Evaluator
The evaluation in SimulEval implemented as the Evaluator shown below.
It runs on sentence level, and will score the translation on quality and latency.
The user can use --quality-metrics and --latency-metrics to choose the metrics.
The final results along with the logs will be saved at --output if given.
- class simuleval.evaluator.evaluator.SentenceLevelEvaluator(dataloader: GenericDataloader, quality_scorers: List[QualityScorer], latency_scorers: List[LatencyScorer], args: Namespace)[source]
Sentence Level evaluator. It iterates over sentence pairs and run evaluation.
for instance in self.maybe_tqdm(self.instances.values()): agent.reset() while not instance.finish_prediction: input_segment = instance.send_source(self.source_segment_size) output_segment = agent.pushpop(input_segment) instance.receive_prediction(output_segment)
- instances
collections of sentence pairs. Instances also keep track of delays.
- latency_scorers
Scorers for latency evaluation.
- Type
List[LatencyScorer]
- quality_scorers
Scorers for quality evaluation.
- Type
List[QualityScorer]
- output
output directory
Evaluator related command line arguments:
usage: [-h] [--quality-metrics QUALITY_METRICS [QUALITY_METRICS ...]] [--latency-metrics LATENCY_METRICS [LATENCY_METRICS ...]] [--computation-aware] [--eval-latency-unit {word,char}] [--remote-address REMOTE_ADDRESS] [--remote-port REMOTE_PORT] [--no-progress-bar] [--output OUTPUT]
Named Arguments
- --quality-metrics
Quality metrics
Default: [‘BLEU’]
- --latency-metrics
Latency metrics
Default: [‘AL’, ‘AP’, ‘DAL’]
- --computation-aware
Include computational latency.
Default: False
- --eval-latency-unit
Possible choices: word, char
Basic unit used for latency calculation, choose from words (detokenized) and characters.
Default: “word”
- --remote-address
Address to client backend
Default: “localhost”
- --remote-port
Port to client backend
Default: 12321
- --no-progress-bar
Do not use progress bar
Default: False
- --output
Output directory
Quality Scorers
- class simuleval.evaluator.scorers.quality_scorer.SacreBLEUScorer(tokenizer: str = '13a')[source]
SacreBLEU Scorer
- Usage:
--quality-metrics BLEU
Additional command line arguments:
usage: [-h] [--sacrebleu-tokenizer SACREBLEU_TOKENIZER]
Named Arguments
- --sacrebleu-tokenizer
Tokenizer in sacrebleu
Default: “13a”
- class simuleval.evaluator.scorers.quality_scorer.ASRSacreBLEUScorer(tokenizer: str = '13a')[source]
ASR + SacreBLEU Scorer (BETA version)
- Usage:
--quality-metrics ASR_BLEU
Additional command line arguments:
usage: [-h] [--sacrebleu-tokenizer SACREBLEU_TOKENIZER]
Named Arguments
- --sacrebleu-tokenizer
Tokenizer in sacrebleu
Default: “13a”
Latency Scorers
- class simuleval.evaluator.scorers.latency_scorer.ALScorer(computation_aware: bool = False, use_ref_len: bool = True)[source]
Average Lagging (AL) from STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework
Give source \(X\), target \(Y\), delays \(D\),
\[AL = \frac{1}{\tau} \sum_i^\tau D_i - (i - 1) \frac{|X|}{|Y|}\]Where
\[\tau = argmin_i(D_i = |X|)\]When reference was given, \(|Y|\) would be the reference length
- Usage:
—-latency-metrics AL
- compute(delays: List[Union[float, int]], source_length: Union[float, int], target_length: Union[float, int])[source]
Function to compute latency on one sentence (instance).
- Parameters
delays (List[Union[float, int]]) – Sequence of delays.
source_length (Union[float, int]) – Length of source sequence.
target_length (Union[float, int]) – Length of target sequence.
- Returns
the latency score on one sentence.
- Return type
float
- class simuleval.evaluator.scorers.latency_scorer.APScorer(computation_aware: bool = False, use_ref_len: bool = True)[source]
Average Proportion (AP) from Can neural machine translation do simultaneous translation?
Give source \(X\), target \(Y\), delays \(D\), the AP is calculated as:
\[AP = \frac{1}{|X||Y]} \sum_i^{|Y|} D_i\]- Usage:
—-latency-metrics AP
- compute(delays: List[Union[float, int]], source_length: Union[float, int], target_length: Union[float, int]) float[source]
Function to compute latency on one sentence (instance).
- Parameters
delays (List[Union[float, int]]) – Sequence of delays.
source_length (Union[float, int]) – Length of source sequence.
target_length (Union[float, int]) – Length of target sequence.
- Returns
the latency score on one sentence.
- Return type
float
- class simuleval.evaluator.scorers.latency_scorer.DALScorer(computation_aware: bool = False, use_ref_len: bool = True)[source]
Differentiable Average Lagging (DAL) from Monotonic Infinite Lookback Attention for Simultaneous Machine Translation (https://arxiv.org/abs/1906.05218)
- Usage:
—-latency-metrics DAL
- compute(delays: List[Union[float, int]], source_length: Union[float, int], target_length: Union[float, int])[source]
Function to compute latency on one sentence (instance).
- Parameters
delays (List[Union[float, int]]) – Sequence of delays.
source_length (Union[float, int]) – Length of source sequence.
target_length (Union[float, int]) – Length of target sequence.
- Returns
the latency score on one sentence.
- Return type
float
Customized Scorers
To add customized scorers, the user can use @register_latency_scorer or @register_quality_scorer to decorate an scorer class.
and use --quality-metrics and --latency-metrics to call the scorer. For example:
import random
from statistics import mean
from simuleval import entrypoint
from simuleval.evaluator.scorers.latency_scorer import (
register_latency_scorer,
LatencyScorer,
)
from simuleval.agents import TextToTextAgent
from simuleval.agents.actions import ReadAction, WriteAction
@register_latency_scorer("RTF")
class RTFScorer(LatencyScorer):
"""Real time factor
Usage:
--latency-metrics RTF
"""
def __call__(self, instances) -> float:
scores = []
for ins in instances.values():
scores.append(ins.delays[-1] / ins.source_length)
return mean(scores)
@entrypoint
class DummyWaitkTextAgent(TextToTextAgent):
waitk = 3
vocab = [chr(i) for i in range(ord("A"), ord("Z") + 1)]
def policy(self):
lagging = len(self.states.source) - len(self.states.target)
if lagging >= self.waitk or self.states.source_finished:
prediction = random.choice(self.vocab)
return WriteAction(prediction, finished=(lagging <= 1))
else:
return ReadAction()
> simuleval --source source.txt --target target.txt --agent dummy_waitk_text_agent_v4.py --latency-metrics RTF
2022-12-06 12:56:01 | INFO | simuleval.cli | Evaluate system: DummyWaitkTextAgent
2022-12-06 12:56:01 | INFO | simuleval.dataloader | Evaluating from text to text.
2022-12-06 12:56:01 | INFO | simuleval.sentence_level_evaluator | Results:
BLEU RTF
1.593 1.078