Quick Start
This section will introduce a minimal example on how to use SimulEval for simultaneous translation evaluation.
The code in the example can be found in examples/quick_start.
The agent in SimulEval is core for simultaneous evaluation. It’s a carrier of user’s simultaneous system. The user has to implement the agent based on their system for evaluation. The example simultaneous system is a dummy wait-k agent, which
Runs wait-k policy.
Generates random characters the policy decide to write.
Stops the generation k predictions after source input. For simplicity, we just set
k=3in this example.
The implementation of this agent is shown as follow.
import random
from simuleval import entrypoint
from simuleval.agents import TextToTextAgent
from simuleval.agents.actions import ReadAction, WriteAction
@entrypoint
class DummyWaitkTextAgent(TextToTextAgent):
waitk = 3
vocab = [chr(i) for i in range(ord("A"), ord("Z") + 1)]
def policy(self):
lagging = len(self.states.source) - len(self.states.target)
if lagging >= self.waitk or self.states.source_finished:
prediction = random.choice(self.vocab)
return WriteAction(prediction, finished=(lagging <= 1))
else:
return ReadAction()
There two essential components for an agent:
states: The attribute keeps track of the source and target information.policy: The method makes decisions when the there is a new source segment.
Once the agent is implemented and saved at dummy_waitk_text_agent_v1.py,
run the following command for latency evaluation on:
simuleval --source source.txt --reference target.txt --agent dummy_waitk_text_agent_v1.py
where --source is the input file while --target is the reference file.
By default, the SimulEval will give the following output — one quality and three latency metrics.
2022-12-05 13:43:58 | INFO | simuleval.cli | Evaluate system: DummyWaitkTextAgent
2022-12-05 13:43:58 | INFO | simuleval.dataloader | Evaluating from text to text.
2022-12-05 13:43:58 | INFO | simuleval.sentence_level_evaluator | Results:
BLEU AL AP DAL
1.541 3.0 0.688 3.0
The average lagging is expected since we are running an wait-3 system where the source and target always have the same length. Notice that we have a very low yet random BLEU score. It’s because we are randomly generate the output.