Quick Start

This section will introduce a minimal example on how to use SimulEval for simultaneous translation evaluation. The code in the example can be found in examples/quick_start.

The agent in SimulEval is core for simultaneous evaluation. It’s a carrier of user’s simultaneous system. The user has to implement the agent based on their system for evaluation. The example simultaneous system is a dummy wait-k agent, which

Runs wait-k policy.
Generates random characters the policy decide to write.
Stops the generation k predictions after source input. For simplicity, we just set k=3 in this example.

The implementation of this agent is shown as follow.

import random
from simuleval import entrypoint
from simuleval.agents import TextToTextAgent
from simuleval.agents.actions import ReadAction, WriteAction


@entrypoint
class DummyWaitkTextAgent(TextToTextAgent):

    waitk = 3
    vocab = [chr(i) for i in range(ord("A"), ord("Z") + 1)]

    def policy(self):
        lagging = len(self.states.source) - len(self.states.target)

        if lagging >= self.waitk or self.states.source_finished:
            prediction = random.choice(self.vocab)

            return WriteAction(prediction, finished=(lagging <= 1))
        else:
            return ReadAction()

There two essential components for an agent:

states: The attribute keeps track of the source and target information.
policy: The method makes decisions when the there is a new source segment.

Once the agent is implemented and saved at dummy_waitk_text_agent_v1.py, run the following command for latency evaluation on:

simuleval --source source.txt --reference target.txt --agent dummy_waitk_text_agent_v1.py

where --source is the input file while --target is the reference file.

By default, the SimulEval will give the following output — one quality and three latency metrics.

2022-12-05 13:43:58 | INFO | simuleval.cli | Evaluate system: DummyWaitkTextAgent
2022-12-05 13:43:58 | INFO | simuleval.dataloader | Evaluating from text to text.
2022-12-05 13:43:58 | INFO | simuleval.sentence_level_evaluator | Results:
BLEU  AL    AP  DAL
1.541 3.0 0.688  3.0

The average lagging is expected since we are running an wait-3 system where the source and target always have the same length. Notice that we have a very low yet random BLEU score. It’s because we are randomly generate the output.