Simultaneous translation is the real-time counterpart of standard machine translation: rather than waiting for the entire source sentence before beginning translation, the system must produce target-language output incrementally as the source input arrives. This setting mirrors the task of human simultaneous interpreters at international conferences and is critical for applications such as live subtitling, real-time multilingual communication, and speech translation. The central challenge is the quality-latency tradeoff: waiting for more source context improves translation quality, but increases the delay experienced by the user.
Read/Write Policies
READ: consume next source token x_i
WRITE: generate next target token y_t
Latency metrics:
Average Lagging (AL) = (1/τ) Σ_{t=1}^{τ} (g(t) − (t−1)·|x|/|y|)
where g(t) = number of source tokens read when writing y_t
Consecutive Wait (CW) = average source tokens between writes
Average Proportion (AP) = (1/(|x|·|y|)) Σ_t g(t)
The read/write framework formalizes simultaneous translation as a sequential decision process. At each time step, the system either reads (consumes) the next source token or writes (generates) a target token based on the source tokens read so far. Fixed policies include wait-k (Ma et al., 2019), which reads k source tokens before starting to write and then alternates between reading and writing. Adaptive policies learn to make read/write decisions based on the current source and target context, potentially achieving better quality-latency tradeoffs than any fixed policy.
The Wait-k Policy
The wait-k policy (Ma et al., 2019) is the simplest and most widely used simultaneous translation policy. The system reads k source tokens, then alternates between writing one target token and reading one source token. This results in a fixed lag of approximately k tokens behind the source. By varying k, practitioners can control the quality-latency tradeoff: smaller k yields lower latency but lower quality (because less source context is available), while larger k approaches full-sentence translation quality at the cost of higher latency. Wait-k models can be trained efficiently by masking future source positions during training.
Simultaneous speech translation (SST) adds the complexity of speech recognition to the simultaneous translation challenge. The system must process an audio stream, decide when enough speech has been received to produce a translation, and generate target text or speech in real time. End-to-end SST models directly translate from source speech to target text without an intermediate transcription step. Segmentation — deciding where to break the continuous speech stream into translatable units — is a critical subproblem, as incorrect segmentation can lead to translation errors that propagate through the remainder of the utterance.
Adaptive Policies and Reinforcement Learning
Adaptive policies learn when to read and when to write based on the current context, potentially outperforming fixed policies by waiting longer when the source is ambiguous and proceeding quickly when the translation is straightforward. These policies have been trained using reinforcement learning (Gu et al., 2017), imitation learning from oracle policies, and monotonic attention mechanisms (Raffel et al., 2017) that learn a soft analog of the read/write decision. The MILk (Monotonic Infinite Lookback) attention model (Arivazhagan et al., 2019) combines a monotonic attention head for deciding when to write with full attention over all read source tokens for deciding what to write.
Simultaneous translation remains an active and challenging research area. Current systems still fall significantly short of human simultaneous interpreters, particularly for structurally divergent language pairs where the target translation requires information that appears late in the source sentence. Anticipation — predicting source content before it is observed — is a key capability that human interpreters possess but machines struggle to replicate. The integration of world knowledge, discourse context, and prosodic cues all represent promising directions for improving simultaneous translation quality.