Simultaneous Translation

Simultaneous translation, also known as simultaneous interpretation, is the task of generating target-language output incrementally while the source-language input is still being received, requiring systems to balance the tradeoff between translation latency and quality.

Policy π: READ or WRITE at each step; Latency = (1/|y|) Σ_t g(t)

Simultaneous translation is the real-time counterpart of standard machine translation: rather than waiting for the entire source sentence before beginning translation, the system must produce target-language output incrementally as the source input arrives. This setting mirrors the task of human simultaneous interpreters at international conferences and is critical for applications such as live subtitling, real-time multilingual communication, and speech translation. The central challenge is the quality-latency tradeoff: waiting for more source context improves translation quality, but increases the delay experienced by the user.

Read/Write Policies

Simultaneous Translation Framework At each step, the policy π chooses:
READ: consume next source token x_i
WRITE: generate next target token y_t

Latency metrics:
Average Lagging (AL) = (1/τ) Σ_{t=1}^{τ} (g(t) − (t−1)·|x|/|y|)
where g(t) = number of source tokens read when writing y_t

Consecutive Wait (CW) = average source tokens between writes
Average Proportion (AP) = (1/(|x|·|y|)) Σ_t g(t)

The read/write framework formalizes simultaneous translation as a sequential decision process. At each time step, the system either reads (consumes) the next source token or writes (generates) a target token based on the source tokens read so far. Fixed policies include wait-k (Ma et al., 2019), which reads k source tokens before starting to write and then alternates between reading and writing. Adaptive policies learn to make read/write decisions based on the current source and target context, potentially achieving better quality-latency tradeoffs than any fixed policy.

The Wait-k Policy

The wait-k policy (Ma et al., 2019) is the simplest and most widely used simultaneous translation policy. The system reads k source tokens, then alternates between writing one target token and reading one source token. This results in a fixed lag of approximately k tokens behind the source. By varying k, practitioners can control the quality-latency tradeoff: smaller k yields lower latency but lower quality (because less source context is available), while larger k approaches full-sentence translation quality at the cost of higher latency. Wait-k models can be trained efficiently by masking future source positions during training.

Simultaneous Speech Translation

Simultaneous speech translation (SST) adds the complexity of speech recognition to the simultaneous translation challenge. The system must process an audio stream, decide when enough speech has been received to produce a translation, and generate target text or speech in real time. End-to-end SST models directly translate from source speech to target text without an intermediate transcription step. Segmentation — deciding where to break the continuous speech stream into translatable units — is a critical subproblem, as incorrect segmentation can lead to translation errors that propagate through the remainder of the utterance.

Adaptive Policies and Reinforcement Learning

Adaptive policies learn when to read and when to write based on the current context, potentially outperforming fixed policies by waiting longer when the source is ambiguous and proceeding quickly when the translation is straightforward. These policies have been trained using reinforcement learning (Gu et al., 2017), imitation learning from oracle policies, and monotonic attention mechanisms (Raffel et al., 2017) that learn a soft analog of the read/write decision. The MILk (Monotonic Infinite Lookback) attention model (Arivazhagan et al., 2019) combines a monotonic attention head for deciding when to write with full attention over all read source tokens for deciding what to write.

Simultaneous translation remains an active and challenging research area. Current systems still fall significantly short of human simultaneous interpreters, particularly for structurally divergent language pairs where the target translation requires information that appears late in the source sentence. Anticipation — predicting source content before it is observed — is a key capability that human interpreters possess but machines struggle to replicate. The integration of world knowledge, discourse context, and prosodic cues all represent promising directions for improving simultaneous translation quality.

Read/Write Policies

The Wait-k Policy

Adaptive Policies and Reinforcement Learning

References

External Links

Read/Write Policies

The Wait-k Policy

Adaptive Policies and Reinforcement Learning

Related Topics

References

External Links