Computational Linguistics
About

Minimum Error Rate Training

Minimum error rate training (MERT) is an optimization algorithm that directly tunes the feature weights of a log-linear translation model to maximize an automatic evaluation metric such as BLEU on a development set.

λ* = argmax_λ BLEU(E*(λ), R) where E*(λ) = {argmax_e Σ_k λ_k h_k(e, f_i)}_i

Minimum error rate training, introduced by Franz Josef Och (2003), resolved a fundamental mismatch in statistical machine translation: the models were trained to maximize likelihood, but evaluated using translation quality metrics like BLEU that are not differentiable functions of the model parameters. MERT directly optimizes the evaluation metric by searching over the space of feature weight vectors, using the key insight that the top-scoring translation for a given sentence is a piecewise-constant function of the weights along any line in weight space.

The MERT Algorithm

MERT Optimization λ* = argmax_λ BLEU({e*₁(λ), ..., e*_N(λ)}, {r₁, ..., r_N})

e*ᵢ(λ) = argmax_e Σ_k λ_k · h_k(e, fᵢ)

Line optimization:
Along direction d from point λ₀, the 1-best translation
for each sentence changes at a finite set of thresholds,
so BLEU is piecewise constant and can be optimized exactly.

MERT uses coordinate ascent with line optimization. Starting from an initial weight vector, the algorithm repeatedly selects a direction in weight space (typically one coordinate axis at a time, or a random direction) and optimizes the evaluation metric along that line. The critical insight is that, for a given n-best list of translation hypotheses, the top-scoring hypothesis for each sentence changes at a finite number of threshold values along any line, making the metric a step function that can be optimized exactly by evaluating it at each threshold. The algorithm iterates between decoding (generating n-best lists) and optimization until convergence.

Properties and Challenges

MERT has several desirable properties: it directly optimizes the evaluation metric, makes no assumptions about the metric's functional form (it can optimize any sentence-level or corpus-level metric), and finds the global optimum along each line. However, it also has significant limitations. The objective function is non-convex, so coordinate ascent may converge to a local optimum. The algorithm is sensitive to initialization and the random choice of search directions. Most critically, MERT becomes unreliable when the number of features exceeds approximately 20–30, because the n-best lists do not provide sufficient coverage of the combinatorial search space in high dimensions.

Alternatives to MERT

The limitations of MERT for high-dimensional feature spaces motivated the development of alternative tuning algorithms. PRO (Hopkins and May, 2011) converts tuning into a binary classification problem over pairs of translations. MIRA (Chiang et al., 2008; Cherry and Foster, 2012) performs online large-margin updates. These methods scale to thousands or millions of features but optimize surrogate objectives rather than the evaluation metric directly. The k-best batch MIRA algorithm became a popular alternative that balances scalability with direct metric optimization.

Impact on MT Development

MERT transformed the practice of MT system development by providing a principled way to combine diverse model components. Before MERT, feature weights were often set manually or through ad hoc procedures, leading to suboptimal and irreproducible results. With MERT, researchers could add new features and automatically determine their optimal contribution, accelerating the pace of innovation in SMT. The shared task paradigm, in which teams compete on standard test sets, became feasible largely because MERT provided a level playing field for system comparison.

The principle of directly optimizing the evaluation metric has carried over to the neural MT era. Sequence-level training objectives, such as minimum risk training and reinforcement learning approaches, can be viewed as continuous analogues of MERT that optimize expected BLEU (or other metrics) through gradient-based methods. The insight that models should be trained to optimize the criteria by which they will be judged remains a guiding principle in machine translation research.

Related Topics

References

  1. Och, F. J. (2003). Minimum error rate training in statistical machine translation. Proceedings of ACL 2003, 160–167. doi:10.3115/1075096.1075117
  2. Cherry, C., & Foster, G. (2012). Batch tuning strategies for statistical machine translation. Proceedings of NAACL-HLT 2012, 427–436. aclanthology.org/N12-1047
  3. Hopkins, M., & May, J. (2011). Tuning as ranking. Proceedings of EMNLP 2011, 1352–1362. aclanthology.org/D11-1125
  4. Clark, J. H., Dyer, C., Lavie, A., & Smith, N. A. (2011). Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. Proceedings of ACL 2011, 176–181. aclanthology.org/P11-2031

External Links