Computational Linguistics
About

Log-Linear Models for MT

Log-linear models provide the discriminative framework for combining multiple feature functions in statistical machine translation, replacing the generative noisy-channel decomposition with a flexible, feature-rich scoring architecture.

P(e|f) = exp(Σ_k λ_k h_k(e, f)) / Σ_{e'} exp(Σ_k λ_k h_k(e', f))

The log-linear model framework, introduced to machine translation by Och and Ney (2002), fundamentally changed how SMT systems combine different knowledge sources. Rather than relying solely on the Bayesian noisy-channel decomposition P(e|f) ∝ P(f|e)·P(e), the log-linear approach directly models the posterior probability as a weighted combination of arbitrary feature functions. This discriminative framework allows the integration of translation models, language models, reordering models, and any other informative features into a single unified scoring function.

The Log-Linear Framework

Log-Linear Translation Model P(e|f) = exp(Σ_{k=1}^{K} λ_k · h_k(e, f)) / Z(f)

Z(f) = Σ_{e'} exp(Σ_{k=1}^{K} λ_k · h_k(e', f))

Decision rule (ignoring Z):
e* = argmax_e Σ_{k=1}^{K} λ_k · h_k(e, f)

λ_k = feature weights (tuned on dev data)

In the log-linear formulation, each feature function h_k(e, f) assigns a real-valued score to a translation hypothesis. The noisy-channel components become just two features among many: h₁(e,f) = log P(f|e) and h₂(e,f) = log P(e). Additional features can include phrase translation probabilities in both directions, lexical weights, distortion penalties, word penalties, phrase penalties, and any domain-specific features. The feature weights λ_k control the relative importance of each feature and are tuned to maximize translation quality on a development set.

Feature Engineering

The log-linear framework opened the door to extensive feature engineering in SMT. Beyond the standard features, researchers introduced sparse features based on word pairs, phrase pair properties, syntactic information, and source-side context. Large-scale discriminative training methods, such as MIRA (Chiang et al., 2008) and PRO (Hopkins and May, 2011), enabled the use of millions of sparse features, moving SMT toward the rich feature representations that are standard in other NLP tasks. This evolution from a handful of dense features to millions of sparse features paralleled the broader trend in machine learning toward high-dimensional discriminative models.

From Generative to Discriminative

The shift from the noisy-channel model to the log-linear framework parallels the broader transition from generative to discriminative models in NLP and machine learning. Generative models like the IBM models define a full joint probability distribution P(f,e), while discriminative log-linear models directly model the conditional P(e|f). The discriminative approach sacrifices the ability to generate data but gains the flexibility to incorporate arbitrary features and the theoretical guarantee of better classification performance when the model class is correctly specified.

Tuning Feature Weights

The feature weights λ_k are critical to system performance and must be carefully tuned. Minimum error rate training (MERT) became the standard approach, directly optimizing the weights to maximize a translation quality metric (typically BLEU) on a development set. Alternative tuning methods include pairwise ranking optimization (PRO), the margin-infused relaxed algorithm (MIRA), and batch MIRA. The choice of tuning algorithm, the size and composition of the development set, and the number of tuning iterations all affect the stability and quality of the final system.

The log-linear framework remains conceptually relevant in the neural era. While neural MT models learn feature representations end-to-end, the combination of multiple components (encoder, decoder, language model, length penalty) during inference can be viewed as a log-linear combination. Ensemble methods that combine multiple neural models also use log-linear interpolation of model scores. The principle of combining diverse knowledge sources through weighted feature functions continues to be a central idea in machine translation system design.

Related Topics

References

  1. Och, F. J., & Ney, H. (2002). Discriminative training and maximum entropy models for statistical machine translation. Proceedings of ACL 2002, 295–302. doi:10.3115/1073083.1073133
  2. Och, F. J. (2003). Minimum error rate training in statistical machine translation. Proceedings of ACL 2003, 160–167. doi:10.3115/1075096.1075117
  3. Hopkins, M., & May, J. (2011). Tuning as ranking. Proceedings of EMNLP 2011, 1352–1362. aclanthology.org/D11-1125
  4. Chiang, D., Marton, Y., & Resnik, P. (2008). Online large-margin training of syntactic and structural translation features. Proceedings of EMNLP 2008, 224–233. aclanthology.org/D08-1024

External Links