The log-linear model framework, introduced to machine translation by Och and Ney (2002), fundamentally changed how SMT systems combine different knowledge sources. Rather than relying solely on the Bayesian noisy-channel decomposition P(e|f) ∝ P(f|e)·P(e), the log-linear approach directly models the posterior probability as a weighted combination of arbitrary feature functions. This discriminative framework allows the integration of translation models, language models, reordering models, and any other informative features into a single unified scoring function.
The Log-Linear Framework
Z(f) = Σ_{e'} exp(Σ_{k=1}^{K} λ_k · h_k(e', f))
Decision rule (ignoring Z):
e* = argmax_e Σ_{k=1}^{K} λ_k · h_k(e, f)
λ_k = feature weights (tuned on dev data)
In the log-linear formulation, each feature function h_k(e, f) assigns a real-valued score to a translation hypothesis. The noisy-channel components become just two features among many: h₁(e,f) = log P(f|e) and h₂(e,f) = log P(e). Additional features can include phrase translation probabilities in both directions, lexical weights, distortion penalties, word penalties, phrase penalties, and any domain-specific features. The feature weights λ_k control the relative importance of each feature and are tuned to maximize translation quality on a development set.
Feature Engineering
The log-linear framework opened the door to extensive feature engineering in SMT. Beyond the standard features, researchers introduced sparse features based on word pairs, phrase pair properties, syntactic information, and source-side context. Large-scale discriminative training methods, such as MIRA (Chiang et al., 2008) and PRO (Hopkins and May, 2011), enabled the use of millions of sparse features, moving SMT toward the rich feature representations that are standard in other NLP tasks. This evolution from a handful of dense features to millions of sparse features paralleled the broader trend in machine learning toward high-dimensional discriminative models.
The shift from the noisy-channel model to the log-linear framework parallels the broader transition from generative to discriminative models in NLP and machine learning. Generative models like the IBM models define a full joint probability distribution P(f,e), while discriminative log-linear models directly model the conditional P(e|f). The discriminative approach sacrifices the ability to generate data but gains the flexibility to incorporate arbitrary features and the theoretical guarantee of better classification performance when the model class is correctly specified.
Tuning Feature Weights
The feature weights λ_k are critical to system performance and must be carefully tuned. Minimum error rate training (MERT) became the standard approach, directly optimizing the weights to maximize a translation quality metric (typically BLEU) on a development set. Alternative tuning methods include pairwise ranking optimization (PRO), the margin-infused relaxed algorithm (MIRA), and batch MIRA. The choice of tuning algorithm, the size and composition of the development set, and the number of tuning iterations all affect the stability and quality of the final system.
The log-linear framework remains conceptually relevant in the neural era. While neural MT models learn feature representations end-to-end, the combination of multiple components (encoder, decoder, language model, length penalty) during inference can be viewed as a log-linear combination. Ensemble methods that combine multiple neural models also use log-linear interpolation of model scores. The principle of combining diverse knowledge sources through weighted feature functions continues to be a central idea in machine translation system design.