BLEU, introduced by Papineni et al. (2002), transformed machine translation evaluation by providing an automatic metric that correlates reasonably well with human judgments of translation quality. Before BLEU, MT evaluation relied almost exclusively on expensive and time-consuming human assessments. BLEU enabled rapid, reproducible evaluation that accelerated the pace of MT research. Despite well-known limitations, BLEU remains the de facto standard metric in MT publications and shared tasks, providing a common basis for comparing systems across studies.
The BLEU Formula
Modified n-gram precision:
p_n = Σ_{C∈Candidates} Σ_{ngram∈C} Count_clip(ngram) / Σ_{C∈Candidates} Σ_{ngram∈C} Count(ngram)
Brevity Penalty:
BP = 1 if c > r
BP = exp(1 − r/c) if c ≤ r
c = candidate length, r = effective reference length
N = 4, w_n = 1/N (uniform weights)
BLEU computes modified n-gram precision for n = 1 to 4 and combines them using a geometric mean. "Modified" precision means that each n-gram in the candidate is counted at most as many times as it appears in the reference, preventing a degenerate candidate that repeats a single word from achieving perfect unigram precision. The brevity penalty (BP) penalizes translations that are shorter than the reference, since short translations can achieve artificially high precision by omitting difficult content. The standard configuration uses uniform weights (w_n = 1/4) across all n-gram orders.
Properties and Interpretation
BLEU scores range from 0 to 1 (often reported as 0 to 100). Scores are not easily interpretable in absolute terms — a BLEU score of 30 might represent acceptable quality for one language pair but poor quality for another. BLEU is a corpus-level metric, computed over an entire test set rather than individual sentences; sentence-level BLEU is unreliable due to the sparsity of higher-order n-gram matches. The metric is precision-oriented: it measures how much of the candidate appears in the reference, not how much of the reference is covered by the candidate.
BLEU has been extensively criticized. It does not account for meaning: a translation that conveys the correct meaning using different words receives a low BLEU score. It treats all n-grams equally, ignoring the distinction between content words and function words. It cannot reward valid translation choices that differ from the reference. Callison-Burch et al. (2006) demonstrated cases where BLEU improvements did not correspond to quality improvements. Neural MT has made these limitations more acute, as NMT systems produce more fluent, diverse translations that diverge more from references than SMT outputs.
Variants and Alternatives
Numerous BLEU variants and alternative metrics have been proposed. SacreBLEU (Post, 2018) standardizes BLEU computation by fixing tokenization and other implementation details, addressing the problem that different BLEU implementations can yield significantly different scores for the same translations. Sentence-level smoothed BLEU (Chen and Cherry, 2014) adds smoothing to avoid zero counts at the sentence level. Character-level BLEU (chrF) computes character n-gram F-scores, which is more robust for morphologically rich languages.
Despite its limitations, BLEU's longevity reflects its practical value: it is fast to compute, requires only reference translations (not source sentences), and provides a standardized benchmark that enables cross-study comparison. The MT community increasingly uses BLEU alongside other metrics — METEOR, TER, COMET, and human evaluation — to provide a more complete picture of translation quality. The development of learned metrics that better correlate with human judgments is an active and important area of research.