Computational Linguistics
About

Human Evaluation of MT

Human evaluation remains the gold standard for assessing machine translation quality, employing structured protocols such as adequacy and fluency ratings, direct assessment, and comparative ranking to capture aspects of translation quality that automatic metrics cannot fully measure.

DA score = (raw_score − μ_annotator) / σ_annotator (standardized Direct Assessment)

Human evaluation of machine translation provides the most reliable assessment of translation quality, capturing subtle aspects of meaning, fluency, style, and pragmatic appropriateness that automatic metrics approximate at best. While automatic metrics like BLEU and METEOR enable rapid development cycles, all major evaluation campaigns — including the annual WMT shared tasks — rely on human evaluation as the primary ranking criterion. The design, execution, and analysis of human evaluations present significant methodological challenges that have driven decades of research in evaluation science.

Evaluation Paradigms

Direct Assessment (DA) Annotators rate translations on a 0–100 continuous scale
Raw scores are standardized per annotator:
z = (raw − μ_annotator) / σ_annotator

System score = mean of standardized segment scores
Significance testing via Wilcoxon rank-sum test

Adequacy DA: "How much of the meaning is expressed?"
Fluency DA: "How natural is the output?"

Early human evaluation protocols asked annotators to rate translations on separate 5-point scales for adequacy ("How much of the meaning of the original sentence is expressed?") and fluency ("How fluent is the translation?"). Direct Assessment (DA), introduced by Graham et al. (2013) and adopted by WMT from 2017 onward, uses a continuous 0–100 scale that provides finer-grained distinctions and is standardized per annotator to account for individual rating biases. Multidimensional Quality Metrics (MQM) provides an error-based framework where annotators identify and classify specific translation errors, yielding both an overall quality score and diagnostic error profiles.

Inter-Annotator Agreement

Achieving reliable human evaluation requires careful control of annotator variability. Even trained professional translators show considerable disagreement on translation quality ratings, reflecting the inherent subjectivity of quality judgments and the multidimensional nature of translation quality. Inter-annotator agreement, measured by Cohen's kappa or intraclass correlation, is typically moderate (0.4–0.6) for absolute ratings but higher for pairwise comparisons. Strategies for improving reliability include annotator training, quality control through attention checks, standardization of scores, and aggregation over multiple annotators.

Crowdsourcing vs. Expert Evaluation

The WMT evaluation campaigns transitioned from expert annotators to crowdsourced workers (via Amazon Mechanical Turk and later Appraise) to obtain sufficient annotation volume at manageable cost. Crowdsourced evaluation requires careful quality control: reference translations are used as quality checks, inconsistent annotators are filtered out, and scores are standardized to account for different rating behaviors. While crowdsourced evaluation may be noisier than expert evaluation at the individual annotation level, the aggregation of many judgments can produce reliable system-level rankings.

Evaluation of Neural MT

The advent of neural MT has complicated human evaluation in several ways. NMT produces more fluent output than SMT, making fluency differences between systems harder to detect. Quality differences between top systems have narrowed, requiring more annotations to achieve statistically significant rankings. NMT hallucinations — fluent but unfaithful outputs — can fool both automatic metrics and human annotators who are not proficient in the source language. Source-based evaluation, where annotators assess the translation against the source sentence (requiring bilingual competence), is more reliable for detecting adequacy errors but is also more expensive.

The development of better human evaluation methodologies and better automatic metrics proceed in tandem. Meta-evaluation — evaluating the evaluators — uses human evaluation as the gold standard against which automatic metrics are validated. The annual WMT Metrics shared task computes the correlation between automatic metrics and human judgments, driving the development of metrics that better capture what humans value in translations. This virtuous cycle between human and automatic evaluation continues to refine our understanding of translation quality and how to measure it.

Interactive Calculator

Enter reference and hypothesis translation pairs as CSV (one pair per line): reference sentence,hypothesis sentence. The calculator tokenizes each pair, computes n-gram precisions (1-4), brevity penalty, and the final BLEU score.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Graham, Y., Baldwin, T., Moffat, A., & Zobel, J. (2013). Continuous measurement scales in human evaluation of machine translation. Proceedings of the 7th Linguistic Annotation Workshop, 33–41. aclanthology.org/W13-2305
  2. Lommel, A., Uszkoreit, H., & Burchardt, A. (2014). Multidimensional Quality Metrics (MQM): A framework for declaring and describing translation quality metrics. Tradumàtica, 12, 455–463. doi:10.5565/rev/tradumatica.77
  3. Freitag, M., Foster, G., Grangier, D., Ratnakar, V., Tan, Q., & Macherey, W. (2021). Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the ACL, 9, 1460–1474. doi:10.1162/tacl_a_00437

External Links