Hierarchical phrase-based SMT (Chiang, 2005, 2007) addresses the fundamental limitation of flat phrase-based models: their inability to capture nested and long-distance reordering patterns. By extending phrase pairs to include nonterminal symbols, hierarchical models can represent recursive translation patterns. For instance, a rule X → ⟨X₁ de X₂, X₂ of X₁⟩ captures the systematic reordering of possessive constructions between Chinese and English. These rules are learned automatically from parallel text without syntactic annotation, using only the word alignments.
Synchronous Context-Free Grammar
γ = source side (terminals and nonterminals)
α = target side (terminals and nonterminals)
~ = one-to-one correspondence between nonterminals
Examples:
X → ⟨X₁ de X₂, X₂ of X₁⟩ (reordering)
X → ⟨acheter X₁, buy X₁⟩ (monotone)
X → ⟨maison, house⟩ (lexical phrase pair)
Hierarchical rules are extracted from word-aligned parallel text by identifying phrase pairs (as in standard phrase-based SMT) and then generalizing them by replacing sub-phrase pairs with nonterminal symbols. The resulting synchronous context-free grammar (SCFG) typically uses a single nonterminal category X plus a sentence-level start symbol S. This minimal syntactic structure is sufficient to capture a wide range of reordering phenomena while keeping the grammar tractable for decoding. Rules are scored with the same features as phrase pairs — forward and inverse translation probabilities, lexical weights — plus additional features for rule type and arity.
CYK Decoding
Decoding in hierarchical SMT uses a bottom-up CYK-style chart parsing algorithm on the source sentence. Each cell in the chart stores the best translations for a source span, computed by combining translations of sub-spans according to the grammar rules. Language model integration requires maintaining target-side boundary words in each chart cell, and cube pruning (Chiang, 2007) is used to efficiently explore the combinatorial space of rule applications and language model contexts. The time complexity is O(n³) in the source sentence length, compared to O(n²) for phrase-based decoding with the same distortion limit.
While Chiang's original model uses unlabeled nonterminals, subsequent work incorporated syntactic labels from parse trees. Syntax-augmented models (Zollmann and Venugopal, 2006) label nonterminals with syntactic categories from target-side parse trees, allowing the grammar to distinguish between NP, VP, and other constituents. String-to-tree, tree-to-string, and tree-to-tree models apply syntactic constraints on one or both sides of the translation. These syntactically informed models achieved further improvements, particularly for language pairs with significant structural divergences.
Impact and Limitations
Hierarchical phrase-based models achieved consistent improvements over flat phrase-based systems, particularly for language pairs with long-distance reordering such as Chinese-English and Arabic-English. The model's ability to represent nested reordering within a single formal framework was theoretically elegant and practically effective. Hierarchical SMT also influenced the development of syntax-based neural MT models and tree-based decoding strategies.
However, hierarchical models have higher computational costs than flat phrase-based systems and larger grammar sizes. The single-nonterminal grammar, while simpler, may over-generate by permitting reorderings that no linguistically motivated grammar would produce. The tension between the expressiveness of the grammar and the tractability of decoding remains a central concern, and various pruning and filtering strategies are needed to make hierarchical systems practical at scale.