Computational Linguistics
About

Encoder-Decoder Architecture

The encoder-decoder architecture is the foundational neural network design for sequence-to-sequence tasks, in which an encoder network compresses the source input into a fixed-length or variable-length representation and a decoder network generates the target output from that representation.

c = encode(x₁, ..., x_S); P(y_t) = decode(y_{t-1}, s_t, c)

The encoder-decoder architecture, also known as the sequence-to-sequence (seq2seq) model, was independently proposed by Cho et al. (2014) and Sutskever et al. (2014) for neural machine translation. The architecture consists of two recurrent neural networks: an encoder that processes the source sentence into a continuous representation, and a decoder that generates the target sentence conditioned on that representation. This elegant formulation maps naturally onto the translation problem and provides a general framework for any task that transforms one sequence into another.

Architecture Details

Encoder-Decoder with RNN Encoder:
h_t = f(x_t, h_{t-1}) for t = 1, ..., S
c = h_S (or c = q({h₁, ..., h_S}))

Decoder:
s_t = g(y_{t-1}, s_{t-1}, c)
P(y_t | y_{
f, g = RNN cells (LSTM or GRU)

In the original formulation, the encoder reads the source sentence word by word, updating its hidden state at each step. The final hidden state serves as the context vector c — a fixed-length summary of the entire source sentence. The decoder initializes its hidden state with c and generates the target sentence one token at a time, feeding each generated token back as input to produce the next token. Both encoder and decoder typically use LSTM or GRU cells to capture long-range dependencies, and the encoder is often bidirectional, processing the sentence in both forward and reverse directions.

The Information Bottleneck

The original encoder-decoder architecture suffers from an information bottleneck: the entire source sentence must be compressed into a single fixed-length vector c, regardless of sentence length. For short sentences this works adequately, but for longer sentences, critical information is inevitably lost. Cho et al. (2014) showed that translation quality degrades rapidly as source sentence length increases. This limitation directly motivated the development of attention mechanisms (Bahdanau et al., 2015), which allow the decoder to access all encoder hidden states rather than relying on a single compressed representation.

Reversing the Source Sentence

Sutskever et al. (2014) discovered that reversing the order of the source sentence before encoding significantly improved translation quality. This trick works because it places the first words of the source sentence closer to the first words of the target sentence in the computational graph, creating shorter gradient paths for the most critical alignments. While the attention mechanism subsequently addressed the underlying problem more generally, the source-reversal trick illustrates how the geometry of information flow in neural networks affects learning and performance.

Variants and Extensions

Numerous variants of the encoder-decoder architecture have been proposed. Deep architectures stack multiple RNN layers in both the encoder and decoder, with residual connections to facilitate gradient flow. Bidirectional encoders process the source in both directions and concatenate the resulting hidden states. Multi-source models use separate encoders for different input modalities (e.g., text and images) and combine their representations for the decoder. The encoder-decoder framework has been applied far beyond translation to summarization, dialogue generation, code generation, and many other sequence-to-sequence tasks.

The Transformer architecture (Vaswani et al., 2017) replaced recurrent computation with self-attention, but retained the fundamental encoder-decoder structure. The encoder produces a sequence of contextualized representations using self-attention, and the decoder generates the target sequence using both self-attention and cross-attention to the encoder output. This demonstrates the enduring influence of the encoder-decoder paradigm: while the specific computational mechanisms have evolved, the abstract architecture of encoding input into representations and decoding them into output remains central to modern NLP.

Related Topics

References

  1. Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proceedings of EMNLP 2014, 1724–1734. doi:10.3115/v1/D14-1179
  2. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 27, 3104–3112. doi:10.48550/arXiv.1409.3215
  3. Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. Proceedings of ICLR 2015. doi:10.48550/arXiv.1409.0473

External Links