Cross-entropy loss is the foundational training objective for virtually all language models, from n-gram models estimated via maximum likelihood to modern transformers trained with gradient descent. For a language model that predicts a probability distribution q over the vocabulary at each position, the cross-entropy with the true distribution p (a one-hot vector indicating the actual next token) measures how many bits the model needs on average to encode the true token using its predicted distribution. Minimizing cross-entropy is equivalent to maximizing the log-likelihood of the training data, establishing a direct connection between information theory and statistical estimation.
Definition and Properties
Per-sequence: L = -(1/T) Σₜ₌₁ᵀ log q(wₜ | w₁, ..., wₜ₋₁)
Relation to entropy and KL divergence:
H(p, q) = H(p) + D_KL(p ‖ q)
Since H(p) is constant, minimizing H(p,q) ≡ minimizing D_KL(p ‖ q)
Perplexity: PP = 2^{H(p,q)} = exp(L) when using natural log
Cross-entropy decomposes into the true entropy H(p) plus the KL divergence D_KL(p || q). Since the true entropy is a constant independent of the model, minimizing cross-entropy is equivalent to minimizing the KL divergence between the true distribution and the model's distribution. For language modeling with one-hot targets, the cross-entropy simplifies to the negative log probability of the correct token: L = -log q(wₜ). The gradient of this loss with respect to the model's logits has a particularly clean form: it is the difference between the predicted probability and the target (q - p), which makes optimization straightforward.
Connection to Maximum Likelihood
Minimizing the average cross-entropy over a dataset is equivalent to maximum likelihood estimation. If the model parameterizes q(w | context; Θ), then the total cross-entropy L = -(1/N) Σᵢ log q(wᵢ | contextᵢ; Θ) is exactly the negative log-likelihood normalized by the number of tokens. This equivalence means that all of the theoretical properties of MLE — consistency, asymptotic efficiency, asymptotic normality — transfer to cross-entropy minimization. In practice, regularization (weight decay, dropout) is added to prevent overfitting, which corresponds to MAP estimation rather than pure MLE.
Label smoothing, introduced by Szegedy et al. (2016) and widely used in transformer training, modifies the cross-entropy target from a hard one-hot distribution to a mixture: p_smooth(w) = (1-epsilon) · p(w) + epsilon / |V|, where epsilon is typically 0.1. This prevents the model from becoming overconfident in its predictions and provides a regularization effect. Vaswani et al. (2017) found that label smoothing improved BLEU scores in machine translation despite slightly increasing perplexity, because it encourages the model to maintain reasonable probability mass on plausible alternatives rather than concentrating all mass on the training label.
Practical Considerations
In practice, cross-entropy loss is computed using the log-softmax trick for numerical stability: instead of computing softmax and then taking the log, the log-softmax is computed directly as log q(w) = z_w - log(Σₖ exp(z_k)), where z is the logit vector. This avoids the numerical overflow that can occur when exponentiating large logits. Modern deep learning frameworks (PyTorch, JAX) implement fused log-softmax-cross-entropy kernels that are both numerically stable and computationally efficient, combining the forward and backward passes to reduce memory usage.
Cross-entropy loss has several desirable properties for training language models. It is convex with respect to the model's output probabilities (though not with respect to the parameters of a deep neural network), it has well-behaved gradients that do not saturate, and it provides a clear information-theoretic interpretation: the gap between the cross-entropy and the true entropy represents the room for improvement. The relationship between cross-entropy and perplexity (PP = exp(H)) provides an intuitive metric: reducing the cross-entropy by 0.1 nats corresponds to reducing perplexity by roughly 10%, giving practitioners a concrete sense of what loss improvements mean for model quality.