Mel-Frequency Cepstral Coefficients (MFCCs) have been the dominant acoustic feature representation in speech and audio processing for over four decades. Introduced by Davis and Mermelstein in 1980, MFCCs capture the spectral envelope of a speech frame in a compact, decorrelated representation that approximates human auditory perception. The "mel-frequency" refers to the perceptually motivated frequency scale, and "cepstral" indicates that the features reside in the cepstral domain — the result of taking the inverse transform of the log spectrum.
MFCC Computation
2. Framing & windowing: 25 ms Hamming windows, 10 ms shift
3. Power spectrum: |FFT(frame)|²
4. Mel filterbank: E_m = ∑_k H_m(k) · |X(k)|²
5. Log compression: log(E_m)
6. DCT: c_j = ∑_{m=1}^{M} log(E_m) · cos(j(m−0.5)π/M)
Typical output: 13 MFCCs (c_0 through c_12) per frame
The computation proceeds through a sequence of well-motivated signal processing steps. The mel-scaled filterbank applies triangular filters spaced according to the mel scale, which is approximately linear below 1000 Hz and logarithmic above, matching the frequency resolution of the human cochlea. The logarithmic compression of filterbank energies simulates the nonlinear loudness perception of the auditory system. The discrete cosine transform decorrelates the log filterbank energies, concentrating most of the information in the first few coefficients and enabling diagonal covariance matrices in GMM-based acoustic models.
Properties and Interpretation
The lower-order MFCCs capture the broad spectral shape (related to the vocal tract resonance characteristics), while higher-order coefficients capture finer spectral detail. The zeroth coefficient c_0 represents overall log energy, the first coefficient c_1 correlates with spectral tilt (the balance between low and high frequencies), and coefficients c_2 through c_4 are most correlated with vowel identity through their relationship to formant positions. Coefficients beyond c_12 are typically discarded as they capture increasingly fine spectral detail that is more speaker- and noise-specific than linguistically informative.
With the rise of deep learning, many speech systems have moved from MFCCs to log mel filterbank features (the step before DCT), because neural networks can learn their own decorrelation and do not benefit from the assumptions built into the DCT. End-to-end systems like wav2vec 2.0 go further by learning features directly from raw waveforms. Nevertheless, MFCCs remain the feature of choice for many production systems, lightweight applications, and tasks where computational efficiency is paramount. Their decades of empirical validation and well-understood properties make them a reliable baseline.
Dynamic features (deltas and delta-deltas) computed from MFCC trajectories capture the temporal evolution of the spectral envelope, encoding transitions between phonemes that are at least as discriminative as the static features themselves. The standard 39-dimensional MFCC feature vector (13 static + 13 delta + 13 delta-delta) has been the default input to speech recognition systems from the GMM-HMM era through the early DNN period.
Robustness enhancements to MFCCs include cepstral mean and variance normalization (CMVN) to remove channel effects, RASTA filtering to suppress slowly varying convolutional noise, and feature-space maximum likelihood linear regression (fMLLR) for speaker normalization. These techniques, layered on top of the basic MFCC computation, have kept the feature relevant and competitive across decades of advancing speech technology.