Computational Linguistics
About

Mel-Frequency Cepstral Coefficients

Mel-Frequency Cepstral Coefficients are a compact spectral representation of speech derived by applying the discrete cosine transform to log mel-scaled filterbank energies, forming the standard acoustic feature for speech recognition.

MFCC_k = ∑_{m=1}^{M} log(E_m) · cos(k(m − 0.5)π/M)

Mel-Frequency Cepstral Coefficients (MFCCs) have been the dominant acoustic feature representation in speech and audio processing for over four decades. Introduced by Davis and Mermelstein in 1980, MFCCs capture the spectral envelope of a speech frame in a compact, decorrelated representation that approximates human auditory perception. The "mel-frequency" refers to the perceptually motivated frequency scale, and "cepstral" indicates that the features reside in the cepstral domain — the result of taking the inverse transform of the log spectrum.

MFCC Computation

MFCC Extraction Steps 1. Pre-emphasis: s'[n] = s[n] − 0.97 · s[n−1]
2. Framing & windowing: 25 ms Hamming windows, 10 ms shift
3. Power spectrum: |FFT(frame)|²
4. Mel filterbank: E_m = ∑_k H_m(k) · |X(k)|²
5. Log compression: log(E_m)
6. DCT: c_j = ∑_{m=1}^{M} log(E_m) · cos(j(m−0.5)π/M)

Typical output: 13 MFCCs (c_0 through c_12) per frame

The computation proceeds through a sequence of well-motivated signal processing steps. The mel-scaled filterbank applies triangular filters spaced according to the mel scale, which is approximately linear below 1000 Hz and logarithmic above, matching the frequency resolution of the human cochlea. The logarithmic compression of filterbank energies simulates the nonlinear loudness perception of the auditory system. The discrete cosine transform decorrelates the log filterbank energies, concentrating most of the information in the first few coefficients and enabling diagonal covariance matrices in GMM-based acoustic models.

Properties and Interpretation

The lower-order MFCCs capture the broad spectral shape (related to the vocal tract resonance characteristics), while higher-order coefficients capture finer spectral detail. The zeroth coefficient c_0 represents overall log energy, the first coefficient c_1 correlates with spectral tilt (the balance between low and high frequencies), and coefficients c_2 through c_4 are most correlated with vowel identity through their relationship to formant positions. Coefficients beyond c_12 are typically discarded as they capture increasingly fine spectral detail that is more speaker- and noise-specific than linguistically informative.

MFCCs in the Neural Network Era

With the rise of deep learning, many speech systems have moved from MFCCs to log mel filterbank features (the step before DCT), because neural networks can learn their own decorrelation and do not benefit from the assumptions built into the DCT. End-to-end systems like wav2vec 2.0 go further by learning features directly from raw waveforms. Nevertheless, MFCCs remain the feature of choice for many production systems, lightweight applications, and tasks where computational efficiency is paramount. Their decades of empirical validation and well-understood properties make them a reliable baseline.

Dynamic features (deltas and delta-deltas) computed from MFCC trajectories capture the temporal evolution of the spectral envelope, encoding transitions between phonemes that are at least as discriminative as the static features themselves. The standard 39-dimensional MFCC feature vector (13 static + 13 delta + 13 delta-delta) has been the default input to speech recognition systems from the GMM-HMM era through the early DNN period.

Robustness enhancements to MFCCs include cepstral mean and variance normalization (CMVN) to remove channel effects, RASTA filtering to suppress slowly varying convolutional noise, and feature-space maximum likelihood linear regression (fMLLR) for speaker normalization. These techniques, layered on top of the basic MFCC computation, have kept the feature relevant and competitive across decades of advancing speech technology.

Interactive Calculator

Enter comma-separated frequency magnitudes simulating a power spectrum (e.g., magnitudes at linearly spaced frequency bins). The calculator applies mel-scale warping, simulates triangular filterbank energies, and computes MFCC coefficients via a discrete cosine transform.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366. doi:10.1109/TASSP.1980.1163420
  2. Zheng, F., Zhang, G., & Song, Z. (2001). Comparison of different implementations of MFCC. Journal of Computer Science and Technology, 16(6), 582–589. doi:10.1007/BF02943243
  3. Huang, X., Acero, A., & Hon, H.-W. (2001). Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice Hall.

External Links