Computational Linguistics
About

Acoustic Features

Acoustic features are numerical representations extracted from the speech signal that capture phonetically relevant information while discarding speaker-specific and channel-specific variability.

x_t = Extract(s[n], window_t)

Acoustic feature extraction is the process of transforming a raw audio waveform into a sequence of compact numerical vectors that capture the linguistically and phonetically relevant properties of the speech signal. Good acoustic features emphasize information that distinguishes different speech sounds (phonemes, words) while being robust to variations in speaker identity, recording conditions, and background noise. The choice of acoustic features profoundly affects the performance of all downstream speech processing tasks, from recognition to synthesis to speaker identification.

Feature Extraction Pipeline

Standard Feature Extraction Pre-emphasis: s'[n] = s[n] − α · s[n−1], α ≈ 0.97
Framing: 25 ms windows, 10 ms shift (Hamming window)
Spectrum: |FFT(frame)|² → power spectrum
Filterbank: mel-spaced triangular filters → log energies
Cepstrum: DCT of log filterbank → MFCCs

+ Delta (Δ) and delta-delta (ΔΔ) coefficients

The standard feature extraction pipeline begins with pre-emphasis filtering to boost high-frequency energy that is attenuated by the glottal source and lip radiation characteristics. The signal is then divided into overlapping frames (typically 25 ms wide with 10 ms shift), each multiplied by a Hamming window to reduce spectral leakage. The power spectrum of each frame is computed via the FFT, and a bank of triangular filters spaced according to the mel scale integrates energy in frequency bands that approximate human auditory resolution.

Beyond MFCCs: Modern Representations

While MFCCs remain widely used, several alternative feature representations have been developed for specific applications. Perceptual Linear Prediction (PLP) features apply psychoacoustic principles more aggressively, including equal-loudness weighting and cubic-root compression. Gammatone filterbank features more closely model the auditory periphery. Filter-bank features (log mel spectrograms without the DCT) are preferred for neural network systems because the decorrelation provided by the DCT is unnecessary when the network can learn its own feature transformations.

Learned Features and Self-Supervised Representations

The recent trend toward end-to-end speech processing has challenged the traditional feature extraction paradigm. Models like wav2vec 2.0 and HuBERT learn acoustic representations directly from raw waveforms through self-supervised pre-training, discovering features that outperform hand-crafted ones on many tasks. These learned features capture hierarchical information: lower layers represent acoustic properties similar to traditional features, while higher layers encode increasingly abstract linguistic information. Despite this trend, understanding traditional acoustic features remains essential for interpreting model behavior and designing efficient systems.

Feature normalization is critical for robust speech processing. Cepstral mean normalization (CMN) subtracts the mean cepstral vector computed over an utterance or sliding window, removing linear channel effects. Cepstral variance normalization (CVN) additionally normalizes the variance. These simple techniques dramatically improve robustness to microphone and channel variations, and are applied as standard preprocessing in virtually all speech systems.

The temporal evolution of acoustic features carries as much information as their static values. Delta (velocity) coefficients capture the first derivative of the feature trajectory, and delta-delta (acceleration) coefficients capture the second derivative, approximated by regression over a local window of frames. Together with the static features, these dynamic coefficients form the standard feature vector input to acoustic models, typically yielding 39 dimensions (13 MFCCs + 13 deltas + 13 delta-deltas).

Interactive Calculator

Enter comma-separated frequency magnitudes simulating a power spectrum (e.g., magnitudes at linearly spaced frequency bins). The calculator applies mel-scale warping, simulates triangular filterbank energies, and computes MFCC coefficients via a discrete cosine transform.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366. doi:10.1109/TASSP.1980.1163420
  2. Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America, 87(4), 1738–1752. doi:10.1121/1.399423
  3. Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Proc. NeurIPS, 33, 12449–12460.
  4. Huang, X., Acero, A., & Hon, H.-W. (2001). Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice Hall.

External Links