Computational Linguistics
About

Acoustic Modeling

Acoustic modeling maps sequences of spectral feature vectors to linguistic units such as phonemes or subword tokens, forming the core component that scores how well a speech signal matches a hypothesized transcription.

P(O|W) = P(o_1, ..., o_T | q_1, ..., q_N)

The acoustic model is the component of a speech recognition system responsible for estimating the probability that a given sequence of acoustic observations was produced by a particular sequence of linguistic units. In the traditional HMM-based framework, each word is represented as a concatenation of phoneme HMMs, and the acoustic model scores how well the observed spectral features align with the expected emissions of these HMM states. The quality of the acoustic model is the single largest determinant of overall ASR performance.

GMM-HMM Acoustic Models

GMM Emission Probability b_j(o_t) = ∑_{m=1}^{M} c_{jm} · N(o_t; μ_{jm}, Σ_{jm})

j: HMM state index
M: number of Gaussian components per state
c_{jm}: mixture weight for component m in state j
μ_{jm}, Σ_{jm}: mean and covariance of component m

In GMM-HMM systems, each HMM state emission distribution is modeled as a mixture of multivariate Gaussians over acoustic feature vectors. Training proceeds iteratively using the Baum-Welch (EM) algorithm: the E-step computes state occupation probabilities given current parameters, and the M-step re-estimates the Gaussian means, covariances, and mixture weights. Context-dependent triphone models capture coarticulation by conditioning each phone model on its left and right neighbors, and phonetic decision trees cluster rare triphone states to share parameters.

DNN-HMM Hybrid Models

Deep neural network acoustic models replace the GMM emission distributions with a neural network that computes posterior probabilities P(s|o_t) for each HMM state s given the observation vector o_t. These posteriors are divided by the state priors P(s) to obtain scaled likelihoods suitable for the HMM framework. DNNs trained on thousands of hours of transcribed speech consistently outperform GMMs because they can model complex, non-linear relationships between acoustic features and phonetic categories without the Gaussian assumption.

Speaker Adaptation Techniques

Acoustic models trained on pooled data from many speakers inevitably compromise on speaker-specific characteristics. Speaker adaptation techniques adjust the model to a particular speaker using a small amount of enrollment data. Maximum Likelihood Linear Regression (MLLR) applies affine transforms to the Gaussian means, while feature-space MLLR (fMLLR) transforms the input features. For neural models, techniques such as i-vector or x-vector speaker embeddings are appended to the input features, allowing the network to condition its predictions on speaker identity.

Modern acoustic models use convolutional layers to capture local spectral patterns, bidirectional LSTMs or conformers to model temporal context, and connectionist temporal classification or attention mechanisms to handle variable-length alignment. The shift from frame-level classification to sequence-level training criteria such as lattice-free MMI (maximum mutual information) has further improved accuracy by optimizing the model to discriminate between competing word sequences rather than individual frames.

The evolution of acoustic modeling reflects a broader trend in machine learning: from carefully engineered generative models (GMM-HMM) to powerful discriminative models (DNN-HMM) and ultimately to end-to-end systems that learn the entire mapping from acoustics to text without explicit phoneme or state definitions.

Interactive Calculator

Enter comma-separated frequency magnitudes simulating a power spectrum (e.g., magnitudes at linearly spaced frequency bins). The calculator applies mel-scale warping, simulates triangular filterbank energies, and computes MFCC coefficients via a discrete cosine transform.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., ... & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97. doi:10.1109/MSP.2012.2205597
  2. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., ... & Vesely, K. (2011). The Kaldi speech recognition toolkit. Proc. ASRU. doi:10.1109/ASRU.2011.6163985
  3. Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 30–42. doi:10.1109/TASL.2011.2134090
  4. Young, S. J., Evermann, G., Gales, M. J. F., Hain, T., Kershaw, D., Liu, X., ... & Woodland, P. C. (2006). The HTK Book (version 3.4). Cambridge University Engineering Department.

External Links