Computational Linguistics
About

Parametric Synthesis

Parametric speech synthesis generates speech by predicting vocoder parameters from linguistic features using statistical models, offering compact models and flexible control at the cost of naturalness compared to concatenative approaches.

o_t = f(l_t; θ) + ε_t

Statistical parametric speech synthesis (SPSS) generates speech by training a model to predict acoustic parameters — such as spectral envelope, fundamental frequency, and aperiodicity — from linguistic features derived from the input text. Unlike concatenative synthesis, which stores and retrieves actual speech segments, parametric synthesis generates every aspect of the speech signal from a compact statistical model. This approach, pioneered by Tokuda and colleagues using HMMs in the early 2000s, dominated TTS research for over a decade before being succeeded by neural methods.

HMM-Based Parametric Synthesis

HMM-Based SPSS Training: align speech to linguistic labels, estimate HMM parameters
Synthesis: generate state sequence from input text

o_t ~ N(μ_q(t), Σ_q(t)) for state q(t)
F0_t ~ N(μ^{f0}_{q(t)}, σ^{f0}_{q(t)}) with voiced/unvoiced decision

Parameter generation: o* = argmax P(o|q, λ) with dynamic feature constraints

In HMM-based synthesis, the speech signal is parameterized as a sequence of vocoder features: mel-cepstral coefficients for the spectral envelope, log F0 for pitch, band aperiodicities for noise characteristics, and a voiced/unvoiced flag. These parameters, along with their delta and delta-delta derivatives, are modeled by context-dependent HMM states. At synthesis time, a state sequence is derived from the input text, and the maximum-likelihood parameter generation algorithm produces a smooth trajectory of vocoder parameters that respects the dynamic feature constraints.

DNN-Based Parametric Synthesis

Deep neural networks replaced HMMs as the mapping function from linguistic features to acoustic parameters around 2013, yielding significant quality improvements. DNN-based parametric synthesis frames the problem as regression: given a vector of linguistic features (phoneme identity, position in syllable/word/phrase, stress, POS tag), predict the corresponding acoustic parameters. LSTMs and bidirectional RNNs further improved quality by modeling the temporal dependencies that frame-independent DNNs miss.

The Vocoder Bottleneck

A fundamental limitation of parametric synthesis is the vocoder, which reconstructs the waveform from predicted parameters. Traditional vocoders like STRAIGHT and WORLD produce speech that sounds "buzzy" or "muffled" compared to natural speech, even when the predicted parameters are accurate. This vocoder degradation accounts for much of the quality gap between parametric and concatenative synthesis. Neural vocoders such as WaveNet and WaveRNN dramatically narrowed this gap, leading to the neural TTS paradigm where the distinction between parametric and waveform-level synthesis blurs.

Parametric synthesis offers several advantages over concatenative approaches: the model is compact (megabytes versus gigabytes), it generalizes to unseen phonetic and prosodic contexts, and it enables flexible control over speaking style, emotion, and speaker identity through model adaptation or interpolation. Speaker adaptation techniques allow a model trained on one speaker to be adapted to a new speaker using as little as a few minutes of data, a capability that is much harder to achieve with concatenative systems.

While pure parametric synthesis has been largely superseded by end-to-end neural TTS, its legacy persists in the overall system design. Modern neural TTS systems still decompose the problem into text-to-spectrogram and spectrogram-to-waveform stages, echoing the parametric synthesis philosophy of separating linguistic and acoustic modeling from waveform generation.

Related Topics

References

  1. Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J., & Oura, K. (2013). Speech synthesis based on hidden Markov models. Proceedings of the IEEE, 101(5), 1234–1252. doi:10.1109/JPROC.2013.2251852
  2. Zen, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. Proc. ICASSP, 7962–7966. doi:10.1109/ICASSP.2013.6639215
  3. Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039–1064. doi:10.1016/j.specom.2009.04.004
  4. Kawahara, H., Masuda-Katsuse, I., & de Cheveigné, A. (1999). Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction. Speech Communication, 27(3–4), 187–207. doi:10.1016/S0167-6393(98)00085-5

External Links