Computational Linguistics
About

Prosody Modeling

Prosody modeling captures the suprasegmental features of speech — intonation, stress, rhythm, and phrasing — using computational frameworks that are essential for natural-sounding speech synthesis and for extracting meaning beyond the segmental content of utterances.

F₀(t), duration(t), energy(t) → prosodic structure

Prosody encompasses the suprasegmental aspects of speech: the pitch contours (intonation), timing patterns (rhythm and duration), loudness variations (stress), and phrasing that overlay the sequence of individual speech sounds. Prosody conveys crucial linguistic information — the difference between a statement and a question, the location of focus and emphasis, the boundaries between syntactic constituents, and the speaker's emotional state. Computational prosody modeling aims to predict and generate these patterns for speech synthesis, and to analyze them for speech understanding.

Prosodic Representations

ToBI Prosodic Annotation Tone tier: pitch accents and boundary tones
H* = high pitch accent
L+H* = rising accent (contrastive focus)
L-L% = low phrase accent + low boundary tone (statement)
H-H% = high phrase accent + high boundary tone (yes/no question)

Break index tier: prosodic boundary strength (0-4)
0 = clitic boundary, 1 = word boundary
3 = intermediate phrase, 4 = intonational phrase

The ToBI (Tones and Break Indices) framework provides a standard annotation system for prosodic structure in American English, with adaptations for many other languages. ToBI labels pitch accents (tonal events associated with prominent syllables), phrase accents (tones at the ends of intermediate phrases), and boundary tones (tones at the ends of intonational phrases) using a small inventory of tonal labels. Automatic ToBI labeling is an active research area, with neural models achieving F1 scores around 80-85% on pitch accent detection and boundary tone classification.

Prosody in Speech Synthesis

Generating natural-sounding prosody is one of the greatest challenges in text-to-speech synthesis. The prosody of an utterance depends on syntactic structure, information structure (topic, focus, given/new), discourse context, and pragmatic intent — factors that are difficult to predict from text alone. Rule-based systems use syntactic parsing and hand-crafted rules to predict prosodic phrasing and pitch accents. Statistical approaches train models on large prosodically annotated speech corpora to predict F0 contours, phone durations, and energy profiles. Modern neural TTS systems like Tacotron and FastSpeech learn to generate prosodic features end-to-end from text, but still struggle with paragraph-level coherence and pragmatically appropriate emphasis.

The Fujisaki Model of F0

The Fujisaki model (1983) decomposes the fundamental frequency (F0) contour of speech into two additive components: phrase commands (slow, global movements reflecting the overall intonation contour) and accent commands (faster, local movements reflecting lexical stress and pitch accents). Each component is modeled as the impulse response of a critically damped second-order system. The model is parameterized by the timing, amplitude, and duration of phrase and accent commands, which can be estimated from F0 data using optimization algorithms. Despite its simplicity, the Fujisaki model provides a compact and linguistically interpretable representation of intonation that has been applied across many languages.

Prosody in Speech Understanding

Prosodic information contributes to speech understanding at multiple levels. Prosodic boundaries help disambiguate syntactically ambiguous sentences ("I saw the man with the telescope" has different readings with different phrasing). Pitch accents mark information focus and contrast, signaling which parts of an utterance are most relevant. Question intonation distinguishes between statements and questions in many languages. Emotional prosody conveys the speaker's affective state. Computational models that integrate prosodic features with linguistic content consistently outperform text-only models for tasks like dialogue act classification, sentiment analysis of spoken language, and turn-taking prediction.

Cross-linguistic variation in prosodic systems is substantial. Tone languages like Mandarin use pitch to distinguish lexical meaning, while intonation languages like English use pitch primarily for post-lexical functions. Stress-timed languages (English, German) and syllable-timed languages (French, Spanish) differ in rhythmic organization. Computational prosody models must account for this typological diversity, and multilingual prosody modeling remains a significant challenge, particularly for low-resource languages where prosodically annotated corpora are unavailable.

Related Topics

References

  1. Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., ... & Hirschberg, J. (1992). ToBI: A standard for labeling English prosody. Proceedings of the 2nd International Conference on Spoken Language Processing, 867–870.
  2. Fujisaki, H. (1983). Dynamic characteristics of voice fundamental frequency in speech and singing. In P. F. MacNeilage (Ed.), The Production of Speech (pp. 39–55). Springer.
  3. Taylor, P. (2009). Text-to-Speech Synthesis. Cambridge University Press. doi:10.1017/CBO9780511816338

External Links