Computational Linguistics
About

Pitch Detection

Pitch detection algorithms estimate the fundamental frequency of voiced speech, a critical parameter for prosody analysis, speaker characterization, tonal language processing, and speech synthesis.

F0 = 1 / T_0 = f_s / τ*

Pitch detection, more precisely fundamental frequency (F0) estimation, is the task of determining the rate of vocal fold vibration from an acoustic speech signal. The fundamental frequency is the acoustic correlate of perceived pitch and is one of the most important prosodic features, conveying linguistic information (sentence type, focus, tone in tonal languages), paralinguistic information (emotion, attitude), and speaker characteristics (age, sex). Robust F0 estimation is essential for virtually every speech processing application, yet it remains a challenging problem due to the complex and quasi-periodic nature of the voice source.

Time-Domain and Frequency-Domain Methods

Autocorrelation-Based F0 Estimation R(τ) = ∑_{n} s[n] · s[n + τ]

F0 = f_s / argmax_τ R(τ), τ ∈ [τ_min, τ_max]

Typical F0 ranges:
Adult male: 85–180 Hz
Adult female: 165–255 Hz
Children: 250–400 Hz

The autocorrelation method estimates F0 by finding the lag at which the signal is most similar to a delayed version of itself. For a perfectly periodic signal, the autocorrelation has peaks at integer multiples of the fundamental period. In practice, the signal is windowed, center-clipped or filtered to enhance periodicity, and the autocorrelation is computed over a range of lags corresponding to plausible F0 values. The dominant algorithm of this type, Praat's autocorrelation method, incorporates octave cost, transition cost, and voicing threshold parameters to produce smooth, robust F0 contours.

Modern Pitch Detection Algorithms

Beyond autocorrelation, numerous F0 estimation approaches have been developed. The YIN algorithm, proposed by de Cheveigné and Kawahara in 2002, uses a difference function (related to autocorrelation) with parabolic interpolation and an aperiodicity threshold for voicing decision. CREPE (Convolutional Representation for Pitch Estimation) applies a deep convolutional neural network trained on synthesized audio with known F0, achieving state-of-the-art accuracy especially on noisy and polyphonic signals. The PYIN algorithm extends YIN with a probabilistic framework and Viterbi decoding to improve temporal coherence.

Challenges in Pitch Detection

Several factors make F0 estimation difficult. Octave errors occur when the algorithm locks onto a harmonic (double the true F0) or a subharmonic (half the true F0), particularly in signals with weak fundamentals. Voicing detection — distinguishing voiced from unvoiced speech — is error-prone in breathy voice, whispered speech, and at voicing boundaries. Background noise and reverberation degrade periodicity cues. Creaky voice (vocal fry) exhibits extremely irregular periodicity that confounds most algorithms. Multi-speaker scenarios require separating overlapping F0 contours. Each challenge has motivated specialized algorithmic solutions.

F0 is the foundation of computational prosody analysis. Intonation patterns — the melodic contour of an utterance — are represented as F0 trajectories over time. Models such as the ToBI (Tones and Break Indices) annotation system represent intonation as a sequence of pitch accents, phrase accents, and boundary tones. Automatic prosody analysis systems use F0, along with energy and duration, to detect sentence boundaries, question types, emphasis, and discourse structure in spoken language.

In tonal languages such as Mandarin, Thai, and Yoruba, F0 patterns distinguish word meaning at the syllable level. Accurate F0 estimation is therefore critical for ASR in these languages, where tone errors are as consequential as segment errors. Tone recognition systems extract F0 contours and classify them using neural networks or specialized models that account for the contextual variation of tones in connected speech.

Related Topics

References

  1. de Cheveigné, A., & Kawahara, H. (2002). YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4), 1917–1930. doi:10.1121/1.1458024
  2. Kim, J. W., Salamon, J., Li, P., & Bello, J. P. (2018). CREPE: A convolutional representation for pitch estimation. Proc. ICASSP, 161–165. doi:10.1109/ICASSP.2018.8461329
  3. Boersma, P. (1993). Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. IFA Proceedings, 17, 97–110.
  4. Mauch, M., & Dixon, S. (2014). pYIN: A fundamental frequency estimator using probabilistic threshold distributions. Proc. ICASSP, 659–663. doi:10.1109/ICASSP.2014.6853678

External Links