Computational Linguistics
About

Spectrograms

Spectrograms are time-frequency representations of speech that visualize how spectral energy evolves over time, serving as both an analysis tool and the primary intermediate representation in neural speech processing systems.

S(t, f) = |STFT(s, t, f)|²

A spectrogram is a visual and computational representation of the frequency content of a signal as it varies over time. For speech, the spectrogram reveals the formant structure of vowels, the noise bursts of plosives, the frication of sibilants, and the harmonic structure of voiced sounds — all evolving dynamically as the speaker produces connected speech. Spectrograms have been a fundamental tool in phonetics and speech science since their development at Bell Labs in the 1940s, and have gained renewed importance as the standard intermediate representation in neural speech synthesis and recognition.

Short-Time Fourier Transform

Spectrogram Computation STFT(s, t, f) = ∑_{n} s[n] · w[n − t] · e^{−j2πfn}

Power spectrogram: S(t,f) = |STFT(s,t,f)|²
Mel spectrogram: S_mel(t,m) = ∑_f H_m(f) · S(t,f)
Log-mel spectrogram: log S_mel(t,m)

Time-frequency tradeoff: Δt · Δf ≥ 1/(4π)

The spectrogram is computed via the Short-Time Fourier Transform (STFT), which applies a sliding window function w[n] to the signal and computes the Fourier transform of each windowed segment. The choice of window length embodies a fundamental tradeoff: longer windows provide better frequency resolution (narrower spectral peaks, resolving individual harmonics) but poorer time resolution (smearing rapid temporal changes), while shorter windows offer the reverse. For speech, a 25 ms window with 10 ms shift provides a good compromise, yielding sufficient frequency resolution to resolve formants while tracking the rapid articulatory movements of connected speech.

Mel Spectrograms in Neural Systems

The mel spectrogram applies a bank of triangular filters spaced on the mel scale to the power spectrum, reducing the frequency axis from hundreds of FFT bins to typically 80 mel bands. This dimensionality reduction is perceptually motivated — the mel scale matches human auditory frequency resolution — and computationally advantageous. Log-mel spectrograms (the logarithm of mel filterbank energies) are the standard input to modern neural ASR systems and the standard output target for neural TTS systems like Tacotron 2 and FastSpeech.

Reading Spectrograms

Spectrogram reading — identifying phonemes and words from their spectrographic patterns — was a major research effort in early speech science. Vowels appear as dark horizontal bands (formants) at characteristic frequencies; fricatives like /s/ show high-frequency noise energy; plosives like /p/ and /b/ are marked by a brief silence followed by a burst of energy; nasals show a low-frequency nasal formant with weakened higher formants. While automatic systems have far surpassed human spectrogram reading ability, understanding spectrographic patterns remains essential for diagnosing ASR errors and designing acoustic features.

Spectrogram representations have also enabled the application of computer vision techniques to speech processing. Convolutional neural networks operating on spectrograms treat them as single-channel images, applying 2D convolutions to learn local time-frequency patterns. This perspective has proven remarkably effective for tasks from keyword spotting to speaker verification, blurring the boundary between speech processing and image recognition.

Beyond the standard power spectrogram, several variants exist for specialized applications. The reassigned spectrogram sharpens time-frequency localization beyond the Heisenberg limit. The constant-Q transform uses logarithmically spaced frequency bins, matching musical pitch spacing. Cochleagrams model the auditory periphery more faithfully. Each variant makes different tradeoffs between resolution, interpretability, and computational cost, but the log-mel spectrogram remains the workhorse representation for modern speech technology.

Interactive Calculator

Enter comma-separated frequency magnitudes simulating a power spectrum (e.g., magnitudes at linearly spaced frequency bins). The calculator applies mel-scale warping, simulates triangular filterbank energies, and computes MFCC coefficients via a discrete cosine transform.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Koenig, W., Dunn, H. K., & Lacy, L. Y. (1946). The sound spectrograph. The Journal of the Acoustical Society of America, 18(1), 19–49. doi:10.1121/1.1916342
  2. Allen, J. B. (1977). Short term spectral analysis, synthesis, and modification by discrete Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 25(3), 235–238. doi:10.1109/TASSP.1977.1162950
  3. Stevens, K. N. (1998). Acoustic Phonetics. MIT Press.
  4. Flanagan, J. L. (1972). Speech Analysis, Synthesis, and Perception (2nd ed.). Springer-Verlag. doi:10.1007/978-3-662-01562-9

External Links