A spectrogram is a visual and computational representation of the frequency content of a signal as it varies over time. For speech, the spectrogram reveals the formant structure of vowels, the noise bursts of plosives, the frication of sibilants, and the harmonic structure of voiced sounds — all evolving dynamically as the speaker produces connected speech. Spectrograms have been a fundamental tool in phonetics and speech science since their development at Bell Labs in the 1940s, and have gained renewed importance as the standard intermediate representation in neural speech synthesis and recognition.
Short-Time Fourier Transform
Power spectrogram: S(t,f) = |STFT(s,t,f)|²
Mel spectrogram: S_mel(t,m) = ∑_f H_m(f) · S(t,f)
Log-mel spectrogram: log S_mel(t,m)
Time-frequency tradeoff: Δt · Δf ≥ 1/(4π)
The spectrogram is computed via the Short-Time Fourier Transform (STFT), which applies a sliding window function w[n] to the signal and computes the Fourier transform of each windowed segment. The choice of window length embodies a fundamental tradeoff: longer windows provide better frequency resolution (narrower spectral peaks, resolving individual harmonics) but poorer time resolution (smearing rapid temporal changes), while shorter windows offer the reverse. For speech, a 25 ms window with 10 ms shift provides a good compromise, yielding sufficient frequency resolution to resolve formants while tracking the rapid articulatory movements of connected speech.
Mel Spectrograms in Neural Systems
The mel spectrogram applies a bank of triangular filters spaced on the mel scale to the power spectrum, reducing the frequency axis from hundreds of FFT bins to typically 80 mel bands. This dimensionality reduction is perceptually motivated — the mel scale matches human auditory frequency resolution — and computationally advantageous. Log-mel spectrograms (the logarithm of mel filterbank energies) are the standard input to modern neural ASR systems and the standard output target for neural TTS systems like Tacotron 2 and FastSpeech.
Spectrogram reading — identifying phonemes and words from their spectrographic patterns — was a major research effort in early speech science. Vowels appear as dark horizontal bands (formants) at characteristic frequencies; fricatives like /s/ show high-frequency noise energy; plosives like /p/ and /b/ are marked by a brief silence followed by a burst of energy; nasals show a low-frequency nasal formant with weakened higher formants. While automatic systems have far surpassed human spectrogram reading ability, understanding spectrographic patterns remains essential for diagnosing ASR errors and designing acoustic features.
Spectrogram representations have also enabled the application of computer vision techniques to speech processing. Convolutional neural networks operating on spectrograms treat them as single-channel images, applying 2D convolutions to learn local time-frequency patterns. This perspective has proven remarkably effective for tasks from keyword spotting to speaker verification, blurring the boundary between speech processing and image recognition.
Beyond the standard power spectrogram, several variants exist for specialized applications. The reassigned spectrogram sharpens time-frequency localization beyond the Heisenberg limit. The constant-Q transform uses logarithmically spaced frequency bins, matching musical pitch spacing. Cochleagrams model the auditory periphery more faithfully. Each variant makes different tradeoffs between resolution, interpretability, and computational cost, but the log-mel spectrogram remains the workhorse representation for modern speech technology.