Speech Emotion Recognition (SER) is the task of automatically detecting the emotional state of a speaker from their voice. Emotions modulate nearly every aspect of speech production: fundamental frequency (pitch) tends to rise with arousal, speaking rate increases in anger and fear, voice quality changes with sadness (breathy) and anger (tense), and spectral energy distribution shifts with emotional state. SER has applications in call center analytics, mental health monitoring, human-robot interaction, and adaptive learning systems where understanding the user's emotional state enables more appropriate system responses.
Acoustic Features for Emotion
Spectral: MFCCs, formant frequencies, spectral centroid, flux
Voice quality: jitter, shimmer, harmonics-to-noise ratio (HNR)
Temporal: speech/silence ratio, utterance duration
Common feature sets: eGeMAPS (88 features), ComParE (6,373 features)
Emotion models: categorical (Ekman's 6), dimensional (valence-arousal)
Two primary frameworks exist for representing emotions. Categorical models define discrete emotion classes — typically the six "basic emotions" proposed by Ekman (anger, disgust, fear, happiness, sadness, surprise) plus a neutral state. Dimensional models represent emotions in a continuous space defined by axes such as valence (positive/negative), arousal (active/passive), and sometimes dominance (dominant/submissive). The dimensional approach is particularly well-suited for capturing subtle emotional variations and blended emotions that do not fit neatly into discrete categories.
Deep Learning Approaches
Early SER systems used hand-crafted features with SVMs or random forests. Modern approaches apply deep neural networks that learn representations directly from spectrograms or raw waveforms. Convolutional neural networks capture local spectro-temporal patterns associated with emotional speech, recurrent networks model the temporal evolution of emotion within an utterance, and self-attention mechanisms identify emotionally salient regions. Pre-trained speech models like wav2vec 2.0 and HuBERT provide powerful representations that transfer well to emotion recognition, especially in low-resource scenarios.
A fundamental difficulty in SER is obtaining reliable ground-truth labels. Emotion perception is subjective: different listeners often disagree about the emotional content of the same utterance. Inter-annotator agreement is typically moderate (Cohen's kappa around 0.4-0.6), and the "true" emotion of a speaker is not directly observable. Best practices include using multiple annotators, reporting agreement statistics, modeling annotator uncertainty in the learning algorithm, and using continuous labels (e.g., time-continuous valence and arousal annotations) rather than forcing discrete choices.
Multimodal emotion recognition combines acoustic features with textual content (what was said) and visual cues (facial expressions, gestures) for more robust predictions. The text modality captures sentiment and emotional vocabulary, while the acoustic modality captures how something was said. Fusion strategies range from early fusion (concatenating features) to late fusion (combining modality-specific predictions) to cross-modal attention that learns which modality is most informative at each moment.
Standard SER benchmarks include IEMOCAP (interactive emotional dyadic motion capture), MSP-IMPROV, and RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song). Current challenges include cross-corpus generalization (models trained on one dataset often perform poorly on others due to differences in recording conditions and annotation conventions), cross-cultural emotion recognition, and in-the-wild emotion detection from noisy, spontaneous speech rather than acted performances.