Hate speech detection is the task of automatically classifying text as hateful or not, often with finer-grained distinctions between hate speech, offensive but non-hateful language, and neutral content. The task has become critically important as social media platforms struggle to moderate billions of posts daily, and manual review cannot scale to the volume of content produced. Computational hate speech detection must navigate fundamental tensions between protecting targeted groups and preserving free expression, between achieving high recall (catching most hate speech) and maintaining high precision (avoiding false positives that suppress legitimate discourse).
Definitions and Annotation Challenges
Hate speech: attacks based on protected characteristics
Offensive language: vulgar or rude but not targeting a group
Neither: neutral or positive content
Fine-grained targets: race, religion, gender, sexual orientation,
disability, nationality, political affiliation
Defining hate speech precisely is itself a contested task. Legal definitions vary across jurisdictions; community standards differ across platforms; and individual annotators bring different perspectives shaped by their experiences and backgrounds. Davidson et al. (2017) showed that distinguishing hate speech from offensive language is particularly difficult: annotators frequently disagree on whether a tweet containing slurs constitutes hate speech or merely offensive language. The use of reclaimed slurs, in-group humour, and counter-speech (quoting hate speech to argue against it) further complicates annotation. These challenges mean that hate speech datasets inevitably reflect the biases of their annotators, which propagate into trained models.
Detection Methods and Bias
State-of-the-art hate speech detection systems use pretrained transformer models fine-tuned on annotated datasets. HateBERT (Caselli et al., 2021) further pretrains BERT on a large corpus of Reddit posts from banned communities, adapting the language model to the register and vocabulary of online hate. Features beyond surface text — such as user history, network structure, and conversational context — can improve detection but raise privacy concerns. Multimodal hate speech detection extends the task to images, memes, and videos, where hateful content may arise from the combination of modalities rather than either modality alone.
A critical concern in hate speech detection is that classifiers trained on biased data can perpetuate and amplify social biases. Sap et al. (2019) demonstrated that hate speech classifiers are significantly more likely to flag African American English (AAE) text as hateful, even when the content is not hateful by any definition. This racial bias arises because AAE features (e.g., certain lexical items and grammatical constructions) are correlated with hateful content in training data due to annotator biases. Mitigating such biases requires diverse annotation, dialect-aware preprocessing, and fairness-constrained learning objectives.
Hate speech detection must contend with adversarial evasion: users who wish to spread hateful content deliberately circumvent filters through character substitution ("h@te"), code words, euphemisms, dog whistles, and references that are hateful only in specific cultural contexts. Robustness to such adversarial attacks requires models that go beyond surface-level pattern matching to understand the intent and social context of language. Cross-lingual and multilingual hate speech detection is an active area of research, as most existing datasets and models focus on English, leaving low-resource languages with minimal coverage despite the global nature of online hate.