Semantic textual similarity (STS) is the task of assigning a continuous score indicating how similar in meaning two text segments are. Unlike textual entailment, which asks a directional yes/no question about logical implication, STS measures a symmetric, graded similarity. The STS Benchmark, introduced through the SemEval shared tasks (2012-2017), uses a 0-5 scale where 0 indicates completely dissimilar sentences and 5 indicates semantically equivalent sentences. STS is both an intrinsic evaluation of semantic representations and a building block for applications like deduplication, clustering, and retrieval.
Methods and Models
Cross-encoder: STS(s1, s2) = MLP(BERT([CLS] s1 [SEP] s2 [SEP]))
Evaluation: Pearson / Spearman correlation with human judgments
STS-B scores: 0 (unrelated) to 5 (equivalent)
"A man is playing a guitar" vs "A man is playing music" → 3.8
"A cat sits on a mat" vs "A dog runs in a park" → 0.6
Two architectural paradigms dominate STS. Bi-encoders independently encode each sentence into a fixed vector and compute similarity as cosine distance, enabling efficient retrieval over large collections. Cross-encoders concatenate the two sentences and pass them jointly through a Transformer, computing a similarity score from the joint representation. Cross-encoders are more accurate because they model fine-grained interactions between the two sentences, but bi-encoders are orders of magnitude faster for retrieval because sentence vectors can be pre-computed and indexed.
Training Objectives
STS models are trained using several objectives. Regression training directly predicts the human similarity score using mean squared error loss. Contrastive learning, used in models like SimCSE, trains the encoder to produce similar vectors for semantically similar pairs and dissimilar vectors for unrelated pairs. Knowledge distillation transfers the quality of cross-encoder scores to more efficient bi-encoder models. Multi-task training on STS, NLI, and paraphrase detection datasets typically yields the strongest general-purpose similarity models.
STS is closely related to but distinct from paraphrase detection. Paraphrase detection is a binary classification task (paraphrase or not), while STS provides a continuous score. The Microsoft Research Paraphrase Corpus (MRPC) and Quora Question Pairs (QQP) dataset are standard paraphrase benchmarks. In practice, STS scores can be thresholded to perform paraphrase detection, and paraphrase data can augment STS training. The PAWS dataset specifically tests adversarial cases where high lexical overlap does not imply semantic similarity.
Applications and Challenges
STS underpins many practical applications. In information retrieval, STS between a query and candidate passages determines relevance ranking. In machine translation evaluation, metrics like BERTScore compute STS between candidate and reference translations, correlating with human quality judgments better than n-gram overlap metrics like BLEU. In automatic essay scoring and plagiarism detection, STS identifies semantically similar passages regardless of surface form differences.
Challenges in STS include handling negation (sentences that differ by a single negation are lexically similar but semantically opposite), quantifier sensitivity ("all students passed" vs. "most students passed"), and domain shift (models trained on news text may perform poorly on clinical or legal text). Compositional generalization -- correctly scoring similarity for novel combinations of known words -- remains difficult, and models sometimes rely on superficial lexical cues rather than deep semantic comparison.