Sentence embeddings extend the distributional semantics paradigm from words to sentences, producing fixed-dimensional vector representations that capture the meaning of entire sentences. While word embeddings represent individual lexical items, many NLP tasks require comparing, clustering, or classifying sentences: semantic search, paraphrase detection, text clustering, and retrieval-augmented generation all depend on high-quality sentence representations. The challenge is to compose word-level information into a sentence-level representation that preserves semantic content while abstracting away from surface form.
Approaches to Sentence Embedding
SIF (Smooth Inverse Frequency):
v_s = (1/n) Σ_i [a / (a + p(w_i))] · v_{w_i}, then remove first principal component
Sentence-BERT (SBERT):
v_s = BERT_pool(w_1, ..., w_n)
Training: minimize ||v_{s1} - v_{s2}|| for paraphrases
Similarity: cos(v_{s1}, v_{s2})
The simplest sentence embedding is the unweighted average of word vectors, which surprisingly serves as a competitive baseline. The SIF method (Arora et al., 2017) improves on averaging by weighting words inversely to their frequency and removing the first principal component, capturing the intuition that common words contribute less to sentence meaning. More sophisticated approaches use neural encoders: InferSent trains a BiLSTM on natural language inference data, Universal Sentence Encoder uses a Transformer or DAN architecture, and Sentence-BERT fine-tunes BERT with a siamese architecture for efficient pairwise comparison.
Contrastive Learning and Modern Methods
Recent advances in sentence embeddings rely on contrastive learning objectives. SimCSE (Gao et al., 2021) achieves strong results using a simple contrastive framework: positive pairs are created by passing the same sentence through the encoder twice with different dropout masks (unsupervised) or by using NLI entailment pairs (supervised). This approach produces embeddings with better uniformity and alignment properties than previous methods. Other contrastive approaches include DeCLUTR, CT-BERT, and various augmentation-based methods that create positive pairs through paraphrasing, back-translation, or word deletion.
Sentence embeddings are evaluated on the STS Benchmark (semantic textual similarity), transfer tasks (SentEval suite including sentiment, entailment, and paraphrase detection), and retrieval tasks (MS MARCO, BEIR). The Massive Text Embedding Benchmark (MTEB) provides a comprehensive evaluation across 8 task types and 58 datasets. A key finding is that embeddings optimized for similarity (STS) do not always transfer well to classification tasks and vice versa, motivating multi-task training approaches and task-specific adapters.
Applications
Sentence embeddings power a wide range of applications. In semantic search, queries and documents are encoded into the same vector space, and retrieval is performed via approximate nearest neighbor search, enabling sub-millisecond retrieval over millions of documents. In retrieval-augmented generation (RAG), sentence embeddings are used to find relevant passages that are then provided as context to a language model. Clustering sentence embeddings enables unsupervised topic discovery, and the cosine similarity between sentence vectors provides a lightweight measure of semantic relatedness for deduplication and paraphrase mining.
Multilingual sentence embeddings, produced by models like LaBSE and multilingual Sentence-BERT, map sentences from different languages into a shared vector space, enabling cross-lingual retrieval, parallel sentence mining, and zero-shot cross-lingual transfer. The quality of sentence embeddings continues to improve with larger pre-trained models and better training objectives, making them an increasingly essential component of modern NLP systems.