Computational Linguistics
About

Sentence Embeddings

Sentence embeddings map variable-length sentences to fixed-dimensional vector representations that capture semantic content, enabling efficient comparison, retrieval, and classification of textual meaning at the sentence level.

v_s = f(w_1, w_2, ..., w_n) in R^d

Sentence embeddings extend the distributional semantics paradigm from words to sentences, producing fixed-dimensional vector representations that capture the meaning of entire sentences. While word embeddings represent individual lexical items, many NLP tasks require comparing, clustering, or classifying sentences: semantic search, paraphrase detection, text clustering, and retrieval-augmented generation all depend on high-quality sentence representations. The challenge is to compose word-level information into a sentence-level representation that preserves semantic content while abstracting away from surface form.

Approaches to Sentence Embedding

Common Sentence Embedding Methods Averaging: v_s = (1/n) Σ_{i=1}^{n} v_{w_i}

SIF (Smooth Inverse Frequency):
v_s = (1/n) Σ_i [a / (a + p(w_i))] · v_{w_i}, then remove first principal component

Sentence-BERT (SBERT):
v_s = BERT_pool(w_1, ..., w_n)
Training: minimize ||v_{s1} - v_{s2}|| for paraphrases
Similarity: cos(v_{s1}, v_{s2})

The simplest sentence embedding is the unweighted average of word vectors, which surprisingly serves as a competitive baseline. The SIF method (Arora et al., 2017) improves on averaging by weighting words inversely to their frequency and removing the first principal component, capturing the intuition that common words contribute less to sentence meaning. More sophisticated approaches use neural encoders: InferSent trains a BiLSTM on natural language inference data, Universal Sentence Encoder uses a Transformer or DAN architecture, and Sentence-BERT fine-tunes BERT with a siamese architecture for efficient pairwise comparison.

Contrastive Learning and Modern Methods

Recent advances in sentence embeddings rely on contrastive learning objectives. SimCSE (Gao et al., 2021) achieves strong results using a simple contrastive framework: positive pairs are created by passing the same sentence through the encoder twice with different dropout masks (unsupervised) or by using NLI entailment pairs (supervised). This approach produces embeddings with better uniformity and alignment properties than previous methods. Other contrastive approaches include DeCLUTR, CT-BERT, and various augmentation-based methods that create positive pairs through paraphrasing, back-translation, or word deletion.

Evaluation of Sentence Embeddings

Sentence embeddings are evaluated on the STS Benchmark (semantic textual similarity), transfer tasks (SentEval suite including sentiment, entailment, and paraphrase detection), and retrieval tasks (MS MARCO, BEIR). The Massive Text Embedding Benchmark (MTEB) provides a comprehensive evaluation across 8 task types and 58 datasets. A key finding is that embeddings optimized for similarity (STS) do not always transfer well to classification tasks and vice versa, motivating multi-task training approaches and task-specific adapters.

Applications

Sentence embeddings power a wide range of applications. In semantic search, queries and documents are encoded into the same vector space, and retrieval is performed via approximate nearest neighbor search, enabling sub-millisecond retrieval over millions of documents. In retrieval-augmented generation (RAG), sentence embeddings are used to find relevant passages that are then provided as context to a language model. Clustering sentence embeddings enables unsupervised topic discovery, and the cosine similarity between sentence vectors provides a lightweight measure of semantic relatedness for deduplication and paraphrase mining.

Multilingual sentence embeddings, produced by models like LaBSE and multilingual Sentence-BERT, map sentences from different languages into a shared vector space, enabling cross-lingual retrieval, parallel sentence mining, and zero-shot cross-lingual transfer. The quality of sentence embeddings continues to improve with larger pre-trained models and better training objectives, making them an increasingly essential component of modern NLP systems.

Related Topics

References

  1. Arora, S., Liang, Y., & Ma, T. (2017). A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of ICLR.
  2. Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of EMNLP-IJCNLP (pp. 3982–3992). doi:10.18653/v1/D19-1410
  3. Gao, T., Yao, X., & Chen, D. (2021). SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of EMNLP (pp. 6894–6910). doi:10.18653/v1/2021.emnlp-main.552
  4. Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. In Proceedings of EMNLP (pp. 670–680). doi:10.18653/v1/D17-1070

External Links