Extractive summarisation produces summaries by selecting a subset of sentences from the source document and presenting them in order. The task can be formulated as a combinatorial optimisation problem: select the subset of sentences that maximises the total information content (salience) while minimising redundancy between selected sentences, subject to a length budget. This formulation connects extractive summarisation to classical optimisation theory and enables the application of techniques from combinatorial optimisation, submodular function maximisation, and integer linear programming.
Sentence Scoring and Selection
Graph-based (TextRank/LexRank):
score(sᵢ) = (1-d) + d · ∑_{sⱼ ∈ adj(sᵢ)} (sim(sᵢ, sⱼ) / ∑ₖ sim(sⱼ, sₖ)) · score(sⱼ)
MMR: score(s) = λ · sim(s, D) − (1-λ) · max_{s' ∈ S} sim(s, s')
balancing relevance and diversity
Sentence scoring assigns an importance score to each sentence in the source document. Frequency-based methods score sentences by the aggregate importance of their constituent words, using measures such as TF-IDF, log-likelihood ratio, or topic signature weights. Position-based features exploit the journalistic convention of putting important information first, assigning higher scores to sentences near the beginning of a document. Graph-based methods such as TextRank (Mihalcea and Tarau, 2004) and LexRank (Erkan and Radev, 2004) construct a graph of inter-sentence similarities and apply PageRank-style algorithms to identify central sentences.
Submodular Optimisation and Neural Extractive Models
Lin and Bilmes (2011) showed that many extractive summarisation objectives can be expressed as submodular functions, which enjoy a theoretical guarantee: the greedy algorithm achieves at least 63% of the optimal solution for monotone submodular maximisation under cardinality constraints. This result provides a principled foundation for greedy sentence selection algorithms that were previously used heuristically. Submodular objectives naturally capture the diminishing returns property of summarisation: adding a sentence that covers new topics is more valuable than adding one that is redundant with already selected sentences.
TextRank and LexRank independently proposed the application of graph-based centrality algorithms to extractive summarisation. Both construct a graph where nodes are sentences and edge weights represent inter-sentence similarity (typically cosine similarity of TF-IDF vectors). TextRank applies PageRank to this graph, while LexRank uses a threshold to create an unweighted graph and applies eigenvector centrality. The central insight is that important sentences are those that are similar to many other important sentences — a recursive definition elegantly captured by the eigenvector computation. These unsupervised methods remain competitive baselines.
Neural extractive models score sentences using contextual representations from recurrent or transformer architectures. SummaRuNNer (Nallapati et al., 2017) uses a hierarchical RNN to encode words into sentence representations and sentences into document representations, then classifies each sentence as included or excluded from the summary. BERT-based extractive models (Liu and Lapata, 2019) insert [CLS] tokens between sentences and use the resulting [CLS] representations to score sentences. These neural models capture inter-sentence dependencies and document-level context that simpler scoring functions miss, achieving substantial improvements over unsupervised methods on standard benchmarks such as CNN/DailyMail.