Question answering (QA) is the task of automatically answering questions posed in natural language. QA systems may draw answers from a given text passage (reading comprehension), a large document collection (open-domain QA), a structured knowledge base (KBQA), or a combination of these sources. The task serves as a comprehensive benchmark for natural language understanding, requiring systems to parse questions, identify relevant information, perform reasoning, and produce precise answers. QA has been a central challenge in AI since the earliest systems of the 1960s and has experienced dramatic progress with the advent of pretrained language models.
Extractive Reading Comprehension
Answer: span (start, end) in passage p
P(start = i) = softmax(w_s · hᵢ)
P(end = j | start = i) = softmax(w_e · hⱼ)
a* = argmax_{i,j: i ≤ j} P(start = i) · P(end = j)
Extractive QA identifies a contiguous span in a given passage that answers the question. The SQuAD dataset (Rajpurkar et al., 2016) established this formulation and catalysed rapid progress: models went from 67% F1 to surpassing human performance (93% F1) within three years. The standard approach encodes the question and passage jointly using a pretrained transformer, then predicts the start and end positions of the answer span. SQuAD 2.0 introduced unanswerable questions, requiring models to determine when the passage does not contain the answer — a crucial capability for reliable QA systems.
Open-Domain and Multi-Hop QA
Open-domain QA removes the assumption that the relevant passage is given, requiring systems to retrieve relevant documents from a large corpus before extracting or generating answers. The retriever-reader architecture (Chen et al., 2017) combines an information retrieval component (the retriever) that identifies relevant passages with a reading comprehension component (the reader) that extracts answers from the retrieved passages. Dense passage retrieval (DPR) replaces traditional sparse retrieval with learned dense representations, improving retrieval quality by capturing semantic similarity rather than just lexical overlap.
Multi-hop QA requires combining information from multiple documents or passages to answer a question. For example, answering "Where was the director of Inception born?" requires first identifying the director (Christopher Nolan) and then finding his birthplace (London). The HotpotQA dataset (Yang et al., 2018) provides a benchmark for multi-hop reasoning, requiring systems to identify and chain evidence across multiple supporting documents. Multi-hop QA pushes beyond pattern matching to test genuine reasoning capabilities, and current systems still struggle with complex chains of reasoning that humans find straightforward.
Generative QA models produce answers by generating text rather than extracting spans, enabling them to handle questions that require synthesis, abstraction, or reasoning beyond what is explicitly stated in any single passage. Models such as T5 and GPT formulate QA as a text-to-text task, generating the answer string conditioned on the question and context. This approach naturally handles questions with free-form answers, list answers, and yes/no answers that extractive models struggle with. However, generative models are susceptible to hallucination — generating plausible but incorrect answers — making faithfulness to the source evidence a critical concern for deployment in real-world applications.