Computational Linguistics
About

Dialogue State Tracking

Dialogue state tracking (DST) maintains a running representation of the user's goals and the information exchanged during a conversation, enabling dialogue systems to make informed decisions about subsequent actions.

B_t(s=v) = P(s_t = v | u₁, a₁, …, u_t)

Dialogue state tracking is the task of maintaining an accurate, up-to-date representation of the user's goals and constraints at each turn of a conversation. In a restaurant booking dialogue, for example, the dialogue state might record that the user wants Italian food (cuisine=Italian), in the center of town (area=center), and has not yet specified a price range (price=unknown). Accurate state tracking is critical because the dialogue policy, database queries, and response generation all depend on the current state. Errors in state tracking cascade through the system, leading to irrelevant responses and task failure.

Belief State Representation

Dialogue State as Belief Distribution State: B_t = {(s_i, P_t(v | s_i)) | s_i ∈ Slots}

Example at turn t:
B_t(cuisine) = {Italian: 0.85, Chinese: 0.10, dontcare: 0.05}
B_t(area) = {center: 0.92, north: 0.05, dontcare: 0.03}
B_t(price) = {unknown: 1.0}

Joint goal accuracy = P(∀s_i: argmax_v B_t(v|s_i) = v*_i)

The dialogue state is typically represented as a belief state — a probability distribution over possible slot values for each slot in the domain ontology. Rather than committing to a single interpretation of each user utterance, belief tracking maintains uncertainty, allowing the system to recover gracefully from recognition and understanding errors. The joint goal accuracy metric evaluates whether the predicted values for all slots simultaneously match the ground truth, making it a stringent measure that penalizes any single-slot error.

Statistical and Neural Approaches

Early DST systems used hand-crafted rules to update the dialogue state based on the NLU output. The DSTC (Dialogue State Tracking Challenge) series, beginning with Williams et al. (2013), established standardized evaluation and spurred the development of statistical approaches. Henderson et al. (2014) introduced a neural approach using recurrent networks that operated directly on ASR output, bypassing the NLU module entirely. The NBT (Neural Belief Tracker) of Mrksic et al. (2017) combined pre-trained word embeddings with a learned similarity function, achieving strong results with limited training data by leveraging semantic relationships between slot values.

Open-Vocabulary DST

Traditional DST systems assume a fixed ontology with enumerated slot values, but real-world conversations frequently involve values not seen during training. Open-vocabulary DST approaches address this by generating or extracting slot values from the dialogue context rather than selecting from a predefined list. Models like TRADE (Wu et al., 2019) use copy mechanisms to extract values directly from user utterances, while generative approaches like SOM-DST frame state tracking as a sequence generation problem. This flexibility is essential for practical deployment, where the space of possible user expressions far exceeds what any predefined ontology can capture.

Pre-trained Model Approaches

The application of pre-trained language models has significantly advanced DST performance. Models fine-tuned from BERT, GPT-2, or T5 frame state tracking as a reading comprehension, question-answering, or sequence-to-sequence problem. For each slot, the model takes the dialogue history as context and either extracts the value span, generates it, or selects from candidates. TripPy (Heck et al., 2020) combines span extraction with memory mechanisms that track slot values across turns. These approaches achieve joint goal accuracies above 55% on MultiWOZ 2.1, though performance degrades substantially in multi-domain and cross-domain settings.

Key challenges in dialogue state tracking include handling coreference and ellipsis (when users refer to previously mentioned values or omit information recoverable from context), schema changes (adapting to new domains with different slot structures), multi-domain tracking (maintaining coherent states across domain transitions within a single dialogue), and error recovery (correctly updating the state when users correct previous statements). Zero-shot and few-shot DST, where models must track states in domains not seen during training using only schema descriptions, represents a particularly important frontier for building practical, extensible dialogue systems.

Related Topics

References

  1. Williams, J., Raux, A., Ramachandran, D., & Black, A. (2013). The dialog state tracking challenge. Proceedings of the SIGDIAL 2013 Conference, 404–413.
  2. Henderson, M., Thomson, B., & Young, S. (2014). Word-based dialog state tracking with recurrent neural networks. Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), 292–299. doi:10.3115/v1/W14-4340
  3. Wu, C.-S., Madotto, A., Hosseini-Asl, E., Xiong, C., Socher, R., & Fung, P. (2019). Transferable multi-domain state generator for task-oriented dialogue systems. Proceedings of the 57th Annual Meeting of the ACL, 808–819. doi:10.18653/v1/P19-1078

External Links