Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance

Dec 15, 2021

Hyundong Cho, Chinnadhurai Sankar, Christopher Lin, Kaushik Ram Sadagopan, Shahin Shayandeh, Asli Celikyilmaz, Jonathan May, Ahmad Beirami

Figure 1 for CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance

Figure 2 for CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance

Figure 3 for CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance

Figure 4 for CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance

Share this with someone who'll enjoy it:

Abstract:Recent neural models that extend the pretrain-then-finetune paradigm continue to achieve new state-of-the-art results on joint goal accuracy (JGA) for dialogue state tracking (DST) benchmarks. However, we call into question their robustness as they show sharp drops in JGA for conversations containing utterances or dialog flows with realistic perturbations. Inspired by CheckList (Ribeiro et al., 2020), we design a collection of metrics called CheckDST that facilitate comparisons of DST models on comprehensive dimensions of robustness by testing well-known weaknesses with augmented test sets. We evaluate recent DST models with CheckDST and argue that models should be assessed more holistically rather than pursuing state-of-the-art on JGA since a higher JGA does not guarantee better overall robustness. We find that span-based classification models are resilient to unseen named entities but not robust to language variety, whereas those based on autoregressive language models generalize better to language variety but tend to memorize named entities and often hallucinate. Due to their respective weaknesses, neither approach is yet suitable for real-world deployment. We believe CheckDST is a useful guide for future research to develop task-oriented dialogue models that embody the strengths of various methods.

View paper on

Share this with someone who'll enjoy it:

Title:CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance

Paper and Code