Alert button
Picture for Chinnadhurai Sankar

Chinnadhurai Sankar

Alert button

Step by Step to Fairness: Attributing Societal Bias in Task-oriented Dialogue Systems

Nov 14, 2023
Hsuan Su, Rebecca Qian, Chinnadhurai Sankar, Shahin Shayandeh, Shang-Tse Chen, Hung-yi Lee, Daniel M. Bikel

Recent works have shown considerable improvements in task-oriented dialogue (TOD) systems by utilizing pretrained large language models (LLMs) in an end-to-end manner. However, the biased behavior of each component in a TOD system and the error propagation issue in the end-to-end framework can lead to seriously biased TOD responses. Existing works of fairness only focus on the total bias of a system. In this paper, we propose a diagnosis method to attribute bias to each component of a TOD system. With the proposed attribution method, we can gain a deeper understanding of the sources of bias. Additionally, researchers can mitigate biased model behavior at a more granular level. We conduct experiments to attribute the TOD system's bias toward three demographic axes: gender, age, and race. Experimental results show that the bias of a TOD system usually comes from the response generation model.

Viaarxiv icon

Continual Dialogue State Tracking via Example-Guided Question Answering

May 23, 2023
Hyundong Cho, Andrea Madotto, Zhaojiang Lin, Khyathi Raghavi Chandu, Satwik Kottur, Jing Xu, Jonathan May, Chinnadhurai Sankar

Figure 1 for Continual Dialogue State Tracking via Example-Guided Question Answering
Figure 2 for Continual Dialogue State Tracking via Example-Guided Question Answering
Figure 3 for Continual Dialogue State Tracking via Example-Guided Question Answering
Figure 4 for Continual Dialogue State Tracking via Example-Guided Question Answering

Dialogue systems are frequently updated to accommodate new services, but naively updating them by continually training with data for new services in diminishing performance on previously learnt services. Motivated by the insight that dialogue state tracking (DST), a crucial component of dialogue systems that estimates the user's goal as a conversation proceeds, is a simple natural language understanding task, we propose reformulating it as a bundle of granular example-guided question answering tasks to minimize the task shift between services and thus benefit continual learning. Our approach alleviates service-specific memorization and teaches a model to contextualize the given question and example to extract the necessary information from the conversation. We find that a model with just 60M parameters can achieve a significant boost by learning to learn from in-context examples retrieved by a retriever trained to identify turns with similar dialogue state changes. Combining our method with dialogue-level memory replay, our approach attains state of the art performance on DST continual learning metrics without relying on any complex regularization or parameter expansion methods.

* 11 pages 
Viaarxiv icon

AUTODIAL: Efficient Asynchronous Task-Oriented Dialogue Model

Mar 10, 2023
Prajjwal Bhargava, Pooyan Amini, Shahin Shayandeh, Chinnadhurai Sankar

Figure 1 for AUTODIAL: Efficient Asynchronous Task-Oriented Dialogue Model
Figure 2 for AUTODIAL: Efficient Asynchronous Task-Oriented Dialogue Model
Figure 3 for AUTODIAL: Efficient Asynchronous Task-Oriented Dialogue Model
Figure 4 for AUTODIAL: Efficient Asynchronous Task-Oriented Dialogue Model

As large dialogue models become commonplace in practice, the problems surrounding high compute requirements for training, inference and larger memory footprint still persists. In this work, we present AUTODIAL, a multi-task dialogue model that addresses the challenges of deploying dialogue model. AUTODIAL utilizes parallel decoders to perform tasks such as dialogue act prediction, domain prediction, intent prediction, and dialogue state tracking. Using classification decoders over generative decoders allows AUTODIAL to significantly reduce memory footprint and achieve faster inference times compared to existing generative approach namely SimpleTOD. We demonstrate that AUTODIAL provides 3-6x speedups during inference while having 11x fewer parameters on three dialogue tasks compared to SimpleTOD. Our results show that extending current dialogue models to have parallel decoders can be a viable alternative for deploying them in resource-constrained environments.

Viaarxiv icon

Data-Efficiency with a Single GPU: An Exploration of Transfer Methods for Small Language Models

Oct 08, 2022
Alon Albalak, Akshat Shrivastava, Chinnadhurai Sankar, Adithya Sagar, Mike Ross

Figure 1 for Data-Efficiency with a Single GPU: An Exploration of Transfer Methods for Small Language Models
Figure 2 for Data-Efficiency with a Single GPU: An Exploration of Transfer Methods for Small Language Models
Figure 3 for Data-Efficiency with a Single GPU: An Exploration of Transfer Methods for Small Language Models
Figure 4 for Data-Efficiency with a Single GPU: An Exploration of Transfer Methods for Small Language Models

Multi-task learning (MTL), instruction tuning, and prompting have recently been shown to improve the generalizability of large language models to new tasks. However, the benefits of such methods are less well-documented in smaller language models, with some studies finding contradictory results. In this work, we explore and isolate the effects of (i) model size, (ii) general purpose MTL, (iii) in-domain MTL, (iv) instruction tuning, and (v) few-shot fine-tuning for models with fewer than 500 million parameters. Our experiments in the zero-shot setting demonstrate that models gain 31% relative improvement, on average, from general purpose MTL, with an additional 37.6% relative gain from in-domain MTL. Contradictory to prior works on large models, we find that instruction tuning provides a modest 2% performance improvement for small models.

Viaarxiv icon

KETOD: Knowledge-Enriched Task-Oriented Dialogue

May 11, 2022
Zhiyu Chen, Bing Liu, Seungwhan Moon, Chinnadhurai Sankar, Paul Crook, William Yang Wang

Figure 1 for KETOD: Knowledge-Enriched Task-Oriented Dialogue
Figure 2 for KETOD: Knowledge-Enriched Task-Oriented Dialogue
Figure 3 for KETOD: Knowledge-Enriched Task-Oriented Dialogue
Figure 4 for KETOD: Knowledge-Enriched Task-Oriented Dialogue

Existing studies in dialogue system research mostly treat task-oriented dialogue and chit-chat as separate domains. Towards building a human-like assistant that can converse naturally and seamlessly with users, it is important to build a dialogue system that conducts both types of conversations effectively. In this work, we investigate how task-oriented dialogue and knowledge-grounded chit-chat can be effectively integrated into a single model. To this end, we create a new dataset, KETOD (Knowledge-Enriched Task-Oriented Dialogue), where we naturally enrich task-oriented dialogues with chit-chat based on relevant entity knowledge. We also propose two new models, SimpleToDPlus and Combiner, for the proposed task. Experimental results on both automatic and human evaluations show that the proposed methods can significantly improve the performance in knowledge-enriched response generation while maintaining a competitive task-oriented dialog performance. We believe our new dataset will be a valuable resource for future studies. Our dataset and code are publicly available at \url{https://github.com/facebookresearch/ketod}.

* NAACL 2022 Findings 
Viaarxiv icon

Database Search Results Disambiguation for Task-Oriented Dialog Systems

Dec 15, 2021
Kun Qian, Ahmad Beirami, Satwik Kottur, Shahin Shayandeh, Paul Crook, Alborz Geramifard, Zhou Yu, Chinnadhurai Sankar

Figure 1 for Database Search Results Disambiguation for Task-Oriented Dialog Systems
Figure 2 for Database Search Results Disambiguation for Task-Oriented Dialog Systems
Figure 3 for Database Search Results Disambiguation for Task-Oriented Dialog Systems
Figure 4 for Database Search Results Disambiguation for Task-Oriented Dialog Systems

As task-oriented dialog systems are becoming increasingly popular in our lives, more realistic tasks have been proposed and explored. However, new practical challenges arise. For instance, current dialog systems cannot effectively handle multiple search results when querying a database, due to the lack of such scenarios in existing public datasets. In this paper, we propose Database Search Result (DSR) Disambiguation, a novel task that focuses on disambiguating database search results, which enhances user experience by allowing them to choose from multiple options instead of just one. To study this task, we augment the popular task-oriented dialog datasets (MultiWOZ and SGD) with turns that resolve ambiguities by (a) synthetically generating turns through a pre-defined grammar, and (b) collecting human paraphrases for a subset. We find that training on our augmented dialog data improves the model's ability to deal with ambiguous scenarios, without sacrificing performance on unmodified turns. Furthermore, pre-fine tuning and multi-task learning help our model to improve performance on DSR-disambiguation even in the absence of in-domain data, suggesting that it can be learned as a universal dialog skill. Our data and code will be made publicly available.

Viaarxiv icon

CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance

Dec 15, 2021
Hyundong Cho, Chinnadhurai Sankar, Christopher Lin, Kaushik Ram Sadagopan, Shahin Shayandeh, Asli Celikyilmaz, Jonathan May, Ahmad Beirami

Figure 1 for CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance
Figure 2 for CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance
Figure 3 for CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance
Figure 4 for CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance

Recent neural models that extend the pretrain-then-finetune paradigm continue to achieve new state-of-the-art results on joint goal accuracy (JGA) for dialogue state tracking (DST) benchmarks. However, we call into question their robustness as they show sharp drops in JGA for conversations containing utterances or dialog flows with realistic perturbations. Inspired by CheckList (Ribeiro et al., 2020), we design a collection of metrics called CheckDST that facilitate comparisons of DST models on comprehensive dimensions of robustness by testing well-known weaknesses with augmented test sets. We evaluate recent DST models with CheckDST and argue that models should be assessed more holistically rather than pursuing state-of-the-art on JGA since a higher JGA does not guarantee better overall robustness. We find that span-based classification models are resilient to unseen named entities but not robust to language variety, whereas those based on autoregressive language models generalize better to language variety but tend to memorize named entities and often hallucinate. Due to their respective weaknesses, neither approach is yet suitable for real-world deployment. We believe CheckDST is a useful guide for future research to develop task-oriented dialogue models that embody the strengths of various methods.

Viaarxiv icon

DAIR: Data Augmented Invariant Regularization

Oct 21, 2021
Tianjian Huang, Shaunak Halbe, Chinnadhurai Sankar, Pooyan Amini, Satwik Kottur, Alborz Geramifard, Meisam Razaviyayn, Ahmad Beirami

Figure 1 for DAIR: Data Augmented Invariant Regularization
Figure 2 for DAIR: Data Augmented Invariant Regularization
Figure 3 for DAIR: Data Augmented Invariant Regularization
Figure 4 for DAIR: Data Augmented Invariant Regularization

While deep learning through empirical risk minimization (ERM) has succeeded at achieving human-level performance at a variety of complex tasks, ERM generalizes poorly to distribution shift. This is partly explained by overfitting to spurious features such as background in images or named entities in natural language. Synthetic data augmentation followed by empirical risk minimization (DA-ERM) is a simple yet powerful solution to remedy this problem. In this paper, we propose data augmented invariant regularization (DAIR). The idea of DAIR is based on the observation that the model performance (loss) is desired to be consistent on the augmented sample and the original one. DAIR introduces a regularizer on DA-ERM to penalize such loss inconsistency. Both theoretically and through empirical experiments, we show that a particular form of the DAIR regularizer consistently performs well in a variety of settings. We apply it to multiple real-world learning problems involving domain shift, namely robust regression, visual question answering, robust deep neural network training, and task-oriented dialog modeling. Our experiments show that DAIR consistently outperforms ERM and DA-ERM with little marginal cost and setting new state-of-the-art results in several benchmarks.

* 15 pages 
Viaarxiv icon

Annotation Inconsistency and Entity Bias in MultiWOZ

May 29, 2021
Kun Qian, Ahmad Beirami, Zhouhan Lin, Ankita De, Alborz Geramifard, Zhou Yu, Chinnadhurai Sankar

Figure 1 for Annotation Inconsistency and Entity Bias in MultiWOZ
Figure 2 for Annotation Inconsistency and Entity Bias in MultiWOZ
Figure 3 for Annotation Inconsistency and Entity Bias in MultiWOZ
Figure 4 for Annotation Inconsistency and Entity Bias in MultiWOZ

MultiWOZ is one of the most popular multi-domain task-oriented dialog datasets, containing 10K+ annotated dialogs covering eight domains. It has been widely accepted as a benchmark for various dialog tasks, e.g., dialog state tracking (DST), natural language generation (NLG), and end-to-end (E2E) dialog modeling. In this work, we identify an overlooked issue with dialog state annotation inconsistencies in the dataset, where a slot type is tagged inconsistently across similar dialogs leading to confusion for DST modeling. We propose an automated correction for this issue, which is present in a whopping 70% of the dialogs. Additionally, we notice that there is significant entity bias in the dataset (e.g., "cambridge" appears in 50% of the destination cities in the train domain). The entity bias can potentially lead to named entity memorization in generative models, which may go unnoticed as the test set suffers from a similar entity bias as well. We release a new test set with all entities replaced with unseen entities. Finally, we benchmark joint goal accuracy (JGA) of the state-of-the-art DST baselines on these modified versions of the data. Our experiments show that the annotation inconsistency corrections lead to 7-10% improvement in JGA. On the other hand, we observe a 29% drop in JGA when models are evaluated on the new test set with unseen entities.

* Accepted by SIGDIAL 2021 
Viaarxiv icon

DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue

Jan 01, 2021
Hung Le, Chinnadhurai Sankar, Seungwhan Moon, Ahmad Beirami, Alborz Geramifard, Satwik Kottur

Figure 1 for DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue
Figure 2 for DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue
Figure 3 for DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue
Figure 4 for DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue

A video-grounded dialogue system is required to understand both dialogue, which contains semantic dependencies from turn to turn, and video, which contains visual cues of spatial and temporal scene variations. Building such dialogue systems is a challenging problem involving complex multimodal and temporal inputs, and studying them independently is hard with existing datasets. Existing benchmarks do not have enough annotations to help analyze dialogue systems and understand their linguistic and visual reasoning capability and limitations in isolation. These benchmarks are also not explicitly designed to minimize biases that models can exploit without actual reasoning. To address these limitations, in this paper, we present a diagnostic dataset that can test a range of reasoning abilities on videos and dialogues. The dataset is designed to contain minimal biases and has detailed annotations for the different types of reasoning each question requires, including cross-turn video interval tracking and dialogue object tracking. We use our dataset to analyze several dialogue system approaches, providing interesting insights into their abilities and limitations. In total, the dataset contains $10$ instances of $10$-round dialogues for each of $\sim11k$ synthetic videos, resulting in more than $100k$ dialogues and $1M$ question-answer pairs. Our code and dataset will be made public.

* 16 pages,11 figures, 3 tables 
Viaarxiv icon