Alert button
Picture for Shehzaad Dhuliawala

Shehzaad Dhuliawala

Alert button

Chain-of-Verification Reduces Hallucination in Large Language Models

Sep 20, 2023
Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, Jason Weston

Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models. We study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response. In experiments, we show CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation.

Viaarxiv icon

Variational Classification

May 17, 2023
Shehzaad Dhuliawala, Mrinmaya Sachan, Carl Allen

Figure 1 for Variational Classification
Figure 2 for Variational Classification
Figure 3 for Variational Classification
Figure 4 for Variational Classification

We present a novel extension of the traditional neural network approach to classification tasks, referred to as variational classification (VC). By incorporating latent variable modeling, akin to the relationship between variational autoencoders and traditional autoencoders, we derive a training objective based on the evidence lower bound (ELBO), optimized using an adversarial approach. Our VC model allows for more flexibility in design choices, in particular class-conditional latent priors, in place of the implicit assumptions made in off-the-shelf softmax classifiers. Empirical evaluation on image and text classification datasets demonstrates the effectiveness of our approach in terms of maintaining prediction accuracy while improving other desirable properties such as calibration and adversarial robustness, even when applied to out-of-domain data.

Viaarxiv icon

Extracting Victim Counts from Text

Feb 23, 2023
Mian Zhong, Shehzaad Dhuliawala, Niklas Stoehr

Figure 1 for Extracting Victim Counts from Text
Figure 2 for Extracting Victim Counts from Text
Figure 3 for Extracting Victim Counts from Text
Figure 4 for Extracting Victim Counts from Text

Decision-makers in the humanitarian sector rely on timely and exact information during crisis events. Knowing how many civilians were injured during an earthquake is vital to allocate aids properly. Information about such victim counts is often only available within full-text event descriptions from newspapers and other reports. Extracting numbers from text is challenging: numbers have different formats and may require numeric reasoning. This renders purely string matching-based approaches insufficient. As a consequence, fine-grained counts of injured, displaced, or abused victims beyond fatalities are often not extracted and remain unseen. We cast victim count extraction as a question answering (QA) task with a regression or classification objective. We compare regex, dependency parsing, semantic role labeling-based approaches, and advanced text-to-text models. Beyond model accuracy, we analyze extraction reliability and robustness which are key for this sensitive task. In particular, we discuss model calibration and investigate few-shot and out-of-distribution performance. Ultimately, we make a comprehensive recommendation on which model to select for different desiderata and data domains. Our work is among the first to apply numeracy-focused large language models in a real-world use case with a positive impact.

* Long paper accepted at EACL 2023 main conference 
Viaarxiv icon

Poor Man's Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference

Jan 28, 2023
Vilém Zouhar, Shehzaad Dhuliawala, Wangchunshu Zhou, Nico Daheim, Tom Kocmi, Yuchen Eleanor Jiang, Mrinmaya Sachan

Figure 1 for Poor Man's Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference
Figure 2 for Poor Man's Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference
Figure 3 for Poor Man's Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference
Figure 4 for Poor Man's Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference

Machine translation quality estimation (QE) predicts human judgements of a translation hypothesis without seeing the reference. State-of-the-art QE systems based on pretrained language models have been achieving remarkable correlations with human judgements yet they are computationally heavy and require human annotations, which are slow and expensive to create. To address these limitations, we define the problem of metric estimation (ME) where one predicts the automated metric scores also without the reference. We show that even without access to the reference, our model can estimate automated metrics ($\rho$=60% for BLEU, $\rho$=51% for other metrics) at the sentence-level. Because automated metrics correlate with human judgements, we can leverage the ME task for pre-training a QE model. For the QE task, we find that pre-training on TER is better ($\rho$=23%) than training for scratch ($\rho$=20%).

* Accepted at EACL23 (main) 
Viaarxiv icon

Calibration of Machine Reading Systems at Scale

Mar 20, 2022
Shehzaad Dhuliawala, Leonard Adolphs, Rajarshi Das, Mrinmaya Sachan

Figure 1 for Calibration of Machine Reading Systems at Scale
Figure 2 for Calibration of Machine Reading Systems at Scale
Figure 3 for Calibration of Machine Reading Systems at Scale
Figure 4 for Calibration of Machine Reading Systems at Scale

In typical machine learning systems, an estimate of the probability of the prediction is used to assess the system's confidence in the prediction. This confidence measure is usually uncalibrated; i.e.\ the system's confidence in the prediction does not match the true probability of the predicted output. In this paper, we present an investigation into calibrating open setting machine reading systems such as open-domain question answering and claim verification systems. We show that calibrating such complex systems which contain discrete retrieval and deep reading components is challenging and current calibration techniques fail to scale to these settings. We propose simple extensions to existing calibration approaches that allows us to adapt them to these settings. Our experimental results reveal that the approach works well, and can be useful to selectively predict answers when question answering systems are posed with unanswerable or out-of-the-training distribution questions.

* Accepted at ACL 2022 Findings 
Viaarxiv icon

Case-based Reasoning for Better Generalization in Text-Adventure Games

Oct 16, 2021
Mattia Atzeni, Shehzaad Dhuliawala, Keerthiram Murugesan, Mrinmaya Sachan

Figure 1 for Case-based Reasoning for Better Generalization in Text-Adventure Games
Figure 2 for Case-based Reasoning for Better Generalization in Text-Adventure Games
Figure 3 for Case-based Reasoning for Better Generalization in Text-Adventure Games
Figure 4 for Case-based Reasoning for Better Generalization in Text-Adventure Games

Text-based games (TBG) have emerged as promising environments for driving research in grounded language understanding and studying problems like generalization and sample efficiency. Several deep reinforcement learning (RL) methods with varying architectures and learning schemes have been proposed for TBGs. However, these methods fail to generalize efficiently, especially under distributional shifts. In a departure from deep RL approaches, in this paper, we propose a general method inspired by case-based reasoning to train agents and generalize out of the training distribution. The case-based reasoner collects instances of positive experiences from the agent's interaction with the world in the past and later reuses the collected experiences to act efficiently. The method can be applied in conjunction with any existing on-policy neural agent in the literature for TBGs. Our experiments show that the proposed approach consistently improves existing methods, obtains good out-of-distribution generalization, and achieves new state-of-the-art results on widely used environments.

Viaarxiv icon

TopiOCQA: Open-domain Conversational Question Answeringwith Topic Switching

Oct 02, 2021
Vaibhav Adlakha, Shehzaad Dhuliawala, Kaheer Suleman, Harm de Vries, Siva Reddy

Figure 1 for TopiOCQA: Open-domain Conversational Question Answeringwith Topic Switching
Figure 2 for TopiOCQA: Open-domain Conversational Question Answeringwith Topic Switching
Figure 3 for TopiOCQA: Open-domain Conversational Question Answeringwith Topic Switching
Figure 4 for TopiOCQA: Open-domain Conversational Question Answeringwith Topic Switching

In a conversational question answering scenario, a questioner seeks to extract information about a topic through a series of interdependent questions and answers. As the conversation progresses, they may switch to related topics, a phenomenon commonly observed in information-seeking search sessions. However, current datasets for conversational question answering are limiting in two ways: 1) they do not contain topic switches; and 2) they assume the reference text for the conversation is given, i.e., the setting is not open-domain. We introduce TopiOCQA (pronounced Tapioca), an open-domain conversational dataset with topic switches on Wikipedia. TopiOCQA contains 3,920 conversations with information-seeking questions and free-form answers. TopiOCQA poses a challenging test-bed for models, where efficient retrieval is required on multiple turns of the same conversation, in conjunction with constructing valid responses using conversational history. We evaluate several baselines, by combining state-of-the-art document retrieval methods with neural reader models. Our best models achieves F1 of 51.9, and BLEU score of 42.1 which falls short of human performance by 18.3 points and 17.6 points respectively, indicating the difficulty of our dataset. Our dataset and code will be available at https://mcgill-nlp.github.io/topiocqa

Viaarxiv icon

How to Query Language Models?

Aug 04, 2021
Leonard Adolphs, Shehzaad Dhuliawala, Thomas Hofmann

Figure 1 for How to Query Language Models?
Figure 2 for How to Query Language Models?
Figure 3 for How to Query Language Models?
Figure 4 for How to Query Language Models?

Large pre-trained language models (LMs) are capable of not only recovering linguistic but also factual and commonsense knowledge. To access the knowledge stored in mask-based LMs, we can use cloze-style questions and let the model fill in the blank. The flexibility advantage over structured knowledge bases comes with the drawback of finding the right query for a certain information need. Inspired by human behavior to disambiguate a question, we propose to query LMs by example. To clarify the ambivalent question "Who does Neuer play for?", a successful strategy is to demonstrate the relation using another subject, e.g., "Ronaldo plays for Portugal. Who does Neuer play for?". We apply this approach of querying by example to the LAMA probe and obtain substantial improvements of up to 37.8% for BERT-large on the T-REx data when providing only 10 demonstrations--even outperforming a baseline that queries the model with up to 40 paraphrases of the question. The examples are provided through the model's context and thus require neither fine-tuning nor an additional forward pass. This suggests that LMs contain more factual and commonsense knowledge than previously assumed--if we query the model in the right way.

Viaarxiv icon

A Simple Approach to Case-Based Reasoning in Knowledge Bases

Jun 25, 2020
Rajarshi Das, Ameya Godbole, Shehzaad Dhuliawala, Manzil Zaheer, Andrew McCallum

Figure 1 for A Simple Approach to Case-Based Reasoning in Knowledge Bases
Figure 2 for A Simple Approach to Case-Based Reasoning in Knowledge Bases
Figure 3 for A Simple Approach to Case-Based Reasoning in Knowledge Bases
Figure 4 for A Simple Approach to Case-Based Reasoning in Knowledge Bases

We present a surprisingly simple yet accurate approach to reasoning in knowledge graphs (KGs) that requires \emph{no training}, and is reminiscent of case-based reasoning in classical artificial intelligence (AI). Consider the task of finding a target entity given a source entity and a binary relation. Our non-parametric approach derives crisp logical rules for each query by finding multiple \textit{graph path patterns} that connect similar source entities through the given relation. Using our method, we obtain new state-of-the-art accuracy, outperforming all previous models, on NELL-995 and FB-122. We also demonstrate that our model is robust in low data settings, outperforming recently proposed meta-learning approaches

Viaarxiv icon