Alert button
Picture for Mark Yatskar

Mark Yatskar

Alert button

Pachinko: Patching Interpretable QA Models through Natural Language Feedback

Nov 16, 2023
Chaitanya Malaviya, Subin Lee, Dan Roth, Mark Yatskar

Eliciting feedback from end users of NLP models can be beneficial for improving models. However, how should we present model responses to users so they are most amenable to be corrected from user feedback? Further, what properties do users value to understand and trust responses? We answer these questions by analyzing the effect of rationales generated by QA models to support their answers. We specifically consider decomposed question-answering models that first extract an intermediate rationale based on a context and a question and then use solely this rationale to answer the question. A rationale outlines the approach followed by the model to answer the question. Our work considers various formats of these rationales that vary according to well-defined properties of interest. We sample these rationales from large language models using few-shot prompting for two reading comprehension datasets, and then perform two user studies. In the first one, we present users with incorrect answers and corresponding rationales of various formats and ask them to provide natural language feedback to revise the rationale. We then measure the effectiveness of this feedback in patching these rationales through in-context learning. The second study evaluates how well different rationale formats enable users to understand and trust model answers, when they are correct. We find that rationale formats significantly affect how easy it is (1) for users to give feedback for rationales, and (2) for models to subsequently execute this feedback. In addition to influencing critiquablity, certain formats significantly enhance user reported understanding and trust of model outputs.

* Code & data available at https://github.com/chaitanyamalaviya/pachinko 
Viaarxiv icon

Interpretable-by-Design Text Classification with Iteratively Generated Concept Bottleneck

Oct 30, 2023
Josh Magnus Ludan, Qing Lyu, Yue Yang, Liam Dugan, Mark Yatskar, Chris Callison-Burch

Figure 1 for Interpretable-by-Design Text Classification with Iteratively Generated Concept Bottleneck
Figure 2 for Interpretable-by-Design Text Classification with Iteratively Generated Concept Bottleneck
Figure 3 for Interpretable-by-Design Text Classification with Iteratively Generated Concept Bottleneck
Figure 4 for Interpretable-by-Design Text Classification with Iteratively Generated Concept Bottleneck

Deep neural networks excel in text classification tasks, yet their application in high-stakes domains is hindered by their lack of interpretability. To address this, we propose Text Bottleneck Models (TBMs), an intrinsically interpretable text classification framework that offers both global and local explanations. Rather than directly predicting the output label, TBMs predict categorical values for a sparse set of salient concepts and use a linear layer over those concept values to produce the final prediction. These concepts can be automatically discovered and measured by a Large Language Model (LLM), without the need for human curation. On 12 diverse datasets, using GPT-4 for both concept generation and measurement, we show that TBMs can rival the performance of established black-box baselines such as GPT-4 fewshot and finetuned DeBERTa, while falling short against finetuned GPT-3.5. Overall, our findings suggest that TBMs are a promising new framework that enhances interpretability, with minimal performance tradeoffs, particularly for general-domain text.

Viaarxiv icon

ExpertQA: Expert-Curated Questions and Attributed Answers

Sep 14, 2023
Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, Dan Roth

Figure 1 for ExpertQA: Expert-Curated Questions and Attributed Answers
Figure 2 for ExpertQA: Expert-Curated Questions and Attributed Answers
Figure 3 for ExpertQA: Expert-Curated Questions and Attributed Answers
Figure 4 for ExpertQA: Expert-Curated Questions and Attributed Answers

As language models are adapted by a more sophisticated and diverse set of users, the importance of guaranteeing that they provide factually correct information supported by verifiable sources is critical across fields of study & professions. This is especially the case for high-stakes fields, such as medicine and law, where the risk of propagating false information is high and can lead to undesirable societal consequences. Previous work studying factuality and attribution has not focused on analyzing these characteristics of language model outputs in domain-specific scenarios. In this work, we present an evaluation study analyzing various axes of factuality and attribution provided in responses from a few systems, by bringing domain experts in the loop. Specifically, we first collect expert-curated questions from 484 participants across 32 fields of study, and then ask the same experts to evaluate generated responses to their own questions. We also ask experts to revise answers produced by language models, which leads to ExpertQA, a high-quality long-form QA dataset with 2177 questions spanning 32 fields, along with verified answers and attributions for claims in the answers.

* Dataset & code is available at https://github.com/chaitanyamalaviya/expertqa 
Viaarxiv icon

Interpretable by Design Visual Question Answering

May 24, 2023
Xingyu Fu, Ben Zhou, Sihao Chen, Mark Yatskar, Dan Roth

Figure 1 for Interpretable by Design Visual Question Answering
Figure 2 for Interpretable by Design Visual Question Answering
Figure 3 for Interpretable by Design Visual Question Answering
Figure 4 for Interpretable by Design Visual Question Answering

Model interpretability has long been a hard problem for the AI community especially in the multimodal setting, where vision and language need to be aligned and reasoned at the same time. In this paper, we specifically focus on the problem of Visual Question Answering (VQA). While previous researches try to probe into the network structures of black-box multimodal models, we propose to tackle the problem from a different angle -- to treat interpretability as an explicit additional goal. Given an image and question, we argue that an interpretable VQA model should be able to tell what conclusions it can get from which part of the image, and show how each statement help to arrive at an answer. We introduce InterVQA: Interpretable-by-design VQA, where we design an explicit intermediate dynamic reasoning structure for VQA problems and enforce symbolic reasoning that only use the structure for final answer prediction to take place. InterVQA produces high-quality explicit intermediate reasoning steps, while maintaining similar to the state-of-the-art (sota) end-task performance.

* Multimodal, Vision and Language 
Viaarxiv icon

AmbiCoref: Evaluating Human and Model Sensitivity to Ambiguous Coreference

Feb 03, 2023
Yuewei Yuan, Chaitanya Malaviya, Mark Yatskar

Figure 1 for AmbiCoref: Evaluating Human and Model Sensitivity to Ambiguous Coreference
Figure 2 for AmbiCoref: Evaluating Human and Model Sensitivity to Ambiguous Coreference
Figure 3 for AmbiCoref: Evaluating Human and Model Sensitivity to Ambiguous Coreference
Figure 4 for AmbiCoref: Evaluating Human and Model Sensitivity to Ambiguous Coreference

Given a sentence "Abby told Brittney that she upset Courtney", one would struggle to understand who "she" refers to, and ask for clarification. However, if the word "upset" were replaced with "hugged", "she" unambiguously refers to Abby. We study if modern coreference resolution models are sensitive to such pronominal ambiguity. To this end, we construct AmbiCoref, a diagnostic corpus of minimal sentence pairs with ambiguous and unambiguous referents. Our examples generalize psycholinguistic studies of human perception of ambiguity around particular arrangements of verbs and their arguments. Analysis shows that (1) humans are less sure of referents in ambiguous AmbiCoref examples than unambiguous ones, and (2) most coreference models show little difference in output between ambiguous and unambiguous pairs. We release AmbiCoref as a diagnostic corpus for testing whether models treat ambiguity similarly to humans.

* EACL 2023 Findings 
Viaarxiv icon

Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification

Nov 21, 2022
Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, Mark Yatskar

Figure 1 for Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification
Figure 2 for Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification
Figure 3 for Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification
Figure 4 for Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification

Concept Bottleneck Models (CBM) are inherently interpretable models that factor model decisions into human-readable concepts. They allow people to easily understand why a model is failing, a critical feature for high-stakes applications. CBMs require manually specified concepts and often under-perform their black box counterparts, preventing their broad adoption. We address these shortcomings and are first to show how to construct high-performance CBMs without manual specification of similar accuracy to black box models. Our approach, Language Guided Bottlenecks (LaBo), leverages a language model, GPT-3, to define a large space of possible bottlenecks. Given a problem domain, LaBo uses GPT-3 to produce factual sentences about categories to form candidate concepts. LaBo efficiently searches possible bottlenecks through a novel submodular utility that promotes the selection of discriminative and diverse information. Ultimately, GPT-3's sentential concepts can be aligned to images using CLIP, to form a bottleneck layer. Experiments demonstrate that LaBo is a highly effective prior for concepts important to visual recognition. In the evaluation with 11 diverse datasets, LaBo bottlenecks excel at few-shot classification: they are 11.7% more accurate than black box linear probes at 1 shot and comparable with more data. Overall, LaBo demonstrates that inherently interpretable models can be widely applied at similar, or better, performance than black box approaches.

* 18 pages, 12 figures, 16 tables 
Viaarxiv icon

Cascading Biases: Investigating the Effect of Heuristic Annotation Strategies on Data and Models

Oct 24, 2022
Chaitanya Malaviya, Sudeep Bhatia, Mark Yatskar

Figure 1 for Cascading Biases: Investigating the Effect of Heuristic Annotation Strategies on Data and Models
Figure 2 for Cascading Biases: Investigating the Effect of Heuristic Annotation Strategies on Data and Models
Figure 3 for Cascading Biases: Investigating the Effect of Heuristic Annotation Strategies on Data and Models
Figure 4 for Cascading Biases: Investigating the Effect of Heuristic Annotation Strategies on Data and Models

Cognitive psychologists have documented that humans use cognitive heuristics, or mental shortcuts, to make quick decisions while expending less effort. While performing annotation work on crowdsourcing platforms, we hypothesize that such heuristic use among annotators cascades on to data quality and model robustness. In this work, we study cognitive heuristic use in the context of annotating multiple-choice reading comprehension datasets. We propose tracking annotator heuristic traces, where we tangibly measure low-effort annotation strategies that could indicate usage of various cognitive heuristics. We find evidence that annotators might be using multiple such heuristics, based on correlations with a battery of psychological tests. Importantly, heuristic use among annotators determines data quality along several dimensions: (1) known biased models, such as partial input models, more easily solve examples authored by annotators that rate highly on heuristic use, (2) models trained on annotators scoring highly on heuristic use don't generalize as well, and (3) heuristic-seeking annotators tend to create qualitatively less challenging examples. Our findings suggest that tracking heuristic usage among annotators can potentially help with collecting challenging datasets and diagnosing model biases.

* EMNLP 2022 
Viaarxiv icon

Visualizing the Obvious: A Concreteness-based Ensemble Model for Noun Property Prediction

Oct 24, 2022
Yue Yang, Artemis Panagopoulou, Marianna Apidianaki, Mark Yatskar, Chris Callison-Burch

Figure 1 for Visualizing the Obvious: A Concreteness-based Ensemble Model for Noun Property Prediction
Figure 2 for Visualizing the Obvious: A Concreteness-based Ensemble Model for Noun Property Prediction
Figure 3 for Visualizing the Obvious: A Concreteness-based Ensemble Model for Noun Property Prediction
Figure 4 for Visualizing the Obvious: A Concreteness-based Ensemble Model for Noun Property Prediction

Neural language models encode rich knowledge about entities and their relationships which can be extracted from their representations using probing. Common properties of nouns (e.g., red strawberries, small ant) are, however, more challenging to extract compared to other types of knowledge because they are rarely explicitly stated in texts. We hypothesize this to mainly be the case for perceptual properties which are obvious to the participants in the communication. We propose to extract these properties from images and use them in an ensemble model, in order to complement the information that is extracted from language models. We consider perceptual properties to be more concrete than abstract properties (e.g., interesting, flawless). We propose to use the adjectives' concreteness score as a lever to calibrate the contribution of each source (text vs. images). We evaluate our ensemble model in a ranking task where the actual properties of a noun need to be ranked higher than other non-relevant properties. Our results show that the proposed combination of text and images greatly improves noun property prediction compared to powerful text-based language models.

* Findings of EMNLP 2022  
* Findings of EMNLP 2022; The first two authors contributed equally 
Viaarxiv icon

Iconary: A Pictionary-Based Game for Testing Multimodal Communication with Drawings and Text

Dec 01, 2021
Christopher Clark, Jordi Salvador, Dustin Schwenk, Derrick Bonafilia, Mark Yatskar, Eric Kolve, Alvaro Herrasti, Jonghyun Choi, Sachin Mehta, Sam Skjonsberg, Carissa Schoenick, Aaron Sarnat, Hannaneh Hajishirzi, Aniruddha Kembhavi, Oren Etzioni, Ali Farhadi

Communicating with humans is challenging for AIs because it requires a shared understanding of the world, complex semantics (e.g., metaphors or analogies), and at times multi-modal gestures (e.g., pointing with a finger, or an arrow in a diagram). We investigate these challenges in the context of Iconary, a collaborative game of drawing and guessing based on Pictionary, that poses a novel challenge for the research community. In Iconary, a Guesser tries to identify a phrase that a Drawer is drawing by composing icons, and the Drawer iteratively revises the drawing to help the Guesser in response. This back-and-forth often uses canonical scenes, visual metaphor, or icon compositions to express challenging words, making it an ideal test for mixing language and visual/symbolic communication in AI. We propose models to play Iconary and train them on over 55,000 games between human players. Our models are skillful players and are able to employ world knowledge in language models to play with words unseen during training. Elite human players outperform our models, particularly at the drawing task, leaving an important gap for future research to address. We release our dataset, code, and evaluation setup as a challenge to the community at http://www.github.com/allenai/iconary.

* In EMNLP 2021 
Viaarxiv icon