Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jordan Boyd-Graber

University of Maryland

CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering

Jan 24, 2024

Zongxia Li, Ishani Mondal, Yijun Liang, Huy Nghiem, Jordan Boyd-Graber

Figure 1 for CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering

Figure 2 for CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering

Figure 3 for CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering

Figure 4 for CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering

Abstract:Question answering (QA) can only make progress if we know if an answer is correct, but for many of the most challenging and interesting QA examples, current evaluation metrics to determine answer equivalence (AE) often do not align with human judgments, particularly more verbose, free-form answers from large language models (LLM). There are two challenges: a lack of data and that models are too big: LLM-based scorers can correlate better with human judges, but this task has only been tested on limited QA datasets, and even when available, update of the model is limited because LLMs are large and often expensive. We rectify both of these issues by providing clear and consistent guidelines for evaluating AE in machine QA adopted from professional human QA contests. We also introduce a combination of standard evaluation and a more efficient, robust, and lightweight discriminate AE classifier-based matching method (CFMatch, smaller than 1 MB), trained and validated to more accurately evaluate answer correctness in accordance with adopted expert AE rules that are more aligned with human judgments.

* 18 pages, two figures, 6 tables. QA evaluation python package available in https://github.com/zli12321/qa_metrics.git

Via

Access Paper or Ask Questions

How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Question Generation

Jan 20, 2024

Yoo Yeon Sung, Ishani Mondal, Jordan Boyd-Graber

Abstract:Dynamic adversarial question generation, where humans write examples to stump a model, aims to create examples that are realistic and informative. However, the advent of large language models (LLMs) has been a double-edged sword for human authors: more people are interested in seeing and pushing the limits of these models, but because the models are so much stronger an opponent, they are harder to defeat. To understand how these models impact adversarial question writing process, we enrich the writing guidance with LLMs and retrieval models for the authors to reason why their questions are not adversarial. While authors could create interesting, challenging adversarial questions, they sometimes resort to tricks that result in poor questions that are ambiguous, subjective, or confusing not just to a computer but also to humans. To address these issues, we propose new metrics and incentives for eliciting good, challenging questions and present a new dataset of adversarially authored questions.

Via

Access Paper or Ask Questions

Towards Pragmatic Awareness in Question Answering: A Case Study in Maternal and Infant Health

Nov 16, 2023

Neha Srikanth, Rupak Sarkar, Rachel Rudinger, Jordan Boyd-Graber

Abstract:Questions posed by information-seeking users often contain implicit false or potentially harmful assumptions. In a high-risk domain such as maternal and infant health, a question-answering system must recognize these pragmatic constraints and go beyond simply answering user questions, examining them in context to respond helpfully. To achieve this, we study pragmatic inferences made when mothers ask questions about pregnancy and infant care. Some of the inferences in these questions evade detection by existing methods, risking the possibility of QA systems failing to address them which can have dangerous health and policy implications. We explore the viability of detecting inferences from questions using large language models and illustrate that informing existing QA pipelines with pragmatic inferences produces responses that can mitigate the propagation of harmful beliefs.

Via

Access Paper or Ask Questions

Labeled Interactive Topic Models

Nov 15, 2023

Kyle Seelman, Mozhi Zhang, Jordan Boyd-Graber

Figure 1 for Labeled Interactive Topic Models

Figure 2 for Labeled Interactive Topic Models

Figure 3 for Labeled Interactive Topic Models

Figure 4 for Labeled Interactive Topic Models

Abstract:Topic models help users understand large document collections; however, topic models do not always find the ``right'' topics. While classical probabilistic and anchor-based topic models have interactive variants to guide models toward better topics, such interactions are not available for neural topic models such as the embedded topic model (\abr{etm}). We correct this lacuna by adding an intuitive interaction to neural topic models: users can label a topic with a word, and topics are updated so that the topic words are close to the label. This allows a user to refine topics based on their information need. While, interactivity is intuitive for \abr{etm}, we extend this framework to work with other neural topic models as well. We develop an interactive interface which allows users to interact and relabel topic models as they see fit. We evaluate our method through a human study, where users can relabel topics to find relevant documents. Using our method, user labeling improves document rank scores, helping to find more relevant documents to a given query when compared to no user labeling.

Via

Access Paper or Ask Questions

Not all Fake News is Written: A Dataset and Analysis of Misleading Video Headlines

Oct 20, 2023

Yoo Yeon Sung, Jordan Boyd-Graber, Naeemul Hassan

Figure 1 for Not all Fake News is Written: A Dataset and Analysis of Misleading Video Headlines

Figure 2 for Not all Fake News is Written: A Dataset and Analysis of Misleading Video Headlines

Figure 3 for Not all Fake News is Written: A Dataset and Analysis of Misleading Video Headlines

Figure 4 for Not all Fake News is Written: A Dataset and Analysis of Misleading Video Headlines

Abstract:Polarization and the marketplace for impressions have conspired to make navigating information online difficult for users, and while there has been a significant effort to detect false or misleading text, multimodal datasets have received considerably less attention. To complement existing resources, we present multimodal Video Misleading Headline (VMH), a dataset that consists of videos and whether annotators believe the headline is representative of the video's contents. After collecting and annotating this dataset, we analyze multimodal baselines for detecting misleading headlines. Our annotation process also focuses on why annotators view a video as misleading, allowing us to better understand the interplay of annotators' background and the content of the videos.

* To be published at EMNLP 2023 Main Conference

Via

Access Paper or Ask Questions

Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong

Oct 19, 2023

Chenglei Si, Navita Goyal, Sherry Tongshuang Wu, Chen Zhao, Shi Feng, Hal Daumé III, Jordan Boyd-Graber

Figure 1 for Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong

Figure 2 for Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong

Figure 3 for Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong

Figure 4 for Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong

Abstract:Large Language Models (LLMs) are increasingly used for accessing information on the web. Their truthfulness and factuality are thus of great interest. To help users make the right decisions about the information they're getting, LLMs should not only provide but also help users fact-check information. In this paper, we conduct experiments with 80 crowdworkers in total to compare language models with search engines (information retrieval systems) at facilitating fact-checking by human users. We prompt LLMs to validate a given claim and provide corresponding explanations. Users reading LLM explanations are significantly more efficient than using search engines with similar accuracy. However, they tend to over-rely the LLMs when the explanation is wrong. To reduce over-reliance on LLMs, we ask LLMs to provide contrastive information - explain both why the claim is true and false, and then we present both sides of the explanation to users. This contrastive explanation mitigates users' over-reliance on LLMs, but cannot significantly outperform search engines. However, showing both search engine results and LLM explanations offers no complementary benefits as compared to search engines alone. Taken together, natural language explanations by LLMs may not be a reliable replacement for reading the retrieved passages yet, especially in high-stakes settings where over-relying on wrong AI explanations could lead to critical consequences.

* preprint

Via

Access Paper or Ask Questions

MegaWika: Millions of reports and their sources across 50 diverse languages

Jul 13, 2023

Samuel Barham, Orion Weller, Michelle Yuan, Kenton Murray, Mahsa Yarmohammadi, Zhengping Jiang, Siddharth Vashishtha, Alexander Martin, Anqi Liu, Aaron Steven White(+2 more)

Abstract:To foster the development of new models for collaborative AI-assisted report generation, we introduce MegaWika, consisting of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials. We process this dataset for a myriad of applications, going beyond the initial Wikipedia citation extraction and web scraping of content, including translating non-English articles for cross-lingual applications and providing FrameNet parses for automated semantic analysis. MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual. We manually analyze the quality of this resource through a semantically stratified sample. Finally, we provide baseline results and trained models for crucial steps in automated report generation: cross-lingual question answering and citation retrieval.

* Submitted to ACL, 2023

Via

Access Paper or Ask Questions

Mixture of Prompt Experts for Generalizable and Interpretable Question Answering

May 24, 2023

Chenglei Si, Weijia Shi, Chen Zhao, Luke Zettlemoyer, Jordan Boyd-Graber

Abstract:One of the ultimate quests of question answering (QA) is to deploy a system that can answer any type of question from the users, and refrain from answering when it does not know the answer. While recent advancements in scaling large language models (LLMs) brought significant improvements on various QA datasets, it remains difficult for a single model to generalize across question types that require distinct reasoning abilities. In this paper, we first provide empirical evidence that state-of-the-art LLMs such as Codex suffer from poor generalizability on question types beyond those seen in the prompt. To address this, we propose a Mixture-of-Prompt-Experts (MOPE) system that ensembles multiple specialized LLMs. We first implement each specialized model based on the same backbone model (Codex) but with prompts optimized for different reasoning categories including factual, multihop, mathematical, and commonsense reasoning. By strategically selecting the best specialized model for each given question, our MOPE system significantly outperforms any single specialized model on a collection of 12 QA datasets from four reasoning types. Moreover, the attribution and agreement among specialized expert models offer greater interpretability, allowing for better selective question answering. Our human study further confirms that presenting the expert predictions and answer selection process helps annotators more accurately decide when to trust the system's output. We release all code and data to facilitate future work.

Via

Access Paper or Ask Questions

InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration in Improving the Performance of Information Extraction

May 24, 2023

Ishani Mondal, Michelle Yuan, Anandhavelu N, Aparna Garimella, Francis Ferraro, Andrew Blair-Stanek, Benjamin Van Durme, Jordan Boyd-Graber

Figure 1 for InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration in Improving the Performance of Information Extraction

Figure 2 for InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration in Improving the Performance of Information Extraction

Figure 3 for InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration in Improving the Performance of Information Extraction

Figure 4 for InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration in Improving the Performance of Information Extraction

Abstract:Learning template based information extraction from documents is a crucial yet difficult task. Prior template-based IE approaches assume foreknowledge of the domain templates; however, real-world IE do not have pre-defined schemas and it is a figure-out-as you go phenomena. To quickly bootstrap templates in a real-world setting, we need to induce template slots from documents with zero or minimal supervision. Since the purpose of question answering intersect with the goal of information extraction, we use automatic question generation to induce template slots from the documents and investigate how a tiny amount of a proxy human-supervision on-the-fly (termed as InteractiveIE) can further boost the performance. Extensive experiments on biomedical and legal documents, where obtaining training data is expensive, reveal encouraging trends of performance improvement using InteractiveIE over AI-only baseline.

Via

Access Paper or Ask Questions

Cheater's Bowl: Human vs. Computer Search Strategies for Open-Domain Question Answering

Nov 15, 2022

Wanrong He, Andrew Mao, Jordan Boyd-Graber

Figure 1 for Cheater's Bowl: Human vs. Computer Search Strategies for Open-Domain Question Answering

Figure 2 for Cheater's Bowl: Human vs. Computer Search Strategies for Open-Domain Question Answering

Figure 3 for Cheater's Bowl: Human vs. Computer Search Strategies for Open-Domain Question Answering

Figure 4 for Cheater's Bowl: Human vs. Computer Search Strategies for Open-Domain Question Answering

Abstract:For humans and computers, the first step in answering an open-domain question is retrieving a set of relevant documents from a large corpus. However, the strategies that computers use fundamentally differ from those of humans. To better understand these differences, we design a gamified interface for data collection -- Cheater's Bowl -- where a human answers complex questions with access to both traditional and modern search tools. We collect a dataset of human search sessions, analyze human search strategies, and compare them to state-of-the-art multi-hop QA models. Humans query logically, apply dynamic search chains, and use world knowledge to boost searching. We demonstrate how human queries can improve the accuracy of existing systems and propose improving the future design of QA models.

* Findings of EMNLP 2022

Via

Access Paper or Ask Questions