Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Byron C. Wallace

Learning from Natural Language Explanations for Generalizable Entity Matching

Jun 13, 2024

Somin Wadhwa, Adit Krishnan, Runhui Wang, Byron C. Wallace, Chris Kong

Abstract:Entity matching is the task of linking records from different sources that refer to the same real-world entity. Past work has primarily treated entity linking as a standard supervised learning problem. However, supervised entity matching models often do not generalize well to new data, and collecting exhaustive labeled training data is often cost prohibitive. Further, recent efforts have adopted LLMs for this task in few/zero-shot settings, exploiting their general knowledge. But LLMs are prohibitively expensive for performing inference at scale for real-world entity matching tasks. As an efficient alternative, we re-cast entity matching as a conditional generation task as opposed to binary classification. This enables us to "distill" LLM reasoning into smaller entity matching models via natural language explanations. This approach achieves strong performance, especially on out-of-domain generalization tests (10.85% F-1) where standalone generative methods struggle. We perform ablations that highlight the importance of explanations, both for performance and model robustness.

Via

Access Paper or Ask Questions

Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models

May 02, 2024

Hye Sun Yun, David Pogrebitskiy, Iain J. Marshall, Byron C. Wallace

Figure 1 for Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models

Figure 2 for Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models

Figure 3 for Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models

Figure 4 for Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models

Abstract:Meta-analyses statistically aggregate the findings of different randomized controlled trials (RCTs) to assess treatment effectiveness. Because this yields robust estimates of treatment effectiveness, results from meta-analyses are considered the strongest form of evidence. However, rigorous evidence syntheses are time-consuming and labor-intensive, requiring manual extraction of data from individual trials to be synthesized. Ideally, language technologies would permit fully automatic meta-analysis, on demand. This requires accurately extracting numerical results from individual trials, which has been beyond the capabilities of natural language processing (NLP) models to date. In this work, we evaluate whether modern large language models (LLMs) can reliably perform this task. We annotate (and release) a modest but granular evaluation dataset of clinical trial reports with numerical findings attached to interventions, comparators, and outcomes. Using this dataset, we evaluate the performance of seven LLMs applied zero-shot for the task of conditionally extracting numerical findings from trial reports. We find that massive LLMs that can accommodate lengthy inputs are tantalizingly close to realizing fully automatic meta-analysis, especially for dichotomous (binary) outcomes (e.g., mortality). However, LLMs -- including ones trained on biomedical texts -- perform poorly when the outcome measures are complex and tallying the results requires inference. This work charts a path toward fully automatic meta-analysis of RCTs via LLMs, while also highlighting the limitations of existing models for this aim.

* 24 pages, 7 figures, 6 tables

Via

Access Paper or Ask Questions

Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores

Mar 01, 2024

Chantal Shaib, Joe Barrow, Jiuding Sun, Alexa F. Siu, Byron C. Wallace, Ani Nenkova

Figure 1 for Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores

Figure 2 for Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores

Figure 3 for Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores

Figure 4 for Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores

Abstract:The diversity across outputs generated by large language models shapes the perception of their quality and utility. Prompt leaks, templated answer structure, and canned responses across different interactions are readily noticed by people, but there is no standard score to measure this aspect of model behavior. In this work we empirically investigate diversity scores on English texts. We find that computationally efficient compression algorithms capture information similar to what is measured by slow to compute $n$-gram overlap homogeneity scores. Further, a combination of measures -- compression ratios, self-repetition of long $n$-grams and Self-BLEU and BERTScore -- are sufficient to report, as they have low mutual correlation with each other. The applicability of scores extends beyond analysis of generative models; for example, we highlight applications on instruction-tuning datasets and human-produced texts. We release a diversity score package to facilitate research and invite consistency across reports.

* Preprint

Via

Access Paper or Ask Questions

How Much Annotation is Needed to Compare Summarization Models?

Feb 28, 2024

Chantal Shaib, Joe Barrow, Alexa F. Siu, Byron C. Wallace, Ani Nenkova

Figure 1 for How Much Annotation is Needed to Compare Summarization Models?

Figure 2 for How Much Annotation is Needed to Compare Summarization Models?

Figure 3 for How Much Annotation is Needed to Compare Summarization Models?

Figure 4 for How Much Annotation is Needed to Compare Summarization Models?

Abstract:Modern instruction-tuned models have become highly capable in text generation tasks such as summarization, and are expected to be released at a steady pace. In practice one may now wish to choose confidently, but with minimal effort, the best performing summarization model when applied to a new domain or purpose. In this work, we empirically investigate the test sample size necessary to select a preferred model in the context of news summarization. Empirical results reveal that comparative evaluation converges quickly for both automatic and human evaluation, with clear preferences for a system emerging from under 100 examples. The human preference data allows us to quantify how well automatic scores can reproduce preference rankings across a variety of downstream summarization tasks. We find that, while automatic metrics are stable at smaller sample sizes, only some automatic metrics are able to moderately predict model win rates according to human preference.

* Preprint

Via

Access Paper or Ask Questions

Leveraging ChatGPT in Pharmacovigilance Event Extraction: An Empirical Study

Feb 24, 2024

Zhaoyue Sun, Gabriele Pergola, Byron C. Wallace, Yulan He

Abstract:With the advent of large language models (LLMs), there has been growing interest in exploring their potential for medical applications. This research aims to investigate the ability of LLMs, specifically ChatGPT, in the context of pharmacovigilance event extraction, of which the main goal is to identify and extract adverse events or potential therapeutic events from textual medical sources. We conduct extensive experiments to assess the performance of ChatGPT in the pharmacovigilance event extraction task, employing various prompts and demonstration selection strategies. The findings demonstrate that while ChatGPT demonstrates reasonable performance with appropriate demonstration selection strategies, it still falls short compared to fully fine-tuned small models. Additionally, we explore the potential of leveraging ChatGPT for data augmentation. However, our investigation reveals that the inclusion of synthesized data into fine-tuning may lead to a decrease in performance, possibly attributed to noise in the ChatGPT-generated labels. To mitigate this, we explore different filtering strategies and find that, with the proper approach, more stable performance can be achieved, although constant improvement remains elusive.

* 14 pages, 2 figures, accepted by EACL 2024

Via

Access Paper or Ask Questions

GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

Feb 19, 2024

Kundan Krishna, Sanjana Ramprasad, Prakhar Gupta, Byron C. Wallace, Zachary C. Lipton, Jeffrey P. Bigham

Figure 1 for GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

Figure 2 for GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

Figure 3 for GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

Figure 4 for GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

Abstract:LLMs can generate factually incorrect statements even when provided access to reference documents. Such errors can be dangerous in high-stakes applications (e.g., document-grounded QA for healthcare or finance). We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks. GenAudit suggests edits to the LLM response by revising or removing claims that are not supported by the reference document, and also presents evidence from the reference for facts that do appear to have support. We train models to execute these tasks, and design an interactive interface to present suggested edits and evidence to users. Comprehensive evaluation by human raters shows that GenAudit can detect errors in 8 different LLM outputs when summarizing documents from diverse domains. To ensure that most errors are flagged by the system, we propose a method that can increase the error recall while minimizing impact on precision. We will release our tool (GenAudit) and fact-checking model for public use.

Via

Access Paper or Ask Questions

Towards Reducing Diagnostic Errors with Interpretable Risk Prediction

Feb 15, 2024

Denis Jered McInerney, William Dickinson, Lucy Flynn, Andrea Young, Geoffrey Young, Jan-Willem van de Meent, Byron C. Wallace

Abstract:Many diagnostic errors occur because clinicians cannot easily access relevant information in patient Electronic Health Records (EHRs). In this work we propose a method to use LLMs to identify pieces of evidence in patient EHR data that indicate increased or decreased risk of specific diagnoses; our ultimate aim is to increase access to evidence and reduce diagnostic errors. In particular, we propose a Neural Additive Model to make predictions backed by evidence with individualized risk estimates at time-points where clinicians are still uncertain, aiming to specifically mitigate delays in diagnosis and errors stemming from an incomplete differential. To train such a model, it is necessary to infer temporally fine-grained retrospective labels of eventual "true" diagnoses. We do so with LLMs, to ensure that the input text is from before a confident diagnosis can be made. We use an LLM to retrieve an initial pool of evidence, but then refine this set of evidence according to correlations learned by the model. We conduct an in-depth evaluation of the usefulness of our approach by simulating how it might be used by a clinician to decide between a pre-defined list of differential diagnoses.

Via

Access Paper or Ask Questions

InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification

Jan 29, 2024

Jan Trienes, Sebastian Joseph, Jörg Schlötterer, Christin Seifert, Kyle Lo, Wei Xu, Byron C. Wallace, Junyi Jessy Li

Figure 1 for InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification

Figure 2 for InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification

Figure 3 for InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification

Figure 4 for InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification

Abstract:Text simplification aims to make technical texts more accessible to laypeople but often results in deletion of information and vagueness. This work proposes InfoLossQA, a framework to characterize and recover simplification-induced information loss in form of question-and-answer (QA) pairs. Building on the theory of Question Under Discussion, the QA pairs are designed to help readers deepen their knowledge of a text. We conduct a range of experiments with this framework. First, we collect a dataset of 1,000 linguist-curated QA pairs derived from 104 LLM simplifications of scientific abstracts of medical studies. Our analyses of this data reveal that information loss occurs frequently, and that the QA pairs give a high-level overview of what information was lost. Second, we devise two methods for this task: end-to-end prompting of open-source and commercial language models, and a natural language inference pipeline. With a novel evaluation framework considering the correctness of QA pairs and their linguistic suitability, our expert evaluation reveals that models struggle to reliably identify information loss and applying similar standards as humans at what constitutes information loss.

Via

Access Paper or Ask Questions

Leveraging Generative AI for Clinical Evidence Summarization Needs to Achieve Trustworthiness

Nov 19, 2023

Gongbo Zhang, Qiao Jin, Denis Jered McInerney, Yong Chen, Fei Wang, Curtis L. Cole, Qian Yang, Yanshan Wang, Bradley A. Malin, Mor Peleg(+4 more)

Figure 1 for Leveraging Generative AI for Clinical Evidence Summarization Needs to Achieve Trustworthiness

Abstract:Evidence-based medicine aims to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, hold promise in facilitating the arduous task. However, developing accountable, fair, and inclusive models remains a complicated undertaking. In this perspective, we discuss the trustworthiness of generative AI in the context of automated summarization of medical evidence.

Via

Access Paper or Ask Questions

Future Lens: Anticipating Subsequent Tokens from a Single Hidden State

Nov 08, 2023

Koyena Pal, Jiuding Sun, Andrew Yuan, Byron C. Wallace, David Bau

Figure 1 for Future Lens: Anticipating Subsequent Tokens from a Single Hidden State

Figure 2 for Future Lens: Anticipating Subsequent Tokens from a Single Hidden State

Figure 3 for Future Lens: Anticipating Subsequent Tokens from a Single Hidden State

Figure 4 for Future Lens: Anticipating Subsequent Tokens from a Single Hidden State

Abstract:We conjecture that hidden state vectors corresponding to individual input tokens encode information sufficient to accurately predict several tokens ahead. More concretely, in this paper we ask: Given a hidden (internal) representation of a single token at position $t$ in an input, can we reliably anticipate the tokens that will appear at positions $\geq t + 2$? To test this, we measure linear approximation and causal intervention methods in GPT-J-6B to evaluate the degree to which individual hidden states in the network contain signal rich enough to predict future hidden states and, ultimately, token outputs. We find that, at some layers, we can approximate a model's output with more than 48% accuracy with respect to its prediction of subsequent tokens through a single hidden state. Finally we present a "Future Lens" visualization that uses these methods to create a new view of transformer states.

* Accepted at CoNLL 2023

Via

Access Paper or Ask Questions