Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Byron C. Wallace

How Many and Which Training Points Would Need to be Removed to Flip this Prediction?

Feb 09, 2023

Jinghan Yang, Sarthak Jain, Byron C. Wallace

Abstract:We consider the problem of identifying a minimal subset of training data $\mathcal{S}_t$ such that if the instances comprising $\mathcal{S}_t$ had been removed prior to training, the categorization of a given test point $x_t$ would have been different. Identifying such a set may be of interest for a few reasons. First, the cardinality of $\mathcal{S}_t$ provides a measure of robustness (if $|\mathcal{S}_t|$ is small for $x_t$, we might be less confident in the corresponding prediction), which we show is correlated with but complementary to predicted probabilities. Second, interrogation of $\mathcal{S}_t$ may provide a novel mechanism for contesting a particular model prediction: If one can make the case that the points in $\mathcal{S}_t$ are wrongly labeled or irrelevant, this may argue for overturning the associated prediction. Identifying $\mathcal{S}_t$ via brute-force is intractable. We propose comparatively fast approximation methods to find $\mathcal{S}_t$ based on influence functions, and find that -- for simple convex text classification models -- these approaches can often successfully identify relatively small sets of training examples which, if removed, would flip the prediction.

* Accepted to EACL 2023

Via

Access Paper or Ask Questions

Do Multi-Document Summarization Models Synthesize?

Jan 31, 2023

Jay DeYoung, Stephanie C. Martinez, Iain J. Marshall, Byron C. Wallace

Abstract:Multi-document summarization entails producing concise synopses of collections of inputs. For some applications, the synopsis should accurately \emph{synthesize} inputs with respect to a key property or aspect. For example, a synopsis of film reviews all written about a particular movie should reflect the average critic consensus. As a more consequential example, consider narrative summaries that accompany biomedical \emph{systematic reviews} of clinical trial results. These narratives should fairly summarize the potentially conflicting results from individual trials. In this paper we ask: To what extent do modern multi-document summarization models implicitly perform this type of synthesis? To assess this we perform a suite of experiments that probe the degree to which conditional generation models trained for summarization using standard methods yield outputs that appropriately synthesize inputs. We find that existing models do partially perform synthesis, but do so imperfectly. In particular, they are over-sensitive to changes in input ordering and under-sensitive to changes in input compositions (e.g., the ratio of positive to negative movie reviews). We propose a simple, general method for improving model synthesis capabilities by generating an explicitly diverse set of candidate outputs, and then selecting from these the string best aligned with the expected aggregate measure for the inputs, or \emph{abstaining} when the model produces no good candidate. This approach improves model synthesis performance. We hope highlighting the need for synthesis (in some summarization settings), motivates further research into multi-document summarization methods and learning objectives that explicitly account for the need to synthesize.

* 22 Pages, 13 Figures, 22 Tables. ACL Formatted paper; expanded version of rejected ICLR submisssion https://openreview.net/forum?id=1PTeB4MWCfU Paper de-anonymized ahead of ICLR de-anonymization due to ACL policies/additional conference submission

Via

Access Paper or Ask Questions

Intermediate Entity-based Sparse Interpretable Representation Learning

Dec 03, 2022

Diego Garcia-Olano, Yasumasa Onoe, Joydeep Ghosh, Byron C. Wallace

Figure 1 for Intermediate Entity-based Sparse Interpretable Representation Learning

Figure 2 for Intermediate Entity-based Sparse Interpretable Representation Learning

Figure 3 for Intermediate Entity-based Sparse Interpretable Representation Learning

Figure 4 for Intermediate Entity-based Sparse Interpretable Representation Learning

Abstract:Interpretable entity representations (IERs) are sparse embeddings that are "human-readable" in that dimensions correspond to fine-grained entity types and values are predicted probabilities that a given entity is of the corresponding type. These methods perform well in zero-shot and low supervision settings. Compared to standard dense neural embeddings, such interpretable representations may permit analysis and debugging. However, while fine-tuning sparse, interpretable representations improves accuracy on downstream tasks, it destroys the semantics of the dimensions which were enforced in pre-training. Can we maintain the interpretable semantics afforded by IERs while improving predictive performance on downstream tasks? Toward this end, we propose Intermediate enTity-based Sparse Interpretable Representation Learning (ItsIRL). ItsIRL realizes improved performance over prior IERs on biomedical tasks, while maintaining "interpretability" generally and their ability to support model debugging specifically. The latter is enabled in part by the ability to perform "counterfactual" fine-grained entity type manipulation, which we explore in this work. Finally, we propose a method to construct entity type based class prototypes for revealing global semantic properties of classes learned by our model.

* Accepted into BlackBox NLP Workshop at EMNLP 2022

Via

Access Paper or Ask Questions

Influence Functions for Sequence Tagging Models

Oct 25, 2022

Sarthak Jain, Varun Manjunatha, Byron C. Wallace, Ani Nenkova

Abstract:Many language tasks (e.g., Named Entity Recognition, Part-of-Speech tagging, and Semantic Role Labeling) are naturally framed as sequence tagging problems. However, there has been comparatively little work on interpretability methods for sequence tagging models. In this paper, we extend influence functions - which aim to trace predictions back to the training points that informed them - to sequence tagging tasks. We define the influence of a training instance segment as the effect that perturbing the labels within this segment has on a test segment level prediction. We provide an efficient approximation to compute this, and show that it tracks with the true segment influence, measured empirically. We show the practical utility of segment influence by using the method to identify systematic annotation errors in two named entity recognition corpora. Code to reproduce our results is available at https://github.com/successar/Segment_Influence_Functions.

* Accepted to Findings of EMNLP 2022

Via

Access Paper or Ask Questions

PHEE: A Dataset for Pharmacovigilance Event Extraction from Text

Oct 22, 2022

Zhaoyue Sun, Jiazheng Li, Gabriele Pergola, Byron C. Wallace, Bino John, Nigel Greene, Joseph Kim, Yulan He

Figure 1 for PHEE: A Dataset for Pharmacovigilance Event Extraction from Text

Figure 2 for PHEE: A Dataset for Pharmacovigilance Event Extraction from Text

Figure 3 for PHEE: A Dataset for Pharmacovigilance Event Extraction from Text

Figure 4 for PHEE: A Dataset for Pharmacovigilance Event Extraction from Text

Abstract:The primary goal of drug safety researchers and regulators is to promptly identify adverse drug reactions. Doing so may in turn prevent or reduce the harm to patients and ultimately improve public health. Evaluating and monitoring drug safety (i.e., pharmacovigilance) involves analyzing an ever growing collection of spontaneous reports from health professionals, physicians, and pharmacists, and information voluntarily submitted by patients. In this scenario, facilitating analysis of such reports via automation has the potential to rapidly identify safety signals. Unfortunately, public resources for developing natural language models for this task are scant. We present PHEE, a novel dataset for pharmacovigilance comprising over 5000 annotated events from medical case reports and biomedical literature, making it the largest such public dataset to date. We describe the hierarchical event schema designed to provide coarse and fine-grained information about patients' demographics, treatments and (side) effects. Along with the discussion of the dataset, we present a thorough experimental evaluation of current state-of-the-art approaches for biomedical event extraction, point out their limitations, and highlight open challenges to foster future research in this area.

* 17 pages, 3 figures, EMNLP2022 accepted

Via

Access Paper or Ask Questions

Self-Repetition in Abstractive Neural Summarizers

Oct 14, 2022

Nikita Salkar, Thomas Trikalinos, Byron C. Wallace, Ani Nenkova

Figure 1 for Self-Repetition in Abstractive Neural Summarizers

Figure 2 for Self-Repetition in Abstractive Neural Summarizers

Figure 3 for Self-Repetition in Abstractive Neural Summarizers

Figure 4 for Self-Repetition in Abstractive Neural Summarizers

Abstract:We provide a quantitative and qualitative analysis of self-repetition in the output of neural summarizers. We measure self-repetition as the number of n-grams of length four or longer that appear in multiple outputs of the same system. We analyze the behavior of three popular architectures (BART, T5, and Pegasus), fine-tuned on five datasets. In a regression analysis, we find that the three architectures have different propensities for repeating content across output summaries for inputs, with BART being particularly prone to self-repetition. Fine-tuning on more abstractive data, and on data featuring formulaic language, is associated with a higher rate of self-repetition. In qualitative analysis we find systems produce artefacts such as ads and disclaimers unrelated to the content being summarized, as well as formulaic phrases common in the fine-tuning domain. Our approach to corpus-level analysis of self-repetition may help practitioners clean up training data for summarizers and ultimately support methods for minimizing the amount of self-repetition.

Via

Access Paper or Ask Questions

Learning to Ask Like a Physician

Jun 06, 2022

Eric Lehman, Vladislav Lialin, Katelyn Y. Legaspi, Anne Janelle R. Sy, Patricia Therese S. Pile, Nicole Rose I. Alberto, Richard Raymund R. Ragasa, Corinna Victoria M. Puyat, Isabelle Rose I. Alberto, Pia Gabrielle I. Alfonso(+8 more)

Figure 1 for Learning to Ask Like a Physician

Figure 2 for Learning to Ask Like a Physician

Figure 3 for Learning to Ask Like a Physician

Figure 4 for Learning to Ask Like a Physician

Abstract:Existing question answering (QA) datasets derived from electronic health records (EHR) are artificially generated and consequently fail to capture realistic physician information needs. We present Discharge Summary Clinical Questions (DiSCQ), a newly curated question dataset composed of 2,000+ questions paired with the snippets of text (triggers) that prompted each question. The questions are generated by medical experts from 100+ MIMIC-III discharge summaries. We analyze this dataset to characterize the types of information sought by medical experts. We also train baseline models for trigger detection and question generation (QG), paired with unsupervised answer retrieval over EHRs. Our baseline model is able to generate high quality questions in over 62% of cases when prompted with human selected triggers. We release this dataset (and all code to reproduce baseline model results) to facilitate further research into realistic clinical QA and QG: https://github.com/elehman16/discq.

Via

Access Paper or Ask Questions

Evaluating Factuality in Text Simplification

Apr 15, 2022

Ashwin Devaraj, William Sheffield, Byron C. Wallace, Junyi Jessy Li

Figure 1 for Evaluating Factuality in Text Simplification

Figure 2 for Evaluating Factuality in Text Simplification

Figure 3 for Evaluating Factuality in Text Simplification

Figure 4 for Evaluating Factuality in Text Simplification

Abstract:Automated simplification models aim to make input texts more readable. Such methods have the potential to make complex information accessible to a wider audience, e.g., providing access to recent medical literature which might otherwise be impenetrable for a lay reader. However, such models risk introducing errors into automatically simplified texts, for instance by inserting statements unsupported by the corresponding original text, or by omitting key information. Providing more readable but inaccurate versions of texts may in many cases be worse than providing no such access at all. The problem of factual accuracy (and the lack thereof) has received heightened attention in the context of summarization models, but the factuality of automatically simplified texts has not been investigated. We introduce a taxonomy of errors that we use to analyze both references drawn from standard simplification datasets and state-of-the-art model outputs. We find that errors often appear in both that are not captured by existing evaluation metrics, motivating a need for research into ensuring the factual accuracy of automated simplification models.

* ACL 2022

Via

Access Paper or Ask Questions

What Would it Take to get Biomedical QA Systems into Practice?

Sep 21, 2021

Gregory Kell, Iain J. Marshall, Byron C. Wallace, Andre Jaun

Figure 1 for What Would it Take to get Biomedical QA Systems into Practice?

Figure 2 for What Would it Take to get Biomedical QA Systems into Practice?

Figure 3 for What Would it Take to get Biomedical QA Systems into Practice?

Figure 4 for What Would it Take to get Biomedical QA Systems into Practice?

Abstract:Medical question answering (QA) systems have the potential to answer clinicians uncertainties about treatment and diagnosis on demand, informed by the latest evidence. However, despite the significant progress in general QA made by the NLP community, medical QA systems are still not widely used in clinical environments. One likely reason for this is that clinicians may not readily trust QA system outputs, in part because transparency, trustworthiness, and provenance have not been key considerations in the design of such models. In this paper we discuss a set of criteria that, if met, we argue would likely increase the utility of biomedical QA systems, which may in turn lead to adoption of such systems in practice. We assess existing models, tasks, and datasets with respect to these criteria, highlighting shortcomings of previously proposed approaches and pointing toward what might be more usable QA systems.

* Accepted by MRQA workshop at EMNLP 2021

Via

Access Paper or Ask Questions

Combining Feature and Instance Attribution to Detect Artifacts

Jul 01, 2021

Pouya Pezeshkpour, Sarthak Jain, Sameer Singh, Byron C. Wallace

Figure 1 for Combining Feature and Instance Attribution to Detect Artifacts

Figure 2 for Combining Feature and Instance Attribution to Detect Artifacts

Figure 3 for Combining Feature and Instance Attribution to Detect Artifacts

Figure 4 for Combining Feature and Instance Attribution to Detect Artifacts

Abstract:Training the large deep neural networks that dominate NLP requires large datasets. Many of these are collected automatically or via crowdsourcing, and may exhibit systematic biases or annotation artifacts. By the latter, we mean correlations between inputs and outputs that are spurious, insofar as they do not represent a generally held causal relationship between features and classes; models that exploit such correlations may appear to perform a given task well, but fail on out of sample data. In this paper we propose methods to facilitate identification of training data artifacts, using new hybrid approaches that combine saliency maps (which highlight important input features) with instance attribution methods (which retrieve training samples influential to a given prediction). We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data, and use it to identify previously unreported artifacts in a few standard NLP datasets. We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice, with promising results. We make code for all methods and experiments in this paper available.

Via

Access Paper or Ask Questions