Alert button
Picture for Ian Tenney

Ian Tenney

Alert button

Simfluence: Modeling the Influence of Individual Training Examples by Simulating Training Runs

Mar 14, 2023
Kelvin Guu, Albert Webson, Ellie Pavlick, Lucas Dixon, Ian Tenney, Tolga Bolukbasi

Figure 1 for Simfluence: Modeling the Influence of Individual Training Examples by Simulating Training Runs
Figure 2 for Simfluence: Modeling the Influence of Individual Training Examples by Simulating Training Runs
Figure 3 for Simfluence: Modeling the Influence of Individual Training Examples by Simulating Training Runs
Figure 4 for Simfluence: Modeling the Influence of Individual Training Examples by Simulating Training Runs

Training data attribution (TDA) methods offer to trace a model's prediction on any given example back to specific influential training examples. Existing approaches do so by assigning a scalar influence score to each training example, under a simplifying assumption that influence is additive. But in reality, we observe that training examples interact in highly non-additive ways due to factors such as inter-example redundancy, training order, and curriculum learning effects. To study such interactions, we propose Simfluence, a new paradigm for TDA where the goal is not to produce a single influence score per example, but instead a training run simulator: the user asks, ``If my model had trained on example $z_1$, then $z_2$, ..., then $z_n$, how would it behave on $z_{test}$?''; the simulator should then output a simulated training run, which is a time series predicting the loss on $z_{test}$ at every step of the simulated run. This enables users to answer counterfactual questions about what their model would have learned under different training curricula, and to directly see where in training that learning would occur. We present a simulator, Simfluence-Linear, that captures non-additive interactions and is often able to predict the spiky trajectory of individual example losses with surprising fidelity. Furthermore, we show that existing TDA methods such as TracIn and influence functions can be viewed as special cases of Simfluence-Linear. This enables us to directly compare methods in terms of their simulation accuracy, subsuming several prior TDA approaches to evaluation. In experiments on large language model (LLM) fine-tuning, we show that our method predicts loss trajectories with much higher accuracy than existing TDA methods (doubling Spearman's correlation and reducing mean-squared error by 75%) across several tasks, models, and training methods.

Viaarxiv icon

Tracing Knowledge in Language Models Back to the Training Data

May 24, 2022
Ekin Akyürek, Tolga Bolukbasi, Frederick Liu, Binbin Xiong, Ian Tenney, Jacob Andreas, Kelvin Guu

Figure 1 for Tracing Knowledge in Language Models Back to the Training Data
Figure 2 for Tracing Knowledge in Language Models Back to the Training Data
Figure 3 for Tracing Knowledge in Language Models Back to the Training Data
Figure 4 for Tracing Knowledge in Language Models Back to the Training Data

Neural language models (LMs) have been shown to memorize a great deal of factual knowledge. But when an LM generates an assertion, it is often difficult to determine where it learned this information and whether it is true. In this paper, we introduce a new benchmark for fact tracing: tracing language models' assertions back to the training examples that provided evidence for those predictions. Prior work has suggested that dataset-level influence methods might offer an effective framework for tracing predictions back to training data. However, such methods have not been evaluated for fact tracing, and researchers primarily have studied them through qualitative analysis or as a data cleaning technique for classification/regression tasks. We present the first experiments that evaluate influence methods for fact tracing, using well-understood information retrieval (IR) metrics. We compare two popular families of influence methods -- gradient-based and embedding-based -- and show that neither can fact-trace reliably; indeed, both methods fail to outperform an IR baseline (BM25) that does not even access the LM. We explore why this occurs (e.g., gradient saturation) and demonstrate that existing influence methods must be improved significantly before they can reliably attribute factual predictions in LMs.

* 14 pages, 5 Tables, 5 Figures 
Viaarxiv icon

Retrieval-guided Counterfactual Generation for QA

Oct 14, 2021
Bhargavi Paranjape, Matthew Lamm, Ian Tenney

Figure 1 for Retrieval-guided Counterfactual Generation for QA
Figure 2 for Retrieval-guided Counterfactual Generation for QA
Figure 3 for Retrieval-guided Counterfactual Generation for QA
Figure 4 for Retrieval-guided Counterfactual Generation for QA

Deep NLP models have been shown to learn spurious correlations, leaving them brittle to input perturbations. Recent work has shown that counterfactual or contrastive data -- i.e. minimally perturbed inputs -- can reveal these weaknesses, and that data augmentation using counterfactuals can help ameliorate them. Proposed techniques for generating counterfactuals rely on human annotations, perturbations based on simple heuristics, and meaning representation frameworks. We focus on the task of creating counterfactuals for question answering, which presents unique challenges related to world knowledge, semantic diversity, and answerability. To address these challenges, we develop a Retrieve-Generate-Filter(RGF) technique to create counterfactual evaluation and training data with minimal human supervision. Using an open-domain QA framework and question generation model trained on original task data, we create counterfactuals that are fluent, semantically diverse, and automatically labeled. Data augmentation with RGF counterfactuals improves performance on out-of-domain and challenging evaluation sets over and above existing methods, in both the reading comprehension and open-domain QA settings. Moreover, we find that RGF data leads to significant improvements in a model's robustness to local perturbations.

Viaarxiv icon

The MultiBERTs: BERT Reproductions for Robustness Analysis

Jun 30, 2021
Thibault Sellam, Steve Yadlowsky, Jason Wei, Naomi Saphra, Alexander D'Amour, Tal Linzen, Jasmijn Bastings, Iulia Turc, Jacob Eisenstein, Dipanjan Das, Ian Tenney, Ellie Pavlick

Figure 1 for The MultiBERTs: BERT Reproductions for Robustness Analysis
Figure 2 for The MultiBERTs: BERT Reproductions for Robustness Analysis
Figure 3 for The MultiBERTs: BERT Reproductions for Robustness Analysis
Figure 4 for The MultiBERTs: BERT Reproductions for Robustness Analysis

Experiments with pretrained models such as BERT are often based on a single checkpoint. While the conclusions drawn apply to the artifact (i.e., the particular instance of the model), it is not always clear whether they hold for the more general procedure (which includes the model architecture, training data, initialization scheme, and loss function). Recent work has shown that re-running pretraining can lead to substantially different conclusions about performance, suggesting that alternative evaluations are needed to make principled statements about procedures. To address this question, we introduce MultiBERTs: a set of 25 BERT-base checkpoints, trained with similar hyper-parameters as the original BERT model but differing in random initialization and data shuffling. The aim is to enable researchers to draw robust and statistically justified conclusions about pretraining procedures. The full release includes 25 fully trained checkpoints, as well as statistical guidelines and a code library implementing our recommended hypothesis testing methods. Finally, for five of these models we release a set of 28 intermediate checkpoints in order to support research on learning dynamics.

* Checkpoints and example analyses: http://goo.gle/multiberts 
Viaarxiv icon

Measuring and Reducing Gendered Correlations in Pre-trained Models

Oct 12, 2020
Kellie Webster, Xuezhi Wang, Ian Tenney, Alex Beutel, Emily Pitler, Ellie Pavlick, Jilin Chen, Slav Petrov

Figure 1 for Measuring and Reducing Gendered Correlations in Pre-trained Models
Figure 2 for Measuring and Reducing Gendered Correlations in Pre-trained Models
Figure 3 for Measuring and Reducing Gendered Correlations in Pre-trained Models
Figure 4 for Measuring and Reducing Gendered Correlations in Pre-trained Models

Pre-trained models have revolutionized natural language understanding. However, researchers have found they can encode artifacts undesired in many applications, such as professions correlating with one gender more than another. We explore such gendered correlations as a case study for how to address unintended correlations in pre-trained models. We define metrics and reveal that it is possible for models with similar accuracy to encode correlations at very different rates. We show how measured correlations can be reduced with general-purpose techniques, and highlight the trade offs different strategies have. With these results, we make recommendations for training robust models: (1) carefully evaluate unintended correlations, (2) be mindful of seemingly innocuous configuration differences, and (3) focus on general mitigations.

Viaarxiv icon

Do Language Embeddings Capture Scales?

Oct 11, 2020
Xikun Zhang, Deepak Ramachandran, Ian Tenney, Yanai Elazar, Dan Roth

Figure 1 for Do Language Embeddings Capture Scales?
Figure 2 for Do Language Embeddings Capture Scales?
Figure 3 for Do Language Embeddings Capture Scales?
Figure 4 for Do Language Embeddings Capture Scales?

Pretrained Language Models (LMs) have been shown to possess significant linguistic, common sense, and factual knowledge. One form of knowledge that has not been studied yet in this context is information about the scalar magnitudes of objects. We show that pretrained language models capture a significant amount of this information but are short of the capability required for general common-sense reasoning. We identify contextual information in pre-training and numeracy as two key factors affecting their performance and show that a simple method of canonicalizing numbers can have a significant effect on the results.

* Accepted at EMNLP Findings 2020 and EMNLP BlackboxNLP workshop 2020 
Viaarxiv icon

The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models

Aug 12, 2020
Ian Tenney, James Wexler, Jasmijn Bastings, Tolga Bolukbasi, Andy Coenen, Sebastian Gehrmann, Ellen Jiang, Mahima Pushkarna, Carey Radebaugh, Emily Reif, Ann Yuan

We present the Language Interpretability Tool (LIT), an open-source platform for visualization and understanding of NLP models. We focus on core questions about model behavior: Why did my model make this prediction? When does it perform poorly? What happens under a controlled change in the input? LIT integrates local explanations, aggregate analysis, and counterfactual generation into a streamlined, browser-based interface to enable rapid exploration and error analysis. We include case studies for a diverse set of workflows, including exploring counterfactuals for sentiment analysis, measuring gender bias in coreference systems, and exploring local behavior in text generation. LIT supports a wide range of models--including classification, seq2seq, and structured prediction--and is highly extensible through a declarative, framework-agnostic API. LIT is under active development, with code and full documentation available at https://github.com/pair-code/lit.

Viaarxiv icon

Asking without Telling: Exploring Latent Ontologies in Contextual Representations

Apr 29, 2020
Julian Michael, Jan A. Botha, Ian Tenney

Figure 1 for Asking without Telling: Exploring Latent Ontologies in Contextual Representations
Figure 2 for Asking without Telling: Exploring Latent Ontologies in Contextual Representations
Figure 3 for Asking without Telling: Exploring Latent Ontologies in Contextual Representations
Figure 4 for Asking without Telling: Exploring Latent Ontologies in Contextual Representations

The success of pretrained contextual encoders, such as ELMo and BERT, has brought a great deal of interest in what these models learn: do they, without explicit supervision, learn to encode meaningful notions of linguistic structure? If so, how is this structure encoded? To investigate this, we introduce latent subclass learning (LSL): a modification to existing classifier-based probing methods that induces a latent categorization (or ontology) of the probe's inputs. Without access to fine-grained gold labels, LSL extracts emergent structure from input representations in an interpretable and quantifiable form. In experiments, we find strong evidence of familiar categories, such as a notion of personhood in ELMo, as well as novel ontological distinctions, such as a preference for fine-grained semantic roles on core arguments. Our results provide unique new evidence of emergent structure in pretrained encoders, including departures from existing annotations which are inaccessible to earlier methods.

* 18 pages, 6 figures, 11 tables 
Viaarxiv icon

What Happens To BERT Embeddings During Fine-tuning?

Apr 29, 2020
Amil Merchant, Elahe Rahimtoroghi, Ellie Pavlick, Ian Tenney

Figure 1 for What Happens To BERT Embeddings During Fine-tuning?
Figure 2 for What Happens To BERT Embeddings During Fine-tuning?
Figure 3 for What Happens To BERT Embeddings During Fine-tuning?
Figure 4 for What Happens To BERT Embeddings During Fine-tuning?

While there has been much recent work studying how linguistic information is encoded in pre-trained sentence representations, comparatively little is understood about how these models change when adapted to solve downstream tasks. Using a suite of analysis techniques (probing classifiers, Representational Similarity Analysis, and model ablations), we investigate how fine-tuning affects the representations of the BERT model. We find that while fine-tuning necessarily makes significant changes, it does not lead to catastrophic forgetting of linguistic phenomena. We instead find that fine-tuning primarily affects the top layers of BERT, but with noteworthy variation across tasks. In particular, dependency parsing reconfigures most of the model, whereas SQuAD and MNLI appear to involve much shallower processing. Finally, we also find that fine-tuning has a weaker effect on representations of out-of-domain sentences, suggesting room for improvement in model generalization.

* 9 pages (not including references), 5 figures 
Viaarxiv icon

jiant: A Software Toolkit for Research on General-Purpose Text Understanding Models

Mar 04, 2020
Yada Pruksachatkun, Phil Yeres, Haokun Liu, Jason Phang, Phu Mon Htut, Alex Wang, Ian Tenney, Samuel R. Bowman

Figure 1 for jiant: A Software Toolkit for Research on General-Purpose Text Understanding Models
Figure 2 for jiant: A Software Toolkit for Research on General-Purpose Text Understanding Models
Figure 3 for jiant: A Software Toolkit for Research on General-Purpose Text Understanding Models

We introduce jiant, an open source toolkit for conducting multitask and transfer learning experiments on English NLU tasks. jiant enables modular and configuration-driven experimentation with state-of-the-art models and implements a broad set of tasks for probing, transfer learning, and multitask training experiments. jiant implements over 50 NLU tasks, including all GLUE and SuperGLUE benchmark tasks. We demonstrate that jiant reproduces published performance on a variety of tasks and models, including BERT and RoBERTa. jiant is available at https://jiant.info.

Viaarxiv icon