Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lifu Tu

An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models

Aug 11, 2020

Lifu Tu, Garima Lalwani, Spandana Gella, He He

Figure 1 for An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models

Figure 2 for An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models

Figure 3 for An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models

Figure 4 for An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models

Abstract:Recent work has shown that pre-trained language models such as BERT improve robustness to spurious correlations in the dataset. Intrigued by these results, we find that the key to their success is generalization from a small amount of counterexamples where the spurious correlations do not hold. When such minority examples are scarce, pre-trained models perform as poorly as models trained from scratch. In the case of extreme minority, we propose to use multi-task learning (MTL) to improve generalization. Our experiments on natural language inference and paraphrase identification show that MTL with the right auxiliary tasks significantly improves performance on challenging examples without hurting the in-distribution performance. Further, we show that the gain from MTL mainly comes from improved generalization from the minority examples. Our results highlight the importance of data diversity for overcoming spurious correlations.

* Accepted to TACL 2020

Via

Access Paper or Ask Questions

ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation

May 12, 2020

Lifu Tu, Richard Yuanzhe Pang, Sam Wiseman, Kevin Gimpel

Figure 1 for ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation

Figure 2 for ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation

Figure 3 for ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation

Figure 4 for ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation

Abstract:We propose to train a non-autoregressive machine translation model to minimize the energy defined by a pretrained autoregressive model. In particular, we view our non-autoregressive translation system as an inference network (Tu and Gimpel, 2018) trained to minimize the autoregressive teacher energy. This contrasts with the popular approach of training a non-autoregressive model on a distilled corpus consisting of the beam-searched outputs of such a teacher model. Our approach, which we call ENGINE (ENerGy-based Inference NEtworks), achieves state-of-the-art non-autoregressive results on the IWSLT 2014 DE-EN and WMT 2016 RO-EN datasets, approaching the performance of autoregressive models.

* ACL 2020 camera-ready version

Via

Access Paper or Ask Questions

Improving Joint Training of Inference Networks and Structured Prediction Energy Networks

Nov 07, 2019

Lifu Tu, Richard Yuanzhe Pang, Kevin Gimpel

Figure 1 for Improving Joint Training of Inference Networks and Structured Prediction Energy Networks

Figure 2 for Improving Joint Training of Inference Networks and Structured Prediction Energy Networks

Figure 3 for Improving Joint Training of Inference Networks and Structured Prediction Energy Networks

Figure 4 for Improving Joint Training of Inference Networks and Structured Prediction Energy Networks

Abstract:Deep energy-based models are powerful, but pose challenges for learning and inference (Belanger and McCallum, 2016). Tu and Gimpel (2018) developed an efficient framework for energy-based models by training "inference networks" to approximate structured inference instead of using gradient descent. However, their alternating optimization approach suffers from instabilities during training, requiring additional loss terms and careful hyperparameter tuning. In this paper, we contribute several strategies to stabilize and improve this joint training of energy functions and inference networks for structured prediction. We design a compound objective to jointly train both cost-augmented and test-time inference networks along with the energy function. We propose joint parameterizations for the inference networks that encourage them to capture complementary functionality during learning. We empirically validate our strategies on two sequence labeling tasks, showing easier paths to strong performance than prior work, as well as further improvements with global energy terms.

Via

Access Paper or Ask Questions

Generating Diverse Story Continuations with Controllable Semantics

Sep 30, 2019

Lifu Tu, Xiaoan Ding, Dong Yu, Kevin Gimpel

Figure 1 for Generating Diverse Story Continuations with Controllable Semantics

Figure 2 for Generating Diverse Story Continuations with Controllable Semantics

Figure 3 for Generating Diverse Story Continuations with Controllable Semantics

Figure 4 for Generating Diverse Story Continuations with Controllable Semantics

Abstract:We propose a simple and effective modeling framework for controlled generation of multiple, diverse outputs. We focus on the setting of generating the next sentence of a story given its context. As controllable dimensions, we consider several sentence attributes, including sentiment, length, predicates, frames, and automatically-induced clusters. Our empirical results demonstrate: (1) our framework is accurate in terms of generating outputs that match the target control values; (2) our model yields increased maximum metric scores compared to standard n-best list generation via beam search; (3) controlling generation with semantic frames leads to a stronger combination of diversity and quality than other control variables as measured by automatic metrics. We also conduct a human evaluation to assess the utility of providing multiple suggestions for creative writing, demonstrating promising results for the potential of controllable, diverse generation in a collaborative writing system.

* EMNLP 2019 Workshop on Neural Generation and Translation (WNGT2019), and non-archival acceptance in NeuralGen 2019

Via

Access Paper or Ask Questions

Benchmarking Approximate Inference Methods for Neural Structured Prediction

Apr 01, 2019

Lifu Tu, Kevin Gimpel

Figure 1 for Benchmarking Approximate Inference Methods for Neural Structured Prediction

Figure 2 for Benchmarking Approximate Inference Methods for Neural Structured Prediction

Figure 3 for Benchmarking Approximate Inference Methods for Neural Structured Prediction

Figure 4 for Benchmarking Approximate Inference Methods for Neural Structured Prediction

Abstract:Exact structured inference with neural network scoring functions is computationally challenging but several methods have been proposed for approximating inference. One approach is to perform gradient descent with respect to the output structure directly (Belanger and McCallum, 2016). Another approach, proposed recently, is to train a neural network (an "inference network") to perform inference (Tu and Gimpel, 2018). In this paper, we compare these two families of inference methods on three sequence labeling datasets. We choose sequence labeling because it permits us to use exact inference as a benchmark in terms of speed, accuracy, and search error. Across datasets, we demonstrate that inference networks achieve a better speed/accuracy/search error trade-off than gradient descent, while also being faster than exact inference at similar accuracy levels. We find further benefit by combining inference networks and gradient descent, using the former to provide a warm start for the latter.

* accepted by NAACL2019

Via

Access Paper or Ask Questions

Learning Approximate Inference Networks for Structured Prediction

Mar 09, 2018

Lifu Tu, Kevin Gimpel

Figure 1 for Learning Approximate Inference Networks for Structured Prediction

Figure 2 for Learning Approximate Inference Networks for Structured Prediction

Figure 3 for Learning Approximate Inference Networks for Structured Prediction

Figure 4 for Learning Approximate Inference Networks for Structured Prediction

Abstract:Structured prediction energy networks (SPENs; Belanger & McCallum 2016) use neural network architectures to define energy functions that can capture arbitrary dependencies among parts of structured outputs. Prior work used gradient descent for inference, relaxing the structured output to a set of continuous variables and then optimizing the energy with respect to them. We replace this use of gradient descent with a neural network trained to approximate structured argmax inference. This "inference network" outputs continuous values that we treat as the output structure. We develop large-margin training criteria for joint training of the structured energy function and inference network. On multi-label classification we report speed-ups of 10-60x compared to (Belanger et al, 2017) while also improving accuracy. For sequence labeling with simple structured energies, our approach performs comparably to exact inference while being much faster at test time. We then demonstrate improved accuracy by augmenting the energy with a "label language model" that scores entire output label sequences, showing it can improve handling of long-distance dependencies in part-of-speech tagging. Finally, we show how inference networks can replace dynamic programming for test-time inference in conditional random fields, suggestive for their general use for fast inference in structured settings.

* accepted by ICLR2018

Via

Access Paper or Ask Questions

Learning to Embed Words in Context for Syntactic Tasks

Jun 12, 2017

Lifu Tu, Kevin Gimpel, Karen Livescu

Figure 1 for Learning to Embed Words in Context for Syntactic Tasks

Figure 2 for Learning to Embed Words in Context for Syntactic Tasks

Figure 3 for Learning to Embed Words in Context for Syntactic Tasks

Figure 4 for Learning to Embed Words in Context for Syntactic Tasks

Abstract:We present models for embedding words in the context of surrounding words. Such models, which we refer to as token embeddings, represent the characteristics of a word that are specific to a given context, such as word sense, syntactic category, and semantic role. We explore simple, efficient token embedding models based on standard neural network architectures. We learn token embeddings on a large amount of unannotated text and evaluate them as features for part-of-speech taggers and dependency parsers trained on much smaller amounts of annotated data. We find that predictors endowed with token embeddings consistently outperform baseline predictors across a range of context window and training set sizes.

* Accepted by ACL 2017 Repl4NLP workshop

Via

Access Paper or Ask Questions

Network Inference by Learned Node-Specific Degree Prior

Feb 07, 2016

Qingming Tang, Lifu Tu, Weiran Wang, Jinbo Xu

Figure 1 for Network Inference by Learned Node-Specific Degree Prior

Figure 2 for Network Inference by Learned Node-Specific Degree Prior

Abstract:We propose a novel method for network inference from partially observed edges using a node-specific degree prior. The degree prior is derived from observed edges in the network to be inferred, and its hyper-parameters are determined by cross validation. Then we formulate network inference as a matrix completion problem regularized by our degree prior. Our theoretical analysis indicates that this prior favors a network following the learned degree distribution, and may lead to improved network recovery error bound than previous work. Experimental results on both simulated and real biological networks demonstrate the superior performance of our method in various settings.

Via

Access Paper or Ask Questions