Alert button
Picture for Lifu Tu

Lifu Tu

Alert button

ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation

May 02, 2020
Lifu Tu, Richard Yuanzhe Pang, Sam Wiseman, Kevin Gimpel

Figure 1 for ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation
Figure 2 for ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation
Figure 3 for ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation
Figure 4 for ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation

We propose to train a non-autoregressive machine translation model to minimize the energy defined by a pretrained autoregressive model. In particular, we view our non-autoregressive translation system as an inference network (Tu and Gimpel, 2018) trained to minimize the autoregressive teacher energy. This contrasts with the popular approach of training a non-autoregressive model on a distilled corpus consisting of the beam-searched outputs of such a teacher model. Our approach, which we call ENGINE (ENerGy-based Inference NEtworks), achieves state-of-the-art non-autoregressive results on the IWSLT 2014 DE-EN and WMT 2016 RO-EN datasets, approaching the performance of autoregressive models.

* ACL2020 
Viaarxiv icon

Improving Joint Training of Inference Networks and Structured Prediction Energy Networks

Nov 07, 2019
Lifu Tu, Richard Yuanzhe Pang, Kevin Gimpel

Figure 1 for Improving Joint Training of Inference Networks and Structured Prediction Energy Networks
Figure 2 for Improving Joint Training of Inference Networks and Structured Prediction Energy Networks
Figure 3 for Improving Joint Training of Inference Networks and Structured Prediction Energy Networks
Figure 4 for Improving Joint Training of Inference Networks and Structured Prediction Energy Networks

Deep energy-based models are powerful, but pose challenges for learning and inference (Belanger and McCallum, 2016). Tu and Gimpel (2018) developed an efficient framework for energy-based models by training "inference networks" to approximate structured inference instead of using gradient descent. However, their alternating optimization approach suffers from instabilities during training, requiring additional loss terms and careful hyperparameter tuning. In this paper, we contribute several strategies to stabilize and improve this joint training of energy functions and inference networks for structured prediction. We design a compound objective to jointly train both cost-augmented and test-time inference networks along with the energy function. We propose joint parameterizations for the inference networks that encourage them to capture complementary functionality during learning. We empirically validate our strategies on two sequence labeling tasks, showing easier paths to strong performance than prior work, as well as further improvements with global energy terms.

Viaarxiv icon

Generating Diverse Story Continuations with Controllable Semantics

Sep 30, 2019
Lifu Tu, Xiaoan Ding, Dong Yu, Kevin Gimpel

Figure 1 for Generating Diverse Story Continuations with Controllable Semantics
Figure 2 for Generating Diverse Story Continuations with Controllable Semantics
Figure 3 for Generating Diverse Story Continuations with Controllable Semantics
Figure 4 for Generating Diverse Story Continuations with Controllable Semantics

We propose a simple and effective modeling framework for controlled generation of multiple, diverse outputs. We focus on the setting of generating the next sentence of a story given its context. As controllable dimensions, we consider several sentence attributes, including sentiment, length, predicates, frames, and automatically-induced clusters. Our empirical results demonstrate: (1) our framework is accurate in terms of generating outputs that match the target control values; (2) our model yields increased maximum metric scores compared to standard n-best list generation via beam search; (3) controlling generation with semantic frames leads to a stronger combination of diversity and quality than other control variables as measured by automatic metrics. We also conduct a human evaluation to assess the utility of providing multiple suggestions for creative writing, demonstrating promising results for the potential of controllable, diverse generation in a collaborative writing system.

* EMNLP 2019 Workshop on Neural Generation and Translation (WNGT2019), and non-archival acceptance in NeuralGen 2019 
Viaarxiv icon

Benchmarking Approximate Inference Methods for Neural Structured Prediction

Apr 01, 2019
Lifu Tu, Kevin Gimpel

Figure 1 for Benchmarking Approximate Inference Methods for Neural Structured Prediction
Figure 2 for Benchmarking Approximate Inference Methods for Neural Structured Prediction
Figure 3 for Benchmarking Approximate Inference Methods for Neural Structured Prediction
Figure 4 for Benchmarking Approximate Inference Methods for Neural Structured Prediction

Exact structured inference with neural network scoring functions is computationally challenging but several methods have been proposed for approximating inference. One approach is to perform gradient descent with respect to the output structure directly (Belanger and McCallum, 2016). Another approach, proposed recently, is to train a neural network (an "inference network") to perform inference (Tu and Gimpel, 2018). In this paper, we compare these two families of inference methods on three sequence labeling datasets. We choose sequence labeling because it permits us to use exact inference as a benchmark in terms of speed, accuracy, and search error. Across datasets, we demonstrate that inference networks achieve a better speed/accuracy/search error trade-off than gradient descent, while also being faster than exact inference at similar accuracy levels. We find further benefit by combining inference networks and gradient descent, using the former to provide a warm start for the latter.

* accepted by NAACL2019 
Viaarxiv icon

Learning Approximate Inference Networks for Structured Prediction

Mar 09, 2018
Lifu Tu, Kevin Gimpel

Figure 1 for Learning Approximate Inference Networks for Structured Prediction
Figure 2 for Learning Approximate Inference Networks for Structured Prediction
Figure 3 for Learning Approximate Inference Networks for Structured Prediction
Figure 4 for Learning Approximate Inference Networks for Structured Prediction

Structured prediction energy networks (SPENs; Belanger & McCallum 2016) use neural network architectures to define energy functions that can capture arbitrary dependencies among parts of structured outputs. Prior work used gradient descent for inference, relaxing the structured output to a set of continuous variables and then optimizing the energy with respect to them. We replace this use of gradient descent with a neural network trained to approximate structured argmax inference. This "inference network" outputs continuous values that we treat as the output structure. We develop large-margin training criteria for joint training of the structured energy function and inference network. On multi-label classification we report speed-ups of 10-60x compared to (Belanger et al, 2017) while also improving accuracy. For sequence labeling with simple structured energies, our approach performs comparably to exact inference while being much faster at test time. We then demonstrate improved accuracy by augmenting the energy with a "label language model" that scores entire output label sequences, showing it can improve handling of long-distance dependencies in part-of-speech tagging. Finally, we show how inference networks can replace dynamic programming for test-time inference in conditional random fields, suggestive for their general use for fast inference in structured settings.

* accepted by ICLR2018 
Viaarxiv icon

Learning to Embed Words in Context for Syntactic Tasks

Jun 12, 2017
Lifu Tu, Kevin Gimpel, Karen Livescu

Figure 1 for Learning to Embed Words in Context for Syntactic Tasks
Figure 2 for Learning to Embed Words in Context for Syntactic Tasks
Figure 3 for Learning to Embed Words in Context for Syntactic Tasks
Figure 4 for Learning to Embed Words in Context for Syntactic Tasks

We present models for embedding words in the context of surrounding words. Such models, which we refer to as token embeddings, represent the characteristics of a word that are specific to a given context, such as word sense, syntactic category, and semantic role. We explore simple, efficient token embedding models based on standard neural network architectures. We learn token embeddings on a large amount of unannotated text and evaluate them as features for part-of-speech taggers and dependency parsers trained on much smaller amounts of annotated data. We find that predictors endowed with token embeddings consistently outperform baseline predictors across a range of context window and training set sizes.

* Accepted by ACL 2017 Repl4NLP workshop 
Viaarxiv icon

Network Inference by Learned Node-Specific Degree Prior

Feb 07, 2016
Qingming Tang, Lifu Tu, Weiran Wang, Jinbo Xu

Figure 1 for Network Inference by Learned Node-Specific Degree Prior
Figure 2 for Network Inference by Learned Node-Specific Degree Prior

We propose a novel method for network inference from partially observed edges using a node-specific degree prior. The degree prior is derived from observed edges in the network to be inferred, and its hyper-parameters are determined by cross validation. Then we formulate network inference as a matrix completion problem regularized by our degree prior. Our theoretical analysis indicates that this prior favors a network following the learned degree distribution, and may lead to improved network recovery error bound than previous work. Experimental results on both simulated and real biological networks demonstrate the superior performance of our method in various settings.

Viaarxiv icon