Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kenton Lee

Language Model Pre-training for Hierarchical Document Representations

Jan 26, 2019
Ming-Wei Chang, Kristina Toutanova, Kenton Lee, Jacob Devlin

Figure 1 for Language Model Pre-training for Hierarchical Document Representations

Figure 2 for Language Model Pre-training for Hierarchical Document Representations

Figure 3 for Language Model Pre-training for Hierarchical Document Representations

Figure 4 for Language Model Pre-training for Hierarchical Document Representations

Hierarchical neural architectures are often used to capture long-distance dependencies and have been applied to many document-level tasks such as summarization, document segmentation, and sentiment analysis. However, effective usage of such a large context can be difficult to learn, especially in the case where there is limited labeled data available. Building on the recent success of language model pretraining methods for learning flat representations of text, we propose algorithms for pre-training hierarchical document representations from unlabeled data. Unlike prior work, which has focused on pre-training contextual token representations or context-independent {sentence/paragraph} representations, our hierarchical document representations include fixed-length sentence/paragraph representations which integrate contextual information from the entire documents. Experiments on document segmentation, document-level question answering, and extractive document summarization demonstrate the effectiveness of the proposed pre-training algorithms.

Via

Access Paper or Ask Questions

A BERT Baseline for the Natural Questions

Jan 24, 2019
Chris Alberti, Kenton Lee, Michael Collins

Figure 1 for A BERT Baseline for the Natural Questions

This technical note describes a new baseline for the Natural Questions. Our model is based on BERT and reduces the gap between the model F1 scores reported in the original dataset paper and the human upper bound by 30% and 50% relative for the long and short answer tasks respectively. This baseline has been submitted to the official NQ leaderboard at ai.google.com/research/NaturalQuestions and we plan to opensource the code for it in the near future.

Via

Access Paper or Ask Questions

Improving Span-based Question Answering Systems with Coarsely Labeled Data

Nov 05, 2018
Hao Cheng, Ming-Wei Chang, Kenton Lee, Ankur Parikh, Michael Collins, Kristina Toutanova

Figure 1 for Improving Span-based Question Answering Systems with Coarsely Labeled Data

Figure 2 for Improving Span-based Question Answering Systems with Coarsely Labeled Data

Figure 3 for Improving Span-based Question Answering Systems with Coarsely Labeled Data

Figure 4 for Improving Span-based Question Answering Systems with Coarsely Labeled Data

We study approaches to improve fine-grained short answer Question Answering models by integrating coarse-grained data annotated for paragraph-level relevance and show that coarsely annotated data can bring significant performance gains. Experiments demonstrate that the standard multi-task learning approach of sharing representations is not the most effective way to leverage coarse-grained annotations. Instead, we can explicitly model the latent fine-grained short answer variables and optimize the marginal log-likelihood directly or use a newly proposed \emph{posterior distillation} learning objective. Since these latent-variable methods have explicit access to the relationship between the fine and coarse tasks, they result in significantly larger improvements from coarse supervision.

Via

Access Paper or Ask Questions

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Oct 11, 2018
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

Figure 1 for BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Figure 2 for BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Figure 3 for BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Figure 4 for BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5% absolute improvement), outperforming human performance by 2.0%.

* 13 pages

Via

Access Paper or Ask Questions

Syntactic Scaffolds for Semantic Structures

Aug 30, 2018
Swabha Swayamdipta, Sam Thomson, Kenton Lee, Luke Zettlemoyer, Chris Dyer, Noah A. Smith

Figure 1 for Syntactic Scaffolds for Semantic Structures

Figure 2 for Syntactic Scaffolds for Semantic Structures

Figure 3 for Syntactic Scaffolds for Semantic Structures

Figure 4 for Syntactic Scaffolds for Semantic Structures

We introduce the syntactic scaffold, an approach to incorporating syntactic information into semantic tasks. Syntactic scaffolds avoid expensive syntactic processing at runtime, only making use of a treebank during training, through a multitask objective. We improve over strong baselines on PropBank semantics, frame semantics, and coreference resolution, achieving competitive performance on all three tasks.

* Accepted at EMNLP 2018

Via

Access Paper or Ask Questions

Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling

Aug 13, 2018
Luheng He, Kenton Lee, Omer Levy, Luke Zettlemoyer

Figure 1 for Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling

Figure 2 for Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling

Figure 3 for Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling

Figure 4 for Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling

Recent BIO-tagging-based neural semantic role labeling models are very high performing, but assume gold predicates as part of the input and cannot incorporate span-level features. We propose an end-to-end approach for jointly predicting all predicates, arguments spans, and the relations between them. The model makes independent decisions about what relationship, if any, holds between every possible word-span pair, and learns contextualized span representations that provide rich, shared input features for each decision. Experiments demonstrate that this approach sets a new state of the art on PropBank SRL without gold predicates.

* 5 pages, ACL 2018

Via

Access Paper or Ask Questions

Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum

May 09, 2018
Omer Levy, Kenton Lee, Nicholas FitzGerald, Luke Zettlemoyer

Figure 1 for Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum

Figure 2 for Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum

Figure 3 for Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum

Figure 4 for Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum

LSTMs were introduced to combat vanishing gradients in simple RNNs by augmenting them with gated additive recurrent connections. We present an alternative view to explain the success of LSTMs: the gates themselves are versatile recurrent models that provide more representational power than previously appreciated. We do this by decoupling the LSTM's gates from the embedded simple RNN, producing a new class of RNNs where the recurrence computes an element-wise weighted sum of context-independent functions of the input. Ablations on a range of problems demonstrate that the gating mechanism alone performs as well as an LSTM in most settings, strongly suggesting that the gates are doing much more in practice than just alleviating vanishing gradients.

* ACL 2018

Via

Access Paper or Ask Questions

Higher-order Coreference Resolution with Coarse-to-fine Inference

Apr 15, 2018
Kenton Lee, Luheng He, Luke Zettlemoyer

Figure 1 for Higher-order Coreference Resolution with Coarse-to-fine Inference

Figure 2 for Higher-order Coreference Resolution with Coarse-to-fine Inference

Figure 3 for Higher-order Coreference Resolution with Coarse-to-fine Inference

We introduce a fully differentiable approximation to higher-order inference for coreference resolution. Our approach uses the antecedent distribution from a span-ranking architecture as an attention mechanism to iteratively refine span representations. This enables the model to softly consider multiple hops in the predicted clusters. To alleviate the computational cost of this iterative process, we introduce a coarse-to-fine approach that incorporates a less accurate but more efficient bilinear factor, enabling more aggressive pruning without hurting accuracy. Compared to the existing state-of-the-art span-ranking approach, our model significantly improves accuracy on the English OntoNotes benchmark, while being far more computationally efficient.

* Accepted to NAACL 2018

Via

Access Paper or Ask Questions

Deep contextualized word representations

Mar 22, 2018
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer

Figure 1 for Deep contextualized word representations

Figure 2 for Deep contextualized word representations

Figure 3 for Deep contextualized word representations

Figure 4 for Deep contextualized word representations

We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.

* NAACL 2018. Originally posted to openreview 27 Oct 2017. v2 updated for NAACL camera ready

Via

Access Paper or Ask Questions

End-to-end Neural Coreference Resolution

Dec 15, 2017
Kenton Lee, Luheng He, Mike Lewis, Luke Zettlemoyer

Figure 1 for End-to-end Neural Coreference Resolution

Figure 2 for End-to-end Neural Coreference Resolution

Figure 3 for End-to-end Neural Coreference Resolution

Figure 4 for End-to-end Neural Coreference Resolution

We introduce the first end-to-end coreference resolution model and show that it significantly outperforms all previous work without using a syntactic parser or hand-engineered mention detector. The key idea is to directly consider all spans in a document as potential mentions and learn distributions over possible antecedents for each. The model computes span embeddings that combine context-dependent boundary representations with a head-finding attention mechanism. It is trained to maximize the marginal likelihood of gold antecedent spans from coreference clusters and is factored to enable aggressive pruning of potential mentions. Experiments demonstrate state-of-the-art performance, with a gain of 1.5 F1 on the OntoNotes benchmark and by 3.1 F1 using a 5-model ensemble, despite the fact that this is the first approach to be successfully trained with no external resources.

* Accepted to EMNLP 2017

Via

Access Paper or Ask Questions