Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Roy Schwartz

On the Power of Saturated Transformers: A View from Circuit Complexity

Jun 30, 2021

William Merrill, Yoav Goldberg, Roy Schwartz, Noah A. Smith

Figure 1 for On the Power of Saturated Transformers: A View from Circuit Complexity

Figure 2 for On the Power of Saturated Transformers: A View from Circuit Complexity

Figure 3 for On the Power of Saturated Transformers: A View from Circuit Complexity

Figure 4 for On the Power of Saturated Transformers: A View from Circuit Complexity

Abstract:Transformers have become a standard architecture for many NLP problems. This has motivated theoretically analyzing their capabilities as models of language, in order to understand what makes them successful, and what their potential weaknesses might be. Recent work has shown that transformers with hard attention are quite limited in capacity, and in fact can be simulated by constant-depth circuits. However, hard attention is a restrictive assumption, which may complicate the relevance of these results for practical transformers. In this work, we analyze the circuit complexity of transformers with saturated attention: a generalization of hard attention that more closely captures the attention patterns learnable in practical transformers. We show that saturated transformers transcend the limitations of hard-attention transformers. With some minor assumptions, we prove that the number of bits needed to represent a saturated transformer memory vector is $O(\log n)$, which implies saturated transformers can be simulated by log-depth circuits. Thus, the jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(\log n)$.

* Preprint

Via

Access Paper or Ask Questions

Provable Limitations of Acquiring Meaning from Ungrounded Form: What will Future Language Models Understand?

Apr 22, 2021

William Merrill, Yoav Goldberg, Roy Schwartz, Noah A. Smith

Figure 1 for Provable Limitations of Acquiring Meaning from Ungrounded Form: What will Future Language Models Understand?

Figure 2 for Provable Limitations of Acquiring Meaning from Ungrounded Form: What will Future Language Models Understand?

Figure 3 for Provable Limitations of Acquiring Meaning from Ungrounded Form: What will Future Language Models Understand?

Figure 4 for Provable Limitations of Acquiring Meaning from Ungrounded Form: What will Future Language Models Understand?

Abstract:Language models trained on billions of tokens have recently led to unprecedented results on many NLP tasks. This success raises the question of whether, in principle, a system can ever "understand" raw text without access to some form of grounding. We formally investigate the abilities of ungrounded systems to acquire meaning. Our analysis focuses on the role of "assertions": contexts within raw text that provide indirect clues about underlying semantics. We study whether assertions enable a system to emulate representations preserving semantic relations like equivalence. We find that assertions enable semantic emulation if all expressions in the language are referentially transparent. However, if the language uses non-transparent patterns like variable binding, we show that emulation can become an uncomputable problem. Finally, we discuss differences between our formal model and natural language, exploring how our results generalize to a modal setting and other semantic relations. Together, our results suggest that assertions in code or language do not provide sufficient signal to fully emulate semantic representations. We formalize ways in which ungrounded language models appear to be fundamentally limited in their ability to "understand".

* Accepted at TACL; pre-MIT Press publication version

Via

Access Paper or Ask Questions

Random Feature Attention

Mar 19, 2021

Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A. Smith, Lingpeng Kong

Abstract:Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic time and space complexity in the sequence length. We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, and explore its application in transformers. RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism. Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines. In the machine translation experiment, RFA decodes twice as fast as a vanilla transformer. Compared to existing efficient transformer variants, RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets. Our analysis shows that RFA's efficiency gains are especially notable on long sequences, suggesting that RFA will be particularly useful in tasks that require working with large inputs, fast decoding speed, or low memory footprints.

* ICLR 2021

Via

Access Paper or Ask Questions

Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA

Mar 17, 2021

Yonatan Bitton, Gabriel Stanovsky, Roy Schwartz, Michael Elhadad

Figure 1 for Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA

Figure 2 for Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA

Figure 3 for Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA

Figure 4 for Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA

Abstract:Recent works have shown that supervised models often exploit data artifacts to achieve good test scores while their performance severely degrades on samples outside their training distribution. Contrast sets (Gardneret al., 2020) quantify this phenomenon by perturbing test samples in a minimal way such that the output label is modified. While most contrast sets were created manually, requiring intensive annotation effort, we present a novel method which leverages rich semantic input representation to automatically generate contrast sets for the visual question answering task. Our method computes the answer of perturbed questions, thus vastly reducing annotation cost and enabling thorough evaluation of models' performance on various semantic aspects (e.g., spatial or relational reasoning). We demonstrate the effectiveness of our approach on the GQA dataset and its semantic scene graph image representation. We find that, despite GQA's compositionality and carefully balanced label distribution, two high-performing models drop 13-17% in accuracy compared to the original test set. Finally, we show that our automatic perturbation can be applied to the training set to mitigate the degradation in performance, opening the door to more robust models.

* Accepted to NAACL 2021

Via

Access Paper or Ask Questions

Parameter Norm Growth During Training of Transformers

Nov 11, 2020

William Merrill, Vivek Ramanujan, Yoav Goldberg, Roy Schwartz, Noah Smith

Figure 1 for Parameter Norm Growth During Training of Transformers

Figure 2 for Parameter Norm Growth During Training of Transformers

Figure 3 for Parameter Norm Growth During Training of Transformers

Figure 4 for Parameter Norm Growth During Training of Transformers

Abstract:The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically some variant of gradient descent (GD). To better understand this bias, we study the tendency of transformer parameters to grow in magnitude during training. We find, both theoretically and empirically, that, in certain contexts, GD increases the parameter $L_2$ norm up to a threshold that itself increases with training-set accuracy. This means increasing training accuracy over time enables the norm to increase. Empirically, we show that the norm grows continuously over pretraining for T5 (Raffel et al., 2019). We show that pretrained T5 approximates a semi-discretized network with saturated activation functions. Such "saturated" networks are known to have a reduced capacity compared to the original network family that can be described in automata-theoretic terms. This suggests saturation is a new characterization of an inductive bias implicit in GD that is of particular interest for NLP. While our experiments focus on transformers, our theoretical analysis extends to other architectures with similar formal properties, such as feedforward ReLU networks.

* Preprint. 9 body pages with appendix

Via

Access Paper or Ask Questions

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Oct 15, 2020

Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, Yejin Choi

Figure 1 for Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Figure 2 for Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Figure 3 for Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Figure 4 for Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Abstract:Large datasets have become commonplace in NLP research. However, the increased emphasis on data quantity has made it challenging to assess the quality of data. We introduce Data Maps---a model-based tool to characterize and diagnose datasets. We leverage a largely ignored source of information: the behavior of the model on individual instances during training (training dynamics) for building data maps. This yields two intuitive measures for each example---the model's confidence in the true class, and the variability of this confidence across epochs---obtained in a single run of training. Experiments across four datasets show that these model-dependent measures reveal three distinct regions in the data map, each with pronounced characteristics. First, our data maps show the presence of "ambiguous" regions with respect to the model, which contribute the most towards out-of-distribution generalization. Second, the most populous regions in the data are "easy to learn" for the model, and play an important role in model optimization. Finally, data maps uncover a region with instances that the model finds "hard to learn"; these often correspond to labeling errors. Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.

* Proceedings of EMNLP 2020

Via

Access Paper or Ask Questions

Extracting a Knowledge Base of Mechanisms from COVID-19 Papers

Oct 08, 2020

Aida Amini, Tom Hope, David Wadden, Madeleine van Zuylen, Eric Horvitz, Roy Schwartz, Hannaneh Hajishirzi

Figure 1 for Extracting a Knowledge Base of Mechanisms from COVID-19 Papers

Figure 2 for Extracting a Knowledge Base of Mechanisms from COVID-19 Papers

Figure 3 for Extracting a Knowledge Base of Mechanisms from COVID-19 Papers

Figure 4 for Extracting a Knowledge Base of Mechanisms from COVID-19 Papers

Abstract:The urgency of mitigating COVID-19 has spawned a large and diverse body of scientific literature that is challenging for researchers to navigate. This explosion of information has stimulated interest in automated tools to help identify useful knowledge. We have pursued the use of methods for extracting diverse forms of mechanism relations from the natural language of scientific papers. We seek to identify concepts in COVID-19 and related literature which represent activities, functions, associations and causal relations, ranging from cellular processes to economic impacts. We formulate a broad, coarse-grained schema targeting mechanism relations between open, free-form entities. Our approach strikes a balance between expressivity and breadth that supports generalization across diverse concepts. We curate a dataset of scientific papers annotated according to our novel schema. Using an information extraction model trained on this new corpus, we construct a knowledge base (KB) of 2M mechanism relations, which we make publicly available. Our model is able to extract relations at an F1 at least twice that of baselines such as open IE or related scientific IE systems. We conduct experiments examining the ability of our system to retrieve relevant information on viral mechanisms of action, and on applications of AI to COVID-19 research. In both cases, our system identifies relevant information from our automatically-constructed knowledge base with high precision.

* Tom Hope and Aida Amini made an equal contribution as joint first authors, listed in alphabetical order. Data and code: https://git.io/JUhv7

Via

Access Paper or Ask Questions

A Mixture of $h-1$ Heads is Better than $h$ Heads

May 13, 2020

Hao Peng, Roy Schwartz, Dianqi Li, Noah A. Smith

Figure 1 for A Mixture of $h-1$ Heads is Better than $h$ Heads

Figure 2 for A Mixture of $h-1$ Heads is Better than $h$ Heads

Figure 3 for A Mixture of $h-1$ Heads is Better than $h$ Heads

Figure 4 for A Mixture of $h-1$ Heads is Better than $h$ Heads

Abstract:Multi-head attentive neural architectures have achieved state-of-the-art results on a variety of natural language processing tasks. Evidence has shown that they are overparameterized; attention heads can be pruned without significant performance loss. In this work, we instead "reallocate" them -- the model learns to activate different heads on different inputs. Drawing connections between multi-head attention and mixture of experts, we propose the mixture of attentive experts model (MAE). MAE is trained using a block coordinate descent algorithm that alternates between updating (1) the responsibilities of the experts and (2) their parameters. Experiments on machine translation and language modeling show that MAE outperforms strong baselines on both tasks. Particularly, on the WMT14 English to German translation dataset, MAE improves over "transformer-base" by 0.8 BLEU, with a comparable number of parameters. Our analysis shows that our model learns to specialize different experts to different inputs.

* ACL2020

Via

Access Paper or Ask Questions

The Right Tool for the Job: Matching Model and Instance Complexities

May 09, 2020

Roy Schwartz, Gabriel Stanovsky, Swabha Swayamdipta, Jesse Dodge, Noah A. Smith

Figure 1 for The Right Tool for the Job: Matching Model and Instance Complexities

Figure 2 for The Right Tool for the Job: Matching Model and Instance Complexities

Figure 3 for The Right Tool for the Job: Matching Model and Instance Complexities

Figure 4 for The Right Tool for the Job: Matching Model and Instance Complexities

Abstract:As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs. To better respect a given inference budget, we propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit" from neural network calculations for simple instances, and late (and accurate) exit for hard instances. To achieve this, we add classifiers to different layers of BERT and use their calibrated confidence scores to make early exit decisions. We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks. Our method presents a favorable speed/accuracy tradeoff in almost all cases, producing models which are up to five times faster than the state of the art, while preserving their accuracy. Our method also requires almost no additional training resources (in either time or parameters) compared to the baseline BERT model. Finally, our method alleviates the need for costly retraining of multiple models at different levels of efficiency; we allow users to control the inference speed/accuracy tradeoff using a single trained model, by setting a single variable at inference time. We publicly release our code.

* ACL 2020; 12 pages; code available in https://github.com/allenai/sledgehammer

Via

Access Paper or Ask Questions

A Formal Hierarchy of RNN Architectures

Apr 24, 2020

William Merrill, Gail Weiss, Yoav Goldberg, Roy Schwartz, Noah A. Smith, Eran Yahav

Figure 1 for A Formal Hierarchy of RNN Architectures

Figure 2 for A Formal Hierarchy of RNN Architectures

Figure 3 for A Formal Hierarchy of RNN Architectures

Figure 4 for A Formal Hierarchy of RNN Architectures

Abstract:We develop a formal hierarchy of the expressive capacity of RNN architectures. The hierarchy is based on two formal properties: space complexity, which measures the RNN's memory, and rational recurrence, defined as whether the recurrent update can be described by a weighted finite-state machine. We place several RNN variants within this hierarchy. For example, we prove the LSTM is not rational, which formally separates it from the related QRNN (Bradbury et al., 2016). We also show how these models' expressive capacity is expanded by stacking multiple layers or composing them with different pooling functions. Our results build on the theory of "saturated" RNNs (Merrill, 2019). While formally extending these findings to unsaturated RNNs is left to future work, we hypothesize that the practical learnable capacity of unsaturated RNNs obeys a similar hierarchy. Experimental findings from training unsaturated networks on formal languages support this conjecture.

* To appear at ACL 2020. Updated to include computational cost estimates

Via

Access Paper or Ask Questions