Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Noam Shazeer

Dima

Talking-Heads Attention

Mar 05, 2020

Noam Shazeer, Zhenzhong Lan, Youlong Cheng, Nan Ding, Le Hou

Abstract:We introduce "talking-heads attention" - a variation on multi-head attention which includes linearprojections across the attention-heads dimension, immediately before and after the softmax operation.While inserting only a small number of additional parameters and a moderate amount of additionalcomputation, talking-heads attention leads to better perplexities on masked language modeling tasks, aswell as better quality when transfer-learning to language comprehension and question answering tasks.

Via

Access Paper or Ask Questions

How Much Knowledge Can You Pack Into the Parameters of a Language Model?

Feb 24, 2020

Adam Roberts, Colin Raffel, Noam Shazeer

Figure 1 for How Much Knowledge Can You Pack Into the Parameters of a Language Model?

Figure 2 for How Much Knowledge Can You Pack Into the Parameters of a Language Model?

Abstract:It has recently been observed that neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language queries. In this short paper, we measure the practical utility of this approach by fine-tuning pre-trained models to answer questions without access to any external context or knowledge. We show that this approach scales surprisingly well with model size and outperforms models that explicitly look up knowledge on the open-domain variants of Natural Questions and WebQuestions.

Via

Access Paper or Ask Questions

GLU Variants Improve Transformer

Feb 12, 2020

Noam Shazeer

Figure 1 for GLU Variants Improve Transformer

Figure 2 for GLU Variants Improve Transformer

Figure 3 for GLU Variants Improve Transformer

Abstract:Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

Via

Access Paper or Ask Questions

Faster Transformer Decoding: N-gram Masked Self-Attention

Jan 14, 2020

Ciprian Chelba, Mia Chen, Ankur Bapna, Noam Shazeer

Figure 1 for Faster Transformer Decoding: N-gram Masked Self-Attention

Figure 2 for Faster Transformer Decoding: N-gram Masked Self-Attention

Abstract:Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence $S=s_1, \ldots, s_S$, we propose truncating the target-side window used for computing self-attention by making an $N$-gram assumption. Experiments on WMT EnDe and EnFr data sets show that the $N$-gram masked self-attention model loses very little in BLEU score for $N$ values in the range $4, \ldots, 8$, depending on the task.

Via

Access Paper or Ask Questions

Fast Transformer Decoding: One Write-Head is All You Need

Nov 06, 2019

Noam Shazeer

Figure 1 for Fast Transformer Decoding: One Write-Head is All You Need

Figure 2 for Fast Transformer Decoding: One Write-Head is All You Need

Figure 3 for Fast Transformer Decoding: One Write-Head is All You Need

Abstract:Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the large "keys" and "values" tensors. We propose a variant called multi-query attention, where the keys and values are shared across all of the different attention "heads", greatly reducing the size of these tensors and hence the memory bandwidth requirements of incremental decoding. We verify experimentally that the resulting models can indeed be much faster to decode, and incur only minor quality degradation from the baseline.

Via

Access Paper or Ask Questions

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Oct 24, 2019

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu

Figure 1 for Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Figure 2 for Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Figure 3 for Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Figure 4 for Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Abstract:Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.

Via

Access Paper or Ask Questions

High Resolution Medical Image Analysis with Spatial Partitioning

Sep 12, 2019

Le Hou, Youlong Cheng, Noam Shazeer, Niki Parmar, Yeqing Li, Panagiotis Korfiatis, Travis M. Drucker, Daniel J. Blezek, Xiaodan Song

Figure 1 for High Resolution Medical Image Analysis with Spatial Partitioning

Figure 2 for High Resolution Medical Image Analysis with Spatial Partitioning

Figure 3 for High Resolution Medical Image Analysis with Spatial Partitioning

Abstract:Medical images such as 3D computerized tomography (CT) scans and pathology images, have hundreds of millions or billions of voxels/pixels. It is infeasible to train CNN models directly on such high resolution images, because neural activations of a single image do not fit in the memory of a single GPU/TPU, and naive data and model parallelism approaches do not work. Existing image analysis approaches alleviate this problem by cropping or down-sampling input images, which leads to complicated implementation and sub-optimal performance due to information loss. In this paper, we implement spatial partitioning, which internally distributes the input and output of convolutional layers across GPUs/TPUs. Our implementation is based on the Mesh-TensorFlow framework and the computation distribution is transparent to end users. With this technique, we train a 3D Unet on up to 512 by 512 by 512 resolution data. To the best of our knowledge, this is the first work for handling such high resolution images end-to-end.

Via

Access Paper or Ask Questions

Corpora Generation for Grammatical Error Correction

Apr 10, 2019

Jared Lichtarge, Chris Alberti, Shankar Kumar, Noam Shazeer, Niki Parmar, Simon Tong

Figure 1 for Corpora Generation for Grammatical Error Correction

Figure 2 for Corpora Generation for Grammatical Error Correction

Figure 3 for Corpora Generation for Grammatical Error Correction

Figure 4 for Corpora Generation for Grammatical Error Correction

Abstract:Grammatical Error Correction (GEC) has been recently modeled using the sequence-to-sequence framework. However, unlike sequence transduction problems such as machine translation, GEC suffers from the lack of plentiful parallel data. We describe two approaches for generating large parallel datasets for GEC using publicly available Wikipedia data. The first method extracts source-target pairs from Wikipedia edit histories with minimal filtration heuristics, while the second method introduces noise into Wikipedia sentences via round-trip translation through bridge languages. Both strategies yield similar sized parallel corpora containing around 4B tokens. We employ an iterative decoding strategy that is tailored to the loosely supervised nature of our constructed corpora. We demonstrate that neural GEC models trained using either type of corpora give similar performance. Fine-tuning these models on the Lang-8 corpus and ensembling allows us to surpass the state of the art on both the CoNLL-2014 benchmark and the JFLEG task. We provide systematic analysis that compares the two approaches to data generation and highlights the effectiveness of ensembling.

* Accepted at NAACL 2019. arXiv admin note: text overlap with arXiv:1811.01710

Via

Access Paper or Ask Questions

Blockwise Parallel Decoding for Deep Autoregressive Models

Nov 07, 2018

Mitchell Stern, Noam Shazeer, Jakob Uszkoreit

Figure 1 for Blockwise Parallel Decoding for Deep Autoregressive Models

Figure 2 for Blockwise Parallel Decoding for Deep Autoregressive Models

Figure 3 for Blockwise Parallel Decoding for Deep Autoregressive Models

Figure 4 for Blockwise Parallel Decoding for Deep Autoregressive Models

Abstract:Deep autoregressive sequence-to-sequence models have demonstrated impressive performance across a wide variety of tasks in recent years. While common architecture classes such as recurrent, convolutional, and self-attention networks make different trade-offs between the amount of computation needed per layer and the length of the critical path at training time, generation still remains an inherently sequential process. To overcome this limitation, we propose a novel blockwise parallel decoding scheme in which we make predictions for multiple time steps in parallel then back off to the longest prefix validated by a scoring model. This allows for substantial theoretical improvements in generation speed when applied to architectures that can process output sequences in parallel. We verify our approach empirically through a series of experiments using state-of-the-art self-attention models for machine translation and image super-resolution, achieving iteration reductions of up to 2x over a baseline greedy decoder with no loss in quality, or up to 7x in exchange for a slight decrease in performance. In terms of wall-clock time, our fastest models exhibit real-time speedups of up to 4x over standard greedy decoding.

* NIPS 2018

Via

Access Paper or Ask Questions

Mesh-TensorFlow: Deep Learning for Supercomputers

Nov 05, 2018

Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young(+2 more)

Figure 1 for Mesh-TensorFlow: Deep Learning for Supercomputers

Figure 2 for Mesh-TensorFlow: Deep Learning for Supercomputers

Figure 3 for Mesh-TensorFlow: Deep Learning for Supercomputers

Abstract:Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, particularly on large clusters. We introduce Mesh-TensorFlow, a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the "batch" dimension, in Mesh-TensorFlow, the user can specify any tensor-dimensions to be split across any dimensions of a multi-dimensional mesh of processors. A Mesh-TensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce. We use Mesh-TensorFlow to implement an efficient data-parallel, model-parallel version of the Transformer sequence-to-sequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing state of the art results on WMT'14 English-to-French translation task and the one-billion-word language modeling benchmark. Mesh-Tensorflow is available at https://github.com/tensorflow/mesh .

Via

Access Paper or Ask Questions