Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Geoffrey Hinton

University of Toronto

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Jan 23, 2017

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean

Figure 1 for Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Figure 2 for Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Figure 3 for Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Figure 4 for Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Abstract:The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.

Via

Access Paper or Ask Questions

Using Fast Weights to Attend to the Recent Past

Dec 05, 2016

Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z. Leibo, Catalin Ionescu

Figure 1 for Using Fast Weights to Attend to the Recent Past

Figure 2 for Using Fast Weights to Attend to the Recent Past

Figure 3 for Using Fast Weights to Attend to the Recent Past

Figure 4 for Using Fast Weights to Attend to the Recent Past

Abstract:Until recently, research on artificial neural networks was largely restricted to systems with only two types of variable: Neural activities that represent the current or recent input and weights that learn to capture regularities among inputs, outputs and payoffs. There is no good reason for this restriction. Synapses have dynamics at many different time-scales and this suggests that artificial neural networks might benefit from variables that change slower than activities but much faster than the standard weights. These "fast weights" can be used to store temporary memories of the recent past and they provide a neurally plausible way of implementing the type of attention to the past that has recently proved very helpful in sequence-to-sequence models. By using fast weights we can avoid the need to store copies of neural activity patterns.

* Added [Schmidhuber 1993] citation to the last paragraph of the introduction. Fixed typo appendix A.1 uniform initialization to 1/\sqrt{H}

Via

Access Paper or Ask Questions

Grammar as a Foreign Language

Jun 09, 2015

Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, Geoffrey Hinton

Figure 1 for Grammar as a Foreign Language

Figure 2 for Grammar as a Foreign Language

Figure 3 for Grammar as a Foreign Language

Figure 4 for Grammar as a Foreign Language

Abstract:Syntactic constituency parsing is a fundamental problem in natural language processing and has been the subject of intensive research and engineering for decades. As a result, the most accurate parsers are domain specific, complex, and inefficient. In this paper we show that the domain agnostic attention-enhanced sequence-to-sequence model achieves state-of-the-art results on the most widely used syntactic constituency parsing dataset, when trained on a large synthetic corpus that was annotated using existing parsers. It also matches the performance of standard parsers when trained only on a small human-annotated dataset, which shows that this model is highly data-efficient, in contrast to sequence-to-sequence models without the attention mechanism. Our parser is also fast, processing over a hundred sentences per second with an unoptimized CPU implementation.

Via

Access Paper or Ask Questions

Distilling the Knowledge in a Neural Network

Mar 09, 2015

Geoffrey Hinton, Oriol Vinyals, Jeff Dean

Figure 1 for Distilling the Knowledge in a Neural Network

Figure 2 for Distilling the Knowledge in a Neural Network

Figure 3 for Distilling the Knowledge in a Neural Network

Figure 4 for Distilling the Knowledge in a Neural Network

Abstract:A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.

* NIPS 2014 Deep Learning Workshop

Via

Access Paper or Ask Questions

Speech Recognition with Deep Recurrent Neural Networks

Mar 22, 2013

Alex Graves, Abdel-rahman Mohamed, Geoffrey Hinton

Figure 1 for Speech Recognition with Deep Recurrent Neural Networks

Figure 2 for Speech Recognition with Deep Recurrent Neural Networks

Abstract:Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates \emph{deep recurrent neural networks}, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.

* To appear in ICASSP 2013

Via

Access Paper or Ask Questions

Deep Lambertian Networks

Jun 27, 2012

Yichuan Tang, Ruslan Salakhutdinov, Geoffrey Hinton

Abstract:Visual perception is a challenging problem in part due to illumination variations. A possible solution is to first estimate an illumination invariant representation before using it for recognition. The object albedo and surface normals are examples of such representations. In this paper, we introduce a multilayer generative model where the latent variables include the albedo, surface normals, and the light source. Combining Deep Belief Nets with the Lambertian reflectance assumption, our model can learn good priors over the albedo from 2D images. Illumination variations can be explained by changing only the lighting latent variable in our model. By transferring learned knowledge from similar objects, albedo and surface normals estimation from a single image is possible in our model. Experiments demonstrate that our model is able to generalize as well as improve over standard baselines in one-shot face recognition.

* Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012)

Via

Access Paper or Ask Questions

Deep Mixtures of Factor Analysers

Jun 18, 2012

Yichuan Tang, Ruslan Salakhutdinov, Geoffrey Hinton

Figure 1 for Deep Mixtures of Factor Analysers

Figure 2 for Deep Mixtures of Factor Analysers

Figure 3 for Deep Mixtures of Factor Analysers

Figure 4 for Deep Mixtures of Factor Analysers

Abstract:An efficient way to learn deep density models that have many layers of latent variables is to learn one layer at a time using a model that has only one layer of latent variables. After learning each layer, samples from the posterior distributions for that layer are used as training data for learning the next layer. This approach is commonly used with Restricted Boltzmann Machines, which are undirected graphical models with a single hidden layer, but it can also be used with Mixtures of Factor Analysers (MFAs) which are directed graphical models. In this paper, we present a greedy layer-wise learning algorithm for Deep Mixtures of Factor Analysers (DMFAs). Even though a DMFA can be converted to an equivalent shallow MFA by multiplying together the factor loading matrices at different levels, learning and inference are much more efficient in a DMFA and the sharing of each lower-level factor loading matrix by many different higher level MFAs prevents overfitting. We demonstrate empirically that DMFAs learn better density models than both MFAs and two types of Restricted Boltzmann Machine on a wide variety of datasets.

* ICML2012

Via

Access Paper or Ask Questions