Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yoshua Bengio

Multiscale sequence modeling with a learned dictionary

Jul 05, 2017
Bart van Merriënboer, Amartya Sanyal, Hugo Larochelle, Yoshua Bengio

Figure 1 for Multiscale sequence modeling with a learned dictionary

Figure 2 for Multiscale sequence modeling with a learned dictionary

Figure 3 for Multiscale sequence modeling with a learned dictionary

Figure 4 for Multiscale sequence modeling with a learned dictionary

We propose a generalization of neural network sequence models. Instead of predicting one symbol at a time, our multi-scale model makes predictions over multiple, potentially overlapping multi-symbol tokens. A variation of the byte-pair encoding (BPE) compression algorithm is used to learn the dictionary of tokens that the model is trained with. When applied to language modelling, our model has the flexibility of character-level models while maintaining many of the performance benefits of word-level models. Our experiments show that this model performs better than a regular LSTM on language modeling tasks, especially for smaller models.

Via

Access Paper or Ask Questions

A Closer Look at Memorization in Deep Networks

Jul 01, 2017
Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, Simon Lacoste-Julien

Figure 1 for A Closer Look at Memorization in Deep Networks

Figure 2 for A Closer Look at Memorization in Deep Networks

Figure 3 for A Closer Look at Memorization in Deep Networks

Figure 4 for A Closer Look at Memorization in Deep Networks

We examine the role of memorization in deep learning, drawing connections to capacity, generalization, and adversarial robustness. While deep networks are capable of memorizing noise data, our results suggest that they tend to prioritize learning simple patterns first. In our experiments, we expose qualitative differences in gradient-based optimization of deep neural networks (DNNs) on noise vs. real data. We also demonstrate that for appropriately tuned explicit regularization (e.g., dropout) we can degrade DNN training performance on noise datasets without compromising generalization on real data. Our analysis suggests that the notions of effective capacity which are dataset independent are unlikely to explain the generalization performance of deep networks when trained with gradient based methods because training data itself plays an important role in determining the degree of memorization.

* Appears in Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Devansh Arpit, Stanis{\l}aw Jastrz\k{e}bski, Nicolas Ballas, and David Krueger contributed equally to this work

Via

Access Paper or Ask Questions

Plan, Attend, Generate: Character-level Neural Machine Translation with Planning in the Decoder

Jun 23, 2017
Caglar Gulcehre, Francis Dutil, Adam Trischler, Yoshua Bengio

Figure 1 for Plan, Attend, Generate: Character-level Neural Machine Translation with Planning in the Decoder

Figure 2 for Plan, Attend, Generate: Character-level Neural Machine Translation with Planning in the Decoder

Figure 3 for Plan, Attend, Generate: Character-level Neural Machine Translation with Planning in the Decoder

Figure 4 for Plan, Attend, Generate: Character-level Neural Machine Translation with Planning in the Decoder

We investigate the integration of a planning mechanism into an encoder-decoder architecture with an explicit alignment for character-level machine translation. We develop a model that plans ahead when it computes alignments between the source and target sequences, constructing a matrix of proposed future alignments and a commitment vector that governs whether to follow or recompute the plan. This mechanism is inspired by the strategic attentive reader and writer (STRAW) model. Our proposed model is end-to-end trainable with fully differentiable operations. We show that it outperforms a strong baseline on three character-level decoder neural machine translation on WMT'15 corpus. Our analysis demonstrates that our model can compute qualitatively intuitive alignments and achieves superior performance with fewer parameters.

* Accepted to Rep4NLP 2017 Workshop at ACL 2017 Conference

Via

Access Paper or Ask Questions

Deep Learning for Patient-Specific Kidney Graft Survival Analysis

May 29, 2017
Margaux Luck, Tristan Sylvain, Héloïse Cardinal, Andrea Lodi, Yoshua Bengio

Figure 1 for Deep Learning for Patient-Specific Kidney Graft Survival Analysis

Figure 2 for Deep Learning for Patient-Specific Kidney Graft Survival Analysis

Figure 3 for Deep Learning for Patient-Specific Kidney Graft Survival Analysis

Figure 4 for Deep Learning for Patient-Specific Kidney Graft Survival Analysis

An accurate model of patient-specific kidney graft survival distributions can help to improve shared-decision making in the treatment and care of patients. In this paper, we propose a deep learning method that directly models the survival function instead of estimating the hazard function to predict survival times for graft patients based on the principle of multi-task learning. By learning to jointly predict the time of the event, and its rank in the cox partial log likelihood framework, our deep learning approach outperforms, in terms of survival time prediction quality and concordance index, other common methods for survival analysis, including the Cox Proportional Hazards model and a network trained on the cox partial log-likelihood.

Via

Access Paper or Ask Questions

Sharp Minima Can Generalize For Deep Nets

May 15, 2017
Laurent Dinh, Razvan Pascanu, Samy Bengio, Yoshua Bengio

Figure 1 for Sharp Minima Can Generalize For Deep Nets

Figure 2 for Sharp Minima Can Generalize For Deep Nets

Figure 3 for Sharp Minima Can Generalize For Deep Nets

Figure 4 for Sharp Minima Can Generalize For Deep Nets

Despite their overwhelming capacity to overfit, deep learning architectures tend to generalize relatively well to unseen data, allowing them to be deployed in practice. However, explaining why this is the case is still an open area of research. One standing hypothesis that is gaining popularity, e.g. Hochreiter & Schmidhuber (1997); Keskar et al. (2017), is that the flatness of minima of the loss function found by stochastic gradient based methods results in good generalization. This paper argues that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization. Specifically, when focusing on deep networks with rectifier units, we can exploit the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit to build equivalent models corresponding to arbitrarily sharper minima. Furthermore, if we allow to reparametrize a function, the geometry of its parameters can change drastically without affecting its generalization properties.

* 8.5 pages of main content, 2.5 of bibliography and 1 page of appendix

Via

Access Paper or Ask Questions

Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

Apr 12, 2017
Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, Jason Yosinski

Figure 1 for Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

Figure 2 for Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

Figure 3 for Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

Figure 4 for Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

Generating high-resolution, photo-realistic images has been a long-standing goal in machine learning. Recently, Nguyen et al. (2016) showed one interesting way to synthesize novel images by performing gradient ascent in the latent space of a generator network to maximize the activations of one or multiple neurons in a separate classifier network. In this paper we extend this method by introducing an additional prior on the latent code, improving both sample quality and sample diversity, leading to a state-of-the-art generative model that produces high quality images at higher resolutions (227x227) than previous generative models, and does so for all 1000 ImageNet categories. In addition, we provide a unified probabilistic interpretation of related activation maximization methods and call the general class of models "Plug and Play Generative Networks". PPGNs are composed of 1) a generator network G that is capable of drawing a wide range of image types and 2) a replaceable "condition" network C that tells the generator what to draw. We demonstrate the generation of images conditioned on a class (when C is an ImageNet or MIT Places classification network) and also conditioned on a caption (when C is an image captioning network). Our method also improves the state of the art of Multifaceted Feature Visualization, which generates the set of synthetic inputs that activate a neuron in order to better understand how deep neural networks operate. Finally, we show that our model performs reasonably well at the task of image inpainting. While image models are used in this paper, the approach is modality-agnostic and can be applied to many types of data.

* CVPR camera-ready

Via

Access Paper or Ask Questions

Equilibrium Propagation: Bridging the Gap Between Energy-Based Models and Backpropagation

Mar 28, 2017
Benjamin Scellier, Yoshua Bengio

Figure 1 for Equilibrium Propagation: Bridging the Gap Between Energy-Based Models and Backpropagation

Figure 2 for Equilibrium Propagation: Bridging the Gap Between Energy-Based Models and Backpropagation

Figure 3 for Equilibrium Propagation: Bridging the Gap Between Energy-Based Models and Backpropagation

Figure 4 for Equilibrium Propagation: Bridging the Gap Between Energy-Based Models and Backpropagation

We introduce Equilibrium Propagation, a learning framework for energy-based models. It involves only one kind of neural computation, performed in both the first phase (when the prediction is made) and the second phase of training (after the target or prediction error is revealed). Although this algorithm computes the gradient of an objective function just like Backpropagation, it does not need a special computation or circuit for the second phase, where errors are implicitly propagated. Equilibrium Propagation shares similarities with Contrastive Hebbian Learning and Contrastive Divergence while solving the theoretical issues of both algorithms: our algorithm computes the gradient of a well defined objective function. Because the objective function is defined in terms of local perturbations, the second phase of Equilibrium Propagation corresponds to only nudging the prediction (fixed point, or stationary distribution) towards a configuration that reduces prediction error. In the case of a recurrent multi-layer supervised network, the output units are slightly nudged towards their target in the second phase, and the perturbation introduced at the output layer propagates backward in the hidden layers. We show that the signal 'back-propagated' during this second phase corresponds to the propagation of error derivatives and encodes the gradient of the objective function, when the synaptic update corresponds to a standard form of spike-timing dependent plasticity. This work makes it more plausible that a mechanism similar to Backpropagation could be implemented by brains, since leaky integrator neural computation performs both inference and error back-propagation in our model. The only local difference between the two phases is whether synaptic changes are allowed or not.

Via

Access Paper or Ask Questions

Batch-normalized joint training for DNN-based distant speech recognition

Mar 24, 2017
Mirco Ravanelli, Philemon Brakel, Maurizio Omologo, Yoshua Bengio

Figure 1 for Batch-normalized joint training for DNN-based distant speech recognition

Figure 2 for Batch-normalized joint training for DNN-based distant speech recognition

Figure 3 for Batch-normalized joint training for DNN-based distant speech recognition

Figure 4 for Batch-normalized joint training for DNN-based distant speech recognition

Improving distant speech recognition is a crucial step towards flexible human-machine interfaces. Current technology, however, still exhibits a lack of robustness, especially when adverse acoustic conditions are met. Despite the significant progress made in the last years on both speech enhancement and speech recognition, one potential limitation of state-of-the-art technology lies in composing modules that are not well matched because they are not trained jointly. To address this concern, a promising approach consists in concatenating a speech enhancement and a speech recognition deep neural network and to jointly update their parameters as if they were within a single bigger network. Unfortunately, joint training can be difficult because the output distribution of the speech enhancement system may change substantially during the optimization procedure. The speech recognition module would have to deal with an input distribution that is non-stationary and unnormalized. To mitigate this issue, we propose a joint training approach based on a fully batch-normalized architecture. Experiments, conducted using different datasets, tasks and acoustic conditions, revealed that the proposed framework significantly overtakes other competitive solutions, especially in challenging environments.

* arXiv admin note: text overlap with arXiv:1703.08002

Via

Access Paper or Ask Questions

A network of deep neural networks for distant speech recognition

Mar 23, 2017
Mirco Ravanelli, Philemon Brakel, Maurizio Omologo, Yoshua Bengio

Figure 1 for A network of deep neural networks for distant speech recognition

Figure 2 for A network of deep neural networks for distant speech recognition

Figure 3 for A network of deep neural networks for distant speech recognition

Figure 4 for A network of deep neural networks for distant speech recognition

Despite the remarkable progress recently made in distant speech recognition, state-of-the-art technology still suffers from a lack of robustness, especially when adverse acoustic conditions characterized by non-stationary noises and reverberation are met. A prominent limitation of current systems lies in the lack of matching and communication between the various technologies involved in the distant speech recognition process. The speech enhancement and speech recognition modules are, for instance, often trained independently. Moreover, the speech enhancement normally helps the speech recognizer, but the output of the latter is not commonly used, in turn, to improve the speech enhancement. To address both concerns, we propose a novel architecture based on a network of deep neural networks, where all the components are jointly trained and better cooperate with each other thanks to a full communication scheme between them. Experiments, conducted using different datasets, tasks and acoustic conditions, revealed that the proposed framework can overtake other competitive solutions, including recent joint training approaches.

Via

Access Paper or Ask Questions

Independently Controllable Features

Mar 22, 2017
Emmanuel Bengio, Valentin Thomas, Joelle Pineau, Doina Precup, Yoshua Bengio

Figure 1 for Independently Controllable Features

Figure 2 for Independently Controllable Features

Finding features that disentangle the different causes of variation in real data is a difficult task, that has nonetheless received considerable attention in static domains like natural images. Interactive environments, in which an agent can deliberately take actions, offer an opportunity to tackle this task better, because the agent can experiment with different actions and observe their effects. We introduce the idea that in interactive environments, latent factors that control the variation in observed data can be identified by figuring out what the agent can control. We propose a naive method to find factors that explain or measure the effect of the actions of a learner, and test it in illustrative experiments.

* RLDM submission

Via

Access Paper or Ask Questions