Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yoshua Bengio

DIRO

Learning Independent Features with Adversarial Nets for Non-linear ICA

Oct 13, 2017

Philemon Brakel, Yoshua Bengio

Figure 1 for Learning Independent Features with Adversarial Nets for Non-linear ICA

Figure 2 for Learning Independent Features with Adversarial Nets for Non-linear ICA

Figure 3 for Learning Independent Features with Adversarial Nets for Non-linear ICA

Figure 4 for Learning Independent Features with Adversarial Nets for Non-linear ICA

Abstract:Reliable measures of statistical dependence could be useful tools for learning independent features and performing tasks like source separation using Independent Component Analysis (ICA). Unfortunately, many of such measures, like the mutual information, are hard to estimate and optimize directly. We propose to learn independent features with adversarial objectives which optimize such measures implicitly. These objectives compare samples from the joint distribution and the product of the marginals without the need to compute any probability densities. We also propose two methods for obtaining samples from the product of the marginals using either a simple resampling trick or a separate parametric distribution. Our experiments show that this strategy can easily be applied to different types of model architectures and solve both linear and non-linear ICA problems.

* A preliminary version of this work was presented at the ICML 2017 workshop on implicit models

Via

Access Paper or Ask Questions

Improving speech recognition by revising gated recurrent units

Sep 29, 2017

Mirco Ravanelli, Philemon Brakel, Maurizio Omologo, Yoshua Bengio

Figure 1 for Improving speech recognition by revising gated recurrent units

Figure 2 for Improving speech recognition by revising gated recurrent units

Figure 3 for Improving speech recognition by revising gated recurrent units

Figure 4 for Improving speech recognition by revising gated recurrent units

Abstract:Speech recognition is largely taking advantage of deep learning, showing that substantial benefits can be obtained by modern Recurrent Neural Networks (RNNs). The most popular RNNs are Long Short-Term Memory (LSTMs), which typically reach state-of-the-art performance in many tasks thanks to their ability to learn long-term dependencies and robustness to vanishing gradients. Nevertheless, LSTMs have a rather complex design with three multiplicative gates, that might impair their efficient implementation. An attempt to simplify LSTMs has recently led to Gated Recurrent Units (GRUs), which are based on just two multiplicative gates. This paper builds on these efforts by further revising GRUs and proposing a simplified architecture potentially more suitable for speech recognition. The contribution of this work is two-fold. First, we suggest to remove the reset gate in the GRU design, resulting in a more efficient single-gate architecture. Second, we propose to replace tanh with ReLU activations in the state update equations. Results show that, in our implementation, the revised architecture reduces the per-epoch training time with more than 30% and consistently improves recognition performance across different tasks, input features, and noisy conditions when compared to a standard GRU.

Via

Access Paper or Ask Questions

The Consciousness Prior

Sep 25, 2017

Yoshua Bengio

Abstract:A new prior is proposed for representation learning, which can be combined with other priors in order to help disentangling abstract factors from each other. It is inspired by the phenomenon of consciousness seen as the formation of a low-dimensional combination of a few concepts constituting a conscious thought, i.e., consciousness as awareness at a particular time instant. This provides a powerful constraint on the representation in that such low-dimensional thought vectors can correspond to statements about reality which are true, highly probable, or very useful for taking decisions. The fact that a few elements of the current state can be combined into such a predictive or useful statement is a strong constraint and deviates considerably from the maximum likelihood approaches to modelling data and how states unfold in the future based on an agent's actions. Instead of making predictions in the sensory (e.g. pixel) space, the consciousness prior allows the agent to make predictions in the abstract space, with only a few dimensions of that space being involved in each of these predictions. The consciousness prior also makes it natural to map conscious states to natural language utterances or to express classical AI knowledge in the form of facts and rules, although the conscious states may be richer than what can be expressed easily in the form of a sentence, a fact or a rule.

Via

Access Paper or Ask Questions

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

Sep 22, 2017

David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Aaron Courville, Chris Pal

Figure 1 for Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

Figure 2 for Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

Figure 3 for Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

Figure 4 for Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

Abstract:We propose zoneout, a novel method for regularizing RNNs. At each timestep, zoneout stochastically forces some hidden units to maintain their previous values. Like dropout, zoneout uses random noise to train a pseudo-ensemble, improving generalization. But by preserving instead of dropping hidden units, gradient information and state information are more readily propagated through time, as in feedforward stochastic depth networks. We perform an empirical investigation of various RNN regularizers, and find that zoneout gives significant performance improvements across tasks. We achieve competitive results with relatively simple models in character- and word-level language modelling on the Penn Treebank and Text8 datasets, and combining with recurrent batch normalization yields state-of-the-art results on permuted sequential MNIST.

* David Krueger and Tegan Maharaj contributed equally to this work

Via

Access Paper or Ask Questions

Independently Controllable Factors

Aug 25, 2017

Valentin Thomas, Jules Pondard, Emmanuel Bengio, Marc Sarfati, Philippe Beaudoin, Marie-Jean Meurs, Joelle Pineau, Doina Precup, Yoshua Bengio

Figure 1 for Independently Controllable Factors

Figure 2 for Independently Controllable Factors

Figure 3 for Independently Controllable Factors

Figure 4 for Independently Controllable Factors

Abstract:It has been postulated that a good representation is one that disentangles the underlying explanatory factors of variation. However, it remains an open question what kind of training framework could potentially achieve that. Whereas most previous work focuses on the static setting (e.g., with images), we postulate that some of the causal factors could be discovered if the learner is allowed to interact with its environment. The agent can experiment with different actions and observe their effects. More specifically, we hypothesize that some of these factors correspond to aspects of the environment which are independently controllable, i.e., that there exists a policy and a learnable feature for each such aspect of the environment, such that this policy can yield changes in that feature with minimal changes to other features that explain the statistical variations in the observed data. We propose a specific objective function to find such factors and verify experimentally that it can indeed disentangle independently controllable aspects of the environment without any extrinsic reward signal.

Via

Access Paper or Ask Questions

Image Segmentation by Iterative Inference from Conditional Score Estimation

Aug 18, 2017

Adriana Romero, Michal Drozdzal, Akram Erraqabi, Simon Jégou, Yoshua Bengio

Figure 1 for Image Segmentation by Iterative Inference from Conditional Score Estimation

Figure 2 for Image Segmentation by Iterative Inference from Conditional Score Estimation

Figure 3 for Image Segmentation by Iterative Inference from Conditional Score Estimation

Figure 4 for Image Segmentation by Iterative Inference from Conditional Score Estimation

Abstract:Inspired by the combination of feedforward and iterative computations in the virtual cortex, and taking advantage of the ability of denoising autoencoders to estimate the score of a joint distribution, we propose a novel approach to iterative inference for capturing and exploiting the complex joint distribution of output variables conditioned on some input variables. This approach is applied to image pixel-wise segmentation, with the estimated conditional score used to perform gradient ascent towards a mode of the estimated conditional distribution. This extends previous work on score estimation by denoising autoencoders to the case of a conditional distribution, with a novel use of a corrupted feedforward predictor replacing Gaussian corruption. An advantage of this approach over more classical ways to perform iterative inference for structured outputs, like conditional random fields (CRFs), is that it is not any more necessary to define an explicit energy function linking the output variables. To keep computations tractable, such energy function parametrizations are typically fairly constrained, involving only a few neighbors of each of the output variables in each clique. We experimentally find that the proposed iterative inference from conditional score estimation by conditional denoising autoencoders performs better than comparable models based on CRFs or those not using any explicit modeling of the conditional joint distribution of outputs.

Via

Access Paper or Ask Questions

Count-ception: Counting by Fully Convolutional Redundant Counting

Jul 23, 2017

Joseph Paul Cohen, Genevieve Boucher, Craig A. Glastonbury, Henry Z. Lo, Yoshua Bengio

Figure 1 for Count-ception: Counting by Fully Convolutional Redundant Counting

Figure 2 for Count-ception: Counting by Fully Convolutional Redundant Counting

Figure 3 for Count-ception: Counting by Fully Convolutional Redundant Counting

Figure 4 for Count-ception: Counting by Fully Convolutional Redundant Counting

Abstract:Counting objects in digital images is a process that should be replaced by machines. This tedious task is time consuming and prone to errors due to fatigue of human annotators. The goal is to have a system that takes as input an image and returns a count of the objects inside and justification for the prediction in the form of object localization. We repose a problem, originally posed by Lempitsky and Zisserman, to instead predict a count map which contains redundant counts based on the receptive field of a smaller regression network. The regression network predicts a count of the objects that exist inside this frame. By processing the image in a fully convolutional way each pixel is going to be accounted for some number of times, the number of windows which include it, which is the size of each window, (i.e., 32x32 = 1024). To recover the true count we take the average over the redundant predictions. Our contribution is redundant counting instead of predicting a density map in order to average over errors. We also propose a novel deep neural network architecture adapted from the Inception family of networks called the Count-ception network. Together our approach results in a 20% relative improvement (2.9 to 2.3 MAE) over the state of the art method by Xie, Noble, and Zisserman in 2016.

* Under Review

Via

Access Paper or Ask Questions

Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition

Jul 19, 2017

Taesup Kim, Inchul Song, Yoshua Bengio

Figure 1 for Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition

Figure 2 for Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition

Figure 3 for Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition

Figure 4 for Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition

Abstract:Layer normalization is a recently introduced technique for normalizing the activities of neurons in deep neural networks to improve the training speed and stability. In this paper, we introduce a new layer normalization technique called Dynamic Layer Normalization (DLN) for adaptive neural acoustic modeling in speech recognition. By dynamically generating the scaling and shifting parameters in layer normalization, DLN adapts neural acoustic models to the acoustic variability arising from various factors such as speakers, channel noises, and environments. Unlike other adaptive acoustic models, our proposed approach does not require additional adaptation data or speaker information such as i-vectors. Moreover, the model size is fixed as it dynamically generates adaptation parameters. We apply our proposed DLN to deep bidirectional LSTM acoustic models and evaluate them on two benchmark datasets for large vocabulary ASR experiments: WSJ and TED-LIUM release 2. The experimental results show that our DLN improves neural acoustic models in terms of transcription accuracy by dynamically adapting to various speakers and environments.

* INTERSPEECH 2017

Via

Access Paper or Ask Questions

Multiscale sequence modeling with a learned dictionary

Jul 05, 2017

Bart van Merriënboer, Amartya Sanyal, Hugo Larochelle, Yoshua Bengio

Figure 1 for Multiscale sequence modeling with a learned dictionary

Figure 2 for Multiscale sequence modeling with a learned dictionary

Figure 3 for Multiscale sequence modeling with a learned dictionary

Figure 4 for Multiscale sequence modeling with a learned dictionary

Abstract:We propose a generalization of neural network sequence models. Instead of predicting one symbol at a time, our multi-scale model makes predictions over multiple, potentially overlapping multi-symbol tokens. A variation of the byte-pair encoding (BPE) compression algorithm is used to learn the dictionary of tokens that the model is trained with. When applied to language modelling, our model has the flexibility of character-level models while maintaining many of the performance benefits of word-level models. Our experiments show that this model performs better than a regular LSTM on language modeling tasks, especially for smaller models.

Via

Access Paper or Ask Questions

A Closer Look at Memorization in Deep Networks

Jul 01, 2017

Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio(+1 more)

Figure 1 for A Closer Look at Memorization in Deep Networks

Figure 2 for A Closer Look at Memorization in Deep Networks

Figure 3 for A Closer Look at Memorization in Deep Networks

Figure 4 for A Closer Look at Memorization in Deep Networks

Abstract:We examine the role of memorization in deep learning, drawing connections to capacity, generalization, and adversarial robustness. While deep networks are capable of memorizing noise data, our results suggest that they tend to prioritize learning simple patterns first. In our experiments, we expose qualitative differences in gradient-based optimization of deep neural networks (DNNs) on noise vs. real data. We also demonstrate that for appropriately tuned explicit regularization (e.g., dropout) we can degrade DNN training performance on noise datasets without compromising generalization on real data. Our analysis suggests that the notions of effective capacity which are dataset independent are unlikely to explain the generalization performance of deep networks when trained with gradient based methods because training data itself plays an important role in determining the degree of memorization.

* Appears in Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Devansh Arpit, Stanis{\l}aw Jastrz\k{e}bski, Nicolas Ballas, and David Krueger contributed equally to this work

Via

Access Paper or Ask Questions