Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aaron Courville

Universite de Montreal

Recurrent Batch Normalization

Feb 28, 2017

Tim Cooijmans, Nicolas Ballas, César Laurent, Çağlar Gülçehre, Aaron Courville

Figure 1 for Recurrent Batch Normalization

Figure 2 for Recurrent Batch Normalization

Figure 3 for Recurrent Batch Normalization

Figure 4 for Recurrent Batch Normalization

Abstract:We propose a reparameterization of LSTM that brings the benefits of batch normalization to recurrent neural networks. Whereas previous works only apply batch normalization to the input-to-hidden transformation of RNNs, we demonstrate that it is both possible and beneficial to batch-normalize the hidden-to-hidden transition, thereby reducing internal covariate shift between time steps. We evaluate our proposal on various sequential problems such as sequence classification, language modeling and question answering. Our empirical results show that our batch-normalized LSTM consistently leads to faster convergence and improved generalization.

Via

Access Paper or Ask Questions

Calibrating Energy-based Generative Adversarial Networks

Feb 24, 2017

Zihang Dai, Amjad Almahairi, Philip Bachman, Eduard Hovy, Aaron Courville

Figure 1 for Calibrating Energy-based Generative Adversarial Networks

Figure 2 for Calibrating Energy-based Generative Adversarial Networks

Figure 3 for Calibrating Energy-based Generative Adversarial Networks

Figure 4 for Calibrating Energy-based Generative Adversarial Networks

Abstract:In this paper, we propose to equip Generative Adversarial Networks with the ability to produce direct energy estimates for samples.Specifically, we propose a flexible adversarial training framework, and prove this framework not only ensures the generator converges to the true data distribution, but also enables the discriminator to retain the density information at the global optimal. We derive the analytic form of the induced solution, and analyze the properties. In order to make the proposed framework trainable in practice, we introduce two effective approximation techniques. Empirically, the experiment results closely match our theoretical analysis, verifying the discriminator is able to recover the energy of data distribution.

* ICLR 2017 camera ready

Via

Access Paper or Ask Questions

Adversarially Learned Inference

Feb 21, 2017

Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, Aaron Courville

Figure 1 for Adversarially Learned Inference

Figure 2 for Adversarially Learned Inference

Figure 3 for Adversarially Learned Inference

Figure 4 for Adversarially Learned Inference

Abstract:We introduce the adversarially learned inference (ALI) model, which jointly learns a generation network and an inference network using an adversarial process. The generation network maps samples from stochastic latent variables to the data space while the inference network maps training examples in data space to the space of latent variables. An adversarial game is cast between these two networks and a discriminative network is trained to distinguish between joint latent/data-space samples from the generative network and joint samples from the inference network. We illustrate the ability of the model to learn mutually coherent inference and generation networks through the inspections of model samples and reconstructions and confirm the usefulness of the learned representations by obtaining a performance competitive with state-of-the-art on the semi-supervised SVHN and CIFAR10 tasks.

Via

Access Paper or Ask Questions

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

Feb 11, 2017

Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, Yoshua Bengio

Figure 1 for SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

Figure 2 for SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

Figure 3 for SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

Figure 4 for SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

Abstract:In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time. We show that our model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature. Human evaluation on the generated samples indicate that our model is preferred over competing models. We also show how each component of the model contributes to the exhibited performance.

* Published as a conference paper at ICLR 2017

Via

Access Paper or Ask Questions

GuessWhat?! Visual object discovery through multi-modal dialogue

Feb 06, 2017

Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, Aaron Courville

Figure 1 for GuessWhat?! Visual object discovery through multi-modal dialogue

Figure 2 for GuessWhat?! Visual object discovery through multi-modal dialogue

Figure 3 for GuessWhat?! Visual object discovery through multi-modal dialogue

Figure 4 for GuessWhat?! Visual object discovery through multi-modal dialogue

Abstract:We introduce GuessWhat?!, a two-player guessing game as a testbed for research on the interplay of computer vision and dialogue systems. The goal of the game is to locate an unknown object in a rich image scene by asking a sequence of questions. Higher-level image understanding, like spatial reasoning and language grounding, is required to solve the proposed task. Our key contribution is the collection of a large-scale dataset consisting of 150K human-played games with a total of 800K visual question-answer pairs on 66K images. We explain our design decisions in collecting the dataset and introduce the oracle and questioner tasks that are associated with the two players of the game. We prototyped deep learning models to establish initial baselines of the introduced tasks.

* 23 pages; CVPR 2017 submission; see https://guesswhat.ai

Via

Access Paper or Ask Questions

A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering

Feb 05, 2017

Tegan Maharaj, Nicolas Ballas, Anna Rohrbach, Aaron Courville, Christopher Pal

Figure 1 for A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering

Figure 2 for A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering

Figure 3 for A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering

Figure 4 for A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering

Abstract:While deep convolutional neural networks frequently approach or exceed human-level performance at benchmark tasks involving static images, extending this success to moving images is not straightforward. Having models which can learn to understand video is of interest for many applications, including content recommendation, prediction, summarization, event/object detection and understanding human visual perception, but many domains lack sufficient data to explore and perfect video models. In order to address the need for a simple, quantitative benchmark for developing and understanding video, we present MovieFIB, a fill-in-the-blank question-answering dataset with over 300,000 examples, based on descriptive video annotations for the visually impaired. In addition to presenting statistics and a description of the dataset, we perform a detailed analysis of 5 different models' predictions, and compare these with human performance. We investigate the relative importance of language, static (2D) visual features, and moving (3D) visual features; the effects of increasing dataset size, the number of frames sampled; and of vocabulary size. We illustrate that: this task is not solvable by a language model alone; our model combining 2D and 3D visual information indeed provides the best result; all models perform significantly worse than human-level. We provide human evaluations for responses given by different models and find that accuracy on the MovieFIB evaluation corresponds well with human judgement. We suggest avenues for improving video models, and hope that the proposed dataset can be useful for measuring and encouraging progress in this very interesting field.

Via

Access Paper or Ask Questions

Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

Jan 10, 2017

Ying Zhang, Mohammad Pezeshki, Philemon Brakel, Saizheng Zhang, Cesar Laurent Yoshua Bengio, Aaron Courville

Figure 1 for Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

Figure 2 for Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

Figure 3 for Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

Figure 4 for Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

Abstract:Convolutional Neural Networks (CNNs) are effective models for reducing spectral variations and modeling spectral correlations in acoustic features for automatic speech recognition (ASR). Hybrid speech recognition systems incorporating CNNs with Hidden Markov Models/Gaussian Mixture Models (HMMs/GMMs) have achieved the state-of-the-art in various benchmarks. Meanwhile, Connectionist Temporal Classification (CTC) with Recurrent Neural Networks (RNNs), which is proposed for labeling unsegmented sequences, makes it feasible to train an end-to-end speech recognition system instead of hybrid settings. However, RNNs are computationally expensive and sometimes difficult to train. In this paper, inspired by the advantages of both CNNs and the CTC approach, we propose an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with CTC directly without recurrent connections. By evaluating the approach on the TIMIT phoneme recognition task, we show that the proposed model is not only computationally efficient, but also competitive with the existing baseline systems. Moreover, we argue that CNNs have the capability to model temporal correlations with appropriate context information.

Via

Access Paper or Ask Questions

Generalizable Features From Unsupervised Learning

Dec 12, 2016

Mehdi Mirza, Aaron Courville, Yoshua Bengio

Figure 1 for Generalizable Features From Unsupervised Learning

Figure 2 for Generalizable Features From Unsupervised Learning

Figure 3 for Generalizable Features From Unsupervised Learning

Figure 4 for Generalizable Features From Unsupervised Learning

Abstract:Humans learn a predictive model of the world and use this model to reason about future events and the consequences of actions. In contrast to most machine predictors, we exhibit an impressive ability to generalize to unseen scenarios and reason intelligently in these settings. One important aspect of this ability is physical intuition(Lake et al., 2016). In this work, we explore the potential of unsupervised learning to find features that promote better generalization to settings outside the supervised training distribution. Our task is predicting the stability of towers of square blocks. We demonstrate that an unsupervised model, trained to predict future frames of a video sequence of stable and unstable block configurations, can yield features that support extrapolating stability prediction to blocks configurations outside the training set distribution

Via

Access Paper or Ask Questions

A Benchmark for Endoluminal Scene Segmentation of Colonoscopy Images

Dec 02, 2016

David Vázquez, Jorge Bernal, F. Javier Sánchez, Gloria Fernández-Esparrach, Antonio M. López, Adriana Romero, Michal Drozdzal, Aaron Courville

Figure 1 for A Benchmark for Endoluminal Scene Segmentation of Colonoscopy Images

Figure 2 for A Benchmark for Endoluminal Scene Segmentation of Colonoscopy Images

Figure 3 for A Benchmark for Endoluminal Scene Segmentation of Colonoscopy Images

Figure 4 for A Benchmark for Endoluminal Scene Segmentation of Colonoscopy Images

Abstract:Colorectal cancer (CRC) is the third cause of cancer death worldwide. Currently, the standard approach to reduce CRC-related mortality is to perform regular screening in search for polyps and colonoscopy is the screening tool of choice. The main limitations of this screening procedure are polyp miss-rate and inability to perform visual assessment of polyp malignancy. These drawbacks can be reduced by designing Decision Support Systems (DSS) aiming to help clinicians in the different stages of the procedure by providing endoluminal scene segmentation. Thus, in this paper, we introduce an extended benchmark of colonoscopy image, with the hope of establishing a new strong benchmark for colonoscopy image analysis research. We provide new baselines on this dataset by training standard fully convolutional networks (FCN) for semantic segmentation and significantly outperforming, without any further post-processing, prior results in endoluminal scene segmentation.

Via

Access Paper or Ask Questions

PixelVAE: A Latent Variable Model for Natural Images

Nov 15, 2016

Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, Aaron Courville

Figure 1 for PixelVAE: A Latent Variable Model for Natural Images

Figure 2 for PixelVAE: A Latent Variable Model for Natural Images

Figure 3 for PixelVAE: A Latent Variable Model for Natural Images

Figure 4 for PixelVAE: A Latent Variable Model for Natural Images

Abstract:Natural image modeling is a landmark challenge of unsupervised learning. Variational Autoencoders (VAEs) learn a useful latent representation and model global structure well but have difficulty capturing small details. PixelCNN models details very well, but lacks a latent code and is difficult to scale for capturing large structures. We present PixelVAE, a VAE model with an autoregressive decoder based on PixelCNN. Our model requires very few expensive autoregressive layers compared to PixelCNN and learns latent codes that are more compressed than a standard VAE while still capturing most non-trivial structure. Finally, we extend our model to a hierarchy of latent variables at different scales. Our model achieves state-of-the-art performance on binarized MNIST, competitive performance on 64x64 ImageNet, and high-quality samples on the LSUN bedrooms dataset.

Via

Access Paper or Ask Questions