Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mohammad Norouzi

QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension

Apr 23, 2018
Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, Quoc V. Le

Figure 1 for QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension

Figure 2 for QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension

Figure 3 for QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension

Figure 4 for QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension

Current end-to-end machine reading and question answering (Q\&A) models are primarily based on recurrent neural networks (RNNs) with attention. Despite their success, these models are often slow for both training and inference due to the sequential nature of RNNs. We propose a new Q\&A architecture called QANet, which does not require recurrent networks: Its encoder consists exclusively of convolution and self-attention, where convolution models local interactions and self-attention models global interactions. On the SQuAD dataset, our model is 3x to 13x faster in training and 4x to 9x faster in inference, while achieving equivalent accuracy to recurrent models. The speed-up gain allows us to train the model with much more data. We hence combine our model with data generated by backtranslation from a neural machine translation model. On the SQuAD dataset, our single model, trained with augmented data, achieves 84.6 F1 score on the test set, which is significantly better than the best published F1 score of 81.8.

* Published as full paper in ICLR 2018

Via

Access Paper or Ask Questions

Neural Program Synthesis with Priority Queue Training

Mar 23, 2018
Daniel A. Abolafia, Mohammad Norouzi, Jonathan Shen, Rui Zhao, Quoc V. Le

Figure 1 for Neural Program Synthesis with Priority Queue Training

Figure 2 for Neural Program Synthesis with Priority Queue Training

Figure 3 for Neural Program Synthesis with Priority Queue Training

Figure 4 for Neural Program Synthesis with Priority Queue Training

We consider the task of program synthesis in the presence of a reward function over the output of programs, where the goal is to find programs with maximal rewards. We employ an iterative optimization scheme, where we train an RNN on a dataset of K best programs from a priority queue of the generated programs so far. Then, we synthesize new programs and add them to the priority queue by sampling from the RNN. We benchmark our algorithm, called priority queue training (or PQT), against genetic algorithm and reinforcement learning baselines on a simple but expressive Turing complete programming language called BF. Our experimental results show that our simple PQT algorithm significantly outperforms the baselines. By adding a program length penalty to the reward function, we are able to synthesize short, human readable programs.

Via

Access Paper or Ask Questions

Trust-PCL: An Off-Policy Trust Region Method for Continuous Control

Feb 22, 2018
Ofir Nachum, Mohammad Norouzi, Kelvin Xu, Dale Schuurmans

Figure 1 for Trust-PCL: An Off-Policy Trust Region Method for Continuous Control

Figure 2 for Trust-PCL: An Off-Policy Trust Region Method for Continuous Control

Figure 3 for Trust-PCL: An Off-Policy Trust Region Method for Continuous Control

Figure 4 for Trust-PCL: An Off-Policy Trust Region Method for Continuous Control

Trust region methods, such as TRPO, are often used to stabilize policy optimization algorithms in reinforcement learning (RL). While current trust region strategies are effective for continuous control, they typically require a prohibitively large amount of on-policy interaction with the environment. To address this problem, we propose an off-policy trust region method, Trust-PCL. The algorithm is the result of observing that the optimal policy and state values of a maximum reward objective with a relative-entropy regularizer satisfy a set of multi-step pathwise consistencies along any path. Thus, Trust-PCL is able to maintain optimization stability while exploiting off-policy data to improve sample efficiency. When evaluated on a number of continuous control tasks, Trust-PCL improves the solution quality and sample efficiency of TRPO.

* ICLR 2018

Via

Access Paper or Ask Questions

Bridging the Gap Between Value and Policy Based Reinforcement Learning

Nov 22, 2017
Ofir Nachum, Mohammad Norouzi, Kelvin Xu, Dale Schuurmans

Figure 1 for Bridging the Gap Between Value and Policy Based Reinforcement Learning

Figure 2 for Bridging the Gap Between Value and Policy Based Reinforcement Learning

We establish a new connection between value and policy based reinforcement learning (RL) based on a relationship between softmax temporal value consistency and policy optimality under entropy regularization. Specifically, we show that softmax consistent action values correspond to optimal entropy regularized policy probabilities along any action sequence, regardless of provenance. From this observation, we develop a new RL algorithm, Path Consistency Learning (PCL), that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on- and off-policy traces. We examine the behavior of PCL in different scenarios and show that PCL can be interpreted as generalizing both actor-critic and Q-learning algorithms. We subsequently deepen the relationship by showing how a single model can be used to represent both a policy and the corresponding softmax state values, eliminating the need for a separate critic. The experimental evaluation demonstrates that PCL significantly outperforms strong actor-critic and Q-learning baselines across several benchmarks.

* NIPS 2017

Via

Access Paper or Ask Questions

Filtering Variational Objectives

Nov 12, 2017
Chris J. Maddison, Dieterich Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, Arnaud Doucet, Yee Whye Teh

Figure 1 for Filtering Variational Objectives

Figure 2 for Filtering Variational Objectives

Figure 3 for Filtering Variational Objectives

Figure 4 for Filtering Variational Objectives

When used as a surrogate objective for maximum likelihood estimation in latent variable models, the evidence lower bound (ELBO) produces state-of-the-art results. Inspired by this, we consider the extension of the ELBO to a family of lower bounds defined by a particle filter's estimator of the marginal likelihood, the filtering variational objectives (FIVOs). FIVOs take the same arguments as the ELBO, but can exploit a model's sequential structure to form tighter bounds. We present results that relate the tightness of FIVO's bound to the variance of the particle filter's estimator by considering the generic case of bounds defined as log-transformed likelihood estimators. Experimentally, we show that training with FIVO results in substantial improvements over training the same model architecture with the ELBO on sequential data.

Via

Access Paper or Ask Questions

Deep Value Networks Learn to Evaluate and Iteratively Refine Structured Outputs

Aug 08, 2017
Michael Gygli, Mohammad Norouzi, Anelia Angelova

Figure 1 for Deep Value Networks Learn to Evaluate and Iteratively Refine Structured Outputs

Figure 2 for Deep Value Networks Learn to Evaluate and Iteratively Refine Structured Outputs

Figure 3 for Deep Value Networks Learn to Evaluate and Iteratively Refine Structured Outputs

Figure 4 for Deep Value Networks Learn to Evaluate and Iteratively Refine Structured Outputs

We approach structured output prediction by optimizing a deep value network (DVN) to precisely estimate the task loss on different output configurations for a given input. Once the model is trained, we perform inference by gradient descent on the continuous relaxations of the output variables to find outputs with promising scores from the value network. When applied to image segmentation, the value network takes an image and a segmentation mask as inputs and predicts a scalar estimating the intersection over union between the input and ground truth masks. For multi-label classification, the DVN's objective is to correctly predict the F1 score for any potential label configuration. The DVN framework achieves the state-of-the-art results on multi-label prediction and image segmentation benchmarks.

* Published at ICML 2017

Via

Access Paper or Ask Questions

Device Placement Optimization with Reinforcement Learning

Jun 25, 2017
Azalia Mirhoseini, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, Jeff Dean

Figure 1 for Device Placement Optimization with Reinforcement Learning

Figure 2 for Device Placement Optimization with Reinforcement Learning

Figure 3 for Device Placement Optimization with Reinforcement Learning

Figure 4 for Device Placement Optimization with Reinforcement Learning

The past few years have witnessed a growth in size and computational requirements for training and inference with neural networks. Currently, a common approach to address these requirements is to use a heterogeneous distributed environment with a mixture of hardware devices such as CPUs and GPUs. Importantly, the decision of placing parts of the neural models on devices is often made by human experts based on simple heuristics and intuitions. In this paper, we propose a method which learns to optimize device placement for TensorFlow computational graphs. Key to our method is the use of a sequence-to-sequence model to predict which subsets of operations in a TensorFlow graph should run on which of the available devices. The execution time of the predicted placements is then used as the reward signal to optimize the parameters of the sequence-to-sequence model. Our main result is that on Inception-V3 for ImageNet classification, and on RNN LSTM, for language modeling and neural machine translation, our model finds non-trivial device placements that outperform hand-crafted heuristics and traditional algorithmic methods.

* To appear at ICML 2017

Via

Access Paper or Ask Questions

N-gram Language Modeling using Recurrent Neural Network Estimation

Jun 20, 2017
Ciprian Chelba, Mohammad Norouzi, Samy Bengio

Figure 1 for N-gram Language Modeling using Recurrent Neural Network Estimation

Figure 2 for N-gram Language Modeling using Recurrent Neural Network Estimation

Figure 3 for N-gram Language Modeling using Recurrent Neural Network Estimation

Figure 4 for N-gram Language Modeling using Recurrent Neural Network Estimation

We investigate the effective memory depth of RNN models by using them for $n$-gram language model (LM) smoothing. Experiments on a small corpus (UPenn Treebank, one million words of training data and 10k vocabulary) have found the LSTM cell with dropout to be the best model for encoding the $n$-gram state when compared with feed-forward and vanilla RNN models. When preserving the sentence independence assumption the LSTM $n$-gram matches the LSTM LM performance for $n=9$ and slightly outperforms it for $n=13$. When allowing dependencies across sentence boundaries, the LSTM $13$-gram almost matches the perplexity of the unlimited history LSTM LM. LSTM $n$-gram smoothing also has the desirable property of improving with increasing $n$-gram order, unlike the Katz or Kneser-Ney back-off estimators. Using multinomial distributions as targets in training instead of the usual one-hot target is only slightly beneficial for low $n$-gram orders. Experiments on the One Billion Words benchmark show that the results hold at larger scale: while LSTM smoothing for short $n$-gram contexts does not provide significant advantages over classic N-gram models, it becomes effective with long contexts ($n > 5$); depending on the task and amount of data it can match fully recurrent LSTM models at about $n=13$. This may have implications when modeling short-format text, e.g. voice search/query LMs. Building LSTM $n$-gram LMs may be appealing for some practical situations: the state in a $n$-gram LM can be succinctly represented with $(n-1)*4$ bytes storing the identity of the words in the context and batches of $n$-gram contexts can be processed in parallel. On the downside, the $n$-gram context encoding computed by the LSTM is discarded, making the model more expensive than a regular recurrent LSTM LM.

* 10 pages, including references

Via

Access Paper or Ask Questions

PixColor: Pixel Recursive Colorization

Jun 05, 2017
Sergio Guadarrama, Ryan Dahl, David Bieber, Mohammad Norouzi, Jonathon Shlens, Kevin Murphy

Figure 1 for PixColor: Pixel Recursive Colorization

Figure 2 for PixColor: Pixel Recursive Colorization

Figure 3 for PixColor: Pixel Recursive Colorization

Figure 4 for PixColor: Pixel Recursive Colorization

We propose a novel approach to automatically produce multiple colorized versions of a grayscale image. Our method results from the observation that the task of automated colorization is relatively easy given a low-resolution version of the color image. We first train a conditional PixelCNN to generate a low resolution color for a given grayscale image. Then, given the generated low-resolution color image and the original grayscale image as inputs, we train a second CNN to generate a high-resolution colorization of an image. We demonstrate that our approach produces more diverse and plausible colorizations than existing methods, as judged by human raters in a "Visual Turing Test".

Via

Access Paper or Ask Questions

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

Apr 05, 2017
Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, Mohammad Norouzi

Figure 1 for Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

Figure 2 for Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

Figure 3 for Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

Figure 4 for Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. In this paper, we offer contributions in both these areas to enable similar progress in audio modeling. First, we detail a powerful new WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform. Second, we introduce NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets. Using NSynth, we demonstrate improved qualitative and quantitative performance of the WaveNet autoencoder over a well-tuned spectral autoencoder baseline. Finally, we show that the model learns a manifold of embeddings that allows for morphing between instruments, meaningfully interpolating in timbre to create new types of sounds that are realistic and expressive.

Via

Access Paper or Ask Questions