Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

George Tucker

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Apr 20, 2020
Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, Sergey Levine

Figure 1 for D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Figure 2 for D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Figure 3 for D4RL: Datasets for Deep Data-Driven Reinforcement Learning

The offline reinforcement learning (RL) problem, also referred to as batch RL, refers to the setting where a policy must be learned from a dataset of previously collected data, without additional online data collection. In supervised learning, large datasets and complex deep neural networks have fueled impressive progress, but in contrast, conventional RL algorithms must collect large amounts of on-policy data and have had little success leveraging previously collected datasets. As a result, existing RL benchmarks are not well-suited for the offline setting, making progress in this area difficult to measure. To design a benchmark tailored to offline RL, we start by outlining key properties of datasets relevant to applications of offline RL. Based on these properties, we design a set of benchmark tasks and datasets that evaluate offline RL algorithms under these conditions. Examples of such properties include: datasets generated via hand-designed controllers and human demonstrators, multi-objective datasets, where an agent can perform different tasks in the same environment, and datasets consisting of a heterogeneous mix of high-quality and low-quality trajectories. By designing the benchmark tasks and datasets to reflect properties of real-world offline RL problems, our benchmark will focus research effort on methods that drive substantial improvements not just on simulated benchmarks, but ultimately on the kinds of real-world problems where offline RL will have the largest impact.

* Website available at https://sites.google.com/view/d4rl/home

Via

Access Paper or Ask Questions

Datasets for Data-Driven Reinforcement Learning

Apr 15, 2020
Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, Sergey Levine

Figure 1 for Datasets for Data-Driven Reinforcement Learning

Figure 2 for Datasets for Data-Driven Reinforcement Learning

Figure 3 for Datasets for Data-Driven Reinforcement Learning

Via

Access Paper or Ask Questions

Meta-Learning without Memorization

Dec 24, 2019
Mingzhang Yin, George Tucker, Mingyuan Zhou, Sergey Levine, Chelsea Finn

Figure 1 for Meta-Learning without Memorization

Figure 2 for Meta-Learning without Memorization

Figure 3 for Meta-Learning without Memorization

Figure 4 for Meta-Learning without Memorization

The ability to learn new concepts with small amounts of data is a critical aspect of intelligence that has proven challenging for deep learning methods. Meta-learning has emerged as a promising technique for leveraging data from previous tasks to enable efficient learning of new tasks. However, most meta-learning algorithms implicitly require that the meta-training tasks be mutually-exclusive, such that no single model can solve all of the tasks at once. For example, when creating tasks for few-shot image classification, prior work uses a per-task random assignment of image classes to N-way classification labels. If this is not done, the meta-learner can ignore the task training data and learn a single model that performs all of the meta-training tasks zero-shot, but does not adapt effectively to new image classes. This requirement means that the user must take great care in designing the tasks, for example by shuffling labels or removing task identifying information from the inputs. In some domains, this makes meta-learning entirely inapplicable. In this paper, we address this challenge by designing a meta-regularization objective using information theory that places precedence on data-driven adaptation. This causes the meta-learner to decide what must be learned from the task training data and what should be inferred from the task testing input. By doing so, our algorithm can successfully use data from non-mutually-exclusive tasks to efficiently adapt to novel tasks. We demonstrate its applicability to both contextual and gradient-based meta-learning algorithms, and apply it in practical settings where applying standard meta-learning has been difficult. Our approach substantially outperforms standard meta-learning algorithms in these settings.

* ICLR 2020

Via

Access Paper or Ask Questions

Behavior Regularized Offline Reinforcement Learning

Nov 26, 2019
Yifan Wu, George Tucker, Ofir Nachum

Figure 1 for Behavior Regularized Offline Reinforcement Learning

Figure 2 for Behavior Regularized Offline Reinforcement Learning

Figure 3 for Behavior Regularized Offline Reinforcement Learning

Figure 4 for Behavior Regularized Offline Reinforcement Learning

In reinforcement learning (RL) research, it is common to assume access to direct online interactions with the environment. However in many real-world applications, access to the environment is limited to a fixed offline dataset of logged experience. In such settings, standard RL algorithms have been shown to diverge or otherwise yield poor performance. Accordingly, recent work has suggested a number of remedies to these issues. In this work, we introduce a general framework, behavior regularized actor critic (BRAC), to empirically evaluate recently proposed methods as well as a number of simple baselines across a variety of offline continuous control tasks. Surprisingly, we find that many of the technical complexities introduced in recent methods are unnecessary to achieve strong performance. Additional ablations provide insights into which design choices matter most in the offline RL setting.

Via

Access Paper or Ask Questions

Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse

Nov 06, 2019
James Lucas, George Tucker, Roger Grosse, Mohammad Norouzi

Figure 1 for Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse

Figure 2 for Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse

Figure 3 for Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse

Figure 4 for Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse

Posterior collapse in Variational Autoencoders (VAEs) arises when the variational posterior distribution closely matches the prior for a subset of latent variables. This paper presents a simple and intuitive explanation for posterior collapse through the analysis of linear VAEs and their direct correspondence with Probabilistic PCA (pPCA). We explain how posterior collapse may occur in pPCA due to local maxima in the log marginal likelihood. Unexpectedly, we prove that the ELBO objective for the linear VAE does not introduce additional spurious local maxima relative to log marginal likelihood. We show further that training a linear VAE with exact variational inference recovers an identifiable global maximum corresponding to the principal component directions. Empirically, we find that our linear analysis is predictive even for high-capacity, non-linear VAEs and helps explain the relationship between the observation noise, local maxima, and posterior collapse in deep Gaussian VAEs.

* 11 main pages, 10 appendix pages. 13 figures total. Accepted at 33rd Conference on Neural Information Processing Systems (NeurIPS 2019)

Via

Access Paper or Ask Questions

Energy-Inspired Models: Learning with Sampler-Induced Distributions

Oct 31, 2019
Dieterich Lawson, George Tucker, Bo Dai, Rajesh Ranganath

Figure 1 for Energy-Inspired Models: Learning with Sampler-Induced Distributions

Figure 2 for Energy-Inspired Models: Learning with Sampler-Induced Distributions

Figure 3 for Energy-Inspired Models: Learning with Sampler-Induced Distributions

Figure 4 for Energy-Inspired Models: Learning with Sampler-Induced Distributions

Energy-based models (EBMs) are powerful probabilistic models, but suffer from intractable sampling and density evaluation due to the partition function. As a result, inference in EBMs relies on approximate sampling algorithms, leading to a mismatch between the model and inference. Motivated by this, we consider the sampler-induced distribution as the model of interest and maximize the likelihood of this model. This yields a class of energy-inspired models (EIMs) that incorporate learned energy functions while still providing exact samples and tractable log-likelihood lower bounds. We describe and evaluate three instantiations of such models based on truncated rejection sampling, self-normalized importance sampling, and Hamiltonian importance sampling. These models outperform or perform comparably to the recently proposed Learned Accept/Reject Sampling algorithm and provide new insights on ranking Noise Contrastive Estimation and Contrastive Predictive Coding. Moreover, EIMs allow us to generalize a recent connection between multi-sample variational lower bounds and auxiliary variable variational inference. We show how recent variational bounds can be unified with EIMs as the variational family.

Via

Access Paper or Ask Questions

Reinforcement Learning Driven Heuristic Optimization

Jun 16, 2019
Qingpeng Cai, Will Hang, Azalia Mirhoseini, George Tucker, Jingtao Wang, Wei Wei

Figure 1 for Reinforcement Learning Driven Heuristic Optimization

Figure 2 for Reinforcement Learning Driven Heuristic Optimization

Figure 3 for Reinforcement Learning Driven Heuristic Optimization

Figure 4 for Reinforcement Learning Driven Heuristic Optimization

Heuristic algorithms such as simulated annealing, Concorde, and METIS are effective and widely used approaches to find solutions to combinatorial optimization problems. However, they are limited by the high sample complexity required to reach a reasonable solution from a cold-start. In this paper, we introduce a novel framework to generate better initial solutions for heuristic algorithms using reinforcement learning (RL), named RLHO. We augment the ability of heuristic algorithms to greedily improve upon an existing initial solution generated by RL, and demonstrate novel results where RL is able to leverage the performance of heuristics as a learning signal to generate better initialization. We apply this framework to Proximal Policy Optimization (PPO) and Simulated Annealing (SA). We conduct a series of experiments on the well-known NP-complete bin packing problem, and show that the RLHO method outperforms our baselines. We show that on the bin packing problem, RL can learn to help heuristics perform even better, allowing us to combine the best parts of both approaches.

* DRL4KDD'19

Via

Access Paper or Ask Questions

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Jun 03, 2019
Aviral Kumar, Justin Fu, George Tucker, Sergey Levine

Figure 1 for Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Figure 2 for Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Figure 3 for Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Figure 4 for Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Off-policy reinforcement learning aims to leverage experience collected from prior policies for sample-efficient learning. However, in practice, commonly used off-policy approximate dynamic programming methods based on Q-learning and actor-critic methods are highly sensitive to the data distribution, and can make only limited progress without collecting additional on-policy data. As a step towards more robust off-policy algorithms, we study the setting where the off-policy experience is fixed and there is no further interaction with the environment. We identify bootstrapping error as a key source of instability in current methods. Bootstrapping error is due to bootstrapping from actions that lie outside of the training data distribution, and it accumulates via the Bellman backup operator. We theoretically analyze bootstrapping error, and demonstrate how carefully constraining action selection in the backup can mitigate it. Based on our analysis, we propose a practical algorithm, bootstrapping error accumulation reduction (BEAR). We demonstrate that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks.

Via

Access Paper or Ask Questions

On Variational Bounds of Mutual Information

May 16, 2019
Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander A. Alemi, George Tucker

Figure 1 for On Variational Bounds of Mutual Information

Figure 2 for On Variational Bounds of Mutual Information

Figure 3 for On Variational Bounds of Mutual Information

Figure 4 for On Variational Bounds of Mutual Information

Estimating and optimizing Mutual Information (MI) is core to many problems in machine learning; however, bounding MI in high dimensions is challenging. To establish tractable and scalable objectives, recent work has turned to variational bounds parameterized by neural networks, but the relationships and tradeoffs between these bounds remains unclear. In this work, we unify these recent developments in a single framework. We find that the existing variational lower bounds degrade when the MI is large, exhibiting either high bias or high variance. To address this problem, we introduce a continuum of lower bounds that encompasses previous bounds and flexibly trades off bias and variance. On high-dimensional, controlled problems, we empirically characterize the bias and variance of the bounds and their gradients and demonstrate the effectiveness of our new bounds for estimation and representation learning.

* ICML 2019

Via

Access Paper or Ask Questions