Alert button
Picture for Cong Lu

Cong Lu

Alert button

Synthetic Experience Replay

Mar 12, 2023
Cong Lu, Philip J. Ball, Jack Parker-Holder

Figure 1 for Synthetic Experience Replay
Figure 2 for Synthetic Experience Replay
Figure 3 for Synthetic Experience Replay
Figure 4 for Synthetic Experience Replay

A key theme in the past decade has been that when large neural networks and large datasets combine they can produce remarkable results. In deep reinforcement learning (RL), this paradigm is commonly made possible through experience replay, whereby a dataset of past experiences is used to train a policy or value function. However, unlike in supervised or self-supervised learning, an RL agent has to collect its own data, which is often limited. Thus, it is challenging to reap the benefits of deep learning, and even small neural networks can overfit at the start of training. In this work, we leverage the tremendous recent progress in generative modeling and propose Synthetic Experience Replay (SynthER), a diffusion-based approach to arbitrarily upsample an agent's collected experience. We show that SynthER is an effective method for training RL agents across offline and online settings. In offline settings, we observe drastic improvements both when upsampling small offline datasets and when training larger networks with additional synthetic data. Furthermore, SynthER enables online agents to train with a much higher update-to-data ratio than before, leading to a large increase in sample efficiency, without any algorithmic changes. We believe that synthetic training data could open the door to realizing the full potential of deep learning for replay-based RL algorithms from limited data.

Viaarxiv icon

On Pathologies in KL-Regularized Reinforcement Learning from Expert Demonstrations

Dec 28, 2022
Tim G. J. Rudner, Cong Lu, Michael A. Osborne, Yarin Gal, Yee Whye Teh

Figure 1 for On Pathologies in KL-Regularized Reinforcement Learning from Expert Demonstrations
Figure 2 for On Pathologies in KL-Regularized Reinforcement Learning from Expert Demonstrations
Figure 3 for On Pathologies in KL-Regularized Reinforcement Learning from Expert Demonstrations
Figure 4 for On Pathologies in KL-Regularized Reinforcement Learning from Expert Demonstrations

KL-regularized reinforcement learning from expert demonstrations has proved successful in improving the sample efficiency of deep reinforcement learning algorithms, allowing them to be applied to challenging physical real-world tasks. However, we show that KL-regularized reinforcement learning with behavioral reference policies derived from expert demonstrations can suffer from pathological training dynamics that can lead to slow, unstable, and suboptimal online learning. We show empirically that the pathology occurs for commonly chosen behavioral policy classes and demonstrate its impact on sample efficiency and online policy performance. Finally, we show that the pathology can be remedied by non-parametric behavioral reference policies and that this allows KL-regularized reinforcement learning to significantly outperform state-of-the-art approaches on a variety of challenging locomotion and dexterous hand manipulation tasks.

* Published in Advances in Neural Information Processing Systems 34 (NeurIPS 2021) 
Viaarxiv icon

Go-Explore Complex 3D Game Environments for Automated Reachability Testing

Sep 01, 2022
Cong Lu, Raluca Georgescu, Johan Verwey

Figure 1 for Go-Explore Complex 3D Game Environments for Automated Reachability Testing
Figure 2 for Go-Explore Complex 3D Game Environments for Automated Reachability Testing
Figure 3 for Go-Explore Complex 3D Game Environments for Automated Reachability Testing
Figure 4 for Go-Explore Complex 3D Game Environments for Automated Reachability Testing

Modern AAA video games feature huge game levels and maps which are increasingly hard for level testers to cover exhaustively. As a result, games often ship with catastrophic bugs such as the player falling through the floor or being stuck in walls. We propose an approach specifically targeted at reachability bugs in simulated 3D environments based on the powerful exploration algorithm, Go-Explore, which saves unique checkpoints across the map and then identifies promising ones to explore from. We show that when coupled with simple heuristics derived from the game's navigation mesh, Go-Explore finds challenging bugs and comprehensively explores complex environments without the need for human demonstration or knowledge of the game dynamics. Go-Explore vastly outperforms more complicated baselines including reinforcement learning with intrinsic curiosity in both covering the navigation mesh and number of unique positions across the map discovered. Finally, due to our use of parallel agents, our algorithm can fully cover a vast 1.5km x 1.5km game world within 10 hours on a single machine making it extremely promising for continuous testing suites.

Viaarxiv icon

Bayesian Generational Population-Based Training

Jul 19, 2022
Xingchen Wan, Cong Lu, Jack Parker-Holder, Philip J. Ball, Vu Nguyen, Binxin Ru, Michael A. Osborne

Figure 1 for Bayesian Generational Population-Based Training
Figure 2 for Bayesian Generational Population-Based Training
Figure 3 for Bayesian Generational Population-Based Training
Figure 4 for Bayesian Generational Population-Based Training

Reinforcement learning (RL) offers the potential for training generally capable agents that can interact autonomously in the real world. However, one key limitation is the brittleness of RL algorithms to core hyperparameters and network architecture choice. Furthermore, non-stationarities such as evolving training data and increased agent complexity mean that different hyperparameters and architectures may be optimal at different points of training. This motivates AutoRL, a class of methods seeking to automate these design choices. One prominent class of AutoRL methods is Population-Based Training (PBT), which have led to impressive performance in several large scale settings. In this paper, we introduce two new innovations in PBT-style methods. First, we employ trust-region based Bayesian Optimization, enabling full coverage of the high-dimensional mixed hyperparameter search space. Second, we show that using a generational approach, we can also learn both architectures and hyperparameters jointly on-the-fly in a single training run. Leveraging the new highly parallelizable Brax physics engine, we show that these innovations lead to large performance gains, significantly outperforming the tuned baseline while learning entire configurations on the fly. Code is available at https://github.com/xingchenwan/bgpbt.

* AutoML Conference 2022. 10 pages, 4 figure, 3 tables (28 pages, 10 figures, 7 tables including references and appendices) 
Viaarxiv icon

Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations

Jun 09, 2022
Cong Lu, Philip J. Ball, Tim G. J. Rudner, Jack Parker-Holder, Michael A. Osborne, Yee Whye Teh

Figure 1 for Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations
Figure 2 for Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations
Figure 3 for Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations
Figure 4 for Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations

Offline reinforcement learning has shown great promise in leveraging large pre-collected datasets for policy learning, allowing agents to forgo often-expensive online data collection. However, to date, offline reinforcement learning from has been relatively under-explored, and there is a lack of understanding of where the remaining challenges lie. In this paper, we seek to establish simple baselines for continuous control in the visual domain. We show that simple modifications to two state-of-the-art vision-based online reinforcement learning algorithms, DreamerV2 and DrQ-v2, suffice to outperform prior work and establish a competitive baseline. We rigorously evaluate these algorithms on both existing offline datasets and a new testbed for offline reinforcement learning from visual observations that better represents the data distributions present in real-world offline reinforcement learning problems, and open-source our code and data to facilitate progress in this important domain. Finally, we present and analyze several key desiderata unique to offline RL from visual observations, including visual distractions and visually identifiable changes in dynamics.

Viaarxiv icon

Revisiting Design Choices in Model-Based Offline Reinforcement Learning

Oct 08, 2021
Cong Lu, Philip J. Ball, Jack Parker-Holder, Michael A. Osborne, Stephen J. Roberts

Figure 1 for Revisiting Design Choices in Model-Based Offline Reinforcement Learning
Figure 2 for Revisiting Design Choices in Model-Based Offline Reinforcement Learning
Figure 3 for Revisiting Design Choices in Model-Based Offline Reinforcement Learning
Figure 4 for Revisiting Design Choices in Model-Based Offline Reinforcement Learning

Offline reinforcement learning enables agents to leverage large pre-collected datasets of environment transitions to learn control policies, circumventing the need for potentially expensive or unsafe online data collection. Significant progress has been made recently in offline model-based reinforcement learning, approaches which leverage a learned dynamics model. This typically involves constructing a probabilistic model, and using the model uncertainty to penalize rewards where there is insufficient data, solving for a pessimistic MDP that lower bounds the true MDP. Existing methods, however, exhibit a breakdown between theory and practice, whereby pessimistic return ought to be bounded by the total variation distance of the model from the true dynamics, but is instead implemented through a penalty based on estimated model uncertainty. This has spawned a variety of uncertainty heuristics, with little to no comparison between differing approaches. In this paper, we compare these heuristics, and design novel protocols to investigate their interaction with other hyperparameters, such as the number of models, or imaginary rollout horizon. Using these insights, we show that selecting these key hyperparameters using Bayesian Optimization produces superior configurations that are vastly different to those currently used in existing hand-tuned state-of-the-art methods, and result in drastically stronger performance.

* Spotlight @ RL4RealLife Workshop ICML2021 
Viaarxiv icon

Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline Environment

Apr 12, 2021
Philip J. Ball, Cong Lu, Jack Parker-Holder, Stephen Roberts

Figure 1 for Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline Environment
Figure 2 for Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline Environment
Figure 3 for Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline Environment
Figure 4 for Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline Environment

Reinforcement learning from large-scale offline datasets provides us with the ability to learn policies without potentially unsafe or impractical exploration. Significant progress has been made in the past few years in dealing with the challenge of correcting for differing behavior between the data collection and learned policies. However, little attention has been paid to potentially changing dynamics when transferring a policy to the online setting, where performance can be up to 90% reduced for existing methods. In this paper we address this problem with Augmented World Models (AugWM). We augment a learned dynamics model with simple transformations that seek to capture potential changes in physical properties of the robot, leading to more robust policies. We not only train our policy in this new setting, but also provide it with the sampled augmentation as a context, allowing it to adapt to changes in the environment. At test time we learn the context in a self-supervised fashion by approximating the augmentation which corresponds to the new environment. We rigorously evaluate our approach on over 100 different changed dynamics settings, and show that this simple approach can significantly improve the zero-shot generalization of a recent state-of-the-art baseline, often achieving successful policies where the baseline fails.

* To be presented as a Spotlight at the "Self-Supervision for Reinforcement Learning Workshop" @ ICLR 2021 
Viaarxiv icon

Think Global and Act Local: Bayesian Optimisation over High-Dimensional Categorical and Mixed Search Spaces

Feb 14, 2021
Xingchen Wan, Vu Nguyen, Huong Ha, Binxin Ru, Cong Lu, Michael A. Osborne

Figure 1 for Think Global and Act Local: Bayesian Optimisation over High-Dimensional Categorical and Mixed Search Spaces
Figure 2 for Think Global and Act Local: Bayesian Optimisation over High-Dimensional Categorical and Mixed Search Spaces
Figure 3 for Think Global and Act Local: Bayesian Optimisation over High-Dimensional Categorical and Mixed Search Spaces
Figure 4 for Think Global and Act Local: Bayesian Optimisation over High-Dimensional Categorical and Mixed Search Spaces

High-dimensional black-box optimisation remains an important yet notoriously challenging problem. Despite the success of Bayesian optimisation methods on continuous domains, domains that are categorical, or that mix continuous and categorical variables, remain challenging. We propose a novel solution -- we combine local optimisation with a tailored kernel design, effectively handling high-dimensional categorical and mixed search spaces, whilst retaining sample efficiency. We further derive convergence guarantee for the proposed approach. Finally, we demonstrate empirically that our method outperforms the current baselines on a variety of synthetic and real-world tasks in terms of performance, computational costs, or both.

* 9 page, 6 figures (26 pages, 13 figures, 2 tables including references and appendices) 
Viaarxiv icon