Alert button
Picture for Ilya Kostrikov

Ilya Kostrikov

Alert button

Training Diffusion Models with Reinforcement Learning

May 23, 2023
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, Sergey Levine

Figure 1 for Training Diffusion Models with Reinforcement Learning
Figure 2 for Training Diffusion Models with Reinforcement Learning
Figure 3 for Training Diffusion Models with Reinforcement Learning
Figure 4 for Training Diffusion Models with Reinforcement Learning

Diffusion models are a class of flexible generative models trained with an approximation to the log-likelihood objective. However, most use cases of diffusion models are not concerned with likelihoods, but instead with downstream objectives such as human-perceived image quality or drug effectiveness. In this paper, we investigate reinforcement learning methods for directly optimizing diffusion models for such objectives. We describe how posing denoising as a multi-step decision-making problem enables a class of policy gradient algorithms, which we refer to as denoising diffusion policy optimization (DDPO), that are more effective than alternative reward-weighted likelihood approaches. Empirically, DDPO is able to adapt text-to-image diffusion models to objectives that are difficult to express via prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Finally, we show that DDPO can improve prompt-image alignment using feedback from a vision-language model without the need for additional data collection or human annotation.

* 20 pages, 12 figures 
Viaarxiv icon

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Apr 20, 2023
Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, Sergey Levine

Figure 1 for IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
Figure 2 for IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
Figure 3 for IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
Figure 4 for IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Effective offline RL methods require properly handling out-of-distribution actions. Implicit Q-learning (IQL) addresses this by training a Q-function using only dataset actions through a modified Bellman backup. However, it is unclear which policy actually attains the values represented by this implicitly trained Q-function. In this paper, we reinterpret IQL as an actor-critic method by generalizing the critic objective and connecting it to a behavior-regularized implicit actor. This generalization shows how the induced actor balances reward maximization and divergence from the behavior policy, with the specific loss choice determining the nature of this tradeoff. Notably, this actor can exhibit complex and multimodal characteristics, suggesting issues with the conditional Gaussian actor fit with advantage weighted regression (AWR) used in prior methods. Instead, we propose using samples from a diffusion parameterized behavior policy and weights computed from the critic to then importance sampled our intended policy. We introduce Implicit Diffusion Q-learning (IDQL), combining our general IQL critic with the policy extraction method. IDQL maintains the ease of implementation of IQL while outperforming prior offline RL methods and demonstrating robustness to hyperparameters. Code is available at https://github.com/philippe-eecs/IDQL.

* 11 Pages, 6 Figures, 3 Tables 
Viaarxiv icon

Efficient Deep Reinforcement Learning Requires Regulating Overfitting

Apr 20, 2023
Qiyang Li, Aviral Kumar, Ilya Kostrikov, Sergey Levine

Figure 1 for Efficient Deep Reinforcement Learning Requires Regulating Overfitting
Figure 2 for Efficient Deep Reinforcement Learning Requires Regulating Overfitting
Figure 3 for Efficient Deep Reinforcement Learning Requires Regulating Overfitting
Figure 4 for Efficient Deep Reinforcement Learning Requires Regulating Overfitting

Deep reinforcement learning algorithms that learn policies by trial-and-error must learn from limited amounts of data collected by actively interacting with the environment. While many prior works have shown that proper regularization techniques are crucial for enabling data-efficient RL, a general understanding of the bottlenecks in data-efficient RL has remained unclear. Consequently, it has been difficult to devise a universal technique that works well across all domains. In this paper, we attempt to understand the primary bottleneck in sample-efficient deep RL by examining several potential hypotheses such as non-stationarity, excessive action distribution shift, and overfitting. We perform thorough empirical analysis on state-based DeepMind control suite (DMC) tasks in a controlled and systematic way to show that high temporal-difference (TD) error on the validation set of transitions is the main culprit that severely affects the performance of deep RL algorithms, and prior methods that lead to good performance do in fact, control the validation TD error to be low. This observation gives us a robust principle for making deep RL efficient: we can hill-climb on the validation TD error by utilizing any form of regularization techniques from supervised learning. We show that a simple online model selection method that targets the validation TD error is effective across state-based DMC and Gym tasks.

* 26 pages, 18 figures, 3 tables, The International Conference on Learning Representations (ICLR) 2023 
Viaarxiv icon

FastRLAP: A System for Learning High-Speed Driving via Deep RL and Autonomous Practicing

Apr 19, 2023
Kyle Stachowicz, Dhruv Shah, Arjun Bhorkar, Ilya Kostrikov, Sergey Levine

Figure 1 for FastRLAP: A System for Learning High-Speed Driving via Deep RL and Autonomous Practicing
Figure 2 for FastRLAP: A System for Learning High-Speed Driving via Deep RL and Autonomous Practicing
Figure 3 for FastRLAP: A System for Learning High-Speed Driving via Deep RL and Autonomous Practicing
Figure 4 for FastRLAP: A System for Learning High-Speed Driving via Deep RL and Autonomous Practicing

We present a system that enables an autonomous small-scale RC car to drive aggressively from visual observations using reinforcement learning (RL). Our system, FastRLAP (faster lap), trains autonomously in the real world, without human interventions, and without requiring any simulation or expert demonstrations. Our system integrates a number of important components to make this possible: we initialize the representations for the RL policy and value function from a large prior dataset of other robots navigating in other environments (at low speed), which provides a navigation-relevant representation. From here, a sample-efficient online RL method uses a single low-speed user-provided demonstration to determine the desired driving course, extracts a set of navigational checkpoints, and autonomously practices driving through these checkpoints, resetting automatically on collision or failure. Perhaps surprisingly, we find that with appropriate initialization and choice of algorithm, our system can learn to drive over a variety of racing courses with less than 20 minutes of online training. The resulting policies exhibit emergent aggressive driving skills, such as timing braking and acceleration around turns and avoiding areas which impede the robot's motion, approaching the performance of a human driver using a similar first-person interface over the course of training.

Viaarxiv icon

Efficient Online Reinforcement Learning with Offline Data

Feb 15, 2023
Philip J. Ball, Laura Smith, Ilya Kostrikov, Sergey Levine

Figure 1 for Efficient Online Reinforcement Learning with Offline Data
Figure 2 for Efficient Online Reinforcement Learning with Offline Data
Figure 3 for Efficient Online Reinforcement Learning with Offline Data
Figure 4 for Efficient Online Reinforcement Learning with Offline Data

Sample efficiency and exploration remain major challenges in online reinforcement learning (RL). A powerful approach that can be applied to address these issues is the inclusion of offline data, such as prior trajectories from a human expert or a sub-optimal exploration policy. Previous methods have relied on extensive modifications and additional complexity to ensure the effective use of this data. Instead, we ask: can we simply apply existing off-policy methods to leverage offline data when learning online? In this work, we demonstrate that the answer is yes; however, a set of minimal but important changes to existing off-policy RL algorithms are required to achieve reliable performance. We extensively ablate these design choices, demonstrating the key factors that most affect performance, and arrive at a set of recommendations that practitioners can readily apply, whether their data comprise a small number of expert demonstrations or large volumes of sub-optimal trajectories. We see that correct application of these simple recommendations can provide a $\mathbf{2.5\times}$ improvement over existing approaches across a diverse set of competitive benchmarks, with no additional computational overhead.

Viaarxiv icon

Offline Reinforcement Learning for Visual Navigation

Dec 16, 2022
Dhruv Shah, Arjun Bhorkar, Hrish Leen, Ilya Kostrikov, Nick Rhinehart, Sergey Levine

Figure 1 for Offline Reinforcement Learning for Visual Navigation
Figure 2 for Offline Reinforcement Learning for Visual Navigation
Figure 3 for Offline Reinforcement Learning for Visual Navigation
Figure 4 for Offline Reinforcement Learning for Visual Navigation

Reinforcement learning can enable robots to navigate to distant goals while optimizing user-specified reward functions, including preferences for following lanes, staying on paved paths, or avoiding freshly mowed grass. However, online learning from trial-and-error for real-world robots is logistically challenging, and methods that instead can utilize existing datasets of robotic navigation data could be significantly more scalable and enable broader generalization. In this paper, we present ReViND, the first offline RL system for robotic navigation that can leverage previously collected data to optimize user-specified reward functions in the real-world. We evaluate our system for off-road navigation without any additional data collection or fine-tuning, and show that it can navigate to distant goals using only offline training from this dataset, and exhibit behaviors that qualitatively differ based on the user-specified reward function.

* Project page https://sites.google.com/view/revind/home 
Viaarxiv icon

A Walk in the Park: Learning to Walk in 20 Minutes With Model-Free Reinforcement Learning

Aug 16, 2022
Laura Smith, Ilya Kostrikov, Sergey Levine

Figure 1 for A Walk in the Park: Learning to Walk in 20 Minutes With Model-Free Reinforcement Learning
Figure 2 for A Walk in the Park: Learning to Walk in 20 Minutes With Model-Free Reinforcement Learning
Figure 3 for A Walk in the Park: Learning to Walk in 20 Minutes With Model-Free Reinforcement Learning
Figure 4 for A Walk in the Park: Learning to Walk in 20 Minutes With Model-Free Reinforcement Learning

Deep reinforcement learning is a promising approach to learning policies in uncontrolled environments that do not require domain knowledge. Unfortunately, due to sample inefficiency, deep RL applications have primarily focused on simulated environments. In this work, we demonstrate that the recent advancements in machine learning algorithms and libraries combined with a carefully tuned robot controller lead to learning quadruped locomotion in only 20 minutes in the real world. We evaluate our approach on several indoor and outdoor terrains which are known to be challenging for classical model-based controllers. We observe the robot to be able to learn walking gait consistently on all of these terrains. Finally, we evaluate our design decisions in a simulated environment.

* First two authors contributed equally. Project website: https://sites.google.com/berkeley.edu/walk-in-the-park 
Viaarxiv icon

In Defense of the Unitary Scalarization for Deep Multi-Task Learning

Jan 20, 2022
Vitaly Kurin, Alessandro De Palma, Ilya Kostrikov, Shimon Whiteson, M. Pawan Kumar

Figure 1 for In Defense of the Unitary Scalarization for Deep Multi-Task Learning
Figure 2 for In Defense of the Unitary Scalarization for Deep Multi-Task Learning
Figure 3 for In Defense of the Unitary Scalarization for Deep Multi-Task Learning
Figure 4 for In Defense of the Unitary Scalarization for Deep Multi-Task Learning

Recent multi-task learning research argues against unitary scalarization, where training simply minimizes the sum of the task losses. Several ad-hoc multi-task optimization algorithms have instead been proposed, inspired by various hypotheses about what makes multi-task settings difficult. The majority of these optimizers require per-task gradients, and introduce significant memory, runtime, and implementation overhead. We present a theoretical analysis suggesting that many specialized multi-task optimizers can be interpreted as forms of regularization. Moreover, we show that, when coupled with standard regularization and stabilization techniques from single-task learning, unitary scalarization matches or improves upon the performance of complex multi-task optimizers in both supervised and reinforcement learning settings. We believe our results call for a critical reevaluation of recent research in the area.

Viaarxiv icon