Alert button
Picture for Doina Precup

Doina Precup

Alert button

Policy composition in reinforcement learning via multi-objective policy optimization

Aug 30, 2023
Shruti Mishra, Ankit Anand, Jordan Hoffmann, Nicolas Heess, Martin Riedmiller, Abbas Abdolmaleki, Doina Precup

Figure 1 for Policy composition in reinforcement learning via multi-objective policy optimization
Figure 2 for Policy composition in reinforcement learning via multi-objective policy optimization
Figure 3 for Policy composition in reinforcement learning via multi-objective policy optimization
Figure 4 for Policy composition in reinforcement learning via multi-objective policy optimization

We enable reinforcement learning agents to learn successful behavior policies by utilizing relevant pre-existing teacher policies. The teacher policies are introduced as objectives, in addition to the task objective, in a multi-objective policy optimization setting. Using the Multi-Objective Maximum a Posteriori Policy Optimization algorithm (Abdolmaleki et al. 2020), we show that teacher policies can help speed up learning, particularly in the absence of shaping rewards. In two domains with continuous observation and action spaces, our agents successfully compose teacher policies in sequence and in parallel, and are also able to further extend the policies of the teachers in order to solve the task. Depending on the specified combination of task and teacher(s), teacher(s) may naturally act to limit the final performance of an agent. The extent to which agents are required to adhere to teacher policies are determined by hyperparameters which determine both the effect of teachers on learning speed and the eventual performance of the agent on the task. In the humanoid domain (Tassa et al. 2018), we also equip agents with the ability to control the selection of teachers. With this ability, agents are able to meaningfully compose from the teacher policies to achieve a superior task reward on the walk task than in cases without access to the teacher policies. We show the resemblance of composed task policies with the corresponding teacher policies through videos.

Viaarxiv icon

A Definition of Continual Reinforcement Learning

Jul 20, 2023
David Abel, André Barreto, Benjamin Van Roy, Doina Precup, Hado van Hasselt, Satinder Singh

Figure 1 for A Definition of Continual Reinforcement Learning
Figure 2 for A Definition of Continual Reinforcement Learning
Figure 3 for A Definition of Continual Reinforcement Learning

In this paper we develop a foundation for continual reinforcement learning.

Viaarxiv icon

On the Convergence of Bounded Agents

Jul 20, 2023
David Abel, André Barreto, Hado van Hasselt, Benjamin Van Roy, Doina Precup, Satinder Singh

Figure 1 for On the Convergence of Bounded Agents

When has an agent converged? Standard models of the reinforcement learning problem give rise to a straightforward definition of convergence: An agent converges when its behavior or performance in each environment state stops changing. However, as we shift the focus of our learning problem from the environment's state to the agent's state, the concept of an agent's convergence becomes significantly less clear. In this paper, we propose two complementary accounts of agent convergence in a framing of the reinforcement learning problem that centers around bounded agents. The first view says that a bounded agent has converged when the minimal number of states needed to describe the agent's future behavior cannot decrease. The second view says that a bounded agent has converged just when the agent's performance only changes if the agent's internal state changes. We establish basic properties of these two definitions, show that they accommodate typical views of convergence in standard settings, and prove several facts about their nature and relationship. We take these perspectives, definitions, and analysis to bring clarity to a central idea of the field.

Viaarxiv icon

An Empirical Study of the Effectiveness of Using a Replay Buffer on Mode Discovery in GFlowNets

Jul 18, 2023
Nikhil Vemgal, Elaine Lau, Doina Precup

Figure 1 for An Empirical Study of the Effectiveness of Using a Replay Buffer on Mode Discovery in GFlowNets
Figure 2 for An Empirical Study of the Effectiveness of Using a Replay Buffer on Mode Discovery in GFlowNets
Figure 3 for An Empirical Study of the Effectiveness of Using a Replay Buffer on Mode Discovery in GFlowNets
Figure 4 for An Empirical Study of the Effectiveness of Using a Replay Buffer on Mode Discovery in GFlowNets

Reinforcement Learning (RL) algorithms aim to learn an optimal policy by iteratively sampling actions to learn how to maximize the total expected return, $R(x)$. GFlowNets are a special class of algorithms designed to generate diverse candidates, $x$, from a discrete set, by learning a policy that approximates the proportional sampling of $R(x)$. GFlowNets exhibit improved mode discovery compared to conventional RL algorithms, which is very useful for applications such as drug discovery and combinatorial search. However, since GFlowNets are a relatively recent class of algorithms, many techniques which are useful in RL have not yet been associated with them. In this paper, we study the utilization of a replay buffer for GFlowNets. We explore empirically various replay buffer sampling techniques and assess the impact on the speed of mode discovery and the quality of the modes discovered. Our experimental results in the Hypergrid toy domain and a molecule synthesis environment demonstrate significant improvements in mode discovery when training with a replay buffer, compared to training only with trajectories generated on-policy.

* Accepted to ICML 2023 workshop on Structured Probabilistic Inference & Generative Modeling 
Viaarxiv icon

Optimism and Adaptivity in Policy Optimization

Jun 18, 2023
Veronica Chelu, Tom Zahavy, Arthur Guez, Doina Precup, Sebastian Flennerhag

Figure 1 for Optimism and Adaptivity in Policy Optimization
Figure 2 for Optimism and Adaptivity in Policy Optimization
Figure 3 for Optimism and Adaptivity in Policy Optimization
Figure 4 for Optimism and Adaptivity in Policy Optimization

We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) through \emph{optimism} \& \emph{adaptivity}. Leveraging the deep connection between policy iteration and policy gradient methods, we recast seemingly unrelated policy optimization algorithms as the repeated application of two interleaving steps (i) an \emph{optimistic policy improvement operator} maps a prior policy $\pi_t$ to a hypothesis $\pi_{t+1}$ using a \emph{gradient ascent prediction}, followed by (ii) a \emph{hindsight adaptation} of the optimistic prediction based on a partial evaluation of the performance of $\pi_{t+1}$. We use this shared lens to jointly express other well-known algorithms, including soft and optimistic policy iteration, natural actor-critic methods, model-based policy improvement based on forward search, and meta-learning algorithms. By doing so, we shed light on collective theoretical properties related to acceleration via optimism \& adaptivity. Building on these insights, we design an \emph{adaptive \& optimistic policy gradient} algorithm via meta-gradient learning, and empirically highlight several design choices pertaining to optimism, in an illustrative task.

Viaarxiv icon

For SALE: State-Action Representation Learning for Deep Reinforcement Learning

Jun 04, 2023
Scott Fujimoto, Wei-Di Chang, Edward J. Smith, Shixiang Shane Gu, Doina Precup, David Meger

Figure 1 for For SALE: State-Action Representation Learning for Deep Reinforcement Learning
Figure 2 for For SALE: State-Action Representation Learning for Deep Reinforcement Learning
Figure 3 for For SALE: State-Action Representation Learning for Deep Reinforcement Learning
Figure 4 for For SALE: State-Action Representation Learning for Deep Reinforcement Learning

In the field of reinforcement learning (RL), representation learning is a proven tool for complex image-based tasks, but is often overlooked for environments with low-level states, such as physical control problems. This paper introduces SALE, a novel approach for learning embeddings that model the nuanced interaction between state and action, enabling effective representation learning from low-level states. We extensively study the design space of these embeddings and highlight important design considerations. We integrate SALE and an adaptation of checkpoints for RL into TD3 to form the TD7 algorithm, which significantly outperforms existing continuous control algorithms. On OpenAI gym benchmark tasks, TD7 has an average performance gain of 276.7% and 50.7% over TD3 at 300k and 5M time steps, respectively, and works in both the online and offline settings.

Viaarxiv icon

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

May 29, 2023
Haque Ishfaq, Qingfeng Lan, Pan Xu, A. Rupam Mahmood, Doina Precup, Anima Anandkumar, Kamyar Azizzadenesheli

Figure 1 for Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo
Figure 2 for Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo
Figure 3 for Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo
Figure 4 for Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcomings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of $\tilde{O}(d^{3/2}H^{5/2}\sqrt{T})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $T$ is the total number of steps. We apply this approach to deep RL, by using Adam optimizer to perform gradient updates. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.

Viaarxiv icon

Policy Gradient Methods in the Presence of Symmetries and State Abstractions

May 09, 2023
Prakash Panangaden, Sahand Rezaei-Shoshtari, Rosie Zhao, David Meger, Doina Precup

Figure 1 for Policy Gradient Methods in the Presence of Symmetries and State Abstractions
Figure 2 for Policy Gradient Methods in the Presence of Symmetries and State Abstractions
Figure 3 for Policy Gradient Methods in the Presence of Symmetries and State Abstractions
Figure 4 for Policy Gradient Methods in the Presence of Symmetries and State Abstractions

Reinforcement learning on high-dimensional and complex problems relies on abstraction for improved efficiency and generalization. In this paper, we study abstraction in the continuous-control setting, and extend the definition of MDP homomorphisms to the setting of continuous state and action spaces. We derive a policy gradient theorem on the abstract MDP for both stochastic and deterministic policies. Our policy gradient results allow for leveraging approximate symmetries of the environment for policy optimization. Based on these theorems, we propose a family of actor-critic algorithms that are able to learn the policy and the MDP homomorphism map simultaneously, using the lax bisimulation metric. Finally, we introduce a series of environments with continuous symmetries to further demonstrate the ability of our algorithm for action abstraction in the presence of such symmetries. We demonstrate the effectiveness of our method on our environments, as well as on challenging visual control tasks from the DeepMind Control Suite. Our method's ability to utilize MDP homomorphisms for representation learning leads to improved performance, and the visualizations of the latent space clearly demonstrate the structure of the learned abstraction.

* arXiv admin note: substantial text overlap with arXiv:2209.07364 
Viaarxiv icon

MUDiff: Unified Diffusion for Complete Molecule Generation

Apr 28, 2023
Chenqing Hua, Sitao Luan, Minkai Xu, Rex Ying, Jie Fu, Stefano Ermon, Doina Precup

Figure 1 for MUDiff: Unified Diffusion for Complete Molecule Generation
Figure 2 for MUDiff: Unified Diffusion for Complete Molecule Generation
Figure 3 for MUDiff: Unified Diffusion for Complete Molecule Generation
Figure 4 for MUDiff: Unified Diffusion for Complete Molecule Generation

We present a new model for generating molecular data by combining discrete and continuous diffusion processes. Our model generates a comprehensive representation of molecules, including atom features, 2D discrete molecule structures, and 3D continuous molecule coordinates. The use of diffusion processes allows for capturing the probabilistic nature of molecular processes and the ability to explore the effect of different factors on molecular structures and properties. Additionally, we propose a novel graph transformer architecture to denoise the diffusion process. The transformer is equivariant to Euclidean transformations, allowing it to learn invariant atom and edge representations while preserving the equivariance of atom coordinates. This transformer can be used to learn molecular representations robust to geometric transformations. We evaluate the performance of our model through experiments and comparisons with existing methods, showing its ability to generate more stable and valid molecules with good properties. Our model is a promising approach for designing molecules with desired properties and can be applied to a wide range of tasks in molecular modeling.

Viaarxiv icon