Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Doina Precup

A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation

Jun 12, 2021
Scott Fujimoto, David Meger, Doina Precup

Figure 1 for A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation

Figure 2 for A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation

Figure 3 for A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation

Figure 4 for A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation

Marginalized importance sampling (MIS), which measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution, is a promising approach for off-policy evaluation. However, current state-of-the-art MIS methods rely on complex optimization tricks and succeed mostly on simple toy problems. We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy. The successor representation can be trained through deep reinforcement learning methodology and decouples the reward optimization from the dynamics of the environment, making the resulting algorithm stable and applicable to high-dimensional domains. We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.

* ICML 2021

Via

Access Paper or Ask Questions

Preferential Temporal Difference Learning

Jun 11, 2021
Nishanth Anand, Doina Precup

Figure 1 for Preferential Temporal Difference Learning

Figure 2 for Preferential Temporal Difference Learning

Figure 3 for Preferential Temporal Difference Learning

Figure 4 for Preferential Temporal Difference Learning

Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to find good policies. Generally speaking, TD learning updates states whenever they are visited. When the agent lands in a state, its value can be used to compute the TD-error, which is then propagated to other states. However, it may be interesting, when computing updates, to take into account other information than whether a state is visited or not. For example, some states might be more important than others (such as states which are frequently seen in a successful trajectory). Or, some states might have unreliable value estimates (for example, due to partial observability or lack of data), making their values less desirable as targets. We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.

* Accepted at the 38th International Conference on Machine Learning (ICML, 2021)

Via

Access Paper or Ask Questions

Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation

Jun 08, 2021
Emmanuel Bengio, Moksh Jain, Maksym Korablyov, Doina Precup, Yoshua Bengio

Figure 1 for Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation

Figure 2 for Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation

Figure 3 for Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation

Figure 4 for Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation

This paper is about the problem of learning a stochastic policy for generating an object (like a molecular graph) from a sequence of actions, such that the probability of generating an object is proportional to a given positive reward for that object. Whereas standard return maximization tends to converge to a single return-maximizing sequence, there are cases where we would like to sample a diverse set of high-return solutions. These arise, for example, in black-box function optimization when few rounds are possible, each with large batches of queries, where the batches should be diverse, e.g., in the design of new molecules. One can also see this as a problem of approximately converting an energy function to a generative distribution. While MCMC methods can achieve that, they are expensive and generally only perform local exploration. Instead, training a generative policy amortizes the cost of search during training and yields to fast generation. Using insights from Temporal Difference learning, we propose GFlowNet, based on a view of the generative process as a flow network, making it possible to handle the tricky case where different trajectories can yield the same final state, e.g., there are many ways to sequentially add atoms to generate some molecular graph. We cast the set of trajectories as a flow and convert the flow consistency equations into a learning objective, akin to the casting of the Bellman equations into Temporal Difference methods. We prove that any global minimum of the proposed objectives yields a policy which samples from the desired distribution, and demonstrate the improved performance and diversity of GFlowNet on a simple domain where there are many modes to the reward function, and on a molecule synthesis task.

* Submitted to NeurIPS 2021

Via

Access Paper or Ask Questions

Correcting Momentum in Temporal Difference Learning

Jun 07, 2021
Emmanuel Bengio, Joelle Pineau, Doina Precup

Figure 1 for Correcting Momentum in Temporal Difference Learning

Figure 2 for Correcting Momentum in Temporal Difference Learning

Figure 3 for Correcting Momentum in Temporal Difference Learning

Figure 4 for Correcting Momentum in Temporal Difference Learning

A common optimization tool used in deep reinforcement learning is momentum, which consists in accumulating and discounting past gradients, reapplying them at each iteration. We argue that, unlike in supervised learning, momentum in Temporal Difference (TD) learning accumulates gradients that become doubly stale: not only does the gradient of the loss change due to parameter updates, the loss itself changes due to bootstrapping. We first show that this phenomenon exists, and then propose a first-order correction term to momentum. We show that this correction term improves sample efficiency in policy evaluation by correcting target value drift. An important insight of this work is that deep RL methods are not always best served by directly importing techniques from the supervised setting.

* NeurIPS Deep RL Workshop 2020

Via

Access Paper or Ask Questions

A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

Jun 03, 2021
Mingde Zhao, Zhen Liu, Sitao Luan, Shuyuan Zhang, Doina Precup, Yoshua Bengio

Figure 1 for A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

Figure 2 for A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

Figure 3 for A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

Figure 4 for A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

We present an end-to-end, model-based deep reinforcement learning agent which dynamically attends to relevant parts of its state, in order to plan and to generalize better out-of-distribution. The agent's architecture uses a set representation and a bottleneck mechanism, forcing the number of entities to which the agent attends at each planning step to be small. In experiments with customized MiniGrid environments with different dynamics, we observe that the design allows agents to learn to plan effectively, by attending to the relevant objects, leading to better out-of-distribution generalization.

Via

Access Paper or Ask Questions

Improving Long-Term Metrics in Recommendation Systems using Short-Horizon Offline RL

Jun 01, 2021
Bogdan Mazoure, Paul Mineiro, Pavithra Srinath, Reza Sharifi Sedeh, Doina Precup, Adith Swaminathan

Figure 1 for Improving Long-Term Metrics in Recommendation Systems using Short-Horizon Offline RL

Figure 2 for Improving Long-Term Metrics in Recommendation Systems using Short-Horizon Offline RL

Figure 3 for Improving Long-Term Metrics in Recommendation Systems using Short-Horizon Offline RL

Figure 4 for Improving Long-Term Metrics in Recommendation Systems using Short-Horizon Offline RL

We study session-based recommendation scenarios where we want to recommend items to users during sequential interactions to improve their long-term utility. Optimizing a long-term metric is challenging because the learning signal (whether the recommendations achieved their desired goals) is delayed and confounded by other user interactions with the system. Immediately measurable proxies such as clicks can lead to suboptimal recommendations due to misalignment with the long-term metric. Many works have applied episodic reinforcement learning (RL) techniques for session-based recommendation but these methods do not account for policy-induced drift in user intent across sessions. We develop a new batch RL algorithm called Short Horizon Policy Improvement (SHPI) that approximates policy-induced distribution shifts across sessions. By varying the horizon hyper-parameter in SHPI, we recover well-known policy improvement schemes in the RL literature. Empirical results on four recommendation tasks show that SHPI can outperform matrix factorization, offline bandits, and offline RL baselines. We also provide a stable and computationally efficient implementation using weighted regression oracles.

Via

Access Paper or Ask Questions

AndroidEnv: A Reinforcement Learning Platform for Android

May 27, 2021
Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed, Tyler Jackson, Shibl Mourad, Doina Precup

Figure 1 for AndroidEnv: A Reinforcement Learning Platform for Android

Figure 2 for AndroidEnv: A Reinforcement Learning Platform for Android

Figure 3 for AndroidEnv: A Reinforcement Learning Platform for Android

Figure 4 for AndroidEnv: A Reinforcement Learning Platform for Android

We introduce AndroidEnv, an open-source platform for Reinforcement Learning (RL) research built on top of the Android ecosystem. AndroidEnv allows RL agents to interact with a wide variety of apps and services commonly used by humans through a universal touchscreen interface. Since agents train on a realistic simulation of an Android device, they have the potential to be deployed on real devices. In this report, we give an overview of the environment, highlighting the significant features it provides for research, and we present an empirical evaluation of some popular reinforcement learning agents on a set of tasks built on this platform.

Via

Access Paper or Ask Questions

What is Going on Inside Recurrent Meta Reinforcement Learning Agents?

Apr 29, 2021
Safa Alver, Doina Precup

Figure 1 for What is Going on Inside Recurrent Meta Reinforcement Learning Agents?

Figure 2 for What is Going on Inside Recurrent Meta Reinforcement Learning Agents?

Recurrent meta reinforcement learning (meta-RL) agents are agents that employ a recurrent neural network (RNN) for the purpose of "learning a learning algorithm". After being trained on a pre-specified task distribution, the learned weights of the agent's RNN are said to implement an efficient learning algorithm through their activity dynamics, which allows the agent to quickly solve new tasks sampled from the same distribution. However, due to the black-box nature of these agents, the way in which they work is not yet fully understood. In this study, we shed light on the internal working mechanisms of these agents by reformulating the meta-RL problem using the Partially Observable Markov Decision Process (POMDP) framework. We hypothesize that the learned activity dynamics is acting as belief states for such agents. Several illustrative experiments suggest that this hypothesis is true, and that recurrent meta-RL agents can be viewed as agents that learn to act optimally in partially observable environments consisting of multiple related tasks. This view helps in understanding their failure cases and some interesting model-based results reported in the literature.

* Accepted to the Never-Ending Reinforcement Learning Workshop at ICLR 2021

Via

Access Paper or Ask Questions

Training a First-Order Theorem Prover from Synthetic Data

Mar 05, 2021
Vlad Firoiu, Eser Aygun, Ankit Anand, Zafarali Ahmed, Xavier Glorot, Laurent Orseau, Lei Zhang, Doina Precup, Shibl Mourad

Figure 1 for Training a First-Order Theorem Prover from Synthetic Data

Figure 2 for Training a First-Order Theorem Prover from Synthetic Data

Figure 3 for Training a First-Order Theorem Prover from Synthetic Data

Figure 4 for Training a First-Order Theorem Prover from Synthetic Data

A major challenge in applying machine learning to automated theorem proving is the scarcity of training data, which is a key ingredient in training successful deep learning models. To tackle this problem, we propose an approach that relies on training purely with synthetically generated theorems, without any human data aside from axioms. We use these theorems to train a neurally-guided saturation-based prover. Our neural prover outperforms the state-of-the-art E-prover on this synthetic data in both time and search steps, and shows significant transfer to the unseen human-written theorems from the TPTP library, where it solves 72\% of first-order problems without equality.

Via

Access Paper or Ask Questions

Variance Penalized On-Policy and Off-Policy Actor-Critic

Feb 03, 2021
Arushi Jain, Gandharv Patil, Ayush Jain, Khimya Khetarpal, Doina Precup

Figure 1 for Variance Penalized On-Policy and Off-Policy Actor-Critic

Figure 2 for Variance Penalized On-Policy and Off-Policy Actor-Critic

Figure 3 for Variance Penalized On-Policy and Off-Policy Actor-Critic

Figure 4 for Variance Penalized On-Policy and Off-Policy Actor-Critic

Reinforcement learning algorithms are typically geared towards optimizing the expected return of an agent. However, in many practical applications, low variance in the return is desired to ensure the reliability of an algorithm. In this paper, we propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return. Previous work uses the second moment of return to estimate the variance indirectly. Instead, we use a much simpler recently proposed direct variance estimator which updates the estimates incrementally using temporal difference methods. Using the variance-penalized criterion, we guarantee the convergence of our algorithm to locally optimal policies for finite state action Markov decision processes. We demonstrate the utility of our algorithm in tabular and continuous MuJoCo domains. Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.

* Accepted to the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), 2021

Via

Access Paper or Ask Questions