Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yinlam Chow

Dima

Piecewise-Stationary Off-Policy Optimization

Jun 15, 2020

Joey Hong, Branislav Kveton, Manzil Zaheer, Yinlam Chow, Amr Ahmed

Figure 1 for Piecewise-Stationary Off-Policy Optimization

Figure 2 for Piecewise-Stationary Off-Policy Optimization

Figure 3 for Piecewise-Stationary Off-Policy Optimization

Abstract:Off-policy learning is a framework for evaluating and optimizing policies without deploying them, from data collected by another policy. Real-world environments are typically non-stationary and the offline learned policies should adapt to these changes. To address this challenge, we study the novel problem of off-policy optimization in piecewise-stationary contextual bandits. Our proposed solution has two phases. In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state. In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance. This approach is practical and analyzable, and we provide guarantees on both the quality of off-policy optimization and the regret during online deployment. To show the effectiveness of our approach, we compare it to state-of-the-art baselines on both synthetic and real-world datasets. Our approach outperforms methods that act only on observed context.

* 16 pages, 2 figures

Via

Access Paper or Ask Questions

Predictive Coding for Locally-Linear Control

Mar 02, 2020

Rui Shu, Tung Nguyen, Yinlam Chow, Tuan Pham, Khoat Than, Mohammad Ghavamzadeh, Stefano Ermon, Hung H. Bui

Figure 1 for Predictive Coding for Locally-Linear Control

Figure 2 for Predictive Coding for Locally-Linear Control

Figure 3 for Predictive Coding for Locally-Linear Control

Figure 4 for Predictive Coding for Locally-Linear Control

Abstract:High-dimensional observations and unknown dynamics are major challenges when applying optimal control to many real-world decision making tasks. The Learning Controllable Embedding (LCE) framework addresses these challenges by embedding the observations into a lower dimensional latent space, estimating the latent dynamics, and then performing control directly in the latent space. To ensure the learned latent dynamics are predictive of next-observations, all existing LCE approaches decode back into the observation space and explicitly perform next-observation prediction---a challenging high-dimensional task that furthermore introduces a large number of nuisance parameters (i.e., the decoder) which are discarded during control. In this paper, we propose a novel information-theoretic LCE approach and show theoretically that explicit next-observation prediction can be replaced with predictive coding. We then use predictive coding to develop a decoder-free LCE model whose latent dynamics are amenable to locally-linear control. Extensive experiments on benchmark tasks show that our model reliably learns a controllable latent space that leads to superior performance when compared with state-of-the-art LCE baselines.

Via

Access Paper or Ask Questions

BRPO: Batch Residual Policy Optimization

Feb 08, 2020

Sungryull Sohn, Yinlam Chow, Jayden Ooi, Ofir Nachum, Honglak Lee, Ed Chi, Craig Boutilier

Figure 1 for BRPO: Batch Residual Policy Optimization

Figure 2 for BRPO: Batch Residual Policy Optimization

Figure 3 for BRPO: Batch Residual Policy Optimization

Figure 4 for BRPO: Batch Residual Policy Optimization

Abstract:In batch reinforcement learning (RL), one often constrains a learned policy to be close to the behavior (data-generating) policy, e.g., by constraining the learned action distribution to differ from the behavior policy by some maximum degree that is the same at each state. This can cause batch RL to be overly conservative, unable to exploit large policy changes at frequently-visited, high-confidence states without risking poor performance at sparsely-visited states. To remedy this, we propose residual policies, where the allowable deviation of the learned policy is state-action-dependent. We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance. We show that BRPO achieves the state-of-the-art performance in a number of tasks.

Via

Access Paper or Ask Questions

AlgaeDICE: Policy Gradient from Arbitrary Experience

Dec 04, 2019

Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, Dale Schuurmans

Figure 1 for AlgaeDICE: Policy Gradient from Arbitrary Experience

Figure 2 for AlgaeDICE: Policy Gradient from Arbitrary Experience

Figure 3 for AlgaeDICE: Policy Gradient from Arbitrary Experience

Figure 4 for AlgaeDICE: Policy Gradient from Arbitrary Experience

Abstract:In many real-world applications of reinforcement learning (RL), interactions with the environment are limited due to cost or feasibility. This presents a challenge to traditional RL algorithms since the max-return objective involves an expectation over on-policy samples. We introduce a new formulation of max-return optimization that allows the problem to be re-expressed by an expectation over an arbitrary behavior-agnostic and off-policy data distribution. We first derive this result by considering a regularized version of the dual max-return objective before extending our findings to unregularized objectives through the use of a Lagrangian formulation of the linear programming characterization of Q-values. We show that, if auxiliary dual variables of the objective are optimized, then the gradient of the off-policy objective is exactly the on-policy policy gradient, without any use of importance weighting. In addition to revealing the appealing theoretical properties of this approach, we also show that it delivers good practical performance.

Via

Access Paper or Ask Questions

CAQL: Continuous Action Q-Learning

Oct 09, 2019

Moonkyung Ryu, Yinlam Chow, Ross Anderson, Christian Tjandraatmadja, Craig Boutilier

Figure 1 for CAQL: Continuous Action Q-Learning

Figure 2 for CAQL: Continuous Action Q-Learning

Figure 3 for CAQL: Continuous Action Q-Learning

Figure 4 for CAQL: Continuous Action Q-Learning

Abstract:Value-based reinforcement learning (RL) methods like Q-learning have shown success in a variety of domains. One challenge in applying Q-learning to continuous-action RL problems, however, is the continuous action maximization (max-Q) required for optimal Bellman backup. In this work, we develop CAQL, a (class of) algorithm(s) for continuous-action Q-learning that can use several plug-and-play optimizers for the max-Q problem. Leveraging recent optimization results for deep neural networks, we show that max-Q can be solved optimally using mixed-integer programming (MIP). When the Q-function representation has sufficient power, MIP-based optimization gives rise to better policies and is more robust than approximate methods (e.g., gradient ascent, cross-entropy search). We further develop several techniques to accelerate inference in CAQL, which despite their approximate nature, perform well. We compare CAQL with state-of-the-art RL algorithms on benchmark continuous-control problems that have different degrees of action constraints and show that CAQL outperforms policy-based methods in heavily constrained environments, often dramatically.

Via

Access Paper or Ask Questions

Prediction, Consistency, Curvature: Representation Learning for Locally-Linear Control

Sep 04, 2019

Nir Levine, Yinlam Chow, Rui Shu, Ang Li, Mohammad Ghavamzadeh, Hung Bui

Figure 1 for Prediction, Consistency, Curvature: Representation Learning for Locally-Linear Control

Figure 2 for Prediction, Consistency, Curvature: Representation Learning for Locally-Linear Control

Figure 3 for Prediction, Consistency, Curvature: Representation Learning for Locally-Linear Control

Figure 4 for Prediction, Consistency, Curvature: Representation Learning for Locally-Linear Control

Abstract:Many real-world sequential decision-making problems can be formulated as optimal control with high-dimensional observations and unknown dynamics. A promising approach is to embed the high-dimensional observations into a lower-dimensional latent representation space, estimate the latent dynamics model, then utilize this model for control in the latent space. An important open question is how to learn a representation that is amenable to existing control algorithms? In this paper, we focus on learning representations for locally-linear control algorithms, such as iterative LQR (iLQR). By formulating and analyzing the representation learning problem from an optimal control perspective, we establish three underlying principles that the learned representation should comprise: 1) accurate prediction in the observation space, 2) consistency between latent and observation space dynamics, and 3) low curvature in the latent space transitions. These principles naturally correspond to a loss function that consists of three terms: prediction, consistency, and curvature (PCC). Crucially, to make PCC tractable, we derive an amortized variational bound for the PCC loss function. Extensive experiments on benchmark domains demonstrate that the new variational-PCC learning algorithm benefits from significantly more stable and reproducible training, and leads to superior control performance. Further ablation studies give support to the importance of all three PCC components for learning a good latent space for control.

Via

Access Paper or Ask Questions

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Jun 10, 2019

Ofir Nachum, Yinlam Chow, Bo Dai, Lihong Li

Figure 1 for DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Figure 2 for DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Figure 3 for DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Figure 4 for DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Abstract:In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios -- correction terms which quantify the likelihood that the new policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset -- can improve accuracy and performance. In this work, we propose an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset. Furthermore, it eschews any direct use of importance weights, thus avoiding potential optimization instabilities endemic of previous methods. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to off-policy policy evaluation and find that our algorithm significantly improves accuracy compared to existing techniques.

Via

Access Paper or Ask Questions

Lyapunov-based Safe Policy Optimization for Continuous Control

Jan 28, 2019

Yinlam Chow, Ofir Nachum, Aleksandra Faust, Mohammad Ghavamzadeh, Edgar Duenez-Guzman

Figure 1 for Lyapunov-based Safe Policy Optimization for Continuous Control

Figure 2 for Lyapunov-based Safe Policy Optimization for Continuous Control

Figure 3 for Lyapunov-based Safe Policy Optimization for Continuous Control

Figure 4 for Lyapunov-based Safe Policy Optimization for Continuous Control

Abstract:We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through {\em safe} policies, i.e.,~policies that do not take the agent to undesirable situations. We formulate these problems as {\em constrained} Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a {\em Lyapunov} approach to solve them. Our algorithms can use any standard policy gradient (PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy optimization (PPO), to train a neural network policy, while guaranteeing near-constraint satisfaction for every policy update by projecting either the policy parameter or the action onto the set of feasible solutions induced by the state-dependent linearized Lyapunov constraints. Compared to the existing constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Moreover, our action-projection algorithm often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with the state-of-the-art baselines on several simulated (MuJoCo) tasks, as well as a real-world indoor robot navigation problem, demonstrating their effectiveness in terms of balancing performance and constraint satisfaction. Videos of the experiments can be found in the following link: https://drive.google.com/file/d/1pzuzFqWIE710bE2U6DmS59AfRzqK2Kek/view?usp=sharing .

Via

Access Paper or Ask Questions

A Block Coordinate Ascent Algorithm for Mean-Variance Optimization

Nov 01, 2018

Bo Liu, Tengyang Xie, Yangyang Xu, Mohammad Ghavamzadeh, Yinlam Chow, Daoming Lyu, Daesub Yoon

Figure 1 for A Block Coordinate Ascent Algorithm for Mean-Variance Optimization

Figure 2 for A Block Coordinate Ascent Algorithm for Mean-Variance Optimization

Abstract:Risk management in dynamic decision problems is a primary concern in many fields, including financial investment, autonomous driving, and healthcare. The mean-variance function is one of the most widely used objective functions in risk management due to its simplicity and interpretability. Existing algorithms for mean-variance optimization are based on multi-time-scale stochastic approximation, whose learning rate schedules are often hard to tune, and have only asymptotic convergence proof. In this paper, we develop a model-free policy search framework for mean-variance optimization with finite-sample error bound analysis (to local optima). Our starting point is a reformulation of the original mean-variance function with its Fenchel dual, from which we propose a stochastic block coordinate ascent policy search algorithm. Both the asymptotic convergence guarantee of the last iteration's solution and the convergence rate of the randomly picked solution are provided, and their applicability is demonstrated on several benchmark domains.

* Accepted by NIPS 2018

Via

Access Paper or Ask Questions

Risk-Sensitive Generative Adversarial Imitation Learning

Aug 13, 2018

Jonathan Lacotte, Yinlam Chow, Mohammad Ghavamzadeh, Marco Pavone

Figure 1 for Risk-Sensitive Generative Adversarial Imitation Learning

Figure 2 for Risk-Sensitive Generative Adversarial Imitation Learning

Figure 3 for Risk-Sensitive Generative Adversarial Imitation Learning

Abstract:We study risk-sensitive imitation learning where the agent's goal is to perform at least as well as the expert in terms of a risk profile. We first formulate our risk-sensitive imitation learning setting. We consider the generative adversarial approach to imitation learning (GAIL) and derive an optimization problem for our formulation, which we call risk-sensitive GAIL (RS-GAIL). We then derive two different versions of our RS-GAIL optimization problem that aim at matching the risk profiles of the agent and the expert w.r.t. Jensen-Shannon (JS) divergence and Wasserstein distance, and develop risk-sensitive generative adversarial imitation learning algorithms based on these optimization problems. We evaluate the performance of our JS-based algorithm and compare it with GAIL and the risk-averse imitation learning (RAIL) algorithm in two MuJoCo tasks.

Via

Access Paper or Ask Questions