Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tom Zahavy

Balancing Constraints and Rewards with Meta-Gradient D4PG

Oct 13, 2020
Dan A. Calian, Daniel J. Mankowitz, Tom Zahavy, Zhongwen Xu, Junhyuk Oh, Nir Levine, Timothy Mann

Figure 1 for Balancing Constraints and Rewards with Meta-Gradient D4PG

Figure 2 for Balancing Constraints and Rewards with Meta-Gradient D4PG

Figure 3 for Balancing Constraints and Rewards with Meta-Gradient D4PG

Figure 4 for Balancing Constraints and Rewards with Meta-Gradient D4PG

Deploying Reinforcement Learning (RL) agents to solve real-world applications often requires satisfying complex system constraints. Often the constraint thresholds are incorrectly set due to the complex nature of a system or the inability to verify the thresholds offline (e.g, no simulator or reasonable offline evaluation procedure exists). This results in solutions where a task cannot be solved without violating the constraints. However, in many real-world cases, constraint violations are undesirable yet they are not catastrophic, motivating the need for soft-constrained RL approaches. We present two soft-constrained RL approaches that utilize meta-gradients to find a good trade-off between expected return and minimizing constraint violations. We demonstrate the effectiveness of these approaches by showing that they consistently outperform the baselines across four different Mujoco domains.

Via

Access Paper or Ask Questions

Learning to Ask Medical Questions using Reinforcement Learning

Mar 31, 2020
Uri Shaham, Tom Zahavy, Cesar Caraballo, Shiwani Mahajan, Daisy Massey, Harlan Krumholz

Figure 1 for Learning to Ask Medical Questions using Reinforcement Learning

Figure 2 for Learning to Ask Medical Questions using Reinforcement Learning

Figure 3 for Learning to Ask Medical Questions using Reinforcement Learning

Figure 4 for Learning to Ask Medical Questions using Reinforcement Learning

We propose a novel reinforcement learning-based approach for adaptive and iterative feature selection. Given a masked vector of input features, a reinforcement learning agent iteratively selects certain features to be unmasked, and uses them to predict an outcome when it is sufficiently confident. The algorithm makes use of a novel environment setting, corresponding to a non-stationary Markov Decision Process. A key component of our approach is a guesser network, trained to predict the outcome from the selected features and parametrizing the reward function. Applying our method to a national survey dataset, we show that it not only outperforms strong baselines when requiring the prediction to be made based on a small number of input features, but is also highly more interpretable. Our code is publicly available at \url{https://github.com/ushaham/adaptiveFS}.

Via

Access Paper or Ask Questions

Self-Tuning Deep Reinforcement Learning

Mar 02, 2020
Tom Zahavy, Zhongwen Xu, Vivek Veeriah, Matteo Hessel, Junhyuk Oh, Hado van Hasselt, David Silver, Satinder Singh

Figure 1 for Self-Tuning Deep Reinforcement Learning

Figure 2 for Self-Tuning Deep Reinforcement Learning

Figure 3 for Self-Tuning Deep Reinforcement Learning

Figure 4 for Self-Tuning Deep Reinforcement Learning

Reinforcement learning (RL) algorithms often require expensive manual or automated hyperparameter searches in order to perform well on a new domain. This need is particularly acute in modern deep RL architectures which often incorporate many modules and multiple loss functions. In this paper, we take a step towards addressing this issue by using metagradients (Xu et al., 2018) to tune these hyperparameters via differentiable cross validation, whilst the agent interacts with and learns from the environment. We present the Self-Tuning Actor Critic (STAC) which uses this process to tune the hyperparameters of the usual loss function of the IMPALA actor critic agent(Espeholt et. al., 2018), to learn the hyperparameters that define auxiliary loss functions, and to balance trade offs in off policy learning by introducing and adapting the hyperparameters of a novel leaky V-trace operator. The method is simple to use, sample efficient and does not require significant increase in compute. Ablative studies show that the overall performance of STAC improves as we adapt more hyperparameters. When applied to 57 games on the Atari 2600 environment over 200 million frames our algorithm improves the median human normalized score of the baseline from 243% to 364%.

Via

Access Paper or Ask Questions

Deep learning reconstruction of ultrashort pulses from 2D spatial intensity patterns recorded by an all-in-line system in a single-shot

Nov 23, 2019
Ron Ziv, Alex Dikopoltsev, Tom Zahavy, Ittai Rubinstein, Pavel Sidorenko, Oren Cohen, Mordechai Segev

Figure 1 for Deep learning reconstruction of ultrashort pulses from 2D spatial intensity patterns recorded by an all-in-line system in a single-shot

Figure 2 for Deep learning reconstruction of ultrashort pulses from 2D spatial intensity patterns recorded by an all-in-line system in a single-shot

Figure 3 for Deep learning reconstruction of ultrashort pulses from 2D spatial intensity patterns recorded by an all-in-line system in a single-shot

Figure 4 for Deep learning reconstruction of ultrashort pulses from 2D spatial intensity patterns recorded by an all-in-line system in a single-shot

We propose a simple all-in-line single-shot scheme for diagnostics of ultrashort laser pulses, consisting of a multi-mode fiber, a nonlinear crystal and a CCD camera. The system records a 2D spatial intensity pattern, from which the pulse shape (amplitude and phase) are recovered, through a fast Deep Learning algorithm. We explore this scheme in simulations and demonstrate the recovery of ultrashort pulses, robustness to noise in measurements and to inaccuracies in the parameters of the system components. Our technique mitigates the need for commonly used iterative optimization reconstruction methods, which are usually slow and hampered by the presence of noise. These features make our concept system advantageous for real time probing of ultrafast processes and noisy conditions. Moreover, this work exemplifies that using deep learning we can unlock new types of systems for pulse recovery.

Via

Access Paper or Ask Questions

Apprenticeship Learning via Frank-Wolfe

Nov 20, 2019
Tom Zahavy, Alon Cohen, Haim Kaplan, Yishay Mansour

Figure 1 for Apprenticeship Learning via Frank-Wolfe

Figure 2 for Apprenticeship Learning via Frank-Wolfe

We consider the applications of the Frank-Wolfe (FW) algorithm for Apprenticeship Learning (AL). In this setting, we are given a Markov Decision Process (MDP) without an explicit reward function. Instead, we observe an expert that acts according to some policy, and the goal is to find a policy whose feature expectations are closest to those of the expert policy. We formulate this problem as finding the projection of the feature expectations of the expert on the feature expectations polytope -- the convex hull of the feature expectations of all the deterministic policies in the MDP. We show that this formulation is equivalent to the AL objective and that solving this problem using the FW algorithm is equivalent well-known Projection method of Abbeel and Ng (2004). This insight allows us to analyze AL with tools from convex optimization literature and derive tighter convergence bounds on AL. Specifically, we show that a variation of the FW method that is based on taking "away steps" achieves a linear rate of convergence when applied to AL and that a stochastic version of the FW algorithm can be used to avoid precise estimation of feature expectations. We also experimentally show that this version outperforms the FW baseline. To the best of our knowledge, this is the first work that shows linear convergence rates for AL.

Via

Access Paper or Ask Questions

Inverse Reinforcement Learning in Contextual MDPs

May 29, 2019
Philip Korsunsky, Stav Belogolovsky, Tom Zahavy, Chen Tessler, Shie Mannor

Figure 1 for Inverse Reinforcement Learning in Contextual MDPs

Figure 2 for Inverse Reinforcement Learning in Contextual MDPs

Figure 3 for Inverse Reinforcement Learning in Contextual MDPs

Figure 4 for Inverse Reinforcement Learning in Contextual MDPs

We consider the Inverse Reinforcement Learning (IRL) problem in Contextual Markov Decision Processes (CMDPs). Here, the reward of the environment, which is not available to the agent, depends on a static parameter referred to as the context. Each context defines an MDP (with a different reward signal), and the agent is provided demonstrations by an expert, for different contexts. The goal is to learn a mapping from contexts to rewards, such that planning with respect to the induced reward will perform similarly to the expert, even for unseen contexts. We suggest two learning algorithms for this scenario. (1) For rewards that are a linear function of the context, we provide a method that is guaranteed to return an $\epsilon$-optimal solution after a polynomial number of demonstrations. (2) For general reward functions, we propose black-box descent methods based on evolutionary strategies capable of working with nonlinear estimators (e.g., neural networks). We evaluate our algorithms in autonomous driving and medical treatment simulations and demonstrate their ability to learn and generalize to unseen contexts.

Via

Access Paper or Ask Questions

Average reward reinforcement learning with unknown mixing times

May 23, 2019
Tom Zahavy, Alon Cohen, Haim Kaplan, Yishay Mansour

Figure 1 for Average reward reinforcement learning with unknown mixing times

We derive and analyze learning algorithms for policy evaluation, apprenticeship learning, and policy gradient for average reward criteria. Existing algorithms explicitly require an upper bound on the mixing time. In contrast, we build on ideas from Markov chain theory and derive sampling algorithms that do not require such an upper bound. For these algorithms, we provide theoretical bounds on their sample-complexity and running time.

Via

Access Paper or Ask Questions