Alert button
Picture for Omer Gottesman

Omer Gottesman

Alert button

TD Convergence: An Optimization Perspective

Jun 30, 2023
Kavosh Asadi, Shoham Sabach, Yao Liu, Omer Gottesman, Rasool Fakoor

Figure 1 for TD Convergence: An Optimization Perspective

We study the convergence behavior of the celebrated temporal-difference (TD) learning algorithm. By looking at the algorithm through the lens of optimization, we first argue that TD can be viewed as an iterative optimization algorithm where the function to be minimized changes per iteration. By carefully investigating the divergence displayed by TD on a classical counter example, we identify two forces that determine the convergent or divergent behavior of the algorithm. We next formalize our discovery in the linear TD setting with quadratic loss and prove that convergence of TD hinges on the interplay between these two forces. We extend this optimization perspective to prove convergence of TD in a much broader setting than just linear approximation and squared loss. Our results provide a theoretical explanation for the successful application of TD in reinforcement learning.

Viaarxiv icon

Robust Decision-Focused Learning for Reward Transfer

Apr 06, 2023
Abhishek Sharma, Sonali Parbhoo, Omer Gottesman, Finale Doshi-Velez

Figure 1 for Robust Decision-Focused Learning for Reward Transfer
Figure 2 for Robust Decision-Focused Learning for Reward Transfer
Figure 3 for Robust Decision-Focused Learning for Reward Transfer
Figure 4 for Robust Decision-Focused Learning for Reward Transfer

Decision-focused (DF) model-based reinforcement learning has recently been introduced as a powerful algorithm which can focus on learning the MDP dynamics which are most relevant for obtaining high rewards. While this approach increases the performance of agents by focusing the learning towards optimizing for the reward directly, it does so by learning less accurate dynamics (from a MLE standpoint), and may thus be brittle to changes in the reward function. In this work, we develop the robust decision-focused (RDF) algorithm which leverages the non-identifiability of DF solutions to learn models which maximize expected returns while simultaneously learning models which are robust to changes in the reward function. We demonstrate on a variety of toy example and healthcare simulators that RDF significantly increases the robustness of DF to changes in the reward function, without decreasing the overall return the agent obtains.

Viaarxiv icon

On the Geometry of Reinforcement Learning in Continuous State and Action Spaces

Dec 29, 2022
Saket Tiwari, Omer Gottesman, George Konidaris

Figure 1 for On the Geometry of Reinforcement Learning in Continuous State and Action Spaces
Figure 2 for On the Geometry of Reinforcement Learning in Continuous State and Action Spaces
Figure 3 for On the Geometry of Reinforcement Learning in Continuous State and Action Spaces
Figure 4 for On the Geometry of Reinforcement Learning in Continuous State and Action Spaces

Advances in reinforcement learning have led to its successful application in complex tasks with continuous state and action spaces. Despite these advances in practice, most theoretical work pertains to finite state and action spaces. We propose building a theoretical understanding of continuous state and action spaces by employing a geometric lens. Central to our work is the idea that the transition dynamics induce a low dimensional manifold of reachable states embedded in the high-dimensional nominal state space. We prove that, under certain conditions, the dimensionality of this manifold is at most the dimensionality of the action space plus one. This is the first result of its kind, linking the geometry of the state space to the dimensionality of the action space. We empirically corroborate this upper bound for four MuJoCo environments. We further demonstrate the applicability of our result by learning a policy in this low dimensional representation. To do so we introduce an algorithm that learns a mapping to a low dimensional representation, as a narrow hidden layer of a deep neural network, in tandem with the policy using DDPG. Our experiments show that a policy learnt this way perform on par or better for four MuJoCo control suite tasks.

Viaarxiv icon

A Bayesian Approach to Learning Bandit Structure in Markov Decision Processes

Jul 30, 2022
Kelly W. Zhang, Omer Gottesman, Finale Doshi-Velez

Figure 1 for A Bayesian Approach to Learning Bandit Structure in Markov Decision Processes
Figure 2 for A Bayesian Approach to Learning Bandit Structure in Markov Decision Processes
Figure 3 for A Bayesian Approach to Learning Bandit Structure in Markov Decision Processes
Figure 4 for A Bayesian Approach to Learning Bandit Structure in Markov Decision Processes

In the reinforcement learning literature, there are many algorithms developed for either Contextual Bandit (CB) or Markov Decision Processes (MDP) environments. However, when deploying reinforcement learning algorithms in the real world, even with domain expertise, it is often difficult to know whether it is appropriate to treat a sequential decision making problem as a CB or an MDP. In other words, do actions affect future states, or only the immediate rewards? Making the wrong assumption regarding the nature of the environment can lead to inefficient learning, or even prevent the algorithm from ever learning an optimal policy, even with infinite data. In this work we develop an online algorithm that uses a Bayesian hypothesis testing approach to learn the nature of the environment. Our algorithm allows practitioners to incorporate prior knowledge about whether the environment is that of a CB or an MDP, and effectively interpolate between classical CB and MDP-based algorithms to mitigate against the effects of misspecifying the environment. We perform simulations and demonstrate that in CB settings our algorithm achieves lower regret than MDP-based algorithms, while in non-bandit MDP settings our algorithm is able to learn the optimal policy, often achieving comparable regret to MDP-based algorithms.

* Challenges of Real-World Reinforcement Learning 2020 (NeurIPS Workshop) 
Viaarxiv icon

Deep Q-Network with Proximal Iteration

Dec 10, 2021
Kavosh Asadi, Rasool Fakoor, Omer Gottesman, Michael L. Littman, Alexander J. Smola

Figure 1 for Deep Q-Network with Proximal Iteration
Figure 2 for Deep Q-Network with Proximal Iteration
Figure 3 for Deep Q-Network with Proximal Iteration
Figure 4 for Deep Q-Network with Proximal Iteration

We employ Proximal Iteration for value-function optimization in reinforcement learning. Proximal Iteration is a computationally efficient technique that enables us to bias the optimization procedure towards more desirable solutions. As a concrete application of Proximal Iteration in deep reinforcement learning, we endow the objective function of the Deep Q-Network (DQN) agent with a proximal term to ensure that the online-network component of DQN remains in the vicinity of the target network. The resultant agent, which we call DQN with Proximal Iteration, or DQNPro, exhibits significant improvements over the original DQN on the Atari benchmark. Our results accentuate the power of employing sound optimization techniques for deep reinforcement learning.

* Work in Progress 
Viaarxiv icon

Identification of Subgroups With Similar Benefits in Off-Policy Policy Evaluation

Nov 28, 2021
Ramtin Keramati, Omer Gottesman, Leo Anthony Celi, Finale Doshi-Velez, Emma Brunskill

Figure 1 for Identification of Subgroups With Similar Benefits in Off-Policy Policy Evaluation
Figure 2 for Identification of Subgroups With Similar Benefits in Off-Policy Policy Evaluation
Figure 3 for Identification of Subgroups With Similar Benefits in Off-Policy Policy Evaluation
Figure 4 for Identification of Subgroups With Similar Benefits in Off-Policy Policy Evaluation

Off-policy policy evaluation methods for sequential decision making can be used to help identify if a proposed decision policy is better than a current baseline policy. However, a new decision policy may be better than a baseline policy for some individuals but not others. This has motivated a push towards personalization and accurate per-state estimates of heterogeneous treatment effects (HTEs). Given the limited data present in many important applications, individual predictions can come at a cost to accuracy and confidence in such predictions. We develop a method to balance the need for personalization with confident predictions by identifying subgroups where it is possible to confidently estimate the expected difference in a new decision policy relative to a baseline. We propose a novel loss function that accounts for uncertainty during the subgroup partitioning phase. In experiments, we show that our method can be used to form accurate predictions of HTEs where other methods struggle.

Viaarxiv icon

Coarse-Grained Smoothness for RL in Metric Spaces

Oct 23, 2021
Omer Gottesman, Kavosh Asadi, Cameron Allen, Sam Lobel, George Konidaris, Michael Littman

Figure 1 for Coarse-Grained Smoothness for RL in Metric Spaces
Figure 2 for Coarse-Grained Smoothness for RL in Metric Spaces
Figure 3 for Coarse-Grained Smoothness for RL in Metric Spaces
Figure 4 for Coarse-Grained Smoothness for RL in Metric Spaces

Principled decision-making in continuous state--action spaces is impossible without some assumptions. A common approach is to assume Lipschitz continuity of the Q-function. We show that, unfortunately, this property fails to hold in many typical domains. We propose a new coarse-grained smoothness definition that generalizes the notion of Lipschitz continuity, is more widely applicable, and allows us to compute significantly tighter bounds on Q-functions, leading to improved learning. We provide a theoretical analysis of our new smoothness definition, and discuss its implications and impact on control and exploration in continuous domains.

Viaarxiv icon

State Relevance for Off-Policy Evaluation

Sep 13, 2021
Simon P. Shen, Yecheng Jason Ma, Omer Gottesman, Finale Doshi-Velez

Figure 1 for State Relevance for Off-Policy Evaluation
Figure 2 for State Relevance for Off-Policy Evaluation
Figure 3 for State Relevance for Off-Policy Evaluation
Figure 4 for State Relevance for Off-Policy Evaluation

Importance sampling-based estimators for off-policy evaluation (OPE) are valued for their simplicity, unbiasedness, and reliance on relatively few assumptions. However, the variance of these estimators is often high, especially when trajectories are of different lengths. In this work, we introduce Omitting-States-Irrelevant-to-Return Importance Sampling (OSIRIS), an estimator which reduces variance by strategically omitting likelihood ratios associated with certain states. We formalize the conditions under which OSIRIS is unbiased and has lower variance than ordinary importance sampling, and we demonstrate these properties empirically.

* Proceedings of the 38th International Conference on Machine Learning, PMLR 139:9537-9546, 2021  
* ICML 2021 
Viaarxiv icon

Learning Markov State Abstractions for Deep Reinforcement Learning

Jun 08, 2021
Cameron Allen, Neev Parikh, Omer Gottesman, George Konidaris

Figure 1 for Learning Markov State Abstractions for Deep Reinforcement Learning
Figure 2 for Learning Markov State Abstractions for Deep Reinforcement Learning
Figure 3 for Learning Markov State Abstractions for Deep Reinforcement Learning
Figure 4 for Learning Markov State Abstractions for Deep Reinforcement Learning

The fundamental assumption of reinforcement learning in Markov decision processes (MDPs) is that the relevant decision process is, in fact, Markov. However, when MDPs have rich observations, agents typically learn by way of an abstract state representation, and such representations are not guaranteed to preserve the Markov property. We introduce a novel set of conditions and prove that they are sufficient for learning a Markov abstract state representation. We then describe a practical training procedure that combines inverse model estimation and temporal contrastive learning to learn an abstraction that approximately satisfies these conditions. Our novel training objective is compatible with both online and offline training: it does not require a reward signal, but agents can capitalize on reward information when available. We empirically evaluate our approach on a visual gridworld domain and a set of continuous control benchmarks. Our approach learns representations that capture the underlying structure of the domain and lead to improved sample efficiency over state-of-the-art deep reinforcement learning with visual features -- often matching or exceeding the performance achieved with hand-designed compact state information.

* Code available at https://github.com/camall3n/markov-state-abstractions 
Viaarxiv icon

Learning to search efficiently for causally near-optimal treatments

Jul 02, 2020
Samuel Håkansson, Viktor Lindblom, Omer Gottesman, Fredrik D. Johansson

Figure 1 for Learning to search efficiently for causally near-optimal treatments
Figure 2 for Learning to search efficiently for causally near-optimal treatments

Finding an effective medical treatment often requires a search by trial and error. Making this search more efficient by minimizing the number of unnecessary trials could lower both costs and patient suffering. We formalize this problem as learning a policy for finding a near-optimal treatment in a minimum number of trials using a causal inference framework. We give a model-based dynamic programming algorithm which learns from observational data while being robust to unmeasured confounding. To reduce time complexity, we suggest a greedy algorithm which bounds the near-optimality constraint. The methods are evaluated on synthetic and real-world healthcare data and compared to model-free reinforcement learning. We find that our methods compare favorably to the model-free baseline while offering a more transparent trade-off between search time and treatment efficacy.

Viaarxiv icon