Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Philip S. Thomas

Optimizing for the Future in Non-Stationary MDPs

May 17, 2020
Yash Chandak, Georgios Theocharous, Shiv Shankar, Sridhar Mahadevan, Martha White, Philip S. Thomas

Figure 1 for Optimizing for the Future in Non-Stationary MDPs

Figure 2 for Optimizing for the Future in Non-Stationary MDPs

Figure 3 for Optimizing for the Future in Non-Stationary MDPs

Figure 4 for Optimizing for the Future in Non-Stationary MDPs

Most reinforcement learning methods are based upon the key assumption that the transition dynamics and reward functions are fixed, that is, the underlying Markov decision process (MDP) is stationary. However, in many practical real-world applications, this assumption is often violated. We discuss how current methods can have inherent limitations for non-stationary MDPs, and therefore searching for a policy that is good for the future, unknown MDP, requires rethinking the optimization paradigm. To address this problem, we develop a method that builds upon ideas from both counter-factual reasoning and curve-fitting to proactively search for a good future policy, without ever modeling the underlying non-stationarity. Interestingly, we observe that minimizing performance over some of the data from past episodes might be beneficial when searching for a policy that maximizes future performance. The effectiveness of the proposed method is demonstrated on problems motivated by real-world applications.

Via

Access Paper or Ask Questions

Learning Reusable Options for Multi-Task Reinforcement Learning

Jan 06, 2020
Francisco M. Garcia, Chris Nota, Philip S. Thomas

Figure 1 for Learning Reusable Options for Multi-Task Reinforcement Learning

Figure 2 for Learning Reusable Options for Multi-Task Reinforcement Learning

Figure 3 for Learning Reusable Options for Multi-Task Reinforcement Learning

Figure 4 for Learning Reusable Options for Multi-Task Reinforcement Learning

Reinforcement learning (RL) has become an increasingly active area of research in recent years. Although there are many algorithms that allow an agent to solve tasks efficiently, they often ignore the possibility that prior experience related to the task at hand might be available. For many practical applications, it might be unfeasible for an agent to learn how to solve a task from scratch, given that it is generally a computationally expensive process; however, prior experience could be leveraged to make these problems tractable in practice. In this paper, we propose a framework for exploiting existing experience by learning reusable options. We show that after an agent learns policies for solving a small number of problems, we are able to use the trajectories generated from those policies to learn reusable options that allow an agent to quickly learn how to solve novel and related problems.

* 15 pages, 7 figures, pre-print

Via

Access Paper or Ask Questions

Reinforcement learning with a network of spiking agents

Nov 10, 2019
Sneha Aenugu, Abhishek Sharma, Sasikiran Yelamarthi, Hananel Hazan, Philip S. Thomas, Robert Kozma

Figure 1 for Reinforcement learning with a network of spiking agents

Figure 2 for Reinforcement learning with a network of spiking agents

Neuroscientific theory suggests that dopaminergic neurons broadcast global reward prediction errors to large areas of the brain influencing the synaptic plasticity of the neurons in those regions. We build on this theory to propose a multi-agent learning framework with spiking neurons in the generalized linear model (GLM) formulation as agents, to solve reinforcement learning (RL) tasks. We show that a network of GLM spiking agents connected in a hierarchical fashion, where each spiking agent modulates its firing policy based on local information and a global prediction error, can learn complex action representations to solve RL tasks. We further show how leveraging principles of modularity and population coding inspired from the brain can help reduce variance in the learning updates making it a viable optimization technique.

Via

Access Paper or Ask Questions

Reinforcement learning with spiking coagents

Oct 31, 2019
Sneha Aenugu, Abhishek Sharma, Sasikiran Yelamarthi, Hananel Hazan, Philip S. Thomas, Robert Kozma

Figure 1 for Reinforcement learning with spiking coagents

Figure 2 for Reinforcement learning with spiking coagents

Via

Access Paper or Ask Questions

Is the Policy Gradient a Gradient?

Jun 17, 2019
Chris Nota, Philip S. Thomas

Figure 1 for Is the Policy Gradient a Gradient?

Figure 2 for Is the Policy Gradient a Gradient?

Figure 3 for Is the Policy Gradient a Gradient?

The policy gradient theorem describes the gradient of the expected discounted return with respect to an agent's policy parameters. However, most policy gradient methods do not use the discount factor in the manner originally prescribed, and therefore do not optimize the discounted objective. It has been an open question in RL as to which, if any, objective they optimize instead. We show that the direction followed by these methods is not the gradient of any objective, and reclassify them as semi-gradient methods with respect to the undiscounted objective. Further, we show that they are not guaranteed to converge to a locally optimal policy, and construct an counterexample where they will converge to the globally pessimal policy with respect to both the discounted and undiscounted objectives.

* 6 pages, 3 figures

Via

Access Paper or Ask Questions

Classical Policy Gradient: Preserving Bellman's Principle of Optimality

Jun 06, 2019
Philip S. Thomas, Scott M. Jordan, Yash Chandak, Chris Nota, James Kostas

We propose a new objective function for finite-horizon episodic Markov decision processes that better captures Bellman's principle of optimality, and provide an expression for the gradient of the objective.

* 1 page, 0 figures

Via

Access Paper or Ask Questions

Reinforcement Learning When All Actions are Not Always Available

Jun 05, 2019
Yash Chandak, Georgios Theocharous, Blossom Metevier, Philip S. Thomas

Figure 1 for Reinforcement Learning When All Actions are Not Always Available

Figure 2 for Reinforcement Learning When All Actions are Not Always Available

Figure 3 for Reinforcement Learning When All Actions are Not Always Available

Figure 4 for Reinforcement Learning When All Actions are Not Always Available

The Markov decision process (MDP) formulation used to model many real-world sequential decision making problems does not capture the setting where the set of available decisions (actions) at each time step is stochastic. Recently, the stochastic action set Markov decision process (SAS-MDP) formulation has been proposed, which captures the concept of a stochastic action set. In this paper we argue that existing RL algorithms for SAS-MDPs suffer from divergence issues, and present new algorithms for SAS-MDPs that incorporate variance reduction techniques unique to this setting, and provide conditions for their convergence. We conclude with experiments that demonstrate the practicality of our approaches using several tasks inspired by real-life use cases wherein the action set is stochastic.

Via

Access Paper or Ask Questions

Lifelong Learning with a Changing Action Set

Jun 05, 2019
Yash Chandak, Georgios Theocharous, Chris Nota, Philip S. Thomas

Figure 1 for Lifelong Learning with a Changing Action Set

Figure 2 for Lifelong Learning with a Changing Action Set

Figure 3 for Lifelong Learning with a Changing Action Set

Figure 4 for Lifelong Learning with a Changing Action Set

In many real-world sequential decision making problems, the number of available actions (decisions) can vary over time. While problems like catastrophic forgetting, changing transition dynamics, changing rewards functions, etc. have been well-studied in the lifelong learning literature, the setting where the action set changes remains unaddressed. In this paper, we present an algorithm that autonomously adapts to an action set whose size changes over time. To tackle this open problem, we break it into two problems that can be solved iteratively: inferring the underlying, unknown, structure in the space of actions and optimizing a policy that leverages this structure. We demonstrate the efficiency of this approach on large-scale real-world lifelong learning problems.

Via

Access Paper or Ask Questions

A New Confidence Interval for the Mean of a Bounded Random Variable

May 15, 2019
Erik Learned-Miller, Philip S. Thomas

Figure 1 for A New Confidence Interval for the Mean of a Bounded Random Variable

Figure 2 for A New Confidence Interval for the Mean of a Bounded Random Variable

Figure 3 for A New Confidence Interval for the Mean of a Bounded Random Variable

Figure 4 for A New Confidence Interval for the Mean of a Bounded Random Variable

We present a new method for constructing a confidence interval for the mean of a bounded random variable from samples of the random variable. We conjecture that the confidence interval has guaranteed coverage, i.e., that it contains the mean with high probability for all distributions on a bounded interval, for all samples sizes, and for all confidence levels. This new method provides confidence intervals that are competitive with those produced using Student's t-statistic, but does not rely on normality assumptions. In particular, its only requirement is that the distribution be bounded on a known finite interval.

Via

Access Paper or Ask Questions