Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Martha White

Demystifying the Recency Heuristic in Temporal-Difference Learning

Jun 18, 2024

Brett Daley, Marlos C. Machado, Martha White

Figure 1 for Demystifying the Recency Heuristic in Temporal-Difference Learning

Figure 2 for Demystifying the Recency Heuristic in Temporal-Difference Learning

Figure 3 for Demystifying the Recency Heuristic in Temporal-Difference Learning

Figure 4 for Demystifying the Recency Heuristic in Temporal-Difference Learning

Abstract:The recency heuristic in reinforcement learning is the assumption that stimuli that occurred closer in time to an acquired reward should be more heavily reinforced. The recency heuristic is one of the key assumptions made by TD($\lambda$), which reinforces recent experiences according to an exponentially decaying weighting. In fact, all other widely used return estimators for TD learning, such as $n$-step returns, satisfy a weaker (i.e., non-monotonic) recency heuristic. Why is the recency heuristic effective for temporal credit assignment? What happens when credit is assigned in a way that violates this heuristic? In this paper, we analyze the specific mathematical implications of adopting the recency heuristic in TD learning. We prove that any return estimator satisfying this heuristic: 1) is guaranteed to converge to the correct value function, 2) has a relatively fast contraction rate, and 3) has a long window of effective credit assignment, yet bounded worst-case variance. We also give a counterexample where on-policy, tabular TD methods violating the recency heuristic diverge. Our results offer some of the first theoretical evidence that credit assignment based on the recency heuristic facilitates learning.

* RLC 2024. 18 pages, 8 figures, 1 table

Via

Access Paper or Ask Questions

A New View on Planning in Online Reinforcement Learning

Jun 03, 2024

Kevin Roice, Parham Mohammad Panahi, Scott M. Jordan, Adam White, Martha White

Figure 1 for A New View on Planning in Online Reinforcement Learning

Figure 2 for A New View on Planning in Online Reinforcement Learning

Figure 3 for A New View on Planning in Online Reinforcement Learning

Figure 4 for A New View on Planning in Online Reinforcement Learning

Abstract:This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can propagate value from an abstract space in a manner that helps a variety of base learners learn significantly faster in different domains.

* Published in the Planning and Reinforcement Learning Workshop at ICAPS 2024. arXiv admin note: text overlap with arXiv:2206.02902

Via

Access Paper or Ask Questions

Tuning for the Unknown: Revisiting Evaluation Strategies for Lifelong RL

Apr 02, 2024

Golnaz Mesbahi, Olya Mastikhina, Parham Mohammad Panahi, Martha White, Adam White

Figure 1 for Tuning for the Unknown: Revisiting Evaluation Strategies for Lifelong RL

Figure 2 for Tuning for the Unknown: Revisiting Evaluation Strategies for Lifelong RL

Figure 3 for Tuning for the Unknown: Revisiting Evaluation Strategies for Lifelong RL

Figure 4 for Tuning for the Unknown: Revisiting Evaluation Strategies for Lifelong RL

Abstract:In continual or lifelong reinforcement learning access to the environment should be limited. If we aspire to design algorithms that can run for long-periods of time, continually adapting to new, unexpected situations then we must be willing to deploy our agents without tuning their hyperparameters over the agent's entire lifetime. The standard practice in deep RL -- and even continual RL -- is to assume unfettered access to deployment environment for the full lifetime of the agent. This paper explores the notion that progress in lifelong RL research has been held back by inappropriate empirical methodologies. In this paper we propose a new approach for tuning and evaluating lifelong RL agents where only one percent of the experiment data can be used for hyperparameter tuning. We then conduct an empirical study of DQN and Soft Actor Critic across a variety of continuing and non-stationary domains. We find both methods generally perform poorly when restricted to one-percent tuning, whereas several algorithmic mitigations designed to maintain network plasticity perform surprising well. In addition, we find that properties designed to measure the network's ability to learn continually indeed correlate with performance under one-percent tuning.

Via

Access Paper or Ask Questions

Investigating the Histogram Loss in Regression

Feb 20, 2024

Ehsan Imani, Kai Luedemann, Sam Scholnick-Hughes, Esraa Elelimy, Martha White

Figure 1 for Investigating the Histogram Loss in Regression

Figure 2 for Investigating the Histogram Loss in Regression

Figure 3 for Investigating the Histogram Loss in Regression

Figure 4 for Investigating the Histogram Loss in Regression

Abstract:It is becoming increasingly common in regression to train neural networks that model the entire distribution even if only the mean is required for prediction. This additional modeling often comes with performance gain and the reasons behind the improvement are not fully known. This paper investigates a recent approach to regression, the Histogram Loss, which involves learning the conditional distribution of the target variable by minimizing the cross-entropy between a target distribution and a flexible histogram prediction. We design theoretical and empirical analyses to determine why and when this performance gain appears, and how different components of the loss contribute to it. Our results suggest that the benefits of learning distributions in this setup come from improvements in optimization rather than learning a better representation. We then demonstrate the viability of the Histogram Loss in common deep learning applications without a need for costly hyperparameter tuning.

* 50 pages

Via

Access Paper or Ask Questions

What to Do When Your Discrete Optimization Is the Size of a Neural Network?

Feb 15, 2024

Hugo Silva, Martha White

Figure 1 for What to Do When Your Discrete Optimization Is the Size of a Neural Network?

Figure 2 for What to Do When Your Discrete Optimization Is the Size of a Neural Network?

Figure 3 for What to Do When Your Discrete Optimization Is the Size of a Neural Network?

Figure 4 for What to Do When Your Discrete Optimization Is the Size of a Neural Network?

Abstract:Oftentimes, machine learning applications using neural networks involve solving discrete optimization problems, such as in pruning, parameter-isolation-based continual learning and training of binary networks. Still, these discrete problems are combinatorial in nature and are also not amenable to gradient-based optimization. Additionally, classical approaches used in discrete settings do not scale well to large neural networks, forcing scientists and empiricists to rely on alternative methods. Among these, two main distinct sources of top-down information can be used to lead the model to good solutions: (1) extrapolating gradient information from points outside of the solution set (2) comparing evaluations between members of a subset of the valid solutions. We take continuation path (CP) methods to represent using purely the former and Monte Carlo (MC) methods to represent the latter, while also noting that some hybrid methods combine the two. The main goal of this work is to compare both approaches. For that purpose, we first overview the two classes while also discussing some of their drawbacks analytically. Then, on the experimental section, we compare their performance, starting with smaller microworld experiments, which allow more fine-grained control of problem variables, and gradually moving towards larger problems, including neural network regression and neural network pruning for image classification, where we additionally compare against magnitude-based pruning.

* Submitted to JMLR

Via

Access Paper or Ask Questions

Compound Returns Reduce Variance in Reinforcement Learning

Feb 06, 2024

Brett Daley, Martha White, Marlos C. Machado

Figure 1 for Compound Returns Reduce Variance in Reinforcement Learning

Figure 2 for Compound Returns Reduce Variance in Reinforcement Learning

Figure 3 for Compound Returns Reduce Variance in Reinforcement Learning

Figure 4 for Compound Returns Reduce Variance in Reinforcement Learning

Abstract:Multistep returns, such as $n$-step returns and $\lambda$-returns, are commonly used to improve the sample efficiency of reinforcement learning (RL) methods. The variance of the multistep returns becomes the limiting factor in their length; looking too far into the future increases variance and reverses the benefits of multistep learning. In our work, we demonstrate the ability of compound returns -- weighted averages of $n$-step returns -- to reduce variance. We prove for the first time that any compound return with the same contraction modulus as a given $n$-step return has strictly lower variance. We additionally prove that this variance-reduction property improves the finite-sample complexity of temporal-difference learning under linear function approximation. Because general compound returns can be expensive to implement, we introduce two-bootstrap returns which reduce variance while remaining efficient, even when using minibatched experience replay. We conduct experiments showing that two-bootstrap returns can improve the sample efficiency of $n$-step deep RL agents, with little additional computational cost.

* Preprint. 8 pages, 5 figures, 3 tables

Via

Access Paper or Ask Questions

When is Offline Policy Selection Sample Efficient for Reinforcement Learning?

Dec 04, 2023

Vincent Liu, Prabhat Nagarajan, Andrew Patterson, Martha White

Figure 1 for When is Offline Policy Selection Sample Efficient for Reinforcement Learning?

Figure 2 for When is Offline Policy Selection Sample Efficient for Reinforcement Learning?

Figure 3 for When is Offline Policy Selection Sample Efficient for Reinforcement Learning?

Figure 4 for When is Offline Policy Selection Sample Efficient for Reinforcement Learning?

Abstract:Offline reinforcement learning algorithms often require careful hyperparameter tuning. Consequently, before deployment, we need to select amongst a set of candidate policies. As yet, however, there is little understanding about the fundamental limits of this offline policy selection (OPS) problem. In this work we aim to provide clarity on when sample efficient OPS is possible, primarily by connecting OPS to off-policy policy evaluation (OPE) and Bellman error (BE) estimation. We first show a hardness result, that in the worst case, OPS is just as hard as OPE, by proving a reduction of OPE to OPS. As a result, no OPS method can be more sample efficient than OPE in the worst case. We then propose a BE method for OPS, called Identifiable BE Selection (IBES), that has a straightforward method for selecting its own hyperparameters. We highlight that using IBES for OPS generally has more requirements than OPE methods, but if satisfied, can be more sample efficient. We conclude with an empirical study comparing OPE and IBES, and by showing the difficulty of OPS on an offline Atari benchmark dataset.

Via

Access Paper or Ask Questions

GVFs in the Real World: Making Predictions Online for Water Treatment

Dec 04, 2023

Muhammad Kamran Janjua, Haseeb Shah, Martha White, Erfan Miahi, Marlos C. Machado, Adam White

Abstract:In this paper we investigate the use of reinforcement-learning based prediction approaches for a real drinking-water treatment plant. Developing such a prediction system is a critical step on the path to optimizing and automating water treatment. Before that, there are many questions to answer about the predictability of the data, suitable neural network architectures, how to overcome partial observability and more. We first describe this dataset, and highlight challenges with seasonality, nonstationarity, partial observability, and heterogeneity across sensors and operation modes of the plant. We then describe General Value Function (GVF) predictions -- discounted cumulative sums of observations -- and highlight why they might be preferable to classical n-step predictions common in time series prediction. We discuss how to use offline data to appropriately pre-train our temporal difference learning (TD) agents that learn these GVF predictions, including how to select hyperparameters for online fine-tuning in deployment. We find that the TD-prediction agent obtains an overall lower normalized mean-squared error than the n-step prediction agent. Finally, we show the importance of learning in deployment, by comparing a TD agent trained purely offline with no online updating to a TD agent that learns online. This final result is one of the first to motivate the importance of adapting predictions in real-time, for non-stationary high-volume systems in the real world.

* Machine Learning (2023): 1-31
* Published in Machine Learning (2023)

Via

Access Paper or Ask Questions

Measuring and Mitigating Interference in Reinforcement Learning

Jul 10, 2023

Vincent Liu, Han Wang, Ruo Yu Tao, Khurram Javed, Adam White, Martha White

Figure 1 for Measuring and Mitigating Interference in Reinforcement Learning

Figure 2 for Measuring and Mitigating Interference in Reinforcement Learning

Figure 3 for Measuring and Mitigating Interference in Reinforcement Learning

Figure 4 for Measuring and Mitigating Interference in Reinforcement Learning

Abstract:Catastrophic interference is common in many network-based learning systems, and many proposals exist for mitigating it. Before overcoming interference we must understand it better. In this work, we provide a definition and novel measure of interference for value-based reinforcement learning methods such as Fitted Q-Iteration and DQN. We systematically evaluate our measure of interference, showing that it correlates with instability in control performance, across a variety of network architectures. Our new interference measure allows us to ask novel scientific questions about commonly used deep learning architectures and study learning algorithms which mitigate interference. Lastly, we outline a class of algorithms which we call online-aware that are designed to mitigate interference, and show they do reduce interference according to our measure and that they improve stability and performance in several classic control environments.

* Published at Conference on Lifelong Learning Agents (CoLLAs) 2023

Via

Access Paper or Ask Questions

Coagent Networks: Generalized and Scaled

May 16, 2023

James E. Kostas, Scott M. Jordan, Yash Chandak, Georgios Theocharous, Dhawal Gupta, Martha White, Bruno Castro da Silva, Philip S. Thomas

Figure 1 for Coagent Networks: Generalized and Scaled

Figure 2 for Coagent Networks: Generalized and Scaled

Figure 3 for Coagent Networks: Generalized and Scaled

Figure 4 for Coagent Networks: Generalized and Scaled

Abstract:Coagent networks for reinforcement learning (RL) [Thomas and Barto, 2011] provide a powerful and flexible framework for deriving principled learning rules for arbitrary stochastic neural networks. The coagent framework offers an alternative to backpropagation-based deep learning (BDL) that overcomes some of backpropagation's main limitations. For example, coagent networks can compute different parts of the network \emph{asynchronously} (at different rates or at different times), can incorporate non-differentiable components that cannot be used with backpropagation, and can explore at levels higher than their action spaces (that is, they can be designed as hierarchical networks for exploration and/or temporal abstraction). However, the coagent framework is not just an alternative to BDL; the two approaches can be blended: BDL can be combined with coagent learning rules to create architectures with the advantages of both approaches. This work generalizes the coagent theory and learning rules provided by previous works; this generalization provides more flexibility for network architecture design within the coagent framework. This work also studies one of the chief disadvantages of coagent networks: high variance updates for networks that have many coagents and do not use backpropagation. We show that a coagent algorithm with a policy network that does not use backpropagation can scale to a challenging RL domain with a high-dimensional state and action space (the MuJoCo Ant environment), learning reasonable (although not state-of-the-art) policies. These contributions motivate and provide a more general theoretical foundation for future work that studies coagent networks.

Via

Access Paper or Ask Questions