Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Adam White

Revisiting Mixture Policies in Entropy-Regularized Actor-Critic

May 09, 2026

Jiamin He, Samuel Neumann, Jincheng Mei, Adam White, Martha White

Abstract:Mixture policies theoretically offer greater flexibility than unimodal policies in continuous action reinforcement learning, but the practical benefits of this complexity remain elusive. Mixture policies are notably absent from most state-of-the-art algorithms, raising a fundamental question: Is the added representational overhead useful? We show that increased flexibility can theoretically enhance solution quality and entropy robustness. Yet standard algorithms like SAC do not leverage these advantages. A core issue is the lack of a low-variance reparameterization trick for mixtures, a luxury Gaussian policies enjoy. We propose a marginalized reparameterization (MRP) estimator to address this, proving it offers lower variance than the standard likelihood-ratio (LR) approach. Our experiments across Gym MuJoCo, DeepMind Control Suite, and MetaWorld show that MRP mixture policies significantly outperform their LR ones, and reach parity (sometimes better) with Gaussian counterparts. In addition, we do find several cases where MRP mixture policies exhibit clear empirical advantages. In this paper, we provide a clearer understanding of the trade-offs involved, elevating MRP mixture policies from theoretical curiosity to a practical tool.

Via

Access Paper or Ask Questions

Gradient Iterated Temporal-Difference Learning

Mar 08, 2026

Théo Vincent, Kevin Gerhardt, Yogesh Tripathi, Habib Maraqten, Adam White, Martha White, Jan Peters, Carlo D'Eramo

Abstract:Temporal-difference (TD) learning is highly effective at controlling and evaluating an agent's long-term outcomes. Most approaches in this paradigm implement a semi-gradient update to boost the learning speed, which consists of ignoring the gradient of the bootstrapped estimate. While popular, this type of update is prone to divergence, as Baird's counterexample illustrates. Gradient TD methods were introduced to overcome this issue, but have not been widely used, potentially due to issues with learning speed compared to semi-gradient methods. Recently, iterated TD learning was developed to increase the learning speed of TD methods. For that, it learns a sequence of action-value functions in parallel, where each function is optimized to represent the application of the Bellman operator over the previous function in the sequence. While promising, this algorithm can be unstable due to its semi-gradient nature, as each function tracks a moving target. In this work, we modify iterated TD learning by computing the gradients over those moving targets, aiming to build a powerful gradient TD method that competes with semi-gradient methods. Our evaluation reveals that this algorithm, called Gradient Iterated Temporal-Difference learning, has a competitive learning speed against semi-gradient methods across various benchmarks, including Atari games, a result that no prior work on gradient TD methods has demonstrated.

Via

Access Paper or Ask Questions

Fine-Tuning without Performance Degradation

May 01, 2025

Han Wang, Adam White, Martha White

Figure 1 for Fine-Tuning without Performance Degradation

Figure 2 for Fine-Tuning without Performance Degradation

Figure 3 for Fine-Tuning without Performance Degradation

Figure 4 for Fine-Tuning without Performance Degradation

Abstract:Fine-tuning policies learned offline remains a major challenge in application domains. Monotonic performance improvement during \emph{fine-tuning} is often challenging, as agents typically experience performance degradation at the early fine-tuning stage. The community has identified multiple difficulties in fine-tuning a learned network online, however, the majority of progress has focused on improving learning efficiency during fine-tuning. In practice, this comes at a serious cost during fine-tuning: initially, agent performance degrades as the agent explores and effectively overrides the policy learned offline. We show across a range of settings, many offline-to-online algorithms exhibit either (1) performance degradation or (2) slow learning (sometimes effectively no improvement) during fine-tuning. We introduce a new fine-tuning algorithm, based on an algorithm called Jump Start, that gradually allows more exploration based on online estimates of performance. Empirically, this approach achieves fast fine-tuning and significantly reduces performance degradations compared with existing algorithms designed to do the same.

Via

Access Paper or Ask Questions

A Method for Evaluating Hyperparameter Sensitivity in Reinforcement Learning

Dec 10, 2024

Jacob Adkins, Michael Bowling, Adam White

Figure 1 for A Method for Evaluating Hyperparameter Sensitivity in Reinforcement Learning

Figure 2 for A Method for Evaluating Hyperparameter Sensitivity in Reinforcement Learning

Figure 3 for A Method for Evaluating Hyperparameter Sensitivity in Reinforcement Learning

Figure 4 for A Method for Evaluating Hyperparameter Sensitivity in Reinforcement Learning

Abstract:The performance of modern reinforcement learning algorithms critically relies on tuning ever-increasing numbers of hyperparameters. Often, small changes in a hyperparameter can lead to drastic changes in performance, and different environments require very different hyperparameter settings to achieve state-of-the-art performance reported in the literature. We currently lack a scalable and widely accepted approach to characterizing these complex interactions. This work proposes a new empirical methodology for studying, comparing, and quantifying the sensitivity of an algorithm's performance to hyperparameter tuning for a given set of environments. We then demonstrate the utility of this methodology by assessing the hyperparameter sensitivity of several commonly used normalization variants of PPO. The results suggest that several algorithmic performance improvements may, in fact, be a result of an increased reliance on hyperparameter tuning.

Via

Access Paper or Ask Questions

Real-Time Recurrent Learning using Trace Units in Reinforcement Learning

Sep 02, 2024

Esraa Elelimy, Adam White, Michael Bowling, Martha White

Figure 1 for Real-Time Recurrent Learning using Trace Units in Reinforcement Learning

Figure 2 for Real-Time Recurrent Learning using Trace Units in Reinforcement Learning

Figure 3 for Real-Time Recurrent Learning using Trace Units in Reinforcement Learning

Figure 4 for Real-Time Recurrent Learning using Trace Units in Reinforcement Learning

Abstract:Recurrent Neural Networks (RNNs) are used to learn representations in partially observable environments. For agents that learn online and continually interact with the environment, it is desirable to train RNNs with real-time recurrent learning (RTRL); unfortunately, RTRL is prohibitively expensive for standard RNNs. A promising direction is to use linear recurrent architectures (LRUs), where dense recurrent weights are replaced with a complex-valued diagonal, making RTRL efficient. In this work, we build on these insights to provide a lightweight but effective approach for training RNNs in online RL. We introduce Recurrent Trace Units (RTUs), a small modification on LRUs that we nonetheless find to have significant performance benefits over LRUs when trained with RTRL. We find RTUs significantly outperform other recurrent architectures across several partially observable environments while using significantly less computation.

Via

Access Paper or Ask Questions

The Cross-environment Hyperparameter Setting Benchmark for Reinforcement Learning

Jul 26, 2024

Andrew Patterson, Samuel Neumann, Raksha Kumaraswamy, Martha White, Adam White

Figure 1 for The Cross-environment Hyperparameter Setting Benchmark for Reinforcement Learning

Figure 2 for The Cross-environment Hyperparameter Setting Benchmark for Reinforcement Learning

Figure 3 for The Cross-environment Hyperparameter Setting Benchmark for Reinforcement Learning

Figure 4 for The Cross-environment Hyperparameter Setting Benchmark for Reinforcement Learning

Abstract:This paper introduces a new empirical methodology, the Cross-environment Hyperparameter Setting Benchmark, that compares RL algorithms across environments using a single hyperparameter setting, encouraging algorithmic development which is insensitive to hyperparameters. We demonstrate that this benchmark is robust to statistical noise and obtains qualitatively similar results across repeated applications, even when using few samples. This robustness makes the benchmark computationally cheap to apply, allowing statistically sound insights at low cost. We demonstrate two example instantiations of the CHS, on a set of six small control environments (SC-CHS) and on the entire DM Control suite of 28 environments (DMC-CHS). Finally, to illustrate the applicability of the CHS to modern RL algorithms on challenging environments, we conduct a novel empirical study of an open question in the continuous control literature. We show, with high confidence, that there is no meaningful difference in performance between Ornstein-Uhlenbeck noise and uncorrelated Gaussian noise for exploration with the DDPG algorithm on the DMC-CHS.

* Accepted to RLC 2024

Via

Access Paper or Ask Questions

Investigating the Interplay of Prioritized Replay and Generalization

Jul 12, 2024

Parham Mohammad Panahi, Andrew Patterson, Martha White, Adam White

Figure 1 for Investigating the Interplay of Prioritized Replay and Generalization

Figure 2 for Investigating the Interplay of Prioritized Replay and Generalization

Figure 3 for Investigating the Interplay of Prioritized Replay and Generalization

Figure 4 for Investigating the Interplay of Prioritized Replay and Generalization

Abstract:Experience replay is ubiquitous in reinforcement learning, to reuse past data and improve sample efficiency. Though a variety of smart sampling schemes have been introduced to improve performance, uniform sampling by far remains the most common approach. One exception is Prioritized Experience Replay (PER), where sampling is done proportionally to TD errors, inspired by the success of prioritized sweeping in dynamic programming. The original work on PER showed improvements in Atari, but follow-up results are mixed. In this paper, we investigate several variations on PER, to attempt to understand where and when PER may be useful. Our findings in prediction tasks reveal that while PER can improve value propagation in tabular settings, behavior is significantly different when combined with neural networks. Certain mitigations -- like delaying target network updates to control generalization and using estimates of expected TD errors in PER to avoid chasing stochasticity -- can avoid large spikes in error with PER and neural networks, but nonetheless generally do not outperform uniform replay. In control tasks, none of the prioritized variants consistently outperform uniform replay.

* Published in the Reinforcement Learning Conference 2024

Via

Access Paper or Ask Questions

Position: Benchmarking is Limited in Reinforcement Learning Research

Jun 23, 2024

Scott M. Jordan, Adam White, Bruno Castro da Silva, Martha White, Philip S. Thomas

Figure 1 for Position: Benchmarking is Limited in Reinforcement Learning Research

Figure 2 for Position: Benchmarking is Limited in Reinforcement Learning Research

Figure 3 for Position: Benchmarking is Limited in Reinforcement Learning Research

Figure 4 for Position: Benchmarking is Limited in Reinforcement Learning Research

Abstract:Novel reinforcement learning algorithms, or improvements on existing ones, are commonly justified by evaluating their performance on benchmark environments and are compared to an ever-changing set of standard algorithms. However, despite numerous calls for improvements, experimental practices continue to produce misleading or unsupported claims. One reason for the ongoing substandard practices is that conducting rigorous benchmarking experiments requires substantial computational time. This work investigates the sources of increased computation costs in rigorous experiment designs. We show that conducting rigorous performance benchmarks will likely have computational costs that are often prohibitive. As a result, we argue for using an additional experimentation paradigm to overcome the limitations of benchmarking.

* 19 pages, 13 figures, The Forty-first International Conference on Machine Learning (ICML 2024)

Via

Access Paper or Ask Questions

A New View on Planning in Online Reinforcement Learning

Jun 03, 2024

Kevin Roice, Parham Mohammad Panahi, Scott M. Jordan, Adam White, Martha White

Figure 1 for A New View on Planning in Online Reinforcement Learning

Figure 2 for A New View on Planning in Online Reinforcement Learning

Figure 3 for A New View on Planning in Online Reinforcement Learning

Figure 4 for A New View on Planning in Online Reinforcement Learning

Abstract:This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can propagate value from an abstract space in a manner that helps a variety of base learners learn significantly faster in different domains.

* Published in the Planning and Reinforcement Learning Workshop at ICAPS 2024. arXiv admin note: text overlap with arXiv:2206.02902

Via

Access Paper or Ask Questions

Tuning for the Unknown: Revisiting Evaluation Strategies for Lifelong RL

Apr 02, 2024

Golnaz Mesbahi, Olya Mastikhina, Parham Mohammad Panahi, Martha White, Adam White

Figure 1 for Tuning for the Unknown: Revisiting Evaluation Strategies for Lifelong RL

Figure 2 for Tuning for the Unknown: Revisiting Evaluation Strategies for Lifelong RL

Figure 3 for Tuning for the Unknown: Revisiting Evaluation Strategies for Lifelong RL

Figure 4 for Tuning for the Unknown: Revisiting Evaluation Strategies for Lifelong RL

Abstract:In continual or lifelong reinforcement learning access to the environment should be limited. If we aspire to design algorithms that can run for long-periods of time, continually adapting to new, unexpected situations then we must be willing to deploy our agents without tuning their hyperparameters over the agent's entire lifetime. The standard practice in deep RL -- and even continual RL -- is to assume unfettered access to deployment environment for the full lifetime of the agent. This paper explores the notion that progress in lifelong RL research has been held back by inappropriate empirical methodologies. In this paper we propose a new approach for tuning and evaluating lifelong RL agents where only one percent of the experiment data can be used for hyperparameter tuning. We then conduct an empirical study of DQN and Soft Actor Critic across a variety of continuing and non-stationary domains. We find both methods generally perform poorly when restricted to one-percent tuning, whereas several algorithmic mitigations designed to maintain network plasticity perform surprising well. In addition, we find that properties designed to measure the network's ability to learn continually indeed correlate with performance under one-percent tuning.

Via

Access Paper or Ask Questions