Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

A. Rupam Mahmood

AVG-DICE: Stationary Distribution Correction by Regression

Mar 03, 2025

Fengdi Che, Bryan Chan, Chen Ma, A. Rupam Mahmood

Abstract:Off-policy policy evaluation (OPE), an essential component of reinforcement learning, has long suffered from stationary state distribution mismatch, undermining both stability and accuracy of OPE estimates. While existing methods correct distribution shifts by estimating density ratios, they often rely on expensive optimization or backward Bellman-based updates and struggle to outperform simpler baselines. We introduce AVG-DICE, a computationally simple Monte Carlo estimator for the density ratio that averages discounted importance sampling ratios, providing an unbiased and consistent correction. AVG-DICE extends naturally to nonlinear function approximation using regression, which we roughly tune and test on OPE tasks based on Mujoco Gym environments and compare with state-of-the-art density-ratio estimators using their reported hyperparameters. In our experiments, AVG-DICE is at least as accurate as state-of-the-art estimators and sometimes offers orders-of-magnitude improvements. However, a sensitivity analysis shows that best-performing hyperparameters may vary substantially across different discount factors, so a re-tuning is suggested.

Via

Access Paper or Ask Questions

Deep Policy Gradient Methods Without Batch Updates, Target Networks, or Replay Buffers

Nov 22, 2024

Gautham Vasan, Mohamed Elsayed, Alireza Azimi, Jiamin He, Fahim Shariar, Colin Bellinger, Martha White, A. Rupam Mahmood

Abstract:Modern deep policy gradient methods achieve effective performance on simulated robotic tasks, but they all require large replay buffers or expensive batch updates, or both, making them incompatible for real systems with resource-limited computers. We show that these methods fail catastrophically when limited to small replay buffers or during incremental learning, where updates only use the most recent sample without batch updates or a replay buffer. We propose a novel incremental deep policy gradient method -- Action Value Gradient (AVG) and a set of normalization and scaling techniques to address the challenges of instability in incremental learning. On robotic simulation benchmarks, we show that AVG is the only incremental method that learns effectively, often achieving final performance comparable to batch policy gradient methods. This advancement enabled us to show for the first time effective deep reinforcement learning with real robots using only incremental updates, employing a robotic manipulator and a mobile robot.

* In The Thirty-eighth Annual Conference on Neural Information Processing Systems. Source code at https://github.com/gauthamvasan/avg and companion video at https://youtu.be/cwwuN6Hyew0

Via

Access Paper or Ask Questions

Streaming Deep Reinforcement Learning Finally Works

Oct 18, 2024

Mohamed Elsayed, Gautham Vasan, A. Rupam Mahmood

Abstract:Natural intelligence processes experience as a continuous stream, sensing, acting, and learning moment-by-moment in real time. Streaming learning, the modus operandi of classic reinforcement learning (RL) algorithms like Q-learning and TD, mimics natural learning by using the most recent sample without storing it. This approach is also ideal for resource-constrained, communication-limited, and privacy-sensitive applications. However, in deep RL, learners almost always use batch updates and replay buffers, making them computationally expensive and incompatible with streaming learning. Although the prevalence of batch deep RL is often attributed to its sample efficiency, a more critical reason for the absence of streaming deep RL is its frequent instability and failure to learn, which we refer to as stream barrier. This paper introduces the stream-x algorithms, the first class of deep RL algorithms to overcome stream barrier for both prediction and control and match sample efficiency of batch RL. Through experiments in Mujoco Gym, DM Control Suite, and Atari Games, we demonstrate stream barrier in existing algorithms and successful stable learning with our stream-x algorithms: stream Q, stream AC, and stream TD, achieving the best model-free performance in DM Control Dog environments. A set of common techniques underlies the stream-x algorithms, enabling their success with a single set of hyperparameters and allowing for easy extension to other algorithms, thereby reviving streaming RL.

Via

Access Paper or Ask Questions

Revisiting Sparse Rewards for Goal-Reaching Reinforcement Learning

Jul 08, 2024

Gautham Vasan, Yan Wang, Fahim Shahriar, James Bergstra, Martin Jagersand, A. Rupam Mahmood

Figure 1 for Revisiting Sparse Rewards for Goal-Reaching Reinforcement Learning

Figure 2 for Revisiting Sparse Rewards for Goal-Reaching Reinforcement Learning

Figure 3 for Revisiting Sparse Rewards for Goal-Reaching Reinforcement Learning

Figure 4 for Revisiting Sparse Rewards for Goal-Reaching Reinforcement Learning

Abstract:Many real-world robot learning problems, such as pick-and-place or arriving at a destination, can be seen as a problem of reaching a goal state as soon as possible. These problems, when formulated as episodic reinforcement learning tasks, can easily be specified to align well with our intended goal: -1 reward every time step with termination upon reaching the goal state, called minimum-time tasks. Despite this simplicity, such formulations are often overlooked in favor of dense rewards due to their perceived difficulty and lack of informativeness. Our studies contrast the two reward paradigms, revealing that the minimum-time task specification not only facilitates learning higher-quality policies but can also surpass dense-reward-based policies on their own performance metrics. Crucially, we also identify the goal-hit rate of the initial policy as a robust early indicator for learning success in such sparse feedback settings. Finally, using four distinct real-robotic platforms, we show that it is possible to learn pixel-based policies from scratch within two to three hours using constant negative rewards.

* In Proceedings of Reinforcement Learning Conference 2024. For a video demo, see https://youtu.be/a6zlVUuKzBc

Via

Access Paper or Ask Questions

Weight Clipping for Deep Continual and Reinforcement Learning

Jul 01, 2024

Mohamed Elsayed, Qingfeng Lan, Clare Lyle, A. Rupam Mahmood

Figure 1 for Weight Clipping for Deep Continual and Reinforcement Learning

Figure 2 for Weight Clipping for Deep Continual and Reinforcement Learning

Figure 3 for Weight Clipping for Deep Continual and Reinforcement Learning

Figure 4 for Weight Clipping for Deep Continual and Reinforcement Learning

Abstract:Many failures in deep continual and reinforcement learning are associated with increasing magnitudes of the weights, making them hard to change and potentially causing overfitting. While many methods address these learning failures, they often change the optimizer or the architecture, a complexity that hinders widespread adoption in various systems. In this paper, we focus on learning failures that are associated with increasing weight norm and we propose a simple technique that can be easily added on top of existing learning systems: clipping neural network weights to limit them to a specific range. We study the effectiveness of weight clipping in a series of supervised and reinforcement learning experiments. Our empirical results highlight the benefits of weight clipping for generalization, addressing loss of plasticity and policy collapse, and facilitating learning with a large replay ratio.

* Published in the First Reinforcement Learning Conference (RLC 2024). Code is available at https://github.com/mohmdelsayed/weight-clipping

Via

Access Paper or Ask Questions

Revisiting Constant Negative Rewards for Goal-Reaching Tasks in Robot Learning

Jun 29, 2024

Gautham Vasan, Yan Wang, Fahim Shahriar, James Bergstra, Martin Jagersand, A. Rupam Mahmood

Figure 1 for Revisiting Constant Negative Rewards for Goal-Reaching Tasks in Robot Learning

Figure 2 for Revisiting Constant Negative Rewards for Goal-Reaching Tasks in Robot Learning

Figure 3 for Revisiting Constant Negative Rewards for Goal-Reaching Tasks in Robot Learning

Figure 4 for Revisiting Constant Negative Rewards for Goal-Reaching Tasks in Robot Learning

* In Proceedings of Reinforcement Learning Conference 2024. For video demo, see https://drive.google.com/file/d/1O8D3oCWq5xf2hi1JOlMBbs6W1ClrvUFb/view?usp=sharing

Via

Access Paper or Ask Questions

More Efficient Randomized Exploration for Reinforcement Learning via Approximate Sampling

Jun 18, 2024

Haque Ishfaq, Yixin Tan, Yu Yang, Qingfeng Lan, Jianfeng Lu, A. Rupam Mahmood, Doina Precup, Pan Xu

Figure 1 for More Efficient Randomized Exploration for Reinforcement Learning via Approximate Sampling

Figure 2 for More Efficient Randomized Exploration for Reinforcement Learning via Approximate Sampling

Figure 3 for More Efficient Randomized Exploration for Reinforcement Learning via Approximate Sampling

Figure 4 for More Efficient Randomized Exploration for Reinforcement Learning via Approximate Sampling

Abstract:Thompson sampling (TS) is one of the most popular exploration techniques in reinforcement learning (RL). However, most TS algorithms with theoretical guarantees are difficult to implement and not generalizable to Deep RL. While the emerging approximate sampling-based exploration schemes are promising, most existing algorithms are specific to linear Markov Decision Processes (MDP) with suboptimal regret bounds, or only use the most basic samplers such as Langevin Monte Carlo. In this work, we propose an algorithmic framework that incorporates different approximate sampling methods with the recently proposed Feel-Good Thompson Sampling (FGTS) approach (Zhang, 2022; Dann et al., 2021), which was previously known to be computationally intractable in general. When applied to linear MDPs, our regret analysis yields the best known dependency of regret on dimensionality, surpassing existing randomized algorithms. Additionally, we provide explicit sampling complexity for each employed sampler. Empirically, we show that in tasks where deep exploration is necessary, our proposed algorithms that combine FGTS and approximate sampling perform significantly better compared to other strong baselines. On several challenging games from the Atari 57 suite, our algorithms achieve performance that is either better than or on par with other strong baselines from the deep RL literature.

* First two authors contributed equally. Accepted to the Reinforcement Learning Conference (RLC) 2024

Via

Access Paper or Ask Questions

Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learning

Jun 05, 2024

Mohamed Elsayed, Homayoon Farrahi, Felix Dangel, A. Rupam Mahmood

Abstract:Second-order information is valuable for many applications but challenging to compute. Several works focus on computing or approximating Hessian diagonals, but even this simplification introduces significant additional costs compared to computing a gradient. In the absence of efficient exact computation schemes for Hessian diagonals, we revisit an early approximation scheme proposed by Becker and LeCun (1989, BL89), which has a cost similar to gradients and appears to have been overlooked by the community. We introduce HesScale, an improvement over BL89, which adds negligible extra computation. On small networks, we find that this improvement is of higher quality than all alternatives, even those with theoretical guarantees, such as unbiasedness, while being much cheaper to compute. We use this insight in reinforcement learning problems where small networks are used and demonstrate HesScale in second-order optimization and scaling the step-size parameter. In our experiments, HesScale optimizes faster than existing methods and improves stability through step-size scaling. These findings are promising for scaling second-order methods in larger models in the future.

* Published in the Proceedings of the 41st International Conference on Machine Learning (ICML 2024). Code is available at https://github.com/mohmdelsayed/HesScale. arXiv admin note: substantial text overlap with arXiv:2210.11639

Via

Access Paper or Ask Questions

Target Networks and Over-parameterization Stabilize Off-policy Bootstrapping with Function Approximation

May 31, 2024

Fengdi Che, Chenjun Xiao, Jincheng Mei, Bo Dai, Ramki Gummadi, Oscar A Ramirez, Christopher K Harris, A. Rupam Mahmood, Dale Schuurmans

Abstract:We prove that the combination of a target network and over-parameterized linear function approximation establishes a weaker convergence condition for bootstrapped value estimation in certain cases, even with off-policy data. Our condition is naturally satisfied for expected updates over the entire state-action space or learning with a batch of complete trajectories from episodic Markov decision processes. Notably, using only a target network or an over-parameterized model does not provide such a convergence guarantee. Additionally, we extend our results to learning with truncated trajectories, showing that convergence is achievable for all tasks with minor modifications, akin to value truncation for the final states in trajectories. Our primary result focuses on temporal difference estimation for prediction, providing high-probability value estimation error bounds and empirical analysis on Baird's counterexample and a Four-room task. Furthermore, we explore the control setting, demonstrating that similar convergence conditions apply to Q-learning.

* Proceedings of the 41 st International Conference on Machine Learning, 2024

Via

Access Paper or Ask Questions

Addressing Loss of Plasticity and Catastrophic Forgetting in Continual Learning

Mar 31, 2024

Mohamed Elsayed, A. Rupam Mahmood

Abstract:Deep representation learning methods struggle with continual learning, suffering from both catastrophic forgetting of useful units and loss of plasticity, often due to rigid and unuseful units. While many methods address these two issues separately, only a few currently deal with both simultaneously. In this paper, we introduce Utility-based Perturbed Gradient Descent (UPGD) as a novel approach for the continual learning of representations. UPGD combines gradient updates with perturbations, where it applies smaller modifications to more useful units, protecting them from forgetting, and larger modifications to less useful units, rejuvenating their plasticity. We use a challenging streaming learning setup where continual learning problems have hundreds of non-stationarities and unknown task boundaries. We show that many existing methods suffer from at least one of the issues, predominantly manifested by their decreasing accuracy over tasks. On the other hand, UPGD continues to improve performance and surpasses or is competitive with all methods in all problems. Finally, in extended reinforcement learning experiments with PPO, we show that while Adam exhibits a performance drop after initial learning, UPGD avoids it by addressing both continual learning issues.

* Published in the Proceedings of the 12th International Conference on Learning Representations (ICLR 2024)

Via

Access Paper or Ask Questions