Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shangtong Zhang

Extensions of Robbins-Siegmund Theorem with Applications in Reinforcement Learning

Sep 30, 2025

Xinyu Liu, Zixuan Xie, Shangtong Zhang

Abstract:The Robbins-Siegmund theorem establishes the convergence of stochastic processes that are almost supermartingales and is foundational for analyzing a wide range of stochastic iterative algorithms in stochastic approximation and reinforcement learning (RL). However, its original form has a significant limitation as it requires the zero-order term to be summable. In many important RL applications, this summable condition, however, cannot be met. This limitation motivates us to extend the Robbins-Siegmund theorem for almost supermartingales where the zero-order term is not summable but only square summable. Particularly, we introduce a novel and mild assumption on the increments of the stochastic processes. This together with the square summable condition enables an almost sure convergence to a bounded set. Additionally, we further provide almost sure convergence rates, high probability concentration bounds, and $L^p$ convergence rates. We then apply the new results in stochastic approximation and RL. Notably, we obtain the first almost sure convergence rate, the first high probability concentration bound, and the first $L^p$ convergence rate for $Q$-learning with linear function approximation.

Via

Access Paper or Ask Questions

Finite Sample Analysis of Linear Temporal Difference Learning with Arbitrary Features

May 27, 2025

Zixuan Xie, Xinyu Liu, Rohan Chandra, Shangtong Zhang

Abstract:Linear TD($\lambda$) is one of the most fundamental reinforcement learning algorithms for policy evaluation. Previously, convergence rates are typically established under the assumption of linearly independent features, which does not hold in many practical scenarios. This paper instead establishes the first $L^2$ convergence rates for linear TD($\lambda$) operating under arbitrary features, without making any algorithmic modification or additional assumptions. Our results apply to both the discounted and average-reward settings. To address the potential non-uniqueness of solutions resulting from arbitrary features, we develop a novel stochastic approximation result featuring convergence rates to the solution set instead of a single point.

Via

Access Paper or Ask Questions

Counterfactual Explanations for Continuous Action Reinforcement Learning

May 19, 2025

Shuyang Dong, Shangtong Zhang, Lu Feng

Abstract:Reinforcement Learning (RL) has shown great promise in domains like healthcare and robotics but often struggles with adoption due to its lack of interpretability. Counterfactual explanations, which address "what if" scenarios, provide a promising avenue for understanding RL decisions but remain underexplored for continuous action spaces. We propose a novel approach for generating counterfactual explanations in continuous action RL by computing alternative action sequences that improve outcomes while minimizing deviations from the original sequence. Our approach leverages a distance metric for continuous actions and accounts for constraints such as adhering to predefined policies in specific states. Evaluations in two RL domains, Diabetes Control and Lunar Lander, demonstrate the effectiveness, efficiency, and generalization of our approach, enabling more interpretable and trustworthy RL applications.

* Accepted by International Joint Conference on Artificial Intelligence (IJCAI) 2025

Via

Access Paper or Ask Questions

Experience Replay Addresses Loss of Plasticity in Continual Learning

Mar 25, 2025

Jiuqi Wang, Rohan Chandra, Shangtong Zhang

Figure 1 for Experience Replay Addresses Loss of Plasticity in Continual Learning

Figure 2 for Experience Replay Addresses Loss of Plasticity in Continual Learning

Figure 3 for Experience Replay Addresses Loss of Plasticity in Continual Learning

Figure 4 for Experience Replay Addresses Loss of Plasticity in Continual Learning

Abstract:Loss of plasticity is one of the main challenges in continual learning with deep neural networks, where neural networks trained via backpropagation gradually lose their ability to adapt to new tasks and perform significantly worse than their freshly initialized counterparts. The main contribution of this paper is to propose a new hypothesis that experience replay addresses the loss of plasticity in continual learning. Here, experience replay is a form of memory. We provide supporting evidence for this hypothesis. In particular, we demonstrate in multiple different tasks, including regression, classification, and policy evaluation, that by simply adding an experience replay and processing the data in the experience replay with Transformers, the loss of plasticity disappears. Notably, we do not alter any standard components of deep learning. For example, we do not change backpropagation. We do not modify the activation functions. And we do not use any regularization. We conjecture that experience replay and Transformers can address the loss of plasticity because of the in-context learning phenomenon.

* 14 pages, 4 figures

Via

Access Paper or Ask Questions

GameChat: Multi-LLM Dialogue for Safe, Agile, and Socially Optimal Multi-Agent Navigation in Constrained Environments

Mar 16, 2025

Vagul Mahadevan, Shangtong Zhang, Rohan Chandra

Figure 1 for GameChat: Multi-LLM Dialogue for Safe, Agile, and Socially Optimal Multi-Agent Navigation in Constrained Environments

Figure 2 for GameChat: Multi-LLM Dialogue for Safe, Agile, and Socially Optimal Multi-Agent Navigation in Constrained Environments

Figure 3 for GameChat: Multi-LLM Dialogue for Safe, Agile, and Socially Optimal Multi-Agent Navigation in Constrained Environments

Figure 4 for GameChat: Multi-LLM Dialogue for Safe, Agile, and Socially Optimal Multi-Agent Navigation in Constrained Environments

Abstract:Safe, agile, and socially compliant multi-robot navigation in cluttered and constrained environments remains a critical challenge. This is especially difficult with self-interested agents in decentralized settings, where there is no central authority to resolve conflicts induced by spatial symmetry. We address this challenge by proposing a novel approach, GameChat, which facilitates safe, agile, and deadlock-free navigation for both cooperative and self-interested agents. Key to our approach is the use of natural language communication to resolve conflicts, enabling agents to prioritize more urgent tasks and break spatial symmetry in a socially optimal manner. Our algorithm ensures subgame perfect equilibrium, preventing agents from deviating from agreed-upon behaviors and supporting cooperation. Furthermore, we guarantee safety through control barrier functions and preserve agility by minimizing disruptions to agents' planned trajectories. We evaluate GameChat in simulated environments with doorways and intersections. The results show that even in the worst case, GameChat reduces the time for all agents to reach their goals by over 35% from a naive baseline and by over 20% from SMG-CBF in the intersection scenario, while doubling the rate of ensuring the agent with a higher priority task reaches the goal first, from 50% (equivalent to random chance) to a 100% perfect performance at maximizing social welfare.

Via

Access Paper or Ask Questions

Group Fairness in Multi-Task Reinforcement Learning

Mar 10, 2025

Kefan Song, Runnan Jiang, Rohan Chandra, Shangtong Zhang

Figure 1 for Group Fairness in Multi-Task Reinforcement Learning

Figure 2 for Group Fairness in Multi-Task Reinforcement Learning

Figure 3 for Group Fairness in Multi-Task Reinforcement Learning

Figure 4 for Group Fairness in Multi-Task Reinforcement Learning

Abstract:This paper addresses a critical societal consideration in the application of Reinforcement Learning (RL): ensuring equitable outcomes across different demographic groups in multi-task settings. While previous work has explored fairness in single-task RL, many real-world applications are multi-task in nature and require policies to maintain fairness across all tasks. We introduce a novel formulation of multi-task group fairness in RL and propose a constrained optimization algorithm that explicitly enforces fairness constraints across multiple tasks simultaneously. We have shown that our proposed algorithm does not violate fairness constraints with high probability and with sublinear regret in the finite-horizon episodic setting. Through experiments in RiverSwim and MuJoCo environments, we demonstrate that our approach better ensures group fairness across multiple tasks compared to previous methods that lack explicit multi-task fairness constraints in both the finite-horizon setting and the infinite-horizon setting. Our results show that the proposed algorithm achieves smaller fairness gaps while maintaining comparable returns across different demographic groups and tasks, suggesting its potential for addressing fairness concerns in real-world multi-task RL applications.

Via

Access Paper or Ask Questions

Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models

Mar 10, 2025

Kefan Song, Jin Yao, Runnan Jiang, Rohan Chandra, Shangtong Zhang

Figure 1 for Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models

Figure 2 for Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models

Figure 3 for Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models

Figure 4 for Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models

Abstract:As Large Language Models (LLMs) become increasingly powerful and accessible to human users, ensuring fairness across diverse demographic groups, i.e., group fairness, is a critical ethical concern. However, current fairness and bias research in LLMs is limited in two aspects. First, compared to traditional group fairness in machine learning classification, it requires that the non-sensitive attributes, in this case, the prompt questions, be the same across different groups. In many practical scenarios, different groups, however, may prefer different prompt questions and this requirement becomes impractical. Second, it evaluates group fairness only for the LLM's final output without identifying the source of possible bias. Namely, the bias in LLM's output can result from both the pretraining and the finetuning. For finetuning, the bias can result from both the RLHF procedure and the learned reward model. Arguably, evaluating the group fairness of each component in the LLM pipeline could help develop better methods to mitigate the possible bias. Recognizing those two limitations, this work benchmarks the group fairness of learned reward models. By using expert-written text from arXiv, we are able to benchmark the group fairness of reward models without requiring the same prompt questions across different demographic groups. Surprisingly, our results demonstrate that all the evaluated reward models (e.g., Nemotron-4-340B-Reward, ArmoRM-Llama3-8B-v0.1, and GRM-llama3-8B-sftreg) exhibit statistically significant group unfairness. We also observed that top-performing reward models (w.r.t. canonical performance metrics) tend to demonstrate better group fairness.

Via

Access Paper or Ask Questions

A Survey of In-Context Reinforcement Learning

Feb 11, 2025

Amir Moeini, Jiuqi Wang, Jacob Beck, Ethan Blaser, Shimon Whiteson, Rohan Chandra, Shangtong Zhang

Figure 1 for A Survey of In-Context Reinforcement Learning

Figure 2 for A Survey of In-Context Reinforcement Learning

Abstract:Reinforcement learning (RL) agents typically optimize their policies by performing expensive backward passes to update their network parameters. However, some agents can solve new tasks without updating any parameters by simply conditioning on additional context such as their action-observation histories. This paper surveys work on such behavior, known as in-context reinforcement learning.

Via

Access Paper or Ask Questions

Linear $Q$-Learning Does Not Diverge: Convergence Rates to a Bounded Set

Jan 31, 2025

Xinyu Liu, Zixuan Xie, Shangtong Zhang

Abstract:$Q$-learning is one of the most fundamental reinforcement learning algorithms. Previously, it is widely believed that $Q$-learning with linear function approximation (i.e., linear $Q$-learning) suffers from possible divergence. This paper instead establishes the first $L^2$ convergence rate of linear $Q$-learning to a bounded set. Notably, we do not make any modification to the original linear $Q$-learning algorithm, do not make any Bellman completeness assumption, and do not make any near-optimality assumption on the behavior policy. All we need is an $\epsilon$-softmax behavior policy with an adaptive temperature. The key to our analysis is the general result of stochastic approximations under Markovian noise with fast-changing transition functions. As a side product, we also use this general result to establish the $L^2$ convergence rate of tabular $Q$-learning with an $\epsilon$-softmax behavior policy, for which we rely on a novel pseudo-contraction property of the weighted Bellman optimality operator.

Via

Access Paper or Ask Questions

CRASH: Challenging Reinforcement-Learning Based Adversarial Scenarios For Safety Hardening

Nov 26, 2024

Amar Kulkarni, Shangtong Zhang, Madhur Behl

Figure 1 for CRASH: Challenging Reinforcement-Learning Based Adversarial Scenarios For Safety Hardening

Figure 2 for CRASH: Challenging Reinforcement-Learning Based Adversarial Scenarios For Safety Hardening

Figure 3 for CRASH: Challenging Reinforcement-Learning Based Adversarial Scenarios For Safety Hardening

Figure 4 for CRASH: Challenging Reinforcement-Learning Based Adversarial Scenarios For Safety Hardening

Abstract:Ensuring the safety of autonomous vehicles (AVs) requires identifying rare but critical failure cases that on-road testing alone cannot discover. High-fidelity simulations provide a scalable alternative, but automatically generating realistic and diverse traffic scenarios that can effectively stress test AV motion planners remains a key challenge. This paper introduces CRASH - Challenging Reinforcement-learning based Adversarial scenarios for Safety Hardening - an adversarial deep reinforcement learning framework to address this issue. First CRASH can control adversarial Non Player Character (NPC) agents in an AV simulator to automatically induce collisions with the Ego vehicle, falsifying its motion planner. We also propose a novel approach, that we term safety hardening, which iteratively refines the motion planner by simulating improvement scenarios against adversarial agents, leveraging the failure cases to strengthen the AV stack. CRASH is evaluated on a simplified two-lane highway scenario, demonstrating its ability to falsify both rule-based and learning-based planners with collision rates exceeding 90%. Additionally, safety hardening reduces the Ego vehicle's collision rate by 26%. While preliminary, these results highlight RL-based safety hardening as a promising approach for scenario-driven simulation testing for autonomous vehicles.

* 7 pages, 9 figures, 2 tables

Via

Access Paper or Ask Questions