Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Claire Chen

Convergence of Two-Timescale Markovian Stochastic Approximations with Applications in Reinforcement Learning

May 29, 2026

Vagul Mahadevan, Claire Chen, Shuze Daniel Liu, Shangtong Zhang

Abstract:This work studies the convergence of two-timescale stochastic approximations (SA), a class of iterative algorithms that update two sets of parameters in fast and slow timescales respectively. Notable examples of two-timescale SA in reinforcement learning (RL) include temporal difference learning with gradient correction (TDC) and actor-critic methods. Previously, the stability (i.e., boundedness) and convergence of two-timescale SA were only established under i.i.d. noise. This work instead establishes the stability and convergence of two-timescale SA under Markovian noise, a setup that is more realistic in RL. Notably, we do not need to use any projection operator and the noise does not need to live in a compact space. Our key technical novelty is to control the fast timescale parameter with the running max of the slow timescale parameter, instead of with the current slow timescale parameter, as most prior works do. As a key application, we establish the first almost sure convergence of TDC with eligibility traces under off-policy learning with linear function approximation.

* ICML 2026

Via

Access Paper or Ask Questions

Offline Two-Player Zero-Sum Markov Games with KL Regularization

May 13, 2026

Claire Chen, Yuheng Zhang, Xinyu Liu, Zixuan Xie, Shuze Daniel Liu, Nan Jiang

Abstract:We study the problem of learning Nash equilibria in offline two-player zero-sum Markov games. While existing approaches often rely on explicit pessimism to address distribution shift, we show that KL regularization alone suffices to stabilize learning and guarantee convergence. We first introduce Regularized Offline Sequential Equilibrium (ROSE), a theoretical framework that achieves a fast $\widetilde{\mathcal{O}}(1/n)$ convergence rate under \textit{unilateral concentrability}, improving over the standard $\widetilde{\mathcal{O}}(1/\sqrt{n})$ rates in unregularized settings. We then propose Sequential Offline Self-play Mirror Descent (SOS-MD), a practical model-free algorithm based on least-squares value estimation and iterative self-play updates. We prove that the last iterate of SOS-MD attains the same $\widetilde{\mathcal{O}}(1/n)$ statistical rate up to a vanishing optimization error of order $\widetilde{\mathcal{O}}(1/\sqrt{T})$ in the number of self-play iterations $T$.

Via

Access Paper or Ask Questions

AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification

May 07, 2026

Claire Chen, Jiabao Sean Xiao, Shuze Daniel Liu, Facundo Perez Paolino, Luke Handley, Theophile Jegou du Laz, Ricky Nilsson, Alice Zou, Matthew Graham, Ashish Mahabal

Abstract:Modern astronomical observatories generate a massive volume of multimodal data, creating a critical bottleneck for expert human review. While multimodal large language models (LLMs) have shown promise in interpreting complex visual and textual inputs, their ability to perform specialized scientific classification while providing interpretable reasoning remains understudied. We introduce AstroAlertBench, a comprehensive multimodal benchmark designed to evaluate LLM performance in astronomical event review along a three-stage logical chain: metadata grounding, scientific reasoning, and hierarchical classification over five categories. We use a pilot sample of 1,500 real-world alerts from the Zwicky Transient Facility (ZTF), a wide-field survey that scans the northern sky to detect transient astronomical events. On this dataset, we benchmark 13 frontier closed-source and open-weight LLMs that support visual input. Our results reveal that high accuracy does not always align with model ``honesty,'' defined as the ability to self-evaluate its reasoning, which affects its reliability as a real-world assistant. We further initialize a human-in-the-loop evaluation protocol as a precursor to future community-scale participation. Together, AstroAlertBench provides a framework for developing calibrated and interpretable astronomical assistants.

Via

Access Paper or Ask Questions

Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards

Apr 10, 2026

Shuze Daniel Liu, Claire Chen, Jiabao Sean Xiao, Lei Lei, Yuheng Zhang, Yisong Yue, David Simchi-Levi

Abstract:The recent advancement of Large Language Models (LLMs) has established their potential as autonomous interactive agents. However, they often struggle in strategic games of incomplete information, such as bilateral price negotiation. In this paper, we investigate if Reinforcement Learning from Verifiable Rewards (RLVR) can effectively teach LLMs to negotiate. Specifically, we explore the strategic behaviors that emerge during the learning process. We introduce a framework that trains a mid-sized buyer agent against a regulated LLM seller across a wide distribution of real-world products. By grounding reward signals directly in the maximization of economic surplus and strict adherence to private budget constraints, we reveal a novel four-phase strategic evolution. The agent progresses from naive bargaining to using aggressive starting prices, moves through a phase of deadlock, and ultimately develops sophisticated persuasive skills. Our results demonstrate that this verifiable training allows a 30B agent to significantly outperform frontier models over ten times its size in extracting surplus. Furthermore, the trained agent generalizes robustly to stronger counterparties unseen during training and remains effective even when facing hostile, adversarial seller personas.

Via

Access Paper or Ask Questions

Beyond Pessimism: Offline Learning in KL-regularized Games

Apr 08, 2026

Yuheng Zhang, Claire Chen, Nan Jiang

Abstract:We study offline learning in KL-regularized two-player zero-sum games, where policies are optimized under a KL constraint to a fixed reference policy. Prior work relies on pessimistic value estimation to handle distribution shift, yielding only $\widetilde{\mathcal{O}}(1/\sqrt n)$ statistical rates. We develop a new pessimism-free algorithm and analytical framework for KL-regularized games, built on the smoothness of KL-regularized best responses and a stability property of the Nash equilibrium induced by skew symmetry. This yields the first $\widetilde{\mathcal{O}}(1/n)$ sample complexity bound for offline learning in KL-regularized zero-sum games, achieved entirely without pessimism. We further propose an efficient self-play policy optimization algorithm and prove that, with a number of iterations linear in the sample size, it achieves the same fast $\widetilde{\mathcal{O}}(1/n)$ statistical rate as the minimax estimator.

Via

Access Paper or Ask Questions

MathlibLemma: Folklore Lemma Generation and Benchmark for Formal Mathematics

Jan 30, 2026

Xinyu Liu, Zixuan Xie, Amir Moeini, Claire Chen, Shuze Daniel Liu, Yu Meng, Aidong Zhang, Shangtong Zhang

Abstract:While the ecosystem of Lean and Mathlib has enjoyed celebrated success in formal mathematical reasoning with the help of large language models (LLMs), the absence of many folklore lemmas in Mathlib remains a persistent barrier that limits Lean's usability as an everyday tool for mathematicians like LaTeX or Maple. To address this, we introduce MathlibLemma, the first LLM-based multi-agent system to automate the discovery and formalization of mathematical folklore lemmas. This framework constitutes our primary contribution, proactively mining the missing connective tissue of mathematics. Its efficacy is demonstrated by the production of a verified library of folklore lemmas, a subset of which has already been formally merged into the latest build of Mathlib, thereby validating the system's real-world utility and alignment with expert standards. Leveraging this pipeline, we further construct the MathlibLemma benchmark, a suite of 4,028 type-checked Lean statements spanning a broad range of mathematical domains. By transforming the role of LLMs from passive consumers to active contributors, this work establishes a constructive methodology for the self-evolution of formal mathematical libraries.

Via

Access Paper or Ask Questions

Causal-PIK: Causality-based Physical Reasoning with a Physics-Informed Kernel

May 28, 2025

Carlota Parés-Morlans, Michelle Yi, Claire Chen, Sarah A. Wu, Rika Antonova, Tobias Gerstenberg, Jeannette Bohg

Figure 1 for Causal-PIK: Causality-based Physical Reasoning with a Physics-Informed Kernel

Figure 2 for Causal-PIK: Causality-based Physical Reasoning with a Physics-Informed Kernel

Figure 3 for Causal-PIK: Causality-based Physical Reasoning with a Physics-Informed Kernel

Figure 4 for Causal-PIK: Causality-based Physical Reasoning with a Physics-Informed Kernel

Abstract:Tasks that involve complex interactions between objects with unknown dynamics make planning before execution difficult. These tasks require agents to iteratively improve their actions after actively exploring causes and effects in the environment. For these type of tasks, we propose Causal-PIK, a method that leverages Bayesian optimization to reason about causal interactions via a Physics-Informed Kernel to help guide efficient search for the best next action. Experimental results on Virtual Tools and PHYRE physical reasoning benchmarks show that Causal-PIK outperforms state-of-the-art results, requiring fewer actions to reach the goal. We also compare Causal-PIK to human studies, including results from a new user study we conducted on the PHYRE benchmark. We find that Causal-PIK remains competitive on tasks that are very challenging, even for human problem-solvers.

* Accepted to ICML 2025

Via

Access Paper or Ask Questions

DexForce: Extracting Force-informed Actions from Kinesthetic Demonstrations for Dexterous Manipulation

Jan 17, 2025

Claire Chen, Zhongchun Yu, Hojung Choi, Mark Cutkosky, Jeannette Bohg

Figure 1 for DexForce: Extracting Force-informed Actions from Kinesthetic Demonstrations for Dexterous Manipulation

Figure 2 for DexForce: Extracting Force-informed Actions from Kinesthetic Demonstrations for Dexterous Manipulation

Figure 3 for DexForce: Extracting Force-informed Actions from Kinesthetic Demonstrations for Dexterous Manipulation

Figure 4 for DexForce: Extracting Force-informed Actions from Kinesthetic Demonstrations for Dexterous Manipulation

Abstract:Imitation learning requires high-quality demonstrations consisting of sequences of state-action pairs. For contact-rich dexterous manipulation tasks that require fine-grained dexterity, the actions in these state-action pairs must produce the right forces. Current widely-used methods for collecting dexterous manipulation demonstrations are difficult to use for demonstrating contact-rich tasks due to unintuitive human-to-robot motion retargeting and the lack of direct haptic feedback. Motivated by this, we propose DexForce, a method for collecting demonstrations of contact-rich dexterous manipulation. DexForce leverages contact forces, measured during kinesthetic demonstrations, to compute force-informed actions for policy learning. We use DexForce to collect demonstrations for six tasks and show that policies trained on our force-informed actions achieve an average success rate of 76% across all tasks. In contrast, policies trained directly on actions that do not account for contact forces have near-zero success rates. We also conduct a study ablating the inclusion of force data in policy observations. We find that while using force data never hurts policy performance, it helps the most for tasks that require an advanced level of precision and coordination, like opening an AirPods case and unscrewing a nut.

* Videos can be found here: https://clairelc.github.io/dexforce.github.io/

Via

Access Paper or Ask Questions

Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning

Oct 08, 2024

Claire Chen, Shuze Liu, Shangtong Zhang

Figure 1 for Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning

Figure 2 for Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning

Figure 3 for Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning

Figure 4 for Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning

Abstract:In reinforcement learning, classic on-policy evaluation methods often suffer from high variance and require massive online data to attain the desired accuracy. Previous studies attempt to reduce evaluation variance by searching for or designing proper behavior policies to collect data. However, these approaches ignore the safety of such behavior policies -- the designed behavior policies have no safety guarantee and may lead to severe damage during online executions. In this paper, to address the challenge of reducing variance while ensuring safety simultaneously, we propose an optimal variance-minimizing behavior policy under safety constraints. Theoretically, while ensuring safety constraints, our evaluation method is unbiased and has lower variance than on-policy evaluation. Empirically, our method is the only existing method to achieve both substantial variance reduction and safety constraint satisfaction. Furthermore, we show our method is even superior to previous methods in both variance reduction and execution safety.

* arXiv admin note: text overlap with arXiv:2410.02226

Via

Access Paper or Ask Questions

Doubly Optimal Policy Evaluation for Reinforcement Learning

Oct 03, 2024

Shuze Liu, Claire Chen, Shangtong Zhang

Figure 1 for Doubly Optimal Policy Evaluation for Reinforcement Learning

Figure 2 for Doubly Optimal Policy Evaluation for Reinforcement Learning

Figure 3 for Doubly Optimal Policy Evaluation for Reinforcement Learning

Figure 4 for Doubly Optimal Policy Evaluation for Reinforcement Learning

Abstract:Policy evaluation estimates the performance of a policy by (1) collecting data from the environment and (2) processing raw data into a meaningful estimate. Due to the sequential nature of reinforcement learning, any improper data-collecting policy or data-processing method substantially deteriorates the variance of evaluation results over long time steps. Thus, policy evaluation often suffers from large variance and requires massive data to achieve the desired accuracy. In this work, we design an optimal combination of data-collecting policy and data-processing baseline. Theoretically, we prove our doubly optimal policy evaluation method is unbiased and guaranteed to have lower variance than previously best-performing methods. Empirically, compared with previous works, we show our method reduces variance substantially and achieves superior empirical performance.

* arXiv admin note: text overlap with arXiv:2301.13734

Via

Access Paper or Ask Questions