Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Meichen Song

Evolving Robustness--Exploration Trade-off in Online Reinforcement Learning via Quantile Bayesian Risk MDPs

May 23, 2026

Meichen Song, Yuhao Wang, Enlu Zhou

Abstract:In online reinforcement learning, data scarcity creates epistemic uncertainty that makes robustness important early in learning, whereas sufficient exploration is needed to learn the true-environment optimal policy. We study this time-varying robustness--exploration trade-off through a quantile Bayesian risk-aware Markov decision process (BR-MDP), in which the quantile level controls how posterior uncertainty enters the Bellman backup. We characterize this control through an asymptotic normality result for the difference between the quantile BR-MDP value and the value in the true environment. The result implies that upper/lower-tail quantiles induce optimism/pessimism towards epistemic uncertainty, and the magnitude of the optimism/pessimism decreases as data accumulate. Building on this characterization, we propose an online Bayesian risk-aware algorithm with an adaptive quantile schedule that emphasizes robustness early and gradually encourages exploration of less-visited state--action pairs. We establish sublinear Bayesian regret bounds with respect to both the true optimal value and the optimal BR-MDP robust value. Numerical experiments demonstrate strong performance in both exploration-demanding and exploration-costly environments.

Via

Access Paper or Ask Questions

Differentiable Arbitrating in Zero-sum Markov Games

Feb 20, 2023

Jing Wang, Meichen Song, Feng Gao, Boyi Liu, Zhaoran Wang, Yi Wu

Figure 1 for Differentiable Arbitrating in Zero-sum Markov Games

Figure 2 for Differentiable Arbitrating in Zero-sum Markov Games

Figure 3 for Differentiable Arbitrating in Zero-sum Markov Games

Figure 4 for Differentiable Arbitrating in Zero-sum Markov Games

Abstract:We initiate the study of how to perturb the reward in a zero-sum Markov game with two players to induce a desirable Nash equilibrium, namely arbitrating. Such a problem admits a bi-level optimization formulation. The lower level requires solving the Nash equilibrium under a given reward function, which makes the overall problem challenging to optimize in an end-to-end way. We propose a backpropagation scheme that differentiates through the Nash equilibrium, which provides the gradient feedback for the upper level. In particular, our method only requires a black-box solver for the (regularized) Nash equilibrium (NE). We develop the convergence analysis for the proposed framework with proper black-box NE solvers and demonstrate the empirical successes in two multi-agent reinforcement learning (MARL) environments.

Via

Access Paper or Ask Questions