Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback

May 22, 2026

Zitian Li, Wang Chi Cheung

Share this with someone who'll enjoy it:

Abstract:Pure exploration in episodic Reinforcement Learning has primarily focused on Best Policy Identification (BPI), which seeks to identify a (near)-optimal policy with high confidence. Motivated by practical settings where a ``good enough'' policy suffices, we study an alternate objective of Good Policy Identification (GPI). For a given reward threshold $μ_0$, GPI only requires identifying a policy with expected reward in an episode at least $μ_0$ if such a policy exists (positive instance), or declaring None if no such policy exists (negative instance). We formalize GPI under the fixed-confidence setting. We require the output to be correct with probability $\geq 1-δ$, and seek to minimize the expected sample complexity, which is the expected number of episodes explored for the output. We propose a novel algorithm BEE-GPI, and derive theoretically-grounded upper bounds on its sample complexity for positive and negative instances. Notably, for positive instances, the coefficient of $\log 1/δ$ in our upper bound is $O(H^2/(V^* - μ_0)^2)$, where $H$ is the episode length and $V^*$ is the optimal expected reward in an episode. The coefficient does not depend on the action and state space sizes otherwise, in sharp contrast to the sample complexity in BPI. We further establish lower bound results to show the near-optimality of BEE-GPI and the necessity of the $1/(V^* -μ)^2$ term. Numerical experiments further validate the efficiency of our approach.

View paper on

Share this with someone who'll enjoy it:

Title:Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback

Paper and Code