Picture for Lanjihong Ma

Lanjihong Ma

Data-dependent Exploration for Online Reinforcement Learning from Human Feedback

Add code
May 06, 2026
Viaarxiv icon