Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

Dec 03, 2019
Tiancheng Jin, Haipeng Luo

Share this with someone who'll enjoy it:

We consider the problem of learning in episodic finite-horizon Markov decision processes with unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $\mathcal{\tilde{O}}(L|X|^2\sqrt{|A|T})$ regret with high probability, where $L$ is the horizon, $|X|$ is the number of states, $|A|$ is the number of actions, and $T$ is the number of episodes. To the best of our knowledge, our algorithm is the first one to ensure sub-linear regret in this challenging setting. Our key technical contribution is to introduce an optimistic loss estimator that is inversely weighted by an $\textit{upper occupancy bound}$.

* 14 pages 

   Access Paper Source

Share this with someone who'll enjoy it: