Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

Apr 21, 2015

Richard S. Sutton, A. Rupam Mahmood, Martha White

Figure 1 for An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

Figure 2 for An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

Figure 3 for An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

Figure 4 for An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

Share this with someone who'll enjoy it:

Abstract:In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. In particular, we show that varying the emphasis of linear TD($\lambda$)'s updates in a particular way causes its expected update to become stable under off-policy training. The only prior model-free TD methods to achieve this with per-step computation linear in the number of function approximation parameters are the gradient-TD family of methods including TDC, GTD($\lambda$), and GQ($\lambda$). Compared to these methods, our _emphatic TD($\lambda$)_ is simpler and easier to use; it has only one learned parameter vector and one step-size parameter. Our treatment includes general state-dependent discounting and bootstrapping functions, and a way of specifying varying degrees of interest in accurately valuing different states.

* Journal of Machine Learning Research 17(73): 1-29, 2016 * 29 pages This is a significant revision based on the first set of reviews. The most important change was to signal early that the main result is about stability, not convergence

View paper on

Share this with someone who'll enjoy it:

Title:An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

Paper and Code