Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nigel Tao

The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

Jan 10, 2013

Lex Weaver, Nigel Tao

Figure 1 for The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

Figure 2 for The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

Figure 3 for The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

Figure 4 for The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

Abstract:There exist a number of reinforcement learning algorithms which learnby climbing the gradient of expected reward. Their long-runconvergence has been proved, even in partially observableenvironments with non-deterministic actions, and without the need fora system model. However, the variance of the gradient estimator hasbeen found to be a significant practical problem. Recent approacheshave discounted future rewards, introducing a bias-variance trade-offinto the gradient estimate. We incorporate a reward baseline into thelearning system, and show that it affects variance without introducingfurther bias. In particular, as we approach the zero-bias,high-variance parameterization, the optimal (or variance minimizing)constant reward baseline is equal to the long-term average expectedreward. Modified policy-gradient algorithms are presented, and anumber of experiments demonstrate their improvement over previous work.

* Appears in Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI2001)

Via

Access Paper or Ask Questions