Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Brendan Maginnis

A short variational proof of equivalence between policy gradients and soft Q learning

Dec 22, 2017

Pierre H. Richemond, Brendan Maginnis

Abstract:Two main families of reinforcement learning algorithms, Q-learning and policy gradients, have recently been proven to be equivalent when using a softmax relaxation on one part, and an entropic regularization on the other. We relate this result to the well-known convex duality of Shannon entropy and the softmax function. Such a result is also known as the Donsker-Varadhan formula. This provides a short proof of the equivalence. We then interpret this duality further, and use ideas of convex analysis to prove a new policy inequality relative to soft Q-learning.

Via

Access Paper or Ask Questions

On Wasserstein Reinforcement Learning and the Fokker-Planck equation

Dec 19, 2017

Pierre H. Richemond, Brendan Maginnis

Abstract:Policy gradients methods often achieve better performance when the change in policy is limited to a small Kullback-Leibler divergence. We derive policy gradients where the change in policy is limited to a small Wasserstein distance (or trust region). This is done in the discrete and continuous multi-armed bandit settings with entropy regularisation. We show that in the small steps limit with respect to the Wasserstein distance $W_2$, policy dynamics are governed by the Fokker-Planck (heat) equation, following the Jordan-Kinderlehrer-Otto result. This means that policies undergo diffusion and advection, concentrating near actions with high reward. This helps elucidate the nature of convergence in the probability matching setup, and provides justification for empirical practices such as Gaussian policy priors and additive gradient noise.

Via

Access Paper or Ask Questions

Efficiently applying attention to sequential data with the Recurrent Discounted Attention unit

Jun 19, 2017

Brendan Maginnis, Pierre H. Richemond

Figure 1 for Efficiently applying attention to sequential data with the Recurrent Discounted Attention unit

Figure 2 for Efficiently applying attention to sequential data with the Recurrent Discounted Attention unit

Figure 3 for Efficiently applying attention to sequential data with the Recurrent Discounted Attention unit

Figure 4 for Efficiently applying attention to sequential data with the Recurrent Discounted Attention unit

Abstract:Recurrent Neural Networks architectures excel at processing sequences by modelling dependencies over different timescales. The recently introduced Recurrent Weighted Average (RWA) unit captures long term dependencies far better than an LSTM on several challenging tasks. The RWA achieves this by applying attention to each input and computing a weighted average over the full history of its computations. Unfortunately, the RWA cannot change the attention it has assigned to previous timesteps, and so struggles with carrying out consecutive tasks or tasks with changing requirements. We present the Recurrent Discounted Attention (RDA) unit that builds on the RWA by additionally allowing the discounting of the past. We empirically compare our model to RWA, LSTM and GRU units on several challenging tasks. On tasks with a single output the RWA, RDA and GRU units learn much quicker than the LSTM and with better performance. On the multiple sequence copy task our RDA unit learns the task three times as quickly as the LSTM or GRU units while the RWA fails to learn at all. On the Wikipedia character prediction task the LSTM performs best but it followed closely by our RDA unit. Overall our RDA unit performs well and is sample efficient on a large variety of sequence tasks.

* Updated results of RDA-exp-tanh unit for the wikipedia char prediction task

Via

Access Paper or Ask Questions