Alert button

Discovered Policy Optimisation

Oct 11, 2022
Figure 1 for Discovered Policy Optimisation
Figure 2 for Discovered Policy Optimisation
Figure 3 for Discovered Policy Optimisation
Figure 4 for Discovered Policy Optimisation

Share this with someone who'll enjoy it:

The last decade has been revolutionary for reinforcement learning (RL) - it can now solve complex decision and control problems. Successful RL methods were handcrafted using mathematical derivations, intuition, and experimentation. This approach has a major shortcoming: It results in specific solutions to the RL problem, rather than a protocol for discovering efficient and robust methods. In contrast, the emerging field of meta-learning provides a toolkit for automatic machine learning method optimisation, potentially addressing this flaw. However, black-box approaches which attempt to discover RL algorithms with minimal prior structure have thus far not been successful. Mirror Learning, which includes RL algorithms, such as PPO, offers a potential framework. In this paper we explore the Mirror Learning space by meta-learning a "drift" function. We refer to the result as Learnt Policy Optimisation (LPO). By analysing LPO we gain original insights into policy optimisation which we use to formulate a novel, closed-form RL algorithm, Discovered Policy Optimisation (DPO). Our experiments in Brax environments confirm state-of-the-art performance of LPO and DPO, as well as their transfer to unseen settings.

* NeurIPS 2022  

Share this with someone who'll enjoy it: