Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yunhao Tang

Unlocking Pixels for Reinforcement Learning via Implicit Attention

Feb 08, 2021
Krzysztof Choromanski, Deepali Jain, Jack Parker-Holder, Xingyou Song, Valerii Likhosherstov, Anirban Santara, Aldo Pacchiano, Yunhao Tang, Adrian Weller

Figure 1 for Unlocking Pixels for Reinforcement Learning via Implicit Attention

Figure 2 for Unlocking Pixels for Reinforcement Learning via Implicit Attention

Figure 3 for Unlocking Pixels for Reinforcement Learning via Implicit Attention

Figure 4 for Unlocking Pixels for Reinforcement Learning via Implicit Attention

There has recently been significant interest in training reinforcement learning (RL) agents in vision-based environments. This poses many challenges, such as high dimensionality and potential for observational overfitting through spurious correlations. A promising approach to solve both of these problems is a self-attention bottleneck, which provides a simple and effective framework for learning high performing policies, even in the presence of distractions. However, due to poor scalability of attention architectures, these methods do not scale beyond low resolution visual inputs, using large patches (thus small attention matrices). In this paper we make use of new efficient attention algorithms, recently shown to be highly effective for Transformers, and demonstrate that these new techniques can be applied in the RL setting. This allows our attention-based controllers to scale to larger visual inputs, and facilitate the use of smaller patches, even individual pixels, improving generalization. In addition, we propose a new efficient algorithm approximating softmax attention with what we call hybrid random features, leveraging the theory of angular kernels. We show theoretically and empirically that hybrid random features is a promising approach when using attention for vision-based RL.

Via

Access Paper or Ask Questions

ES-ENAS: Combining Evolution Strategies with Neural Architecture Search at No Extra Cost for Reinforcement Learning

Jan 19, 2021
Xingyou Song, Krzysztof Choromanski, Jack Parker-Holder, Yunhao Tang, Daiyi Peng, Deepali Jain, Wenbo Gao, Aldo Pacchiano, Tamas Sarlos, Yuxiang Yang

Figure 1 for ES-ENAS: Combining Evolution Strategies with Neural Architecture Search at No Extra Cost for Reinforcement Learning

Figure 2 for ES-ENAS: Combining Evolution Strategies with Neural Architecture Search at No Extra Cost for Reinforcement Learning

Figure 3 for ES-ENAS: Combining Evolution Strategies with Neural Architecture Search at No Extra Cost for Reinforcement Learning

Figure 4 for ES-ENAS: Combining Evolution Strategies with Neural Architecture Search at No Extra Cost for Reinforcement Learning

We introduce ES-ENAS, a simple neural architecture search (NAS) algorithm for the purpose of reinforcement learning (RL) policy design, by combining Evolutionary Strategies (ES) and Efficient NAS (ENAS) in a highly scalable and intuitive way. Our main insight is noticing that ES is already a distributed blackbox algorithm, and thus we may simply insert a model controller from ENAS into the central aggregator in ES and obtain weight sharing properties for free. By doing so, we bridge the gap from NAS research in supervised learning settings to the reinforcement learning scenario through this relatively simple marriage between two different lines of research, and are one of the first to apply controller-based NAS techniques to RL. We demonstrate the utility of our method by training combinatorial neural network architectures for RL problems in continuous control, via edge pruning and weight sharing. We also incorporate a wide variety of popular techniques from modern NAS literature, including multiobjective optimization and varying controller methods, to showcase their promise in the RL field and discuss possible extensions. We achieve >90% network compression for multiple tasks, which may be special interest in mobile robotics with limited storage and computational resources.

* 14 pages. This is an updated version of a previous submission which can be found at arXiv:1907.06511. See https://github.com/google-research/google-research/tree/master/es_enas for associated code

Via

Access Paper or Ask Questions

Monte-Carlo Tree Search as Regularized Policy Optimization

Jul 24, 2020
Jean-Bastien Grill, Florent Altché, Yunhao Tang, Thomas Hubert, Michal Valko, Ioannis Antonoglou, Rémi Munos

Figure 1 for Monte-Carlo Tree Search as Regularized Policy Optimization

Figure 2 for Monte-Carlo Tree Search as Regularized Policy Optimization

Figure 3 for Monte-Carlo Tree Search as Regularized Policy Optimization

Figure 4 for Monte-Carlo Tree Search as Regularized Policy Optimization

The combination of Monte-Carlo tree search (MCTS) with deep reinforcement learning has led to significant advances in artificial intelligence. However, AlphaZero, the current state-of-the-art MCTS algorithm, still relies on handcrafted heuristics that are only partially understood. In this paper, we show that AlphaZero's search heuristics, along with other common ones such as UCT, are an approximation to the solution of a specific regularized policy optimization problem. With this insight, we propose a variant of AlphaZero which uses the exact solution to this policy optimization problem, and show experimentally that it reliably outperforms the original algorithm in multiple domains.

* Accepted to International Conference on Machine Learning (ICML), 2020

Via

Access Paper or Ask Questions

Online Hyper-parameter Tuning in Off-policy Learning via Evolutionary Strategies

Jun 13, 2020
Yunhao Tang, Krzysztof Choromanski

Figure 1 for Online Hyper-parameter Tuning in Off-policy Learning via Evolutionary Strategies

Figure 2 for Online Hyper-parameter Tuning in Off-policy Learning via Evolutionary Strategies

Figure 3 for Online Hyper-parameter Tuning in Off-policy Learning via Evolutionary Strategies

Figure 4 for Online Hyper-parameter Tuning in Off-policy Learning via Evolutionary Strategies

Off-policy learning algorithms have been known to be sensitive to the choice of hyper-parameters. However, unlike near on-policy algorithms for which hyper-parameters could be optimized via e.g. meta-gradients, similar techniques could not be straightforwardly applied to off-policy learning. In this work, we propose a framework which entails the application of Evolutionary Strategies to online hyper-parameter tuning in off-policy learning. Our formulation draws close connections to meta-gradients and leverages the strengths of black-box optimization with relatively low-dimensional search spaces. We show that our method outperforms state-of-the-art off-policy learning baselines with static hyper-parameters and recent prior work over a wide range of continuous control benchmarks.

Via

Access Paper or Ask Questions

Hindsight Expectation Maximization for Goal-conditioned Reinforcement Learning

Jun 13, 2020
Yunhao Tang, Alp Kucukelbir

Figure 1 for Hindsight Expectation Maximization for Goal-conditioned Reinforcement Learning

Figure 2 for Hindsight Expectation Maximization for Goal-conditioned Reinforcement Learning

Figure 3 for Hindsight Expectation Maximization for Goal-conditioned Reinforcement Learning

Figure 4 for Hindsight Expectation Maximization for Goal-conditioned Reinforcement Learning

We propose a graphical model framework for goal-conditioned RL, with an EM algorithm that operates on the lower bound of the RL objective. The E-step provides a natural interpretation of how 'learning in hindsight' techniques, such as HER, to handle extremely sparse goal-conditioned rewards. The M-step reduces policy optimization to supervised learning updates, which greatly stabilizes end-to-end training on high-dimensional inputs such as images. We show that the combined algorithm, hEM significantly outperforms model-free baselines on a wide range of goal-conditioned benchmarks with sparse rewards.

Via

Access Paper or Ask Questions

Self-Imitation Learning via Generalized Lower Bound Q-learning

Jun 12, 2020
Yunhao Tang

Figure 1 for Self-Imitation Learning via Generalized Lower Bound Q-learning

Figure 2 for Self-Imitation Learning via Generalized Lower Bound Q-learning

Figure 3 for Self-Imitation Learning via Generalized Lower Bound Q-learning

Figure 4 for Self-Imitation Learning via Generalized Lower Bound Q-learning

Self-imitation learning motivated by lower-bound Q-learning is a novel and effective approach for off-policy learning. In this work, we propose a n-step lower bound which generalizes the original return-based lower-bound Q-learning, and introduce a new family of self-imitation learning algorithms. To provide a formal motivation for the potential performance gains provided by self-imitation learning, we show that n-step lower bound Q-learning achieves a trade-off between fixed point bias and contraction rate, drawing close connections to the popular uncorrected n-step Q-learning. We finally show that n-step lower bound Q-learning is a more robust alternative to return-based self-imitation learning and uncorrected n-step, over a wide range of continuous control benchmark tasks.

Via

Access Paper or Ask Questions

Taylor Expansion Policy Optimization

Mar 13, 2020
Yunhao Tang, Michal Valko, Rémi Munos

Figure 1 for Taylor Expansion Policy Optimization

Figure 2 for Taylor Expansion Policy Optimization

Figure 3 for Taylor Expansion Policy Optimization

Figure 4 for Taylor Expansion Policy Optimization

In this work, we investigate the application of Taylor expansions in reinforcement learning. In particular, we propose Taylor expansion policy optimization, a policy optimization formalism that generalizes prior work (e.g., TRPO) as a first-order special case. We also show that Taylor expansions intimately relate to off-policy evaluation. Finally, we show that this new formulation entails modifications which improve the performance of several state-of-the-art distributed algorithms.

Via

Access Paper or Ask Questions

Discrete Action On-Policy Learning with Action-Value Critic

Feb 21, 2020
Yuguang Yue, Yunhao Tang, Mingzhang Yin, Mingyuan Zhou

Figure 1 for Discrete Action On-Policy Learning with Action-Value Critic

Figure 2 for Discrete Action On-Policy Learning with Action-Value Critic

Figure 3 for Discrete Action On-Policy Learning with Action-Value Critic

Figure 4 for Discrete Action On-Policy Learning with Action-Value Critic

Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension, making it challenging to apply existing on-policy gradient based deep RL algorithms efficiently. To effectively operate in multidimensional discrete action spaces, we construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation. We follow rigorous statistical analysis to design how to generate and combine these correlated actions, and how to sparsify the gradients by shutting down the contributions from certain dimensions. These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques. We demonstrate these properties on OpenAI Gym benchmark tasks, and illustrate how discretizing the action space could benefit the exploration phase and hence facilitate convergence to a better local optimal solution thanks to the flexibility of discrete policy.

Via

Access Paper or Ask Questions

ES-MAML: Simple Hessian-Free Meta Learning

Oct 05, 2019
Xingyou Song, Wenbo Gao, Yuxiang Yang, Krzysztof Choromanski, Aldo Pacchiano, Yunhao Tang

Figure 1 for ES-MAML: Simple Hessian-Free Meta Learning

Figure 2 for ES-MAML: Simple Hessian-Free Meta Learning

Figure 3 for ES-MAML: Simple Hessian-Free Meta Learning

Figure 4 for ES-MAML: Simple Hessian-Free Meta Learning

We introduce ES-MAML, a new framework for solving the model agnostic meta learning (MAML) problem based on Evolution Strategies (ES). Existing algorithms for MAML are based on policy gradients, and incur significant difficulties when attempting to estimate second derivatives using backpropagation on stochastic policies. We show how ES can be applied to MAML to obtain an algorithm which avoids the problem of estimating second derivatives, and is also conceptually simple and easy to implement. Moreover, ES-MAML can handle new types of nonsmooth adaptation operators, and other techniques for improving performance and estimation of ES methods become applicable. We show empirically that ES-MAML is competitive with existing methods and often yields better adaptation with fewer queries.

* 10 main pages, 21 total pages, 21 figures

Via

Access Paper or Ask Questions