Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xingyou Song

Richard

Open Source Vizier: Distributed Infrastructure and API for Reliable and Flexible Blackbox Optimization

Jul 27, 2022

Xingyou Song, Sagi Perel, Chansoo Lee, Greg Kochanski, Daniel Golovin

Figure 1 for Open Source Vizier: Distributed Infrastructure and API for Reliable and Flexible Blackbox Optimization

Figure 2 for Open Source Vizier: Distributed Infrastructure and API for Reliable and Flexible Blackbox Optimization

Figure 3 for Open Source Vizier: Distributed Infrastructure and API for Reliable and Flexible Blackbox Optimization

Figure 4 for Open Source Vizier: Distributed Infrastructure and API for Reliable and Flexible Blackbox Optimization

Abstract:Vizier is the de-facto blackbox and hyperparameter optimization service across Google, having optimized some of Google's largest products and research efforts. To operate at the scale of tuning thousands of users' critical systems, Google Vizier solved key design challenges in providing multiple different features, while remaining fully fault-tolerant. In this paper, we introduce Open Source (OSS) Vizier, a standalone Python-based interface for blackbox optimization and research, based on the Google-internal Vizier infrastructure and framework. OSS Vizier provides an API capable of defining and solving a wide variety of optimization problems, including multi-metric, early stopping, transfer learning, and conditional search. Furthermore, it is designed to be a distributed system that assures reliability, and allows multiple parallel evaluations of the user's objective function. The flexible RPC-based infrastructure allows users to access OSS Vizier from binaries written in any language. OSS Vizier also provides a back-end ("Pythia") API that gives algorithm authors a way to interface new algorithms with the core OSS Vizier system. OSS Vizier is available at https://github.com/google/vizier.

* Published as a conference paper for the systems track at the 1st International Conference on Automated Machine Learning (AutoML-Conf 2022). Code can be found at https://github.com/google/vizier

Via

Access Paper or Ask Questions

Towards Learning Universal Hyperparameter Optimizers with Transformers

May 26, 2022

Yutian Chen, Xingyou Song, Chansoo Lee, Zi Wang, Qiuyi Zhang, David Dohan, Kazuya Kawakami, Greg Kochanski, Arnaud Doucet, Marc'aurelio Ranzato(+2 more)

Figure 1 for Towards Learning Universal Hyperparameter Optimizers with Transformers

Figure 2 for Towards Learning Universal Hyperparameter Optimizers with Transformers

Figure 3 for Towards Learning Universal Hyperparameter Optimizers with Transformers

Figure 4 for Towards Learning Universal Hyperparameter Optimizers with Transformers

Abstract:Meta-learning hyperparameter optimization (HPO) algorithms from prior experiments is a promising approach to improve optimization efficiency over objective functions from a similar distribution. However, existing methods are restricted to learning from experiments sharing the same set of hyperparameters. In this paper, we introduce the OptFormer, the first text-based Transformer HPO framework that provides a universal end-to-end interface for jointly learning policy and function prediction when trained on vast tuning data from the wild. Our extensive experiments demonstrate that the OptFormer can imitate at least 7 different HPO algorithms, which can be further improved via its function uncertainty estimates. Compared to a Gaussian Process, the OptFormer also learns a robust prior distribution for hyperparameter response functions, and can thereby provide more accurate and better calibrated predictions. This work paves the path to future extensions for training a Transformer-based model as a general HPO optimizer.

Via

Access Paper or Ask Questions

Automated Reinforcement Learning (AutoRL): A Survey and Open Problems

Jan 11, 2022

Jack Parker-Holder, Raghu Rajan, Xingyou Song, André Biedenkapp, Yingjie Miao, Theresa Eimer, Baohe Zhang, Vu Nguyen, Roberto Calandra, Aleksandra Faust(+2 more)

Figure 1 for Automated Reinforcement Learning (AutoRL): A Survey and Open Problems

Figure 2 for Automated Reinforcement Learning (AutoRL): A Survey and Open Problems

Figure 3 for Automated Reinforcement Learning (AutoRL): A Survey and Open Problems

Figure 4 for Automated Reinforcement Learning (AutoRL): A Survey and Open Problems

Abstract:The combination of Reinforcement Learning (RL) with deep learning has led to a series of impressive feats, with many believing (deep) RL provides a path towards generally capable agents. However, the success of RL agents is often highly sensitive to design choices in the training process, which may require tedious and error-prone manual tuning. This makes it challenging to use RL for new problems, while also limits its full potential. In many other areas of machine learning, AutoML has shown it is possible to automate such design choices and has also yielded promising initial results when applied to RL. However, Automated Reinforcement Learning (AutoRL) involves not only standard applications of AutoML but also includes additional challenges unique to RL, that naturally produce a different set of methods. As such, AutoRL has been emerging as an important area of research in RL, providing promise in a variety of applications from RNA design to playing games such as Go. Given the diversity of methods and environments considered in RL, much of the research has been conducted in distinct subfields, ranging from meta-learning to evolution. In this survey we seek to unify the field of AutoRL, we provide a common taxonomy, discuss each area in detail and pose open problems which would be of interest to researchers going forward.

Via

Access Paper or Ask Questions

Debiasing a First-order Heuristic for Approximate Bi-level Optimization

Jun 08, 2021

Valerii Likhosherstov, Xingyou Song, Krzysztof Choromanski, Jared Davis, Adrian Weller

Figure 1 for Debiasing a First-order Heuristic for Approximate Bi-level Optimization

Figure 2 for Debiasing a First-order Heuristic for Approximate Bi-level Optimization

Figure 3 for Debiasing a First-order Heuristic for Approximate Bi-level Optimization

Figure 4 for Debiasing a First-order Heuristic for Approximate Bi-level Optimization

Abstract:Approximate bi-level optimization (ABLO) consists of (outer-level) optimization problems, involving numerical (inner-level) optimization loops. While ABLO has many applications across deep learning, it suffers from time and memory complexity proportional to the length $r$ of its inner optimization loop. To address this complexity, an earlier first-order method (FOM) was proposed as a heuristic that omits second derivative terms, yielding significant speed gains and requiring only constant memory. Despite FOM's popularity, there is a lack of theoretical understanding of its convergence properties. We contribute by theoretically characterizing FOM's gradient bias under mild assumptions. We further demonstrate a rich family of examples where FOM-based SGD does not converge to a stationary point of the ABLO objective. We address this concern by proposing an unbiased FOM (UFOM) enjoying constant memory complexity as a function of $r$. We characterize the introduced time-variance tradeoff, demonstrate convergence bounds, and find an optimal UFOM for a given ABLO problem. Finally, we propose an efficient adaptive UFOM scheme.

* Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 2021. arXiv admin note: text overlap with arXiv:2006.03631

Via

Access Paper or Ask Questions

RL-DARTS: Differentiable Architecture Search for Reinforcement Learning

Jun 04, 2021

Yingjie Miao, Xingyou Song, Daiyi Peng, Summer Yue, Eugene Brevdo, Aleksandra Faust

Figure 1 for RL-DARTS: Differentiable Architecture Search for Reinforcement Learning

Figure 2 for RL-DARTS: Differentiable Architecture Search for Reinforcement Learning

Figure 3 for RL-DARTS: Differentiable Architecture Search for Reinforcement Learning

Figure 4 for RL-DARTS: Differentiable Architecture Search for Reinforcement Learning

Abstract:We introduce RL-DARTS, one of the first applications of Differentiable Architecture Search (DARTS) in reinforcement learning (RL) to search for convolutional cells, applied to the Procgen benchmark. We outline the initial difficulties of applying neural architecture search techniques in RL, and demonstrate that by simply replacing the image encoder with a DARTS supernet, our search method is sample-efficient, requires minimal extra compute resources, and is also compatible with off-policy and on-policy RL algorithms, needing only minor changes in preexisting code. Surprisingly, we find that the supernet can be used as an actor for inference to generate replay data in standard RL training loops, and thus train end-to-end. Throughout this training process, we show that the supernet gradually learns better cells, leading to alternative architectures which can be highly competitive against manually designed policies, but also verify previous design choices for RL policies.

* 19 pages total, 17 figures

Via

Access Paper or Ask Questions

Unlocking Pixels for Reinforcement Learning via Implicit Attention

Mar 04, 2021

Krzysztof Choromanski, Deepali Jain, Jack Parker-Holder, Xingyou Song, Valerii Likhosherstov, Anirban Santara, Aldo Pacchiano, Yunhao Tang, Adrian Weller

Figure 1 for Unlocking Pixels for Reinforcement Learning via Implicit Attention

Figure 2 for Unlocking Pixels for Reinforcement Learning via Implicit Attention

Figure 3 for Unlocking Pixels for Reinforcement Learning via Implicit Attention

Figure 4 for Unlocking Pixels for Reinforcement Learning via Implicit Attention

Abstract:There has recently been significant interest in training reinforcement learning (RL) agents in vision-based environments. This poses many challenges, such as high dimensionality and potential for observational overfitting through spurious correlations. A promising approach to solve both of these problems is a self-attention bottleneck, which provides a simple and effective framework for learning high performing policies, even in the presence of distractions. However, due to poor scalability of attention architectures, these methods do not scale beyond low resolution visual inputs, using large patches (thus small attention matrices). In this paper we make use of new efficient attention algorithms, recently shown to be highly effective for Transformers, and demonstrate that these new techniques can be applied in the RL setting. This allows our attention-based controllers to scale to larger visual inputs, and facilitate the use of smaller patches, even individual pixels, improving generalization. In addition, we propose a new efficient algorithm approximating softmax attention with what we call hybrid random features, leveraging the theory of angular kernels. We show theoretically and empirically that hybrid random features is a promising approach when using attention for vision-based RL.

Via

Access Paper or Ask Questions

ES-ENAS: Combining Evolution Strategies with Neural Architecture Search at No Extra Cost for Reinforcement Learning

Jan 19, 2021

Xingyou Song, Krzysztof Choromanski, Jack Parker-Holder, Yunhao Tang, Daiyi Peng, Deepali Jain, Wenbo Gao, Aldo Pacchiano, Tamas Sarlos, Yuxiang Yang

Figure 1 for ES-ENAS: Combining Evolution Strategies with Neural Architecture Search at No Extra Cost for Reinforcement Learning

Figure 2 for ES-ENAS: Combining Evolution Strategies with Neural Architecture Search at No Extra Cost for Reinforcement Learning

Figure 3 for ES-ENAS: Combining Evolution Strategies with Neural Architecture Search at No Extra Cost for Reinforcement Learning

Figure 4 for ES-ENAS: Combining Evolution Strategies with Neural Architecture Search at No Extra Cost for Reinforcement Learning

Abstract:We introduce ES-ENAS, a simple neural architecture search (NAS) algorithm for the purpose of reinforcement learning (RL) policy design, by combining Evolutionary Strategies (ES) and Efficient NAS (ENAS) in a highly scalable and intuitive way. Our main insight is noticing that ES is already a distributed blackbox algorithm, and thus we may simply insert a model controller from ENAS into the central aggregator in ES and obtain weight sharing properties for free. By doing so, we bridge the gap from NAS research in supervised learning settings to the reinforcement learning scenario through this relatively simple marriage between two different lines of research, and are one of the first to apply controller-based NAS techniques to RL. We demonstrate the utility of our method by training combinatorial neural network architectures for RL problems in continuous control, via edge pruning and weight sharing. We also incorporate a wide variety of popular techniques from modern NAS literature, including multiobjective optimization and varying controller methods, to showcase their promise in the RL field and discuss possible extensions. We achieve >90% network compression for multiple tasks, which may be special interest in mobile robotics with limited storage and computational resources.

* 14 pages. This is an updated version of a previous submission which can be found at arXiv:1907.06511. See https://github.com/google-research/google-research/tree/master/es_enas for associated code

Via

Access Paper or Ask Questions

Sub-Linear Memory: How to Make Performers SLiM

Dec 21, 2020

Valerii Likhosherstov, Krzysztof Choromanski, Jared Davis, Xingyou Song, Adrian Weller

Figure 1 for Sub-Linear Memory: How to Make Performers SLiM

Figure 2 for Sub-Linear Memory: How to Make Performers SLiM

Figure 3 for Sub-Linear Memory: How to Make Performers SLiM

Figure 4 for Sub-Linear Memory: How to Make Performers SLiM

Abstract:The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitous in state-of-the-art solutions for a wide variety of applications. Yet vanilla Transformers are notoriously resource-expensive, requiring $O(L^2)$ in serial time and memory as functions of input length $L$. Recent works proposed various linear self-attention mechanisms, scaling only as $O(L)$ for serial computation. We perform a thorough analysis of recent Transformer mechanisms with linear self-attention, Performers, in terms of overall computational complexity. We observe a remarkable computational flexibility: forward and backward propagation can be performed with no approximations using sublinear memory as a function of $L$ (in addition to negligible storage for the input sequence), at a cost of greater time complexity in the parallel setting. In the extreme case, a Performer consumes only $O(1)$ memory during training, and still requires $O(L)$ time. This discovered time-memory tradeoff can be used for training or, due to complete backward-compatibility, for fine-tuning on a low-memory device, e.g. a smartphone or an earlier-generation GPU, thus contributing towards decentralized and democratized deep learning.

Via

Access Paper or Ask Questions

Rethinking Attention with Performers

Sep 30, 2020

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser(+3 more)

Figure 1 for Rethinking Attention with Performers

Figure 2 for Rethinking Attention with Performers

Figure 3 for Rethinking Attention with Performers

Figure 4 for Rethinking Attention with Performers

Abstract:We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.

* 36 pages. This is an updated version of a previous submission which can be found at arXiv:2006.03555. See https://github.com/google-research/google-research/tree/master/protein_lm for protein language model code, and https://github.com/google-research/google-research/tree/master/performer for Performer code

Via

Access Paper or Ask Questions

An Ode to an ODE

Jun 23, 2020

Krzysztof Choromanski, Jared Quincy Davis, Valerii Likhosherstov, Xingyou Song, Jean-Jacques Slotine, Jacob Varley, Honglak Lee, Adrian Weller, Vikas Sindhwani

Abstract:We present a new paradigm for Neural ODE algorithms, called ODEtoODE, where time-dependent parameters of the main flow evolve according to a matrix flow on the orthogonal group O(d). This nested system of two flows, where the parameter-flow is constrained to lie on the compact manifold, provides stability and effectiveness of training and provably solves the gradient vanishing-explosion problem which is intrinsically related to training deep neural network architectures such as Neural ODEs. Consequently, it leads to better downstream models, as we show on the example of training reinforcement learning policies with evolution strategies, and in the supervised learning setting, by comparing with previous SOTA baselines. We provide strong convergence results for our proposed mechanism that are independent of the depth of the network, supporting our empirical studies. Our results show an intriguing connection between the theory of deep neural networks and the field of matrix flows on compact manifolds.

* 20 pages, 9 figures

Via

Access Paper or Ask Questions