Alert button
Picture for Srivatsan Srinivasan

Srivatsan Srinivasan

Alert button

Reinforced Self-Training (ReST) for Language Modeling

Aug 21, 2023
Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, Nando de Freitas

Figure 1 for Reinforced Self-Training (ReST) for Language Modeling
Figure 2 for Reinforced Self-Training (ReST) for Language Modeling
Figure 3 for Reinforced Self-Training (ReST) for Language Modeling
Figure 4 for Reinforced Self-Training (ReST) for Language Modeling

Reinforcement learning from human feedback (RLHF) can improve the quality of large language model's (LLM) outputs by aligning them with human preferences. We propose a simple algorithm for aligning LLMs with human preferences inspired by growing batch reinforcement learning (RL), which we call Reinforced Self-Training (ReST). Given an initial LLM policy, ReST produces a dataset by generating samples from the policy, which are then used to improve the LLM policy using offline RL algorithms. ReST is more efficient than typical online RLHF methods because the training dataset is produced offline, which allows data reuse. While ReST is a general approach applicable to all generative learning settings, we focus on its application to machine translation. Our results show that ReST can substantially improve translation quality, as measured by automated metrics and human evaluation on machine translation benchmarks in a compute and sample-efficient manner.

* 23 pages, 16 figures 
Viaarxiv icon

AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning

Aug 07, 2023
Michaël Mathieu, Sherjil Ozair, Srivatsan Srinivasan, Caglar Gulcehre, Shangtong Zhang, Ray Jiang, Tom Le Paine, Richard Powell, Konrad Żołna, Julian Schrittwieser, David Choi, Petko Georgiev, Daniel Toyama, Aja Huang, Roman Ring, Igor Babuschkin, Timo Ewalds, Mahyar Bordbar, Sarah Henderson, Sergio Gómez Colmenarejo, Aäron van den Oord, Wojciech Marian Czarnecki, Nando de Freitas, Oriol Vinyals

Figure 1 for AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning
Figure 2 for AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning
Figure 3 for AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning
Figure 4 for AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning

StarCraft II is one of the most challenging simulated reinforcement learning environments; it is partially observable, stochastic, multi-agent, and mastering StarCraft II requires strategic planning over long time horizons with real-time low-level execution. It also has an active professional competitive scene. StarCraft II is uniquely suited for advancing offline RL algorithms, both because of its challenging nature and because Blizzard has released a massive dataset of millions of StarCraft II games played by human players. This paper leverages that and establishes a benchmark, called AlphaStar Unplugged, introducing unprecedented challenges for offline reinforcement learning. We define a dataset (a subset of Blizzard's release), tools standardizing an API for machine learning methods, and an evaluation protocol. We also present baseline agents, including behavior cloning, offline variants of actor-critic and MuZero. We improve the state of the art of agents using only offline data, and we achieve 90% win rate against previously published AlphaStar behavior cloning agent.

* 32 pages, 13 figures, previous version published as a NeurIPS 2021 workshop: https://openreview.net/forum?id=Np8Pumfoty 
Viaarxiv icon

An Empirical Study of Implicit Regularization in Deep Offline RL

Jul 07, 2022
Caglar Gulcehre, Srivatsan Srinivasan, Jakub Sygnowski, Georg Ostrovski, Mehrdad Farajtabar, Matt Hoffman, Razvan Pascanu, Arnaud Doucet

Figure 1 for An Empirical Study of Implicit Regularization in Deep Offline RL
Figure 2 for An Empirical Study of Implicit Regularization in Deep Offline RL
Figure 3 for An Empirical Study of Implicit Regularization in Deep Offline RL
Figure 4 for An Empirical Study of Implicit Regularization in Deep Offline RL

Deep neural networks are the most commonly used function approximators in offline reinforcement learning. Prior works have shown that neural nets trained with TD-learning and gradient descent can exhibit implicit regularization that can be characterized by under-parameterization of these networks. Specifically, the rank of the penultimate feature layer, also called \textit{effective rank}, has been observed to drastically collapse during the training. In turn, this collapse has been argued to reduce the model's ability to further adapt in later stages of learning, leading to the diminished final performance. Such an association between the effective rank and performance makes effective rank compelling for offline RL, primarily for offline policy evaluation. In this work, we conduct a careful empirical study on the relation between effective rank and performance on three offline RL datasets : bsuite, Atari, and DeepMind lab. We observe that a direct association exists only in restricted settings and disappears in the more extensive hyperparameter sweeps. Also, we empirically identify three phases of learning that explain the impact of implicit regularization on the learning dynamics and found that bootstrapping alone is insufficient to explain the collapse of the effective rank. Further, we show that several other factors could confound the relationship between effective rank and performance and conclude that studying this association under simplistic assumptions could be highly misleading.

* 40 pages, 37 figures, 2 tables 
Viaarxiv icon

Output-Constrained Bayesian Neural Networks

May 15, 2019
Wanqian Yang, Lars Lorch, Moritz A. Graule, Srivatsan Srinivasan, Anirudh Suresh, Jiayu Yao, Melanie F. Pradier, Finale Doshi-Velez

Figure 1 for Output-Constrained Bayesian Neural Networks
Figure 2 for Output-Constrained Bayesian Neural Networks
Figure 3 for Output-Constrained Bayesian Neural Networks
Figure 4 for Output-Constrained Bayesian Neural Networks

Bayesian neural network (BNN) priors are defined in parameter space, making it hard to encode prior knowledge expressed in function space. We formulate a prior that incorporates functional constraints about what the output can or cannot be in regions of the input space. Output-Constrained BNNs (OC-BNN) represent an interpretable approach of enforcing a range of constraints, fully consistent with the Bayesian framework and amenable to black-box inference. We demonstrate how OC-BNNs improve model robustness and prevent the prediction of infeasible outputs in two real-world applications of healthcare and robotics.

* Presented at the ICML 2019 Workshop on Uncertainty and Robustness in Deep Learning and Workshop on Understanding and Improving Generalization in Deep Learning. Long Beach, CA, 2019 
Viaarxiv icon

Truly Batch Apprenticeship Learning with Deep Successor Features

Mar 24, 2019
Donghun Lee, Srivatsan Srinivasan, Finale Doshi-Velez

Figure 1 for Truly Batch Apprenticeship Learning with Deep Successor Features
Figure 2 for Truly Batch Apprenticeship Learning with Deep Successor Features
Figure 3 for Truly Batch Apprenticeship Learning with Deep Successor Features
Figure 4 for Truly Batch Apprenticeship Learning with Deep Successor Features

We introduce a novel apprenticeship learning algorithm to learn an expert's underlying reward structure in off-policy model-free \emph{batch} settings. Unlike existing methods that require a dynamics model or additional data acquisition for on-policy evaluation, our algorithm requires only the batch data of observed expert behavior. Such settings are common in real-world tasks---health care, finance or industrial processes ---where accurate simulators do not exist or data acquisition is costly. To address challenges in batch settings, we introduce Deep Successor Feature Networks(DSFN) that estimate feature expectations in an off-policy setting and a transition-regularized imitation network that produces a near-expert initial policy and an efficient feature representation. Our algorithm achieves superior results in batch settings on both control benchmarks and a vital clinical task of sepsis management in the Intensive Care Unit.

* 10 pages, 3 figures, Under Conference Review 
Viaarxiv icon

Evaluating Reinforcement Learning Algorithms in Observational Health Settings

May 31, 2018
Omer Gottesman, Fredrik Johansson, Joshua Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan, Linying Zhang, Yi Ding, David Wihl, Xuefeng Peng, Jiayu Yao, Isaac Lage, Christopher Mosch, Li-wei H. Lehman, Matthieu Komorowski, Matthieu Komorowski, Aldo Faisal, Leo Anthony Celi, David Sontag, Finale Doshi-Velez

Figure 1 for Evaluating Reinforcement Learning Algorithms in Observational Health Settings
Figure 2 for Evaluating Reinforcement Learning Algorithms in Observational Health Settings
Figure 3 for Evaluating Reinforcement Learning Algorithms in Observational Health Settings
Figure 4 for Evaluating Reinforcement Learning Algorithms in Observational Health Settings

Much attention has been devoted recently to the development of machine learning algorithms with the goal of improving treatment policies in healthcare. Reinforcement learning (RL) is a sub-field within machine learning that is concerned with learning how to make sequences of decisions so as to optimize long-term effects. Already, RL algorithms have been proposed to identify decision-making strategies for mechanical ventilation, sepsis management and treatment of schizophrenia. However, before implementing treatment policies learned by black-box algorithms in high-stakes clinical decision problems, special care must be taken in the evaluation of these policies. In this document, our goal is to expose some of the subtleties associated with evaluating RL algorithms in healthcare. We aim to provide a conceptual starting point for clinical and computational researchers to ask the right questions when designing and evaluating algorithms for new ways of treating patients. In the following, we describe how choices about how to summarize a history, variance of statistical estimators, and confounders in more ad-hoc measures can result in unreliable, even misleading estimates of the quality of a treatment policy. We also provide suggestions for mitigating these effects---for while there is much promise for mining observational health data to uncover better treatment policies, evaluation must be performed thoughtfully.

Viaarxiv icon