Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ruslan Salakhutdinov

Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning

May 17, 2021
Yue Wu, Shuangfei Zhai, Nitish Srivastava, Joshua Susskind, Jian Zhang, Ruslan Salakhutdinov, Hanlin Goh

Figure 1 for Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning

Figure 2 for Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning

Figure 3 for Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning

Figure 4 for Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning

Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration. However, existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states. We hypothesize that a key missing ingredient from the existing methods is a proper treatment of uncertainty in the offline setting. We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly. Implementation-wise, we adopt a practical and effective dropout-based uncertainty estimation method that introduces very little overhead over existing RL algorithms. Empirically, we observe that UWAC substantially improves model stability during training. In addition, UWAC out-performs existing offline RL methods on a variety of competitive tasks, and achieves significant performance gains over the state-of-the-art baseline on datasets with sparse demonstrations collected from human experts.

* To appear in ICML 2021

Via

Access Paper or Ask Questions

A Note on Connecting Barlow Twins with Negative-Sample-Free Contrastive Learning

Apr 28, 2021
Yao-Hung Hubert Tsai, Shaojie Bai, Louis-Philippe Morency, Ruslan Salakhutdinov

Figure 1 for A Note on Connecting Barlow Twins with Negative-Sample-Free Contrastive Learning

Figure 2 for A Note on Connecting Barlow Twins with Negative-Sample-Free Contrastive Learning

In this report, we relate the algorithmic design of Barlow Twins' method to the Hilbert-Schmidt Independence Criterion (HSIC), thus establishing it as a contrastive learning approach that is free of negative samples. Through this perspective, we argue that Barlow Twins (and thus the class of negative-sample-free contrastive learning methods) suggests a possibility to bridge the two major families of self-supervised learning philosophies: non-contrastive and contrastive approaches. In particular, Barlow twins exemplified how we could combine the best practices of both worlds: avoiding the need of large training batch size and negative sample pairing (like non-contrastive methods) and avoiding symmetry-breaking network designs (like contrastive methods).

Via

Access Paper or Ask Questions

Focused Attention Improves Document-Grounded Generation

Apr 26, 2021
Shrimai Prabhumoye, Kazuma Hashimoto, Yingbo Zhou, Alan W Black, Ruslan Salakhutdinov

Figure 1 for Focused Attention Improves Document-Grounded Generation

Figure 2 for Focused Attention Improves Document-Grounded Generation

Figure 3 for Focused Attention Improves Document-Grounded Generation

Figure 4 for Focused Attention Improves Document-Grounded Generation

Document grounded generation is the task of using the information provided in a document to improve text generation. This work focuses on two different document grounded generation tasks: Wikipedia Update Generation task and Dialogue response generation. Our work introduces two novel adaptations of large scale pre-trained encoder-decoder models focusing on building context driven representation of the document and enabling specific attention to the information in the document. Additionally, we provide a stronger BART baseline for these tasks. Our proposed techniques outperform existing methods on both automated (at least 48% increase in BLEU-4 points) and human evaluation for closeness to reference and relevance to the document. Furthermore, we perform comprehensive manual inspection of the generated output and categorize errors to provide insights into future directions in modeling these tasks.

* Accepted at North American Chapter of the Association for Computational Linguistics (NAACL) 2021

Via

Access Paper or Ask Questions

Self-supervised Representation Learning with Relative Predictive Coding

Apr 12, 2021
Yao-Hung Hubert Tsai, Martin Q. Ma, Muqiao Yang, Han Zhao, Louis-Philippe Morency, Ruslan Salakhutdinov

Figure 1 for Self-supervised Representation Learning with Relative Predictive Coding

Figure 2 for Self-supervised Representation Learning with Relative Predictive Coding

Figure 3 for Self-supervised Representation Learning with Relative Predictive Coding

Figure 4 for Self-supervised Representation Learning with Relative Predictive Coding

This paper introduces Relative Predictive Coding (RPC), a new contrastive representation learning objective that maintains a good balance among training stability, minibatch size sensitivity, and downstream task performance. The key to the success of RPC is two-fold. First, RPC introduces the relative parameters to regularize the objective for boundedness and low variance. Second, RPC contains no logarithm and exponential score functions, which are the main cause of training instability in prior contrastive objectives. We empirically verify the effectiveness of RPC on benchmark vision and speech self-supervised learning tasks. Lastly, we relate RPC with mutual information (MI) estimation, showing RPC can be used to estimate MI with low variance.

Via

Access Paper or Ask Questions

StylePTB: A Compositional Benchmark for Fine-grained Controllable Text Style Transfer

Apr 12, 2021
Yiwei Lyu, Paul Pu Liang, Hai Pham, Eduard Hovy, Barnabás Póczos, Ruslan Salakhutdinov, Louis-Philippe Morency

Figure 1 for StylePTB: A Compositional Benchmark for Fine-grained Controllable Text Style Transfer

Figure 2 for StylePTB: A Compositional Benchmark for Fine-grained Controllable Text Style Transfer

Figure 3 for StylePTB: A Compositional Benchmark for Fine-grained Controllable Text Style Transfer

Figure 4 for StylePTB: A Compositional Benchmark for Fine-grained Controllable Text Style Transfer

Text style transfer aims to controllably generate text with targeted stylistic changes while maintaining core meaning from the source sentence constant. Many of the existing style transfer benchmarks primarily focus on individual high-level semantic changes (e.g. positive to negative), which enable controllability at a high level but do not offer fine-grained control involving sentence structure, emphasis, and content of the sentence. In this paper, we introduce a large-scale benchmark, StylePTB, with (1) paired sentences undergoing 21 fine-grained stylistic changes spanning atomic lexical, syntactic, semantic, and thematic transfers of text, as well as (2) compositions of multiple transfers which allow modeling of fine-grained stylistic changes as building blocks for more complex, high-level transfers. By benchmarking existing methods on StylePTB, we find that they struggle to model fine-grained changes and have an even more difficult time composing multiple styles. As a result, StylePTB brings novel challenges that we hope will encourage future research in controllable text style transfer, compositional models, and learning disentangled representations. Solving these challenges would present important steps towards controllable text generation.

* NAACL 2021, code available at https://github.com/lvyiwei1/StylePTB/

Via

Access Paper or Ask Questions

Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation

Apr 04, 2021
Emilio Parisotto, Ruslan Salakhutdinov

Figure 1 for Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation

Figure 2 for Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation

Figure 3 for Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation

Figure 4 for Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation

Many real-world applications such as robotics provide hard constraints on power and compute that limit the viable model complexity of Reinforcement Learning (RL) agents. Similarly, in many distributed RL settings, acting is done on un-accelerated hardware such as CPUs, which likewise restricts model size to prevent intractable experiment run times. These "actor-latency" constrained settings present a major obstruction to the scaling up of model complexity that has recently been extremely successful in supervised learning. To be able to utilize large model capacity while still operating within the limits imposed by the system during acting, we develop an "Actor-Learner Distillation" (ALD) procedure that leverages a continual form of distillation that transfers learning progress from a large capacity learner model to a small capacity actor model. As a case study, we develop this procedure in the context of partially-observable environments, where transformer models have had large improvements over LSTMs recently, at the cost of significantly higher computational complexity. With transformer models as the learner and LSTMs as the actor, we demonstrate in several challenging memory environments that using Actor-Learner Distillation recovers the clear sample-efficiency gains of the transformer learner model while maintaining the fast inference and reduced total training time of the LSTM actor model.

* Published at ICLR 2021

Via

Access Paper or Ask Questions

Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification

Mar 23, 2021
Benjamin Eysenbach, Sergey Levine, Ruslan Salakhutdinov

Figure 1 for Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification

Figure 2 for Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification

Figure 3 for Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification

Figure 4 for Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification

In the standard Markov decision process formalism, users specify tasks by writing down a reward function. However, in many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved. Motivated by this observation, we derive a control algorithm from first principles that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states. Prior work has approached similar problem settings in a two-stage process, first learning an auxiliary reward function and then optimizing this reward function using another reinforcement learning algorithm. In contrast, we derive a method based on recursive classification that eschews auxiliary reward functions and instead directly learns a value function from transitions and successful outcomes. Our method therefore requires fewer hyperparameters to tune and lines of code to debug. We show that our method satisfies a new data-driven Bellman equation, where examples take the place of the typical reward function term. Experiments show that our approach outperforms prior methods that learn explicit reward functions.

* Website with videos and code: https://ben-eysenbach.github.io/rce

Via

Access Paper or Ask Questions

Instabilities of Offline RL with Pre-Trained Neural Representation

Mar 08, 2021
Ruosong Wang, Yifan Wu, Ruslan Salakhutdinov, Sham M. Kakade

Figure 1 for Instabilities of Offline RL with Pre-Trained Neural Representation

Figure 2 for Instabilities of Offline RL with Pre-Trained Neural Representation

Figure 3 for Instabilities of Offline RL with Pre-Trained Neural Representation

Figure 4 for Instabilities of Offline RL with Pre-Trained Neural Representation

In offline reinforcement learning (RL), we seek to utilize offline data to evaluate (or learn) policies in scenarios where the data are collected from a distribution that substantially differs from that of the target policy to be evaluated. Recent theoretical advances have shown that such sample-efficient offline RL is indeed possible provided certain strong representational conditions hold, else there are lower bounds exhibiting exponential error amplification (in the problem horizon) unless the data collection distribution has only a mild distribution shift relative to the target policy. This work studies these issues from an empirical perspective to gauge how stable offline RL methods are. In particular, our methodology explores these ideas when using features from pre-trained neural networks, in the hope that these representations are powerful enough to permit sample efficient offline RL. Through extensive experiments on a range of tasks, we see that substantial error amplification does occur even when using such pre-trained representations (trained on the same task itself); we find offline RL is stable only under extremely mild distribution shift. The implications of these results, both from a theoretical and an empirical perspective, are that successful offline RL (where we seek to go beyond the low distribution shift regime) requires substantially stronger conditions beyond those which suffice for successful supervised learning.

Via

Access Paper or Ask Questions

On Proximal Policy Optimization's Heavy-tailed Gradients

Feb 20, 2021
Saurabh Garg, Joshua Zhanson, Emilio Parisotto, Adarsh Prasad, J. Zico Kolter, Sivaraman Balakrishnan, Zachary C. Lipton, Ruslan Salakhutdinov, Pradeep Ravikumar

Figure 1 for On Proximal Policy Optimization's Heavy-tailed Gradients

Figure 2 for On Proximal Policy Optimization's Heavy-tailed Gradients

Figure 3 for On Proximal Policy Optimization's Heavy-tailed Gradients

Figure 4 for On Proximal Policy Optimization's Heavy-tailed Gradients

Modern policy gradient algorithms, notably Proximal Policy Optimization (PPO), rely on an arsenal of heuristics, including loss clipping and gradient clipping, to ensure successful learning. These heuristics are reminiscent of techniques from robust statistics, commonly used for estimation in outlier-rich ("heavy-tailed") regimes. In this paper, we present a detailed empirical study to characterize the heavy-tailed nature of the gradients of the PPO surrogate reward function. We demonstrate that the gradients, especially for the actor network, exhibit pronounced heavy-tailedness and that it increases as the agent's policy diverges from the behavioral policy (i.e., as the agent goes further off policy). Further examination implicates the likelihood ratios and advantages in the surrogate reward as the main sources of the observed heavy-tailedness. We then highlight issues arising due to the heavy-tailed nature of the gradients. In this light, we study the effects of the standard PPO clipping heuristics, demonstrating that these tricks primarily serve to offset heavy-tailedness in gradients. Thus motivated, we propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks. Despite requiring less hyperparameter tuning, our method matches the performance of PPO (with all heuristics enabled) on a battery of MuJoCo continuous control tasks.

* Pre-print

Via

Access Paper or Ask Questions