Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Scott Reed

Semi-supervised reward learning for offline reinforcement learning

Dec 12, 2020

Ksenia Konyushkova, Konrad Zolna, Yusuf Aytar, Alexander Novikov, Scott Reed, Serkan Cabi, Nando de Freitas

Figure 1 for Semi-supervised reward learning for offline reinforcement learning

Figure 2 for Semi-supervised reward learning for offline reinforcement learning

Figure 3 for Semi-supervised reward learning for offline reinforcement learning

Figure 4 for Semi-supervised reward learning for offline reinforcement learning

Abstract:In offline reinforcement learning (RL) agents are trained using a logged dataset. It appears to be the most natural route to attack real-life applications because in domains such as healthcare and robotics interactions with the environment are either expensive or unethical. Training agents usually requires reward functions, but unfortunately, rewards are seldom available in practice and their engineering is challenging and laborious. To overcome this, we investigate reward learning under the constraint of minimizing human reward annotations. We consider two types of supervision: timestep annotations and demonstrations. We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data. In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards. We further investigate the relationship between the quality of the reward model and the final policies. We notice, for example, that the reward models do not need to be perfect to result in useful policies.

* Accepted to Offline Reinforcement Learning Workshop at Neural Information Processing Systems (2020)

Via

Access Paper or Ask Questions

Offline Learning from Demonstrations and Unlabeled Experience

Nov 27, 2020

Konrad Zolna, Alexander Novikov, Ksenia Konyushkova, Caglar Gulcehre, Ziyu Wang, Yusuf Aytar, Misha Denil, Nando de Freitas, Scott Reed

Figure 1 for Offline Learning from Demonstrations and Unlabeled Experience

Figure 2 for Offline Learning from Demonstrations and Unlabeled Experience

Figure 3 for Offline Learning from Demonstrations and Unlabeled Experience

Figure 4 for Offline Learning from Demonstrations and Unlabeled Experience

Abstract:Behavior cloning (BC) is often practical for robot learning because it allows a policy to be trained offline without rewards, by supervised learning on expert demonstrations. However, BC does not effectively leverage what we will refer to as unlabeled experience: data of mixed and unknown quality without reward annotations. This unlabeled data can be generated by a variety of sources such as human teleoperation, scripted policies and other agents on the same robot. Towards data-driven offline robot learning that can use this unlabeled experience, we introduce Offline Reinforced Imitation Learning (ORIL). ORIL first learns a reward function by contrasting observations from demonstrator and unlabeled trajectories, then annotates all data with the learned reward, and finally trains an agent via offline reinforcement learning. Across a diverse set of continuous control and simulated robotic manipulation tasks, we show that ORIL consistently outperforms comparable BC agents by effectively leveraging unlabeled experience.

* Accepted to Offline Reinforcement Learning Workshop at Neural Information Processing Systems (2020)

Via

Access Paper or Ask Questions

Critic Regularized Regression

Jun 26, 2020

Ziyu Wang, Alexander Novikov, Konrad Żołna, Jost Tobias Springenberg, Scott Reed, Bobak Shahriari, Noah Siegel, Josh Merel, Caglar Gulcehre, Nicolas Heess(+1 more)

Figure 1 for Critic Regularized Regression

Figure 2 for Critic Regularized Regression

Figure 3 for Critic Regularized Regression

Figure 4 for Critic Regularized Regression

Abstract:Offline reinforcement learning (RL), also known as batch RL, offers the prospect of policy optimization from large pre-recorded datasets without online environment interaction. It addresses challenges with regard to the cost of data collection and safety, both of which are particularly pertinent to real-world applications of RL. Unfortunately, most off-policy algorithms perform poorly when learning from a fixed dataset. In this paper, we propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR). We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces -- outperforming several state-of-the-art offline RL algorithms by a significant margin on a wide range of benchmark tasks.

* 23 pages

Via

Access Paper or Ask Questions

Task-Relevant Adversarial Imitation Learning

Oct 02, 2019

Konrad Zolna, Scott Reed, Alexander Novikov, Sergio Gomez Colmenarej, David Budden, Serkan Cabi, Misha Denil, Nando de Freitas, Ziyu Wang

Figure 1 for Task-Relevant Adversarial Imitation Learning

Figure 2 for Task-Relevant Adversarial Imitation Learning

Figure 3 for Task-Relevant Adversarial Imitation Learning

Figure 4 for Task-Relevant Adversarial Imitation Learning

Abstract:We show that a critical problem in adversarial imitation from high-dimensional sensory data is the tendency of discriminator networks to distinguish agent and expert behaviour using task-irrelevant features beyond the control of the agent. We analyze this problem in detail and propose a solution as well as several baselines that outperform standard Generative Adversarial Imitation Learning (GAIL). Our proposed solution, Task-Relevant Adversarial Imitation Learning (TRAIL), uses a constrained optimization objective to overcome task-irrelevant features. Comprehensive experiments show that TRAIL can solve challenging manipulation tasks from pixels by imitating human operators, where other agents such as behaviour cloning (BC), standard GAIL, improved GAIL variants including our newly proposed baselines, and Deterministic Policy Gradients from Demonstrations (DPGfD) fail to find solutions, even when the other agents have access to task reward.

Via

Access Paper or Ask Questions

A Framework for Data-Driven Robotics

Sep 26, 2019

Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Żołna, Yusuf Aytar, David Budden, Mel Vecerik(+6 more)

Figure 1 for A Framework for Data-Driven Robotics

Figure 2 for A Framework for Data-Driven Robotics

Figure 3 for A Framework for Data-Driven Robotics

Figure 4 for A Framework for Data-Driven Robotics

Abstract:We present a framework for data-driven robotics that makes use of a large dataset of recorded robot experience and scales to several tasks using learned reward functions. We show how to apply this framework to accomplish three different object manipulation tasks on a real robot platform. Given demonstrations of a task together with task-agnostic recorded experience, we use a special form of human annotation as supervision to learn a reward function, which enables us to deal with real-world tasks where the reward signal cannot be acquired directly. Learned rewards are used in combination with a large dataset of experience from different tasks to learn a robot policy offline using batch RL. We show that using our approach it is possible to train agents to perform a variety of challenging manipulation tasks including stacking rigid objects and handling cloth.

Via

Access Paper or Ask Questions

Learning Compositional Neural Programs with Recursive Tree Search and Planning

May 30, 2019

Thomas Pierrot, Guillaume Ligner, Scott Reed, Olivier Sigaud, Nicolas Perrin, Alexandre Laterre, David Kas, Karim Beguir, Nando de Freitas

Figure 1 for Learning Compositional Neural Programs with Recursive Tree Search and Planning

Figure 2 for Learning Compositional Neural Programs with Recursive Tree Search and Planning

Figure 3 for Learning Compositional Neural Programs with Recursive Tree Search and Planning

Figure 4 for Learning Compositional Neural Programs with Recursive Tree Search and Planning

Abstract:We propose a novel reinforcement learning algorithm, AlphaNPI, that incorporates the strengths of Neural Programmer-Interpreters (NPI) and AlphaZero. NPI contributes structural biases in the form of modularity, hierarchy and recursion, which are helpful to reduce sample complexity, improve generalization and increase interpretability. AlphaZero contributes powerful neural network guided search algorithms, which we augment with recursion. AlphaNPI only assumes a hierarchical program specification with sparse rewards: 1 when the program execution satisfies the specification, and 0 otherwise. Using this specification, AlphaNPI is able to train NPI models effectively with RL for the first time, completely eliminating the need for strong supervision in the form of execution traces. The experiments show that AlphaNPI can sort as well as previous strongly supervised NPI variants. The AlphaNPI agent is also trained on a Tower of Hanoi puzzle with two disks and is shown to generalize to puzzles with an arbitrary number of disk

Via

Access Paper or Ask Questions

One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL

Oct 11, 2018

Tom Le Paine, Sergio Gómez Colmenarejo, Ziyu Wang, Scott Reed, Yusuf Aytar, Tobias Pfaff, Matt W. Hoffman, Gabriel Barth-Maron, Serkan Cabi, David Budden(+1 more)

Figure 1 for One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL

Figure 2 for One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL

Figure 3 for One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL

Figure 4 for One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL

Abstract:Humans are experts at high-fidelity imitation -- closely mimicking a demonstration, often in one attempt. Humans use this ability to quickly solve a task instance, and to bootstrap learning of new tasks. Achieving these abilities in autonomous agents is an open problem. In this paper, we introduce an off-policy RL algorithm (MetaMimic) to narrow this gap. MetaMimic can learn both (i) policies for high-fidelity one-shot imitation of diverse novel skills, and (ii) policies that enable the agent to solve tasks more efficiently than the demonstrators. MetaMimic relies on the principle of storing all experiences in a memory and replaying these to learn massive deep neural network policies by off-policy RL. This paper introduces, to the best of our knowledge, the largest existing neural networks for deep RL and shows that larger networks with normalization are needed to achieve one-shot high-fidelity imitation on a challenging manipulation task. The results also show that both types of policy can be learned from vision, in spite of the task rewards being sparse, and without access to demonstrator actions.

Via

Access Paper or Ask Questions

Sample Efficient Adaptive Text-to-Speech

Sep 27, 2018

Yutian Chen, Yannis Assael, Brendan Shillingford, David Budden, Scott Reed, Heiga Zen, Quan Wang, Luis C. Cobo, Andrew Trask, Ben Laurie(+4 more)

Figure 1 for Sample Efficient Adaptive Text-to-Speech

Figure 2 for Sample Efficient Adaptive Text-to-Speech

Figure 3 for Sample Efficient Adaptive Text-to-Speech

Figure 4 for Sample Efficient Adaptive Text-to-Speech

Abstract:We present a meta-learning approach for adaptive text-to-speech (TTS) with few data. During training, we learn a multi-speaker model using a shared conditional WaveNet core and independent learned embeddings for each speaker. The aim of training is not to produce a neural network with fixed weights, which is then deployed as a TTS system. Instead, the aim is to produce a network that requires few data at deployment time to rapidly adapt to new speakers. We introduce and benchmark three strategies: (i) learning the speaker embedding while keeping the WaveNet core fixed, (ii) fine-tuning the entire architecture with stochastic gradient descent, and (iii) predicting the speaker embedding with a trained neural network encoder. The experiments show that these approaches are successful at adapting the multi-speaker neural network to new speakers, obtaining state-of-the-art results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers.

Via

Access Paper or Ask Questions

Neural Arithmetic Logic Units

Aug 01, 2018

Andrew Trask, Felix Hill, Scott Reed, Jack Rae, Chris Dyer, Phil Blunsom

Figure 1 for Neural Arithmetic Logic Units

Figure 2 for Neural Arithmetic Logic Units

Figure 3 for Neural Arithmetic Logic Units

Figure 4 for Neural Arithmetic Logic Units

Abstract:Neural networks can learn to represent and manipulate numerical information, but they seldom generalize well outside of the range of numerical values encountered during training. To encourage more systematic numerical extrapolation, we propose an architecture that represents numerical quantities as linear activations which are manipulated using primitive arithmetic operators, controlled by learned gates. We call this module a neural arithmetic logic unit (NALU), by analogy to the arithmetic logic unit in traditional processors. Experiments show that NALU-enhanced neural networks can learn to track time, perform arithmetic over images of numbers, translate numerical language into real-valued scalars, execute computer code, and count objects in images. In contrast to conventional architectures, we obtain substantially better generalization both inside and outside of the range of numerical values encountered during training, often extrapolating orders of magnitude beyond trained numerical ranges.

Via

Access Paper or Ask Questions

ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans

Mar 28, 2018

Angela Dai, Daniel Ritchie, Martin Bokeloh, Scott Reed, Jürgen Sturm, Matthias Nießner

Figure 1 for ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans

Figure 2 for ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans

Figure 3 for ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans

Figure 4 for ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans

Abstract:We introduce ScanComplete, a novel data-driven approach for taking an incomplete 3D scan of a scene as input and predicting a complete 3D model along with per-voxel semantic labels. The key contribution of our method is its ability to handle large scenes with varying spatial extent, managing the cubic growth in data size as scene size increases. To this end, we devise a fully-convolutional generative 3D CNN model whose filter kernels are invariant to the overall scene size. The model can be trained on scene subvolumes but deployed on arbitrarily large scenes at test time. In addition, we propose a coarse-to-fine inference strategy in order to produce high-resolution output while also leveraging large input context sizes. In an extensive series of experiments, we carefully evaluate different model design choices, considering both deterministic and probabilistic models for completion and semantic inference. Our results show that we outperform other methods not only in the size of the environments handled and processing efficiency, but also with regard to completion quality and semantic segmentation performance by a significant margin.

* Video: https://youtu.be/5s5s8iH0NF8

Via

Access Paper or Ask Questions