Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jan Peters

An Analysis of Measure-Valued Derivatives for Policy Gradients

Mar 08, 2022

Joao Carvalho, Jan Peters

Figure 1 for An Analysis of Measure-Valued Derivatives for Policy Gradients

Figure 2 for An Analysis of Measure-Valued Derivatives for Policy Gradients

Figure 3 for An Analysis of Measure-Valued Derivatives for Policy Gradients

Abstract:Reinforcement learning methods for robotics are increasingly successful due to the constant development of better policy gradient techniques. A precise (low variance) and accurate (low bias) gradient estimator is crucial to face increasingly complex tasks. Traditional policy gradient algorithms use the likelihood-ratio trick, which is known to produce unbiased but high variance estimates. More modern approaches exploit the reparametrization trick, which gives lower variance gradient estimates but requires differentiable value function approximators. In this work, we study a different type of stochastic gradient estimator - the Measure-Valued Derivative. This estimator is unbiased, has low variance, and can be used with differentiable and non-differentiable function approximators. We empirically evaluate this estimator in the actor-critic policy gradient setting and show that it can reach comparable performance with methods based on the likelihood-ratio or reparametrization tricks, both in low and high-dimensional action spaces. With this work, we want to show that the Measure-Valued Derivative estimator can be a useful alternative to other policy gradient estimators.

Via

Access Paper or Ask Questions

PAC-Bayesian Lifelong Learning For Multi-Armed Bandits

Mar 07, 2022

Hamish Flynn, David Reeb, Melih Kandemir, Jan Peters

Figure 1 for PAC-Bayesian Lifelong Learning For Multi-Armed Bandits

Figure 2 for PAC-Bayesian Lifelong Learning For Multi-Armed Bandits

Figure 3 for PAC-Bayesian Lifelong Learning For Multi-Armed Bandits

Figure 4 for PAC-Bayesian Lifelong Learning For Multi-Armed Bandits

Abstract:We present a PAC-Bayesian analysis of lifelong learning. In the lifelong learning problem, a sequence of learning tasks is observed one-at-a-time, and the goal is to transfer information acquired from previous tasks to new learning tasks. We consider the case when each learning task is a multi-armed bandit problem. We derive lower bounds on the expected average reward that would be obtained if a given multi-armed bandit algorithm was run in a new task with a particular prior and for a set number of steps. We propose lifelong learning algorithms that use our new bounds as learning objectives. Our proposed algorithms are evaluated in several lifelong multi-armed bandit problems and are found to perform better than a baseline method that does not use generalisation bounds.

* Data Mining and Knowledge Discovery, 2022, Special Issue of the Journal Track ECML PKDD 2022
* 29 pages, 5 figures

Via

Access Paper or Ask Questions

An Adaptive Human Driver Model for Realistic Race Car Simulations

Mar 03, 2022

Stefan Löckel, Siwei Ju, Maximilian Schaller, Peter van Vliet, Jan Peters

Figure 1 for An Adaptive Human Driver Model for Realistic Race Car Simulations

Figure 2 for An Adaptive Human Driver Model for Realistic Race Car Simulations

Figure 3 for An Adaptive Human Driver Model for Realistic Race Car Simulations

Figure 4 for An Adaptive Human Driver Model for Realistic Race Car Simulations

Abstract:Engineering a high-performance race car requires a direct consideration of the human driver using real-world tests or Human-Driver-in-the-Loop simulations. Apart from that, offline simulations with human-like race driver models could make this vehicle development process more effective and efficient but are hard to obtain due to various challenges. With this work, we intend to provide a better understanding of race driver behavior and introduce an adaptive human race driver model based on imitation learning. Using existing findings and an interview with a professional race engineer, we identify fundamental adaptation mechanisms and how drivers learn to optimize lap time on a new track. Subsequently, we use these insights to develop generalization and adaptation techniques for a recently presented probabilistic driver modeling approach and evaluate it using data from professional race drivers and a state-of-the-art race car simulator. We show that our framework can create realistic driving line distributions on unseen race tracks with almost human-like performance. Moreover, our driver model optimizes its driving lap by lap, correcting driving errors from previous laps while achieving faster lap times. This work contributes to a better understanding and modeling of the human driver, aiming to expedite simulation methods in the modern vehicle development process and potentially supporting automated driving and racing technologies.

* 12 pages, 12 figures

Via

Access Paper or Ask Questions

Integrating Contrastive Learning with Dynamic Models for Reinforcement Learning from Images

Mar 02, 2022

Bang You, Oleg Arenz, Youping Chen, Jan Peters

Figure 1 for Integrating Contrastive Learning with Dynamic Models for Reinforcement Learning from Images

Figure 2 for Integrating Contrastive Learning with Dynamic Models for Reinforcement Learning from Images

Figure 3 for Integrating Contrastive Learning with Dynamic Models for Reinforcement Learning from Images

Figure 4 for Integrating Contrastive Learning with Dynamic Models for Reinforcement Learning from Images

Abstract:Recent methods for reinforcement learning from images use auxiliary tasks to learn image features that are used by the agent's policy or Q-function. In particular, methods based on contrastive learning that induce linearity of the latent dynamics or invariance to data augmentation have been shown to greatly improve the sample efficiency of the reinforcement learning algorithm and the generalizability of the learned embedding. We further argue, that explicitly improving Markovianity of the learned embedding is desirable and propose a self-supervised representation learning method which integrates contrastive learning with dynamic models to synergistically combine these three objectives: (1) We maximize the InfoNCE bound on the mutual information between the state- and action-embedding and the embedding of the next state to induce a linearly predictive embedding without explicitly learning a linear transition model, (2) we further improve Markovianity of the learned embedding by explicitly learning a non-linear transition model using regression, and (3) we maximize the mutual information between the two nonlinear predictions of the next embeddings based on the current action and two independent augmentations of the current state, which naturally induces transformation invariance not only for the state embedding, but also for the nonlinear transition model. Experimental evaluation on the Deepmind control suite shows that our proposed method achieves higher sample efficiency and better generalization than state-of-art methods based on contrastive learning or reconstruction.

* Neurocomputing 476(2022)102-114
* 28 pages, 11 figures, 5 tables

Via

Access Paper or Ask Questions

A Unified Perspective on Value Backup and Exploration in Monte-Carlo Tree Search

Feb 11, 2022

Tuan Dam, Carlo D'Eramo, Jan Peters, Joni Pajarinen

Abstract:Monte-Carlo Tree Search (MCTS) is a class of methods for solving complex decision-making problems through the synergy of Monte-Carlo planning and Reinforcement Learning (RL). The highly combinatorial nature of the problems commonly addressed by MCTS requires the use of efficient exploration strategies for navigating the planning tree and quickly convergent value backup methods. These crucial problems are particularly evident in recent advances that combine MCTS with deep neural networks for function approximation. In this work, we propose two methods for improving the convergence rate and exploration based on a newly introduced backup operator and entropy regularization. We provide strong theoretical guarantees to bound convergence rate, approximation error, and regret of our methods. Moreover, we introduce a mathematical framework based on the use of the $\alpha$-divergence for backup and exploration in MCTS. We show that this theoretical formulation unifies different approaches, including our newly introduced ones, under the same mathematical framework, allowing to obtain different methods by simply changing the value of $\alpha$. In practice, our unified perspective offers a flexible way to balance between exploration and exploitation by tuning the single $\alpha$ parameter according to the problem at hand. We validate our methods through a rigorous empirical study from basic toy problems to the complex Atari games, and including both MDP and POMDP problems.

* arXiv admin note: text overlap with arXiv:2007.00391

Via

Access Paper or Ask Questions

Learning Geometric Constraints in Task and Motion Planning

Jan 24, 2022

Tianyu Ren, Alexander Imani Cowen-Rivers, Haitham Bou Ammar, Jan Peters

Figure 1 for Learning Geometric Constraints in Task and Motion Planning

Figure 2 for Learning Geometric Constraints in Task and Motion Planning

Figure 3 for Learning Geometric Constraints in Task and Motion Planning

Figure 4 for Learning Geometric Constraints in Task and Motion Planning

Abstract:Searching for bindings of geometric parameters in task and motion planning (TAMP) is a finite-horizon stochastic planning problem with high-dimensional decision spaces. A robot manipulator can only move in a subspace of its whole range that is subjected to many geometric constraints. A TAMP solver usually takes many explorations before finding a feasible binding set for each task. It is favorable to learn those constraints once and then transfer them over different tasks within the same workspace. We address this problem by representing constraint knowledge with transferable primitives and using Bayesian optimization (BO) based on these primitives to guide binding search in further tasks. Via semantic and geometric backtracking in TAMP, we construct constraint primitives to encode the geometric constraints respectively in a reusable form. Then we devise a BO approach to efficiently utilize the accumulated constraints for guiding node expansion of an MCTS-based binding planner. We further compose a transfer mechanism to enable free knowledge flow between TAMP tasks. Results indicate that our approach reduces the expensive exploration calls in binding search by 43.60to 71.69 when compared to the baseline unguided planner.

Via

Access Paper or Ask Questions

Distilled Domain Randomization

Dec 06, 2021

Julien Brosseit, Benedikt Hahner, Fabio Muratore, Michael Gienger, Jan Peters

Figure 1 for Distilled Domain Randomization

Figure 2 for Distilled Domain Randomization

Figure 3 for Distilled Domain Randomization

Figure 4 for Distilled Domain Randomization

Abstract:Deep reinforcement learning is an effective tool to learn robot control policies from scratch. However, these methods are notorious for the enormous amount of required training data which is prohibitively expensive to collect on real robots. A highly popular alternative is to learn from simulations, allowing to generate the data much faster, safer, and cheaper. Since all simulators are mere models of reality, there are inevitable differences between the simulated and the real data, often referenced as the 'reality gap'. To bridge this gap, many approaches learn one policy from a distribution over simulators. In this paper, we propose to combine reinforcement learning from randomized physics simulations with policy distillation. Our algorithm, called Distilled Domain Randomization (DiDoR), distills so-called teacher policies, which are experts on domains that have been sampled initially, into a student policy that is later deployed. This way, DiDoR learns controllers which transfer directly from simulation to reality, i.e., without requiring data from the target domain. We compare DiDoR against three baselines in three sim-to-sim as well as two sim-to-real experiments. Our results show that the target domain performance of policies trained with DiDoR is en par or better than the baselines'. Moreover, our approach neither increases the required memory capacity nor the time to compute an action, which may well be a point of failure for successfully deploying the learned controller.

* shared first authorship between Julien Brosseit, Benedikt Hahner, and Fabio Muratore

Via

Access Paper or Ask Questions

Model-Based Reinforcement Learning for Stochastic Hybrid Systems

Nov 11, 2021

Hany Abdulsamad, Jan Peters

Figure 1 for Model-Based Reinforcement Learning for Stochastic Hybrid Systems

Figure 2 for Model-Based Reinforcement Learning for Stochastic Hybrid Systems

Figure 3 for Model-Based Reinforcement Learning for Stochastic Hybrid Systems

Figure 4 for Model-Based Reinforcement Learning for Stochastic Hybrid Systems

Abstract:Optimal control of general nonlinear systems is a central challenge in automation. Data-driven approaches to control, enabled by powerful function approximators, have recently had great success in tackling challenging robotic applications. However, such methods often obscure the structure of dynamics and control behind black-box over-parameterized representations, thus limiting our ability to understand the closed-loop behavior. This paper adopts a hybrid-system view of nonlinear modeling and control that lends an explicit hierarchical structure to the problem and breaks down complex dynamics into simpler localized units. Therefore, we consider a sequence modeling paradigm that captures the temporal structure of the data and derive an expecation-maximization (EM) algorithm that automatically decomposes nonlinear dynamics into stochastic piecewise affine dynamical systems with nonlinear boundaries. Furthermore, we show that these time-series models naturally admit a closed-loop extension that we use to extract locally linear or polynomial feedback controllers from nonlinear experts via imitation learning. Finally, we introduce a novel hybrid realtive entropy policy search (Hb-REPS) technique that incorporates the hierarchical nature of hybrid systems and optimizes a set of time-invariant local feedback controllers derived from a locally polynomial approximation of a global value function.

Via

Access Paper or Ask Questions

Robot Learning from Randomized Simulations: A Review

Nov 01, 2021

Fabio Muratore, Fabio Ramos, Greg Turk, Wenhao Yu, Michael Gienger, Jan Peters

Figure 1 for Robot Learning from Randomized Simulations: A Review

Figure 2 for Robot Learning from Randomized Simulations: A Review

Figure 3 for Robot Learning from Randomized Simulations: A Review

Figure 4 for Robot Learning from Randomized Simulations: A Review

Abstract:The rise of deep learning has caused a paradigm shift in robotics research, favoring methods that require large amounts of data. It is prohibitively expensive to generate such data sets on a physical platform. Therefore, state-of-the-art approaches learn in simulation where data generation is fast as well as inexpensive and subsequently transfer the knowledge to the real robot (sim-to-real). Despite becoming increasingly realistic, all simulators are by construction based on models, hence inevitably imperfect. This raises the question of how simulators can be modified to facilitate learning robot control policies and overcome the mismatch between simulation and reality, often called the 'reality gap'. We provide a comprehensive review of sim-to-real research for robotics, focusing on a technique named 'domain randomization' which is a method for learning from randomized simulations.

* submitted to Frontiers in Robotics and AI

Via

Access Paper or Ask Questions

A Differentiable Newton-Euler Algorithm for Real-World Robotics

Oct 24, 2021

Michael Lutter, Johannes Silberbauer, Joe Watson, Jan Peters

Figure 1 for A Differentiable Newton-Euler Algorithm for Real-World Robotics

Figure 2 for A Differentiable Newton-Euler Algorithm for Real-World Robotics

Figure 3 for A Differentiable Newton-Euler Algorithm for Real-World Robotics

Figure 4 for A Differentiable Newton-Euler Algorithm for Real-World Robotics

Abstract:Obtaining dynamics models is essential for robotics to achieve accurate model-based controllers and simulators for planning. The dynamics models are typically obtained using model specification of the manufacturer or simple numerical methods such as linear regression. However, this approach does not guarantee physically plausible parameters and can only be applied to kinematic chains consisting of rigid bodies. In this article, we describe a differentiable simulator that can be used to identify the system parameters of real-world mechanical systems with complex friction models, holonomic as well as non-holonomic constraints. To guarantee physically consistent parameters, we utilize virtual parameters and gradient-based optimization. The described Differentiable Newton-Euler Algorithm (DiffNEA) can be applied to a class of dynamical systems and guarantees physically plausible predictions. The extensive experimental evaluation shows, that the proposed model learning approach learns accurate dynamics models of systems with complex friction and non-holonomic constraints. Especially in the offline reinforcement learning experiments, the identified DiffNEA models excel. For the challenging ball in a cup task, these models solve the task using model-based offline reinforcement learning on the physical system. The black-box baselines fail on this task in simulation and on the physical system despite using more data for learning the model.

* arXiv admin note: text overlap with arXiv:2011.01734

Via

Access Paper or Ask Questions