Alert button
Picture for Joao Carvalho

Joao Carvalho

Alert button

Motion Planning Diffusion: Learning and Planning of Robot Motions with Diffusion Models

Aug 03, 2023
Joao Carvalho, An T. Le, Mark Baierl, Dorothea Koert, Jan Peters

Learning priors on trajectory distributions can help accelerate robot motion planning optimization. Given previously successful plans, learning trajectory generative models as priors for a new planning problem is highly desirable. Prior works propose several ways on utilizing this prior to bootstrapping the motion planning problem. Either sampling the prior for initializations or using the prior distribution in a maximum-a-posterior formulation for trajectory optimization. In this work, we propose learning diffusion models as priors. We then can sample directly from the posterior trajectory distribution conditioned on task goals, by leveraging the inverse denoising process of diffusion models. Furthermore, diffusion has been recently shown to effectively encode data multimodality in high-dimensional settings, which is particularly well-suited for large trajectory dataset. To demonstrate our method efficacy, we compare our proposed method - Motion Planning Diffusion - against several baselines in simulated planar robot and 7-dof robot arm manipulator environments. To assess the generalization capabilities of our method, we test it in environments with previously unseen obstacles. Our experiments show that diffusion models are strong priors to encode high-dimensional trajectory distributions of robot motions.

Viaarxiv icon

Diminishing Return of Value Expansion Methods in Model-Based Reinforcement Learning

Mar 07, 2023
Daniel Palenicek, Michael Lutter, Joao Carvalho, Jan Peters

Figure 1 for Diminishing Return of Value Expansion Methods in Model-Based Reinforcement Learning
Figure 2 for Diminishing Return of Value Expansion Methods in Model-Based Reinforcement Learning
Figure 3 for Diminishing Return of Value Expansion Methods in Model-Based Reinforcement Learning
Figure 4 for Diminishing Return of Value Expansion Methods in Model-Based Reinforcement Learning

Model-based reinforcement learning is one approach to increase sample efficiency. However, the accuracy of the dynamics model and the resulting compounding error over modelled trajectories are commonly regarded as key limitations. A natural question to ask is: How much more sample efficiency can be gained by improving the learned dynamics models? Our paper empirically answers this question for the class of model-based value expansion methods in continuous control problems. Value expansion methods should benefit from increased model accuracy by enabling longer rollout horizons and better value function approximations. Our empirical study, which leverages oracle dynamics models to avoid compounding model errors, shows that (1) longer horizons increase sample efficiency, but the gain in improvement decreases with each additional expansion step, and (2) the increased model accuracy only marginally increases the sample efficiency compared to learned models with identical horizons. Therefore, longer horizons and increased model accuracy yield diminishing returns in terms of sample efficiency. These improvements in sample efficiency are particularly disappointing when compared to model-free value expansion methods. Even though they introduce no computational overhead, we find their performance to be on-par with model-based value expansion methods. Therefore, we conclude that the limitation of model-based value expansion methods is not the model accuracy of the learned models. While higher model accuracy is beneficial, our experiments show that even a perfect model will not provide an un-rivalled sample efficiency but that the bottleneck lies elsewhere.

* Published as a conference paper at ICLR 2023 
Viaarxiv icon

A Hierarchical Approach to Active Pose Estimation

Mar 08, 2022
Jascha Hellwig, Mark Baierl, Joao Carvalho, Julen Urain, Jan Peters

Figure 1 for A Hierarchical Approach to Active Pose Estimation
Figure 2 for A Hierarchical Approach to Active Pose Estimation
Figure 3 for A Hierarchical Approach to Active Pose Estimation
Figure 4 for A Hierarchical Approach to Active Pose Estimation

Creating mobile robots which are able to find and manipulate objects in large environments is an active topic of research. These robots not only need to be capable of searching for specific objects but also to estimate their poses often relying on environment observations, which is even more difficult in the presence of occlusions. Therefore, to tackle this problem we propose a simple hierarchical approach to estimate the pose of a desired object. An Active Visual Search module operating with RGB images first obtains a rough estimation of the object 2D pose, followed by a more computationally expensive Active Pose Estimation module using point cloud data. We empirically show that processing image features to obtain a richer observation speeds up the search and pose estimation computations, in comparison to a binary decision that indicates whether the object is or not in the current image.

Viaarxiv icon

Residual Robot Learning for Object-Centric Probabilistic Movement Primitives

Mar 08, 2022
Joao Carvalho, Dorothea Koert, Marek Daniv, Jan Peters

Figure 1 for Residual Robot Learning for Object-Centric Probabilistic Movement Primitives
Figure 2 for Residual Robot Learning for Object-Centric Probabilistic Movement Primitives
Figure 3 for Residual Robot Learning for Object-Centric Probabilistic Movement Primitives
Figure 4 for Residual Robot Learning for Object-Centric Probabilistic Movement Primitives

It is desirable for future robots to quickly learn new tasks and adapt learned skills to constantly changing environments. To this end, Probabilistic Movement Primitives (ProMPs) have shown to be a promising framework to learn generalizable trajectory generators from distributions over demonstrated trajectories. However, in practical applications that require high precision in the manipulation of objects, the accuracy of ProMPs is often insufficient, in particular when they are learned in cartesian space from external observations and executed with limited controller gains. Therefore, we propose to combine ProMPs with recently introduced Residual Reinforcement Learning (RRL), to account for both, corrections in position and orientation during task execution. In particular, we learn a residual on top of a nominal ProMP trajectory with Soft-Actor Critic and incorporate the variability in the demonstrations as a decision variable to reduce the search space for RRL. As a proof of concept, we evaluate our proposed method on a 3D block insertion task with a 7-DoF Franka Emika Panda robot. Experimental results show that the robot successfully learns to complete the insertion which was not possible before with using basic ProMPs.

Viaarxiv icon

An Analysis of Measure-Valued Derivatives for Policy Gradients

Mar 08, 2022
Joao Carvalho, Jan Peters

Figure 1 for An Analysis of Measure-Valued Derivatives for Policy Gradients
Figure 2 for An Analysis of Measure-Valued Derivatives for Policy Gradients
Figure 3 for An Analysis of Measure-Valued Derivatives for Policy Gradients

Reinforcement learning methods for robotics are increasingly successful due to the constant development of better policy gradient techniques. A precise (low variance) and accurate (low bias) gradient estimator is crucial to face increasingly complex tasks. Traditional policy gradient algorithms use the likelihood-ratio trick, which is known to produce unbiased but high variance estimates. More modern approaches exploit the reparametrization trick, which gives lower variance gradient estimates but requires differentiable value function approximators. In this work, we study a different type of stochastic gradient estimator - the Measure-Valued Derivative. This estimator is unbiased, has low variance, and can be used with differentiable and non-differentiable function approximators. We empirically evaluate this estimator in the actor-critic policy gradient setting and show that it can reach comparable performance with methods based on the likelihood-ratio or reparametrization tricks, both in low and high-dimensional action spaces. With this work, we want to show that the Measure-Valued Derivative estimator can be a useful alternative to other policy gradient estimators.

Viaarxiv icon

A Nonparametric Off-Policy Policy Gradient

Feb 11, 2020
Samuele Tosatto, Joao Carvalho, Hany Abdulsamad, Jan Peters

Figure 1 for A Nonparametric Off-Policy Policy Gradient
Figure 2 for A Nonparametric Off-Policy Policy Gradient
Figure 3 for A Nonparametric Off-Policy Policy Gradient
Figure 4 for A Nonparametric Off-Policy Policy Gradient

Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes. The need for intensive interactions with the environment is especially observed in many widely popular policy gradient algorithms that perform updates using on-policy samples. The price of such inefficiency becomes evident in real-world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited. We address this issue by building on the general sample efficiency of off-policy algorithms. With nonparametric regression and density estimation methods we construct a nonparametric Bellman equation in a principled manner, which allows us to obtain closed-form estimates of the value function, and to analytically express the full policy gradient. We provide a theoretical analysis of our estimate to show that it is consistent under mild smoothness assumptions and empirically show that our approach has better sample efficiency than state-of-the-art policy gradient methods.

Viaarxiv icon

A Nonparametric Offpolicy Policy Gradient

Jan 08, 2020
Samuele Tosatto, Joao Carvalho, Hany Abdulsamad, Jan Peters

Figure 1 for A Nonparametric Offpolicy Policy Gradient
Figure 2 for A Nonparametric Offpolicy Policy Gradient
Figure 3 for A Nonparametric Offpolicy Policy Gradient
Figure 4 for A Nonparametric Offpolicy Policy Gradient

Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes. The need for intensive interactions with the environment is especially observed in many widely popular policy gradient algorithms that perform updates using on-policy samples. The price of such inefficiency becomes evident in real-world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited. We address this issue by building on the general sample efficiency of off-policy algorithms. With nonparametric regression and density estimation methods we construct a nonparametric Bellman equation in a principled manner, which allows us to obtain closed-form estimates of the value function, and to analytically express the full policy gradient. We provide a theoretical analysis of our estimate to show that it is consistent under mild smoothness assumptions and empirically show that our approach has better sample efficiency than state-of-the-art policy gradient methods.

Viaarxiv icon