Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ashwin Balakrishna

Monte Carlo Augmented Actor-Critic for Sparse Reward Deep Reinforcement Learning from Suboptimal Demonstrations

Oct 14, 2022

Albert Wilcox, Ashwin Balakrishna, Jules Dedieu, Wyame Benslimane, Daniel Brown, Ken Goldberg

Figure 1 for Monte Carlo Augmented Actor-Critic for Sparse Reward Deep Reinforcement Learning from Suboptimal Demonstrations

Figure 2 for Monte Carlo Augmented Actor-Critic for Sparse Reward Deep Reinforcement Learning from Suboptimal Demonstrations

Figure 3 for Monte Carlo Augmented Actor-Critic for Sparse Reward Deep Reinforcement Learning from Suboptimal Demonstrations

Figure 4 for Monte Carlo Augmented Actor-Critic for Sparse Reward Deep Reinforcement Learning from Suboptimal Demonstrations

Abstract:Providing densely shaped reward functions for RL algorithms is often exceedingly challenging, motivating the development of RL algorithms that can learn from easier-to-specify sparse reward functions. This sparsity poses new exploration challenges. One common way to address this problem is using demonstrations to provide initial signal about regions of the state space with high rewards. However, prior RL from demonstrations algorithms introduce significant complexity and many hyperparameters, making them hard to implement and tune. We introduce Monte Carlo Augmented Actor Critic (MCAC), a parameter free modification to standard actor-critic algorithms which initializes the replay buffer with demonstrations and computes a modified $Q$-value by taking the maximum of the standard temporal distance (TD) target and a Monte Carlo estimate of the reward-to-go. This encourages exploration in the neighborhood of high-performing trajectories by encouraging high $Q$-values in corresponding regions of the state space. Experiments across $5$ continuous control domains suggest that MCAC can be used to significantly increase learning efficiency across $6$ commonly used RL and RL-from-demonstrations algorithms. See https://sites.google.com/view/mcac-rl for code and supplementary material.

* To be published in the 36th Conference on Neural Information Processing Systems (NeurIPS 2022). 19 pages. 11 figures

Via

Access Paper or Ask Questions

Learning Switching Criteria for Sim2Real Transfer of Robotic Fabric Manipulation Policies

Jul 02, 2022

Satvik Sharma, Ellen Novoseller, Vainavi Viswanath, Zaynah Javed, Rishi Parikh, Ryan Hoque, Ashwin Balakrishna, Daniel S. Brown, Ken Goldberg

Figure 1 for Learning Switching Criteria for Sim2Real Transfer of Robotic Fabric Manipulation Policies

Figure 2 for Learning Switching Criteria for Sim2Real Transfer of Robotic Fabric Manipulation Policies

Figure 3 for Learning Switching Criteria for Sim2Real Transfer of Robotic Fabric Manipulation Policies

Figure 4 for Learning Switching Criteria for Sim2Real Transfer of Robotic Fabric Manipulation Policies

Abstract:Simulation-to-reality transfer has emerged as a popular and highly successful method to train robotic control policies for a wide variety of tasks. However, it is often challenging to determine when policies trained in simulation are ready to be transferred to the physical world. Deploying policies that have been trained with very little simulation data can result in unreliable and dangerous behaviors on physical hardware. On the other hand, excessive training in simulation can cause policies to overfit to the visual appearance and dynamics of the simulator. In this work, we study strategies to automatically determine when policies trained in simulation can be reliably transferred to a physical robot. We specifically study these ideas in the context of robotic fabric manipulation, in which successful sim2real transfer is especially challenging due to the difficulties of precisely modeling the dynamics and visual appearance of fabric. Results in a fabric smoothing task suggest that our switching criteria correlate well with performance in real. In particular, our confidence-based switching criteria achieve average final fabric coverage of 87.2-93.7% within 55-60% of the total training budget. See https://tinyurl.com/lsc-case for code and supplemental materials.

* CASE 2022. The first two authors contributed equally. 9 pages; 5 figures; 1 table

Via

Access Paper or Ask Questions

Dynamics-Aware Comparison of Learned Reward Functions

Jan 25, 2022

Blake Wulfe, Ashwin Balakrishna, Logan Ellis, Jean Mercat, Rowan McAllister, Adrien Gaidon

Figure 1 for Dynamics-Aware Comparison of Learned Reward Functions

Figure 2 for Dynamics-Aware Comparison of Learned Reward Functions

Figure 3 for Dynamics-Aware Comparison of Learned Reward Functions

Figure 4 for Dynamics-Aware Comparison of Learned Reward Functions

Abstract:The ability to learn reward functions plays an important role in enabling the deployment of intelligent agents in the real world. However, comparing reward functions, for example as a means of evaluating reward learning methods, presents a challenge. Reward functions are typically compared by considering the behavior of optimized policies, but this approach conflates deficiencies in the reward function with those of the policy search algorithm used to optimize it. To address this challenge, Gleave et al. (2020) propose the Equivalent-Policy Invariant Comparison (EPIC) distance. EPIC avoids policy optimization, but in doing so requires computing reward values at transitions that may be impossible under the system dynamics. This is problematic for learned reward functions because it entails evaluating them outside of their training distribution, resulting in inaccurate reward values that we show can render EPIC ineffective at comparing rewards. To address this problem, we propose the Dynamics-Aware Reward Distance (DARD), a new reward pseudometric. DARD uses an approximate transition model of the environment to transform reward functions into a form that allows for comparisons that are invariant to reward shaping while only evaluating reward functions on transitions close to their training distribution. Experiments in simulated physical domains demonstrate that DARD enables reliable reward comparisons without policy optimization and is significantly more predictive than baseline methods of downstream policy performance when dealing with learned reward functions.

Via

Access Paper or Ask Questions

MESA: Offline Meta-RL for Safe Adaptation and Fault Tolerance

Dec 07, 2021

Michael Luo, Ashwin Balakrishna, Brijen Thananjeyan, Suraj Nair, Julian Ibarz, Jie Tan, Chelsea Finn, Ion Stoica, Ken Goldberg

Figure 1 for MESA: Offline Meta-RL for Safe Adaptation and Fault Tolerance

Figure 2 for MESA: Offline Meta-RL for Safe Adaptation and Fault Tolerance

Figure 3 for MESA: Offline Meta-RL for Safe Adaptation and Fault Tolerance

Figure 4 for MESA: Offline Meta-RL for Safe Adaptation and Fault Tolerance

Abstract:Safe exploration is critical for using reinforcement learning (RL) in risk-sensitive environments. Recent work learns risk measures which measure the probability of violating constraints, which can then be used to enable safety. However, learning such risk measures requires significant interaction with the environment, resulting in excessive constraint violations during learning. Furthermore, these measures are not easily transferable to new environments. We cast safe exploration as an offline meta-RL problem, where the objective is to leverage examples of safe and unsafe behavior across a range of environments to quickly adapt learned risk measures to a new environment with previously unseen dynamics. We then propose MEta-learning for Safe Adaptation (MESA), an approach for meta-learning a risk measure for safe RL. Simulation experiments across 5 continuous control domains suggest that MESA can leverage offline data from a range of different environments to reduce constraint violations in unseen environments by up to a factor of 2 while maintaining task performance. See https://tinyurl.com/safe-meta-rl for code and supplementary material.

* Workshop on Safe and Robust Control of Uncertain Systems at the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Online

Via

Access Paper or Ask Questions

LEGS: Learning Efficient Grasp Sets for Exploratory Grasping

Nov 29, 2021

Letian Fu, Michael Danielczuk, Ashwin Balakrishna, Daniel S. Brown, Jeffrey Ichnowski, Eugen Solowjow, Ken Goldberg

Figure 1 for LEGS: Learning Efficient Grasp Sets for Exploratory Grasping

Figure 2 for LEGS: Learning Efficient Grasp Sets for Exploratory Grasping

Figure 3 for LEGS: Learning Efficient Grasp Sets for Exploratory Grasping

Figure 4 for LEGS: Learning Efficient Grasp Sets for Exploratory Grasping

Abstract:Previous work defined Exploratory Grasping, where a robot iteratively grasps and drops an unknown complex polyhedral object to discover a set of robust grasps for each recognizably distinct stable pose of the object. Recent work used a multi-armed bandit model with a small set of candidate grasps per pose; however, for objects with few successful grasps, this set may not include the most robust grasp. We present Learned Efficient Grasp Sets (LEGS), an algorithm that can efficiently explore thousands of possible grasps by constructing small active sets of promising grasps and uses learned confidence bounds to determine when, with high confidence, it can stop exploring the object. Experiments suggest that LEGS can identify a high-quality grasp more efficiently than prior algorithms which do not learn active sets. In simulation experiments, we measure the optimality gap between the success probability of the best grasp identified by LEGS and baselines and that of the true most robust grasp. After 3000 steps of exploration, LEGS outperforms baseline algorithms on 10 of the 14 Dex-Net Adversarial objects and 25 of the 39 EGAD! objects. We then develop a self-supervised grasping system, where the robot explores grasps with minimal human intervention. Physical experiments across 3 objects suggest that LEGS converges to high-performing grasps significantly faster than baselines. See \url{https://sites.google.com/view/legs-exp-grasping} for supplemental material and videos.

Via

Access Paper or Ask Questions

ThriftyDAgger: Budget-Aware Novelty and Risk Gating for Interactive Imitation Learning

Sep 17, 2021

Ryan Hoque, Ashwin Balakrishna, Ellen Novoseller, Albert Wilcox, Daniel S. Brown, Ken Goldberg

Figure 1 for ThriftyDAgger: Budget-Aware Novelty and Risk Gating for Interactive Imitation Learning

Figure 2 for ThriftyDAgger: Budget-Aware Novelty and Risk Gating for Interactive Imitation Learning

Figure 3 for ThriftyDAgger: Budget-Aware Novelty and Risk Gating for Interactive Imitation Learning

Figure 4 for ThriftyDAgger: Budget-Aware Novelty and Risk Gating for Interactive Imitation Learning

Abstract:Effective robot learning often requires online human feedback and interventions that can cost significant human time, giving rise to the central challenge in interactive imitation learning: is it possible to control the timing and length of interventions to both facilitate learning and limit burden on the human supervisor? This paper presents ThriftyDAgger, an algorithm for actively querying a human supervisor given a desired budget of human interventions. ThriftyDAgger uses a learned switching policy to solicit interventions only at states that are sufficiently (1) novel, where the robot policy has no reference behavior to imitate, or (2) risky, where the robot has low confidence in task completion. To detect the latter, we introduce a novel metric for estimating risk under the current robot policy. Experiments in simulation and on a physical cable routing experiment suggest that ThriftyDAgger's intervention criteria balances task performance and supervisor burden more effectively than prior algorithms. ThriftyDAgger can also be applied at execution time, where it achieves a 100% success rate on both the simulation and physical tasks. A user study (N=10) in which users control a three-robot fleet while also performing a concentration task suggests that ThriftyDAgger increases human and robot performance by 58% and 80% respectively compared to the next best algorithm while reducing supervisor burden.

* CoRL 2021 Oral

Via

Access Paper or Ask Questions

Kit-Net: Self-Supervised Learning to Kit Novel 3D Objects into Novel 3D Cavities

Jul 13, 2021

Shivin Devgon, Jeffrey Ichnowski, Michael Danielczuk, Daniel S. Brown, Ashwin Balakrishna, Shirin Joshi, Eduardo M. C. Rocha, Eugen Solowjow, Ken Goldberg

Figure 1 for Kit-Net: Self-Supervised Learning to Kit Novel 3D Objects into Novel 3D Cavities

Figure 2 for Kit-Net: Self-Supervised Learning to Kit Novel 3D Objects into Novel 3D Cavities

Figure 3 for Kit-Net: Self-Supervised Learning to Kit Novel 3D Objects into Novel 3D Cavities

Figure 4 for Kit-Net: Self-Supervised Learning to Kit Novel 3D Objects into Novel 3D Cavities

Abstract:In industrial part kitting, 3D objects are inserted into cavities for transportation or subsequent assembly. Kitting is a critical step as it can decrease downstream processing and handling times and enable lower storage and shipping costs. We present Kit-Net, a framework for kitting previously unseen 3D objects into cavities given depth images of both the target cavity and an object held by a gripper in an unknown initial orientation. Kit-Net uses self-supervised deep learning and data augmentation to train a convolutional neural network (CNN) to robustly estimate 3D rotations between objects and matching concave or convex cavities using a large training dataset of simulated depth images pairs. Kit-Net then uses the trained CNN to implement a controller to orient and position novel objects for insertion into novel prismatic and conformal 3D cavities. Experiments in simulation suggest that Kit-Net can orient objects to have a 98.9% average intersection volume between the object mesh and that of the target cavity. Physical experiments with industrial objects succeed in 18% of trials using a baseline method and in 63% of trials with Kit-Net. Video, code, and data are available at https://github.com/BerkeleyAutomation/Kit-Net.

* Conference on Automation Science and Engineering (CASE) 2021

Via

Access Paper or Ask Questions

LS3: Latent Space Safe Sets for Long-Horizon Visuomotor Control of Iterative Tasks

Jul 10, 2021

Albert Wilcox, Ashwin Balakrishna, Brijen Thananjeyan, Joseph E. Gonzalez, Ken Goldberg

Figure 1 for LS3: Latent Space Safe Sets for Long-Horizon Visuomotor Control of Iterative Tasks

Figure 2 for LS3: Latent Space Safe Sets for Long-Horizon Visuomotor Control of Iterative Tasks

Figure 3 for LS3: Latent Space Safe Sets for Long-Horizon Visuomotor Control of Iterative Tasks

Figure 4 for LS3: Latent Space Safe Sets for Long-Horizon Visuomotor Control of Iterative Tasks

Abstract:Reinforcement learning (RL) algorithms have shown impressive success in exploring high-dimensional environments to learn complex, long-horizon tasks, but can often exhibit unsafe behaviors and require extensive environment interaction when exploration is unconstrained. A promising strategy for safe learning in dynamically uncertain environments is requiring that the agent can robustly return to states where task success (and therefore safety) can be guaranteed. While this approach has been successful in low-dimensions, enforcing this constraint in environments with high-dimensional state spaces, such as images, is challenging. We present Latent Space Safe Sets (LS3), which extends this strategy to iterative, long-horizon tasks with image observations by using suboptimal demonstrations and a learned dynamics model to restrict exploration to the neighborhood of a learned Safe Set where task completion is likely. We evaluate LS3 on 4 domains, including a challenging sequential pushing task in simulation and a physical cable routing task. We find that LS3 can use prior task successes to restrict exploration and learn more efficiently than prior algorithms while satisfying constraints. See https://tinyurl.com/latent-ss for code and supplementary material.

* Preprint, Under Review. First two authors contributed equally

Via

Access Paper or Ask Questions

Untangling Dense Non-Planar Knots by Learning Manipulation Features and Recovery Policies

Jun 29, 2021

Priya Sundaresan, Jennifer Grannen, Brijen Thananjeyan, Ashwin Balakrishna, Jeffrey Ichnowski, Ellen Novoseller, Minho Hwang, Michael Laskey, Joseph E. Gonzalez, Ken Goldberg

Figure 1 for Untangling Dense Non-Planar Knots by Learning Manipulation Features and Recovery Policies

Figure 2 for Untangling Dense Non-Planar Knots by Learning Manipulation Features and Recovery Policies

Figure 3 for Untangling Dense Non-Planar Knots by Learning Manipulation Features and Recovery Policies

Figure 4 for Untangling Dense Non-Planar Knots by Learning Manipulation Features and Recovery Policies

Abstract:Robot manipulation for untangling 1D deformable structures such as ropes, cables, and wires is challenging due to their infinite dimensional configuration space, complex dynamics, and tendency to self-occlude. Analytical controllers often fail in the presence of dense configurations, due to the difficulty of grasping between adjacent cable segments. We present two algorithms that enhance robust cable untangling, LOKI and SPiDERMan, which operate alongside HULK, a high-level planner from prior work. LOKI uses a learned model of manipulation features to refine a coarse grasp keypoint prediction to a precise, optimized location and orientation, while SPiDERMan uses a learned model to sense task progress and apply recovery actions. We evaluate these algorithms in physical cable untangling experiments with 336 knots and over 1500 actions on real cables using the da Vinci surgical robot. We find that the combination of HULK, LOKI, and SPiDERMan is able to untangle dense overhand, figure-eight, double-overhand, square, bowline, granny, stevedore, and triple-overhand knots. The composition of these methods successfully untangles a cable from a dense initial configuration in 68.3% of 60 physical experiments and achieves 50% higher success rates than baselines from prior work. Supplementary material, code, and videos can be found at https://tinyurl.com/rssuntangling.

Via

Access Paper or Ask Questions

Policy Gradient Bayesian Robust Optimization for Imitation Learning

Jun 21, 2021

Zaynah Javed, Daniel S. Brown, Satvik Sharma, Jerry Zhu, Ashwin Balakrishna, Marek Petrik, Anca D. Dragan, Ken Goldberg

Figure 1 for Policy Gradient Bayesian Robust Optimization for Imitation Learning

Figure 2 for Policy Gradient Bayesian Robust Optimization for Imitation Learning

Figure 3 for Policy Gradient Bayesian Robust Optimization for Imitation Learning

Figure 4 for Policy Gradient Bayesian Robust Optimization for Imitation Learning

Abstract:The difficulty in specifying rewards for many real-world problems has led to an increased focus on learning rewards from human feedback, such as demonstrations. However, there are often many different reward functions that explain the human feedback, leaving agents with uncertainty over what the true reward function is. While most policy optimization approaches handle this uncertainty by optimizing for expected performance, many applications demand risk-averse behavior. We derive a novel policy gradient-style robust optimization approach, PG-BROIL, that optimizes a soft-robust objective that balances expected performance and risk. To the best of our knowledge, PG-BROIL is the first policy optimization algorithm robust to a distribution of reward hypotheses which can scale to continuous MDPs. Results suggest that PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse and outperforms state-of-the-art imitation learning algorithms when learning from ambiguous demonstrations by hedging against uncertainty, rather than seeking to uniquely identify the demonstrator's reward function.

* In proceedings of the International Conference on Machine Learning (ICML) 2021

Via

Access Paper or Ask Questions