Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuke Zhu

Hierarchical Planning for Long-Horizon Manipulation with Geometric and Symbolic Scene Graphs

Dec 14, 2020
Yifeng Zhu, Jonathan Tremblay, Stan Birchfield, Yuke Zhu

Figure 1 for Hierarchical Planning for Long-Horizon Manipulation with Geometric and Symbolic Scene Graphs

Figure 2 for Hierarchical Planning for Long-Horizon Manipulation with Geometric and Symbolic Scene Graphs

Figure 3 for Hierarchical Planning for Long-Horizon Manipulation with Geometric and Symbolic Scene Graphs

Figure 4 for Hierarchical Planning for Long-Horizon Manipulation with Geometric and Symbolic Scene Graphs

We present a visually grounded hierarchical planning algorithm for long-horizon manipulation tasks. Our algorithm offers a joint framework of neuro-symbolic task planning and low-level motion generation conditioned on the specified goal. At the core of our approach is a two-level scene graph representation, namely geometric scene graph and symbolic scene graph. This hierarchical representation serves as a structured, object-centric abstraction of manipulation scenes. Our model uses graph neural networks to process these scene graphs for predicting high-level task plans and low-level motions. We demonstrate that our method scales to long-horizon tasks and generalizes well to novel task goals. We validate our method in a kitchen storage task in both physical simulation and the real world. Our experiments show that our method achieved over 70% success rate and nearly 90% of subgoal completion rate on the real robot while being four orders of magnitude faster in computation time compared to standard search-based task-and-motion planner.

Via

Access Paper or Ask Questions

Learning Multi-Arm Manipulation Through Collaborative Teleoperation

Dec 12, 2020
Albert Tung, Josiah Wong, Ajay Mandlekar, Roberto Martín-Martín, Yuke Zhu, Li Fei-Fei, Silvio Savarese

Figure 1 for Learning Multi-Arm Manipulation Through Collaborative Teleoperation

Figure 2 for Learning Multi-Arm Manipulation Through Collaborative Teleoperation

Figure 3 for Learning Multi-Arm Manipulation Through Collaborative Teleoperation

Figure 4 for Learning Multi-Arm Manipulation Through Collaborative Teleoperation

Imitation Learning (IL) is a powerful paradigm to teach robots to perform manipulation tasks by allowing them to learn from human demonstrations collected via teleoperation, but has mostly been limited to single-arm manipulation. However, many real-world tasks require multiple arms, such as lifting a heavy object or assembling a desk. Unfortunately, applying IL to multi-arm manipulation tasks has been challenging -- asking a human to control more than one robotic arm can impose significant cognitive burden and is often only possible for a maximum of two robot arms. To address these challenges, we present Multi-Arm RoboTurk (MART), a multi-user data collection platform that allows multiple remote users to simultaneously teleoperate a set of robotic arms and collect demonstrations for multi-arm tasks. Using MART, we collected demonstrations for five novel two and three-arm tasks from several geographically separated users. From our data we arrived at a critical insight: most multi-arm tasks do not require global coordination throughout its full duration, but only during specific moments. We show that learning from such data consequently presents challenges for centralized agents that directly attempt to model all robot actions simultaneously, and perform a comprehensive study of different policy architectures with varying levels of centralization on our tasks. Finally, we propose and evaluate a base-residual policy framework that allows trained policies to better adapt to the mixed coordination setting common in multi-arm manipulation, and show that a centralized policy augmented with a decentralized residual model outperforms all other models on our set of benchmark tasks. Additional results and videos at https://roboturk.stanford.edu/multiarm .

* First two authors contributed equally

Via

Access Paper or Ask Questions

Human-in-the-Loop Imitation Learning using Remote Teleoperation

Dec 12, 2020
Ajay Mandlekar, Danfei Xu, Roberto Martín-Martín, Yuke Zhu, Li Fei-Fei, Silvio Savarese

Figure 1 for Human-in-the-Loop Imitation Learning using Remote Teleoperation

Figure 2 for Human-in-the-Loop Imitation Learning using Remote Teleoperation

Figure 3 for Human-in-the-Loop Imitation Learning using Remote Teleoperation

Figure 4 for Human-in-the-Loop Imitation Learning using Remote Teleoperation

Imitation Learning is a promising paradigm for learning complex robot manipulation skills by reproducing behavior from human demonstrations. However, manipulation tasks often contain bottleneck regions that require a sequence of precise actions to make meaningful progress, such as a robot inserting a pod into a coffee machine to make coffee. Trained policies can fail in these regions because small deviations in actions can lead the policy into states not covered by the demonstrations. Intervention-based policy learning is an alternative that can address this issue -- it allows human operators to monitor trained policies and take over control when they encounter failures. In this paper, we build a data collection system tailored to 6-DoF manipulation settings, that enables remote human operators to monitor and intervene on trained policies. We develop a simple and effective algorithm to train the policy iteratively on new data collected by the system that encourages the policy to learn how to traverse bottlenecks through the interventions. We demonstrate that agents trained on data collected by our intervention-based system and algorithm outperform agents trained on an equivalent number of samples collected by non-interventional demonstrators, and further show that our method outperforms multiple state-of-the-art baselines for learning from the human interventions on a challenging robot threading task and a coffee making task. Additional results and videos at https://sites.google.com/stanford.edu/iwr .

Via

Access Paper or Ask Questions

Detect, Reject, Correct: Crossmodal Compensation of Corrupted Sensors

Dec 01, 2020
Michelle A. Lee, Matthew Tan, Yuke Zhu, Jeannette Bohg

Figure 1 for Detect, Reject, Correct: Crossmodal Compensation of Corrupted Sensors

Figure 2 for Detect, Reject, Correct: Crossmodal Compensation of Corrupted Sensors

Figure 3 for Detect, Reject, Correct: Crossmodal Compensation of Corrupted Sensors

Figure 4 for Detect, Reject, Correct: Crossmodal Compensation of Corrupted Sensors

Using sensor data from multiple modalities presents an opportunity to encode redundant and complementary features that can be useful when one modality is corrupted or noisy. Humans do this everyday, relying on touch and proprioceptive feedback in visually-challenging environments. However, robots might not always know when their sensors are corrupted, as even broken sensors can return valid values. In this work, we introduce the Crossmodal Compensation Model (CCM), which can detect corrupted sensor modalities and compensate for them. CMM is a representation model learned with self-supervision that leverages unimodal reconstruction loss for corruption detection. CCM then discards the corrupted modality and compensates for it with information from the remaining sensors. We show that CCM learns rich state representations that can be used for contact-rich manipulation policies, even when input modalities are corrupted in ways not seen during training time.

* 8 pages, 5 figures

Via

Access Paper or Ask Questions

Deep Affordance Foresight: Planning Through What Can Be Done in the Future

Nov 17, 2020
Danfei Xu, Ajay Mandlekar, Roberto Martín-Martín, Yuke Zhu, Silvio Savarese, Li Fei-Fei

Figure 1 for Deep Affordance Foresight: Planning Through What Can Be Done in the Future

Figure 2 for Deep Affordance Foresight: Planning Through What Can Be Done in the Future

Figure 3 for Deep Affordance Foresight: Planning Through What Can Be Done in the Future

Figure 4 for Deep Affordance Foresight: Planning Through What Can Be Done in the Future

Planning in realistic environments requires searching in large planning spaces. Affordances are a powerful concept to simplify this search, because they model what actions can be successful in a given situation. However, the classical notion of affordance is not suitable for long horizon planning because it only informs the robot about the immediate outcome of actions instead of what actions are best for achieving a long-term goal. In this paper, we introduce a new affordance representation that enables the robot to reason about the long-term effects of actions through modeling what actions are afforded in the future, thereby informing the robot the best actions to take next to achieve a task goal. Based on the new representation, we develop a learning-to-plan method, Deep Affordance Foresight (DAF), that learns partial environment models of affordances of parameterized motor skills through trial-and-error. We evaluate DAF on two challenging manipulation domains and show that it can effectively learn to carry out multi-step tasks, share learned affordance representations among different tasks, and learn to plan with high-dimensional image inputs. Additional material is available at https://sites.google.com/stanford.edu/daf

Via

Access Paper or Ask Questions

Fast Uncertainty Quantification for Deep Object Pose Estimation

Nov 16, 2020
Guanya Shi, Yifeng Zhu, Jonathan Tremblay, Stan Birchfield, Fabio Ramos, Animashree Anandkumar, Yuke Zhu

Figure 1 for Fast Uncertainty Quantification for Deep Object Pose Estimation

Figure 2 for Fast Uncertainty Quantification for Deep Object Pose Estimation

Figure 3 for Fast Uncertainty Quantification for Deep Object Pose Estimation

Figure 4 for Fast Uncertainty Quantification for Deep Object Pose Estimation

Deep learning-based object pose estimators are often unreliable and overconfident especially when the input image is outside the training domain, for instance, with sim2real transfer. Efficient and robust uncertainty quantification (UQ) in pose estimators is critically needed in many robotic tasks. In this work, we propose a simple, efficient, and plug-and-play UQ method for 6-DoF object pose estimation. We ensemble 2-3 pre-trained models with different neural network architectures and/or training data sources, and compute their average pairwise disagreement against one another to obtain the uncertainty quantification. We propose four disagreement metrics, including a learned metric, and show that the average distance (ADD) is the best learning-free metric and it is only slightly worse than the learned metric, which requires labeled target data. Our method has several advantages compared to the prior art: 1) our method does not require any modification of the training process or the model inputs; and 2) it needs only one forward pass for each model. We evaluate the proposed UQ method on three tasks where our uncertainty quantification yields much stronger correlations with pose estimation errors than the baselines. Moreover, in a real robot grasping task, our method increases the grasping success rate from 35% to 90%.

* Video and code are available at https://sites.google.com/view/fastuq

Via

Access Paper or Ask Questions

Learning a Contact-Adaptive Controller for Robust, Efficient Legged Locomotion

Oct 05, 2020
Xingye Da, Zhaoming Xie, David Hoeller, Byron Boots, Animashree Anandkumar, Yuke Zhu, Buck Babich, Animesh Garg

Figure 1 for Learning a Contact-Adaptive Controller for Robust, Efficient Legged Locomotion

Figure 2 for Learning a Contact-Adaptive Controller for Robust, Efficient Legged Locomotion

Figure 3 for Learning a Contact-Adaptive Controller for Robust, Efficient Legged Locomotion

Figure 4 for Learning a Contact-Adaptive Controller for Robust, Efficient Legged Locomotion

We present a hierarchical framework that combines model-based control and reinforcement learning (RL) to synthesize robust controllers for a quadruped (the Unitree Laikago). The system consists of a high-level controller that learns to choose from a set of primitives in response to changes in the environment and a low-level controller that utilizes an established control method to robustly execute the primitives. Our framework learns a controller that can adapt to challenging environmental changes on the fly, including novel scenarios not seen during training. The learned controller is up to 85~percent more energy efficient and is more robust compared to baseline methods. We also deploy the controller on a physical robot without any randomization or adaptation scheme.

* supplementary video: https://youtu.be/JJOmFZKpYTo

Via

Access Paper or Ask Questions

Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning

Oct 02, 2020
Weili Nie, Zhiding Yu, Lei Mao, Ankit B. Patel, Yuke Zhu, Animashree Anandkumar

Humans have an inherent ability to learn novel concepts from only a few samples and generalize these concepts to different situations. Even though today's machine learning models excel with a plethora of training data on standard recognition tasks, a considerable gap exists between machine-level pattern recognition and human-level concept learning. To narrow this gap, the Bongard Problems (BPs) were introduced as an inspirational challenge for visual cognition in intelligent systems. Albeit new advances in representation learning and learning to learn, BPs remain a daunting challenge for modern AI. Inspired by the original one hundred BPs, we propose a new benchmark Bongard-LOGO for human-level concept learning and reasoning. We develop a program-guided generation technique to produce a large set of human-interpretable visual cognition problems in action-oriented LOGO language. Our benchmark captures three core properties of human cognition: 1) context-dependent perception, in which the same object may have disparate interpretations given different contexts; 2) analogy-making perception, in which some meaningful concepts are traded off for other meaningful concepts; and 3) perception with a few samples but infinite vocabulary. In experiments, we show that the state-of-the-art deep learning methods perform substantially worse than human subjects, implying that they fail to capture core human cognition properties. Finally, we discuss research directions towards a general architecture for visual reasoning to tackle this benchmark.

* 21 pages, NeurIPS 2020

Via

Access Paper or Ask Questions

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Sep 25, 2020
Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Martín-Martín

Figure 1 for robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Figure 2 for robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Figure 3 for robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Figure 4 for robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

robosuite is a simulation framework for robot learning powered by the MuJoCo physics engine. It offers a modular design for creating robotic tasks as well as a suite of benchmark environments for reproducible research. This paper discusses the key system modules and the benchmark environments of our new release robosuite v1.0.

* For more information, please visit https://robosuite.ai

Via

Access Paper or Ask Questions