Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shubham Tulsiani

Where2Act: From Pixels to Actions for Articulated 3D Objects

Jan 07, 2021

Kaichun Mo, Leonidas Guibas, Mustafa Mukadam, Abhinav Gupta, Shubham Tulsiani

Figure 1 for Where2Act: From Pixels to Actions for Articulated 3D Objects

Figure 2 for Where2Act: From Pixels to Actions for Articulated 3D Objects

Figure 3 for Where2Act: From Pixels to Actions for Articulated 3D Objects

Figure 4 for Where2Act: From Pixels to Actions for Articulated 3D Objects

Abstract:One of the fundamental goals of visual perception is to allow agents to meaningfully interact with their environment. In this paper, we take a step towards that long-term goal -- we extract highly localized actionable information related to elementary actions such as pushing or pulling for articulated objects with movable parts. For example, given a drawer, our network predicts that applying a pulling force on the handle opens the drawer. We propose, discuss, and evaluate novel network architectures that given image and depth data, predict the set of actions possible at each pixel, and the regions over articulated parts that are likely to move under the force. We propose a learning-from-interaction framework with an online data sampling strategy that allows us to train the network in simulation (SAPIEN) and generalizes across categories. But more importantly, our learned models even transfer to real-world data. Check the project website for the code and data release.

Via

Access Paper or Ask Questions

Visual Imitation Made Easy

Aug 11, 2020

Sarah Young, Dhiraj Gandhi, Shubham Tulsiani, Abhinav Gupta, Pieter Abbeel, Lerrel Pinto

Abstract:Visual imitation learning provides a framework for learning complex manipulation behaviors by leveraging human demonstrations. However, current interfaces for imitation such as kinesthetic teaching or teleoperation prohibitively restrict our ability to efficiently collect large-scale data in the wild. Obtaining such diverse demonstration data is paramount for the generalization of learned skills to novel scenarios. In this work, we present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots. We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector. To extract action information from these visual demonstrations, we use off-the-shelf Structure from Motion (SfM) techniques in addition to training a finger detection network. We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task. For both tasks, we use standard behavior cloning to learn executable policies from the previously collected offline demonstrations. To improve learning performance, we employ a variety of data augmentations and provide an extensive analysis of its effects. Finally, we demonstrate the utility of our interface by evaluating on real robotic scenarios with previously unseen objects and achieve a 87% success rate on pushing and a 62% success rate on stacking. Robot videos are available at https://dhiraj100892.github.io/Visual-Imitation-Made-Easy.

Via

Access Paper or Ask Questions

Object-Centric Multi-View Aggregation

Jul 21, 2020

Shubham Tulsiani, Or Litany, Charles R. Qi, He Wang, Leonidas J. Guibas

Figure 1 for Object-Centric Multi-View Aggregation

Figure 2 for Object-Centric Multi-View Aggregation

Figure 3 for Object-Centric Multi-View Aggregation

Figure 4 for Object-Centric Multi-View Aggregation

Abstract:We present an approach for aggregating a sparse set of views of an object in order to compute a semi-implicit 3D representation in the form of a volumetric feature grid. Key to our approach is an object-centric canonical 3D coordinate system into which views can be lifted, without explicit camera pose estimation, and then combined -- in a manner that can accommodate a variable number of views and is view order independent. We show that computing a symmetry-aware mapping from pixels to the canonical coordinate system allows us to better propagate information to unseen regions, as well as to robustly overcome pose ambiguities during inference. Our aggregate representation enables us to perform 3D inference tasks like volumetric reconstruction and novel view synthesis, and we use these tasks to demonstrate the benefits of our aggregation approach as compared to implicit or camera-centric alternatives.

Via

Access Paper or Ask Questions

Implicit Mesh Reconstruction from Unannotated Image Collections

Jul 16, 2020

Shubham Tulsiani, Nilesh Kulkarni, Abhinav Gupta

Figure 1 for Implicit Mesh Reconstruction from Unannotated Image Collections

Figure 2 for Implicit Mesh Reconstruction from Unannotated Image Collections

Figure 3 for Implicit Mesh Reconstruction from Unannotated Image Collections

Figure 4 for Implicit Mesh Reconstruction from Unannotated Image Collections

Abstract:We present an approach to infer the 3D shape, texture, and camera pose for an object from a single RGB image, using only category-level image collections with foreground masks as supervision. We represent the shape as an image-conditioned implicit function that transforms the surface of a sphere to that of the predicted mesh, while additionally predicting the corresponding texture. To derive supervisory signal for learning, we enforce that: a) our predictions when rendered should explain the available image evidence, and b) the inferred 3D structure should be geometrically consistent with learned pixel to surface mappings. We empirically show that our approach improves over prior work that leverages similar supervision, and in fact performs competitively to methods that use stronger supervision. Finally, as our method enables learning with limited supervision, we qualitatively demonstrate its applicability over a set of about 30 object categories.

* Project page: https://shubhtuls.github.io/imr/

Via

Access Paper or Ask Questions

See, Hear, Explore: Curiosity via Audio-Visual Association

Jul 07, 2020

Victoria Dean, Shubham Tulsiani, Abhinav Gupta

Figure 1 for See, Hear, Explore: Curiosity via Audio-Visual Association

Figure 2 for See, Hear, Explore: Curiosity via Audio-Visual Association

Figure 3 for See, Hear, Explore: Curiosity via Audio-Visual Association

Figure 4 for See, Hear, Explore: Curiosity via Audio-Visual Association

Abstract:Exploration is one of the core challenges in reinforcement learning. A common formulation of curiosity-driven exploration uses the difference between the real future and the future predicted by a learned model. However, predicting the future is an inherently difficult task which can be ill-posed in the face of stochasticity. In this paper, we introduce an alternative form of curiosity that rewards novel associations between different senses. Our approach exploits multiple modalities to provide a stronger signal for more efficient exploration. Our method is inspired by the fact that, for humans, both sight and sound play a critical role in exploration. We present results on several Atari environments and Habitat (a photorealistic navigation simulator), showing the benefits of using an audio-visual association model for intrinsically guiding learning agents in the absence of external rewards. For videos and code, see https://vdean.github.io/audio-curiosity.html.

Via

Access Paper or Ask Questions

Articulation-aware Canonical Surface Mapping

Apr 02, 2020

Nilesh Kulkarni, Abhinav Gupta, David F. Fouhey, Shubham Tulsiani

Figure 1 for Articulation-aware Canonical Surface Mapping

Figure 2 for Articulation-aware Canonical Surface Mapping

Figure 3 for Articulation-aware Canonical Surface Mapping

Figure 4 for Articulation-aware Canonical Surface Mapping

Abstract:We tackle the tasks of: 1) predicting a Canonical Surface Mapping (CSM) that indicates the mapping from 2D pixels to corresponding points on a canonical template shape, and 2) inferring the articulation and pose of the template corresponding to the input image. While previous approaches rely on keypoint supervision for learning, we present an approach that can learn without such annotations. Our key insight is that these tasks are geometrically related, and we can obtain supervisory signal via enforcing consistency among the predictions. We present results across a diverse set of animal object categories, showing that our method can learn articulation and CSM prediction from image collections using only foreground mask labels for training. We empirically show that allowing articulation helps learn more accurate CSM prediction, and that enforcing the consistency with predicted CSM is similarly critical for learning meaningful articulation.

* To appear at CVPR 2020, project page https://nileshkulkarni.github.io/acsm/

Via

Access Paper or Ask Questions

Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects

Mar 26, 2020

Kiana Ehsani, Shubham Tulsiani, Saurabh Gupta, Ali Farhadi, Abhinav Gupta

Figure 1 for Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects

Figure 2 for Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects

Figure 3 for Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects

Figure 4 for Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects

Abstract:When we humans look at a video of human-object interaction, we can not only infer what is happening but we can even extract actionable information and imitate those interactions. On the other hand, current recognition or geometric approaches lack the physicality of action representation. In this paper, we take a step towards a more physical understanding of actions. We address the problem of inferring contact points and the physical forces from videos of humans interacting with objects. One of the main challenges in tackling this problem is obtaining ground-truth labels for forces. We sidestep this problem by instead using a physics simulator for supervision. Specifically, we use a simulator to predict effects and enforce that estimated forces must lead to the same effect as depicted in the video. Our quantitative and qualitative results show that (a) we can predict meaningful forces from videos whose effects lead to accurate imitation of the motions observed, (b) by jointly optimizing for contact point and force prediction, we can improve the performance on both tasks in comparison to independent training, and (c) we can learn a representation from this model that generalizes to novel objects using few shot examples.

* CVPR 2020 -- (Oral presentation)

Via

Access Paper or Ask Questions

Intrinsic Motivation for Encouraging Synergistic Behavior

Feb 12, 2020

Rohan Chitnis, Shubham Tulsiani, Saurabh Gupta, Abhinav Gupta

Figure 1 for Intrinsic Motivation for Encouraging Synergistic Behavior

Figure 2 for Intrinsic Motivation for Encouraging Synergistic Behavior

Figure 3 for Intrinsic Motivation for Encouraging Synergistic Behavior

Figure 4 for Intrinsic Motivation for Encouraging Synergistic Behavior

Abstract:We study the role of intrinsic motivation as an exploration bias for reinforcement learning in sparse-reward synergistic tasks, which are tasks where multiple agents must work together to achieve a goal they could not individually. Our key idea is that a good guiding principle for intrinsic motivation in synergistic tasks is to take actions which affect the world in ways that would not be achieved if the agents were acting on their own. Thus, we propose to incentivize agents to take (joint) actions whose effects cannot be predicted via a composition of the predicted effect for each individual agent. We study two instantiations of this idea, one based on the true states encountered, and another based on a dynamics model trained concurrently with the policy. While the former is simpler, the latter has the benefit of being analytically differentiable with respect to the action taken. We validate our approach in robotic bimanual manipulation and multi-agent locomotion tasks with sparse rewards; we find that our approach yields more efficient learning than both 1) training with only the sparse reward and 2) using the typical surprise-based formulation of intrinsic motivation, which does not bias toward synergistic behavior. Videos are available on the project webpage: https://sites.google.com/view/iclr2020-synergistic.

* ICLR 2020 camera-ready

Via

Access Paper or Ask Questions

Object-centric Forward Modeling for Model Predictive Control

Oct 08, 2019

Yufei Ye, Dhiraj Gandhi, Abhinav Gupta, Shubham Tulsiani

Figure 1 for Object-centric Forward Modeling for Model Predictive Control

Figure 2 for Object-centric Forward Modeling for Model Predictive Control

Figure 3 for Object-centric Forward Modeling for Model Predictive Control

Figure 4 for Object-centric Forward Modeling for Model Predictive Control

Abstract:We present an approach to learn an object-centric forward model, and show that this allows us to plan for sequences of actions to achieve distant desired goals. We propose to model a scene as a collection of objects, each with an explicit spatial location and implicit visual feature, and learn to model the effects of actions using random interaction data. Our model allows capturing the robot-object and object-object interactions, and leads to more sample-efficient and accurate predictions. We show that this learned model can be leveraged to search for action sequences that lead to desired goal configurations, and that in conjunction with a learned correction module, this allows for robust closed loop execution. We present experiments both in simulation and the real world, and show that our approach improves over alternate implicit or pixel-space forward models. Please see our project page (https://judyye.github.io/ocmpc/) for result videos.

Via

Access Paper or Ask Questions

Efficient Bimanual Manipulation Using Learned Task Schemas

Sep 30, 2019

Rohan Chitnis, Shubham Tulsiani, Saurabh Gupta, Abhinav Gupta

Figure 1 for Efficient Bimanual Manipulation Using Learned Task Schemas

Figure 2 for Efficient Bimanual Manipulation Using Learned Task Schemas

Figure 3 for Efficient Bimanual Manipulation Using Learned Task Schemas

Figure 4 for Efficient Bimanual Manipulation Using Learned Task Schemas

Abstract:We address the problem of effectively composing skills to solve sparse-reward tasks in the real world. Given a set of parameterized skills (such as exerting a force or doing a top grasp at a location), our goal is to learn policies that invoke these skills to efficiently solve such tasks. Our insight is that for many tasks, the learning process can be decomposed into learning a state-independent task schema (a sequence of skills to execute) and a policy to choose the parameterizations of the skills in a state-dependent manner. For such tasks, we show that explicitly modeling the schema's state-independence can yield significant improvements in sample efficiency for model-free reinforcement learning algorithms. Furthermore, these schemas can be transferred to solve related tasks, by simply re-learning the parameterizations with which the skills are invoked. We find that doing so enables learning to solve sparse-reward tasks on real-world robotic systems very efficiently. We validate our approach experimentally over a suite of robotic bimanual manipulation tasks, both in simulation and on real hardware. See videos at http://tinyurl.com/chitnis-schema .

Via

Access Paper or Ask Questions