Alert button
Picture for Homanga Bharadhwaj

Homanga Bharadhwaj

Alert button

RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking

Sep 05, 2023
Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Abhinav Gupta, Shubham Tulsiani, Vikash Kumar

The grand aim of having a single robot that can manipulate arbitrary objects in diverse settings is at odds with the paucity of robotics datasets. Acquiring and growing such datasets is strenuous due to manual efforts, operational costs, and safety challenges. A path toward such an universal agent would require a structured framework capable of wide generalization but trained within a reasonable data budget. In this paper, we develop an efficient system (RoboAgent) for training universal agents capable of multi-task manipulation skills using (a) semantic augmentations that can rapidly multiply existing datasets and (b) action representations that can extract performant policies with small yet diverse multi-modal datasets without overfitting. In addition, reliable task conditioning and an expressive policy architecture enable our agent to exhibit a diverse repertoire of skills in novel situations specified using language commands. Using merely 7500 demonstrations, we are able to train a single agent capable of 12 unique skills, and demonstrate its generalization over 38 tasks spread across common daily activities in diverse kitchen scenes. On average, RoboAgent outperforms prior methods by over 40% in unseen situations while being more sample efficient and being amenable to capability improvements and extensions through fine-tuning. Videos at https://robopen.github.io/

Viaarxiv icon

Visual Affordance Prediction for Guiding Robot Exploration

May 28, 2023
Homanga Bharadhwaj, Abhinav Gupta, Shubham Tulsiani

Figure 1 for Visual Affordance Prediction for Guiding Robot Exploration
Figure 2 for Visual Affordance Prediction for Guiding Robot Exploration
Figure 3 for Visual Affordance Prediction for Guiding Robot Exploration
Figure 4 for Visual Affordance Prediction for Guiding Robot Exploration

Motivated by the intuitive understanding humans have about the space of possible interactions, and the ease with which they can generalize this understanding to previously unseen scenes, we develop an approach for learning visual affordances for guiding robot exploration. Given an input image of a scene, we infer a distribution over plausible future states that can be achieved via interactions with it. We use a Transformer-based model to learn a conditional distribution in the latent embedding space of a VQ-VAE and show that these models can be trained using large-scale and diverse passive data, and that the learned models exhibit compositional generalization to diverse objects beyond the training distribution. We show how the trained affordance model can be used for guiding exploration by acting as a goal-sampling distribution, during visual goal-conditioned policy learning in robotic manipulation.

* Old Paper; Presented in ICRA 2023 
Viaarxiv icon

Zero-Shot Robot Manipulation from Passive Human Videos

Feb 03, 2023
Homanga Bharadhwaj, Abhinav Gupta, Shubham Tulsiani, Vikash Kumar

Figure 1 for Zero-Shot Robot Manipulation from Passive Human Videos
Figure 2 for Zero-Shot Robot Manipulation from Passive Human Videos
Figure 3 for Zero-Shot Robot Manipulation from Passive Human Videos
Figure 4 for Zero-Shot Robot Manipulation from Passive Human Videos

Can we learn robot manipulation for everyday tasks, only by watching videos of humans doing arbitrary tasks in different unstructured settings? Unlike widely adopted strategies of learning task-specific behaviors or direct imitation of a human video, we develop a a framework for extracting agent-agnostic action representations from human videos, and then map it to the agent's embodiment during deployment. Our framework is based on predicting plausible human hand trajectories given an initial image of a scene. After training this prediction model on a diverse set of human videos from the internet, we deploy the trained model zero-shot for physical robot manipulation tasks, after appropriate transformations to the robot's embodiment. This simple strategy lets us solve coarse manipulation tasks like opening and closing drawers, pushing, and tool use, without access to any in-domain robot manipulation trajectories. Our real-world deployment results establish a strong baseline for action prediction information that can be acquired from diverse arbitrary videos of human activities, and be useful for zero-shot robotic manipulation in unseen scenes.

* Preprint. Under review 
Viaarxiv icon

Offline Policy Optimization in RL with Variance Regularizaton

Dec 29, 2022
Riashat Islam, Samarth Sinha, Homanga Bharadhwaj, Samin Yeasar Arnob, Zhuoran Yang, Animesh Garg, Zhaoran Wang, Lihong Li, Doina Precup

Figure 1 for Offline Policy Optimization in RL with Variance Regularizaton
Figure 2 for Offline Policy Optimization in RL with Variance Regularizaton
Figure 3 for Offline Policy Optimization in RL with Variance Regularizaton
Figure 4 for Offline Policy Optimization in RL with Variance Regularizaton

Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications. This is often because off-policy RL algorithms suffer from distributional shift, due to mismatch between dataset and the target policy, leading to high variance and over-estimation of value functions. In this work, we propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms. We show that the regularizer leads to a lower bound to the offline policy optimization objective, which can help avoid over-estimation errors, and explains the benefits of our approach across a range of continuous control domains when compared to existing state-of-the-art algorithms.

* Old Draft, Offline RL Workshop, NeurIPS'20; 
Viaarxiv icon

CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning

Dec 12, 2022
Zhao Mandi, Homanga Bharadhwaj, Vincent Moens, Shuran Song, Aravind Rajeswaran, Vikash Kumar

Figure 1 for CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning
Figure 2 for CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning
Figure 3 for CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning
Figure 4 for CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning

Developing robots that are capable of many skills and generalization to unseen scenarios requires progress on two fronts: efficient collection of large and diverse datasets, and training of high-capacity policies on the collected data. While large datasets have propelled progress in other fields like computer vision and natural language processing, collecting data of comparable scale is particularly challenging for physical systems like robotics. In this work, we propose a framework to bridge this gap and better scale up robot learning, under the lens of multi-task, multi-scene robot manipulation in kitchen environments. Our framework, named CACTI, has four stages that separately handle data collection, data augmentation, visual representation learning, and imitation policy training. In the CACTI framework, we highlight the benefit of adapting state-of-the-art models for image generation as part of the augmentation stage, and the significant improvement of training efficiency by using pretrained out-of-domain visual representations at the compression stage. Experimentally, we demonstrate that 1) on a real robot setup, CACTI enables efficient training of a single policy capable of 10 manipulation tasks involving kitchen objects, and robust to varying layouts of distractor objects; 2) in a simulated kitchen environment, CACTI trains a single policy on 18 semantic tasks across up to 50 layout variations per task. The simulation task benchmark and augmented datasets in both real and simulated environments will be released to facilitate future research.

Viaarxiv icon

Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective

Sep 18, 2022
Raj Ghugare, Homanga Bharadhwaj, Benjamin Eysenbach, Sergey Levine, Ruslan Salakhutdinov

Figure 1 for Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective
Figure 2 for Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective
Figure 3 for Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective
Figure 4 for Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective

While reinforcement learning (RL) methods that learn an internal model of the environment have the potential to be more sample efficient than their model-free counterparts, learning to model raw observations from high dimensional sensors can be challenging. Prior work has addressed this challenge by learning low-dimensional representation of observations through auxiliary objectives, such as reconstruction or value prediction. However, the alignment between these auxiliary objectives and the RL objective is often unclear. In this work, we propose a single objective which jointly optimizes a latent-space model and policy to achieve high returns while remaining self-consistent. This objective is a lower bound on expected returns. Unlike prior bounds for model-based RL on policy exploration or model guarantees, our bound is directly on the overall RL objective. We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods. While such sample efficient methods typically are computationally demanding, our method attains the performance of SAC in about 50\% less wall-clock time.

* 9 pages (without references and appendix), 17 figures, 25 Pages (total), Project website with code: \url{https://alignedlatentmodels.github.io/} 
Viaarxiv icon

INFOrmation Prioritization through EmPOWERment in Visual Model-Based RL

Apr 18, 2022
Homanga Bharadhwaj, Mohammad Babaeizadeh, Dumitru Erhan, Sergey Levine

Figure 1 for INFOrmation Prioritization through EmPOWERment in Visual Model-Based RL
Figure 2 for INFOrmation Prioritization through EmPOWERment in Visual Model-Based RL
Figure 3 for INFOrmation Prioritization through EmPOWERment in Visual Model-Based RL
Figure 4 for INFOrmation Prioritization through EmPOWERment in Visual Model-Based RL

Model-based reinforcement learning (RL) algorithms designed for handling complex visual observations typically learn some sort of latent state representation, either explicitly or implicitly. Standard methods of this sort do not distinguish between functionally relevant aspects of the state and irrelevant distractors, instead aiming to represent all available information equally. We propose a modified objective for model-based RL that, in combination with mutual information maximization, allows us to learn representations and dynamics for visual model-based RL without reconstruction in a way that explicitly prioritizes functionally relevant factors. The key principle behind our design is to integrate a term inspired by variational empowerment into a state-space model based on mutual information. This term prioritizes information that is correlated with action, thus ensuring that functionally relevant factors are captured first. Furthermore, the same empowerment term also promotes faster exploration during the RL process, especially for sparse-reward tasks where the reward signal is insufficient to drive exploration in the early stages of learning. We evaluate the approach on a suite of vision-based robot control tasks with natural video backgrounds, and show that the proposed prioritized information objective outperforms state-of-the-art model based RL approaches with higher sample efficiency and episodic returns. https://sites.google.com/view/information-empowerment

* Published in International Conference on Learning Representations (ICLR 2022) 
Viaarxiv icon

Auditing Robot Learning for Safety and Compliance during Deployment

Oct 12, 2021
Homanga Bharadhwaj

Figure 1 for Auditing Robot Learning for Safety and Compliance during Deployment

Robots of the future are going to exhibit increasingly human-like and super-human intelligence in a myriad of different tasks. They are also likely going to fail and be incompliant with human preferences in increasingly subtle ways. Towards the goal of achieving autonomous robots, the robot learning community has made rapid strides in applying machine learning techniques to train robots through data and interaction. This makes the study of how best to audit these algorithms for checking their compatibility with humans, pertinent and urgent. In this paper, we draw inspiration from the AI Safety and Alignment communities and make the case that we need to urgently consider ways in which we can best audit our robot learning algorithms to check for failure modes, and ensure that when operating autonomously, they are indeed behaving in ways that the human algorithm designers intend them to. We believe that this is a challenging problem that will require efforts from the entire robot learning community, and do not attempt to provide a concrete framework for auditing. Instead, we outline high-level guidance and a possible approach towards formulating this framework which we hope will serve as a useful starting point for thinking about auditing in the context of robot learning.

* Blue Sky paper at the 5th Conference on Robot Learning (CoRL 2021) 
Viaarxiv icon

Auditing AI models for Verified Deployment under Semantic Specifications

Sep 25, 2021
Homanga Bharadhwaj, De-An Huang, Chaowei Xiao, Anima Anandkumar, Animesh Garg

Figure 1 for Auditing AI models for Verified Deployment under Semantic Specifications
Figure 2 for Auditing AI models for Verified Deployment under Semantic Specifications
Figure 3 for Auditing AI models for Verified Deployment under Semantic Specifications
Figure 4 for Auditing AI models for Verified Deployment under Semantic Specifications

Auditing trained deep learning (DL) models prior to deployment is vital in preventing unintended consequences. One of the biggest challenges in auditing is in understanding how we can obtain human-interpretable specifications that are directly useful to the end-user. We address this challenge through a sequence of semantically-aligned unit tests, where each unit test verifies whether a predefined specification (e.g., accuracy over 95%) is satisfied with respect to controlled and semantically aligned variations in the input space (e.g., in face recognition, the angle relative to the camera). We perform these unit tests by directly verifying the semantically aligned variations in an interpretable latent space of a generative model. Our framework, AuditAI, bridges the gap between interpretable formal verification and scalability. With evaluations on four different datasets, covering images of towers, chest X-rays, human faces, and ImageNet classes, we show how AuditAI allows us to obtain controlled variations for verification and certified training while addressing the limitations of verifying using only pixel-space perturbations. A blog post accompanying the paper is at this link https://developer.nvidia.com/blog/nvidia-research-auditing-ai-models-for-verified-deployment-under-semantic-specifications

* Preprint; Under review; 
Viaarxiv icon

Learning by Watching: Physical Imitation of Manipulation Skills from Human Videos

Jan 18, 2021
Haoyu Xiong, Quanzhou Li, Yun-Chun Chen, Homanga Bharadhwaj, Samarth Sinha, Animesh Garg

Figure 1 for Learning by Watching: Physical Imitation of Manipulation Skills from Human Videos
Figure 2 for Learning by Watching: Physical Imitation of Manipulation Skills from Human Videos
Figure 3 for Learning by Watching: Physical Imitation of Manipulation Skills from Human Videos
Figure 4 for Learning by Watching: Physical Imitation of Manipulation Skills from Human Videos

We present an approach for physical imitation from human videos for robot manipulation tasks. The key idea of our method lies in explicitly exploiting the kinematics and motion information embedded in the video to learn structured representations that endow the robot with the ability to imagine how to perform manipulation tasks in its own context. To achieve this, we design a perception module that learns to translate human videos to the robot domain followed by unsupervised keypoint detection. The resulting keypoint-based representations provide semantically meaningful information that can be directly used for reward computing and policy learning. We evaluate the effectiveness of our approach on five robot manipulation tasks, including reaching, pushing, sliding, coffee making, and drawer closing. Detailed experimental evaluations demonstrate that our method performs favorably against previous approaches.

* Project Website: https://www.pair.toronto.edu/lbw-kp/ 
Viaarxiv icon