Alert button
Picture for Govind Thattai

Govind Thattai

Alert button

Alexa, play with robot: Introducing the First Alexa Prize SimBot Challenge on Embodied AI

Aug 09, 2023
Hangjie Shi, Leslie Ball, Govind Thattai, Desheng Zhang, Lucy Hu, Qiaozi Gao, Suhaila Shakiah, Xiaofeng Gao, Aishwarya Padmakumar, Bofei Yang, Cadence Chung, Dinakar Guthy, Gaurav Sukhatme, Karthika Arumugam, Matthew Wen, Osman Ipek, Patrick Lange, Rohan Khanna, Shreyas Pansare, Vasu Sharma, Chao Zhang, Cris Flagg, Daniel Pressel, Lavina Vaz, Luke Dai, Prasoon Goyal, Sattvik Sahai, Shaohua Liu, Yao Lu, Anna Gottardi, Shui Hu, Yang Liu, Dilek Hakkani-Tur, Kate Bland, Heather Rocker, James Jeun, Yadunandana Rao, Michael Johnston, Akshaya Iyengar, Arindam Mandal, Prem Natarajan, Reza Ghanadan

Figure 1 for Alexa, play with robot: Introducing the First Alexa Prize SimBot Challenge on Embodied AI
Figure 2 for Alexa, play with robot: Introducing the First Alexa Prize SimBot Challenge on Embodied AI
Figure 3 for Alexa, play with robot: Introducing the First Alexa Prize SimBot Challenge on Embodied AI
Figure 4 for Alexa, play with robot: Introducing the First Alexa Prize SimBot Challenge on Embodied AI

The Alexa Prize program has empowered numerous university students to explore, experiment, and showcase their talents in building conversational agents through challenges like the SocialBot Grand Challenge and the TaskBot Challenge. As conversational agents increasingly appear in multimodal and embodied contexts, it is important to explore the affordances of conversational interaction augmented with computer vision and physical embodiment. This paper describes the SimBot Challenge, a new challenge in which university teams compete to build robot assistants that complete tasks in a simulated physical environment. This paper provides an overview of the SimBot Challenge, which included both online and offline challenge phases. We describe the infrastructure and support provided to the teams including Alexa Arena, the simulated environment, and the ML toolkit provided to teams to accelerate their building of vision and language models. We summarize the approaches the participating teams took to overcome research challenges and extract key lessons learned. Finally, we provide analysis of the performance of the competing SimBots during the competition.

Viaarxiv icon

Exploiting Generalization in Offline Reinforcement Learning via Unseen State Augmentations

Aug 07, 2023
Nirbhay Modhe, Qiaozi Gao, Ashwin Kalyan, Dhruv Batra, Govind Thattai, Gaurav Sukhatme

Figure 1 for Exploiting Generalization in Offline Reinforcement Learning via Unseen State Augmentations
Figure 2 for Exploiting Generalization in Offline Reinforcement Learning via Unseen State Augmentations
Figure 3 for Exploiting Generalization in Offline Reinforcement Learning via Unseen State Augmentations
Figure 4 for Exploiting Generalization in Offline Reinforcement Learning via Unseen State Augmentations

Offline reinforcement learning (RL) methods strike a balance between exploration and exploitation by conservative value estimation -- penalizing values of unseen states and actions. Model-free methods penalize values at all unseen actions, while model-based methods are able to further exploit unseen states via model rollouts. However, such methods are handicapped in their ability to find unseen states far away from the available offline data due to two factors -- (a) very short rollout horizons in models due to cascading model errors, and (b) model rollouts originating solely from states observed in offline data. We relax the second assumption and present a novel unseen state augmentation strategy to allow exploitation of unseen states where the learned model and value estimates generalize. Our strategy finds unseen states by value-informed perturbations of seen states followed by filtering out states with epistemic uncertainty estimates too high (high error) or too low (too similar to seen data). We observe improved performance in several offline RL tasks and find that our augmentation strategy consistently leads to overall lower average dataset Q-value estimates i.e. more conservative Q-value estimates than a baseline.

Viaarxiv icon

LEMMA: Learning Language-Conditioned Multi-Robot Manipulation

Aug 02, 2023
Ran Gong, Xiaofeng Gao, Qiaozi Gao, Suhaila Shakiah, Govind Thattai, Gaurav S. Sukhatme

Figure 1 for LEMMA: Learning Language-Conditioned Multi-Robot Manipulation
Figure 2 for LEMMA: Learning Language-Conditioned Multi-Robot Manipulation
Figure 3 for LEMMA: Learning Language-Conditioned Multi-Robot Manipulation
Figure 4 for LEMMA: Learning Language-Conditioned Multi-Robot Manipulation

Complex manipulation tasks often require robots with complementary capabilities to collaborate. We introduce a benchmark for LanguagE-Conditioned Multi-robot MAnipulation (LEMMA) focused on task allocation and long-horizon object manipulation based on human language instructions in a tabletop setting. LEMMA features 8 types of procedurally generated tasks with varying degree of complexity, some of which require the robots to use tools and pass tools to each other. For each task, we provide 800 expert demonstrations and human instructions for training and evaluations. LEMMA poses greater challenges compared to existing benchmarks, as it requires the system to identify each manipulator's limitations and assign sub-tasks accordingly while also handling strong temporal dependencies in each task. To address these challenges, we propose a modular hierarchical planning approach as a baseline. Our results highlight the potential of LEMMA for developing future language-conditioned multi-robot systems.

* 8 pages, 3 figures 
Viaarxiv icon

Neural Architecture Search for Parameter-Efficient Fine-tuning of Large Pre-trained Language Models

May 26, 2023
Neal Lawton, Anoop Kumar, Govind Thattai, Aram Galstyan, Greg Ver Steeg

Figure 1 for Neural Architecture Search for Parameter-Efficient Fine-tuning of Large Pre-trained Language Models
Figure 2 for Neural Architecture Search for Parameter-Efficient Fine-tuning of Large Pre-trained Language Models
Figure 3 for Neural Architecture Search for Parameter-Efficient Fine-tuning of Large Pre-trained Language Models

Parameter-efficient tuning (PET) methods fit pre-trained language models (PLMs) to downstream tasks by either computing a small compressed update for a subset of model parameters, or appending and fine-tuning a small number of new model parameters to the pre-trained network. Hand-designed PET architectures from the literature perform well in practice, but have the potential to be improved via automated neural architecture search (NAS). We propose an efficient NAS method for learning PET architectures via structured and unstructured pruning. We present experiments on GLUE demonstrating the effectiveness of our algorithm and discuss how PET architectural design choices affect performance in practice.

* 8 pages, 3 figures, ACL 2023 
Viaarxiv icon

Alexa Arena: A User-Centric Interactive Platform for Embodied AI

Mar 02, 2023
Qiaozi Gao, Govind Thattai, Xiaofeng Gao, Suhaila Shakiah, Shreyas Pansare, Vasu Sharma, Gaurav Sukhatme, Hangjie Shi, Bofei Yang, Desheng Zheng, Lucy Hu, Karthika Arumugam, Shui Hu, Matthew Wen, Dinakar Guthy, Cadence Chung, Rohan Khanna, Osman Ipek, Leslie Ball, Kate Bland, Heather Rocker, Yadunandana Rao, Michael Johnston, Reza Ghanadan, Arindam Mandal, Dilek Hakkani Tur, Prem Natarajan

Figure 1 for Alexa Arena: A User-Centric Interactive Platform for Embodied AI
Figure 2 for Alexa Arena: A User-Centric Interactive Platform for Embodied AI
Figure 3 for Alexa Arena: A User-Centric Interactive Platform for Embodied AI
Figure 4 for Alexa Arena: A User-Centric Interactive Platform for Embodied AI

We introduce Alexa Arena, a user-centric simulation platform for Embodied AI (EAI) research. Alexa Arena provides a variety of multi-room layouts and interactable objects, for the creation of human-robot interaction (HRI) missions. With user-friendly graphics and control mechanisms, Alexa Arena supports the development of gamified robotic tasks readily accessible to general human users, thus opening a new venue for high-efficiency HRI data collection and EAI system evaluation. Along with the platform, we introduce a dialog-enabled instruction-following benchmark and provide baseline results for it. We make Alexa Arena publicly available to facilitate research in building generalizable and assistive embodied agents.

Viaarxiv icon

Language-Informed Transfer Learning for Embodied Household Activities

Jan 12, 2023
Yuqian Jiang, Qiaozi Gao, Govind Thattai, Gaurav Sukhatme

Figure 1 for Language-Informed Transfer Learning for Embodied Household Activities
Figure 2 for Language-Informed Transfer Learning for Embodied Household Activities
Figure 3 for Language-Informed Transfer Learning for Embodied Household Activities
Figure 4 for Language-Informed Transfer Learning for Embodied Household Activities

For service robots to become general-purpose in everyday household environments, they need not only a large library of primitive skills, but also the ability to quickly learn novel tasks specified by users. Fine-tuning neural networks on a variety of downstream tasks has been successful in many vision and language domains, but research is still limited on transfer learning between diverse long-horizon tasks. We propose that, compared to reinforcement learning for a new household activity from scratch, home robots can benefit from transferring the value and policy networks trained for similar tasks. We evaluate this idea in the BEHAVIOR simulation benchmark which includes a large number of household activities and a set of action primitives. For easy mapping between state spaces of different tasks, we provide a text-based representation and leverage language models to produce a common embedding space. The results show that the selection of similar source activities can be informed by the semantic similarity of state and goal descriptions with the target task. We further analyze the results and discuss ways to overcome the problem of catastrophic forgetting.

Viaarxiv icon

GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods

Jan 05, 2023
Da Yin, Feng Gao, Govind Thattai, Michael Johnston, Kai-Wei Chang

Figure 1 for GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods
Figure 2 for GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods
Figure 3 for GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods
Figure 4 for GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods

A key goal for the advancement of AI is to develop technologies that serve the needs not just of one group but of all communities regardless of their geographical region. In fact, a significant proportion of knowledge is locally shared by people from certain regions but may not apply equally in other regions because of cultural differences. If a model is unaware of regional characteristics, it may lead to performance disparity across regions and result in bias against underrepresented groups. We propose GIVL, a Geographically Inclusive Vision-and-Language Pre-trained model. There are two attributes of geo-diverse visual concepts which can help to learn geo-diverse knowledge: 1) concepts under similar categories have unique knowledge and visual characteristics, 2) concepts with similar visual features may fall in completely different categories. Motivated by the attributes, we design new pre-training objectives Image Knowledge Matching (IKM) and Image Edit Checking (IEC) to pre-train GIVL. Compared with similar-size models pre-trained with similar scale of data, GIVL achieves state-of-the-art (SOTA) and more balanced performance on geo-diverse V&L tasks.

Viaarxiv icon

OpenD: A Benchmark for Language-Driven Door and Drawer Opening

Dec 10, 2022
Yizhou Zhao, Qiaozi Gao, Liang Qiu, Govind Thattai, Gaurav S. Sukhatme

Figure 1 for OpenD: A Benchmark for Language-Driven Door and Drawer Opening
Figure 2 for OpenD: A Benchmark for Language-Driven Door and Drawer Opening
Figure 3 for OpenD: A Benchmark for Language-Driven Door and Drawer Opening
Figure 4 for OpenD: A Benchmark for Language-Driven Door and Drawer Opening

We introduce OPEND, a benchmark for learning how to use a hand to open cabinet doors or drawers in a photo-realistic and physics-reliable simulation environment driven by language instruction. To solve the task, we propose a multi-step planner composed of a deep neural network and rule-base controllers. The network is utilized to capture spatial relationships from images and understand semantic meaning from language instructions. Controllers efficiently execute the plan based on the spatial and semantic understanding. We evaluate our system by measuring its zero-shot performance in test data set. Experimental results demonstrate the effectiveness of decision planning by our multi-step planner for different hands, while suggesting that there is significant room for developing better models to address the challenge brought by language understanding, spatial reasoning, and long-term manipulation. We will release OPEND and host challenges to promote future research in this area.

Viaarxiv icon

TPA-Net: Generate A Dataset for Text to Physics-based Animation

Nov 25, 2022
Yuxing Qiu, Feng Gao, Minchen Li, Govind Thattai, Yin Yang, Chenfanfu Jiang

Figure 1 for TPA-Net: Generate A Dataset for Text to Physics-based Animation
Figure 2 for TPA-Net: Generate A Dataset for Text to Physics-based Animation
Figure 3 for TPA-Net: Generate A Dataset for Text to Physics-based Animation
Figure 4 for TPA-Net: Generate A Dataset for Text to Physics-based Animation

Recent breakthroughs in Vision-Language (V&L) joint research have achieved remarkable results in various text-driven tasks. High-quality Text-to-video (T2V), a task that has been long considered mission-impossible, was proven feasible with reasonably good results in latest works. However, the resulting videos often have undesired artifacts largely because the system is purely data-driven and agnostic to the physical laws. To tackle this issue and further push T2V towards high-level physical realism, we present an autonomous data generation technique and a dataset, which intend to narrow the gap with a large number of multi-modal, 3D Text-to-Video/Simulation (T2V/S) data. In the dataset, we provide high-resolution 3D physical simulations for both solids and fluids, along with textual descriptions of the physical phenomena. We take advantage of state-of-the-art physical simulation methods (i) Incremental Potential Contact (IPC) and (ii) Material Point Method (MPM) to simulate diverse scenarios, including elastic deformations, material fractures, collisions, turbulence, etc. Additionally, high-quality, multi-view rendering videos are supplied for the benefit of T2V, Neural Radiance Fields (NeRF), and other communities. This work is the first step towards fully automated Text-to-Video/Simulation (T2V/S). Live examples and subsequent work are at https://sites.google.com/view/tpa-net.

Viaarxiv icon

Towards Reasoning-Aware Explainable VQA

Nov 09, 2022
Rakesh Vaideeswaran, Feng Gao, Abhinav Mathur, Govind Thattai

Figure 1 for Towards Reasoning-Aware Explainable VQA
Figure 2 for Towards Reasoning-Aware Explainable VQA
Figure 3 for Towards Reasoning-Aware Explainable VQA
Figure 4 for Towards Reasoning-Aware Explainable VQA

The domain of joint vision-language understanding, especially in the context of reasoning in Visual Question Answering (VQA) models, has garnered significant attention in the recent past. While most of the existing VQA models focus on improving the accuracy of VQA, the way models arrive at an answer is oftentimes a black box. As a step towards making the VQA task more explainable and interpretable, our method is built upon the SOTA VQA framework by augmenting it with an end-to-end explanation generation module. In this paper, we investigate two network architectures, including Long Short-Term Memory (LSTM) and Transformer decoder, as the explanation generator. Our method generates human-readable textual explanations while maintaining SOTA VQA accuracy on the GQA-REX (77.49%) and VQA-E (71.48%) datasets. Approximately 65.16% of the generated explanations are approved by humans as valid. Roughly 60.5% of the generated explanations are valid and lead to the correct answers.

Viaarxiv icon