Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jesse Thomason

University of Southern California

Retrospectives on the Embodied AI Workshop

Oct 17, 2022
Matt Deitke, Dhruv Batra, Yonatan Bisk, Tommaso Campari, Angel X. Chang, Devendra Singh Chaplot, Changan Chen, Claudia Pérez D'Arpino, Kiana Ehsani, Ali Farhadi, Li Fei-Fei, Anthony Francis, Chuang Gan, Kristen Grauman, David Hall, Winson Han, Unnat Jain, Aniruddha Kembhavi, Jacob Krantz, Stefan Lee, Chengshu Li, Sagnik Majumder, Oleksandr Maksymets, Roberto Martín-Martín, Roozbeh Mottaghi, Sonia Raychaudhuri, Mike Roberts, Silvio Savarese, Manolis Savva, Mohit Shridhar, Niko Sünderhauf, Andrew Szot, Ben Talbot, Joshua B. Tenenbaum, Jesse Thomason, Alexander Toshev, Joanne Truong, Luca Weihs, Jiajun Wu

Figure 1 for Retrospectives on the Embodied AI Workshop

Figure 2 for Retrospectives on the Embodied AI Workshop

Figure 3 for Retrospectives on the Embodied AI Workshop

Figure 4 for Retrospectives on the Embodied AI Workshop

We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.

Via

Access Paper or Ask Questions

Iterative Vision-and-Language Navigation

Oct 06, 2022
Jacob Krantz, Shurjo Banerjee, Wang Zhu, Jason Corso, Peter Anderson, Stefan Lee, Jesse Thomason

Figure 1 for Iterative Vision-and-Language Navigation

Figure 2 for Iterative Vision-and-Language Navigation

Figure 3 for Iterative Vision-and-Language Navigation

Figure 4 for Iterative Vision-and-Language Navigation

We present Iterative Vision-and-Language Navigation (IVLN), a paradigm for evaluating language-guided agents navigating in a persistent environment over time. Existing Vision-and-Language Navigation (VLN) benchmarks erase the agent's memory at the beginning of every episode, testing the ability to perform cold-start navigation with no prior information. However, deployed robots occupy the same environment for long periods of time. The IVLN paradigm addresses this disparity by training and evaluating VLN agents that maintain memory across tours of scenes that consist of up to 100 ordered instruction-following Room-to-Room (R2R) episodes, each defined by an individual language instruction and a target path. We present discrete and continuous Iterative Room-to-Room (IR2R) benchmarks comprising about 400 tours each in 80 indoor scenes. We find that extending the implicit memory of high-performing transformer VLN agents is not sufficient for IVLN, but agents that build maps can benefit from environment persistence, motivating a renewed focus on map-building agents in VLN.

Via

Access Paper or Ask Questions

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

Sep 22, 2022
Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, Animesh Garg

Figure 1 for ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

Figure 2 for ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

Figure 3 for ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

Figure 4 for ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

Task planning can require defining myriad domain knowledge about the world in which a robot needs to act. To ameliorate that effort, large language models (LLMs) can be used to score potential next actions during task planning, and even generate action sequences directly, given an instruction in natural language with no additional domain information. However, such methods either require enumerating all possible next steps for scoring, or generate free-form text that may contain actions not possible on a given robot in its current context. We present a programmatic LLM prompt structure that enables plan generation functional across situated environments, robot capabilities, and tasks. Our key insight is to prompt the LLM with program-like specifications of the available actions and objects in an environment, as well as with example programs that can be executed. We make concrete recommendations about prompt structure and generation constraints through ablation experiments, demonstrate state of the art success rates in VirtualHome household tasks, and deploy our method on a physical robot arm for tabletop tasks. Website at progprompt.github.io

Via

Access Paper or Ask Questions

VAuLT: Augmenting the Vision-and-Language Transformer with the Propagation of Deep Language Representations

Aug 18, 2022
Georgios Chochlakis, Tejas Srinivasan, Jesse Thomason, Shrikanth Narayanan

Figure 1 for VAuLT: Augmenting the Vision-and-Language Transformer with the Propagation of Deep Language Representations

Figure 2 for VAuLT: Augmenting the Vision-and-Language Transformer with the Propagation of Deep Language Representations

Figure 3 for VAuLT: Augmenting the Vision-and-Language Transformer with the Propagation of Deep Language Representations

Figure 4 for VAuLT: Augmenting the Vision-and-Language Transformer with the Propagation of Deep Language Representations

We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT is an extension of the popular Vision-and-Language Transformer (ViLT), and improves performance on vision-and-language tasks that involve more complex text inputs than image captions while having minimal impact on training and inference efficiency. ViLT, importantly, enables efficient training and inference in vision-and-language tasks, achieved by using a shallow image encoder. However, it is pretrained on captioning and similar datasets, where the language input is simple, literal, and descriptive, therefore lacking linguistic diversity. So, when working with multimedia data in the wild, such as multimodal social media data (in our work, Twitter), there is a notable shift from captioning language data, as well as diversity of tasks, and we indeed find evidence that the language capacity of ViLT is lacking instead. The key insight of VAuLT is to propagate the output representations of a large language model like BERT to the language input of ViLT. We show that such a strategy significantly improves over ViLT on vision-and-language tasks involving richer language inputs and affective constructs, such as TWITTER-2015, TWITTER-2017, MVSA-Single and MVSA-Multiple, but lags behind pure reasoning tasks such as the Bloomberg Twitter Text-Image Relationship dataset. We have released the code for all our experiments at https://github.com/gchochla/VAuLT.

* 10 pages, 1 figure

Via

Access Paper or Ask Questions

Curriculum Learning for Data-Efficient Vision-Language Alignment

Jul 29, 2022
Tejas Srinivasan, Xiang Ren, Jesse Thomason

Figure 1 for Curriculum Learning for Data-Efficient Vision-Language Alignment

Figure 2 for Curriculum Learning for Data-Efficient Vision-Language Alignment

Figure 3 for Curriculum Learning for Data-Efficient Vision-Language Alignment

Figure 4 for Curriculum Learning for Data-Efficient Vision-Language Alignment

Aligning image and text encoders from scratch using contrastive learning requires large amounts of paired image-text data. We alleviate this need by aligning individually pre-trained language and vision representation models using a much smaller amount of paired data, augmented with a curriculum learning algorithm to learn fine-grained vision-language alignments. TOnICS (Training with Ontology-Informed Contrastive Sampling) initially samples minibatches whose image-text pairs contain a wide variety of objects to learn object-level alignment, and progressively samples minibatches where all image-text pairs contain the same object to learn finer-grained contextual alignment. Aligning pre-trained BERT and VinVL models to each other using TOnICS outperforms CLIP on downstream zero-shot image retrieval while using less than 1% as much training data.

Via

Access Paper or Ask Questions

Interactive Learning from Natural Language and Demonstrations using Signal Temporal Logic

Jul 01, 2022
Sara Mohammadinejad, Jesse Thomason, Jyotirmoy V. Deshmukh

Figure 1 for Interactive Learning from Natural Language and Demonstrations using Signal Temporal Logic

Figure 2 for Interactive Learning from Natural Language and Demonstrations using Signal Temporal Logic

Figure 3 for Interactive Learning from Natural Language and Demonstrations using Signal Temporal Logic

Figure 4 for Interactive Learning from Natural Language and Demonstrations using Signal Temporal Logic

Natural language is an intuitive way for humans to communicate tasks to a robot. While natural language (NL) is ambiguous, real world tasks and their safety requirements need to be communicated unambiguously. Signal Temporal Logic (STL) is a formal logic that can serve as a versatile, expressive, and unambiguous formal language to describe robotic tasks. On one hand, existing work in using STL for the robotics domain typically requires end-users to express task specifications in STL, a challenge for non-expert users. On the other, translating from NL to STL specifications is currently restricted to specific fragments. In this work, we propose DIALOGUESTL, an interactive approach for learning correct and concise STL formulas from (often) ambiguous NL descriptions. We use a combination of semantic parsing, pre-trained transformer-based language models, and user-in-the-loop clarifications aided by a small number of user demonstrations to predict the best STL formula to encode NL task descriptions. An advantage of mapping NL to STL is that there has been considerable recent work on the use of reinforcement learning (RL) to identify control policies for robots. We show we can use Deep Q-Learning techniques to learn optimal policies from the learned STL specifications. We demonstrate that DIALOGUESTL is efficient, scalable, and robust, and has high accuracy in predicting the correct STL formula with a few number of demonstrations and a few interactions with an oracle user.

Via

Access Paper or Ask Questions

CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks

Jun 18, 2022
Tejas Srinivasan, Ting-Yun Chang, Leticia Leonor Pinto Alva, Georgios Chochlakis, Mohammad Rostami, Jesse Thomason

Figure 1 for CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks

Figure 2 for CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks

Figure 3 for CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks

Figure 4 for CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks

Current state-of-the-art vision-and-language models are evaluated on tasks either individually or in a multi-task setting, overlooking the challenges of continually learning (CL) tasks as they arrive. Existing CL benchmarks have facilitated research on task adaptation and mitigating "catastrophic forgetting", but are limited to vision-only and language-only tasks. We present CLiMB, a benchmark to study the challenge of learning multimodal tasks in a CL setting, and to systematically evaluate how upstream continual learning can rapidly generalize to new multimodal and unimodal tasks. CLiMB includes implementations of several CL algorithms and a modified Vision-Language Transformer (ViLT) model that can be deployed on both multimodal and unimodal tasks. We find that common CL methods can help mitigate forgetting during multimodal task learning, but do not enable cross-task knowledge transfer. We envision that CLiMB will facilitate research on a new class of CL algorithms for this challenging multimodal setting.

Via

Access Paper or Ask Questions

Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

Mar 29, 2022
Jing Gu, Eliana Stefani, Qi Wu, Jesse Thomason, Xin Eric Wang

Figure 1 for Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

Figure 2 for Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

Figure 3 for Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

Figure 4 for Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

A long-term goal of AI research is to build intelligent agents that can communicate with humans in natural language, perceive the environment, and perform real-world tasks. Vision-and-Language Navigation (VLN) is a fundamental and interdisciplinary research topic towards this goal, and receives increasing attention from natural language processing, computer vision, robotics, and machine learning communities. In this paper, we review contemporary studies in the emerging field of VLN, covering tasks, evaluation metrics, methods, etc. Through structured analysis of current progress and challenges, we highlight the limitations of current VLN and opportunities for future work. This paper serves as a thorough reference for the VLN research community.

* 19 pages. Accepted to ACL 2022

Via

Access Paper or Ask Questions

LUMINOUS: Indoor Scene Generation for Embodied AI Challenges

Nov 10, 2021
Yizhou Zhao, Kaixiang Lin, Zhiwei Jia, Qiaozi Gao, Govind Thattai, Jesse Thomason, Gaurav S. Sukhatme

Figure 1 for LUMINOUS: Indoor Scene Generation for Embodied AI Challenges

Figure 2 for LUMINOUS: Indoor Scene Generation for Embodied AI Challenges

Figure 3 for LUMINOUS: Indoor Scene Generation for Embodied AI Challenges

Figure 4 for LUMINOUS: Indoor Scene Generation for Embodied AI Challenges

Learning-based methods for training embodied agents typically require a large number of high-quality scenes that contain realistic layouts and support meaningful interactions. However, current simulators for Embodied AI (EAI) challenges only provide simulated indoor scenes with a limited number of layouts. This paper presents Luminous, the first research framework that employs state-of-the-art indoor scene synthesis algorithms to generate large-scale simulated scenes for Embodied AI challenges. Further, we automatically and quantitatively evaluate the quality of generated indoor scenes via their ability to support complex household tasks. Luminous incorporates a novel scene generation algorithm (Constrained Stochastic Scene Generation (CSSG)), which achieves competitive performance with human-designed scenes. Within Luminous, the EAI task executor, task instruction generation module, and video rendering toolkit can collectively generate a massive multimodal dataset of new scenes for the training and evaluation of Embodied AI agents. Extensive experimental results demonstrate the effectiveness of the data generated by Luminous, enabling the comprehensive assessment of embodied agents on generalization and robustness.

* 2021 paper, Amazon

Via

Access Paper or Ask Questions

TEACh: Task-driven Embodied Agents that Chat

Oct 15, 2021
Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, Dilek Hakkani-Tur

Figure 1 for TEACh: Task-driven Embodied Agents that Chat

Figure 2 for TEACh: Task-driven Embodied Agents that Chat

Figure 3 for TEACh: Task-driven Embodied Agents that Chat

Figure 4 for TEACh: Task-driven Embodied Agents that Chat

Robots operating in human spaces must be able to engage in natural language interaction with people, both understanding and executing instructions, and using conversation to resolve ambiguity and recover from mistakes. To study this, we introduce TEACh, a dataset of over 3,000 human--human, interactive dialogues to complete household tasks in simulation. A Commander with access to oracle information about a task communicates in natural language with a Follower. The Follower navigates through and interacts with the environment to complete tasks varying in complexity from "Make Coffee" to "Prepare Breakfast", asking questions and getting additional information from the Commander. We propose three benchmarks using TEACh to study embodied intelligence challenges, and we evaluate initial models' abilities in dialogue understanding, language grounding, and task execution.

* 7 pages main, 28 pages total, 29 figures; Version 2 includes information on data cleaning and experimental results use a modified data split that has been released

Via

Access Paper or Ask Questions