Alert button
Picture for Stephen James

Stephen James

Alert button

Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training

Sep 25, 2023
Jiangliu Wang, Jianbo Jiao, Yibing Song, Stephen James, Zhan Tong, Chongjian Ge, Pieter Abbeel, Yun-hui Liu

Figure 1 for Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training
Figure 2 for Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training
Figure 3 for Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training
Figure 4 for Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training

This work aims to improve unsupervised audio-visual pre-training. Inspired by the efficacy of data augmentation in visual contrastive learning, we propose a novel speed co-augmentation method that randomly changes the playback speeds of both audio and video data. Despite its simplicity, the speed co-augmentation method possesses two compelling attributes: (1) it increases the diversity of audio-visual pairs and doubles the size of negative pairs, resulting in a significant enhancement in the learned representations, and (2) it changes the strict correlation between audio-visual pairs but introduces a partial relationship between the augmented pairs, which is modeled by our proposed SoftInfoNCE loss to further boost the performance. Experimental results show that the proposed method significantly improves the learned representations when compared to vanilla audio-visual contrastive learning.

* Published at the CVPR 2023 Sight and Sound workshop 
Viaarxiv icon

Language-Conditioned Path Planning

Aug 31, 2023
Amber Xie, Youngwoon Lee, Pieter Abbeel, Stephen James

Figure 1 for Language-Conditioned Path Planning
Figure 2 for Language-Conditioned Path Planning
Figure 3 for Language-Conditioned Path Planning
Figure 4 for Language-Conditioned Path Planning

Contact is at the core of robotic manipulation. At times, it is desired (e.g. manipulation and grasping), and at times, it is harmful (e.g. when avoiding obstacles). However, traditional path planning algorithms focus solely on collision-free paths, limiting their applicability in contact-rich tasks. To address this limitation, we propose the domain of Language-Conditioned Path Planning, where contact-awareness is incorporated into the path planning problem. As a first step in this domain, we propose Language-Conditioned Collision Functions (LACO) a novel approach that learns a collision function using only a single-view image, language prompt, and robot configuration. LACO predicts collisions between the robot and the environment, enabling flexible, conditional path planning without the need for manual object annotations, point cloud data, or ground-truth object meshes. In both simulation and the real world, we demonstrate that LACO can facilitate complex, nuanced path plans that allow for interaction with objects that are safe to collide, rather than prohibiting any collision.

* Conference on Robot Learning, 2023 
Viaarxiv icon

Language Reward Modulation for Pretraining Reinforcement Learning

Aug 23, 2023
Ademi Adeniji, Amber Xie, Carmelo Sferrazza, Younggyo Seo, Stephen James, Pieter Abbeel

Figure 1 for Language Reward Modulation for Pretraining Reinforcement Learning
Figure 2 for Language Reward Modulation for Pretraining Reinforcement Learning
Figure 3 for Language Reward Modulation for Pretraining Reinforcement Learning
Figure 4 for Language Reward Modulation for Pretraining Reinforcement Learning

Using learned reward functions (LRFs) as a means to solve sparse-reward reinforcement learning (RL) tasks has yielded some steady progress in task-complexity through the years. In this work, we question whether today's LRFs are best-suited as a direct replacement for task rewards. Instead, we propose leveraging the capabilities of LRFs as a pretraining signal for RL. Concretely, we propose $\textbf{LA}$nguage Reward $\textbf{M}$odulated $\textbf{P}$retraining (LAMP) which leverages the zero-shot capabilities of Vision-Language Models (VLMs) as a $\textit{pretraining}$ utility for RL as opposed to a downstream task reward. LAMP uses a frozen, pretrained VLM to scalably generate noisy, albeit shaped exploration rewards by computing the contrastive alignment between a highly diverse collection of language instructions and the image observations of an agent in its pretraining environment. LAMP optimizes these rewards in conjunction with standard novelty-seeking exploration rewards with reinforcement learning to acquire a language-conditioned, pretrained policy. Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks in RLBench.

* Code available at https://github.com/ademiadeniji/lamp 
Viaarxiv icon

Multi-View Masked World Models for Visual Robotic Manipulation

Feb 05, 2023
Younggyo Seo, Junsu Kim, Stephen James, Kimin Lee, Jinwoo Shin, Pieter Abbeel

Figure 1 for Multi-View Masked World Models for Visual Robotic Manipulation
Figure 2 for Multi-View Masked World Models for Visual Robotic Manipulation
Figure 3 for Multi-View Masked World Models for Visual Robotic Manipulation
Figure 4 for Multi-View Masked World Models for Visual Robotic Manipulation

Visual robotic manipulation research and applications often use multiple cameras, or views, to better perceive the world. How else can we utilize the richness of multi-view data? In this paper, we investigate how to learn good representations with multi-view data and utilize them for visual robotic manipulation. Specifically, we train a multi-view masked autoencoder which reconstructs pixels of randomly masked viewpoints and then learn a world model operating on the representations from the autoencoder. We demonstrate the effectiveness of our method in a range of scenarios, including multi-view control and single-view control with auxiliary cameras for representation learning. We also show that the multi-view masked autoencoder trained with multiple randomized viewpoints enables training a policy with strong viewpoint randomization and transferring the policy to solve real-robot tasks without camera calibration and an adaptation procedure. Videos demonstrations in real-world experiments and source code are available at the project website: https://sites.google.com/view/mv-mwm.

* First two authors contributed equally. Project webpage: https://sites.google.com/view/mv-mwm 
Viaarxiv icon

StereoPose: Category-Level 6D Transparent Object Pose Estimation from Stereo Images via Back-View NOCS

Nov 03, 2022
Kai Chen, Stephen James, Congying Sui, Yun-Hui Liu, Pieter Abbeel, Qi Dou

Figure 1 for StereoPose: Category-Level 6D Transparent Object Pose Estimation from Stereo Images via Back-View NOCS
Figure 2 for StereoPose: Category-Level 6D Transparent Object Pose Estimation from Stereo Images via Back-View NOCS
Figure 3 for StereoPose: Category-Level 6D Transparent Object Pose Estimation from Stereo Images via Back-View NOCS
Figure 4 for StereoPose: Category-Level 6D Transparent Object Pose Estimation from Stereo Images via Back-View NOCS

Most existing methods for category-level pose estimation rely on object point clouds. However, when considering transparent objects, depth cameras are usually not able to capture meaningful data, resulting in point clouds with severe artifacts. Without a high-quality point cloud, existing methods are not applicable to challenging transparent objects. To tackle this problem, we present StereoPose, a novel stereo image framework for category-level object pose estimation, ideally suited for transparent objects. For a robust estimation from pure stereo images, we develop a pipeline that decouples category-level pose estimation into object size estimation, initial pose estimation, and pose refinement. StereoPose then estimates object pose based on representation in the normalized object coordinate space~(NOCS). To address the issue of image content aliasing, we further define a back-view NOCS map for the transparent object. The back-view NOCS aims to reduce the network learning ambiguity caused by content aliasing, and leverage informative cues on the back of the transparent object for more accurate pose estimation. To further improve the performance of the stereo framework, StereoPose is equipped with a parallax attention module for stereo feature fusion and an epipolar loss for improving the stereo-view consistency of network predictions. Extensive experiments on the public TOD dataset demonstrate the superiority of the proposed StereoPose framework for category-level 6D transparent object pose estimation.

* 7 pages, 6 figures, Project homepage: https://appsrv.cse.cuhk.edu.hk/~kaichen/stereopose.html 
Viaarxiv icon

Sim-to-Real via Sim-to-Seg: End-to-end Off-road Autonomous Driving Without Real Data

Oct 25, 2022
John So, Amber Xie, Sunggoo Jung, Jeffrey Edlund, Rohan Thakker, Ali Agha-mohammadi, Pieter Abbeel, Stephen James

Figure 1 for Sim-to-Real via Sim-to-Seg: End-to-end Off-road Autonomous Driving Without Real Data
Figure 2 for Sim-to-Real via Sim-to-Seg: End-to-end Off-road Autonomous Driving Without Real Data
Figure 3 for Sim-to-Real via Sim-to-Seg: End-to-end Off-road Autonomous Driving Without Real Data
Figure 4 for Sim-to-Real via Sim-to-Seg: End-to-end Off-road Autonomous Driving Without Real Data

Autonomous driving is complex, requiring sophisticated 3D scene understanding, localization, mapping, and control. Rather than explicitly modelling and fusing each of these components, we instead consider an end-to-end approach via reinforcement learning (RL). However, collecting exploration driving data in the real world is impractical and dangerous. While training in simulation and deploying visual sim-to-real techniques has worked well for robot manipulation, deploying beyond controlled workspace viewpoints remains a challenge. In this paper, we address this challenge by presenting Sim2Seg, a re-imagining of RCAN that crosses the visual reality gap for off-road autonomous driving, without using any real-world data. This is done by learning to translate randomized simulation images into simulated segmentation and depth maps, subsequently enabling real-world images to also be translated. This allows us to train an end-to-end RL policy in simulation, and directly deploy in the real-world. Our approach, which can be trained in 48 hours on 1 GPU, can perform equally as well as a classical perception and control stack that took thousands of engineering hours over several months to build. We hope this work motivates future end-to-end autonomous driving research.

* CoRL 2022 Paper 
Viaarxiv icon

Real-World Robot Learning with Masked Visual Pre-training

Oct 06, 2022
Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, Trevor Darrell

Figure 1 for Real-World Robot Learning with Masked Visual Pre-training
Figure 2 for Real-World Robot Learning with Masked Visual Pre-training
Figure 3 for Real-World Robot Learning with Masked Visual Pre-training
Figure 4 for Real-World Robot Learning with Masked Visual Pre-training

In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic tasks and embodiments. We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a massive collection of 4.5M images from the Internet and egocentric videos, and demonstrate clearly the benefits of scaling visual pre-training for robot learning.

* CoRL 2022; Project page: https://tetexiao.com/projects/real-mvp 
Viaarxiv icon

Temporally Consistent Video Transformer for Long-Term Video Prediction

Oct 05, 2022
Wilson Yan, Danijar Hafner, Stephen James, Pieter Abbeel

Figure 1 for Temporally Consistent Video Transformer for Long-Term Video Prediction
Figure 2 for Temporally Consistent Video Transformer for Long-Term Video Prediction
Figure 3 for Temporally Consistent Video Transformer for Long-Term Video Prediction
Figure 4 for Temporally Consistent Video Transformer for Long-Term Video Prediction

Generating long, temporally consistent video remains an open challenge in video generation. Primarily due to computational limitations, most prior methods limit themselves to training on a small subset of frames that are then extended to generate longer videos through a sliding window fashion. Although these techniques may produce sharp videos, they have difficulty retaining long-term temporal consistency due to their limited context length. In this work, we present Temporally Consistent Video Transformer (TECO), a vector-quantized latent dynamics video prediction model that learns compressed representations to efficiently condition on long videos of hundreds of frames during both training and generation. We use a MaskGit prior for dynamics prediction which enables both sharper and faster generations compared to prior work. Our experiments show that TECO outperforms SOTA baselines in a variety of video prediction benchmarks ranging from simple mazes in DMLab, large 3D worlds in Minecraft, and complex real-world videos from Kinetics-600. In addition, to better understand the capabilities of video prediction models in modeling temporal consistency, we introduce several challenging video prediction tasks consisting of agents randomly traversing 3D scenes of varying difficulty. This presents a challenging benchmark for video prediction in partially observable environments where a model must understand what parts of the scenes to re-create versus invent depending on its past observations or generations. Generated videos are available at https://wilson1yan.github.io/teco

* Project website: https://wilson1yan.github.io/teco 
Viaarxiv icon

HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator

Sep 15, 2022
Younggyo Seo, Kimin Lee, Fangchen Liu, Stephen James, Pieter Abbeel

Figure 1 for HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator
Figure 2 for HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator
Figure 3 for HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator
Figure 4 for HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator

Video prediction is an important yet challenging problem; burdened with the tasks of generating future frames and learning environment dynamics. Recently, autoregressive latent video models have proved to be a powerful video prediction tool, by separating the video prediction into two sub-problems: pre-training an image generator model, followed by learning an autoregressive prediction model in the latent space of the image generator. However, successfully generating high-fidelity and high-resolution videos has yet to be seen. In this work, we investigate how to train an autoregressive latent video prediction model capable of predicting high-fidelity future frames with minimal modification to existing models, and produce high-resolution (256x256) videos. Specifically, we scale up prior models by employing a high-fidelity image generator (VQ-GAN) with a causal transformer model, and introduce additional techniques of top-k sampling and data augmentation to further improve video prediction quality. Despite the simplicity, the proposed method achieves competitive performance to state-of-the-art approaches on standard video prediction benchmarks with fewer parameters, and enables high-resolution video prediction on complex and large-scale datasets. Videos are available at https://sites.google.com/view/harp-videos/home.

* Extended draft of the paper accepted to ICIP 2022 conference 
Viaarxiv icon

Masked World Models for Visual Control

Jun 28, 2022
Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, Pieter Abbeel

Figure 1 for Masked World Models for Visual Control
Figure 2 for Masked World Models for Visual Control
Figure 3 for Masked World Models for Visual Control
Figure 4 for Masked World Models for Visual Control

Visual model-based reinforcement learning (RL) has the potential to enable sample-efficient robot learning from visual observations. Yet the current approaches typically train a single model end-to-end for learning both visual representations and dynamics, making it difficult to accurately model the interaction between robots and small objects. In this work, we introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning. Specifically, we train an autoencoder with convolutional layers and vision transformers (ViT) to reconstruct pixels given masked convolutional features, and learn a latent dynamics model that operates on the representations from the autoencoder. Moreover, to encode task-relevant information, we introduce an auxiliary reward prediction objective for the autoencoder. We continually update both autoencoder and dynamics model using online samples collected from environment interaction. We demonstrate that our decoupling approach achieves state-of-the-art performance on a variety of visual robotic tasks from Meta-world and RLBench, e.g., we achieve 81.7% success rate on 50 visual robotic manipulation tasks from Meta-world, while the baseline achieves 67.9%. Code is available on the project website: https://sites.google.com/view/mwm-rl.

* Project website: https://sites.google.com/view/mwm-rl 
Viaarxiv icon