Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Michael S. Ryoo

VicTR: Video-conditioned Text Representations for Activity Recognition

Apr 05, 2023
Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael S. Ryoo

Figure 1 for VicTR: Video-conditioned Text Representations for Activity Recognition

Figure 2 for VicTR: Video-conditioned Text Representations for Activity Recognition

Figure 3 for VicTR: Video-conditioned Text Representations for Activity Recognition

Figure 4 for VicTR: Video-conditioned Text Representations for Activity Recognition

Vision-Language models have shown strong performance in the image-domain -- even in zero-shot settings, thanks to the availability of large amount of pretraining data (i.e., paired image-text examples). However for videos, such paired data is not as abundant. Thus, video-text models are usually designed by adapting pretrained image-text models to video-domain, instead of training from scratch. All such recipes rely on augmenting visual embeddings with temporal information (i.e., image -> video), often keeping text embeddings unchanged or even being discarded. In this paper, we argue that such adapted video-text models can benefit more by augmenting text rather than visual information. We propose VicTR, which jointly-optimizes text and video tokens, generating 'Video-conditioned Text' embeddings. Our method can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text (e.g., object or scene information). We conduct experiments on multiple benchmarks including supervised (Kinetics-400, Charades), zero-shot and few-shot (HMDB-51, UCF-101) settings, showing competitive performance on activity recognition based on video-text models.

Via

Access Paper or Ask Questions

Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors

Nov 23, 2022
Ryan Burgert, Kanchana Ranasinghe, Xiang Li, Michael S. Ryoo

Figure 1 for Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors

Figure 2 for Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors

Figure 3 for Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors

Figure 4 for Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors

Recent diffusion-based generative models combined with vision-language models are capable of creating realistic images from natural language prompts. While these models are trained on large internet-scale datasets, such pre-trained models are not directly introduced to any semantic localization or grounding. Most current approaches for localization or grounding rely on human-annotated localization information in the form of bounding boxes or segmentation masks. The exceptions are a few unsupervised methods that utilize architectures or loss functions geared towards localization, but they need to be trained separately. In this work, we explore how off-the-shelf diffusion models, trained with no exposure to such localization information, are capable of grounding various semantic phrases with no segmentation-specific re-training. An inference time optimization process is introduced, that is capable of generating segmentation masks conditioned on natural language. We evaluate our proposal Peekaboo for unsupervised semantic segmentation on the Pascal VOC dataset. In addition, we evaluate for referring segmentation on the RefCOCO dataset. In summary, we present a first zero-shot, open-vocabulary, unsupervised (no localization information), semantic grounding technique leveraging diffusion-based generative models with no re-training. Our code will be released publicly.

* 19 pages; contains appendix

Via

Access Paper or Ask Questions

Token Turing Machines

Nov 16, 2022
Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, Anurag Arnab

We propose Token Turing Machines (TTM), a sequential, autoregressive Transformer model with memory for real-world sequential visual understanding. Our model is inspired by the seminal Neural Turing Machine, and has an external memory consisting of a set of tokens which summarise the previous history (i.e., frames). This memory is efficiently addressed, read and written using a Transformer as the processing unit/controller at each step. The model's memory module ensures that a new observation will only be processed with the contents of the memory (and not the entire history), meaning that it can efficiently process long sequences with a bounded computational cost at each step. We show that TTM outperforms other alternatives, such as other Transformer models designed for long sequences and recurrent neural networks, on two real-world sequential visual understanding tasks: online temporal activity detection from videos and vision-based robot action policy learning.

Via

Access Paper or Ask Questions

Grafting Vision Transformers

Oct 28, 2022
Jongwoo Park, Kumara Kahatapitiya, Donghyun Kim, Shivchander Sudalairaj, Quanfu Fan, Michael S. Ryoo

Figure 1 for Grafting Vision Transformers

Figure 2 for Grafting Vision Transformers

Figure 3 for Grafting Vision Transformers

Figure 4 for Grafting Vision Transformers

Vision Transformers (ViTs) have recently become the state-of-the-art across many computer vision tasks. In contrast to convolutional networks (CNNs), ViTs enable global information sharing even within shallow layers of a network, i.e., among high-resolution features. However, this perk was later overlooked with the success of pyramid architectures such as Swin Transformer, which show better performance-complexity trade-offs. In this paper, we present a simple and efficient add-on component (termed GrafT) that considers global dependencies and multi-scale information throughout the network, in both high- and low-resolution features alike. GrafT can be easily adopted in both homogeneous and pyramid Transformers while showing consistent gains. It has the flexibility of branching-out at arbitrary depths, widening a network with multiple scales. This grafting operation enables us to share most of the parameters and computations of the backbone, adding only minimal complexity, but with a higher yield. In fact, the process of progressively compounding multi-scale receptive fields in GrafT enables communications between local regions. We show the benefits of the proposed method on multiple benchmarks, including image classification (ImageNet-1K), semantic segmentation (ADE20K), object detection and instance segmentation (COCO2017). Our code and models will be made available.

Via

Access Paper or Ask Questions

Open-vocabulary Queryable Scene Representations for Real World Planning

Sep 20, 2022
Boyuan Chen, Fei Xia, Brian Ichter, Kanishka Rao, Keerthana Gopalakrishnan, Michael S. Ryoo, Austin Stone, Daniel Kappler

Figure 1 for Open-vocabulary Queryable Scene Representations for Real World Planning

Figure 2 for Open-vocabulary Queryable Scene Representations for Real World Planning

Figure 3 for Open-vocabulary Queryable Scene Representations for Real World Planning

Figure 4 for Open-vocabulary Queryable Scene Representations for Real World Planning

Large language models (LLMs) have unlocked new capabilities of task planning from human instructions. However, prior attempts to apply LLMs to real-world robotic tasks are limited by the lack of grounding in the surrounding scene. In this paper, we develop NLMap, an open-vocabulary and queryable scene representation to address this problem. NLMap serves as a framework to gather and integrate contextual information into LLM planners, allowing them to see and query available objects in the scene before generating a context-conditioned plan. NLMap first establishes a natural language queryable scene representation with Visual Language models (VLMs). An LLM based object proposal module parses instructions and proposes involved objects to query the scene representation for object availability and location. An LLM planner then plans with such information about the scene. NLMap allows robots to operate without a fixed list of objects nor executable options, enabling real robot operation unachievable by previous methods. Project website: https://nlmap-saycan.github.io

Via

Access Paper or Ask Questions

Video Question Answering with Iterative Video-Text Co-Tokenization

Aug 01, 2022
AJ Piergiovanni, Kairo Morton, Weicheng Kuo, Michael S. Ryoo, Anelia Angelova

Figure 1 for Video Question Answering with Iterative Video-Text Co-Tokenization

Figure 2 for Video Question Answering with Iterative Video-Text Co-Tokenization

Figure 3 for Video Question Answering with Iterative Video-Text Co-Tokenization

Figure 4 for Video Question Answering with Iterative Video-Text Co-Tokenization

Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization approach to answer a variety of questions related to videos. We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA, outperforming the previous state-of-the-art by large margins. Simultaneously, our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.

* ECCV 2022

Via

Access Paper or Ask Questions

Video + CLIP Baseline for Ego4D Long-term Action Anticipation

Jul 01, 2022
Srijan Das, Michael S. Ryoo

Figure 1 for Video + CLIP Baseline for Ego4D Long-term Action Anticipation

Figure 2 for Video + CLIP Baseline for Ego4D Long-term Action Anticipation

Figure 3 for Video + CLIP Baseline for Ego4D Long-term Action Anticipation

In this report, we introduce our adaptation of image-text models for long-term action anticipation. Our Video + CLIP framework makes use of a large-scale pre-trained paired image-text model: CLIP and a video encoder Slowfast network. The CLIP embedding provides fine-grained understanding of objects relevant for an action whereas the slowfast network is responsible for modeling temporal information within a video clip of few frames. We show that the features obtained from both encoders are complementary to each other, thus outperforming the baseline on Ego4D for the task of long-term action anticipation. Our code is available at github.com/srijandas07/clip_baseline_LTA_Ego4d.

* Secured second position in the Ego4D Challenge for Long-Term Action Anticipation track at CVPR 2022

Via

Access Paper or Ask Questions

Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Jun 23, 2022
Jinghuan Shang, Srijan Das, Michael S. Ryoo

Figure 1 for Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Figure 2 for Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Figure 3 for Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Figure 4 for Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, neither these Transformers nor 2D convolutional networks perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations. The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric transformations on the tokens. These enable 3DTRL to recover the 3D positional information of the tokens from 2D patches. In practice, 3DTRL is easily plugged-in into a Transformer. Our experiments demonstrate the effectiveness of 3DTRL in many vision tasks including image classification, multi-view video alignment, and action recognition. The models with 3DTRL outperform their backbone Transformers in all the tasks with minimal added computation. Our project page is at https://www3.cs.stonybrook.edu/~jishang/3dtrl/3dtrl.html

* Pre-print. 20 pages

Via

Access Paper or Ask Questions

Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?

Jun 23, 2022
Xiang Li, Jinghuan Shang, Srijan Das, Michael S. Ryoo

Figure 1 for Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?

Figure 2 for Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?

Figure 3 for Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?

Figure 4 for Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?

We investigate whether self-supervised learning (SSL) can improve online reinforcement learning (RL) from pixels. We extend the contrastive reinforcement learning framework (e.g., CURL) that jointly optimizes SSL and RL losses and conduct an extensive amount of experiments with various self-supervised losses. Our observations suggest that the existing SSL framework for RL fails to bring meaningful improvement over the baselines only taking advantage of image augmentation when the same amount of data and augmentation is used. We further perform an evolutionary search to find the optimal combination of multiple self-supervised losses for RL, but find that even such a loss combination fails to meaningfully outperform the methods that only utilize carefully designed image augmentations. Often, the use of self-supervised losses under the existing framework lowered RL performances. We evaluate the approach in multiple different environments including a real-world robot environment and confirm that no single self-supervised loss or image augmentation method can dominate all environments and that the current framework for joint optimization of SSL and RL is limited. Finally, we empirically investigate the pretraining framework for SSL + RL and the properties of representations learned with different approaches.

Via

Access Paper or Ask Questions

STC-mix: Space, Time, Channel mixing for Self-supervised Video Representation

Dec 07, 2021
Srijan Das, Michael S. Ryoo

Figure 1 for STC-mix: Space, Time, Channel mixing for Self-supervised Video Representation

Figure 2 for STC-mix: Space, Time, Channel mixing for Self-supervised Video Representation

Figure 3 for STC-mix: Space, Time, Channel mixing for Self-supervised Video Representation

Figure 4 for STC-mix: Space, Time, Channel mixing for Self-supervised Video Representation

Contrastive representation learning of videos highly relies on the availability of millions of unlabelled videos. This is practical for videos available on web but acquiring such large scale of videos for real-world applications is very expensive and laborious. Therefore, in this paper we focus on designing video augmentation for self-supervised learning, we first analyze the best strategy to mix videos to create a new augmented video sample. Then, the question remains, can we make use of the other modalities in videos for data mixing? To this end, we propose Cross-Modal Manifold Cutmix (CMMC) that inserts a video tesseract into another video tesseract in the feature space across two different modalities. We find that our video mixing strategy STC-mix, i.e. preliminary mixing of videos followed by CMMC across different modalities in a video, improves the quality of learned video representations. We conduct thorough experiments for two downstream tasks: action recognition and video retrieval on two small scale video datasets UCF101, and HMDB51. We also demonstrate the effectiveness of our STC-mix on NTU dataset where domain knowledge is limited. We show that the performance of our STC-mix on both the downstream tasks is on par with the other self-supervised approaches while requiring less training data.

* 12 pages, codes and model links will be updated soon

Via

Access Paper or Ask Questions