Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rohit Girdhar

Jack

Learning Video Representations from Large Language Models

Dec 08, 2022

Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar

Abstract:We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text. The video-text embedding learned contrastively with these additional auto-generated narrations outperforms the previous state-of-the-art on multiple first-person and third-person video tasks, both in zero-shot and finetuned setups. Most notably, LaViLa obtains an absolute gain of 10.1% on EGTEA classification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks. Furthermore, LaViLa trained with only half the narrations from the Ego4D dataset outperforms baseline models trained on the full set, and shows positive scaling behavior on increasing pre-training data and model size.

* Tech report. Project page: https://facebookresearch.github.io/LaViLa; Code is available at http://github.com/facebookresearch/LaViLa

Via

Access Paper or Ask Questions

OmniMAE: Single Model Masked Pretraining on Images and Videos

Jun 16, 2022

Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

Figure 1 for OmniMAE: Single Model Masked Pretraining on Images and Videos

Figure 2 for OmniMAE: Single Model Masked Pretraining on Images and Videos

Figure 3 for OmniMAE: Single Model Masked Pretraining on Images and Videos

Figure 4 for OmniMAE: Single Model Masked Pretraining on Images and Videos

Abstract:Transformer-based architectures have become competitive across a variety of visual domains, most notably images and videos. While prior work has studied these modalities in isolation, having a common architecture suggests that one can train a single unified model for multiple visual modalities. Prior attempts at unified modeling typically use architectures tailored for vision tasks, or obtain worse performance compared to single modality models. In this work, we show that masked autoencoding can be used to train a simple Vision Transformer on images and videos, without requiring any labeled data. This single model learns visual representations that are comparable to or better than single-modality representations on both image and video benchmarks, while using a much simpler architecture. In particular, our single pretrained model can be finetuned to achieve 86.5% on ImageNet and 75.3% on the challenging Something Something-v2 video benchmark. Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training.

Via

Access Paper or Ask Questions

Omnivore: A Single Model for Many Visual Modalities

Jan 20, 2022

Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, Ishan Misra

Figure 1 for Omnivore: A Single Model for Many Visual Modalities

Figure 2 for Omnivore: A Single Model for Many Visual Modalities

Figure 3 for Omnivore: A Single Model for Many Visual Modalities

Figure 4 for Omnivore: A Single Model for Many Visual Modalities

Abstract:Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data. Instead, in this paper, we propose a single model which excels at classifying images, videos, and single-view 3D data using exactly the same model parameters. Our 'Omnivore' model leverages the flexibility of transformer-based architectures and is trained jointly on classification tasks from different modalities. Omnivore is simple to train, uses off-the-shelf standard datasets, and performs at-par or better than modality-specific models of the same size. A single Omnivore model obtains 86.0% on ImageNet, 84.1% on Kinetics, and 67.1% on SUN RGB-D. After finetuning, our models outperform prior work on a variety of vision tasks and generalize across modalities. Omnivore's shared visual representation naturally enables cross-modal recognition without access to correspondences between modalities. We hope our results motivate researchers to model visual modalities together.

Via

Access Paper or Ask Questions

Detecting Twenty-thousand Classes using Image-level Supervision

Jan 10, 2022

Xingyi Zhou, Rohit Girdhar, Armand Joulin, Phillip Krähenbühl, Ishan Misra

Figure 1 for Detecting Twenty-thousand Classes using Image-level Supervision

Figure 2 for Detecting Twenty-thousand Classes using Image-level Supervision

Figure 3 for Detecting Twenty-thousand Classes using Image-level Supervision

Figure 4 for Detecting Twenty-thousand Classes using Image-level Supervision

Abstract:Current object detectors are limited in vocabulary size due to the small scale of detection datasets. Image classifiers, on the other hand, reason about much larger vocabularies, as their datasets are larger and easier to collect. We propose Detic, which simply trains the classifiers of a detector on image classification data and thus expands the vocabulary of detectors to tens of thousands of concepts. Unlike prior work, Detic does not assign image labels to boxes based on model predictions, making it much easier to implement and compatible with a range of detection architectures and backbones. Our results show that Detic yields excellent detectors even for classes without box annotations. It outperforms prior work on both open-vocabulary and long-tail detection benchmarks. Detic provides a gain of 2.4 mAP for all classes and 8.3 mAP for novel classes on the open-vocabulary LVIS benchmark. On the standard LVIS benchmark, Detic reaches 41.7 mAP for all classes and 41.7 mAP for rare classes. For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset and show that it generalizes to new datasets without fine-tuning. Code is available at https://github.com/facebookresearch/Detic.

* Code is available at https://github.com/facebookresearch/Detic

Via

Access Paper or Ask Questions

Mask2Former for Video Instance Segmentation

Dec 20, 2021

Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexander Kirillov, Rohit Girdhar, Alexander G. Schwing

Figure 1 for Mask2Former for Video Instance Segmentation

Figure 2 for Mask2Former for Video Instance Segmentation

Figure 3 for Mask2Former for Video Instance Segmentation

Abstract:We find Mask2Former also achieves state-of-the-art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline. In this report, we show universal image segmentation architectures trivially generalize to video segmentation by directly predicting 3D segmentation volumes. Specifically, Mask2Former sets a new state-of-the-art of 60.4 AP on YouTubeVIS-2019 and 52.6 AP on YouTubeVIS-2021. We believe Mask2Former is also capable of handling video semantic and panoptic segmentation, given its versatility in image segmentation. We hope this will make state-of-the-art video segmentation research more accessible and bring more attention to designing universal image and video segmentation architectures.

* Code and models: https://github.com/facebookresearch/Mask2Former

Via

Access Paper or Ask Questions

Masked-attention Mask Transformer for Universal Image Segmentation

Dec 10, 2021

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar

Figure 1 for Masked-attention Mask Transformer for Universal Image Segmentation

Figure 2 for Masked-attention Mask Transformer for Universal Image Segmentation

Figure 3 for Masked-attention Mask Transformer for Universal Image Segmentation

Figure 4 for Masked-attention Mask Transformer for Universal Image Segmentation

Abstract:Image segmentation is about grouping pixels with different semantics, e.g., category or instance membership, where each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).

* Project page/code/models: https://bowenc0221.github.io/mask2former

Via

Access Paper or Ask Questions

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Oct 13, 2021

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu(+74 more)

Figure 1 for Ego4D: Around the World in 3,000 Hours of Egocentric Video

Figure 2 for Ego4D: Around the World in 3,000 Hours of Egocentric Video

Figure 3 for Ego4D: Around the World in 3,000 Hours of Egocentric Video

Figure 4 for Ego4D: Around the World in 3,000 Hours of Egocentric Video

Abstract:We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 855 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/

Via

Access Paper or Ask Questions

An End-to-End Transformer Model for 3D Object Detection

Sep 16, 2021

Ishan Misra, Rohit Girdhar, Armand Joulin

Figure 1 for An End-to-End Transformer Model for 3D Object Detection

Figure 2 for An End-to-End Transformer Model for 3D Object Detection

Figure 3 for An End-to-End Transformer Model for 3D Object Detection

Figure 4 for An End-to-End Transformer Model for 3D Object Detection

Abstract:We propose 3DETR, an end-to-end Transformer based object detection model for 3D point clouds. Compared to existing detection methods that employ a number of 3D-specific inductive biases, 3DETR requires minimal modifications to the vanilla Transformer block. Specifically, we find that a standard Transformer with non-parametric queries and Fourier positional embeddings is competitive with specialized architectures that employ libraries of 3D-specific operators with hand-tuned hyperparameters. Nevertheless, 3DETR is conceptually simple and easy to implement, enabling further improvements by incorporating 3D domain knowledge. Through extensive experiments, we show 3DETR outperforms the well-established and highly optimized VoteNet baselines on the challenging ScanNetV2 dataset by 9.5%. Furthermore, we show 3DETR is applicable to 3D tasks beyond detection, and can serve as a building block for future research.

* Accepted at ICCV 2021

Via

Access Paper or Ask Questions

Anticipative Video Transformer

Jun 03, 2021

Rohit Girdhar, Kristen Grauman

Figure 1 for Anticipative Video Transformer

Figure 2 for Anticipative Video Transformer

Figure 3 for Anticipative Video Transformer

Figure 4 for Anticipative Video Transformer

Abstract:We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to anticipate future actions. We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames' features. Compared to existing temporal aggregation strategies, AVT has the advantage of both maintaining the sequential progression of observed actions while still capturing long-range dependencies--both critical for the anticipation task. Through extensive experiments, we show that AVT obtains the best reported performance on four popular action anticipation benchmarks: EpicKitchens-55, EpicKitchens-100, EGTEA Gaze+, and 50-Salads, including outperforming all submissions to the EpicKitchens-100 CVPR'21 challenge.

* Ranked #1 on CVPR'21 EPIC-Kitchens Action Anticipation challenge leaderboard. Project page: http://facebookresearch.github.io/AVT

Via

Access Paper or Ask Questions

3D Spatial Recognition without Spatially Labeled 3D

May 13, 2021

Zhongzheng Ren, Ishan Misra, Alexander G. Schwing, Rohit Girdhar

Figure 1 for 3D Spatial Recognition without Spatially Labeled 3D

Figure 2 for 3D Spatial Recognition without Spatially Labeled 3D

Figure 3 for 3D Spatial Recognition without Spatially Labeled 3D

Figure 4 for 3D Spatial Recognition without Spatially Labeled 3D

Abstract:We introduce WyPR, a Weakly-supervised framework for Point cloud Recognition, requiring only scene-level class tags as supervision. WyPR jointly addresses three core 3D recognition tasks: point-level semantic segmentation, 3D proposal generation, and 3D object detection, coupling their predictions through self and cross-task consistency losses. We show that in conjunction with standard multiple-instance learning objectives, WyPR can detect and segment objects in point cloud data without access to any spatial labels at training time. We demonstrate its efficacy using the ScanNet and S3DIS datasets, outperforming prior state of the art on weakly-supervised segmentation by more than 6% mIoU. In addition, we set up the first benchmark for weakly-supervised 3D object detection on both datasets, where WyPR outperforms standard approaches and establishes strong baselines for future work.

* CVPR 2021

Via

Access Paper or Ask Questions