Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haoqi Fan

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

Apr 29, 2021

Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick, Kaiming He

Figure 1 for A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

Figure 2 for A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

Figure 3 for A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

Figure 4 for A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

Abstract:We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv) backbone architectures. We draw a series of intriguing observations from this study, e.g., we discover that encouraging long-spanned persistency can be effective even if the timespan is 60 seconds. In addition to state-of-the-art results in multiple benchmarks, we report a few promising cases in which unsupervised pre-training can outperform its supervised counterpart. Code is made available at https://github.com/facebookresearch/SlowFast

* CVPR 2021

Via

Access Paper or Ask Questions

Multiscale Vision Transformers

Apr 22, 2021

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer

Figure 1 for Multiscale Vision Transformers

Figure 2 for Multiscale Vision Transformers

Figure 3 for Multiscale Vision Transformers

Figure 4 for Multiscale Vision Transformers

Abstract:We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. Code is available at: https://github.com/facebookresearch/SlowFast

* Technical report

Via

Access Paper or Ask Questions

Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories

Apr 02, 2021

Xitong Yang, Haoqi Fan, Lorenzo Torresani, Larry Davis, Heng Wang

Figure 1 for Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories

Figure 2 for Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories

Figure 3 for Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories

Figure 4 for Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories

Abstract:The standard way of training video models entails sampling at each iteration a single clip from a video and optimizing the clip prediction with respect to the video-level label. We argue that a single clip may not have enough temporal coverage to exhibit the label to recognize, since video datasets are often weakly labeled with categorical information but without dense temporal annotations. Furthermore, optimizing the model over brief clips impedes its ability to learn long-term temporal dependencies. To overcome these limitations, we introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration. This enables the learning of long-range dependencies beyond a single clip. We explore different design choices for the collaborative memory to ease the optimization difficulties. Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead. Through extensive experiments, we demonstrate that our framework generalizes to different video architectures and tasks, outperforming the state of the art on both action recognition (e.g., Kinetics-400 & 700, Charades, Something-Something-V1) and action detection (e.g., AVA v2.1 & v2.2).

* CVPR 2021

Via

Access Paper or Ask Questions

Multiview Pseudo-Labeling for Semi-supervised Learning from Video

Apr 01, 2021

Bo Xiong, Haoqi Fan, Kristen Grauman, Christoph Feichtenhofer

Figure 1 for Multiview Pseudo-Labeling for Semi-supervised Learning from Video

Figure 2 for Multiview Pseudo-Labeling for Semi-supervised Learning from Video

Figure 3 for Multiview Pseudo-Labeling for Semi-supervised Learning from Video

Figure 4 for Multiview Pseudo-Labeling for Semi-supervised Learning from Video

Abstract:We present a multiview pseudo-labeling approach to video learning, a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video. The complementary views help obtain more reliable pseudo-labels on unlabeled video, to learn stronger video representations than from purely supervised data. Though our method capitalizes on multiple views, it nonetheless trains a model that is shared across appearance and motion input and thus, by design, incurs no additional computation overhead at inference time. On multiple video recognition datasets, our method substantially outperforms its supervised counterpart, and compares favorably to previous work on standard benchmarks in self-supervised video representation learning.

* Technical report

Via

Access Paper or Ask Questions

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Mar 28, 2021

Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui Ding, Zhongyuan Wang

Figure 1 for HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Figure 2 for HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Figure 3 for HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Figure 4 for HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Abstract:Video-Text Retrieval has been a hot research topic with the explosion of multimedia data on the Internet. Transformer for video-text learning has attracted increasing attention due to the promising performance.However, existing cross-modal transformer approaches typically suffer from two major limitations: 1) Limited exploitation of the transformer architecture where different layers have different feature characteristics. 2) End-to-end training mechanism limits negative interactions among samples in a mini-batch. In this paper, we propose a novel approach named Hierarchical Transformer (HiT) for video-text retrieval. HiT performs hierarchical cross-modal contrastive matching in feature-level and semantic-level to achieve multi-view and comprehensive retrieval results. Moreover, inspired by MoCo, we propose Momentum Cross-modal Contrast for cross-modal learning to enable large-scale negative interactions on-the-fly, which contributes to the generation of more precise and discriminative representations. Experimental results on three major Video-Text Retrieval benchmark datasets demonstrate the advantages of our methods.

Via

Access Paper or Ask Questions

Can Temporal Information Help with Contrastive Self-Supervised Learning?

Nov 25, 2020

Yutong Bai, Haoqi Fan, Ishan Misra, Ganesh Venkatesh, Yongyi Lu, Yuyin Zhou, Qihang Yu, Vikas Chandra, Alan Yuille

Figure 1 for Can Temporal Information Help with Contrastive Self-Supervised Learning?

Figure 2 for Can Temporal Information Help with Contrastive Self-Supervised Learning?

Figure 3 for Can Temporal Information Help with Contrastive Self-Supervised Learning?

Figure 4 for Can Temporal Information Help with Contrastive Self-Supervised Learning?

Abstract:Leveraging temporal information has been regarded as essential for developing video understanding models. However, how to properly incorporate temporal information into the recent successful instance discrimination based contrastive self-supervised learning (CSL) framework remains unclear. As an intuitive solution, we find that directly applying temporal augmentations does not help, or even impair video CSL in general. This counter-intuitive observation motivates us to re-design existing video CSL frameworks, for better integration of temporal knowledge. To this end, we present Temporal-aware Contrastive self-supervised learningTaCo, as a general paradigm to enhance video CSL. Specifically, TaCo selects a set of temporal transformations not only as strong data augmentation but also to constitute extra self-supervision for video understanding. By jointly contrasting instances with enriched temporal transformations and learning these transformations as self-supervised signals, TaCo can significantly enhance unsupervised video representation learning. For instance, TaCo demonstrates consistent improvement in downstream classification tasks over a list of backbones and CSL approaches. Our best model achieves 85.1% (UCF-101) and 51.6% (HMDB-51) top-1 accuracy, which is a 3% and 2.4% relative improvement over the previous state-of-the-art.

Via

Access Paper or Ask Questions

Improved Baselines with Momentum Contrastive Learning

Mar 09, 2020

Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He

Figure 1 for Improved Baselines with Momentum Contrastive Learning

Figure 2 for Improved Baselines with Momentum Contrastive Learning

Figure 3 for Improved Baselines with Momentum Contrastive Learning

Figure 4 for Improved Baselines with Momentum Contrastive Learning

Abstract:Contrastive unsupervised learning has recently shown encouraging progress, e.g., in Momentum Contrast (MoCo) and SimCLR. In this note, we verify the effectiveness of two of SimCLR's design improvements by implementing them in the MoCo framework. With simple modifications to MoCo---namely, using an MLP projection head and more data augmentation---we establish stronger baselines that outperform SimCLR and do not require large training batches. We hope this will make state-of-the-art unsupervised learning research more accessible. Code will be made public.

* Tech report, 2 pages + references

Via

Access Paper or Ask Questions

Momentum Contrast for Unsupervised Visual Representation Learning

Nov 14, 2019

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick

Figure 1 for Momentum Contrast for Unsupervised Visual Representation Learning

Figure 2 for Momentum Contrast for Unsupervised Visual Representation Learning

Figure 3 for Momentum Contrast for Unsupervised Visual Representation Learning

Figure 4 for Momentum Contrast for Unsupervised Visual Representation Learning

Abstract:We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.

* Technical report

Via

Access Paper or Ask Questions

Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution

Apr 30, 2019

Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, Jiashi Feng

Figure 1 for Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution

Figure 2 for Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution

Figure 3 for Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution

Figure 4 for Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution

Abstract:In natural images, information is conveyed at different frequencies where higher frequencies are usually encoded with fine details and lower frequencies are usually encoded with global structures. Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies. In this work, we propose to factorize the mixed feature maps by their frequencies and design a novel Octave Convolution (OctConv) operation to store and process feature maps that vary spatially "slower" at a lower spatial resolution reducing both memory and computation cost. Unlike existing multi-scale meth-ods, OctConv is formulated as a single, generic, plug-and-play convolutional unit that can be used as a direct replacement of (vanilla) convolutions without any adjustments in the network architecture. It is also orthogonal and complementary to methods that suggest better topologies or reduce channel-wise redundancy like group or depth-wise convolutions. We experimentally show that by simply replacing con-volutions with OctConv, we can consistently boost accuracy for both image and video recognition tasks, while reducing memory and computational cost. An OctConv-equipped ResNet-152 can achieve 82.9% top-1 classification accuracy on ImageNet with merely 22.2 GFLOPs.

Via

Access Paper or Ask Questions

Long-Term Feature Banks for Detailed Video Understanding

Dec 12, 2018

Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krähenbühl, Ross Girshick

Figure 1 for Long-Term Feature Banks for Detailed Video Understanding

Figure 2 for Long-Term Feature Banks for Detailed Video Understanding

Figure 3 for Long-Term Feature Banks for Detailed Video Understanding

Figure 4 for Long-Term Feature Banks for Detailed Video Understanding

Abstract:To understand the world, we humans constantly need to relate the present to the past, and put events in context. In this paper, we enable existing video models to do the same. We propose a long-term feature bank---supportive information extracted over the entire span of a video---to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds. Our experiments demonstrate that augmenting 3D convolutional networks with a long-term feature bank yields state-of-the-art results on three challenging video datasets: AVA, EPIC-Kitchens, and Charades.

* Technical report

Via

Access Paper or Ask Questions