Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sifei Liu

GroupViT: Semantic Segmentation Emerges from Text Supervision

Feb 22, 2022

Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang

Figure 1 for GroupViT: Semantic Segmentation Emerges from Text Supervision

Figure 2 for GroupViT: Semantic Segmentation Emerges from Text Supervision

Figure 3 for GroupViT: Semantic Segmentation Emerges from Text Supervision

Figure 4 for GroupViT: Semantic Segmentation Emerges from Text Supervision

Abstract:Grouping and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grouping of image regions usually happens implicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grouping mechanism into deep networks, which allows semantic segments to emerge automatically with only text supervision. We propose a hierarchical Grouping Vision Transformer (GroupViT), which goes beyond the regular grid structure representation and learns to group image regions into progressively larger arbitrary-shaped segments. We train GroupViT jointly with a text encoder on a large-scale image-text dataset via contrastive losses. With only text supervision and without any pixel-level annotations, GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner, i.e., without any further fine-tuning. It achieves a zero-shot accuracy of 51.2% mIoU on the PASCAL VOC 2012 and 22.3% mIoU on PASCAL Context datasets, and performs competitively to state-of-the-art transfer-learning methods requiring greater levels of supervision. Project page is available at https://jerryxu.net/GroupViT.

* Project page: https://jerryxu.net/GroupViT

Via

Access Paper or Ask Questions

Learning Continuous Environment Fields via Implicit Functions

Nov 27, 2021

Xueting Li, Shalini De Mello, Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz, Sifei Liu

Figure 1 for Learning Continuous Environment Fields via Implicit Functions

Figure 2 for Learning Continuous Environment Fields via Implicit Functions

Figure 3 for Learning Continuous Environment Fields via Implicit Functions

Figure 4 for Learning Continuous Environment Fields via Implicit Functions

Abstract:We propose a novel scene representation that encodes reaching distance -- the distance between any position in the scene to a goal along a feasible trajectory. We demonstrate that this environment field representation can directly guide the dynamic behaviors of agents in 2D mazes or 3D indoor scenes. Our environment field is a continuous representation and learned via a neural implicit function using discretely sampled training data. We showcase its application for agent navigation in 2D mazes, and human trajectory prediction in 3D indoor environments. To produce physically plausible and natural trajectories for humans, we additionally learn a generative model that predicts regions where humans commonly appear, and enforce the environment field to be defined within such regions. Extensive experiments demonstrate that the proposed method can generate both feasible and plausible trajectories efficiently and accurately.

Via

Access Paper or Ask Questions

Self-Supervised Object Detection via Generative Image Synthesis

Oct 19, 2021

Siva Karthik Mustikovela, Shalini De Mello, Aayush Prakash, Umar Iqbal, Sifei Liu, Thu Nguyen-Phuoc, Carsten Rother, Jan Kautz

Figure 1 for Self-Supervised Object Detection via Generative Image Synthesis

Figure 2 for Self-Supervised Object Detection via Generative Image Synthesis

Figure 3 for Self-Supervised Object Detection via Generative Image Synthesis

Figure 4 for Self-Supervised Object Detection via Generative Image Synthesis

Abstract:We present SSOD, the first end-to-end analysis-by synthesis framework with controllable GANs for the task of self-supervised object detection. We use collections of real world images without bounding box annotations to learn to synthesize and detect objects. We leverage controllable GANs to synthesize images with pre-defined object properties and use them to train object detectors. We propose a tight end-to-end coupling of the synthesis and detection networks to optimally train our system. Finally, we also propose a method to optimally adapt SSOD to an intended target data without requiring labels for it. For the task of car detection, on the challenging KITTI and Cityscapes datasets, we show that SSOD outperforms the prior state-of-the-art purely image-based self-supervised object detection method Wetectron. Even without requiring any 3D CAD assets, it also surpasses the state-of-the-art rendering based method Meta-Sim2. Our work advances the field of self-supervised object detection by introducing a successful new paradigm of using controllable GAN-based image synthesis for it and by significantly improving the baseline accuracy of the task. We open-source our code at https://github.com/NVlabs/SSOD.

Via

Access Paper or Ask Questions

Video Autoencoder: self-supervised disentanglement of static 3D structure and motion

Oct 06, 2021

Zihang Lai, Sifei Liu, Alexei A. Efros, Xiaolong Wang

Figure 1 for Video Autoencoder: self-supervised disentanglement of static 3D structure and motion

Figure 2 for Video Autoencoder: self-supervised disentanglement of static 3D structure and motion

Figure 3 for Video Autoencoder: self-supervised disentanglement of static 3D structure and motion

Figure 4 for Video Autoencoder: self-supervised disentanglement of static 3D structure and motion

Abstract:A video autoencoder is proposed for learning disentan- gled representations of 3D structure and camera pose from videos in a self-supervised manner. Relying on temporal continuity in videos, our work assumes that the 3D scene structure in nearby video frames remains static. Given a sequence of video frames as input, the video autoencoder extracts a disentangled representation of the scene includ- ing: (i) a temporally-consistent deep voxel feature to represent the 3D structure and (ii) a 3D trajectory of camera pose for each frame. These two representations will then be re-entangled for rendering the input video frames. This video autoencoder can be trained directly using a pixel reconstruction loss, without any ground truth 3D or camera pose annotations. The disentangled representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following. We evaluate our method on several large- scale natural video datasets, and show generalization results on out-of-domain images.

* Accepted to ICCV 2021. Project page: https://zlai0.github.io/VideoAutoencoder

Via

Access Paper or Ask Questions

Learning Contrastive Representation for Semantic Correspondence

Sep 22, 2021

Taihong Xiao, Sifei Liu, Shalini De Mello, Zhiding Yu, Jan Kautz, Ming-Hsuan Yang

Figure 1 for Learning Contrastive Representation for Semantic Correspondence

Figure 2 for Learning Contrastive Representation for Semantic Correspondence

Figure 3 for Learning Contrastive Representation for Semantic Correspondence

Figure 4 for Learning Contrastive Representation for Semantic Correspondence

Abstract:Dense correspondence across semantically related images has been extensively studied, but still faces two challenges: 1) large variations in appearance, scale and pose exist even for objects from the same category, and 2) labeling pixel-level dense correspondences is labor intensive and infeasible to scale. Most existing approaches focus on designing various matching approaches with fully-supervised ImageNet pretrained networks. On the other hand, while a variety of self-supervised approaches are proposed to explicitly measure image-level similarities, correspondence matching the pixel level remains under-explored. In this work, we propose a multi-level contrastive learning approach for semantic matching, which does not rely on any ImageNet pretrained model. We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects, while the performance can be further enhanced by regularizing cross-instance cycle-consistency at intermediate feature levels. Experimental results on the PF-PASCAL, PF-WILLOW, and SPair-71k benchmark datasets demonstrate that our method performs favorably against the state-of-the-art approaches. The source code and trained models will be made available to the public.

Via

Access Paper or Ask Questions

Learning 3D Dense Correspondence via Canonical Point Autoencoder

Jul 10, 2021

An-Chieh Cheng, Xueting Li, Min Sun, Ming-Hsuan Yang, Sifei Liu

Figure 1 for Learning 3D Dense Correspondence via Canonical Point Autoencoder

Figure 2 for Learning 3D Dense Correspondence via Canonical Point Autoencoder

Figure 3 for Learning 3D Dense Correspondence via Canonical Point Autoencoder

Figure 4 for Learning 3D Dense Correspondence via Canonical Point Autoencoder

Abstract:We propose a canonical point autoencoder (CPAE) that predicts dense correspondences between 3D shapes of the same category. The autoencoder performs two key functions: (a) encoding an arbitrarily ordered point cloud to a canonical primitive, e.g., a sphere, and (b) decoding the primitive back to the original input instance shape. As being placed in the bottleneck, this primitive plays a key role to map all the unordered point clouds on the canonical surface and to be reconstructed in an ordered fashion. Once trained, points from different shape instances that are mapped to the same locations on the primitive surface are determined to be a pair of correspondence. Our method does not require any form of annotation or self-supervised part segmentation network and can handle unaligned input point clouds. Experimental results on 3D semantic keypoint transfer and part segmentation transfer show that our model performs favorably against state-of-the-art correspondence learning methods.

Via

Access Paper or Ask Questions

Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time

Jun 09, 2021

Shaowei Liu, Hanwen Jiang, Jiarui Xu, Sifei Liu, Xiaolong Wang

Figure 1 for Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time

Figure 2 for Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time

Figure 3 for Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time

Figure 4 for Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time

Abstract:Estimating 3D hand and object pose from a single image is an extremely challenging problem: hands and objects are often self-occluded during interactions, and the 3D annotations are scarce as even humans cannot directly label the ground-truths from a single image perfectly. To tackle these challenges, we propose a unified framework for estimating the 3D hand and object poses with semi-supervised learning. We build a joint learning framework where we perform explicit contextual reasoning between hand and object representations by a Transformer. Going beyond limited 3D annotations in a single image, we leverage the spatial-temporal consistency in large-scale hand-object videos as a constraint for generating pseudo labels in semi-supervised learning. Our method not only improves hand pose estimation in challenging real-world dataset, but also substantially improve the object pose which has fewer ground-truths per instance. By training with large-scale diverse videos, our model also generalizes better across multiple out-of-domain datasets. Project page and code: https://stevenlsw.github.io/Semi-Hand-Object

* CVPR 2021, Project page: https://stevenlsw.github.io/Semi-Hand-Object

Via

Access Paper or Ask Questions

Contrastive Syn-to-Real Generalization

Apr 06, 2021

Wuyang Chen, Zhiding Yu, Shalini De Mello, Sifei Liu, Jose M. Alvarez, Zhangyang Wang, Anima Anandkumar

Figure 1 for Contrastive Syn-to-Real Generalization

Figure 2 for Contrastive Syn-to-Real Generalization

Figure 3 for Contrastive Syn-to-Real Generalization

Figure 4 for Contrastive Syn-to-Real Generalization

Abstract:Training on synthetic data can be beneficial for label or data-scarce scenarios. However, synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work, we make a key observation that the diversity of the learned feature embeddings plays an important role in the generalization performance. To this end, we propose contrastive synthetic-to-real generalization (CSG), a novel framework that leverages the pre-trained ImageNet knowledge to prevent overfitting to the synthetic domain, while promoting the diversity of feature embeddings as an inductive bias to improve generalization. In addition, we enhance the proposed CSG framework with attentional pooling (A-pool) to let the model focus on semantically important regions and further improve its generalization. We demonstrate the effectiveness of CSG on various synthetic training tasks, exhibiting state-of-the-art performance on zero-shot domain generalization.

* Accepted in ICLR 2021

Via

Access Paper or Ask Questions

Learning to Track Instances without Video Annotations

Apr 01, 2021

Yang Fu, Sifei Liu, Umar Iqbal, Shalini De Mello, Humphrey Shi, Jan Kautz

Figure 1 for Learning to Track Instances without Video Annotations

Figure 2 for Learning to Track Instances without Video Annotations

Figure 3 for Learning to Track Instances without Video Annotations

Figure 4 for Learning to Track Instances without Video Annotations

Abstract:Tracking segmentation masks of multiple instances has been intensively studied, but still faces two fundamental challenges: 1) the requirement of large-scale, frame-wise annotation, and 2) the complexity of two-stage approaches. To resolve these challenges, we introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences. With an instance contrastive objective, we learn an embedding to discriminate each instance from the others. We show that even when only trained with images, the learned feature representation is robust to instance appearance variations, and is thus able to track objects steadily across frames. We further enhance the tracking capability of the embedding by learning correspondence from unlabeled videos in a self-supervised manner. In addition, we integrate this module into single-stage instance segmentation and pose estimation frameworks, which significantly reduce the computational complexity of tracking compared to two-stage networks. We conduct experiments on the YouTube-VIS and PoseTrack datasets. Without any video annotation efforts, our proposed method can achieve comparable or even better performance than most fully-supervised methods.

* CVPR 2021

Via

Access Paper or Ask Questions

Learning Continuous Image Representation with Local Implicit Image Function

Dec 16, 2020

Yinbo Chen, Sifei Liu, Xiaolong Wang

Figure 1 for Learning Continuous Image Representation with Local Implicit Image Function

Figure 2 for Learning Continuous Image Representation with Local Implicit Image Function

Figure 3 for Learning Continuous Image Representation with Local Implicit Image Function

Figure 4 for Learning Continuous Image Representation with Local Implicit Image Function

Abstract:How to represent an image? While the visual world is presented in a continuous manner, machines store and see the images in a discrete way with 2D arrays of pixels. In this paper, we seek to learn a continuous representation for images. Inspired by the recent progress in 3D reconstruction with implicit function, we propose Local Implicit Image Function (LIIF), which takes an image coordinate and the 2D deep features around the coordinate as inputs, predicts the RGB value at a given coordinate as an output. Since the coordinates are continuous, LIIF can be presented in an arbitrary resolution. To generate the continuous representation for pixel-based images, we train an encoder and LIIF representation via a self-supervised task with super-resolution. The learned continuous representation can be presented in arbitrary resolution even extrapolate to $\times 30$ higher resolution, where the training tasks are not provided. We further show that LIIF representation builds a bridge between discrete and continuous representation in 2D, it naturally supports the learning tasks with size-varied image ground-truths and significantly outperforms the method with resizing the ground-truths. Our project page with code is at https://yinboc.github.io/liif/ .

* Our project page: https://yinboc.github.io/liif/ ; Code is released at: https://github.com/yinboc/liif

Via

Access Paper or Ask Questions