Alert button
Picture for Yongming Rao

Yongming Rao

Alert button

TCOVIS: Temporally Consistent Online Video Instance Segmentation

Sep 21, 2023
Junlong Li, Bingyao Yu, Yongming Rao, Jie Zhou, Jiwen Lu

Figure 1 for TCOVIS: Temporally Consistent Online Video Instance Segmentation
Figure 2 for TCOVIS: Temporally Consistent Online Video Instance Segmentation
Figure 3 for TCOVIS: Temporally Consistent Online Video Instance Segmentation
Figure 4 for TCOVIS: Temporally Consistent Online Video Instance Segmentation

In recent years, significant progress has been made in video instance segmentation (VIS), with many offline and online methods achieving state-of-the-art performance. While offline methods have the advantage of producing temporally consistent predictions, they are not suitable for real-time scenarios. Conversely, online methods are more practical, but maintaining temporal consistency remains a challenging task. In this paper, we propose a novel online method for video instance segmentation, called TCOVIS, which fully exploits the temporal information in a video clip. The core of our method consists of a global instance assignment strategy and a spatio-temporal enhancement module, which improve the temporal consistency of the features from two aspects. Specifically, we perform global optimal matching between the predictions and ground truth across the whole video clip, and supervise the model with the global optimal objective. We also capture the spatial feature and aggregate it with the semantic feature between frames, thus realizing the spatio-temporal enhancement. We evaluate our method on four widely adopted VIS benchmarks, namely YouTube-VIS 2019/2021/2022 and OVIS, and achieve state-of-the-art performance on all benchmarks without bells-and-whistles. For instance, on YouTube-VIS 2021, TCOVIS achieves 49.5 AP and 61.3 AP with ResNet-50 and Swin-L backbones, respectively. Code is available at https://github.com/jun-long-li/TCOVIS.

* 11 pages, 4 figures. This paper has been accepted for ICCV 2023 
Viaarxiv icon

Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models

Jul 27, 2023
Ziyi Wang, Xumin Yu, Yongming Rao, Jie Zhou, Jiwen Lu

Figure 1 for Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models
Figure 2 for Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models
Figure 3 for Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models
Figure 4 for Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models

With the overwhelming trend of mask image modeling led by MAE, generative pre-training has shown a remarkable potential to boost the performance of fundamental models in 2D vision. However, in 3D vision, the over-reliance on Transformer-based backbones and the unordered nature of point clouds have restricted the further development of generative pre-training. In this paper, we propose a novel 3D-to-2D generative pre-training method that is adaptable to any point cloud model. We propose to generate view images from different instructed poses via the cross-attention mechanism as the pre-training scheme. Generating view images has more precise supervision than its point cloud counterpart, thus assisting 3D backbones to have a finer comprehension of the geometrical structure and stereoscopic relations of the point cloud. Experimental results have proved the superiority of our proposed 3D-to-2D generative pre-training over previous pre-training methods. Our method is also effective in boosting the performance of architecture-oriented approaches, achieving state-of-the-art performance when fine-tuning on ScanObjectNN classification and ShapeNetPart segmentation tasks. Code is available at https://github.com/wangzy22/TAP.

* Accepted to ICCV 2023, project page: https://tap.ivg-research.xyz 
Viaarxiv icon

Unleashing Text-to-Image Diffusion Models for Visual Perception

Mar 03, 2023
Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie Zhou, Jiwen Lu

Figure 1 for Unleashing Text-to-Image Diffusion Models for Visual Perception
Figure 2 for Unleashing Text-to-Image Diffusion Models for Visual Perception
Figure 3 for Unleashing Text-to-Image Diffusion Models for Visual Perception
Figure 4 for Unleashing Text-to-Image Diffusion Models for Visual Perception

Diffusion models (DMs) have become the new trend of generative models and have demonstrated a powerful ability of conditional synthesis. Among those, text-to-image diffusion models pre-trained on large-scale image-text pairs are highly controllable by customizable prompts. Unlike the unconditional generative models that focus on low-level attributes and details, text-to-image diffusion models contain more high-level knowledge thanks to the vision-language pre-training. In this paper, we propose VPD (Visual Perception with a pre-trained Diffusion model), a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. Instead of using the pre-trained denoising autoencoder in a diffusion-based pipeline, we simply use it as a backbone and aim to study how to take full advantage of the learned knowledge. Specifically, we prompt the denoising decoder with proper textual inputs and refine the text features with an adapter, leading to a better alignment to the pre-trained stage and making the visual contents interact with the text prompts. We also propose to utilize the cross-attention maps between the visual features and the text features to provide explicit guidance. Compared with other pre-training methods, we show that vision-language pre-trained diffusion models can be faster adapted to downstream visual perception tasks using the proposed VPD. Extensive experiments on semantic segmentation, referring image segmentation and depth estimation demonstrates the effectiveness of our method. Notably, VPD attains 0.254 RMSE on NYUv2 depth estimation and 73.3% oIoU on RefCOCO-val referring image segmentation, establishing new records on these two benchmarks. Code is available at https://github.com/wl-zhao/VPD

* project page: https://vpd.ivg-research.xyz 
Viaarxiv icon

UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models

Feb 12, 2023
Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, Jiwen Lu

Figure 1 for UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models
Figure 2 for UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models
Figure 3 for UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models
Figure 4 for UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models

Diffusion probabilistic models (DPMs) have demonstrated a very promising ability in high-resolution image synthesis. However, sampling from a pre-trained DPM usually requires hundreds of model evaluations, which is computationally expensive. Despite recent progress in designing high-order solvers for DPMs, there still exists room for further speedup, especially in extremely few steps (e.g., 5~10 steps). Inspired by the predictor-corrector for ODE solvers, we develop a unified corrector (UniC) that can be applied after any existing DPM sampler to increase the order of accuracy without extra model evaluations, and derive a unified predictor (UniP) that supports arbitrary order as a byproduct. Combining UniP and UniC, we propose a unified predictor-corrector framework called UniPC for the fast sampling of DPMs, which has a unified analytical form for any order and can significantly improve the sampling quality over previous methods. We evaluate our methods through extensive experiments including both unconditional and conditional sampling using pixel-space and latent-space DPMs. Our UniPC can achieve 3.87 FID on CIFAR10 (unconditional) and 7.51 FID on ImageNet 256$\times$256 (conditional) with only 10 function evaluations. Code is available at https://github.com/wl-zhao/UniPC

* Project page: https://unipc.ivg-research.xyz 
Viaarxiv icon

AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware Transformers

Jan 11, 2023
Xumin Yu, Yongming Rao, Ziyi Wang, Jiwen Lu, Jie Zhou

Figure 1 for AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware Transformers
Figure 2 for AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware Transformers
Figure 3 for AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware Transformers
Figure 4 for AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware Transformers

In this paper, we present a new method that reformulates point cloud completion as a set-to-set translation problem and design a new model, called PoinTr, which adopts a Transformer encoder-decoder architecture for point cloud completion. By representing the point cloud as a set of unordered groups of points with position embeddings, we convert the input data to a sequence of point proxies and employ the Transformers for generation. To facilitate Transformers to better leverage the inductive bias about 3D geometric structures of point clouds, we further devise a geometry-aware block that models the local geometric relationships explicitly. The migration of Transformers enables our model to better learn structural knowledge and preserve detailed information for point cloud completion. Taking a step towards more complicated and diverse situations, we further propose AdaPoinTr by developing an adaptive query generation mechanism and designing a novel denoising task during completing a point cloud. Coupling these two techniques enables us to train the model efficiently and effectively: we reduce training time (by 15x or more) and improve completion performance (over 20%). We also show our method can be extended to the scene-level point cloud completion scenario by designing a new geometry-enhanced semantic scene completion framework. Extensive experiments on the existing and newly-proposed datasets demonstrate the effectiveness of our method, which attains 6.53 CD on PCN, 0.81 CD on ShapeNet-55 and 0.392 MMD on real-world KITTI, surpassing other work by a large margin and establishing new state-of-the-arts on various benchmarks. Most notably, AdaPoinTr can achieve such promising performance with higher throughputs and fewer FLOPs compared with the previous best methods in practice. The code and datasets are available at https://github.com/yuxumin/PoinTr

* Extension of our ICCV 2021 work: arXiv:2108.08839 . Code is available at https://github.com/yuxumin/PoinTr 
Viaarxiv icon

FLAG3D: A 3D Fitness Activity Dataset with Language Instruction

Dec 09, 2022
Yansong Tang, Jinpeng Liu, Aoyang Liu, Bin Yang, Wenxun Dai, Yongming Rao, Jiwen Lu, Jie Zhou, Xiu Li

Figure 1 for FLAG3D: A 3D Fitness Activity Dataset with Language Instruction
Figure 2 for FLAG3D: A 3D Fitness Activity Dataset with Language Instruction
Figure 3 for FLAG3D: A 3D Fitness Activity Dataset with Language Instruction
Figure 4 for FLAG3D: A 3D Fitness Activity Dataset with Language Instruction

With the continuously thriving popularity around the world, fitness activity analytic has become an emerging research topic in computer vision. While a variety of new tasks and algorithms have been proposed recently, there are growing hunger for data resources involved in high-quality data, fine-grained labels, and diverse environments. In this paper, we present FLAG3D, a large-scale 3D fitness activity dataset with language instruction containing 180K sequences of 60 categories. FLAG3D features the following three aspects: 1) accurate and dense 3D human pose captured from advanced MoCap system to handle the complex activity and large movement, 2) detailed and professional language instruction to describe how to perform a specific activity, 3) versatile video resources from a high-tech MoCap system, rendering software, and cost-effective smartphones in natural environments. Extensive experiments and in-depth analysis show that FLAG3D contributes great research value for various challenges, such as cross-domain human action recognition, dynamic human mesh recovery, and language-guided human action generation. Our dataset and source code will be publicly available at https://andytang15.github.io/FLAG3D.

Viaarxiv icon

Prompt Learning with Optimal Transport for Vision-Language Models

Oct 03, 2022
Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, Kun Zhang

Figure 1 for Prompt Learning with Optimal Transport for Vision-Language Models
Figure 2 for Prompt Learning with Optimal Transport for Vision-Language Models
Figure 3 for Prompt Learning with Optimal Transport for Vision-Language Models
Figure 4 for Prompt Learning with Optimal Transport for Vision-Language Models

With the increasing attention to large vision-language models such as CLIP, there has been a significant amount of effort dedicated to building efficient prompts. Unlike conventional methods of only learning one single prompt, we propose to learn multiple comprehensive prompts to describe diverse characteristics of categories such as intrinsic attributes or extrinsic contexts. However, directly matching each prompt to the same visual feature is problematic, as it pushes the prompts to converge to one point. To solve this problem, we propose to apply optimal transport to match the vision and text modalities. Specifically, we first model images and the categories with visual and textual feature sets. Then, we apply a two-stage optimization strategy to learn the prompts. In the inner loop, we optimize the optimal transport distance to align visual features and prompts by the Sinkhorn algorithm, while in the outer loop, we learn the prompts by this distance from the supervised data. Extensive experiments are conducted on the few-shot recognition task and the improvement demonstrates the superiority of our method.

Viaarxiv icon

HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions

Aug 09, 2022
Yongming Rao, Wenliang Zhao, Yansong Tang, Jie Zhou, Ser-Nam Lim, Jiwen Lu

Figure 1 for HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions
Figure 2 for HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions
Figure 3 for HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions
Figure 4 for HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions

Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework. We present the Recursive Gated Convolution ($\textit{g}^\textit{n}$Conv) that performs high-order spatial interactions with gated convolutions and recursive designs. The new operation is highly flexible and customizable, which is compatible with various variants of convolution and extends the two-order interactions in self-attention to arbitrary orders without introducing significant extra computation. $\textit{g}^\textit{n}$Conv can serve as a plug-and-play module to improve various vision Transformers and convolution-based models. Based on the operation, we construct a new family of generic vision backbones named HorNet. Extensive experiments on ImageNet classification, COCO object detection and ADE20K semantic segmentation show HorNet outperform Swin Transformers and ConvNeXt by a significant margin with similar overall architecture and training configurations. HorNet also shows favorable scalability to more training data and a larger model size. Apart from the effectiveness in visual encoders, we also show $\textit{g}^\textit{n}$Conv can be applied to task-specific decoders and consistently improve dense prediction performance with less computation. Our results demonstrate that $\textit{g}^\textit{n}$Conv can be a new basic module for visual modeling that effectively combines the merits of both vision Transformers and CNNs. Code is available at https://github.com/raoyongming/HorNet

* project page: https://hornet.ivg-research.xyz 
Viaarxiv icon

P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting

Aug 04, 2022
Ziyi Wang, Xumin Yu, Yongming Rao, Jie Zhou, Jiwen Lu

Figure 1 for P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting
Figure 2 for P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting
Figure 3 for P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting
Figure 4 for P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting

Nowadays, pre-training big models on large-scale datasets has become a crucial topic in deep learning. The pre-trained models with high representation ability and transferability achieve a great success and dominate many downstream tasks in natural language processing and 2D vision. However, it is non-trivial to promote such a pretraining-tuning paradigm to the 3D vision, given the limited training data that are relatively inconvenient to collect. In this paper, we provide a new perspective of leveraging pre-trained 2D knowledge in 3D domain to tackle this problem, tuning pre-trained image models with the novel Point-to-Pixel prompting for point cloud analysis at a minor parameter cost. Following the principle of prompting engineering, we transform point clouds into colorful images with geometry-preserved projection and geometry-aware coloring to adapt to pre-trained image models, whose weights are kept frozen during the end-to-end optimization of point cloud analysis tasks. We conduct extensive experiments to demonstrate that cooperating with our proposed Point-to-Pixel Prompting, better pre-trained image model will lead to consistently better performance in 3D vision. Enjoying prosperous development from image pre-training field, our method attains 89.3% accuracy on the hardest setting of ScanObjectNN, surpassing conventional point cloud models with much fewer trainable parameters. Our framework also exhibits very competitive performance on ModelNet classification and ShapeNet Part Segmentation. Code is available at https://github.com/wangzy22/P2P

* project page: https://p2p.ivg-research.xyz/ 
Viaarxiv icon