Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhiguo Cao

Semi-Supervised Class-Agnostic Motion Prediction with Pseudo Label Regeneration and BEVMix

Dec 14, 2023

Kewei Wang, Yizheng Wu, Zhiyu Pan, Xingyi Li, Ke Xian, Zhe Wang, Zhiguo Cao, Guosheng Lin

Figure 1 for Semi-Supervised Class-Agnostic Motion Prediction with Pseudo Label Regeneration and BEVMix

Figure 2 for Semi-Supervised Class-Agnostic Motion Prediction with Pseudo Label Regeneration and BEVMix

Figure 3 for Semi-Supervised Class-Agnostic Motion Prediction with Pseudo Label Regeneration and BEVMix

Figure 4 for Semi-Supervised Class-Agnostic Motion Prediction with Pseudo Label Regeneration and BEVMix

Abstract:Class-agnostic motion prediction methods aim to comprehend motion within open-world scenarios, holding significance for autonomous driving systems. However, training a high-performance model in a fully-supervised manner always requires substantial amounts of manually annotated data, which can be both expensive and time-consuming to obtain. To address this challenge, our study explores the potential of semi-supervised learning (SSL) for class-agnostic motion prediction. Our SSL framework adopts a consistency-based self-training paradigm, enabling the model to learn from unlabeled data by generating pseudo labels through test-time inference. To improve the quality of pseudo labels, we propose a novel motion selection and re-generation module. This module effectively selects reliable pseudo labels and re-generates unreliable ones. Furthermore, we propose two data augmentation strategies: temporal sampling and BEVMix. These strategies facilitate consistency regularization in SSL. Experiments conducted on nuScenes demonstrate that our SSL method can surpass the self-supervised approach by a large margin by utilizing only a tiny fraction of labeled data. Furthermore, our method exhibits comparable performance to weakly and some fully supervised methods. These results highlight the ability of our method to strike a favorable balance between annotation costs and performance. Code will be available at https://github.com/kwwcv/SSMP.

* This paper is accepted by AAAI2024

Via

Access Paper or Ask Questions

End-to-end Video Gaze Estimation via Capturing Head-face-eye Spatial-temporal Interaction Context

Nov 01, 2023

Yiran Guan, Zhuoguang Chen, Wenzheng Zeng, Zhiguo Cao, Yang Xiao

Figure 1 for End-to-end Video Gaze Estimation via Capturing Head-face-eye Spatial-temporal Interaction Context

Figure 2 for End-to-end Video Gaze Estimation via Capturing Head-face-eye Spatial-temporal Interaction Context

Figure 3 for End-to-end Video Gaze Estimation via Capturing Head-face-eye Spatial-temporal Interaction Context

Figure 4 for End-to-end Video Gaze Estimation via Capturing Head-face-eye Spatial-temporal Interaction Context

Abstract:In this letter, we propose a new method, Multi-Clue Gaze (MCGaze), to facilitate video gaze estimation via capturing spatial-temporal interaction context among head, face, and eye in an end-to-end learning way, which has not been well concerned yet. The main advantage of MCGaze is that the tasks of clue localization of head, face, and eye can be solved jointly for gaze estimation in a one-step way, with joint optimization to seek optimal performance. During this, spatial-temporal context exchange happens among the clues on the head, face, and eye. Accordingly, the final gazes obtained by fusing features from various queries can be aware of global clues from heads and faces, and local clues from eyes simultaneously, which essentially leverages performance. Meanwhile, the one-step running way also ensures high running efficiency. Experiments on the challenging Gaze360 dataset verify the superiority of our proposition. The source code will be released at https://github.com/zgchen33/MCGaze.

* 5 pages, 3 figures, 3 tables

Via

Access Paper or Ask Questions

When Epipolar Constraint Meets Non-local Operators in Multi-View Stereo

Sep 29, 2023

Tianqi Liu, Xinyi Ye, Weiyue Zhao, Zhiyu Pan, Min Shi, Zhiguo Cao

Figure 1 for When Epipolar Constraint Meets Non-local Operators in Multi-View Stereo

Figure 2 for When Epipolar Constraint Meets Non-local Operators in Multi-View Stereo

Figure 3 for When Epipolar Constraint Meets Non-local Operators in Multi-View Stereo

Figure 4 for When Epipolar Constraint Meets Non-local Operators in Multi-View Stereo

Abstract:Learning-based multi-view stereo (MVS) method heavily relies on feature matching, which requires distinctive and descriptive representations. An effective solution is to apply non-local feature aggregation, e.g., Transformer. Albeit useful, these techniques introduce heavy computation overheads for MVS. Each pixel densely attends to the whole image. In contrast, we propose to constrain non-local feature augmentation within a pair of lines: each point only attends the corresponding pair of epipolar lines. Our idea takes inspiration from the classic epipolar geometry, which shows that one point with different depth hypotheses will be projected to the epipolar line on the other view. This constraint reduces the 2D search space into the epipolar line in stereo matching. Similarly, this suggests that the matching of MVS is to distinguish a series of points lying on the same line. Inspired by this point-to-line search, we devise a line-to-point non-local augmentation strategy. We first devise an optimized searching algorithm to split the 2D feature maps into epipolar line pairs. Then, an Epipolar Transformer (ET) performs non-local feature augmentation among epipolar line pairs. We incorporate the ET into a learning-based MVS baseline, named ET-MVSNet. ET-MVSNet achieves state-of-the-art reconstruction performance on both the DTU and Tanks-and-Temples benchmark with high efficiency. Code is available at https://github.com/TQTQliu/ET-MVSNet.

* ICCV2023

Via

Access Paper or Ask Questions

Learning to Upsample by Learning to Sample

Aug 29, 2023

Wenze Liu, Hao Lu, Hongtao Fu, Zhiguo Cao

Figure 1 for Learning to Upsample by Learning to Sample

Figure 2 for Learning to Upsample by Learning to Sample

Figure 3 for Learning to Upsample by Learning to Sample

Figure 4 for Learning to Upsample by Learning to Sample

Abstract:We present DySample, an ultra-lightweight and effective dynamic upsampler. While impressive performance gains have been witnessed from recent kernel-based dynamic upsamplers such as CARAFE, FADE, and SAPA, they introduce much workload, mostly due to the time-consuming dynamic convolution and the additional sub-network used to generate dynamic kernels. Further, the need for high-res feature guidance of FADE and SAPA somehow limits their application scenarios. To address these concerns, we bypass dynamic convolution and formulate upsampling from the perspective of point sampling, which is more resource-efficient and can be easily implemented with the standard built-in function in PyTorch. We first showcase a naive design, and then demonstrate how to strengthen its upsampling behavior step by step towards our new upsampler, DySample. Compared with former kernel-based dynamic upsamplers, DySample requires no customized CUDA package and has much fewer parameters, FLOPs, GPU memory, and latency. Besides the light-weight characteristics, DySample outperforms other upsamplers across five dense prediction tasks, including semantic segmentation, object detection, instance segmentation, panoptic segmentation, and monocular depth estimation. Code is available at https://github.com/tiny-smart/dysample.

* Accepted by ICCV 2023

Via

Access Paper or Ask Questions

Point-Query Quadtree for Crowd Counting, Localization, and More

Aug 26, 2023

Chengxin Liu, Hao Lu, Zhiguo Cao, Tongliang Liu

Figure 1 for Point-Query Quadtree for Crowd Counting, Localization, and More

Figure 2 for Point-Query Quadtree for Crowd Counting, Localization, and More

Figure 3 for Point-Query Quadtree for Crowd Counting, Localization, and More

Figure 4 for Point-Query Quadtree for Crowd Counting, Localization, and More

Abstract:We show that crowd counting can be viewed as a decomposable point querying process. This formulation enables arbitrary points as input and jointly reasons whether the points are crowd and where they locate. The querying processing, however, raises an underlying problem on the number of necessary querying points. Too few imply underestimation; too many increase computational overhead. To address this dilemma, we introduce a decomposable structure, i.e., the point-query quadtree, and propose a new counting model, termed Point quEry Transformer (PET). PET implements decomposable point querying via data-dependent quadtree splitting, where each querying point could split into four new points when necessary, thus enabling dynamic processing of sparse and dense regions. Such a querying process yields an intuitive, universal modeling of crowd as both the input and output are interpretable and steerable. We demonstrate the applications of PET on a number of crowd-related tasks, including fully-supervised crowd counting and localization, partial annotation learning, and point annotation refinement, and also report state-of-the-art performance. For the first time, we show that a single counting model can address multiple crowd-related tasks across different learning paradigms. Code is available at https://github.com/cxliu0/PET.

* Accepted by ICCV 2023

Via

Access Paper or Ask Questions

Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

Aug 20, 2023

Liao Shen, Xingyi Li, Huiqiang Sun, Juewen Peng, Ke Xian, Zhiguo Cao, Guosheng Lin

Figure 1 for Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

Figure 2 for Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

Figure 3 for Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

Figure 4 for Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

Abstract:We study the problem of synthesizing a long-term dynamic video from only a single image. This is challenging since it requires consistent visual content movements given large camera motions. Existing methods either hallucinate inconsistent perpetual views or struggle with long camera trajectories. To address these issues, it is essential to estimate the underlying 4D (including 3D geometry and scene motion) and fill in the occluded regions. To this end, we present Make-It-4D, a novel method that can generate a consistent long-term dynamic video from a single image. On the one hand, we utilize layered depth images (LDIs) to represent a scene, and they are then unprojected to form a feature point cloud. To animate the visual content, the feature point cloud is displaced based on the scene flow derived from motion estimation and the corresponding camera pose. Such 4D representation enables our method to maintain the global consistency of the generated dynamic video. On the other hand, we fill in the occluded regions by using a pretrained diffusion model to inpaint and outpaint the input image. This enables our method to work under large camera motions. Benefiting from our design, our method can be training-free which saves a significant amount of training time. Experimental results demonstrate the effectiveness of our approach, which showcases compelling rendering results.

* accepted by ACM MM'23

Via

Access Paper or Ask Questions

Neural Video Depth Stabilizer

Aug 10, 2023

Yiran Wang, Min Shi, Jiaqi Li, Zihao Huang, Zhiguo Cao, Jianming Zhang, Ke Xian, Guosheng Lin

Abstract:Video depth estimation aims to infer temporally consistent depth. Some methods achieve temporal consistency by finetuning a single-image depth model during test time using geometry and re-projection constraints, which is inefficient and not robust. An alternative approach is to learn how to enforce temporal consistency from data, but this requires well-designed models and sufficient video depth data. To address these challenges, we propose a plug-and-play framework called Neural Video Depth Stabilizer (NVDS) that stabilizes inconsistent depth estimations and can be applied to different single-image depth models without extra effort. We also introduce a large-scale dataset, Video Depth in the Wild (VDW), which consists of 14,203 videos with over two million frames, making it the largest natural-scene video depth dataset to our knowledge. We evaluate our method on the VDW dataset as well as two public benchmarks and demonstrate significant improvements in consistency, accuracy, and efficiency compared to previous approaches. Our work serves as a solid baseline and provides a data foundation for learning-based video depth models. We will release our dataset and code for future research.

* Accepted by ICCV2023

Via

Access Paper or Ask Questions

Diffusion-Augmented Depth Prediction with Sparse Annotations

Aug 04, 2023

Jiaqi Li, Yiran Wang, Zihao Huang, Jinghong Zheng, Ke Xian, Zhiguo Cao, Jianming Zhang

Figure 1 for Diffusion-Augmented Depth Prediction with Sparse Annotations

Figure 2 for Diffusion-Augmented Depth Prediction with Sparse Annotations

Figure 3 for Diffusion-Augmented Depth Prediction with Sparse Annotations

Figure 4 for Diffusion-Augmented Depth Prediction with Sparse Annotations

Abstract:Depth estimation aims to predict dense depth maps. In autonomous driving scenes, sparsity of annotations makes the task challenging. Supervised models produce concave objects due to insufficient structural information. They overfit to valid pixels and fail to restore spatial structures. Self-supervised methods are proposed for the problem. Their robustness is limited by pose estimation, leading to erroneous results in natural scenes. In this paper, we propose a supervised framework termed Diffusion-Augmented Depth Prediction (DADP). We leverage the structural characteristics of diffusion model to enforce depth structures of depth models in a plug-and-play manner. An object-guided integrality loss is also proposed to further enhance regional structure integrality by fetching objective information. We evaluate DADP on three driving benchmarks and achieve significant improvements in depth structures and robustness. Our work provides a new perspective on depth estimation with sparse annotations in autonomous driving scenes.

* Accepted by ACM MM'2023

Via

Access Paper or Ask Questions

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

Aug 03, 2023

Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao(+4 more)

Abstract:We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world. Using a scalable data engine that incorporates human feedback and efficient models in the loop, we create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions. It covers a wide range of 3.5 million common and rare concepts in the real world, and has 132.2 billion tokens that describe the concepts and their attributes. Leveraging this new dataset, we develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding. The model is trained with open-ended language prompts and locations, which allows it to generalize to various vision and language tasks with remarkable zero-shot performance, including region-text retrieval, region recognition, captioning, and question-answering. We hope that this project can serve as a foundation for vision-language artificial general intelligence research. Models and the dataset shall be released at https://github.com/OpenGVLab/All-Seeing, and demo can be seen at https://huggingface.co/spaces/OpenGVLab/all-seeing.

* Technical Report

Via

Access Paper or Ask Questions

Fast Full-frame Video Stabilization with Iterative Optimization

Jul 31, 2023

Weiyue Zhao, Xin Li, Zhan Peng, Xianrui Luo, Xinyi Ye, Hao Lu, Zhiguo Cao

Figure 1 for Fast Full-frame Video Stabilization with Iterative Optimization

Figure 2 for Fast Full-frame Video Stabilization with Iterative Optimization

Figure 3 for Fast Full-frame Video Stabilization with Iterative Optimization

Figure 4 for Fast Full-frame Video Stabilization with Iterative Optimization

Abstract:Video stabilization refers to the problem of transforming a shaky video into a visually pleasing one. The question of how to strike a good trade-off between visual quality and computational speed has remained one of the open challenges in video stabilization. Inspired by the analogy between wobbly frames and jigsaw puzzles, we propose an iterative optimization-based learning approach using synthetic datasets for video stabilization, which consists of two interacting submodules: motion trajectory smoothing and full-frame outpainting. First, we develop a two-level (coarse-to-fine) stabilizing algorithm based on the probabilistic flow field. The confidence map associated with the estimated optical flow is exploited to guide the search for shared regions through backpropagation. Second, we take a divide-and-conquer approach and propose a novel multiframe fusion strategy to render full-frame stabilized views. An important new insight brought about by our iterative optimization approach is that the target video can be interpreted as the fixed point of nonlinear mapping for video stabilization. We formulate video stabilization as a problem of minimizing the amount of jerkiness in motion trajectories, which guarantees convergence with the help of fixed-point theory. Extensive experimental results are reported to demonstrate the superiority of the proposed approach in terms of computational speed and visual quality. The code will be available on GitHub.

* Accepted by ICCV2023

Via

Access Paper or Ask Questions