Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yali Wang

Parameter is Not All You Need: Starting from Non-Parametric Networks for 3D Point Cloud Analysis

Mar 14, 2023
Renrui Zhang, Liuhui Wang, Yali Wang, Peng Gao, Hongsheng Li, Jianbo Shi

Figure 1 for Parameter is Not All You Need: Starting from Non-Parametric Networks for 3D Point Cloud Analysis

Figure 2 for Parameter is Not All You Need: Starting from Non-Parametric Networks for 3D Point Cloud Analysis

Figure 3 for Parameter is Not All You Need: Starting from Non-Parametric Networks for 3D Point Cloud Analysis

Figure 4 for Parameter is Not All You Need: Starting from Non-Parametric Networks for 3D Point Cloud Analysis

We present a Non-parametric Network for 3D point cloud analysis, Point-NN, which consists of purely non-learnable components: farthest point sampling (FPS), k-nearest neighbors (k-NN), and pooling operations, with trigonometric functions. Surprisingly, it performs well on various 3D tasks, requiring no parameters or training, and even surpasses existing fully trained models. Starting from this basic non-parametric model, we propose two extensions. First, Point-NN can serve as a base architectural framework to construct Parametric Networks by simply inserting linear layers on top. Given the superior non-parametric foundation, the derived Point-PN exhibits a high performance-efficiency trade-off with only a few learnable parameters. Second, Point-NN can be regarded as a plug-and-play module for the already trained 3D models during inference. Point-NN captures the complementary geometric knowledge and enhances existing methods for different 3D benchmarks without re-training. We hope our work may cast a light on the community for understanding 3D point clouds with non-parametric methods. Code is available at https://github.com/ZrrSkywalker/Point-NN.

* Accepted by CVPR 2023. Code is available at https://github.com/ZrrSkywalker/Point-NN

Via

Access Paper or Ask Questions

MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency

Dec 20, 2022
Mingye Xu, Mutian Xu, Tong He, Wanli Ouyang, Yali Wang, Xiaoguang Han, Yu Qiao

Figure 1 for MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency

Figure 2 for MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency

Figure 3 for MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency

Figure 4 for MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency

Masked Modeling (MM) has demonstrated widespread success in various vision challenges, by reconstructing masked visual patches. Yet, applying MM for large-scale 3D scenes remains an open problem due to the data sparsity and scene complexity. The conventional random masking paradigm used in 2D images often causes a high risk of ambiguity when recovering the masked region of 3D scenes. To this end, we propose a novel informative-preserved reconstruction, which explores local statistics to discover and preserve the representative structured points, effectively enhancing the pretext masking task for 3D scene understanding. Integrated with a progressive reconstruction manner, our method can concentrate on modeling regional geometry and enjoy less ambiguity for masked reconstruction. Besides, such scenes with progressive masking ratios can also serve to self-distill their intrinsic spatial consistency, requiring to learn the consistent representations from unmasked areas. By elegantly combining informative-preserved reconstruction on masked areas and consistency self-distillation from unmasked areas, a unified framework called MM-3DScene is yielded. We conduct comprehensive experiments on a host of downstream tasks. The consistent improvement (e.g., +6.1 mAP@0.5 on object detection and +2.2% mIoU on semantic segmentation) demonstrates the superiority of our approach.

Via

Access Paper or Ask Questions

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Dec 07, 2022
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, Yu Qiao

Figure 1 for InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Figure 2 for InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Figure 3 for InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Figure 4 for InternVideo: General Video Foundation Models via Generative and Discriminative Learning

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .

* technical report

Via

Access Paper or Ask Questions

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

Nov 17, 2022
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, Yu Qiao

Figure 1 for UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

Figure 2 for UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

Figure 3 for UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

Figure 4 for UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

Learning discriminative spatiotemporal representation is the key problem of video understanding. Recently, Vision Transformers (ViTs) have shown their power in learning long-term video dependency with self-attention. Unfortunately, they exhibit limitations in tackling local video redundancy, due to the blind global comparison among tokens. UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format. However, this model has to require a tiresome and complicated image-pretraining phrase, before being finetuned on videos. This blocks its wide usage in practice. On the contrary, open-sourced ViTs are readily available and well-pretrained with rich image supervision. Based on these observations, we propose a generic paradigm to build a powerful family of video networks, by arming the pretrained ViTs with efficient UniFormer designs. We call this family UniFormerV2, since it inherits the concise style of the UniFormer block. But it contains brand-new local and global relation aggregators, which allow for preferable accuracy-computation balance by seamlessly integrating advantages from both ViTs and UniFormer. Without any bells and whistles, our UniFormerV2 gets the state-of-the-art recognition performance on 8 popular video benchmarks, including scene-related Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90% top-1 accuracy on Kinetics-400, to our best knowledge. Code will be available at https://github.com/OpenGVLab/UniFormerV2.

* 24 pages, 4 figures, 20 tables

Via

Access Paper or Ask Questions

InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges

Nov 17, 2022
Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, Zhiyu Zhao, Junting Pan, Yifei Huang, Zun Wang, Jiashuo Yu, Yinan He, Hongjie Zhang, Tong Lu, Yali Wang, Limin Wang, Yu Qiao

Figure 1 for InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges

Figure 2 for InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges

Figure 3 for InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges

Figure 4 for InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges

In this report, we present our champion solutions to five tracks at Ego4D challenge. We leverage our developed InternVideo, a video foundation model, for five Ego4D tasks, including Moment Queries, Natural Language Queries, Future Hand Prediction, State Change Object Detection, and Short-term Object Interaction Anticipation. InternVideo-Ego4D is an effective paradigm to adapt the strong foundation model to the downstream ego-centric video understanding tasks with simple head designs. In these five tasks, the performance of InternVideo-Ego4D comprehensively surpasses the baseline methods and the champions of CVPR2022, demonstrating the powerful representation ability of InternVideo as a video foundation model. Our code will be released at https://github.com/OpenGVLab/ego4d-eccv2022-solutions

* Technical report in 2nd International Ego4D Workshop@ECCV 2022. Code will be released at https://github.com/OpenGVLab/ego4d-eccv2022-solutions

Via

Access Paper or Ask Questions

VideoPipe 2022 Challenge: Real-World Video Understanding for Urban Pipe Inspection

Oct 20, 2022
Yi Liu, Xuan Zhang, Ying Li, Guixin Liang, Yabing Jiang, Lixia Qiu, Haiping Tang, Fei Xie, Wei Yao, Yi Dai, Yu Qiao, Yali Wang

Figure 1 for VideoPipe 2022 Challenge: Real-World Video Understanding for Urban Pipe Inspection

Figure 2 for VideoPipe 2022 Challenge: Real-World Video Understanding for Urban Pipe Inspection

Figure 3 for VideoPipe 2022 Challenge: Real-World Video Understanding for Urban Pipe Inspection

Figure 4 for VideoPipe 2022 Challenge: Real-World Video Understanding for Urban Pipe Inspection

Video understanding is an important problem in computer vision. Currently, the well-studied task in this research is human action recognition, where the clips are manually trimmed from the long videos, and a single class of human action is assumed for each clip. However, we may face more complicated scenarios in the industrial applications. For example, in the real-world urban pipe system, anomaly defects are fine-grained, multi-labeled, domain-relevant. To recognize them correctly, we need to understand the detailed video content. For this reason, we propose to advance research areas of video understanding, with a shift from traditional action recognition to industrial anomaly analysis. In particular, we introduce two high-quality video benchmarks, namely QV-Pipe and CCTV-Pipe, for anomaly inspection in the real-world urban pipe systems. Based on these new datasets, we will host two competitions including (1) Video Defect Classification on QV-Pipe and (2) Temporal Defect Localization on CCTV-Pipe. In this report, we describe the details of these benchmarks, the problem definitions of competition tracks, the evaluation metric, and the result summary. We expect that, this competition would bring new opportunities and challenges for video understanding in smart city and beyond. The details of our VideoPipe challenge can be found in https://videopipe.github.io.

* VideoPipe Challenge @ ICPR2022. Homepage: https://videopipe.github.io/

Via

Access Paper or Ask Questions

Low-Resolution Action Recognition for Tiny Actions Challenge

Sep 28, 2022
Boyu Chen, Yu Qiao, Yali Wang

Figure 1 for Low-Resolution Action Recognition for Tiny Actions Challenge

Figure 2 for Low-Resolution Action Recognition for Tiny Actions Challenge

Figure 3 for Low-Resolution Action Recognition for Tiny Actions Challenge

Figure 4 for Low-Resolution Action Recognition for Tiny Actions Challenge

Tiny Actions Challenge focuses on understanding human activities in real-world surveillance. Basically, there are two main difficulties for activity recognition in this scenario. First, human activities are often recorded at a distance, and appear in a small resolution without much discriminative clue. Second, these activities are naturally distributed in a long-tailed way. It is hard to alleviate data bias for such heavy category imbalance. To tackle these problems, we propose a comprehensive recognition solution in this paper. First, we train video backbones with data balance, in order to alleviate overfitting in the challenge benchmark. Second, we design a dual-resolution distillation framework, which can effectively guide low-resolution action recognition by super-resolution knowledge. Finally, we apply model en-semble with post-processing, which can further boost per-formance on the long-tailed categories. Our solution ranks Top-1 on the leaderboard.

* This article is the report of the CVPR 2022 ActivityNet workshop Tiny Actions Challenge(https://tinyactions-cvpr22.github.io/). The time of the first submission to the organizers is June 6th

Via

Access Paper or Ask Questions

CP3: Unifying Point Cloud Completion by Pretrain-Prompt-Predict Paradigm

Jul 12, 2022
Mingye Xu, Yali Wang, Yihao Liu, Yu Qiao

Figure 1 for CP3: Unifying Point Cloud Completion by Pretrain-Prompt-Predict Paradigm

Figure 2 for CP3: Unifying Point Cloud Completion by Pretrain-Prompt-Predict Paradigm

Figure 3 for CP3: Unifying Point Cloud Completion by Pretrain-Prompt-Predict Paradigm

Figure 4 for CP3: Unifying Point Cloud Completion by Pretrain-Prompt-Predict Paradigm

Point cloud completion aims to predict complete shape from its partial observation. Current approaches mainly consist of generation and refinement stages in a coarse-to-fine style. However, the generation stage often lacks robustness to tackle different incomplete variations, while the refinement stage blindly recovers point clouds without the semantic awareness. To tackle these challenges, we unify point cloud Completion by a generic Pretrain-Prompt-Predict paradigm, namely CP3. Inspired by prompting approaches from NLP, we creatively reinterpret point cloud generation and refinement as the prompting and predicting stages, respectively. Then, we introduce a concise self-supervised pretraining stage before prompting. It can effectively increase robustness of point cloud generation, by an Incompletion-Of-Incompletion (IOI) pretext task. Moreover, we develop a novel Semantic Conditional Refinement (SCR) network at the predicting stage. It can discriminatively modulate multi-scale refinement with the guidance of semantics. Finally, extensive experiments demonstrate that our CP3 outperforms the state-of-the-art methods with a large margin.

Via

Access Paper or Ask Questions

Cross Domain Object Detection by Target-Perceived Dual Branch Distillation

May 03, 2022
Mengzhe He, Yali Wang, Jiaxi Wu, Yiru Wang, Hanqing Li, Bo Li, Weihao Gan, Wei Wu, Yu Qiao

Figure 1 for Cross Domain Object Detection by Target-Perceived Dual Branch Distillation

Figure 2 for Cross Domain Object Detection by Target-Perceived Dual Branch Distillation

Figure 3 for Cross Domain Object Detection by Target-Perceived Dual Branch Distillation

Figure 4 for Cross Domain Object Detection by Target-Perceived Dual Branch Distillation

Cross domain object detection is a realistic and challenging task in the wild. It suffers from performance degradation due to large shift of data distributions and lack of instance-level annotations in the target domain. Existing approaches mainly focus on either of these two difficulties, even though they are closely coupled in cross domain object detection. To solve this problem, we propose a novel Target-perceived Dual-branch Distillation (TDD) framework. By integrating detection branches of both source and target domains in a unified teacher-student learning scheme, it can reduce domain shift and generate reliable supervision effectively. In particular, we first introduce a distinct Target Proposal Perceiver between two domains. It can adaptively enhance source detector to perceive objects in a target image, by leveraging target proposal contexts from iterative cross-attention. Afterwards, we design a concise Dual Branch Self Distillation strategy for model training, which can progressively integrate complementary object knowledge from different domains via self-distillation in two branches. Finally, we conduct extensive experiments on a number of widely-used scenarios in cross domain object detection. The results show that our TDD significantly outperforms the state-of-the-art methods on all the benchmarks. Our code and model will be available at https://github.com/Feobi1999/TDD.

* CVPR2022

Via

Access Paper or Ask Questions

Target-Relevant Knowledge Preservation for Multi-Source Domain Adaptive Object Detection

Apr 17, 2022
Jiaxi Wu, Jiaxin Chen, Mengzhe He, Yiru Wang, Bo Li, Bingqi Ma, Weihao Gan, Wei Wu, Yali Wang, Di Huang

Figure 1 for Target-Relevant Knowledge Preservation for Multi-Source Domain Adaptive Object Detection

Figure 2 for Target-Relevant Knowledge Preservation for Multi-Source Domain Adaptive Object Detection

Figure 3 for Target-Relevant Knowledge Preservation for Multi-Source Domain Adaptive Object Detection

Figure 4 for Target-Relevant Knowledge Preservation for Multi-Source Domain Adaptive Object Detection

Domain adaptive object detection (DAOD) is a promising way to alleviate performance drop of detectors in new scenes. Albeit great effort made in single source domain adaptation, a more generalized task with multiple source domains remains not being well explored, due to knowledge degradation during their combination. To address this issue, we propose a novel approach, namely target-relevant knowledge preservation (TRKP), to unsupervised multi-source DAOD. Specifically, TRKP adopts the teacher-student framework, where the multi-head teacher network is built to extract knowledge from labeled source domains and guide the student network to learn detectors in unlabeled target domain. The teacher network is further equipped with an adversarial multi-source disentanglement (AMSD) module to preserve source domain-specific knowledge and simultaneously perform cross-domain alignment. Besides, a holistic target-relevant mining (HTRM) scheme is developed to re-weight the source images according to the source-target relevance. By this means, the teacher network is enforced to capture target-relevant knowledge, thus benefiting decreasing domain shift when mentoring object detection in the target domain. Extensive experiments are conducted on various widely used benchmarks with new state-of-the-art scores reported, highlighting the effectiveness.

* CVPR2022

Via

Access Paper or Ask Questions