Alert button
Picture for Xiaoyang Wu

Xiaoyang Wu

Alert button

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

Sep 04, 2023
Zhening Huang, Xiaoyang Wu, Xi Chen, Hengshuang Zhao, Lei Zhu, Joan Lasenby

Figure 1 for OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation
Figure 2 for OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation
Figure 3 for OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation
Figure 4 for OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

Current 3D open-vocabulary scene understanding methods mostly utilize well-aligned 2D images as the bridge to learn 3D features with language. However, applying these approaches becomes challenging in scenarios where 2D images are absent. In this work, we introduce a completely new pipeline, namely, OpenIns3D, which requires no 2D image inputs, for 3D open-vocabulary scene understanding at the instance level. The OpenIns3D framework employs a "Mask-Snap-Lookup" scheme. The "Mask" module learns class-agnostic mask proposals in 3D point clouds. The "Snap" module generates synthetic scene-level images at multiple scales and leverages 2D vision language models to extract interesting objects. The "Lookup" module searches through the outcomes of "Snap" with the help of Mask2Pixel maps, which contain the precise correspondence between 3D masks and synthetic images, to assign category names to the proposed masks. This 2D input-free, easy-to-train, and flexible approach achieved state-of-the-art results on a wide range of indoor and outdoor datasets with a large margin. Furthermore, OpenIns3D allows for effortless switching of 2D detectors without re-training. When integrated with state-of-the-art 2D open-world models such as ODISE and GroundingDINO, superb results are observed on open-vocabulary instance segmentation. When integrated with LLM-powered 2D models like LISA, it demonstrates a remarkable capacity to process highly complex text queries, including those that require intricate reasoning and world knowledge. Project page: https://zheninghuang.github.io/OpenIns3D/

* 24 pages, 16 figures, 13 tables. Project page: https://zheninghuang.github.io/OpenIns3D/ 
Viaarxiv icon

Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training

Aug 18, 2023
Xiaoyang Wu, Zhuotao Tian, Xin Wen, Bohao Peng, Xihui Liu, Kaicheng Yu, Hengshuang Zhao

The rapid advancement of deep learning models often attributes to their ability to leverage massive training data. In contrast, such privilege has not yet fully benefited 3D deep learning, mainly due to the limited availability of large-scale 3D datasets. Merging multiple available data sources and letting them collaboratively train a single model is a potential solution. However, due to the large domain gap between 3D point cloud datasets, such mixed supervision could adversely affect the model's performance and lead to degenerated performance (i.e., negative transfer) compared to single-dataset training. In view of this challenge, we introduce Point Prompt Training (PPT), a novel framework for multi-dataset synergistic learning in the context of 3D representation learning that supports multiple pre-training paradigms. Based on this framework, we propose Prompt-driven Normalization, which adapts the model to different datasets with domain-specific prompts and Language-guided Categorical Alignment that decently unifies the multiple-dataset label spaces by leveraging the relationship between label text. Extensive experiments verify that PPT can overcome the negative transfer associated with synergistic learning and produce generalizable representations. Notably, it achieves state-of-the-art performance on each dataset using a single weight-shared model with supervised multi-dataset training. Moreover, when served as a pre-training framework, it outperforms other pre-training approaches regarding representation quality and attains remarkable state-of-the-art performance across over ten diverse downstream tasks spanning both indoor and outdoor 3D scenarios.

* Code available at Pointcept (https://github.com/Pointcept/Pointcept) 
Viaarxiv icon

MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds

Jul 18, 2023
Jiahui Liu, Chirui Chang, Jianhui Liu, Xiaoyang Wu, Lan Ma, Xiaojuan Qi

Figure 1 for MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds
Figure 2 for MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds
Figure 3 for MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds
Figure 4 for MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds

3D semantic segmentation on multi-scan large-scale point clouds plays an important role in autonomous systems. Unlike the single-scan-based semantic segmentation task, this task requires distinguishing the motion states of points in addition to their semantic categories. However, methods designed for single-scan-based segmentation tasks perform poorly on the multi-scan task due to the lacking of an effective way to integrate temporal information. We propose MarS3D, a plug-and-play motion-aware module for semantic segmentation on multi-scan 3D point clouds. This module can be flexibly combined with single-scan models to allow them to have multi-scan perception abilities. The model encompasses two key designs: the Cross-Frame Feature Embedding module for enriching representation learning and the Motion-Aware Feature Learning module for enhancing motion awareness. Extensive experiments show that MarS3D can improve the performance of the baseline model by a large margin. The code is available at https://github.com/CVMI-Lab/MarS3D.

* The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023  
Viaarxiv icon

Hierarchical Dense Correlation Distillation for Few-Shot Segmentation-Extended Abstract

Jun 27, 2023
Bohao Peng, Zhuotao Tian, Xiaoyang Wu, Chengyao Wang, Shu Liu, Jingyong Su, Jiaya Jia

Figure 1 for Hierarchical Dense Correlation Distillation for Few-Shot Segmentation-Extended Abstract
Figure 2 for Hierarchical Dense Correlation Distillation for Few-Shot Segmentation-Extended Abstract
Figure 3 for Hierarchical Dense Correlation Distillation for Few-Shot Segmentation-Extended Abstract
Figure 4 for Hierarchical Dense Correlation Distillation for Few-Shot Segmentation-Extended Abstract

Few-shot semantic segmentation (FSS) aims to form class-agnostic models segmenting unseen classes with only a handful of annotations. Previous methods limited to the semantic feature and prototype representation suffer from coarse segmentation granularity and train-set overfitting. In this work, we design Hierarchically Decoupled Matching Network (HDMNet) mining pixel-level support correlation based on the transformer architecture. The self-attention modules are used to assist in establishing hierarchical dense features, as a means to accomplish the cascade matching between query and support features. Moreover, we propose a matching module to reduce train-set overfitting and introduce correlation distillation leveraging semantic correspondence from coarse resolution to boost fine-grained segmentation. Our method performs decently in experiments. We achieve 50.0% mIoU on COCO dataset one-shot setting and 56.0% on five-shot segmentation, respectively. The code will be available on the project website. We hope our work can benefit broader industrial applications where novel classes with limited annotations are required to be decently identified.

* Accepted to CVPR 2023 VISION Workshop, Oral. The extended abstract of Hierarchical Dense Correlation Distillation for Few-Shot Segmentation. arXiv admin note: substantial text overlap with arXiv:2303.14652 
Viaarxiv icon

SAM3D: Segment Anything in 3D Scenes

Jun 06, 2023
Yunhan Yang, Xiaoyang Wu, Tong He, Hengshuang Zhao, Xihui Liu

Figure 1 for SAM3D: Segment Anything in 3D Scenes
Figure 2 for SAM3D: Segment Anything in 3D Scenes
Figure 3 for SAM3D: Segment Anything in 3D Scenes
Figure 4 for SAM3D: Segment Anything in 3D Scenes

In this work, we propose SAM3D, a novel framework that is able to predict masks in 3D point clouds by leveraging the Segment-Anything Model (SAM) in RGB images without further training or finetuning. For a point cloud of a 3D scene with posed RGB images, we first predict segmentation masks of RGB images with SAM, and then project the 2D masks into the 3D points. Later, we merge the 3D masks iteratively with a bottom-up merging approach. At each step, we merge the point cloud masks of two adjacent frames with the bidirectional merging approach. In this way, the 3D masks predicted from different frames are gradually merged into the 3D masks of the whole 3D scene. Finally, we can optionally ensemble the result from our SAM3D with the over-segmentation results based on the geometric information of the 3D scenes. Our approach is experimented with ScanNet dataset and qualitative results demonstrate that our SAM3D achieves reasonable and fine-grained 3D segmentation results without any training or finetuning of SAM.

* Technical Report. The code is released at https://github.com/Pointcept/SegmentAnything3D 
Viaarxiv icon

OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection

Jun 02, 2023
Zhangyang Qi, Jiaqi Wang, Xiaoyang Wu, Hengshuang Zhao

Figure 1 for OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection
Figure 2 for OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection
Figure 3 for OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection
Figure 4 for OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection

Multi-view 3D object detection is becoming popular in autonomous driving due to its high effectiveness and low cost. Most of the current state-of-the-art detectors follow the query-based bird's-eye-view (BEV) paradigm, which benefits from both BEV's strong perception power and end-to-end pipeline. Despite achieving substantial progress, existing works model objects via globally leveraging temporal and spatial information of BEV features, resulting in problems when handling the challenging complex and dynamic autonomous driving scenarios. In this paper, we proposed an Object-Centric query-BEV detector OCBEV, which can carve the temporal and spatial cues of moving targets more effectively. OCBEV comprises three designs: Object Aligned Temporal Fusion aligns the BEV feature based on ego-motion and estimated current locations of moving objects, leading to a precise instance-level feature fusion. Object Focused Multi-View Sampling samples more 3D features from an adaptive local height ranges of objects for each scene to enrich foreground information. Object Informed Query Enhancement replaces part of pre-defined decoder queries in common DETR-style decoders with positional features of objects on high-confidence locations, introducing more direct object positional priors. Extensive experimental evaluations are conducted on the challenging nuScenes dataset. Our approach achieves a state-of-the-art result, surpassing the traditional BEVFormer by 1.5 NDS points. Moreover, we have a faster convergence speed and only need half of the training iterations to get comparable performance, which further demonstrates its effectiveness.

Viaarxiv icon

Hierarchical Dense Correlation Distillation for Few-Shot Segmentation

Mar 26, 2023
Bohao Peng, Zhuotao Tian, Xiaoyang Wu, Chenyao Wang, Shu Liu, Jingyong Su, Jiaya Jia

Figure 1 for Hierarchical Dense Correlation Distillation for Few-Shot Segmentation
Figure 2 for Hierarchical Dense Correlation Distillation for Few-Shot Segmentation
Figure 3 for Hierarchical Dense Correlation Distillation for Few-Shot Segmentation
Figure 4 for Hierarchical Dense Correlation Distillation for Few-Shot Segmentation

Few-shot semantic segmentation (FSS) aims to form class-agnostic models segmenting unseen classes with only a handful of annotations. Previous methods limited to the semantic feature and prototype representation suffer from coarse segmentation granularity and train-set overfitting. In this work, we design Hierarchically Decoupled Matching Network (HDMNet) mining pixel-level support correlation based on the transformer architecture. The self-attention modules are used to assist in establishing hierarchical dense features, as a means to accomplish the cascade matching between query and support features. Moreover, we propose a matching module to reduce train-set overfitting and introduce correlation distillation leveraging semantic correspondence from coarse resolution to boost fine-grained segmentation. Our method performs decently in experiments. We achieve $50.0\%$ mIoU on \coco~dataset one-shot setting and $56.0\%$ on five-shot segmentation, respectively.

* to be published in CVPR 2023, code is available at \url{https://github.com/Pbihao/HDMNet} 
Viaarxiv icon

Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning

Mar 24, 2023
Xiaoyang Wu, Xin Wen, Xihui Liu, Hengshuang Zhao

Figure 1 for Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning
Figure 2 for Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning
Figure 3 for Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning
Figure 4 for Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning

As a pioneering work, PointContrast conducts unsupervised 3D representation learning via leveraging contrastive learning over raw RGB-D frames and proves its effectiveness on various downstream tasks. However, the trend of large-scale unsupervised learning in 3D has yet to emerge due to two stumbling blocks: the inefficiency of matching RGB-D frames as contrastive views and the annoying mode collapse phenomenon mentioned in previous works. Turning the two stumbling blocks into empirical stepping stones, we first propose an efficient and effective contrastive learning framework, which generates contrastive views directly on scene-level point clouds by a well-curated data augmentation pipeline and a practical view mixing strategy. Second, we introduce reconstructive learning on the contrastive learning framework with an exquisite design of contrastive cross masks, which targets the reconstruction of point color and surfel normal. Our Masked Scene Contrast (MSC) framework is capable of extracting comprehensive 3D representations more efficiently and effectively. It accelerates the pre-training procedure by at least 3x and still achieves an uncompromised performance compared with previous work. Besides, MSC also enables large-scale 3D pre-training across multiple datasets, which further boosts the performance and achieves state-of-the-art fine-tuning results on several downstream tasks, e.g., 75.5% mIoU on ScanNet semantic segmentation validation set.

* Accepted by CVPR 2023, available at https://github.com/Pointcept/Pointcept 
Viaarxiv icon

GeoSpark: Sparking up Point Cloud Segmentation with Geometry Clue

Mar 14, 2023
Zhening Huang, Xiaoyang Wu, Hengshuang Zhao, Lei Zhu, Shujun Wang, Georgios Hadjidemetriou, Ioannis Brilakis

Figure 1 for GeoSpark: Sparking up Point Cloud Segmentation with Geometry Clue
Figure 2 for GeoSpark: Sparking up Point Cloud Segmentation with Geometry Clue
Figure 3 for GeoSpark: Sparking up Point Cloud Segmentation with Geometry Clue
Figure 4 for GeoSpark: Sparking up Point Cloud Segmentation with Geometry Clue

Current point cloud segmentation architectures suffer from limited long-range feature modeling, as they mostly rely on aggregating information with local neighborhoods. Furthermore, in order to learn point features at multiple scales, most methods utilize a data-agnostic sampling approach to decrease the number of points after each stage. Such sampling methods, however, often discard points for small objects in the early stages, leading to inadequate feature learning. We believe these issues are can be mitigated by introducing explicit geometry clues as guidance. To this end, we propose GeoSpark, a Plug-in module that incorporates Geometry clues into the network to Spark up feature learning and downsampling. GeoSpark can be easily integrated into various backbones. For feature aggregation, it improves feature modeling by allowing the network to learn from both local points and neighboring geometry partitions, resulting in an enlarged data-tailored receptive field. Additionally, GeoSpark utilizes geometry partition information to guide the downsampling process, where points with unique features are preserved while redundant points are fused, resulting in better preservation of key points throughout the network. We observed consistent improvements after adding GeoSpark to various backbones including PointNet++, KPConv, and PointTransformer. Notably, when integrated with Point Transformer, our GeoSpark module achieves a 74.7% mIoU on the ScanNetv2 dataset (4.1% improvement) and 71.5% mIoU on the S3DIS Area 5 dataset (1.1% improvement), ranking top on both benchmarks. Code and models will be made publicly available.

Viaarxiv icon