Alert button
Picture for Runnan Chen

Runnan Chen

Alert button

UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase

Sep 11, 2023
Youquan Liu, Runnan Chen, Xin Li, Lingdong Kong, Yuchen Yang, Zhaoyang Xia, Yeqi Bai, Xinge Zhu, Yuexin Ma, Yikang Li, Yu Qiao, Yuenan Hou

Point-, voxel-, and range-views are three representative forms of point clouds. All of them have accurate 3D measurements but lack color and texture information. RGB images are a natural complement to these point cloud views and fully utilizing the comprehensive information of them benefits more robust perceptions. In this paper, we present a unified multi-modal LiDAR segmentation network, termed UniSeg, which leverages the information of RGB images and three views of the point cloud, and accomplishes semantic segmentation and panoptic segmentation simultaneously. Specifically, we first design the Learnable cross-Modal Association (LMA) module to automatically fuse voxel-view and range-view features with image features, which fully utilize the rich semantic information of images and are robust to calibration errors. Then, the enhanced voxel-view and range-view features are transformed to the point space,where three views of point cloud features are further fused adaptively by the Learnable cross-View Association module (LVA). Notably, UniSeg achieves promising results in three public benchmarks, i.e., SemanticKITTI, nuScenes, and Waymo Open Dataset (WOD); it ranks 1st on two challenges of two benchmarks, including the LiDAR semantic segmentation challenge of nuScenes and panoptic segmentation challenges of SemanticKITTI. Besides, we construct the OpenPCSeg codebase, which is the largest and most comprehensive outdoor LiDAR segmentation codebase. It contains most of the popular outdoor LiDAR segmentation algorithms and provides reproducible implementations. The OpenPCSeg codebase will be made publicly available at https://github.com/PJLab-ADG/PCSeg.

* ICCV 2023; 21 pages; 9 figures; 18 tables; Code at https://github.com/PJLab-ADG/PCSeg 
Viaarxiv icon

Human-centric Scene Understanding for 3D Large-scale Scenarios

Jul 26, 2023
Yiteng Xu, Peishan Cong, Yichen Yao, Runnan Chen, Yuenan Hou, Xinge Zhu, Xuming He, Jingyi Yu, Yuexin Ma

Figure 1 for Human-centric Scene Understanding for 3D Large-scale Scenarios
Figure 2 for Human-centric Scene Understanding for 3D Large-scale Scenarios
Figure 3 for Human-centric Scene Understanding for 3D Large-scale Scenarios
Figure 4 for Human-centric Scene Understanding for 3D Large-scale Scenarios

Human-centric scene understanding is significant for real-world applications, but it is extremely challenging due to the existence of diverse human poses and actions, complex human-environment interactions, severe occlusions in crowds, etc. In this paper, we present a large-scale multi-modal dataset for human-centric scene understanding, dubbed HuCenLife, which is collected in diverse daily-life scenarios with rich and fine-grained annotations. Our HuCenLife can benefit many 3D perception tasks, such as segmentation, detection, action recognition, etc., and we also provide benchmarks for these tasks to facilitate related research. In addition, we design novel modules for LiDAR-based segmentation and action recognition, which are more applicable for large-scale human-centric scenarios and achieve state-of-the-art performance.

Viaarxiv icon

See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data

Jul 20, 2023
Yuhang Lu, Qi Jiang, Runnan Chen, Yuenan Hou, Xinge Zhu, Yuexin Ma

Figure 1 for See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data
Figure 2 for See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data
Figure 3 for See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data
Figure 4 for See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data

Zero-shot point cloud segmentation aims to make deep models capable of recognizing novel objects in point cloud that are unseen in the training phase. Recent trends favor the pipeline which transfers knowledge from seen classes with labels to unseen classes without labels. They typically align visual features with semantic features obtained from word embedding by the supervision of seen classes' annotations. However, point cloud contains limited information to fully match with semantic features. In fact, the rich appearance information of images is a natural complement to the textureless point cloud, which is not well explored in previous literature. Motivated by this, we propose a novel multi-modal zero-shot learning method to better utilize the complementary information of point clouds and images for more accurate visual-semantic alignment. Extensive experiments are performed in two popular benchmarks, i.e., SemanticKITTI and nuScenes, and our method outperforms current SOTA methods with 52% and 49% improvement on average for unseen class mIoU, respectively.

* Accepted by ICCV 2023 
Viaarxiv icon

Segment Any Point Cloud Sequences by Distilling Vision Foundation Models

Jun 15, 2023
Youquan Liu, Lingdong Kong, Jun Cen, Runnan Chen, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu

Figure 1 for Segment Any Point Cloud Sequences by Distilling Vision Foundation Models
Figure 2 for Segment Any Point Cloud Sequences by Distilling Vision Foundation Models
Figure 3 for Segment Any Point Cloud Sequences by Distilling Vision Foundation Models
Figure 4 for Segment Any Point Cloud Sequences by Distilling Vision Foundation Models

Recent advancements in vision foundation models (VFMs) have opened up new possibilities for versatile and efficient visual perception. In this work, we introduce Seal, a novel framework that harnesses VFMs for segmenting diverse automotive point cloud sequences. Seal exhibits three appealing properties: i) Scalability: VFMs are directly distilled into point clouds, eliminating the need for annotations in either 2D or 3D during pretraining. ii) Consistency: Spatial and temporal relationships are enforced at both the camera-to-LiDAR and point-to-segment stages, facilitating cross-modal representation learning. iii) Generalizability: Seal enables knowledge transfer in an off-the-shelf manner to downstream tasks involving diverse point clouds, including those from real/synthetic, low/high-resolution, large/small-scale, and clean/corrupted datasets. Extensive experiments conducted on eleven different point cloud datasets showcase the effectiveness and superiority of Seal. Notably, Seal achieves a remarkable 45.0% mIoU on nuScenes after linear probing, surpassing random initialization by 36.9% mIoU and outperforming prior arts by 6.1% mIoU. Moreover, Seal demonstrates significant performance gains over existing methods across 20 different few-shot fine-tuning tasks on all eleven tested point cloud datasets.

* Preprint; 36 pages, 16 figures, 14 tables; Code at https://github.com/youquanl/Segment-Any-Point-Cloud 
Viaarxiv icon

Towards Label-free Scene Understanding by Vision Foundation Models

Jun 06, 2023
Runnan Chen, Youquan Liu, Lingdong Kong, Nenglun Chen, Xinge Zhu, Yuexin Ma, Tongliang Liu, Wenping Wang

Figure 1 for Towards Label-free Scene Understanding by Vision Foundation Models
Figure 2 for Towards Label-free Scene Understanding by Vision Foundation Models
Figure 3 for Towards Label-free Scene Understanding by Vision Foundation Models
Figure 4 for Towards Label-free Scene Understanding by Vision Foundation Models

Vision foundation models such as Contrastive Vision-Language Pre-training (CLIP) and Segment Anything (SAM) have demonstrated impressive zero-shot performance on image classification and segmentation tasks. However, the incorporation of CLIP and SAM for label-free scene understanding has yet to be explored. In this paper, we investigate the potential of vision foundation models in enabling networks to comprehend 2D and 3D worlds without labelled data. The primary challenge lies in effectively supervising networks under extremely noisy pseudo labels, which are generated by CLIP and further exacerbated during the propagation from the 2D to the 3D domain. To tackle these challenges, we propose a novel Cross-modality Noisy Supervision (CNS) method that leverages the strengths of CLIP and SAM to supervise 2D and 3D networks simultaneously. In particular, we introduce a prediction consistency regularization to co-train 2D and 3D networks, then further impose the networks' latent space consistency using the SAM's robust feature representation. Experiments conducted on diverse indoor and outdoor datasets demonstrate the superior performance of our method in understanding 2D and 3D open environments. Our 2D and 3D network achieves label-free semantic segmentation with 28.4% and 33.5% mIoU on ScanNet, improving 4.7% and 7.9%, respectively. And for nuScenes dataset, our performance is 26.8% with an improvement of 6%. Code will be released (https://github.com/runnanchen/Label-Free-Scene-Understanding).

Viaarxiv icon

Robo3D: Towards Robust and Reliable 3D Perception against Corruptions

Apr 13, 2023
Lingdong Kong, Youquan Liu, Xin Li, Runnan Chen, Wenwei Zhang, Jiawei Ren, Liang Pan, Kai Chen, Ziwei Liu

Figure 1 for Robo3D: Towards Robust and Reliable 3D Perception against Corruptions
Figure 2 for Robo3D: Towards Robust and Reliable 3D Perception against Corruptions
Figure 3 for Robo3D: Towards Robust and Reliable 3D Perception against Corruptions
Figure 4 for Robo3D: Towards Robust and Reliable 3D Perception against Corruptions

The robustness of 3D perception systems under natural corruptions from environments and sensors is pivotal for safety-critical applications. Existing large-scale 3D perception datasets often contain data that are meticulously cleaned. Such configurations, however, cannot reflect the reliability of perception models during the deployment stage. In this work, we present Robo3D, the first comprehensive benchmark heading toward probing the robustness of 3D detectors and segmentors under out-of-distribution scenarios against natural corruptions that occur in real-world environments. Specifically, we consider eight corruption types stemming from adversarial weather conditions, external disturbances, and internal sensor failure. We uncover that, although promising results have been progressively achieved on standard benchmarks, state-of-the-art 3D perception models are at risk of being vulnerable to corruptions. We draw key observations on the use of data representations, augmentation schemes, and training strategies, that could severely affect the model's performance. To pursue better robustness, we propose a density-insensitive training framework along with a simple flexible voxelization strategy to enhance the model resiliency. We hope our benchmark and approach could inspire future research in designing more robust and reliable 3D perception models. Our robustness benchmark suite is publicly available.

* Preprint; 33 pages, 26 figures, 26 tables; Code at https://github.com/ldkong1205/Robo3D 
Viaarxiv icon

Depth-NeuS: Neural Implicit Surfaces Learning for Multi-view Reconstruction Based on Depth Information Optimization

Mar 30, 2023
Hanqi Jiang, Cheng Zeng, Runnan Chen, Shuai Liang, Yinhe Han, Yichao Gao, Conglin Wang

Figure 1 for Depth-NeuS: Neural Implicit Surfaces Learning for Multi-view Reconstruction Based on Depth Information Optimization
Figure 2 for Depth-NeuS: Neural Implicit Surfaces Learning for Multi-view Reconstruction Based on Depth Information Optimization
Figure 3 for Depth-NeuS: Neural Implicit Surfaces Learning for Multi-view Reconstruction Based on Depth Information Optimization
Figure 4 for Depth-NeuS: Neural Implicit Surfaces Learning for Multi-view Reconstruction Based on Depth Information Optimization

Recently, methods for neural surface representation and rendering, for example NeuS, have shown that learning neural implicit surfaces through volume rendering is becoming increasingly popular and making good progress. However, these methods still face some challenges. Existing methods lack a direct representation of depth information, which makes object reconstruction unrestricted by geometric features, resulting in poor reconstruction of objects with texture and color features. This is because existing methods only use surface normals to represent implicit surfaces without using depth information. Therefore, these methods cannot model the detailed surface features of objects well. To address this problem, we propose a neural implicit surface learning method called Depth-NeuS based on depth information optimization for multi-view reconstruction. In this paper, we introduce depth loss to explicitly constrain SDF regression and introduce geometric consistency loss to optimize for low-texture areas. Specific experiments show that Depth-NeuS outperforms existing technologies in multiple scenarios and achieves high-quality surface reconstruction in multiple scenarios.

* 9 pages 
Viaarxiv icon

Rethinking Range View Representation for LiDAR Segmentation

Mar 18, 2023
Lingdong Kong, Youquan Liu, Runnan Chen, Yuexin Ma, Xinge Zhu, Yikang Li, Yuenan Hou, Yu Qiao, Ziwei Liu

Figure 1 for Rethinking Range View Representation for LiDAR Segmentation
Figure 2 for Rethinking Range View Representation for LiDAR Segmentation
Figure 3 for Rethinking Range View Representation for LiDAR Segmentation
Figure 4 for Rethinking Range View Representation for LiDAR Segmentation

LiDAR segmentation is crucial for autonomous driving perception. Recent trends favor point- or voxel-based methods as they often yield better performance than the traditional range view representation. In this work, we unveil several key factors in building powerful range view models. We observe that the "many-to-one" mapping, semantic incoherence, and shape deformation are possible impediments against effective learning from range view projections. We present RangeFormer -- a full-cycle framework comprising novel designs across network architecture, data augmentation, and post-processing -- that better handles the learning and processing of LiDAR point clouds from the range view. We further introduce a Scalable Training from Range view (STR) strategy that trains on arbitrary low-resolution 2D range images, while still maintaining satisfactory 3D segmentation accuracy. We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks, i.e., SemanticKITTI, nuScenes, and ScribbleKITTI.

* 22 pages, 10 figures, 14 tables, project page at https://ldkong.com/RangeFormer 
Viaarxiv icon