Alert button
Picture for Jian Pu

Jian Pu

Alert button

Revisiting Multi-modal 3D Semantic Segmentation in Real-world Autonomous Driving

Oct 13, 2023
Feng Jiang, Chaoping Tu, Gang Zhang, Jun Li, Hanqing Huang, Junyu Lin, Di Feng, Jian Pu

Figure 1 for Revisiting Multi-modal 3D Semantic Segmentation in Real-world Autonomous Driving
Figure 2 for Revisiting Multi-modal 3D Semantic Segmentation in Real-world Autonomous Driving
Figure 3 for Revisiting Multi-modal 3D Semantic Segmentation in Real-world Autonomous Driving
Figure 4 for Revisiting Multi-modal 3D Semantic Segmentation in Real-world Autonomous Driving

LiDAR and camera are two critical sensors for multi-modal 3D semantic segmentation and are supposed to be fused efficiently and robustly to promise safety in various real-world scenarios. However, existing multi-modal methods face two key challenges: 1) difficulty with efficient deployment and real-time execution; and 2) drastic performance degradation under weak calibration between LiDAR and cameras. To address these challenges, we propose CPGNet-LCF, a new multi-modal fusion framework extending the LiDAR-only CPGNet. CPGNet-LCF solves the first challenge by inheriting the easy deployment and real-time capabilities of CPGNet. For the second challenge, we introduce a novel weak calibration knowledge distillation strategy during training to improve the robustness against the weak calibration. CPGNet-LCF achieves state-of-the-art performance on the nuScenes and SemanticKITTI benchmarks. Remarkably, it can be easily deployed to run in 20ms per frame on a single Tesla V100 GPU using TensorRT TF16 mode. Furthermore, we benchmark performance over four weak calibration levels, demonstrating the robustness of our proposed approach.

* 7 pages, 3 figures 
Viaarxiv icon

ADU-Depth: Attention-based Distillation with Uncertainty Modeling for Depth Estimation

Sep 26, 2023
Zizhang Wu, Zhuozheng Li, Zhi-Gang Fan, Yunzhe Wu, Xiaoquan Wang, Rui Tang, Jian Pu

Figure 1 for ADU-Depth: Attention-based Distillation with Uncertainty Modeling for Depth Estimation
Figure 2 for ADU-Depth: Attention-based Distillation with Uncertainty Modeling for Depth Estimation
Figure 3 for ADU-Depth: Attention-based Distillation with Uncertainty Modeling for Depth Estimation
Figure 4 for ADU-Depth: Attention-based Distillation with Uncertainty Modeling for Depth Estimation

Monocular depth estimation is challenging due to its inherent ambiguity and ill-posed nature, yet it is quite important to many applications. While recent works achieve limited accuracy by designing increasingly complicated networks to extract features with limited spatial geometric cues from a single RGB image, we intend to introduce spatial cues by training a teacher network that leverages left-right image pairs as inputs and transferring the learned 3D geometry-aware knowledge to the monocular student network. Specifically, we present a novel knowledge distillation framework, named ADU-Depth, with the goal of leveraging the well-trained teacher network to guide the learning of the student network, thus boosting the precise depth estimation with the help of extra spatial scene information. To enable domain adaptation and ensure effective and smooth knowledge transfer from teacher to student, we apply both attention-adapted feature distillation and focal-depth-adapted response distillation in the training stage. In addition, we explicitly model the uncertainty of depth estimation to guide distillation in both feature space and result space to better produce 3D-aware knowledge from monocular observations and thus enhance the learning for hard-to-predict image regions. Our extensive experiments on the real depth estimation datasets KITTI and DrivingStereo demonstrate the effectiveness of the proposed method, which ranked 1st on the challenging KITTI online benchmark.

* accepted by CoRL 2023 
Viaarxiv icon

LineMarkNet: Line Landmark Detection for Valet Parking

Sep 25, 2023
Zizhang Wu, Yuanzhu Gan, Tianhao Xu, Rui Tang, Jian Pu

Figure 1 for LineMarkNet: Line Landmark Detection for Valet Parking
Figure 2 for LineMarkNet: Line Landmark Detection for Valet Parking
Figure 3 for LineMarkNet: Line Landmark Detection for Valet Parking
Figure 4 for LineMarkNet: Line Landmark Detection for Valet Parking

We aim for accurate and efficient line landmark detection for valet parking, which is a long-standing yet unsolved problem in autonomous driving. To this end, we present a deep line landmark detection system where we carefully design the modules to be lightweight. Specifically, we first empirically design four general line landmarks including three physical lines and one novel mental line. The four line landmarks are effective for valet parking. We then develop a deep network (LineMarkNet) to detect line landmarks from surround-view cameras where we, via the pre-calibrated homography, fuse context from four separate cameras into the unified bird-eye-view (BEV) space, specifically we fuse the surroundview features and BEV features, then employ the multi-task decoder to detect multiple line landmarks where we apply the center-based strategy for object detection task, and design our graph transformer to enhance the vision transformer with hierarchical level graph reasoning for semantic segmentation task. At last, we further parameterize the detected line landmarks (e.g., intercept-slope form) whereby a novel filtering backend incorporates temporal and multi-view consistency to achieve smooth and stable detection. Moreover, we annotate a large-scale dataset to validate our method. Experimental results show that our framework achieves the enhanced performance compared with several line detection methods and validate the multi-task network's efficiency about the real-time line landmark detection on the Qualcomm 820A platform while meantime keeps superior accuracy, with our deep line landmark detection system.

* 29 pages, 12 figures 
Viaarxiv icon

PPD: A New Valet Parking Pedestrian Fisheye Dataset for Autonomous Driving

Sep 25, 2023
Zizhang Wu, Xinyuan Chen, Fan Song, Yuanzhu Gan, Tianhao Xu, Jian Pu, Rui Tang

Figure 1 for PPD: A New Valet Parking Pedestrian Fisheye Dataset for Autonomous Driving
Figure 2 for PPD: A New Valet Parking Pedestrian Fisheye Dataset for Autonomous Driving
Figure 3 for PPD: A New Valet Parking Pedestrian Fisheye Dataset for Autonomous Driving
Figure 4 for PPD: A New Valet Parking Pedestrian Fisheye Dataset for Autonomous Driving

Pedestrian detection under valet parking scenarios is fundamental for autonomous driving. However, the presence of pedestrians can be manifested in a variety of ways and postures under imperfect ambient conditions, which can adversely affect detection performance. Furthermore, models trained on publicdatasets that include pedestrians generally provide suboptimal outcomes for these valet parking scenarios. In this paper, wepresent the Parking Pedestrian Dataset (PPD), a large-scale fisheye dataset to support research dealing with real-world pedestrians, especially with occlusions and diverse postures. PPD consists of several distinctive types of pedestrians captured with fisheye cameras. Additionally, we present a pedestrian detection baseline on PPD dataset, and introduce two data augmentation techniques to improve the baseline by enhancing the diversity ofthe original dataset. Extensive experiments validate the effectiveness of our novel data augmentation approaches over baselinesand the dataset's exceptional generalizability.

* 9 pages, 6 figures 
Viaarxiv icon

PointSSC: A Cooperative Vehicle-Infrastructure Point Cloud Benchmark for Semantic Scene Completion

Sep 22, 2023
Yuxiang Yan, Boda Liu, Jianfei Ai, Qinbu Li, Ru Wan, Jian Pu

Figure 1 for PointSSC: A Cooperative Vehicle-Infrastructure Point Cloud Benchmark for Semantic Scene Completion
Figure 2 for PointSSC: A Cooperative Vehicle-Infrastructure Point Cloud Benchmark for Semantic Scene Completion
Figure 3 for PointSSC: A Cooperative Vehicle-Infrastructure Point Cloud Benchmark for Semantic Scene Completion
Figure 4 for PointSSC: A Cooperative Vehicle-Infrastructure Point Cloud Benchmark for Semantic Scene Completion

Semantic Scene Completion (SSC) aims to jointly generate space occupancies and semantic labels for complex 3D scenes. Most existing SSC models focus on volumetric representations, which are memory-inefficient for large outdoor spaces. Point clouds provide a lightweight alternative but existing benchmarks lack outdoor point cloud scenes with semantic labels. To address this, we introduce PointSSC, the first cooperative vehicle-infrastructure point cloud benchmark for semantic scene completion. These scenes exhibit long-range perception and minimal occlusion. We develop an automated annotation pipeline leveraging Segment Anything to efficiently assign semantics. To benchmark progress, we propose a LiDAR-based model with a Spatial-Aware Transformer for global and local feature extraction and a Completion and Segmentation Cooperative Module for joint completion and segmentation. PointSSC provides a challenging testbed to drive advances in semantic point cloud completion for real-world navigation.

* 8 pages, 5 figures, submitted to ICRA2024 
Viaarxiv icon

Understanding Depth Map Progressively: Adaptive Distance Interval Separation for Monocular 3d Object Detection

Jun 19, 2023
Xianhui Cheng, Shoumeng Qiu, Zhikang Zou, Jian Pu, Xiangyang Xue

Figure 1 for Understanding Depth Map Progressively: Adaptive Distance Interval Separation for Monocular 3d Object Detection
Figure 2 for Understanding Depth Map Progressively: Adaptive Distance Interval Separation for Monocular 3d Object Detection
Figure 3 for Understanding Depth Map Progressively: Adaptive Distance Interval Separation for Monocular 3d Object Detection
Figure 4 for Understanding Depth Map Progressively: Adaptive Distance Interval Separation for Monocular 3d Object Detection

Monocular 3D object detection aims to locate objects in different scenes with just a single image. Due to the absence of depth information, several monocular 3D detection techniques have emerged that rely on auxiliary depth maps from the depth estimation task. There are multiple approaches to understanding the representation of depth maps, including treating them as pseudo-LiDAR point clouds, leveraging implicit end-to-end learning of depth information, or considering them as an image input. However, these methods have certain drawbacks, such as their reliance on the accuracy of estimated depth maps and suboptimal utilization of depth maps due to their image-based nature. While LiDAR-based methods and convolutional neural networks (CNNs) can be utilized for pseudo point clouds and depth maps, respectively, it is always an alternative. In this paper, we propose a framework named the Adaptive Distance Interval Separation Network (ADISN) that adopts a novel perspective on understanding depth maps, as a form that lies between LiDAR and images. We utilize an adaptive separation approach that partitions the depth map into various subgraphs based on distance and treats each of these subgraphs as an individual image for feature extraction. After adaptive separations, each subgraph solely contains pixels within a learned interval range. If there is a truncated object within this range, an evident curved edge will appear, which we can leverage for texture extraction using CNNs to obtain rich depth information in pixels. Meanwhile, to mitigate the inaccuracy of depth estimation, we designed an uncertainty module. To take advantage of both images and depth maps, we use different branches to learn localization detection tasks and appearance tasks separately.

Viaarxiv icon

Learning Monocular Depth in Dynamic Environment via Context-aware Temporal Attention

May 12, 2023
Zizhang Wu, Zhuozheng Li, Zhi-Gang Fan, Yunzhe Wu, Yuanzhu Gan, Jian Pu, Xianzhi Li

Figure 1 for Learning Monocular Depth in Dynamic Environment via Context-aware Temporal Attention
Figure 2 for Learning Monocular Depth in Dynamic Environment via Context-aware Temporal Attention
Figure 3 for Learning Monocular Depth in Dynamic Environment via Context-aware Temporal Attention
Figure 4 for Learning Monocular Depth in Dynamic Environment via Context-aware Temporal Attention

The monocular depth estimation task has recently revealed encouraging prospects, especially for the autonomous driving task. To tackle the ill-posed problem of 3D geometric reasoning from 2D monocular images, multi-frame monocular methods are developed to leverage the perspective correlation information from sequential temporal frames. However, moving objects such as cars and trains usually violate the static scene assumption, leading to feature inconsistency deviation and misaligned cost values, which would mislead the optimization algorithm. In this work, we present CTA-Depth, a Context-aware Temporal Attention guided network for multi-frame monocular Depth estimation. Specifically, we first apply a multi-level attention enhancement module to integrate multi-level image features to obtain an initial depth and pose estimation. Then the proposed CTA-Refiner is adopted to alternatively optimize the depth and pose. During the refinement process, context-aware temporal attention (CTA) is developed to capture the global temporal-context correlations to maintain the feature consistency and estimation integrity of moving objects. In particular, we propose a long-range geometry embedding (LGE) module to produce a long-range temporal geometry prior. Our approach achieves significant improvements over state-of-the-art approaches on three benchmark datasets.

* accepted by IJCAI 2023; 9 pages, 5 figures 
Viaarxiv icon

Multi-to-Single Knowledge Distillation for Point Cloud Semantic Segmentation

Apr 28, 2023
Shoumeng Qiu, Feng Jiang, Haiqiang Zhang, Xiangyang Xue, Jian Pu

Figure 1 for Multi-to-Single Knowledge Distillation for Point Cloud Semantic Segmentation
Figure 2 for Multi-to-Single Knowledge Distillation for Point Cloud Semantic Segmentation
Figure 3 for Multi-to-Single Knowledge Distillation for Point Cloud Semantic Segmentation
Figure 4 for Multi-to-Single Knowledge Distillation for Point Cloud Semantic Segmentation

3D point cloud semantic segmentation is one of the fundamental tasks for environmental understanding. Although significant progress has been made in recent years, the performance of classes with few examples or few points is still far from satisfactory. In this paper, we propose a novel multi-to-single knowledge distillation framework for the 3D point cloud semantic segmentation task to boost the performance of those hard classes. Instead of fusing all the points of multi-scans directly, only the instances that belong to the previously defined hard classes are fused. To effectively and sufficiently distill valuable knowledge from multi-scans, we leverage a multilevel distillation framework, i.e., feature representation distillation, logit distillation, and affinity distillation. We further develop a novel instance-aware affinity distillation algorithm for capturing high-level structural knowledge to enhance the distillation efficacy for hard classes. Finally, we conduct experiments on the SemanticKITTI dataset, and the results on both the validation and test sets demonstrate that our method yields substantial improvements compared with the baseline method. The code is available at \Url{https://github.com/skyshoumeng/M2SKD}.

* ICRA 2023  
Viaarxiv icon

Knowledge Distillation from 3D to Bird's-Eye-View for LiDAR Semantic Segmentation

Apr 22, 2023
Feng Jiang, Heng Gao, Shoumeng Qiu, Haiqiang Zhang, Ru Wan, Jian Pu

Figure 1 for Knowledge Distillation from 3D to Bird's-Eye-View for LiDAR Semantic Segmentation
Figure 2 for Knowledge Distillation from 3D to Bird's-Eye-View for LiDAR Semantic Segmentation
Figure 3 for Knowledge Distillation from 3D to Bird's-Eye-View for LiDAR Semantic Segmentation
Figure 4 for Knowledge Distillation from 3D to Bird's-Eye-View for LiDAR Semantic Segmentation

LiDAR point cloud segmentation is one of the most fundamental tasks for autonomous driving scene understanding. However, it is difficult for existing models to achieve both high inference speed and accuracy simultaneously. For example, voxel-based methods perform well in accuracy, while Bird's-Eye-View (BEV)-based methods can achieve real-time inference. To overcome this issue, we develop an effective 3D-to-BEV knowledge distillation method that transfers rich knowledge from 3D voxel-based models to BEV-based models. Our framework mainly consists of two modules: the voxel-to-pillar distillation module and the label-weight distillation module. Voxel-to-pillar distillation distills sparse 3D features to BEV features for middle layers to make the BEV-based model aware of more structural and geometric information. Label-weight distillation helps the model pay more attention to regions with more height information. Finally, we conduct experiments on the SemanticKITTI dataset and Paris-Lille-3D. The results on SemanticKITTI show more than 5% improvement on the test set, especially for classes such as motorcycle and person, with more than 15% improvement. The code can be accessed at https://github.com/fengjiang5/Knowledge-Distillation-from-Cylinder3D-to-PolarNet.

* ICME 2023 Accepted 
Viaarxiv icon