Alert button
Picture for Ruigang Yang

Ruigang Yang

Alert button

Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving

May 25, 2023
Wenhao Cheng, Junbo Yin, Wei Li, Ruigang Yang, Jianbing Shen

Figure 1 for Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving
Figure 2 for Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving
Figure 3 for Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving
Figure 4 for Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving

This paper addresses the problem of 3D referring expression comprehension (REC) in autonomous driving scenario, which aims to ground a natural language to the targeted region in LiDAR point clouds. Previous approaches for REC usually focus on the 2D or 3D-indoor domain, which is not suitable for accurately predicting the location of the queried 3D region in an autonomous driving scene. In addition, the upper-bound limitation and the heavy computation cost motivate us to explore a better solution. In this work, we propose a new multi-modal visual grounding task, termed LiDAR Grounding. Then we devise a Multi-modal Single Shot Grounding (MSSG) approach with an effective token fusion strategy. It jointly learns the LiDAR-based object detector with the language features and predicts the targeted region directly from the detector without any post-processing. Moreover, the image feature can be flexibly integrated into our approach to provide rich texture and color information. The cross-modal learning enforces the detector to concentrate on important regions in the point cloud by considering the informative language expressions, thus leading to much better accuracy and efficiency. Extensive experiments on the Talk2Car dataset demonstrate the effectiveness of the proposed methods. Our work offers a deeper insight into the LiDAR-based grounding task and we expect it presents a promising direction for the autonomous driving community.

Viaarxiv icon

LWSIS: LiDAR-guided Weakly Supervised Instance Segmentation for Autonomous Driving

Dec 07, 2022
Xiang Li, Junbo Yin, Botian Shi, Yikang Li, Ruigang Yang, Jianbin Shen

Figure 1 for LWSIS: LiDAR-guided Weakly Supervised Instance Segmentation for Autonomous Driving
Figure 2 for LWSIS: LiDAR-guided Weakly Supervised Instance Segmentation for Autonomous Driving
Figure 3 for LWSIS: LiDAR-guided Weakly Supervised Instance Segmentation for Autonomous Driving
Figure 4 for LWSIS: LiDAR-guided Weakly Supervised Instance Segmentation for Autonomous Driving

Image instance segmentation is a fundamental research topic in autonomous driving, which is crucial for scene understanding and road safety. Advanced learning-based approaches often rely on the costly 2D mask annotations for training. In this paper, we present a more artful framework, LiDAR-guided Weakly Supervised Instance Segmentation (LWSIS), which leverages the off-the-shelf 3D data, i.e., Point Cloud, together with the 3D boxes, as natural weak supervisions for training the 2D image instance segmentation models. Our LWSIS not only exploits the complementary information in multimodal data during training, but also significantly reduces the annotation cost of the dense 2D masks. In detail, LWSIS consists of two crucial modules, Point Label Assignment (PLA) and Graph-based Consistency Regularization (GCR). The former module aims to automatically assign the 3D point cloud as 2D point-wise labels, while the latter further refines the predictions by enforcing geometry and appearance consistency of the multimodal data. Moreover, we conduct a secondary instance segmentation annotation on the nuScenes, named nuInsSeg, to encourage further research on multimodal perception tasks. Extensive experiments on the nuInsSeg, as well as the large-scale Waymo, show that LWSIS can substantially improve existing weakly supervised segmentation models by only involving 3D data during training. Additionally, LWSIS can also be incorporated into 3D object detectors like PointPainting to boost the 3D detection performance for free. The code and dataset are available at https://github.com/Serenos/LWSIS.

* AAAI2023 
Viaarxiv icon

SSDA3D: Semi-supervised Domain Adaptation for 3D Object Detection from Point Cloud

Dec 06, 2022
Yan Wang, Junbo Yin, Wei Li, Pascal Frossard, Ruigang Yang, Jianbing Shen

Figure 1 for SSDA3D: Semi-supervised Domain Adaptation for 3D Object Detection from Point Cloud
Figure 2 for SSDA3D: Semi-supervised Domain Adaptation for 3D Object Detection from Point Cloud
Figure 3 for SSDA3D: Semi-supervised Domain Adaptation for 3D Object Detection from Point Cloud
Figure 4 for SSDA3D: Semi-supervised Domain Adaptation for 3D Object Detection from Point Cloud

LiDAR-based 3D object detection is an indispensable task in advanced autonomous driving systems. Though impressive detection results have been achieved by superior 3D detectors, they suffer from significant performance degeneration when facing unseen domains, such as different LiDAR configurations, different cities, and weather conditions. The mainstream approaches tend to solve these challenges by leveraging unsupervised domain adaptation (UDA) techniques. However, these UDA solutions just yield unsatisfactory 3D detection results when there is a severe domain shift, e.g., from Waymo (64-beam) to nuScenes (32-beam). To address this, we present a novel Semi-Supervised Domain Adaptation method for 3D object detection (SSDA3D), where only a few labeled target data is available, yet can significantly improve the adaptation performance. In particular, our SSDA3D includes an Inter-domain Adaptation stage and an Intra-domain Generalization stage. In the first stage, an Inter-domain Point-CutMix module is presented to efficiently align the point cloud distribution across domains. The Point-CutMix generates mixed samples of an intermediate domain, thus encouraging to learn domain-invariant knowledge. Then, in the second stage, we further enhance the model for better generalization on the unlabeled target set. This is achieved by exploring Intra-domain Point-MixUp in semi-supervised learning, which essentially regularizes the pseudo label distribution. Experiments from Waymo to nuScenes show that, with only 10% labeled target data, our SSDA3D can surpass the fully-supervised oracle model with 100% target label. Our code is available at https://github.com/yinjunbo/SSDA3D.

* Accepted by AAAI 2023 
Viaarxiv icon

Transformation-Equivariant 3D Object Detection for Autonomous Driving

Dec 01, 2022
Hai Wu, Chenglu Wen, Wei Li, Xin Li, Ruigang Yang, Cheng Wang

Figure 1 for Transformation-Equivariant 3D Object Detection for Autonomous Driving
Figure 2 for Transformation-Equivariant 3D Object Detection for Autonomous Driving
Figure 3 for Transformation-Equivariant 3D Object Detection for Autonomous Driving
Figure 4 for Transformation-Equivariant 3D Object Detection for Autonomous Driving

3D object detection received increasing attention in autonomous driving recently. Objects in 3D scenes are distributed with diverse orientations. Ordinary detectors do not explicitly model the variations of rotation and reflection transformations. Consequently, large networks and extensive data augmentation are required for robust detection. Recent equivariant networks explicitly model the transformation variations by applying shared networks on multiple transformed point clouds, showing great potential in object geometry modeling. However, it is difficult to apply such networks to 3D object detection in autonomous driving due to its large computation cost and slow reasoning speed. In this work, we present TED, an efficient Transformation-Equivariant 3D Detector to overcome the computation cost and speed issues. TED first applies a sparse convolution backbone to extract multi-channel transformation-equivariant voxel features; and then aligns and aggregates these equivariant features into lightweight and compact representations for high-performance 3D object detection. On the highly competitive KITTI 3D car detection leaderboard, TED ranked 1st among all submissions with competitive efficiency.

* Accepted by AAAI 2023 
Viaarxiv icon

Zero-shot Point Cloud Segmentation by Transferring Geometric Primitives

Oct 18, 2022
Runnan Chen, Xinge Zhu, Nenglun Chen, Wei Li, Yuexin Ma, Ruigang Yang, Wenping Wang

Figure 1 for Zero-shot Point Cloud Segmentation by Transferring Geometric Primitives
Figure 2 for Zero-shot Point Cloud Segmentation by Transferring Geometric Primitives
Figure 3 for Zero-shot Point Cloud Segmentation by Transferring Geometric Primitives
Figure 4 for Zero-shot Point Cloud Segmentation by Transferring Geometric Primitives

We investigate transductive zero-shot point cloud semantic segmentation in this paper, where unseen class labels are unavailable during training. Actually, the 3D geometric elements are essential cues to reason the 3D object type. If two categories share similar geometric primitives, they also have similar semantic representations. Based on this consideration, we propose a novel framework to learn the geometric primitives shared in seen and unseen categories' objects, where the learned geometric primitives are served for transferring knowledge from seen to unseen categories. Specifically, a group of learnable prototypes automatically encode geometric primitives via back-propagation. Then, the point visual representation is formulated as the similarity vector of its feature to the prototypes, which implies semantic cues for both seen and unseen categories. Besides, considering a 3D object composed of multiple geometric primitives, we formulate the semantic representation as a mixture-distributed embedding for the fine-grained match of visual representation. In the end, to effectively learn the geometric primitives and alleviate the misclassification issue, we propose a novel unknown-aware infoNCE loss to align the visual and semantic representation. As a result, guided by semantic representations, the network recognizes the novel object represented with geometric primitives. Extensive experiments show that our method significantly outperforms other state-of-the-art methods in the harmonic mean-intersection-over-union (hIoU), with the improvement of 17.8%, 30.4% and 9.2% on S3DIS, ScanNet and SemanticKITTI datasets, respectively. Codes will be released.

Viaarxiv icon

Vision-Centric BEV Perception: A Survey

Aug 04, 2022
Yuexin Ma, Tai Wang, Xuyang Bai, Huitong Yang, Yuenan Hou, Yaming Wang, Yu Qiao, Ruigang Yang, Dinesh Manocha, Xinge Zhu

Figure 1 for Vision-Centric BEV Perception: A Survey
Figure 2 for Vision-Centric BEV Perception: A Survey
Figure 3 for Vision-Centric BEV Perception: A Survey
Figure 4 for Vision-Centric BEV Perception: A Survey

Vision-centric BEV perception has recently received increased attention from both industry and academia due to its inherent merits, including presenting a natural representation of the world and being fusion-friendly. With the rapid development of deep learning, numerous methods have been proposed to address the vision-centric BEV perception. However, there is no recent survey for this novel and growing research field. To stimulate its future research, this paper presents a comprehensive survey of recent progress of vision-centric BEV perception and its extensions. It collects and organizes the recent knowledge, and gives a systematic review and summary of commonly used algorithms. It also provides in-depth analyses and comparative results on several BEV perception tasks, facilitating the comparisons of future works and inspiring future research directions. Moreover, empirical implementation details are also discussed and shown to benefit the development of related algorithms.

* project page at https://github.com/4DVLab/Vision-Centric-BEV-Perception; 22 pages, 15 figures 
Viaarxiv icon

Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds

Jul 26, 2022
Junbo Yin, Jianbing Shen, Xin Gao, David Crandall, Ruigang Yang

Figure 1 for Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds
Figure 2 for Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds
Figure 3 for Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds
Figure 4 for Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds

Previous works for LiDAR-based 3D object detection mainly focus on the single-frame paradigm. In this paper, we propose to detect 3D objects by exploiting temporal information in multiple frames, i.e., the point cloud videos. We empirically categorize the temporal information into short-term and long-term patterns. To encode the short-term data, we present a Grid Message Passing Network (GMPNet), which considers each grid (i.e., the grouped points) as a node and constructs a k-NN graph with the neighbor grids. To update features for a grid, GMPNet iteratively collects information from its neighbors, thus mining the motion cues in grids from nearby frames. To further aggregate the long-term frames, we propose an Attentive Spatiotemporal Transformer GRU (AST-GRU), which contains a Spatial Transformer Attention (STA) module and a Temporal Transformer Attention (TTA) module. STA and TTA enhance the vanilla GRU to focus on small objects and better align the moving objects. Our overall framework supports both online and offline video object detection in point clouds. We implement our algorithm based on prevalent anchor-based and anchor-free detectors. The evaluation results on the challenging nuScenes benchmark show the superior performance of our method, achieving the 1st on the leaderboard without any bells and whistles, by the time the paper is submitted.

Viaarxiv icon

STCrowd: A Multimodal Dataset for Pedestrian Perception in Crowded Scenes

Apr 03, 2022
Peishan Cong, Xinge Zhu, Feng Qiao, Yiming Ren, Xidong Peng, Yuenan Hou, Lan Xu, Ruigang Yang, Dinesh Manocha, Yuexin Ma

Figure 1 for STCrowd: A Multimodal Dataset for Pedestrian Perception in Crowded Scenes
Figure 2 for STCrowd: A Multimodal Dataset for Pedestrian Perception in Crowded Scenes
Figure 3 for STCrowd: A Multimodal Dataset for Pedestrian Perception in Crowded Scenes
Figure 4 for STCrowd: A Multimodal Dataset for Pedestrian Perception in Crowded Scenes

Accurately detecting and tracking pedestrians in 3D space is challenging due to large variations in rotations, poses and scales. The situation becomes even worse for dense crowds with severe occlusions. However, existing benchmarks either only provide 2D annotations, or have limited 3D annotations with low-density pedestrian distribution, making it difficult to build a reliable pedestrian perception system especially in crowded scenes. To better evaluate pedestrian perception algorithms in crowded scenarios, we introduce a large-scale multimodal dataset,STCrowd. Specifically, in STCrowd, there are a total of 219 K pedestrian instances and 20 persons per frame on average, with various levels of occlusion. We provide synchronized LiDAR point clouds and camera images as well as their corresponding 3D labels and joint IDs. STCrowd can be used for various tasks, including LiDAR-only, image-only, and sensor-fusion based pedestrian detection and tracking. We provide baselines for most of the tasks. In addition, considering the property of sparse global distribution and density-varying local distribution of pedestrians, we further propose a novel method, Density-aware Hierarchical heatmap Aggregation (DHA), to enhance pedestrian perception in crowded scenes. Extensive experiments show that our new method achieves state-of-the-art performance for pedestrian detection on various datasets.

* accepted at CVPR2022 
Viaarxiv icon

Towards 3D Scene Understanding by Referring Synthetic Models

Mar 20, 2022
Runnan Chen, Xinge Zhu, Nenglun Chen, Dawei Wang, Wei Li, Yuexin Ma, Ruigang Yang, Wenping Wang

Figure 1 for Towards 3D Scene Understanding by Referring Synthetic Models
Figure 2 for Towards 3D Scene Understanding by Referring Synthetic Models
Figure 3 for Towards 3D Scene Understanding by Referring Synthetic Models
Figure 4 for Towards 3D Scene Understanding by Referring Synthetic Models

Promising performance has been achieved for visual perception on the point cloud. However, the current methods typically rely on labour-extensive annotations on the scene scans. In this paper, we explore how synthetic models alleviate the real scene annotation burden, i.e., taking the labelled 3D synthetic models as reference for supervision, the neural network aims to recognize specific categories of objects on a real scene scan (without scene annotation for supervision). The problem studies how to transfer knowledge from synthetic 3D models to real 3D scenes and is named Referring Transfer Learning (RTL). The main challenge is solving the model-to-scene (from a single model to the scene) and synthetic-to-real (from synthetic model to real scene's object) gap between the synthetic model and the real scene. To this end, we propose a simple yet effective framework to perform two alignment operations. First, physical data alignment aims to make the synthetic models cover the diversity of the scene's objects with data processing techniques. Then a novel \textbf{convex-hull regularized feature alignment} introduces learnable prototypes to project the point features of both synthetic models and real scenes to a unified feature space, which alleviates the domain gap. These operations ease the model-to-scene and synthetic-to-real difficulty for a network to recognize the target objects on a real unseen scene. Experiments show that our method achieves the average mAP of 46.08\% and 55.49\% on the ScanNet and S3DIS datasets by learning the synthetic models from the ModelNet dataset. Code will be publicly available.

Viaarxiv icon

An Intelligent Self-driving Truck System For Highway Transportation

Dec 31, 2021
Dawei Wang, Lingping Gao, Ziquan Lan, Wei Li, Jiaping Ren, Jiahui Zhang, Peng Zhang, Pei Zhou, Shengao Wang, Jia Pan, Dinesh Manocha, Ruigang Yang

Figure 1 for An Intelligent Self-driving Truck System For Highway Transportation
Figure 2 for An Intelligent Self-driving Truck System For Highway Transportation
Figure 3 for An Intelligent Self-driving Truck System For Highway Transportation
Figure 4 for An Intelligent Self-driving Truck System For Highway Transportation

Recently, there have been many advances in autonomous driving society, attracting a lot of attention from academia and industry. However, existing works mainly focus on cars, extra development is still required for self-driving truck algorithms and models. In this paper, we introduce an intelligent self-driving truck system. Our presented system consists of three main components, 1) a realistic traffic simulation module for generating realistic traffic flow in testing scenarios, 2) a high-fidelity truck model which is designed and evaluated for mimicking real truck response in real-world deployment, 3) an intelligent planning module with learning-based decision making algorithm and multi-mode trajectory planner, taking into account the truck's constraints, road slope changes, and the surrounding traffic flow. We provide quantitative evaluations for each component individually to demonstrate the fidelity and performance of each part. We also deploy our proposed system on a real truck and conduct real world experiments which shows our system's capacity of mitigating sim-to-real gap. Our code is available at https://github.com/InceptioResearch/IITS

Viaarxiv icon