Alert button
Picture for Jiageng Mao

Jiageng Mao

Alert button

CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data

Mar 26, 2023
Yihan Zeng, Chenhan Jiang, Jiageng Mao, Jianhua Han, Chaoqiang Ye, Qingqiu Huang, Dit-Yan Yeung, Zhen Yang, Xiaodan Liang, Hang Xu

Figure 1 for CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data
Figure 2 for CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data
Figure 3 for CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data
Figure 4 for CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data

Contrastive Language-Image Pre-training, benefiting from large-scale unlabeled text-image pairs, has demonstrated great performance in open-world vision understanding tasks. However, due to the limited Text-3D data pairs, adapting the success of 2D Vision-Language Models (VLM) to the 3D space remains an open problem. Existing works that leverage VLM for 3D understanding generally resort to constructing intermediate 2D representations for the 3D data, but at the cost of losing 3D geometry information. To take a step toward open-world 3D vision understanding, we propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$^2$) to directly learn the transferable 3D point cloud representation in realistic scenarios with a novel proxy alignment mechanism. Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios. On top of that, we propose a cross-modal contrastive objective to learn semantic and instance-level aligned point cloud representation. Experimental results on both indoor and outdoor scenarios show that our learned 3D representation has great transfer ability in downstream tasks, including zero-shot and few-shot 3D recognition, which boosts the state-of-the-art methods by large margins. Furthermore, we provide analyses of the capability of different representations in real scenarios and present the optional ensemble scheme.

* To appear at CVPR 2023 
Viaarxiv icon

3D Object Detection for Autonomous Driving: A Review and New Outlooks

Jun 19, 2022
Jiageng Mao, Shaoshuai Shi, Xiaogang Wang, Hongsheng Li

Figure 1 for 3D Object Detection for Autonomous Driving: A Review and New Outlooks
Figure 2 for 3D Object Detection for Autonomous Driving: A Review and New Outlooks
Figure 3 for 3D Object Detection for Autonomous Driving: A Review and New Outlooks
Figure 4 for 3D Object Detection for Autonomous Driving: A Review and New Outlooks

Autonomous driving, in recent years, has been receiving increasing attention for its potential to relieve drivers' burdens and improve the safety of driving. In modern autonomous driving pipelines, the perception system is an indispensable component, aiming to accurately estimate the status of surrounding environments and provide reliable observations for prediction and planning. 3D object detection, which intelligently predicts the locations, sizes, and categories of the critical 3D objects near an autonomous vehicle, is an important part of a perception system. This paper reviews the advances in 3D object detection for autonomous driving. First, we introduce the background of 3D object detection and discuss the challenges in this task. Second, we conduct a comprehensive survey of the progress in 3D object detection from the aspects of models and sensory inputs, including LiDAR-based, camera-based, and multi-modal detection approaches. We also provide an in-depth analysis of the potentials and challenges in each category of methods. Additionally, we systematically investigate the applications of 3D object detection in driving systems. Finally, we conduct a performance analysis of the 3D object detection approaches, and we further summarize the research trends over the years and prospect the future directions of this area.

* A survey on 3D object detection for autonomous driving. Project page is at https://github.com/PointsCoder/Awesome-3D-Object-Detection-for-Autonomous-Driving 
Viaarxiv icon

Point2Seq: Detecting 3D Objects as Sequences

Mar 25, 2022
Yujing Xue, Jiageng Mao, Minzhe Niu, Hang Xu, Michael Bi Mi, Wei Zhang, Xiaogang Wang, Xinchao Wang

Figure 1 for Point2Seq: Detecting 3D Objects as Sequences
Figure 2 for Point2Seq: Detecting 3D Objects as Sequences
Figure 3 for Point2Seq: Detecting 3D Objects as Sequences
Figure 4 for Point2Seq: Detecting 3D Objects as Sequences

We present a simple and effective framework, named Point2Seq, for 3D object detection from point clouds. In contrast to previous methods that normally {predict attributes of 3D objects all at once}, we expressively model the interdependencies between attributes of 3D objects, which in turn enables a better detection accuracy. Specifically, we view each 3D object as a sequence of words and reformulate the 3D object detection task as decoding words from 3D scenes in an auto-regressive manner. We further propose a lightweight scene-to-sequence decoder that can auto-regressively generate words conditioned on features from a 3D scene as well as cues from the preceding words. The predicted words eventually constitute a set of sequences that completely describe the 3D objects in the scene, and all the predicted sequences are then automatically assigned to the respective ground truths through similarity-based sequence matching. Our approach is conceptually intuitive and can be readily plugged upon most existing 3D-detection backbones without adding too much computational overhead; the sequential decoding paradigm we proposed, on the other hand, can better exploit information from complex 3D scenes with the aid of preceding predicted words. Without bells and whistles, our method significantly outperforms previous anchor- and center-based 3D object detection frameworks, yielding the new state of the art on the challenging ONCE dataset as well as the Waymo Open Dataset. Code is available at \url{https://github.com/ocNflag/point2seq}.

* To appear in CVPR2022 
Viaarxiv icon

Voxel Transformer for 3D Object Detection

Sep 13, 2021
Jiageng Mao, Yujing Xue, Minzhe Niu, Haoyue Bai, Jiashi Feng, Xiaodan Liang, Hang Xu, Chunjing Xu

Figure 1 for Voxel Transformer for 3D Object Detection
Figure 2 for Voxel Transformer for 3D Object Detection
Figure 3 for Voxel Transformer for 3D Object Detection
Figure 4 for Voxel Transformer for 3D Object Detection

We present Voxel Transformer (VoTr), a novel and effective voxel-based Transformer backbone for 3D object detection from point clouds. Conventional 3D convolutional backbones in voxel-based 3D detectors cannot efficiently capture large context information, which is crucial for object recognition and localization, owing to the limited receptive fields. In this paper, we resolve the problem by introducing a Transformer-based architecture that enables long-range relationships between voxels by self-attention. Given the fact that non-empty voxels are naturally sparse but numerous, directly applying standard Transformer on voxels is non-trivial. To this end, we propose the sparse voxel module and the submanifold voxel module, which can operate on the empty and non-empty voxel positions effectively. To further enlarge the attention range while maintaining comparable computational overhead to the convolutional counterparts, we propose two attention mechanisms for multi-head attention in those two modules: Local Attention and Dilated Attention, and we further propose Fast Voxel Query to accelerate the querying process in multi-head attention. VoTr contains a series of sparse and submanifold voxel modules and can be applied in most voxel-based detectors. Our proposed VoTr shows consistent improvement over the convolutional baselines while maintaining computational efficiency on the KITTI dataset and the Waymo Open dataset.

* To appear at ICCV 2021 
Viaarxiv icon

Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection

Sep 06, 2021
Jiageng Mao, Minzhe Niu, Haoyue Bai, Xiaodan Liang, Hang Xu, Chunjing Xu

Figure 1 for Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection
Figure 2 for Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection
Figure 3 for Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection
Figure 4 for Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection

We present a flexible and high-performance framework, named Pyramid R-CNN, for two-stage 3D object detection from point clouds. Current approaches generally rely on the points or voxels of interest for RoI feature extraction on the second stage, but cannot effectively handle the sparsity and non-uniform distribution of those points, and this may result in failures in detecting objects that are far away. To resolve the problems, we propose a novel second-stage module, named pyramid RoI head, to adaptively learn the features from the sparse points of interest. The pyramid RoI head consists of three key components. Firstly, we propose the RoI-grid Pyramid, which mitigates the sparsity problem by extensively collecting points of interest for each RoI in a pyramid manner. Secondly, we propose RoI-grid Attention, a new operation that can encode richer information from sparse points by incorporating conventional attention-based and graph-based point operators into a unified formulation. Thirdly, we propose the Density-Aware Radius Prediction (DARP) module, which can adapt to different point density levels by dynamically adjusting the focusing range of RoIs. Combining the three components, our pyramid RoI head is robust to the sparse and imbalanced circumstances, and can be applied upon various 3D backbones to consistently boost the detection performance. Extensive experiments show that Pyramid R-CNN outperforms the state-of-the-art 3D detection models by a large margin on both the KITTI dataset and the Waymo Open dataset.

* To appear at ICCV 2021 
Viaarxiv icon

One Million Scenes for Autonomous Driving: ONCE Dataset

Jun 21, 2021
Jiageng Mao, Minzhe Niu, Chenhan Jiang, Hanxue Liang, Xiaodan Liang, Yamin Li, Chaoqiang Ye, Wei Zhang, Zhenguo Li, Jie Yu, Hang Xu, Chunjing Xu

Figure 1 for One Million Scenes for Autonomous Driving: ONCE Dataset
Figure 2 for One Million Scenes for Autonomous Driving: ONCE Dataset
Figure 3 for One Million Scenes for Autonomous Driving: ONCE Dataset
Figure 4 for One Million Scenes for Autonomous Driving: ONCE Dataset

Current perception models in autonomous driving have become notorious for greatly relying on a mass of annotated data to cover unseen cases and address the long-tail problem. On the other hand, learning from unlabeled large-scale collected data and incrementally self-training powerful recognition models have received increasing attention and may become the solutions of next-generation industry-level powerful and robust perception models in autonomous driving. However, the research community generally suffered from data inadequacy of those essential real-world scene data, which hampers the future exploration of fully/semi/self-supervised methods for 3D perception. In this paper, we introduce the ONCE (One millioN sCenEs) dataset for 3D object detection in the autonomous driving scenario. The ONCE dataset consists of 1 million LiDAR scenes and 7 million corresponding camera images. The data is selected from 144 driving hours, which is 20x longer than the largest 3D autonomous driving dataset available (e.g. nuScenes and Waymo), and it is collected across a range of different areas, periods and weather conditions. To facilitate future research on exploiting unlabeled data for 3D detection, we additionally provide a benchmark in which we reproduce and evaluate a variety of self-supervised and semi-supervised methods on the ONCE dataset. We conduct extensive analyses on those methods and provide valuable observations on their performance related to the scale of used data. Data, code, and more information are available at https://once-for-auto-driving.github.io/index.html.

Viaarxiv icon

GRNet: Gridding Residual Network for Dense Point Cloud Completion

Jul 03, 2020
Haozhe Xie, Hongxun Yao, Shangchen Zhou, Jiageng Mao, Shengping Zhang, Wenxiu Sun

Figure 1 for GRNet: Gridding Residual Network for Dense Point Cloud Completion
Figure 2 for GRNet: Gridding Residual Network for Dense Point Cloud Completion
Figure 3 for GRNet: Gridding Residual Network for Dense Point Cloud Completion
Figure 4 for GRNet: Gridding Residual Network for Dense Point Cloud Completion

Estimating the complete 3D point cloud from an incomplete one is a key problem in many vision and robotics applications. Mainstream methods (e.g., PCN and TopNet) use Multi-layer Perceptrons (MLPs) to directly process point clouds, which may cause the loss of details because the structural and context of point clouds are not fully considered. To solve this problem, we introduce 3D grids as intermediate representations to regularize unordered point clouds. We therefore propose a novel Gridding Residual Network (GRNet) for point cloud completion. In particular, we devise two novel differentiable layers, named Gridding and Gridding Reverse, to convert between point clouds and 3D grids without losing structural information. We also present the differentiable Cubic Feature Sampling layer to extract features of neighboring points, which preserves context information. In addition, we design a new loss function, namely Gridding Loss, to calculate the L1 distance between the 3D grids of the predicted and ground truth point clouds, which is helpful to recover details. Experimental results indicate that the proposed GRNet performs favorably against state-of-the-art methods on the ShapeNet, Completion3D, and KITTI benchmarks.

* ECCV 2020 
Viaarxiv icon

Interpolated Convolutional Networks for 3D Point Cloud Understanding

Aug 13, 2019
Jiageng Mao, Xiaogang Wang, Hongsheng Li

Figure 1 for Interpolated Convolutional Networks for 3D Point Cloud Understanding
Figure 2 for Interpolated Convolutional Networks for 3D Point Cloud Understanding
Figure 3 for Interpolated Convolutional Networks for 3D Point Cloud Understanding
Figure 4 for Interpolated Convolutional Networks for 3D Point Cloud Understanding

Point cloud is an important type of 3D representation. However, directly applying convolutions on point clouds is challenging due to the sparse, irregular and unordered data structure. In this paper, we propose a novel Interpolated Convolution operation, InterpConv, to tackle the point cloud feature learning and understanding problem. The key idea is to utilize a set of discrete kernel weights and interpolate point features to neighboring kernel-weight coordinates by an interpolation function for convolution. A normalization term is introduced to handle neighborhoods of different sparsity levels. Our InterpConv is shown to be permutation and sparsity invariant, and can directly handle irregular inputs. We further design Interpolated Convolutional Neural Networks (InterpCNNs) based on InterpConv layers to handle point cloud recognition tasks including shape classification, object part segmentation and indoor scene semantic parsing. Experiments show that the networks can capture both fine-grained local structures and global shape context information effectively. The proposed approach achieves state-of-the-art performance on public benchmarks including ModelNet40, ShapeNet Parts and S3DIS.

* ICCV2019 oral. Code will be released soon 
Viaarxiv icon