Alert button
Picture for Zetong Yang

Zetong Yang

Alert button

Self-supervised Pre-training with Masked Shape Prediction for 3D Scene Understanding

May 08, 2023
Li Jiang, Zetong Yang, Shaoshuai Shi, Vladislav Golyanik, Dengxin Dai, Bernt Schiele

Figure 1 for Self-supervised Pre-training with Masked Shape Prediction for 3D Scene Understanding
Figure 2 for Self-supervised Pre-training with Masked Shape Prediction for 3D Scene Understanding
Figure 3 for Self-supervised Pre-training with Masked Shape Prediction for 3D Scene Understanding
Figure 4 for Self-supervised Pre-training with Masked Shape Prediction for 3D Scene Understanding

Masked signal modeling has greatly advanced self-supervised pre-training for language and 2D images. However, it is still not fully explored in 3D scene understanding. Thus, this paper introduces Masked Shape Prediction (MSP), a new framework to conduct masked signal modeling in 3D scenes. MSP uses the essential 3D semantic cue, i.e., geometric shape, as the prediction target for masked points. The context-enhanced shape target consisting of explicit shape context and implicit deep shape feature is proposed to facilitate exploiting contextual cues in shape prediction. Meanwhile, the pre-training architecture in MSP is carefully designed to alleviate the masked shape leakage from point coordinates. Experiments on multiple 3D understanding tasks on both indoor and outdoor datasets demonstrate the effectiveness of MSP in learning good feature representations to consistently boost downstream performance.

* CVPR 2023 
Viaarxiv icon

A Unified Query-based Paradigm for Point Cloud Understanding

Mar 03, 2022
Zetong Yang, Li Jiang, Yanan Sun, Bernt Schiele, Jiaya Jia

Figure 1 for A Unified Query-based Paradigm for Point Cloud Understanding
Figure 2 for A Unified Query-based Paradigm for Point Cloud Understanding
Figure 3 for A Unified Query-based Paradigm for Point Cloud Understanding
Figure 4 for A Unified Query-based Paradigm for Point Cloud Understanding

3D point cloud understanding is an important component in autonomous driving and robotics. In this paper, we present a novel Embedding-Querying paradigm (EQ-Paradigm) for 3D understanding tasks including detection, segmentation and classification. EQ-Paradigm is a unified paradigm that enables the combination of any existing 3D backbone architectures with different task heads. Under the EQ-Paradigm, the input is firstly encoded in the embedding stage with an arbitrary feature extraction architecture, which is independent of tasks and heads. Then, the querying stage enables the encoded features to be applicable for diverse task heads. This is achieved by introducing an intermediate representation, i.e., Q-representation, in the querying stage to serve as a bridge between the embedding stage and task heads. We design a novel Q-Net as the querying stage network. Extensive experimental results on various 3D tasks including semantic segmentation, object detection and shape classification show that EQ-Paradigm in tandem with Q-Net is a general and effective pipeline, which enables a flexible collaboration of backbones and heads, and further boosts the performance of the state-of-the-art methods. All codes and models will be published soon.

* Accepted by CVPR2022 
Viaarxiv icon

3D-MAN: 3D Multi-frame Attention Network for Object Detection

Mar 30, 2021
Zetong Yang, Yin Zhou, Zhifeng Chen, Jiquan Ngiam

Figure 1 for 3D-MAN: 3D Multi-frame Attention Network for Object Detection
Figure 2 for 3D-MAN: 3D Multi-frame Attention Network for Object Detection
Figure 3 for 3D-MAN: 3D Multi-frame Attention Network for Object Detection
Figure 4 for 3D-MAN: 3D Multi-frame Attention Network for Object Detection

3D object detection is an important module in autonomous driving and robotics. However, many existing methods focus on using single frames to perform 3D detection, and do not fully utilize information from multiple frames. In this paper, we present 3D-MAN: a 3D multi-frame attention network that effectively aggregates features from multiple perspectives and achieves state-of-the-art performance on Waymo Open Dataset. 3D-MAN first uses a novel fast single-frame detector to produce box proposals. The box proposals and their corresponding feature maps are then stored in a memory bank. We design a multi-view alignment and aggregation module, using attention networks, to extract and aggregate the temporal features stored in the memory bank. This effectively combines the features coming from different perspectives of the scene. We demonstrate the effectiveness of our approach on the large-scale complex Waymo Open Dataset, achieving state-of-the-art results compared to published single-frame and multi-frame methods.

Viaarxiv icon

CVPR 2019 WAD Challenge on Trajectory Prediction and 3D Perception

Apr 06, 2020
Sibo Zhang, Yuexin Ma, Ruigang Yang, Xin Li, Yanliang Zhu, Deheng Qian, Zetong Yang, Wenjing Zhang, Yuanpei Liu

Figure 1 for CVPR 2019 WAD Challenge on Trajectory Prediction and 3D Perception
Figure 2 for CVPR 2019 WAD Challenge on Trajectory Prediction and 3D Perception
Figure 3 for CVPR 2019 WAD Challenge on Trajectory Prediction and 3D Perception
Figure 4 for CVPR 2019 WAD Challenge on Trajectory Prediction and 3D Perception

This paper reviews the CVPR 2019 challenge on Autonomous Driving. Baidu's Robotics and Autonomous Driving Lab (RAL) providing 150 minutes labeled Trajectory and 3D Perception dataset including about 80k lidar point cloud and 1000km trajectories for urban traffic. The challenge has two tasks in (1) Trajectory Prediction and (2) 3D Lidar Object Detection. There are more than 200 teams submitted results on Leaderboard and more than 1000 participants attended the workshop.

Viaarxiv icon

3DSSD: Point-based 3D Single Stage Object Detector

Feb 24, 2020
Zetong Yang, Yanan Sun, Shu Liu, Jiaya Jia

Figure 1 for 3DSSD: Point-based 3D Single Stage Object Detector
Figure 2 for 3DSSD: Point-based 3D Single Stage Object Detector
Figure 3 for 3DSSD: Point-based 3D Single Stage Object Detector
Figure 4 for 3DSSD: Point-based 3D Single Stage Object Detector

Currently, there have been many kinds of voxel-based 3D single stage detectors, while point-based single stage methods are still underexplored. In this paper, we first present a lightweight and effective point-based 3D single stage object detector, named 3DSSD, achieving a good balance between accuracy and efficiency. In this paradigm, all upsampling layers and refinement stage, which are indispensable in all existing point-based methods, are abandoned to reduce the large computation cost. We novelly propose a fusion sampling strategy in downsampling process to make detection on less representative points feasible. A delicate box prediction network including a candidate generation layer, an anchor-free regression head with a 3D center-ness assignment strategy is designed to meet with our demand of accuracy and speed. Our paradigm is an elegant single stage anchor-free framework, showing great superiority to other existing methods. We evaluate 3DSSD on widely used KITTI dataset and more challenging nuScenes dataset. Our method outperforms all state-of-the-art voxel-based single stage methods by a large margin, and has comparable performance to two stage point-based methods as well, with inference speed more than 25 FPS, 2x faster than former state-of-the-art point-based methods.

Viaarxiv icon

STD: Sparse-to-Dense 3D Object Detector for Point Cloud

Jul 22, 2019
Zetong Yang, Yanan Sun, Shu Liu, Xiaoyong Shen, Jiaya Jia

Figure 1 for STD: Sparse-to-Dense 3D Object Detector for Point Cloud
Figure 2 for STD: Sparse-to-Dense 3D Object Detector for Point Cloud
Figure 3 for STD: Sparse-to-Dense 3D Object Detector for Point Cloud
Figure 4 for STD: Sparse-to-Dense 3D Object Detector for Point Cloud

We present a new two-stage 3D object detection framework, named sparse-to-dense 3D Object Detector (STD). The first stage is a bottom-up proposal generation network that uses raw point cloud as input to generate accurate proposals by seeding each point with a new spherical anchor. It achieves a high recall with less computation compared with prior works. Then, PointsPool is applied for generating proposal features by transforming their interior point features from sparse expression to compact representation, which saves even more computation time. In box prediction, which is the second stage, we implement a parallel intersection-over-union (IoU) branch to increase awareness of localization accuracy, resulting in further improved performance. We conduct experiments on KITTI dataset, and evaluate our method in terms of 3D object and Bird's Eye View (BEV) detection. Our method outperforms other state-of-the-arts by a large margin, especially on the hard set, with inference speed more than 10 FPS.

* arXiv admin note: text overlap with arXiv:1812.05276 
Viaarxiv icon

IPOD: Intensive Point-based Object Detector for Point Cloud

Dec 13, 2018
Zetong Yang, Yanan Sun, Shu Liu, Xiaoyong Shen, Jiaya Jia

Figure 1 for IPOD: Intensive Point-based Object Detector for Point Cloud
Figure 2 for IPOD: Intensive Point-based Object Detector for Point Cloud
Figure 3 for IPOD: Intensive Point-based Object Detector for Point Cloud
Figure 4 for IPOD: Intensive Point-based Object Detector for Point Cloud

We present a novel 3D object detection framework, named IPOD, based on raw point cloud. It seeds object proposal for each point, which is the basic element. This paradigm provides us with high recall and high fidelity of information, leading to a suitable way to process point cloud data. We design an end-to-end trainable architecture, where features of all points within a proposal are extracted from the backbone network and achieve a proposal feature for final bounding inference. These features with both context information and precise point cloud coordinates yield improved performance. We conduct experiments on KITTI dataset, evaluating our performance in terms of 3D object detection, Bird's Eye View (BEV) detection and 2D object detection. Our method accomplishes new state-of-the-art , showing great advantage on the hard set.

Viaarxiv icon