Alert button
Picture for Dalong Du

Dalong Du

Alert button

OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction

Apr 11, 2023
Yunpeng Zhang, Zheng Zhu, Dalong Du

Figure 1 for OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction
Figure 2 for OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction
Figure 3 for OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction
Figure 4 for OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction

The vision-based perception for autonomous driving has undergone a transformation from the bird-eye-view (BEV) representations to the 3D semantic occupancy. Compared with the BEV planes, the 3D semantic occupancy further provides structural information along the vertical direction. This paper presents OccFormer, a dual-path transformer network to effectively process the 3D volume for semantic occupancy prediction. OccFormer achieves a long-range, dynamic, and efficient encoding of the camera-generated 3D voxel features. It is obtained by decomposing the heavy 3D processing into the local and global transformer pathways along the horizontal plane. For the occupancy decoder, we adapt the vanilla Mask2Former for 3D semantic occupancy by proposing preserve-pooling and class-guided sampling, which notably mitigate the sparsity and class imbalance. Experimental results demonstrate that OccFormer significantly outperforms existing methods for semantic scene completion on SemanticKITTI dataset and for LiDAR semantic segmentation on nuScenes dataset. Code is available at \url{https://github.com/zhangyp15/OccFormer}.

* Code is available at https://github.com/zhangyp15/OccFormer 
Viaarxiv icon

StereoScene: BEV-Assisted Stereo Matching Empowers 3D Semantic Scene Completion

Mar 30, 2023
Bohan Li, Yasheng Sun, Xin Jin, Wenjun Zeng, Zheng Zhu, Xiaoefeng Wang, Yunpeng Zhang, James Okae, Hang Xiao, Dalong Du

Figure 1 for StereoScene: BEV-Assisted Stereo Matching Empowers 3D Semantic Scene Completion
Figure 2 for StereoScene: BEV-Assisted Stereo Matching Empowers 3D Semantic Scene Completion
Figure 3 for StereoScene: BEV-Assisted Stereo Matching Empowers 3D Semantic Scene Completion
Figure 4 for StereoScene: BEV-Assisted Stereo Matching Empowers 3D Semantic Scene Completion

3D semantic scene completion (SSC) is an ill-posed task that requires inferring a dense 3D scene from incomplete observations. Previous methods either explicitly incorporate 3D geometric input or rely on learnt 3D prior behind monocular RGB images. However, 3D sensors such as LiDAR are expensive and intrusive while monocular cameras face challenges in modeling precise geometry due to the inherent ambiguity. In this work, we propose StereoScene for 3D Semantic Scene Completion (SSC), which explores taking full advantage of light-weight camera inputs without resorting to any external 3D sensors. Our key insight is to leverage stereo matching to resolve geometric ambiguity. To improve its robustness in unmatched areas, we introduce bird's-eye-view (BEV) representation to inspire hallucination ability with rich context information. On top of the stereo and BEV representations, a mutual interactive aggregation (MIA) module is carefully devised to fully unleash their power. Specifically, a Bi-directional Interaction Transformer (BIT) augmented with confidence re-weighting is used to encourage reliable prediction through mutual guidance while a Dual Volume Aggregation (DVA) module is designed to facilitate complementary aggregation. Experimental results on SemanticKITTI demonstrate that the proposed StereoScene outperforms the state-of-the-art camera-based methods by a large margin with a relative improvement of 26.9% in geometry and 38.6% in semantic.

Viaarxiv icon

OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception

Mar 07, 2023
Xiaofeng Wang, Zheng Zhu, Wenbo Xu, Yunpeng Zhang, Yi Wei, Xu Chi, Yun Ye, Dalong Du, Jiwen Lu, Xingang Wang

Figure 1 for OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception
Figure 2 for OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception
Figure 3 for OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception
Figure 4 for OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception

Semantic occupancy perception is essential for autonomous driving, as automated vehicles require a fine-grained perception of the 3D urban structures. However, existing relevant benchmarks lack diversity in urban scenes, and they only evaluate front-view predictions. Towards a comprehensive benchmarking of surrounding perception algorithms, we propose OpenOccupancy, which is the first surrounding semantic occupancy perception benchmark. In the OpenOccupancy benchmark, we extend the large-scale nuScenes dataset with dense semantic occupancy annotations. Previous annotations rely on LiDAR points superimposition, where some occupancy labels are missed due to sparse LiDAR channels. To mitigate the problem, we introduce the Augmenting And Purifying (AAP) pipeline to ~2x densify the annotations, where ~4000 human hours are involved in the labeling process. Besides, camera-based, LiDAR-based and multi-modal baselines are established for the OpenOccupancy benchmark. Furthermore, considering the complexity of surrounding occupancy perception lies in the computational burden of high-resolution 3D predictions, we propose the Cascade Occupancy Network (CONet) to refine the coarse prediction, which relatively enhances the performance by ~30% than the baseline. We hope the OpenOccupancy benchmark will boost the development of surrounding occupancy perception algorithms.

* project page: https://github.com/JeffWang987/OpenOccupancy 
Viaarxiv icon

Gait Recognition in the Wild: A Benchmark

May 05, 2022
Zheng Zhu, Xianda Guo, Tian Yang, Junjie Huang, Jiankang Deng, Guan Huang, Dalong Du, Jiwen Lu, Jie Zhou

Figure 1 for Gait Recognition in the Wild: A Benchmark
Figure 2 for Gait Recognition in the Wild: A Benchmark
Figure 3 for Gait Recognition in the Wild: A Benchmark
Figure 4 for Gait Recognition in the Wild: A Benchmark

Gait benchmarks empower the research community to train and evaluate high-performance gait recognition systems. Even though growing efforts have been devoted to cross-view recognition, academia is restricted by current existing databases captured in the controlled environment. In this paper, we contribute a new benchmark for Gait REcognition in the Wild (GREW). The GREW dataset is constructed from natural videos, which contains hundreds of cameras and thousands of hours streams in open systems. With tremendous manual annotations, the GREW consists of 26K identities and 128K sequences with rich attributes for unconstrained gait recognition. Moreover, we add a distractor set of over 233K sequences, making it more suitable for real-world applications. Compared with prevailing predefined cross-view datasets, the GREW has diverse and practical view variations, as well as more natural challenging factors. To the best of our knowledge, this is the first large-scale dataset for gait recognition in the wild. Equipped with this benchmark, we dissect the unconstrained gait recognition problem. Representative appearance-based and model-based methods are explored, and comprehensive baselines are established. Experimental results show (1) The proposed GREW benchmark is necessary for training and evaluating gait recognizer in the wild. (2) For state-of-the-art gait recognition approaches, there is a lot of room for improvement. (3) The GREW benchmark can be used as effective pre-training for controlled gait recognition. Benchmark website is https://www.grew-benchmark.org/.

* Published in ICCV 2021. Benchmark website is https://www.grew-benchmark.org/ 
Viaarxiv icon

WebFace260M: A Benchmark for Million-Scale Deep Face Recognition

Apr 21, 2022
Zheng Zhu, Guan Huang, Jiankang Deng, Yun Ye, Junjie Huang, Xinze Chen, Jiagang Zhu, Tian Yang, Dalong Du, Jiwen Lu, Jie Zhou

Figure 1 for WebFace260M: A Benchmark for Million-Scale Deep Face Recognition
Figure 2 for WebFace260M: A Benchmark for Million-Scale Deep Face Recognition
Figure 3 for WebFace260M: A Benchmark for Million-Scale Deep Face Recognition
Figure 4 for WebFace260M: A Benchmark for Million-Scale Deep Face Recognition

Face benchmarks empower the research community to train and evaluate high-performance face recognition systems. In this paper, we contribute a new million-scale recognition benchmark, containing uncurated 4M identities/260M faces (WebFace260M) and cleaned 2M identities/42M faces (WebFace42M) training data, as well as an elaborately designed time-constrained evaluation protocol. Firstly, we collect 4M name lists and download 260M faces from the Internet. Then, a Cleaning Automatically utilizing Self-Training (CAST) pipeline is devised to purify the tremendous WebFace260M, which is efficient and scalable. To the best of our knowledge, the cleaned WebFace42M is the largest public face recognition training set and we expect to close the data gap between academia and industry. Referring to practical deployments, Face Recognition Under Inference Time conStraint (FRUITS) protocol and a new test set with rich attributes are constructed. Besides, we gather a large-scale masked face sub-set for biometrics assessment under COVID-19. For a comprehensive evaluation of face matchers, three recognition tasks are performed under standard, masked and unbiased settings, respectively. Equipped with this benchmark, we delve into million-scale face recognition problems. A distributed framework is developed to train face recognition models efficiently without tampering with the performance. Enabled by WebFace42M, we reduce 40% failure rate on the challenging IJB-C set and rank 3rd among 430 entries on NIST-FRVT. Even 10% data (WebFace4M) shows superior performance compared with the public training sets. Furthermore, comprehensive baselines are established under the FRUITS-100/500/1000 milliseconds protocols. The proposed benchmark shows enormous potential on standard, masked and unbiased face recognition scenarios. Our WebFace260M website is https://www.face-benchmark.org.

* Accepted by T-PAMI. Extension of our CVPR-2021 work: arXiv:2103.04098. Project website is https://www.face-benchmark.org 
Viaarxiv icon

HFT: Lifting Perspective Representations via Hybrid Feature Transformation

Apr 11, 2022
Jiayu Zou, Junrui Xiao, Zheng Zhu, Junjie Huang, Guan Huang, Dalong Du, Xingang Wang

Figure 1 for HFT: Lifting Perspective Representations via Hybrid Feature Transformation
Figure 2 for HFT: Lifting Perspective Representations via Hybrid Feature Transformation
Figure 3 for HFT: Lifting Perspective Representations via Hybrid Feature Transformation
Figure 4 for HFT: Lifting Perspective Representations via Hybrid Feature Transformation

Autonomous driving requires accurate and detailed Bird's Eye View (BEV) semantic segmentation for decision making, which is one of the most challenging tasks for high-level scene perception. Feature transformation from frontal view to BEV is the pivotal technology for BEV semantic segmentation. Existing works can be roughly classified into two categories, i.e., Camera model-Based Feature Transformation (CBFT) and Camera model-Free Feature Transformation (CFFT). In this paper, we empirically analyze the vital differences between CBFT and CFFT. The former transforms features based on the flat-world assumption, which may cause distortion of regions lying above the ground plane. The latter is limited in the segmentation performance due to the absence of geometric priors and time-consuming computation. In order to reap the benefits and avoid the drawbacks of CBFT and CFFT, we propose a novel framework with a Hybrid Feature Transformation module (HFT). Specifically, we decouple the feature maps produced by HFT for estimating the layout of outdoor scenes in BEV. Furthermore, we design a mutual learning scheme to augment hybrid transformation by applying feature mimicking. Notably, extensive experiments demonstrate that with negligible extra overhead, HFT achieves a relative improvement of 13.3% on the Argoverse dataset and 16.8% on the KITTI 3D Object datasets compared to the best-performing existing method. The codes are available at https://github.com/JiayuZou2020/HFT.

Viaarxiv icon

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Dec 22, 2021
Junjie Huang, Guan Huang, Zheng Zhu, Dalong Du

Figure 1 for BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View
Figure 2 for BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View
Figure 3 for BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View
Figure 4 for BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Autonomous driving perceives the surrounding environment for decision making, which is one of the most complicated scenes for visual perception. The great power of paradigm innovation in solving the 2D object detection task inspires us to seek an elegant, feasible, and scalable paradigm for pushing the performance boundary in this area. To this end, we contribute the BEVDet paradigm in this paper. BEVDet is developed by following the principle of detecting the 3D objects in Bird-Eye-View (BEV), where route planning can be handily performed. In this paradigm, four kinds of modules are conducted in succession with different roles: an image-view encoder for encoding feature in image view, a view transformer for feature transformation from image view to BEV, a BEV encoder for further encoding feature in BEV, and a task-specific head for predicting the targets in BEV. We merely reuse the existing modules for constructing BEVDet and make it feasible for multi-camera 3D object detection by constructing an exclusive data augmentation strategy. The proposed paradigm works well in multi-camera 3D object detection and offers a good trade-off between computing budget and performance. BEVDet with 704x256 (1/8 of the competitors) image size scores 29.4% mAP and 38.4% NDS on the nuScenes val set, which is comparable with FCOS3D (i.e., 2008.2 GFLOPs, 1.7 FPS, 29.5% mAP and 37.2% NDS), while requires merely 12% computing budget of 239.4 GFLOPs and runs 4.3 times faster. Scaling up the input size to 1408x512, BEVDet scores 34.9% mAP, and 41.7% NDS, which requires just 601.4 GFLOPs and significantly suppresses FCOS3D by 5.4% mAP and 4.5% NDS. The superior performance of BEVDet tells the magic of paradigm innovation.

* Multi-camera 3D Object Detection 
Viaarxiv icon

Face-NMS: A Core-set Selection Approach for Efficient Face Recognition

Sep 10, 2021
Yunze Chen, Junjie Huang, Jiagang Zhu, Zheng Zhu, Tian Yang, Guan Huang, Dalong Du

Figure 1 for Face-NMS: A Core-set Selection Approach for Efficient Face Recognition
Figure 2 for Face-NMS: A Core-set Selection Approach for Efficient Face Recognition
Figure 3 for Face-NMS: A Core-set Selection Approach for Efficient Face Recognition
Figure 4 for Face-NMS: A Core-set Selection Approach for Efficient Face Recognition

Recently, face recognition in the wild has achieved remarkable success and one key engine is the increasing size of training data. For example, the largest face dataset, WebFace42M contains about 2 million identities and 42 million faces. However, a massive number of faces raise the constraints in training time, computing resources, and memory cost. The current research on this problem mainly focuses on designing an efficient Fully-connected layer (FC) to reduce GPU memory consumption caused by a large number of identities. In this work, we relax these constraints by resolving the redundancy problem of the up-to-date face datasets caused by the greedily collecting operation (i.e. the core-set selection perspective). As the first attempt in this perspective on the face recognition problem, we find that existing methods are limited in both performance and efficiency. For superior cost-efficiency, we contribute a novel filtering strategy dubbed Face-NMS. Face-NMS works on feature space and simultaneously considers the local and global sparsity in generating core sets. In practice, Face-NMS is analogous to Non-Maximum Suppression (NMS) in the object detection community. It ranks the faces by their potential contribution to the overall sparsity and filters out the superfluous face in the pairs with high similarity for local sparsity. With respect to the efficiency aspect, Face-NMS accelerates the whole pipeline by applying a smaller but sufficient proxy dataset in training the proxy model. As a result, with Face-NMS, we successfully scale down the WebFace42M dataset to 60% while retaining its performance on the main benchmarks, offering a 40% resource-saving and 1.64 times acceleration. The code is publicly available for reference at https://github.com/HuangJunJie2017/Face-NMS.

* Data efficient face recognition; Core-set selection 
Viaarxiv icon

Masked Face Recognition Challenge: The WebFace260M Track Report

Aug 16, 2021
Zheng Zhu, Guan Huang, Jiankang Deng, Yun Ye, Junjie Huang, Xinze Chen, Jiagang Zhu, Tian Yang, Jia Guo, Jiwen Lu, Dalong Du, Jie Zhou

Figure 1 for Masked Face Recognition Challenge: The WebFace260M Track Report
Figure 2 for Masked Face Recognition Challenge: The WebFace260M Track Report
Figure 3 for Masked Face Recognition Challenge: The WebFace260M Track Report
Figure 4 for Masked Face Recognition Challenge: The WebFace260M Track Report

According to WHO statistics, there are more than 204,617,027 confirmed COVID-19 cases including 4,323,247 deaths worldwide till August 12, 2021. During the coronavirus epidemic, almost everyone wears a facial mask. Traditionally, face recognition approaches process mostly non-occluded faces, which include primary facial features such as the eyes, nose, and mouth. Removing the mask for authentication in airports or laboratories will increase the risk of virus infection, posing a huge challenge to current face recognition systems. Due to the sudden outbreak of the epidemic, there are yet no publicly available real-world masked face recognition (MFR) benchmark. To cope with the above-mentioned issue, we organize the Face Bio-metrics under COVID Workshop and Masked Face Recognition Challenge in ICCV 2021. Enabled by the ultra-large-scale WebFace260M benchmark and the Face Recognition Under Inference Time conStraint (FRUITS) protocol, this challenge (WebFace260M Track) aims to push the frontiers of practical MFR. Since public evaluation sets are mostly saturated or contain noise, a new test set is gathered consisting of elaborated 2,478 celebrities and 60,926 faces. Meanwhile, we collect the world-largest real-world masked test set. In the first phase of WebFace260M Track, 69 teams (total 833 solutions) participate in the challenge and 49 teams exceed the performance of our baseline. There are second phase of the challenge till October 1, 2021 and on-going leaderboard. We will actively update this report in the future.

* The WebFace260M Track of ICCV-21 MFR Challenge is still open in https://www.face-benchmark.org/challenge.html 
Viaarxiv icon