Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zeming Li

Unifying Voxel-based Representation with Transformer for 3D Object Detection

Jun 01, 2022

Yanwei Li, Yilun Chen, Xiaojuan Qi, Zeming Li, Jian Sun, Jiaya Jia

Figure 1 for Unifying Voxel-based Representation with Transformer for 3D Object Detection

Figure 2 for Unifying Voxel-based Representation with Transformer for 3D Object Detection

Figure 3 for Unifying Voxel-based Representation with Transformer for 3D Object Detection

Figure 4 for Unifying Voxel-based Representation with Transformer for 3D Object Detection

Abstract:In this work, we present a unified framework for multi-modality 3D object detection, named UVTR. The proposed method aims to unify multi-modality representations in the voxel space for accurate and robust single- or cross-modality 3D detection. To this end, the modality-specific space is first designed to represent different inputs in the voxel feature space. Different from previous work, our approach preserves the voxel space without height compression to alleviate semantic ambiguity and enable spatial interactions. Benefit from the unified manner, cross-modality interaction is then proposed to make full use of inherent properties from different sensors, including knowledge transfer and modality fusion. In this way, geometry-aware expressions in point clouds and context-rich features in images are well utilized for better performance and robustness. The transformer decoder is applied to efficiently sample features from the unified space with learnable positions, which facilitates object-level interactions. In general, UVTR presents an early attempt to represent different modalities in a unified framework. It surpasses previous work in single- and multi-modality entries and achieves leading performance in the nuScenes test set with 69.7%, 55.1%, and 71.1% NDS for LiDAR, camera, and multi-modality inputs, respectively. Code is made available at https://github.com/dvlab-research/UVTR.

* Code is available at https://github.com/dvlab-research/UVTR

Via

Access Paper or Ask Questions

Voxel Field Fusion for 3D Object Detection

May 31, 2022

Yanwei Li, Xiaojuan Qi, Yukang Chen, Liwei Wang, Zeming Li, Jian Sun, Jiaya Jia

Figure 1 for Voxel Field Fusion for 3D Object Detection

Figure 2 for Voxel Field Fusion for 3D Object Detection

Figure 3 for Voxel Field Fusion for 3D Object Detection

Figure 4 for Voxel Field Fusion for 3D Object Detection

Abstract:In this work, we present a conceptually simple yet effective framework for cross-modality 3D object detection, named voxel field fusion. The proposed approach aims to maintain cross-modality consistency by representing and fusing augmented image features as a ray in the voxel field. To this end, the learnable sampler is first designed to sample vital features from the image plane that are projected to the voxel grid in a point-to-ray manner, which maintains the consistency in feature representation with spatial context. In addition, ray-wise fusion is conducted to fuse features with the supplemental context in the constructed voxel field. We further develop mixed augmentor to align feature-variant transformations, which bridges the modality gap in data augmentation. The proposed framework is demonstrated to achieve consistent gains in various benchmarks and outperforms previous fusion-based methods on KITTI and nuScenes datasets. Code is made available at https://github.com/dvlab-research/VFF.

* Accepted to CVPR2022

Via

Access Paper or Ask Questions

A Closer Look at Self-supervised Lightweight Vision Transformers

May 28, 2022

Shaoru Wang, Jin Gao, Zeming Li, Jian Sun, Weiming Hu

Figure 1 for A Closer Look at Self-supervised Lightweight Vision Transformers

Figure 2 for A Closer Look at Self-supervised Lightweight Vision Transformers

Figure 3 for A Closer Look at Self-supervised Lightweight Vision Transformers

Figure 4 for A Closer Look at Self-supervised Lightweight Vision Transformers

Abstract:Self-supervised learning on large-scale Vision Transformers (ViTs) as pre-training methods has achieved promising downstream performance. Yet, how such pre-training paradigms promote lightweight ViTs' performance is considerably less studied. In this work, we mainly produce recipes for pre-training high-performance lightweight ViTs using masked-image-modeling-based MAE, namely MAE-lite, which achieves 78.4% top-1 accuracy on ImageNet with ViT-Tiny (5.7M). Furthermore, we develop and benchmark other fully-supervised and self-supervised pre-training counterparts, e.g., contrastive-learning-based MoCo-v3, on both ImageNet and other classification tasks. We analyze and clearly show the effect of such pre-training, and reveal that properly-learned lower layers of the pre-trained models matter more than higher ones in data-sufficient downstream tasks. Finally, by further comparing with the pre-trained representations of the up-scaled models, a distillation strategy during pre-training is developed to improve the pre-trained representations as well, leading to further downstream performance improvement. The code and models will be made publicly available.

Via

Access Paper or Ask Questions

Real-time Object Detection for Streaming Perception

Mar 29, 2022

Jinrong Yang, Songtao Liu, Zeming Li, Xiaoping Li, Jian Sun

Figure 1 for Real-time Object Detection for Streaming Perception

Figure 2 for Real-time Object Detection for Streaming Perception

Figure 3 for Real-time Object Detection for Streaming Perception

Figure 4 for Real-time Object Detection for Streaming Perception

Abstract:Autonomous driving requires the model to perceive the environment and (re)act within a low latency for safety. While past works ignore the inevitable changes in the environment after processing, streaming perception is proposed to jointly evaluate the latency and accuracy into a single metric for video online perception. In this paper, instead of searching trade-offs between accuracy and speed like previous works, we point out that endowing real-time models with the ability to predict the future is the key to dealing with this problem. We build a simple and effective framework for streaming perception. It equips a novel DualFlow Perception module (DFP), which includes dynamic and static flows to capture the moving trend and basic detection feature for streaming prediction. Further, we introduce a Trend-Aware Loss (TAL) combined with a trend factor to generate adaptive weights for objects with different moving speeds. Our simple method achieves competitive performance on Argoverse-HD dataset and improves the AP by 4.9% compared to the strong baseline, validating its effectiveness. Our code will be made available at https://github.com/yancie-yjr/StreamYOLO.

* CVPR 2022 Accepted Paper (Oral)

Via

Access Paper or Ask Questions

Rebalanced Siamese Contrastive Mining for Long-Tailed Recognition

Mar 22, 2022

Zhisheng Zhong, Jiequan Cui, Eric Lo, Zeming Li, Jian Sun, Jiaya Jia

Figure 1 for Rebalanced Siamese Contrastive Mining for Long-Tailed Recognition

Figure 2 for Rebalanced Siamese Contrastive Mining for Long-Tailed Recognition

Figure 3 for Rebalanced Siamese Contrastive Mining for Long-Tailed Recognition

Figure 4 for Rebalanced Siamese Contrastive Mining for Long-Tailed Recognition

Abstract:Deep neural networks perform poorly on heavily class-imbalanced datasets. Given the promising performance of contrastive learning, we propose $\mathbf{Re}$balanced $\mathbf{S}$iamese $\mathbf{Co}$ntrastive $\mathbf{m}$ining ( $\mathbf{ResCom}$) to tackle imbalanced recognition. Based on the mathematical analysis and simulation results, we claim that supervised contrastive learning suffers a dual class-imbalance problem at both the original batch and Siamese batch levels, which is more serious than long-tailed classification learning. In this paper, at the original batch level, we introduce a class-balanced supervised contrastive loss to assign adaptive weights for different classes. At the Siamese batch level, we present a class-balanced queue, which maintains the same number of keys for all classes. Furthermore, we note that the contrastive loss gradient with respect to the contrastive logits can be decoupled into the positives and negatives, and easy positives and easy negatives will make the contrastive gradient vanish. We propose supervised hard positive and negative pairs mining to pick up informative pairs for contrastive computation and improve representation learning. Finally, to approximately maximize the mutual information between the two views, we propose Siamese Balanced Softmax and joint it with the contrastive loss for one-stage training. ResCom outperforms the previous methods by large margins on multiple long-tailed recognition benchmarks. Our code will be made publicly available at: https://github.com/dvlab-research/ResCom.

* Tech report

Via

Access Paper or Ask Questions

Fully Convolutional Networks for Panoptic Segmentation with Point-based Supervision

Aug 18, 2021

Yanwei Li, Hengshuang Zhao, Xiaojuan Qi, Yukang Chen, Lu Qi, Liwei Wang, Zeming Li, Jian Sun, Jiaya Jia

Figure 1 for Fully Convolutional Networks for Panoptic Segmentation with Point-based Supervision

Figure 2 for Fully Convolutional Networks for Panoptic Segmentation with Point-based Supervision

Figure 3 for Fully Convolutional Networks for Panoptic Segmentation with Point-based Supervision

Figure 4 for Fully Convolutional Networks for Panoptic Segmentation with Point-based Supervision

Abstract:In this paper, we present a conceptually simple, strong, and efficient framework for fully- and weakly-supervised panoptic segmentation, called Panoptic FCN. Our approach aims to represent and predict foreground things and background stuff in a unified fully convolutional pipeline, which can be optimized with point-based fully or weak supervision. In particular, Panoptic FCN encodes each object instance or stuff category with the proposed kernel generator and produces the prediction by convolving the high-resolution feature directly. With this approach, instance-aware and semantically consistent properties for things and stuff can be respectively satisfied in a simple generate-kernel-then-segment workflow. Without extra boxes for localization or instance separation, the proposed approach outperforms the previous box-based and -free models with high efficiency. Furthermore, we propose a new form of point-based annotation for weakly-supervised panoptic segmentation. It only needs several random points for both things and stuff, which dramatically reduces the annotation cost of human. The proposed Panoptic FCN is also proved to have much superior performance in this weakly-supervised setting, which achieves 82% of the fully-supervised performance with only 20 randomly annotated points per instance. Extensive experiments demonstrate the effectiveness and efficiency of Panoptic FCN on COCO, VOC 2012, Cityscapes, and Mapillary Vistas datasets. And it sets up a new leading benchmark for both fully- and weakly-supervised panoptic segmentation. Our code and models are made publicly available at https://github.com/dvlab-research/PanopticFCN

* Extended version of arXiv:2012.00720

Via

Access Paper or Ask Questions

YOLOX: Exceeding YOLO Series in 2021

Aug 06, 2021

Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, Jian Sun

Figure 1 for YOLOX: Exceeding YOLO Series in 2021

Figure 2 for YOLOX: Exceeding YOLO Series in 2021

Figure 3 for YOLOX: Exceeding YOLO Series in 2021

Figure 4 for YOLOX: Exceeding YOLO Series in 2021

Abstract:In this report, we present some experienced improvements to YOLO series, forming a new high-performance detector -- YOLOX. We switch the YOLO detector to an anchor-free manner and conduct other advanced detection techniques, i.e., a decoupled head and the leading label assignment strategy SimOTA to achieve state-of-the-art results across a large scale range of models: For YOLO-Nano with only 0.91M parameters and 1.08G FLOPs, we get 25.3% AP on COCO, surpassing NanoDet by 1.8% AP; for YOLOv3, one of the most widely used detectors in industry, we boost it to 47.3% AP on COCO, outperforming the current best practice by 3.0% AP; for YOLOX-L with roughly the same amount of parameters as YOLOv4-CSP, YOLOv5-L, we achieve 50.0% AP on COCO at a speed of 68.9 FPS on Tesla V100, exceeding YOLOv5-L by 1.8% AP. Further, we won the 1st Place on Streaming Perception Challenge (Workshop on Autonomous Driving at CVPR 2021) using a single YOLOX-L model. We hope this report can provide useful experience for developers and researchers in practical scenes, and we also provide deploy versions with ONNX, TensorRT, NCNN, and Openvino supported. Source code is at https://github.com/Megvii-BaseDetection/YOLOX.

Via

Access Paper or Ask Questions

Workshop on Autonomous Driving at CVPR 2021: Technical Report for Streaming Perception Challenge

Jul 27, 2021

Songyang Zhang, Lin Song, Songtao Liu, Zheng Ge, Zeming Li, Xuming He, Jian Sun

Figure 1 for Workshop on Autonomous Driving at CVPR 2021: Technical Report for Streaming Perception Challenge

Figure 2 for Workshop on Autonomous Driving at CVPR 2021: Technical Report for Streaming Perception Challenge

Figure 3 for Workshop on Autonomous Driving at CVPR 2021: Technical Report for Streaming Perception Challenge

Abstract:In this report, we introduce our real-time 2D object detection system for the realistic autonomous driving scenario. Our detector is built on a newly designed YOLO model, called YOLOX. On the Argoverse-HD dataset, our system achieves 41.0 streaming AP, which surpassed second place by 7.8/6.1 on detection-only track/fully track, respectively. Moreover, equipped with TensorRT, our model achieves the 30FPS inference speed with a high-resolution input size (e.g., 1440-2304). Code and models will be available at https://github.com/Megvii-BaseDetection/YOLOX

* Report of the 1st Place of Streaming Perception Challenge(Workshop on Autonomous Driving at CVPR 2021)

Via

Access Paper or Ask Questions

Generalized Few-Shot Object Detection without Forgetting

May 20, 2021

Zhibo Fan, Yuchen Ma, Zeming Li, Jian Sun

Figure 1 for Generalized Few-Shot Object Detection without Forgetting

Figure 2 for Generalized Few-Shot Object Detection without Forgetting

Figure 3 for Generalized Few-Shot Object Detection without Forgetting

Figure 4 for Generalized Few-Shot Object Detection without Forgetting

Abstract:Recently few-shot object detection is widely adopted to deal with data-limited situations. While most previous works merely focus on the performance on few-shot categories, we claim that detecting all classes is crucial as test samples may contain any instances in realistic applications, which requires the few-shot detector to learn new concepts without forgetting. Through analysis on transfer learning based methods, some neglected but beneficial properties are utilized to design a simple yet effective few-shot detector, Retentive R-CNN. It consists of Bias-Balanced RPN to debias the pretrained RPN and Re-detector to find few-shot class objects without forgetting previous knowledge. Extensive experiments on few-shot detection benchmarks show that Retentive R-CNN significantly outperforms state-of-the-art methods on overall performance among all settings as it can achieve competitive results on few-shot classes and does not degrade the base class performance at all. Our approach has demonstrated that the long desired never-forgetting learner is available in object detection.

* Accepted by CVPR 2021

Via

Access Paper or Ask Questions

IQDet: Instance-wise Quality Distribution Sampling for Object Detection

Apr 14, 2021

Yuchen Ma, Songtao Liu, Zeming Li, Jian Sun

Figure 1 for IQDet: Instance-wise Quality Distribution Sampling for Object Detection

Figure 2 for IQDet: Instance-wise Quality Distribution Sampling for Object Detection

Figure 3 for IQDet: Instance-wise Quality Distribution Sampling for Object Detection

Figure 4 for IQDet: Instance-wise Quality Distribution Sampling for Object Detection

Abstract:We propose a dense object detector with an instance-wise sampling strategy, named IQDet. Instead of using human prior sampling strategies, we first extract the regional feature of each ground-truth to estimate the instance-wise quality distribution. According to a mixture model in spatial dimensions, the distribution is more noise-robust and adapted to the semantic pattern of each instance. Based on the distribution, we propose a quality sampling strategy, which automatically selects training samples in a probabilistic manner and trains with more high-quality samples. Extensive experiments on MS COCO show that our method steadily improves baseline by nearly 2.4 AP without bells and whistles. Moreover, our best model achieves 51.6 AP, outperforming all existing state-of-the-art one-stage detectors and it is completely cost-free in inference time.

* Accepted by CVPR 2021

Via

Access Paper or Ask Questions