Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Naiyan Wang

3D Video Object Detection with Learnable Object-Centric Global Optimization

Mar 27, 2023
Jiawei He, Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang

Figure 1 for 3D Video Object Detection with Learnable Object-Centric Global Optimization

Figure 2 for 3D Video Object Detection with Learnable Object-Centric Global Optimization

Figure 3 for 3D Video Object Detection with Learnable Object-Centric Global Optimization

Figure 4 for 3D Video Object Detection with Learnable Object-Centric Global Optimization

We explore long-term temporal visual correspondence-based optimization for 3D video object detection in this work. Visual correspondence refers to one-to-one mappings for pixels across multiple images. Correspondence-based optimization is the cornerstone for 3D scene reconstruction but is less studied in 3D video object detection, because moving objects violate multi-view geometry constraints and are treated as outliers during scene reconstruction. We address this issue by treating objects as first-class citizens during correspondence-based optimization. In this work, we propose BA-Det, an end-to-end optimizable object detector with object-centric temporal correspondence learning and featuremetric object bundle adjustment. Empirically, we verify the effectiveness and efficiency of BA-Det for multiple baseline 3D detectors under various setups. Our BA-Det achieves SOTA performance on the large-scale Waymo Open Dataset (WOD) with only marginal computation cost. Our code is available at https://github.com/jiaweihe1996/BA-Det.

* CVPR2023

Via

Access Paper or Ask Questions

Learnable Graph Matching: A Practical Paradigm for Data Association

Mar 27, 2023
Jiawei He, Zehao Huang, Naiyan Wang, Zhaoxiang Zhang

Figure 1 for Learnable Graph Matching: A Practical Paradigm for Data Association

Figure 2 for Learnable Graph Matching: A Practical Paradigm for Data Association

Figure 3 for Learnable Graph Matching: A Practical Paradigm for Data Association

Figure 4 for Learnable Graph Matching: A Practical Paradigm for Data Association

Data association is at the core of many computer vision tasks, e.g., multiple object tracking, image matching, and point cloud registration. Existing methods usually solve the data association problem by network flow optimization, bipartite matching, or end-to-end learning directly. Despite their popularity, we find some defects of the current solutions: they mostly ignore the intra-view context information; besides, they either train deep association models in an end-to-end way and hardly utilize the advantage of optimization-based assignment methods, or only use an off-the-shelf neural network to extract features. In this paper, we propose a general learnable graph matching method to address these issues. Especially, we model the intra-view relationships as an undirected graph. Then data association turns into a general graph matching problem between graphs. Furthermore, to make optimization end-to-end differentiable, we relax the original graph matching problem into continuous quadratic programming and then incorporate training into a deep graph neural network with KKT conditions and implicit function theorem. In MOT task, our method achieves state-of-the-art performance on several MOT datasets. For image matching, our method outperforms state-of-the-art methods with half training data and iterations on a popular indoor dataset, ScanNet. Code will be available at https://github.com/jiaweihe1996/GMTracker.

* Submitted to TPAMI on Mar 21, 2022. arXiv admin note: substantial text overlap with arXiv:2103.16178

Via

Access Paper or Ask Questions

FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models

Mar 22, 2023
Jianglong Ye, Naiyan Wang, Xiaolong Wang

Figure 1 for FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models

Figure 2 for FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models

Figure 3 for FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models

Figure 4 for FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models

Recent works on generalizable NeRFs have shown promising results on novel view synthesis from single or few images. However, such models have rarely been applied on other downstream tasks beyond synthesis such as semantic understanding and parsing. In this paper, we propose a novel framework named FeatureNeRF to learn generalizable NeRFs by distilling pre-trained vision foundation models (e.g., DINO, Latent Diffusion). FeatureNeRF leverages 2D pre-trained foundation models to 3D space via neural rendering, and then extract deep features for 3D query points from NeRF MLPs. Consequently, it allows to map 2D images to continuous 3D semantic feature volumes, which can be used for various downstream tasks. We evaluate FeatureNeRF on tasks of 2D/3D semantic keypoint transfer and 2D/3D object part segmentation. Our extensive experiments demonstrate the effectiveness of FeatureNeRF as a generalizable 3D semantic feature extractor. Our project page is available at https://jianglongye.com/featurenerf/ .

* Project page: https://jianglongye.com/featurenerf/

Via

Access Paper or Ask Questions

Deep Planar Parallax for Monocular Depth Estimation

Jan 09, 2023
Haoqian Liang, Zhichao Li, Ya Yang, Naiyan Wang

Figure 1 for Deep Planar Parallax for Monocular Depth Estimation

Figure 2 for Deep Planar Parallax for Monocular Depth Estimation

Figure 3 for Deep Planar Parallax for Monocular Depth Estimation

Figure 4 for Deep Planar Parallax for Monocular Depth Estimation

Depth estimation is a fundamental problem in the perception system of autonomous driving scenes. Although autonomous driving is challenging, much prior knowledge can still be utilized, by which the sophistication of the problem can be effectively restricted. Some previous works introduce the road plane prior to the depth estimation problem according to the Planar Parallax Geometry. However, we find that their usages are not effective, leaving the network cannot learn the geometric information. To this end, we analyze this problem in detail and reveal that explicit warping of consecutive frames and flow pre-training can effectively bring the geometric prior into learning. Furthermore, we propose Planar Position Embedding to deal with the intrinsic weakness of plane parallax geometry. Comprehensive experimental results on autonomous driving datasets like KITTI and Waymo Open Dataset (WOD) demonstrate that our Planar Parallax Network(PPNet) dramatically outperforms existing learning-based methods.

Via

Access Paper or Ask Questions

Anchor3DLane: Learning to Regress 3D Anchors for Monocular 3D Lane Detection

Jan 06, 2023
Shaofei Huang, Zhenwei Shen, Zehao Huang, Zihan Ding, Jiao Dai, Jizhong Han, Naiyan Wang, Si Liu

Figure 1 for Anchor3DLane: Learning to Regress 3D Anchors for Monocular 3D Lane Detection

Figure 2 for Anchor3DLane: Learning to Regress 3D Anchors for Monocular 3D Lane Detection

Figure 3 for Anchor3DLane: Learning to Regress 3D Anchors for Monocular 3D Lane Detection

Figure 4 for Anchor3DLane: Learning to Regress 3D Anchors for Monocular 3D Lane Detection

Monocular 3D lane detection is a challenging task due to its lack of depth information. A popular solution to 3D lane detection is to first transform the front-viewed (FV) images or features into the bird-eye-view (BEV) space with inverse perspective mapping (IPM) and detect lanes from BEV features. However, the reliance of IPM on flat ground assumption and loss of context information makes it inaccurate to restore 3D information from BEV representations. An attempt has been made to get rid of BEV and predict 3D lanes from FV representations directly, while it still underperforms other BEV-based methods given its lack of structured representation for 3D lanes. In this paper, we define 3D lane anchors in the 3D space and propose a BEV-free method named Anchor3DLane to predict 3D lanes directly from FV representations. 3D lane anchors are projected to the FV features to extract their features which contain both good structural and context information to make accurate predictions. We further extend Anchor3DLane to the multi-frame setting to incorporate temporal information for performance improvement. In addition, we also develop a global optimization method that makes use of the equal-width property between lanes to reduce the lateral error of predictions. Extensive experiments on three popular 3D lane detection benchmarks show that our Anchor3DLane outperforms previous BEV-based methods and achieves state-of-the-art performances.

Via

Access Paper or Ask Questions

Object as Query: Equipping Any 2D Object Detector with 3D Detection Ability

Jan 06, 2023
Zitian Wang, Zehao Huang, Jiahui Fu, Naiyan Wang, Si Liu

Figure 1 for Object as Query: Equipping Any 2D Object Detector with 3D Detection Ability

Figure 2 for Object as Query: Equipping Any 2D Object Detector with 3D Detection Ability

Figure 3 for Object as Query: Equipping Any 2D Object Detector with 3D Detection Ability

Figure 4 for Object as Query: Equipping Any 2D Object Detector with 3D Detection Ability

3D object detection from multi-view images has drawn much attention over the past few years. Existing methods mainly establish 3D representations from multi-view images and adopt a dense detection head for object detection, or employ object queries distributed in 3D space to localize objects. In this paper, we design Multi-View 2D Objects guided 3D Object Detector (MV2D), which can be equipped with any 2D object detector to promote multi-view 3D object detection. Since 2D detections can provide valuable priors for object existence, MV2D exploits 2D detector to generate object queries conditioned on the rich image semantics. These dynamically generated queries enable MV2D to detect objects in larger 3D space without increased computational costs and shows a strong capability of localizing 3D objects. For the generated queries, we design a sparse cross attention module to force them to focus on the features of specific objects, which reduces the computational cost and suppresses interference from noises. The evaluation results on the nuScenes dataset demonstrate that dynamic object queries and sparse feature aggregation do not harm 3D detection capability. MV2D also exhibits a state-of-the-art performance among existing methods. We hope MV2D can serve as a new baseline for future research.

* technical report

Via

Access Paper or Ask Questions

Super Sparse 3D Object Detection

Jan 05, 2023
Lue Fan, Yuxue Yang, Feng Wang, Naiyan Wang, Zhaoxiang Zhang

Figure 1 for Super Sparse 3D Object Detection

Figure 2 for Super Sparse 3D Object Detection

Figure 3 for Super Sparse 3D Object Detection

Figure 4 for Super Sparse 3D Object Detection

As the perception range of LiDAR expands, LiDAR-based 3D object detection contributes ever-increasingly to the long-range perception in autonomous driving. Mainstream 3D object detectors often build dense feature maps, where the cost is quadratic to the perception range, making them hardly scale up to the long-range settings. To enable efficient long-range detection, we first propose a fully sparse object detector termed FSD. FSD is built upon the general sparse voxel encoder and a novel sparse instance recognition (SIR) module. SIR groups the points into instances and applies highly-efficient instance-wise feature extraction. The instance-wise grouping sidesteps the issue of the center feature missing, which hinders the design of the fully sparse architecture. To further enjoy the benefit of fully sparse characteristic, we leverage temporal information to remove data redundancy and propose a super sparse detector named FSD++. FSD++ first generates residual points, which indicate the point changes between consecutive frames. The residual points, along with a few previous foreground points, form the super sparse input data, greatly reducing data redundancy and computational overhead. We comprehensively analyze our method on the large-scale Waymo Open Dataset, and state-of-the-art performance is reported. To showcase the superiority of our method in long-range detection, we also conduct experiments on Argoverse 2 Dataset, where the perception range ($200m$) is much larger than Waymo Open Dataset ($75m$). Code is open-sourced at https://github.com/tusen-ai/SST.

* Extension of Fully Sparse 3D Object Detection [arXiv:2207.10035]

Via

Access Paper or Ask Questions

PredNAS: A Universal and Sample Efficient Neural Architecture Search Framework

Oct 26, 2022
Liuchun Yuan, Zehao Huang, Naiyan Wang

Figure 1 for PredNAS: A Universal and Sample Efficient Neural Architecture Search Framework

Figure 2 for PredNAS: A Universal and Sample Efficient Neural Architecture Search Framework

Figure 3 for PredNAS: A Universal and Sample Efficient Neural Architecture Search Framework

Figure 4 for PredNAS: A Universal and Sample Efficient Neural Architecture Search Framework

In this paper, we present a general and effective framework for Neural Architecture Search (NAS), named PredNAS. The motivation is that given a differentiable performance estimation function, we can directly optimize the architecture towards higher performance by simple gradient ascent. Specifically, we adopt a neural predictor as the performance predictor. Surprisingly, PredNAS can achieve state-of-the-art performances on NAS benchmarks with only a few training samples (less than 100). To validate the universality of our method, we also apply our method on large-scale tasks and compare our method with RegNet on ImageNet and YOLOX on MSCOCO. The results demonstrate that our PredNAS can explore novel architectures with competitive performances under specific computational complexity constraints.

* 11 Pages,4 figures

Via

Access Paper or Ask Questions

YOLOV: Making Still Image Object Detectors Great at Video Object Detection

Aug 20, 2022
Yuheng Shi, Naiyan Wang, Xiaojie Guo

Figure 1 for YOLOV: Making Still Image Object Detectors Great at Video Object Detection

Figure 2 for YOLOV: Making Still Image Object Detectors Great at Video Object Detection

Figure 3 for YOLOV: Making Still Image Object Detectors Great at Video Object Detection

Figure 4 for YOLOV: Making Still Image Object Detectors Great at Video Object Detection

Video object detection (VID) is challenging because of the high variation of object appearance as well as the diverse deterioration in some frames. On the positive side, the detection in a certain frame of a video, compared with in a still image, can draw support from other frames. Hence, how to aggregate features across different frames is pivotal to the VID problem. Most of existing aggregation algorithms are customized for two-stage detectors. But, the detectors in this category are usually computationally expensive due to the two-stage nature. This work proposes a simple yet effective strategy to address the above concerns, which spends marginal overheads with significant gains in accuracy. Concretely, different from the traditional two-stage pipeline, we advocate putting the region-level selection after the one-stage detection to avoid processing massive low-quality candidates. Besides, a novel module is constructed to evaluate the relationship between a target frame and its reference ones, and guide the aggregation. Extensive experiments and ablation studies are conducted to verify the efficacy of our design, and reveal its superiority over other state-of-the-art VID approaches in both effectiveness and efficiency. Our YOLOX-based model can achieve promising performance (e.g., 87.5\% AP50 at over 30 FPS on the ImageNet VID dataset on a single 2080Ti GPU), making it attractive for large-scale or real-time applications. The implementation is simple, the demo code and models have been made available at https://github.com/YuHengsss/YOLOV .

Via

Access Paper or Ask Questions

Fully Sparse 3D Object Detection

Jul 20, 2022
Lue Fan, Feng Wang, Naiyan Wang, Zhaoxiang Zhang

Figure 1 for Fully Sparse 3D Object Detection

Figure 2 for Fully Sparse 3D Object Detection

Figure 3 for Fully Sparse 3D Object Detection

Figure 4 for Fully Sparse 3D Object Detection

As the perception range of LiDAR increases, LiDAR-based 3D object detection becomes a dominant task in the long-range perception task of autonomous driving. The mainstream 3D object detectors usually build dense feature maps in the network backbone and prediction head. However, the computational and spatial costs on the dense feature map are quadratic to the perception range, which makes them hardly scale up to the long-range setting. To enable efficient long-range LiDAR-based object detection, we build a fully sparse 3D object detector (FSD). The computational and spatial cost of FSD is roughly linear to the number of points and independent of the perception range. FSD is built upon the general sparse voxel encoder and a novel sparse instance recognition (SIR) module. SIR first groups the points into instances and then applies instance-wise feature extraction and prediction. In this way, SIR resolves the issue of center feature missing, which hinders the design of the fully sparse architecture for all center-based or anchor-based detectors. Moreover, SIR avoids the time-consuming neighbor queries in previous point-based methods by grouping points into instances. We conduct extensive experiments on the large-scale Waymo Open Dataset to reveal the working mechanism of FSD, and state-of-the-art performance is reported. To demonstrate the superiority of FSD in long-range detection, we also conduct experiments on Argoverse 2 Dataset, which has a much larger perception range ($200m$) than Waymo Open Dataset ($75m$). On such a large perception range, FSD achieves state-of-the-art performance and is 2.4$\times$ faster than the dense counterpart.Codes will be released at https://github.com/TuSimple/SST.

Via

Access Paper or Ask Questions