Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nanning Zheng

Xi'an Jiaotong University

InteractionNet: Joint Planning and Prediction for Autonomous Driving with Transformers

Sep 07, 2023

Jiawei Fu, Yanqing Shen, Zhiqiang Jian, Shitao Chen, Jingmin Xin, Nanning Zheng

Figure 1 for InteractionNet: Joint Planning and Prediction for Autonomous Driving with Transformers

Figure 2 for InteractionNet: Joint Planning and Prediction for Autonomous Driving with Transformers

Figure 3 for InteractionNet: Joint Planning and Prediction for Autonomous Driving with Transformers

Figure 4 for InteractionNet: Joint Planning and Prediction for Autonomous Driving with Transformers

Abstract:Planning and prediction are two important modules of autonomous driving and have experienced tremendous advancement recently. Nevertheless, most existing methods regard planning and prediction as independent and ignore the correlation between them, leading to the lack of consideration for interaction and dynamic changes of traffic scenarios. To address this challenge, we propose InteractionNet, which leverages transformer to share global contextual reasoning among all traffic participants to capture interaction and interconnect planning and prediction to achieve joint. Besides, InteractionNet deploys another transformer to help the model pay extra attention to the perceived region containing critical or unseen vehicles. InteractionNet outperforms other baselines in several benchmarks, especially in terms of safety, which benefits from the joint consideration of planning and forecasting. The code will be available at https://github.com/fujiawei0724/InteractionNet.

* Accepted to IROS 2023

Via

Access Paper or Ask Questions

Complementing Onboard Sensors with Satellite Map: A New Perspective for HD Map Construction

Aug 29, 2023

Wenjie Gao, Jiawei Fu, Haodong Jing, Nanning Zheng

Figure 1 for Complementing Onboard Sensors with Satellite Map: A New Perspective for HD Map Construction

Figure 2 for Complementing Onboard Sensors with Satellite Map: A New Perspective for HD Map Construction

Figure 3 for Complementing Onboard Sensors with Satellite Map: A New Perspective for HD Map Construction

Figure 4 for Complementing Onboard Sensors with Satellite Map: A New Perspective for HD Map Construction

Abstract:High-Definition (HD) maps play a crucial role in autonomous driving systems. Recent methods have attempted to construct HD maps in real-time based on information obtained from vehicle onboard sensors. However, the performance of these methods is significantly susceptible to the environment surrounding the vehicle due to the inherent limitation of onboard sensors, such as weak capacity for long-range detection. In this study, we demonstrate that supplementing onboard sensors with satellite maps can enhance the performance of HD map construction methods, leveraging the broad coverage capability of satellite maps. For the purpose of further research, we release the satellite map tiles as a complementary dataset of nuScenes dataset. Meanwhile, we propose a hierarchical fusion module that enables better fusion of satellite maps information with existing methods. Specifically, we design an attention mask based on segmentation and distance, applying the cross-attention mechanism to fuse onboard Bird's Eye View (BEV) features and satellite features in feature-level fusion. An alignment module is introduced before concatenation in BEV-level fusion to mitigate the impact of misalignment between the two features. The experimental results on the augmented nuScenes dataset showcase the seamless integration of our module into three existing HD map construction methods. It notably enhances their performance in both HD map semantic segmentation and instance detection tasks.

Via

Access Paper or Ask Questions

V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection

Aug 08, 2023

Yichao Shen, Zigang Geng, Yuhui Yuan, Yutong Lin, Ze Liu, Chunyu Wang, Han Hu, Nanning Zheng, Baining Guo

Figure 1 for V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection

Figure 2 for V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection

Figure 3 for V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection

Figure 4 for V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection

Abstract:We introduce a highly performant 3D object detector for point clouds using the DETR framework. The prior attempts all end up with suboptimal results because they fail to learn accurate inductive biases from the limited scale of training data. In particular, the queries often attend to points that are far away from the target objects, violating the locality principle in object detection. To address the limitation, we introduce a novel 3D Vertex Relative Position Encoding (3DV-RPE) method which computes position encoding for each point based on its relative position to the 3D boxes predicted by the queries in each decoder layer, thus providing clear information to guide the model to focus on points near the objects, in accordance with the principle of locality. In addition, we systematically improve the pipeline from various aspects such as data normalization based on our understanding of the task. We show exceptional results on the challenging ScanNetV2 benchmark, achieving significant improvements over the previous 3DETR in $\rm{AP}_{25}$/$\rm{AP}_{50}$ from 65.0\%/47.0\% to 77.8\%/66.0\%, respectively. In addition, our method sets a new record on ScanNetV2 and SUN RGB-D datasets.Code will be released at http://github.com/yichaoshen-MS/V-DETR.

Via

Access Paper or Ask Questions

DETR Doesn't Need Multi-Scale or Locality Design

Aug 03, 2023

Yutong Lin, Yuhui Yuan, Zheng Zhang, Chen Li, Nanning Zheng, Han Hu

Abstract:This paper presents an improved DETR detector that maintains a "plain" nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that reintroduce architectural inductive biases of multi-scale and locality into the decoder. We show that two simple technologies are surprisingly effective within a plain design to compensate for the lack of multi-scale feature maps and locality constraints. The first is a box-to-pixel relative position bias (BoxRPB) term added to the cross-attention formulation, which well guides each query to attend to the corresponding object region while also providing encoding flexibility. The second is masked image modeling (MIM)-based backbone pre-training which helps learn representation with fine-grained localization ability and proves crucial for remedying dependencies on the multi-scale feature maps. By incorporating these technologies and recent advancements in training and problem formation, the improved "plain" DETR showed exceptional improvements over the original DETR detector. By leveraging the Object365 dataset for pre-training, it achieved 63.9 mAP accuracy using a Swin-L backbone, which is highly competitive with state-of-the-art detectors which all heavily rely on multi-scale feature maps and region-based feature extraction. Code is available at https://github.com/impiga/Plain-DETR .

* To be published in ICCV2023

Via

Access Paper or Ask Questions

FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene

Jul 27, 2023

Chengrui Wei, Meng Yang, Lei He, Nanning Zheng

Figure 1 for FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene

Figure 2 for FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene

Figure 3 for FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene

Figure 4 for FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene

Abstract:It has long been an ill-posed problem to predict absolute depth maps from single images in real (unseen) indoor scenes. We observe that it is essentially due to not only the scale-ambiguous problem but also the focal-ambiguous problem that decreases the generalization ability of monocular depth estimation. That is, images may be captured by cameras of different focal lengths in scenes of different scales. In this paper, we develop a focal-and-scale depth estimation model to well learn absolute depth maps from single images in unseen indoor scenes. First, a relative depth estimation network is adopted to learn relative depths from single images with diverse scales/semantics. Second, multi-scale features are generated by mapping a single focal length value to focal length features and concatenating them with intermediate features of different scales in relative depth estimation. Finally, relative depths and multi-scale features are jointly fed into an absolute depth estimation network. In addition, a new pipeline is developed to augment the diversity of focal lengths of public datasets, which are often captured with cameras of the same or similar focal lengths. Our model is trained on augmented NYUDv2 and tested on three unseen datasets. Our model considerably improves the generalization ability of depth estimation by 41%/13% (RMSE) with/without data augmentation compared with five recent SOTAs and well alleviates the deformation problem in 3D reconstruction. Notably, our model well maintains the accuracy of depth estimation on original NYUDv2.

Via

Access Paper or Ask Questions

LongNet: Scaling Transformers to 1,000,000,000 Tokens

Jul 19, 2023

Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, Furu Wei

Figure 1 for LongNet: Scaling Transformers to 1,000,000,000 Tokens

Figure 2 for LongNet: Scaling Transformers to 1,000,000,000 Tokens

Figure 3 for LongNet: Scaling Transformers to 1,000,000,000 Tokens

Figure 4 for LongNet: Scaling Transformers to 1,000,000,000 Tokens

Abstract:Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted. To address this issue, we introduce LongNet, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences. Specifically, we propose dilated attention, which expands the attentive field exponentially as the distance grows. LongNet has significant advantages: 1) it has a linear computation complexity and a logarithm dependency between any two tokens in a sequence; 2) it can be served as a distributed trainer for extremely long sequences; 3) its dilated attention is a drop-in replacement for standard attention, which can be seamlessly integrated with the existing Transformer-based optimization. Experiments results demonstrate that LongNet yields strong performance on both long-sequence modeling and general language tasks. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.

* Work in progress

Via

Access Paper or Ask Questions

MLF-DET: Multi-Level Fusion for Cross-Modal 3D Object Detection

Jul 18, 2023

Zewei Lin, Yanqing Shen, Sanping Zhou, Shitao Chen, Nanning Zheng

Abstract:In this paper, we propose a novel and effective Multi-Level Fusion network, named as MLF-DET, for high-performance cross-modal 3D object DETection, which integrates both the feature-level fusion and decision-level fusion to fully utilize the information in the image. For the feature-level fusion, we present the Multi-scale Voxel Image fusion (MVI) module, which densely aligns multi-scale voxel features with image features. For the decision-level fusion, we propose the lightweight Feature-cued Confidence Rectification (FCR) module which further exploits image semantics to rectify the confidence of detection candidates. Besides, we design an effective data augmentation strategy termed Occlusion-aware GT Sampling (OGS) to reserve more sampled objects in the training scenes, so as to reduce overfitting. Extensive experiments on the KITTI dataset demonstrate the effectiveness of our method. Notably, on the extremely competitive KITTI car 3D object detection benchmark, our method reaches 82.89% moderate AP and achieves state-of-the-art performance without bells and whistles.

Via

Access Paper or Ask Questions

When and Why Momentum Accelerates SGD:An Empirical Study

Jun 15, 2023

Jingwen Fu, Bohan Wang, Huishuai Zhang, Zhizheng Zhang, Wei Chen, Nanning Zheng

Figure 1 for When and Why Momentum Accelerates SGD:An Empirical Study

Figure 2 for When and Why Momentum Accelerates SGD:An Empirical Study

Figure 3 for When and Why Momentum Accelerates SGD:An Empirical Study

Figure 4 for When and Why Momentum Accelerates SGD:An Empirical Study

Abstract:Momentum has become a crucial component in deep learning optimizers, necessitating a comprehensive understanding of when and why it accelerates stochastic gradient descent (SGD). To address the question of ''when'', we establish a meaningful comparison framework that examines the performance of SGD with Momentum (SGDM) under the \emph{effective learning rates} $\eta_{ef}$, a notion unifying the influence of momentum coefficient $\mu$ and batch size $b$ over learning rate $\eta$. In the comparison of SGDM and SGD with the same effective learning rate and the same batch size, we observe a consistent pattern: when $\eta_{ef}$ is small, SGDM and SGD experience almost the same empirical training losses; when $\eta_{ef}$ surpasses a certain threshold, SGDM begins to perform better than SGD. Furthermore, we observe that the advantage of SGDM over SGD becomes more pronounced with a larger batch size. For the question of ``why'', we find that the momentum acceleration is closely related to \emph{abrupt sharpening} which is to describe a sudden jump of the directional Hessian along the update direction. Specifically, the misalignment between SGD and SGDM happens at the same moment that SGD experiences abrupt sharpening and converges slower. Momentum improves the performance of SGDM by preventing or deferring the occurrence of abrupt sharpening. Together, this study unveils the interplay between momentum, learning rates, and batch sizes, thus improving our understanding of momentum acceleration.

Via

Access Paper or Ask Questions

Milestones in Autonomous Driving and Intelligent Vehicles Part II: Perception and Planning

Jun 03, 2023

Long Chen, Siyu Teng, Bai Li, Xiaoxiang Na, Yuchen Li, Zixuan Li, Jinjun Wang, Dongpu Cao, Nanning Zheng, Fei-Yue Wang

Figure 1 for Milestones in Autonomous Driving and Intelligent Vehicles Part II: Perception and Planning

Figure 2 for Milestones in Autonomous Driving and Intelligent Vehicles Part II: Perception and Planning

Figure 3 for Milestones in Autonomous Driving and Intelligent Vehicles Part II: Perception and Planning

Figure 4 for Milestones in Autonomous Driving and Intelligent Vehicles Part II: Perception and Planning

Abstract:Growing interest in autonomous driving (AD) and intelligent vehicles (IVs) is fueled by their promise for enhanced safety, efficiency, and economic benefits. While previous surveys have captured progress in this field, a comprehensive and forward-looking summary is needed. Our work fills this gap through three distinct articles. The first part, a "Survey of Surveys" (SoS), outlines the history, surveys, ethics, and future directions of AD and IV technologies. The second part, "Milestones in Autonomous Driving and Intelligent Vehicles Part I: Control, Computing System Design, Communication, HD Map, Testing, and Human Behaviors" delves into the development of control, computing system, communication, HD map, testing, and human behaviors in IVs. This part, the third part, reviews perception and planning in the context of IVs. Aiming to provide a comprehensive overview of the latest advancements in AD and IVs, this work caters to both newcomers and seasoned researchers. By integrating the SoS and Part I, we offer unique insights and strive to serve as a bridge between past achievements and future possibilities in this dynamic field.

* 17pages, 6figures. IEEE Transactions on Systems, Man, and Cybernetics: Systems. arXiv admin note: text overlap with arXiv:2303.09824

Via

Access Paper or Ask Questions

Vector-based Representation is the Key: A Study on Disentanglement and Compositional Generalization

May 29, 2023

Tao Yang, Yuwang Wang, Cuiling Lan, Yan Lu, Nanning Zheng

Figure 1 for Vector-based Representation is the Key: A Study on Disentanglement and Compositional Generalization

Figure 2 for Vector-based Representation is the Key: A Study on Disentanglement and Compositional Generalization

Figure 3 for Vector-based Representation is the Key: A Study on Disentanglement and Compositional Generalization

Figure 4 for Vector-based Representation is the Key: A Study on Disentanglement and Compositional Generalization

Abstract:Recognizing elementary underlying concepts from observations (disentanglement) and generating novel combinations of these concepts (compositional generalization) are fundamental abilities for humans to support rapid knowledge learning and generalize to new tasks, with which the deep learning models struggle. Towards human-like intelligence, various works on disentangled representation learning have been proposed, and recently some studies on compositional generalization have been presented. However, few works study the relationship between disentanglement and compositional generalization, and the observed results are inconsistent. In this paper, we study several typical disentangled representation learning works in terms of both disentanglement and compositional generalization abilities, and we provide an important insight: vector-based representation (using a vector instead of a scalar to represent a concept) is the key to empower both good disentanglement and strong compositional generalization. This insight also resonates the neuroscience research that the brain encodes information in neuron population activity rather than individual neurons. Motivated by this observation, we further propose a method to reform the scalar-based disentanglement works ($\beta$-TCVAE and FactorVAE) to be vector-based to increase both capabilities. We investigate the impact of the dimensions of vector-based representation and one important question: whether better disentanglement indicates higher compositional generalization. In summary, our study demonstrates that it is possible to achieve both good concept recognition and novel concept composition, contributing an important step towards human-like intelligence.

* Preprint

Via

Access Paper or Ask Questions