Alert button
Picture for Guan Huang

Guan Huang

Alert button

DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving

Sep 18, 2023
Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiwen Lu

Figure 1 for DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving
Figure 2 for DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving
Figure 3 for DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving
Figure 4 for DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving

World models, especially in autonomous driving, are trending and drawing extensive attention due to their capacity for comprehending driving environments. The established world model holds immense potential for the generation of high-quality driving videos, and driving policies for safe maneuvering. However, a critical limitation in relevant research lies in its predominant focus on gaming environments or simulated settings, thereby lacking the representation of real-world driving scenarios. Therefore, we introduce DriveDreamer, a pioneering world model entirely derived from real-world driving scenarios. Regarding that modeling the world in intricate driving scenes entails an overwhelming search space, we propose harnessing the powerful diffusion model to construct a comprehensive representation of the complex environment. Furthermore, we introduce a two-stage training pipeline. In the initial phase, DriveDreamer acquires a deep understanding of structured traffic constraints, while the subsequent stage equips it with the ability to anticipate future states. The proposed DriveDreamer is the first world model established from real-world driving scenarios. We instantiate DriveDreamer on the challenging nuScenes benchmark, and extensive experiments verify that DriveDreamer empowers precise, controllable video generation that faithfully captures the structural constraints of real-world traffic scenarios. Additionally, DriveDreamer enables the generation of realistic and reasonable driving policies, opening avenues for interaction and practical applications.

* Project Page: https://drivedreamer.github.io 
Viaarxiv icon

Deep trip generation with graph neural networks for bike sharing system expansion

Mar 20, 2023
Yuebing Liang, Fangyi Ding, Guan Huang, Zhan Zhao

Figure 1 for Deep trip generation with graph neural networks for bike sharing system expansion
Figure 2 for Deep trip generation with graph neural networks for bike sharing system expansion
Figure 3 for Deep trip generation with graph neural networks for bike sharing system expansion
Figure 4 for Deep trip generation with graph neural networks for bike sharing system expansion

Bike sharing is emerging globally as an active, convenient, and sustainable mode of transportation. To plan successful bike-sharing systems (BSSs), many cities start from a small-scale pilot and gradually expand the system to cover more areas. For station-based BSSs, this means planning new stations based on existing ones over time, which requires prediction of the number of trips generated by these new stations across the whole system. Previous studies typically rely on relatively simple regression or machine learning models, which are limited in capturing complex spatial relationships. Despite the growing literature in deep learning methods for travel demand prediction, they are mostly developed for short-term prediction based on time series data, assuming no structural changes to the system. In this study, we focus on the trip generation problem for BSS expansion, and propose a graph neural network (GNN) approach to predicting the station-level demand based on multi-source urban built environment data. Specifically, it constructs multiple localized graphs centered on each target station and uses attention mechanisms to learn the correlation weights between stations. We further illustrate that the proposed approach can be regarded as a generalized spatial regression model, indicating the commonalities between spatial regression and GNNs. The model is evaluated based on realistic experiments using multi-year BSS data from New York City, and the results validate the superior performance of our approach compared to existing methods. We also demonstrate the interpretability of the model for uncovering the effects of built environment features and spatial interactions between stations, which can provide strategic guidance for BSS station location selection and capacity planning.

Viaarxiv icon

Detachable Novel Views Synthesis of Dynamic Scenes Using Distribution-Driven Neural Radiance Fields

Jan 01, 2023
Boyu Zhang, Wenbo Xu, Zheng Zhu, Guan Huang

Figure 1 for Detachable Novel Views Synthesis of Dynamic Scenes Using Distribution-Driven Neural Radiance Fields
Figure 2 for Detachable Novel Views Synthesis of Dynamic Scenes Using Distribution-Driven Neural Radiance Fields
Figure 3 for Detachable Novel Views Synthesis of Dynamic Scenes Using Distribution-Driven Neural Radiance Fields
Figure 4 for Detachable Novel Views Synthesis of Dynamic Scenes Using Distribution-Driven Neural Radiance Fields

Representing and synthesizing novel views in real-world dynamic scenes from casual monocular videos is a long-standing problem. Existing solutions typically approach dynamic scenes by applying geometry techniques or utilizing temporal information between several adjacent frames without considering the underlying background distribution in the entire scene or the transmittance over the ray dimension, limiting their performance on static and occlusion areas. Our approach $\textbf{D}$istribution-$\textbf{D}$riven neural radiance fields offers high-quality view synthesis and a 3D solution to $\textbf{D}$etach the background from the entire $\textbf{D}$ynamic scene, which is called $\text{D}^4$NeRF. Specifically, it employs a neural representation to capture the scene distribution in the static background and a 6D-input NeRF to represent dynamic objects, respectively. Each ray sample is given an additional occlusion weight to indicate the transmittance lying in the static and dynamic components. We evaluate $\text{D}^4$NeRF on public dynamic scenes and our urban driving scenes acquired from an autonomous-driving dataset. Extensive experiments demonstrate that our approach outperforms previous methods in rendering texture details and motion areas while also producing a clean static background. Our code will be released at https://github.com/Luciferbobo/D4NeRF.

Viaarxiv icon

Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark

Dec 17, 2022
Xiaofeng Wang, Zheng Zhu, Yunpeng Zhang, Guan Huang, Yun Ye, Wenbo Xu, Ziwei Chen, Xingang Wang

Figure 1 for Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark
Figure 2 for Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark
Figure 3 for Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark
Figure 4 for Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark

In recent years, vision-centric perception has flourished in various autonomous driving tasks, including 3D detection, semantic map construction, motion forecasting, and depth estimation. Nevertheless, the latency of vision-centric approaches is too high for practical deployment (e.g., most camera-based 3D detectors have a runtime greater than 300ms). To bridge the gap between ideal research and real-world applications, it is necessary to quantify the trade-off between performance and efficiency. Traditionally, autonomous-driving perception benchmarks perform the offline evaluation, neglecting the inference time delay. To mitigate the problem, we propose the Autonomous-driving StreAming Perception (ASAP) benchmark, which is the first benchmark to evaluate the online performance of vision-centric perception in autonomous driving. On the basis of the 2Hz annotated nuScenes dataset, we first propose an annotation-extending pipeline to generate high-frame-rate labels for the 12Hz raw images. Referring to the practical deployment, the Streaming Perception Under constRained-computation (SPUR) evaluation protocol is further constructed, where the 12Hz inputs are utilized for streaming evaluation under the constraints of different computational resources. In the ASAP benchmark, comprehensive experiment results reveal that the model rank alters under different constraints, suggesting that the model latency and computation budget should be considered as design choices to optimize the practical deployment. To facilitate further research, we establish baselines for camera-based streaming 3D detection, which consistently enhance the streaming performance across various hardware. ASAP project page: https://github.com/JeffWang987/ASAP.

* code: https://github.com/JeffWang987/ASAP 
Viaarxiv icon

BEVPoolv2: A Cutting-edge Implementation of BEVDet Toward Deployment

Nov 30, 2022
Junjie Huang, Guan Huang

Figure 1 for BEVPoolv2: A Cutting-edge Implementation of BEVDet Toward Deployment
Figure 2 for BEVPoolv2: A Cutting-edge Implementation of BEVDet Toward Deployment
Figure 3 for BEVPoolv2: A Cutting-edge Implementation of BEVDet Toward Deployment
Figure 4 for BEVPoolv2: A Cutting-edge Implementation of BEVDet Toward Deployment

We release a new codebase version of the BEVDet, dubbed branch dev2.0. With dev2.0, we propose BEVPoolv2 upgrade the view transformation process from the perspective of engineering optimization, making it free from a huge burden in both calculation and storage aspects. It achieves this by omitting the calculation and preprocessing of the large frustum feature. As a result, it can be processed within 0.82 ms even with a large input resolution of 640x1600, which is 15.1 times the previous fastest implementation. Besides, it is also less cache consumptive when compared with the previous implementation, naturally as it no longer needs to store the large frustum feature. Last but not least, this also makes the deployment to the other backend handy. We offer an example of deployment to the TensorRT backend in branch dev2.0 and show how fast the BEVDet paradigm can be processed on it. Other than BEVPoolv2, we also select and integrate some substantial progress that was proposed in the past year. As an example configuration, BEVDet4D-R50-Depth-CBGS scores 52.3 NDS on the NuScenes validation set and can be processed at a speed of 16.4 FPS with the PyTorch backend. The code has been released to facilitate the study on https://github.com/HuangJunJie2017/BEVDet/tree/dev2.0.

* Technique report 
Viaarxiv icon

Cross-Mode Knowledge Adaptation for Bike Sharing Demand Prediction using Domain-Adversarial Graph Neural Networks

Nov 16, 2022
Yuebing Liang, Guan Huang, Zhan Zhao

Figure 1 for Cross-Mode Knowledge Adaptation for Bike Sharing Demand Prediction using Domain-Adversarial Graph Neural Networks
Figure 2 for Cross-Mode Knowledge Adaptation for Bike Sharing Demand Prediction using Domain-Adversarial Graph Neural Networks
Figure 3 for Cross-Mode Knowledge Adaptation for Bike Sharing Demand Prediction using Domain-Adversarial Graph Neural Networks
Figure 4 for Cross-Mode Knowledge Adaptation for Bike Sharing Demand Prediction using Domain-Adversarial Graph Neural Networks

For bike sharing systems, demand prediction is crucial to ensure the timely re-balancing of available bikes according to predicted demand. Existing methods for bike sharing demand prediction are mostly based on its own historical demand variation, essentially regarding it as a closed system and neglecting the interaction between different transportation modes. This is particularly important for bike sharing because it is often used to complement travel through other modes (e.g., public transit). Despite some recent progress, no existing method is capable of leveraging spatiotemporal information from multiple modes and explicitly considers the distribution discrepancy between them, which can easily lead to negative transfer. To address these challenges, this study proposes a domain-adversarial multi-relational graph neural network (DA-MRGNN) for bike sharing demand prediction with multimodal historical data as input. A temporal adversarial adaptation network is introduced to extract shareable features from demand patterns of different modes. To capture correlations between spatial units across modes, we adapt a multi-relational graph neural network (MRGNN) considering both cross-mode similarity and difference. In addition, an explainable GNN technique is developed to understand how our proposed model makes predictions. Extensive experiments are conducted using real-world bike sharing, subway and ride-hailing data from New York City. The results demonstrate the superior performance of our proposed approach compared to existing methods and the effectiveness of different model components.

* arXiv admin note: substantial text overlap with arXiv:2203.10961 
Viaarxiv icon

Hybrid CNN -Interpreter: Interpret local and global contexts for CNN-based Models

Oct 31, 2022
Wenli Yang, Guan Huang, Renjie Li, Jiahao Yu, Yanyu Chen, Quan Bai, Beyong Kang

Figure 1 for Hybrid CNN -Interpreter: Interpret local and global contexts for CNN-based Models
Figure 2 for Hybrid CNN -Interpreter: Interpret local and global contexts for CNN-based Models
Figure 3 for Hybrid CNN -Interpreter: Interpret local and global contexts for CNN-based Models
Figure 4 for Hybrid CNN -Interpreter: Interpret local and global contexts for CNN-based Models

Convolutional neural network (CNN) models have seen advanced improvements in performance in various domains, but lack of interpretability is a major barrier to assurance and regulation during operation for acceptance and deployment of AI-assisted applications. There have been many works on input interpretability focusing on analyzing the input-output relations, but the internal logic of models has not been clarified in the current mainstream interpretability methods. In this study, we propose a novel hybrid CNN-interpreter through: (1) An original forward propagation mechanism to examine the layer-specific prediction results for local interpretability. (2) A new global interpretability that indicates the feature correlation and filter importance effects. By combining the local and global interpretabilities, hybrid CNN-interpreter enables us to have a solid understanding and monitoring of model context during the whole learning process with detailed and consistent representations. Finally, the proposed interpretabilities have been demonstrated to adapt to various CNN-based model structures.

Viaarxiv icon

A Simple Baseline for Multi-Camera 3D Object Detection

Aug 22, 2022
Yunpeng Zhang, Wenzhao Zheng, Zheng Zhu, Guan Huang, Jie Zhou, Jiwen Lu

Figure 1 for A Simple Baseline for Multi-Camera 3D Object Detection
Figure 2 for A Simple Baseline for Multi-Camera 3D Object Detection
Figure 3 for A Simple Baseline for Multi-Camera 3D Object Detection
Figure 4 for A Simple Baseline for Multi-Camera 3D Object Detection

3D object detection with surrounding cameras has been a promising direction for autonomous driving. In this paper, we present SimMOD, a Simple baseline for Multi-camera Object Detection, to solve the problem. To incorporate multi-view information as well as build upon previous efforts on monocular 3D object detection, the framework is built on sample-wise object proposals and designed to work in a two-stage manner. First, we extract multi-scale features and generate the perspective object proposals on each monocular image. Second, the multi-view proposals are aggregated and then iteratively refined with multi-view and multi-scale visual features in the DETR3D-style. The refined proposals are end-to-end decoded into the detection results. To further boost the performance, we incorporate the auxiliary branches alongside the proposal generation to enhance the feature learning. Also, we design the methods of target filtering and teacher forcing to promote the consistency of two-stage training. We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD and achieve new state-of-the-art performance. Code will be available at https://github.com/zhangyp15/SimMOD.

Viaarxiv icon

Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning

Aug 19, 2022
Xiaofeng Wang, Zheng Zhu, Guan Huang, Xu Chi, Yun Ye, Ziwei Chen, Xingang Wang

Figure 1 for Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning
Figure 2 for Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning
Figure 3 for Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning
Figure 4 for Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning

Self-supervised monocular methods can efficiently learn depth information of weakly textured surfaces or reflective objects. However, the depth accuracy is limited due to the inherent ambiguity in monocular geometric modeling. In contrast, multi-frame depth estimation methods improve the depth accuracy thanks to the success of Multi-View Stereo (MVS), which directly makes use of geometric constraints. Unfortunately, MVS often suffers from texture-less regions, non-Lambertian surfaces, and moving objects, especially in real-world video sequences without known camera motion and depth supervision. Therefore, we propose MOVEDepth, which exploits the MOnocular cues and VElocity guidance to improve multi-frame Depth learning. Unlike existing methods that enforce consistency between MVS depth and monocular depth, MOVEDepth boosts multi-frame depth learning by directly addressing the inherent problems of MVS. The key of our approach is to utilize monocular depth as a geometric priority to construct MVS cost volume, and adjust depth candidates of cost volume under the guidance of predicted camera velocity. We further fuse monocular depth and MVS depth by learning uncertainty in the cost volume, which results in a robust depth estimation against ambiguity in multi-view geometry. Extensive experiments show MOVEDepth achieves state-of-the-art performance: Compared with Monodepth2 and PackNet, our method relatively improves the depth accuracy by 20\% and 19.8\% on the KITTI benchmark. MOVEDepth also generalizes to the more challenging DDAD benchmark, relatively outperforming ManyDepth by 7.2\%. The code is available at https://github.com/JeffWang987/MOVEDepth.

* code: https://github.com/JeffWang987/MOVEDepth 
Viaarxiv icon

MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

Aug 06, 2022
Chaoqiang Zhao, Youmin Zhang, Matteo Poggi, Fabio Tosi, Xianda Guo, Zheng Zhu, Guan Huang, Yang Tang, Stefano Mattoccia

Figure 1 for MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer
Figure 2 for MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer
Figure 3 for MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer
Figure 4 for MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

Self-supervised monocular depth estimation is an attractive solution that does not require hard-to-source depth labels for training. Convolutional neural networks (CNNs) have recently achieved great success in this task. However, their limited receptive field constrains existing network architectures to reason only locally, dampening the effectiveness of the self-supervised paradigm. In the light of the recent successes achieved by Vision Transformers (ViTs), we propose MonoViT, a brand-new framework combining the global reasoning enabled by ViT models with the flexibility of self-supervised monocular depth estimation. By combining plain convolutions with Transformer blocks, our model can reason locally and globally, yielding depth prediction at a higher level of detail and accuracy, allowing MonoViT to achieve state-of-the-art performance on the established KITTI dataset. Moreover, MonoViT proves its superior generalization capacities on other datasets such as Make3D and DrivingStereo.

* Accepted by 3DV 2022 
Viaarxiv icon