Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Qihao Jin

Unveiling the Surprising Efficacy of Navigation Understanding in End-to-End Autonomous Driving

Apr 14, 2026

Zhihua Hua, Junli Wang, Pengfei LI, Qihao Jin, Bo Zhang, Kehua Sheng, Yilun Chen, Zhongxue Gan, Wenchao Ding

Abstract:Global navigation information and local scene understanding are two crucial components of autonomous driving systems. However, our experimental results indicate that many end-to-end autonomous driving systems tend to over-rely on local scene understanding while failing to utilize global navigation information. These systems exhibit weak correlation between their planning capabilities and navigation input, and struggle to perform navigation-following in complex scenarios. To overcome this limitation, we propose the Sequential Navigation Guidance (SNG) framework, an efficient representation of global navigation information based on real-world navigation patterns. The SNG encompasses both navigation paths for constraining long-term trajectories and turn-by-turn (TBT) information for real-time decision-making logic. We constructed the SNG-QA dataset, a visual question answering (VQA) dataset based on SNG that aligns global and local planning. Additionally, we introduce an efficient model SNG-VLA that fuses local planning with global planning. The SNG-VLA achieves state-of-the-art performance through precise navigation information modeling without requiring auxiliary loss functions from perception tasks. Project page: SNG-VLA

* 8 pages, 6 figures. ICRA 2026. Code available at https://fudan-magic-lab.github.io/SNG-VLA-web

Via

Access Paper or Ask Questions

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

May 21, 2025

Zhenyu Ning, Guangda Liu, Qihao Jin, Wenchao Ding, Minyi Guo, Jieru Zhao

Figure 1 for LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

Figure 2 for LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

Figure 3 for LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

Figure 4 for LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

Abstract:Recent developments in Video Large Language Models (Video LLMs) have enabled models to process long video sequences and demonstrate remarkable performance. Nonetheless, studies predominantly focus on offline video question answering, neglecting memory usage and response speed that are essential in various real-world applications, such as Deepseek services, autonomous driving, and robotics. To mitigate these challenges, we propose $\textbf{LiveVLM}$, a training-free framework specifically designed for streaming, online video understanding and real-time interaction. Unlike existing works that process videos only after one question is posed, LiveVLM constructs an innovative streaming-oriented KV cache to process video streams in real-time, retain long-term video details and eliminate redundant KVs, ensuring prompt responses to user queries. For continuous video streams, LiveVLM generates and compresses video key-value tensors (video KVs) to reserve visual information while improving memory efficiency. Furthermore, when a new question is proposed, LiveVLM incorporates an online question-answering process that efficiently fetches both short-term and long-term visual information, while minimizing interference from redundant context. Extensive experiments demonstrate that LiveVLM enables the foundation LLaVA-OneVision model to process 44$\times$ number of frames on the same device, and achieves up to 5$\times$ speedup in response speed compared with SoTA online methods at an input of 256 frames, while maintaining the same or better model performance.

Via

Access Paper or Ask Questions

Dual-AEB: Synergizing Rule-Based and Multimodal Large Language Models for Effective Emergency Braking

Oct 11, 2024

Wei Zhang, Pengfei Li, Junli Wang, Bingchuan Sun, Qihao Jin, Guangjun Bao, Shibo Rui, Yang Yu, Wenchao Ding, Peng Li(+1 more)

Figure 1 for Dual-AEB: Synergizing Rule-Based and Multimodal Large Language Models for Effective Emergency Braking

Figure 2 for Dual-AEB: Synergizing Rule-Based and Multimodal Large Language Models for Effective Emergency Braking

Figure 3 for Dual-AEB: Synergizing Rule-Based and Multimodal Large Language Models for Effective Emergency Braking

Figure 4 for Dual-AEB: Synergizing Rule-Based and Multimodal Large Language Models for Effective Emergency Braking

Abstract:Automatic Emergency Braking (AEB) systems are a crucial component in ensuring the safety of passengers in autonomous vehicles. Conventional AEB systems primarily rely on closed-set perception modules to recognize traffic conditions and assess collision risks. To enhance the adaptability of AEB systems in open scenarios, we propose Dual-AEB, a system combines an advanced multimodal large language model (MLLM) for comprehensive scene understanding and a conventional rule-based rapid AEB to ensure quick response times. To the best of our knowledge, Dual-AEB is the first method to incorporate MLLMs within AEB systems. Through extensive experimentation, we have validated the effectiveness of our method. The source code will be available at https://github.com/ChipsICU/Dual-AEB.

Via

Access Paper or Ask Questions