Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Linfeng He

MambaVSR: Content-Aware Scanning State Space Model for Video Super-Resolution

Jun 13, 2025

Linfeng He, Meiqin Liu, Qi Tang, Chao Yao, Yao Zhao

Abstract:Video super-resolution (VSR) faces critical challenges in effectively modeling non-local dependencies across misaligned frames while preserving computational efficiency. Existing VSR methods typically rely on optical flow strategies or transformer architectures, which struggle with large motion displacements and long video sequences. To address this, we propose MambaVSR, the first state-space model framework for VSR that incorporates an innovative content-aware scanning mechanism. Unlike rigid 1D sequential processing in conventional vision Mamba methods, our MambaVSR enables dynamic spatiotemporal interactions through the Shared Compass Construction (SCC) and the Content-Aware Sequentialization (CAS). Specifically, the SCC module constructs intra-frame semantic connectivity graphs via efficient sparse attention and generates adaptive spatial scanning sequences through spectral clustering. Building upon SCC, the CAS module effectively aligns and aggregates non-local similar content across multiple frames by interleaving temporal features along the learned spatial order. To bridge global dependencies with local details, the Global-Local State Space Block (GLSSB) synergistically integrates window self-attention operations with SSM-based feature propagation, enabling high-frequency detail recovery under global dependency guidance. Extensive experiments validate MambaVSR's superiority, outperforming the Transformer-based method by 0.58 dB PSNR on the REDS dataset with 55% fewer parameters.

Via

Access Paper or Ask Questions

Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent

Nov 08, 2024

Linfeng He, Yiming Sun, Sihao Wu, Jiaxu Liu, Xiaowei Huang

Figure 1 for Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent

Figure 2 for Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent

Figure 3 for Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent

Figure 4 for Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent

Abstract:In this paper, we propose a novel framework for enhancing visual comprehension in autonomous driving systems by integrating visual language models (VLMs) with additional visual perception module specialised in object detection. We extend the Llama-Adapter architecture by incorporating a YOLOS-based detection network alongside the CLIP perception network, addressing limitations in object detection and localisation. Our approach introduces camera ID-separators to improve multi-view processing, crucial for comprehensive environmental awareness. Experiments on the DriveLM visual question answering challenge demonstrate significant improvements over baseline models, with enhanced performance in ChatGPT scores, BLEU scores, and CIDEr metrics, indicating closeness of model answer to ground truth. Our method represents a promising step towards more capable and interpretable autonomous driving systems. Possible safety enhancement enabled by detection modality is also discussed.

* accepted by SafeGenAI workshop of NeurIPS 2024

Via

Access Paper or Ask Questions