Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Songyan Zhang

Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset

Apr 24, 2026

Wenhui Huang, Songyan Zhang, Collister Chua, Yang Liang, Zhiqi Mao, Heng Yang, Chen Lv

Abstract:Urban transportation systems face growing safety challenges that require scalable intelligence for emerging smart mobility infrastructures. While recent advances in foundation models and large-scale multimodal datasets have strengthened perception and reasoning in intelligent transportation systems (ITS), existing research remains largely centered on microscopic autonomous driving (AD), with limited attention to city-scale traffic analysis. In particular, open-ended safety-oriented visual question answering (VQA) and corresponding foundation models for reasoning over heterogeneous roadside camera observations remain underexplored. To address this gap, we introduce the Land Transportation Dataset (LTD), a large-scale open-source vision-language dataset for open-ended reasoning in urban traffic environments. LTD contains 11.6K high-quality VQA pairs collected from heterogeneous roadside cameras, spanning diverse road geometries, traffic participants, illumination conditions, and adverse weather. The dataset integrates three complementary tasks: fine-grained multi-object grounding, multi-image camera selection, and multi-image risk analysis, requiring joint reasoning over minimally correlated views to infer hazardous objects, contributing factors, and risky road directions. To ensure annotation fidelity, we combine multi-model vision-language generation with cross-validation and human-in-the-loop refinement. Building upon LTD, we further propose UniVLT, a transportation foundation model trained via curriculum-based knowledge transfer to unify microscopic AD reasoning and macroscopic traffic analysis within a single architecture. Extensive experiments on LTD and multiple AD benchmarks demonstrate that UniVLT achieves SOTA performance on open-ended reasoning tasks across diverse domains, while exposing limitations of existing foundation models in complex multi-view traffic scenarios.

Via

Access Paper or Ask Questions

AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving

Mar 16, 2026

Wenhui Huang, Songyan Zhang, Qihang Huang, Zhidong Wang, Zhiqi Mao, Collister Chua, Zhan Chen, Long Chen, Chen Lv

Abstract:Integrating vision-language models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose \OURS in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through asynchronous execution at different task frequencies. Extensive experiments on multiple benchmarks, under both open- and closed-loop settings, demonstrate that \OURS achieves competitive performance compared to state-of-the-art methods. We further investigate the functional boundary of pre-trained VLMs in AD, examining when AD-tailored fine-tuning is necessary. Our results show that pre-trained VLMs can achieve competitive multi-task scene understanding performance through semantic prompting alone, while fine-tuning remains essential for action-level tasks such as decision-making and trajectory planning. We refer to \href{https://automot-website.github.io/}{Project Page} for the demonstration videos and qualitative results.

Via

Access Paper or Ask Questions

POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D Reconstruction

Apr 08, 2025

Songyan Zhang, Yongtao Ge, Jinyuan Tian, Guangkai Xu, Hao Chen, Chen Lv, Chunhua Shen

Abstract:3D reconstruction in dynamic scenes primarily relies on the combination of geometry estimation and matching modules where the latter task is pivotal for distinguishing dynamic regions which can help to mitigate the interference introduced by camera and object motion. Furthermore, the matching module explicitly models object motion, enabling the tracking of specific targets and advancing motion understanding in complex scenarios. Recently, the proposed representation of pointmap in DUSt3R suggests a potential solution to unify both geometry estimation and matching in 3D space, but it still struggles with ambiguous matching in dynamic regions, which may hamper further improvement. In this work, we present POMATO, a unified framework for dynamic 3D reconstruction by marrying pointmap matching with temporal motion. Specifically, our method first learns an explicit matching relationship by mapping RGB pixels from both dynamic and static regions across different views to 3D pointmaps within a unified coordinate system. Furthermore, we introduce a temporal motion module for dynamic motions that ensures scale consistency across different frames and enhances performance in tasks requiring both precise geometry and reliable matching, most notably 3D point tracking. We show the effectiveness of the proposed pointmap matching and temporal fusion paradigm by demonstrating the remarkable performance across multiple downstream tasks, including video depth estimation, 3D point tracking, and pose estimation. Code and models are publicly available at https://github.com/wyddmw/POMATO.

* code: https://github.com/wyddmw/POMATO

Via

Access Paper or Ask Questions

WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model

Dec 13, 2024

Songyan Zhang, Wenhui Huang, Zihui Gao, Hao Chen, Chen Lv

Figure 1 for WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model

Figure 2 for WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model

Figure 3 for WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model

Figure 4 for WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model

Abstract:The emergence of general human knowledge and impressive logical reasoning capacity in rapidly progressed vision-language models (VLMs) have driven increasing interest in applying VLMs to high-level autonomous driving tasks, such as scene understanding and decision-making. However, an in-depth study on the relationship between knowledge proficiency, especially essential driving expertise, and closed-loop autonomous driving performance requires further exploration. In this paper, we investigate the effects of the depth and breadth of fundamental driving knowledge on closed-loop trajectory planning and introduce WiseAD, a specialized VLM tailored for end-to-end autonomous driving capable of driving reasoning, action justification, object recognition, risk analysis, driving suggestions, and trajectory planning across diverse scenarios. We employ joint training on driving knowledge and planning datasets, enabling the model to perform knowledge-aligned trajectory planning accordingly. Extensive experiments indicate that as the diversity of driving knowledge extends, critical accidents are notably reduced, contributing 11.9% and 12.4% improvements in the driving score and route completion on the Carla closed-loop evaluations, achieving state-of-the-art performance. Moreover, WiseAD also demonstrates remarkable performance in knowledge evaluations on both in-domain and out-of-domain datasets.

Via

Access Paper or Ask Questions

Digging Into Normal Incorporated Stereo Matching

Feb 28, 2024

Zihua Liu, Songyan Zhang, Zhicheng Wang, Masatoshi Okutomi

Figure 1 for Digging Into Normal Incorporated Stereo Matching

Figure 2 for Digging Into Normal Incorporated Stereo Matching

Figure 3 for Digging Into Normal Incorporated Stereo Matching

Figure 4 for Digging Into Normal Incorporated Stereo Matching

Abstract:Despite the remarkable progress facilitated by learning-based stereo-matching algorithms, disparity estimation in low-texture, occluded, and bordered regions still remains a bottleneck that limits the performance. To tackle these challenges, geometric guidance like plane information is necessary as it provides intuitive guidance about disparity consistency and affinity similarity. In this paper, we propose a normal incorporated joint learning framework consisting of two specific modules named non-local disparity propagation(NDP) and affinity-aware residual learning(ARL). The estimated normal map is first utilized for calculating a non-local affinity matrix and a non-local offset to perform spatial propagation at the disparity level. To enhance geometric consistency, especially in low-texture regions, the estimated normal map is then leveraged to calculate a local affinity matrix, providing the residual learning with information about where the correction should refer and thus improving the residual learning efficiency. Extensive experiments on several public datasets including Scene Flow, KITTI 2015, and Middlebury 2014 validate the effectiveness of our proposed method. By the time we finished this work, our approach ranked 1st for stereo matching across foreground pixels on the KITTI 2015 dataset and 3rd on the Scene Flow dataset among all the published works.

* Proceedings of the 30th ACM International Conference on Multimedia (ACMMM2022), pp.6050-6060, October 2022

Via

Access Paper or Ask Questions

RGM: A Robust Generalist Matching Model

Oct 19, 2023

Songyan Zhang, Xinyu Sun, Hao Chen, Bo Li, Chunhua Shen

Figure 1 for RGM: A Robust Generalist Matching Model

Figure 2 for RGM: A Robust Generalist Matching Model

Figure 3 for RGM: A Robust Generalist Matching Model

Figure 4 for RGM: A Robust Generalist Matching Model

Abstract:Finding corresponding pixels within a pair of images is a fundamental computer vision task with various applications. Due to the specific requirements of different tasks like optical flow estimation and local feature matching, previous works are primarily categorized into dense matching and sparse feature matching focusing on specialized architectures along with task-specific datasets, which may somewhat hinder the generalization performance of specialized models. In this paper, we propose a deep model for sparse and dense matching, termed RGM (Robust Generalist Matching). In particular, we elaborately design a cascaded GRU module for refinement by exploring the geometric similarity iteratively at multiple scales following an additional uncertainty estimation module for sparsification. To narrow the gap between synthetic training samples and real-world scenarios, we build a new, large-scale dataset with sparse correspondence ground truth by generating optical flow supervision with greater intervals. As such, we are able to mix up various dense and sparse matching datasets, significantly improving the training diversity. The generalization capacity of our proposed RGM is greatly improved by learning the matching and uncertainty estimation in a two-stage manner on the large, mixed data. Superior performance is achieved for zero-shot matching and downstream geometry estimation across multiple datasets, outperforming the previous methods by a large margin.

* 17 pages. Code is available at: https://github.com/aim-uofa/RGM

Via

Access Paper or Ask Questions

DAVOS: Semi-Supervised Video Object Segmentation via Adversarial Domain Adaptation

May 24, 2021

Jinshuo Zhang, Zhicheng Wang, Songyan Zhang, Gang Wei

Figure 1 for DAVOS: Semi-Supervised Video Object Segmentation via Adversarial Domain Adaptation

Figure 2 for DAVOS: Semi-Supervised Video Object Segmentation via Adversarial Domain Adaptation

Figure 3 for DAVOS: Semi-Supervised Video Object Segmentation via Adversarial Domain Adaptation

Figure 4 for DAVOS: Semi-Supervised Video Object Segmentation via Adversarial Domain Adaptation

Abstract:Domain shift has always been one of the primary issues in video object segmentation (VOS), for which models suffer from degeneration when tested on unfamiliar datasets. Recently, many online methods have emerged to narrow the performance gap between training data (source domain) and test data (target domain) by fine-tuning on annotations of test data which are usually in shortage. In this paper, we propose a novel method to tackle domain shift by first introducing adversarial domain adaptation to the VOS task, with supervised training on the source domain and unsupervised training on the target domain. By fusing appearance and motion features with a convolution layer, and by adding supervision onto the motion branch, our model achieves state-of-the-art performance on DAVIS2016 with 82.6% mean IoU score after supervised training. Meanwhile, our adversarial domain adaptation strategy significantly raises the performance of the trained model when applied on FBMS59 and Youtube-Object, without exploiting extra annotations.

Via

Access Paper or Ask Questions

EDNet: Efficient Disparity Estimation with Combination Volume and Spatial Attention based Residual Learning

Nov 08, 2020

Songyan Zhang, Zhicheng Wang

Figure 1 for EDNet: Efficient Disparity Estimation with Combination Volume and Spatial Attention based Residual Learning

Figure 2 for EDNet: Efficient Disparity Estimation with Combination Volume and Spatial Attention based Residual Learning

Figure 3 for EDNet: Efficient Disparity Estimation with Combination Volume and Spatial Attention based Residual Learning

Figure 4 for EDNet: Efficient Disparity Estimation with Combination Volume and Spatial Attention based Residual Learning

Abstract:Existing state-of-the-art disparity estimation works mostly leverage the 4D concatenation volume and construct a very deep 3D convolution neural network for disparity regression, which is inefficient considering the high memory consumption and slow inference speed. In this paper, we propose a network named EDNet for efficient disparity estimation. To be specific, we construct a combination volume which incorporates contextual information from the concatenation volume and feature similarity measurement from the correlation volume. The combination volume can be aggregated by 2D convolutions which require less running memory. We further propose a spatial attention based residual learning module to generate attention-aware residual features. Accurate disparity correction can be provided even in low-texture regions as the residual learning process can specifically concentrate on inaccurate regions. Extensive experiments on Scene Flow and KITTI datasets show that our network outperforms previous 3D convolution based works and achieves state-of-the-art performance with significantly faster speed and less memory consumption, demonstrating the effectiveness of our proposed method.

Via

Access Paper or Ask Questions