Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shuyao Shang

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Oct 14, 2025

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang(+3 more)

Figure 1 for DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Figure 2 for DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Figure 3 for DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Figure 4 for DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Abstract:Scaling Vision-Language-Action (VLA) models on large-scale data offers a promising path to achieving a more generalized driving intelligence. However, VLA models are limited by a ``supervision deficit'': the vast model capacity is supervised by sparse, low-dimensional actions, leaving much of their representational power underutilized. To remedy this, we propose \textbf{DriveVLA-W0}, a training paradigm that employs world modeling to predict future images. This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment. We showcase the paradigm's versatility by instantiating it for two dominant VLA archetypes: an autoregressive world model for VLAs that use discrete visual tokens, and a diffusion world model for those operating on continuous visual features. Building on the rich representations learned from world modeling, we introduce a lightweight action expert to address the inference latency for real-time deployment. Extensive experiments on the NAVSIM v1/v2 benchmark and a 680x larger in-house dataset demonstrate that DriveVLA-W0 significantly outperforms BEV and VLA baselines. Crucially, it amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases.

Via

Access Paper or Ask Questions

ResDiff: Combining CNN and Diffusion Model for Image Super-Resolution

Mar 16, 2023

Shuyao Shang, Zhengyang Shan, Guangxing Liu, Jinglin Zhang

Figure 1 for ResDiff: Combining CNN and Diffusion Model for Image Super-Resolution

Figure 2 for ResDiff: Combining CNN and Diffusion Model for Image Super-Resolution

Figure 3 for ResDiff: Combining CNN and Diffusion Model for Image Super-Resolution

Figure 4 for ResDiff: Combining CNN and Diffusion Model for Image Super-Resolution

Abstract:Adapting the Diffusion Probabilistic Model (DPM) for direct image super-resolution is wasteful, given that a simple Convolutional Neural Network (CNN) can recover the main low-frequency content. Therefore, we present ResDiff, a novel Diffusion Probabilistic Model based on Residual structure for Single Image Super-Resolution (SISR). ResDiff utilizes a combination of a CNN, which restores primary low-frequency components, and a DPM, which predicts the residual between the ground-truth image and the CNN-predicted image. In contrast to the common diffusion-based methods that directly use LR images to guide the noise towards HR space, ResDiff utilizes the CNN's initial prediction to direct the noise towards the residual space between HR space and CNN-predicted space, which not only accelerates the generation process but also acquires superior sample quality. Additionally, a frequency-domain-based loss function for CNN is introduced to facilitate its restoration, and a frequency-domain guided diffusion is designed for DPM on behalf of predicting high-frequency details. The extensive experiments on multiple benchmark datasets demonstrate that ResDiff outperforms previous diffusion-based methods in terms of shorter model convergence time, superior generation quality, and more diverse samples.

* 18 pages, 13 figures

Via

Access Paper or Ask Questions