Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wang Tao

SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

Dec 16, 2025

Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao, Linfeng Zhang, Biqing Qi, Bowen Zhou

Figure 1 for SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

Figure 2 for SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

Figure 3 for SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

Figure 4 for SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

Abstract:Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present \textbf{SDAR-VL}, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an \emph{integrated framework for efficient and stable training}. This framework unifies three components: (1) \textbf{Asynchronous Block-wise Noise Scheduling} to diversify supervision within each batch; (2) \textbf{Effective Mask Ratio Scaling} for unbiased loss normalization under stochastic masking; and (3) a \textbf{Progressive Beta Noise Curriculum} that increases effective mask coverage while preserving corruption diversity. Experiments on 21 single-image, multi-image, and video benchmarks show that SDAR-VL consistently improves \emph{training efficiency}, \emph{convergence stability}, and \emph{task performance} over conventional block diffusion. On this evaluation suite, SDAR-VL sets a new state of the art among diffusion-based vision-language models and, under matched settings, matches or surpasses strong AR baselines such as LLaVA-OneVision as well as the global diffusion baseline LLaDA-V, establishing block-wise diffusion as a practical backbone for VLU.

Via

Access Paper or Ask Questions

Adaptable and Precise: Enterprise-Scenario LLM Function-Calling Capability Training Pipeline

Dec 20, 2024

Guancheng Zeng, Wentao Ding, Beining Xu, Chi Zhang, Wenqiang Han, Gang Li, Jingjing Mo, Pengxu Qiu, Xinran Tao, Wang Tao(+1 more)

Figure 1 for Adaptable and Precise: Enterprise-Scenario LLM Function-Calling Capability Training Pipeline

Figure 2 for Adaptable and Precise: Enterprise-Scenario LLM Function-Calling Capability Training Pipeline

Figure 3 for Adaptable and Precise: Enterprise-Scenario LLM Function-Calling Capability Training Pipeline

Figure 4 for Adaptable and Precise: Enterprise-Scenario LLM Function-Calling Capability Training Pipeline

Abstract:Enterprises possess a vast array of API assets scattered across various functions, forming the backbone of existing business processes. By leveraging these APIs as functional tools, enterprises can design diverse, scenario-specific agent applications, driven by on-premise function-calling models as the core engine. However, generic models often fail to meet enterprise requirements in terms of computational efficiency, output accuracy, and stability, necessitating scenario-specific adaptation. In this paper, we propose a training pipeline for function-calling capabilities tailored to real-world business scenarios. This pipeline includes the synthesis and augmentation of scenario-specific function-calling data, model fine-tuning, and performance evaluation and analysis. Using this pipeline, we generated 1,260 fully AI-generated samples and 1,035 augmented manually-labeled samples in digital HR agent scenario. The Qwen2.5-Coder-7B-Instruct model was employed as the base model and fine-tuned using the LoRA method on four GPUs with 24GB VRAM. Our fine-tuned model demonstrated outstanding performance in evaluations and practical applications, surpassing GPT-4 and GPT-4o in accuracy on the test set. These results validate the reliability of the proposed pipeline for training scenario-specific function-calling models.

* 23 pages, 6 figures, 7 tables

Via

Access Paper or Ask Questions