Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Qijie Wang

WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

May 11, 2026

Keming Wu, Yijing Cui, Wenhan Xue, Qijie Wang, Xuan Luo, Zhiyuan Feng, Zuhao Yang, Sudong Wang, Sicong Jiang, Haowei Zhu(+4 more)

Abstract:Commercial video generation systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into "world simulators." Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time. We introduce WorldReasonBench, which reframes video generation evaluation as world-state prediction: given an initial state and an action, can a model generate a future video whose state evolution remains physically, socially, logically, and informationally consistent? WorldReasonBench contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories. We evaluate generated videos with a human-aligned two-part methodology: Process-aware Reasoning Verification uses structured QA and reasoning-phase diagnostics to detect temporal and causal failures, while Multi-dimensional Quality Assessment scores reasoning quality, temporal consistency, and visual aesthetics for ranking and reward modeling. We further introduce WorldRewardBench, a preference benchmark with approximately 6K expert-annotated pairs over 1.4K videos, supporting pair-wise and point-wise reward-model evaluation. Across modern video generators, our results expose a persistent gap between visual plausibility and world reasoning: videos can look convincing while failing dynamics, causality, or information preservation. We will release our benchmarks and evaluation toolkit to support community research on genuinely world-aware video generation at https://github.com/UniX-AI-Lab/WorldReasonBench/.

* Project Page: https://unix-ai-lab.github.io/WorldReasonBench/

Via

Access Paper or Ask Questions

Accelerating Diffusion-based Super-Resolution with Dynamic Time-Spatial Sampling

May 17, 2025

Rui Qin, Qijie Wang, Ming Sun, Haowei Zhu, Chao Zhou, Bin Wang

Figure 1 for Accelerating Diffusion-based Super-Resolution with Dynamic Time-Spatial Sampling

Figure 2 for Accelerating Diffusion-based Super-Resolution with Dynamic Time-Spatial Sampling

Figure 3 for Accelerating Diffusion-based Super-Resolution with Dynamic Time-Spatial Sampling

Figure 4 for Accelerating Diffusion-based Super-Resolution with Dynamic Time-Spatial Sampling

Abstract:Diffusion models have gained attention for their success in modeling complex distributions, achieving impressive perceptual quality in SR tasks. However, existing diffusion-based SR methods often suffer from high computational costs, requiring numerous iterative steps for training and inference. Existing acceleration techniques, such as distillation and solver optimization, are generally task-agnostic and do not fully leverage the specific characteristics of low-level tasks like super-resolution (SR). In this study, we analyze the frequency- and spatial-domain properties of diffusion-based SR methods, revealing key insights into the temporal and spatial dependencies of high-frequency signal recovery. Specifically, high-frequency details benefit from concentrated optimization during early and late diffusion iterations, while spatially textured regions demand adaptive denoising strategies. Building on these observations, we propose the Time-Spatial-aware Sampling strategy (TSS) for the acceleration of Diffusion SR without any extra training cost. TSS combines Time Dynamic Sampling (TDS), which allocates more iterations to refining textures, and Spatial Dynamic Sampling (SDS), which dynamically adjusts strategies based on image content. Extensive evaluations across multiple benchmarks demonstrate that TSS achieves state-of-the-art (SOTA) performance with significantly fewer iterations, improving MUSIQ scores by 0.2 - 3.0 and outperforming the current acceleration methods with only half the number of steps.

Via

Access Paper or Ask Questions

FSL-LVLM: Friction-Aware Safety Locomotion using Large Vision Language Model in Wheeled Robots

Sep 15, 2024

Bo Peng, Donghoon Baek, Qijie Wang, Joao Ramos

Figure 1 for FSL-LVLM: Friction-Aware Safety Locomotion using Large Vision Language Model in Wheeled Robots

Figure 2 for FSL-LVLM: Friction-Aware Safety Locomotion using Large Vision Language Model in Wheeled Robots

Figure 3 for FSL-LVLM: Friction-Aware Safety Locomotion using Large Vision Language Model in Wheeled Robots

Figure 4 for FSL-LVLM: Friction-Aware Safety Locomotion using Large Vision Language Model in Wheeled Robots

Abstract:Wheeled-legged robots offer significant mobility and versatility but face substantial challenges when operating on slippery terrains. Traditional model-based controllers for these robots assume no slipping. While reinforcement learning (RL) helps quadruped robots adapt to different surfaces, recovering from slips remains challenging, especially for systems with few contact points. Estimating the ground friction coefficient is another open challenge. In this paper, we propose a novel friction-aware safety locomotion framework that integrates Large Vision Language Models (LVLMs) with a RL policy. Our approach explicitly incorporates the estimated friction coefficient into the RL policy, enabling the robot to adapt its behavior in advance based on the surface type before reaching it. We introduce a Friction-From-Vision (FFV) module, which leverages LVLMs to estimate ground friction coefficients, eliminating the need for large datasets and extensive training. The framework was validated on a customized wheeled inverted pendulum, and experimental results demonstrate that our framework increases the success rate in completing driving tasks by adjusting speed according to terrain type, while achieving better tracking performance compared to baseline methods. Our framework can be simply integrated with any other RL policies.

* submitted to icra2025

Via

Access Paper or Ask Questions

CapS-Adapter: Caption-based MultiModal Adapter in Zero-Shot Classification

May 26, 2024

Qijie Wang, Guandu Liu, Bin Wang

Figure 1 for CapS-Adapter: Caption-based MultiModal Adapter in Zero-Shot Classification

Figure 2 for CapS-Adapter: Caption-based MultiModal Adapter in Zero-Shot Classification

Figure 3 for CapS-Adapter: Caption-based MultiModal Adapter in Zero-Shot Classification

Figure 4 for CapS-Adapter: Caption-based MultiModal Adapter in Zero-Shot Classification

Abstract:Recent advances in vision-language foundational models, such as CLIP, have demonstrated significant strides in zero-shot classification. However, the extensive parameterization of models like CLIP necessitates a resource-intensive fine-tuning process. In response, TIP-Adapter and SuS-X have introduced training-free methods aimed at bolstering the efficacy of downstream tasks. While these approaches incorporate support sets to maintain data distribution consistency between knowledge cache and test sets, they often fall short in terms of generalization on the test set, particularly when faced with test data exhibiting substantial distributional variations. In this work, we present CapS-Adapter, an innovative method that employs a caption-based support set, effectively harnessing both image and caption features to exceed existing state-of-the-art techniques in training-free scenarios. CapS-Adapter adeptly constructs support sets that closely mirror target distributions, utilizing instance-level distribution features extracted from multimodal large models. By leveraging CLIP's single and cross-modal strengths, CapS-Adapter enhances predictive accuracy through the use of multimodal support sets. Our method achieves outstanding zero-shot classification results across 19 benchmark datasets, improving accuracy by 2.19\% over the previous leading method. Our contributions are substantiated through extensive validation on multiple benchmark datasets, demonstrating superior performance and robust generalization capabilities. Our code is made publicly available at https://github.com/WLuLi/CapS-Adapter.

Via

Access Paper or Ask Questions