Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tong Wu

PreResQ-R1: Towards Fine-Grained Rank-and-Score Reinforcement Learning for Visual Quality Assessment via Preference-Response Disentangled Policy Optimization

Nov 07, 2025

Zehui Feng, Tian Qiu, Tong Wu, Junxuan Li, Huayuan Xu, Ting Han

Abstract:Visual Quality Assessment (QA) seeks to predict human perceptual judgments of visual fidelity. While recent multimodal large language models (MLLMs) show promise in reasoning about image and video quality, existing approaches mainly rely on supervised fine-tuning or rank-only objectives, resulting in shallow reasoning, poor score calibration, and limited cross-domain generalization. We propose PreResQ-R1, a Preference-Response Disentangled Reinforcement Learning framework that unifies absolute score regression and relative ranking consistency within a single reasoning-driven optimization scheme. Unlike prior QA methods, PreResQ-R1 introduces a dual-branch reward formulation that separately models intra-sample response coherence and inter-sample preference alignment, optimized via Group Relative Policy Optimization (GRPO). This design encourages fine-grained, stable, and interpretable chain-of-thought reasoning about perceptual quality. To extend beyond static imagery, we further design a global-temporal and local-spatial data flow strategy for Video Quality Assessment. Remarkably, with reinforcement fine-tuning on only 6K images and 28K videos, PreResQ-R1 achieves state-of-the-art results across 10 IQA and 5 VQA benchmarks under both SRCC and PLCC metrics, surpassing by margins of 5.30% and textbf2.15% in IQA task, respectively. Beyond quantitative gains, it produces human-aligned reasoning traces that reveal the perceptual cues underlying quality judgments. Code and model are available.

* 27 pages, 14 figures, under review as a conference paper

Via

Access Paper or Ask Questions

Towards a Golden Classifier-Free Guidance Path via Foresight Fixed Point Iterations

Oct 24, 2025

Kaibo Wang, Jianda Mao, Tong Wu, Yang Xiang

Figure 1 for Towards a Golden Classifier-Free Guidance Path via Foresight Fixed Point Iterations

Figure 2 for Towards a Golden Classifier-Free Guidance Path via Foresight Fixed Point Iterations

Figure 3 for Towards a Golden Classifier-Free Guidance Path via Foresight Fixed Point Iterations

Figure 4 for Towards a Golden Classifier-Free Guidance Path via Foresight Fixed Point Iterations

Abstract:Classifier-Free Guidance (CFG) is an essential component of text-to-image diffusion models, and understanding and advancing its operational mechanisms remains a central focus of research. Existing approaches stem from divergent theoretical interpretations, thereby limiting the design space and obscuring key design choices. To address this, we propose a unified perspective that reframes conditional guidance as fixed point iterations, seeking to identify a golden path where latents produce consistent outputs under both conditional and unconditional generation. We demonstrate that CFG and its variants constitute a special case of single-step short-interval iteration, which is theoretically proven to exhibit inefficiency. To this end, we introduce Foresight Guidance (FSG), which prioritizes solving longer-interval subproblems in early diffusion stages with increased iterations. Extensive experiments across diverse datasets and model architectures validate the superiority of FSG over state-of-the-art methods in both image quality and computational efficiency. Our work offers novel perspectives for conditional guidance and unlocks the potential of adaptive design.

* Accepted at NeurIPS 2025 (Spotlight)

Via

Access Paper or Ask Questions

Universal Graph Learning for Power System Reconfigurations: Transfer Across Topology Variations

Sep 10, 2025

Tong Wu, Anna Scaglione, Sandy Miguel, Daniel Arnold

Abstract:This work addresses a fundamental challenge in applying deep learning to power systems: developing neural network models that transfer across significant system changes, including networks with entirely different topologies and dimensionalities, without requiring training data from unseen reconfigurations. Despite extensive research, most ML-based approaches remain system-specific, limiting real-world deployment. This limitation stems from a dual barrier. First, topology changes shift feature distributions and alter input dimensions due to power flow physics. Second, reconfigurations redefine output semantics and dimensionality, requiring models to handle configuration-specific outputs while maintaining transferable feature extraction. To overcome this challenge, we introduce a Universal Graph Convolutional Network (UGCN) that achieves transferability to any reconfiguration or variation of existing power systems without any prior knowledge of new grid topologies or retraining during implementation. Our approach applies to both transmission and distribution networks and demonstrates generalization capability to completely unseen system reconfigurations, such as network restructuring and major grid expansions. Experimental results across power system applications, including false data injection detection and state forecasting, show that UGCN significantly outperforms state-of-the-art methods in cross-system zero-shot transferability of new reconfigurations.

* This work has been submitted to the IEEE for possible publication

Via

Access Paper or Ask Questions

EGTM: Event-guided Efficient Turbulence Mitigation

Sep 04, 2025

Huanan Li, Rui Fan, Juntao Guan, Weidong Hao, Lai Rui, Tong Wu, Yikai Wang, Lin Gu

Abstract:Turbulence mitigation (TM) aims to remove the stochastic distortions and blurs introduced by atmospheric turbulence into frame cameras. Existing state-of-the-art deep-learning TM methods extract turbulence cues from multiple degraded frames to find the so-called "lucky'', not distorted patch, for "lucky fusion''. However, it requires high-capacity network to learn from coarse-grained turbulence dynamics between synchronous frames with limited frame-rate, thus fall short in computational and storage efficiency. Event cameras, with microsecond-level temporal resolution, have the potential to fundamentally address this bottleneck with efficient sparse and asynchronous imaging mechanism. In light of this, we (i) present the fundamental \textbf{``event-lucky insight''} to reveal the correlation between turbulence distortions and inverse spatiotemporal distribution of event streams. Then, build upon this insight, we (ii) propose a novel EGTM framework that extracts pixel-level reliable turbulence-free guidance from the explicit but noisy turbulent events for temporal lucky fusion. Moreover, we (iii) build the first turbulence data acquisition system to contribute the first real-world event-driven TM dataset. Extensive experimental results demonstrate that our approach significantly surpass the existing SOTA TM method by 710 times, 214 times and 224 times in model size, inference latency and model complexity respectively, while achieving the state-of-the-art in restoration quality (+0.94 PSNR and +0.08 SSIM) on our real-world EGTM dataset. This demonstrating the great efficiency merit of introducing event modality into TM task. Demo code and data have been uploaded in supplementary material and will be released once accepted.

Via

Access Paper or Ask Questions

DiCache: Let Diffusion Model Determine Its Own Cache

Aug 24, 2025

Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Tong Wu, Dahua Lin, Jiaqi Wang

Figure 1 for DiCache: Let Diffusion Model Determine Its Own Cache

Figure 2 for DiCache: Let Diffusion Model Determine Its Own Cache

Figure 3 for DiCache: Let Diffusion Model Determine Its Own Cache

Figure 4 for DiCache: Let Diffusion Model Determine Its Own Cache

Abstract:Recent years have witnessed the rapid development of acceleration techniques for diffusion models, especially caching-based acceleration methods. These studies seek to answer two fundamental questions: "When to cache" and "How to use cache", typically relying on predefined empirical laws or dataset-level priors to determine the timing of caching and utilizing handcrafted rules for leveraging multi-step caches. However, given the highly dynamic nature of the diffusion process, they often exhibit limited generalizability and fail on outlier samples. In this paper, a strong correlation is revealed between the variation patterns of the shallow-layer feature differences in the diffusion model and those of final model outputs. Moreover, we have observed that the features from different model layers form similar trajectories. Based on these observations, we present DiCache, a novel training-free adaptive caching strategy for accelerating diffusion models at runtime, answering both when and how to cache within a unified framework. Specifically, DiCache is composed of two principal components: (1) Online Probe Profiling Scheme leverages a shallow-layer online probe to obtain a stable prior for the caching error in real time, enabling the model to autonomously determine caching schedules. (2) Dynamic Cache Trajectory Alignment combines multi-step caches based on shallow-layer probe feature trajectory to better approximate the current feature, facilitating higher visual quality. Extensive experiments validate DiCache's capability in achieving higher efficiency and improved visual fidelity over state-of-the-art methods on various leading diffusion models including WAN 2.1, HunyuanVideo for video generation, and Flux for image generation.

Via

Access Paper or Ask Questions

Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity

Aug 07, 2025

Yuhan Zhang, Long Zhuo, Ziyang Chu, Tong Wu, Zhibing Li, Liang Pan, Dahua Lin, Ziwei Liu

Figure 1 for Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity

Figure 2 for Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity

Figure 3 for Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity

Figure 4 for Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity

Abstract:Despite rapid advances in 3D content generation, quality assessment for the generated 3D assets remains challenging. Existing methods mainly rely on image-based metrics and operate solely at the object level, limiting their ability to capture spatial coherence, material authenticity, and high-fidelity local details. 1) To address these challenges, we introduce Hi3DEval, a hierarchical evaluation framework tailored for 3D generative content. It combines both object-level and part-level evaluation, enabling holistic assessments across multiple dimensions as well as fine-grained quality analysis. Additionally, we extend texture evaluation beyond aesthetic appearance by explicitly assessing material realism, focusing on attributes such as albedo, saturation, and metallicness. 2) To support this framework, we construct Hi3DBench, a large-scale dataset comprising diverse 3D assets and high-quality annotations, accompanied by a reliable multi-agent annotation pipeline. We further propose a 3D-aware automated scoring system based on hybrid 3D representations. Specifically, we leverage video-based representations for object-level and material-subject evaluations to enhance modeling of spatio-temporal consistency and employ pretrained 3D features for part-level perception. Extensive experiments demonstrate that our approach outperforms existing image-based metrics in modeling 3D characteristics and achieves superior alignment with human preference, providing a scalable alternative to manual evaluations. The project page is available at https://zyh482.github.io/Hi3DEval/.

* Page: https://zyh482.github.io/Hi3DEval/

Via

Access Paper or Ask Questions

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

Aug 06, 2025

Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, Jiaqi Wang

Abstract:Repurposing large vision-language models (LVLMs) as computer use agents (CUAs) has led to substantial breakthroughs, primarily driven by human-labeled data. However, these models often struggle with novel and specialized software, particularly in scenarios lacking human annotations. To address this challenge, we propose SEAgent, an agentic self-evolving framework enabling CUAs to autonomously evolve through interactions with unfamiliar software. Specifically, SEAgent empowers computer-use agents to autonomously master novel software environments via experiential learning, where agents explore new software, learn through iterative trial-and-error, and progressively tackle auto-generated tasks organized from simple to complex. To achieve this goal, we design a World State Model for step-wise trajectory assessment, along with a Curriculum Generator that generates increasingly diverse and challenging tasks. The agent's policy is updated through experiential learning, comprised of adversarial imitation of failure actions and Group Relative Policy Optimization (GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist training strategy that integrates individual experiential insights from specialist agents, facilitating the development of a stronger generalist CUA capable of continuous autonomous evolution. This unified agent ultimately achieves performance surpassing ensembles of individual specialist agents on their specialized software. We validate the effectiveness of SEAgent across five novel software environments within OS-World. Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA, i.e., UI-TARS.

* Code at https://github.com/SunzeY/SEAgent

Via

Access Paper or Ask Questions

Video World Models with Long-term Spatial Memory

Jun 05, 2025

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, Gordon Wetzstein

Figure 1 for Video World Models with Long-term Spatial Memory

Figure 2 for Video World Models with Long-term Spatial Memory

Figure 3 for Video World Models with Long-term Spatial Memory

Figure 4 for Video World Models with Long-term Spatial Memory

Abstract:Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long-term consistency of video world models through a geometry-grounded long-term spatial memory. Our framework includes mechanisms to store and retrieve information from the long-term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation.

* Project page: https://spmem.github.io/

Via

Access Paper or Ask Questions

Towards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation

Jun 05, 2025

Jan Ackermann, Kiyohiro Nakayama, Guandao Yang, Tong Wu, Gordon Wetzstein

Figure 1 for Towards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation

Figure 2 for Towards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation

Figure 3 for Towards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation

Figure 4 for Towards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation

Abstract:Multimodal foundation models have demonstrated strong generalization, yet their ability to transfer knowledge to specialized domains such as garment generation remains underexplored. We introduce VLG, a vision-language-garment model that synthesizes garments from textual descriptions and visual imagery. Our experiments assess VLG's zero-shot generalization, investigating its ability to transfer web-scale reasoning to unseen garment styles and prompts. Preliminary results indicate promising transfer capabilities, highlighting the potential for multimodal foundation models to adapt effectively to specialized domains like fashion design.

* Presented at MMFM CVPRW'25, code available at https://georgenakayama.github.io/AIpparel/

Via

Access Paper or Ask Questions

SpikeStereoNet: A Brain-Inspired Framework for Stereo Depth Estimation from Spike Streams

May 26, 2025

Zhuoheng Gao, Yihao Li, Jiyao Zhang, Rui Zhao, Tong Wu, Hao Tang, Zhaofei Yu, Hao Dong, Guozhang Chen, Tiejun Huang

Abstract:Conventional frame-based cameras often struggle with stereo depth estimation in rapidly changing scenes. In contrast, bio-inspired spike cameras emit asynchronous events at microsecond-level resolution, providing an alternative sensing modality. However, existing methods lack specialized stereo algorithms and benchmarks tailored to the spike data. To address this gap, we propose SpikeStereoNet, a brain-inspired framework and the first to estimate stereo depth directly from raw spike streams. The model fuses raw spike streams from two viewpoints and iteratively refines depth estimation through a recurrent spiking neural network (RSNN) update module. To benchmark our approach, we introduce a large-scale synthetic spike stream dataset and a real-world stereo spike dataset with dense depth annotations. SpikeStereoNet outperforms existing methods on both datasets by leveraging spike streams' ability to capture subtle edges and intensity shifts in challenging regions such as textureless surfaces and extreme lighting conditions. Furthermore, our framework exhibits strong data efficiency, maintaining high accuracy even with substantially reduced training data. The source code and datasets will be publicly available.

Via

Access Paper or Ask Questions