Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yang Xiang

Towards a Golden Classifier-Free Guidance Path via Foresight Fixed Point Iterations

Oct 24, 2025

Kaibo Wang, Jianda Mao, Tong Wu, Yang Xiang

Figure 1 for Towards a Golden Classifier-Free Guidance Path via Foresight Fixed Point Iterations

Figure 2 for Towards a Golden Classifier-Free Guidance Path via Foresight Fixed Point Iterations

Figure 3 for Towards a Golden Classifier-Free Guidance Path via Foresight Fixed Point Iterations

Figure 4 for Towards a Golden Classifier-Free Guidance Path via Foresight Fixed Point Iterations

Abstract:Classifier-Free Guidance (CFG) is an essential component of text-to-image diffusion models, and understanding and advancing its operational mechanisms remains a central focus of research. Existing approaches stem from divergent theoretical interpretations, thereby limiting the design space and obscuring key design choices. To address this, we propose a unified perspective that reframes conditional guidance as fixed point iterations, seeking to identify a golden path where latents produce consistent outputs under both conditional and unconditional generation. We demonstrate that CFG and its variants constitute a special case of single-step short-interval iteration, which is theoretically proven to exhibit inefficiency. To this end, we introduce Foresight Guidance (FSG), which prioritizes solving longer-interval subproblems in early diffusion stages with increased iterations. Extensive experiments across diverse datasets and model architectures validate the superiority of FSG over state-of-the-art methods in both image quality and computational efficiency. Our work offers novel perspectives for conditional guidance and unlocks the potential of adaptive design.

* Accepted at NeurIPS 2025 (Spotlight)

Via

Access Paper or Ask Questions

A Hierarchical Quantized Tokenization Framework for Task-Adaptive Graph Representation Learning

Oct 14, 2025

Yang Xiang, Li Fan, Chenke Yin, Chengtao Ji

Abstract:Recent progress in language and vision foundation models demonstrates the importance of discrete token interfaces that transform complex inputs into compact sequences for large-scale modeling. Extending this paradigm to graphs requires a tokenization scheme that handles non-Euclidean structures and multi-scale dependencies efficiently. Existing approaches to graph tokenization, linearized, continuous, and quantized, remain limited in adaptability and efficiency. In particular, most current quantization-based tokenizers organize hierarchical information in fixed or task-agnostic ways, which may either over-represent or under-utilize structural cues, and lack the ability to dynamically reweight contributions from different levels without retraining the encoder. This work presents a hierarchical quantization framework that introduces a self-weighted mechanism for task-adaptive aggregation across multiple scales. The proposed method maintains a frozen encoder while modulating information flow through a lightweight gating process, enabling parameter-efficient adaptation to diverse downstream tasks. Experiments on benchmark datasets for node classification and link prediction demonstrate consistent improvements over strong baselines under comparable computational budgets.

Via

Access Paper or Ask Questions

From Bias to Balance: Exploring and Mitigating Spatial Bias in LVLMs

Sep 26, 2025

Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Weili Guan, Jun Yu, Min Zhang

Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable success across a wide range of multimodal tasks, yet their robustness to spatial variations remains insufficiently understood. In this work, we present a systematic study of the spatial bias of LVLMs, focusing on how models respond when identical key visual information is placed at different locations within an image. Through a carefully designed probing dataset, we demonstrate that current LVLMs often produce inconsistent outputs under such spatial shifts, revealing a fundamental limitation in their spatial-semantic understanding. Further analysis shows that this phenomenon originates not from the vision encoder, which reliably perceives and interprets visual content across positions, but from the unbalanced design of position embeddings in the language model component. In particular, the widely adopted position embedding strategies, such as RoPE, introduce imbalance during cross-modal interaction, leading image tokens at different positions to exert unequal influence on semantic understanding. To mitigate this issue, we introduce Balanced Position Assignment (BaPA), a simple yet effective mechanism that assigns identical position embeddings to all image tokens, promoting a more balanced integration of visual information. Extensive experiments show that BaPA enhances the spatial robustness of LVLMs without retraining and further boosts their performance across diverse multimodal benchmarks when combined with lightweight fine-tuning. Further analysis of information flow reveals that BaPA yields balanced attention, enabling more holistic visual understanding.

Via

Access Paper or Ask Questions

From Long to Lean: Performance-aware and Adaptive Chain-of-Thought Compression via Multi-round Refinement

Sep 26, 2025

Jianzhi Yan, Le Liu, Youcheng Pan, Shiwei Chen, Zike Yuan, Yang Xiang, Buzhou Tang

Figure 1 for From Long to Lean: Performance-aware and Adaptive Chain-of-Thought Compression via Multi-round Refinement

Figure 2 for From Long to Lean: Performance-aware and Adaptive Chain-of-Thought Compression via Multi-round Refinement

Figure 3 for From Long to Lean: Performance-aware and Adaptive Chain-of-Thought Compression via Multi-round Refinement

Figure 4 for From Long to Lean: Performance-aware and Adaptive Chain-of-Thought Compression via Multi-round Refinement

Abstract:Chain-of-Thought (CoT) reasoning improves performance on complex tasks but introduces significant inference latency due to verbosity. We propose Multiround Adaptive Chain-of-Thought Compression (MACC), a framework that leverages the token elasticity phenomenon--where overly small token budgets can paradoxically increase output length--to progressively compress CoTs via multiround refinement. This adaptive strategy allows MACC to determine the optimal compression depth for each input. Our method achieves an average accuracy improvement of 5.6 percent over state-of-the-art baselines, while also reducing CoT length by an average of 47 tokens and significantly lowering latency. Furthermore, we show that test-time performance--accuracy and token length--can be reliably predicted using interpretable features like perplexity and compression rate on the training set. Evaluated across different models, our method enables efficient model selection and forecasting without repeated fine-tuning, demonstrating that CoT compression is both effective and predictable. Our code will be released in https://github.com/Leon221220/MACC.

* 17 pages, 8 figures

Via

Access Paper or Ask Questions

Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing

Sep 17, 2025

Yaru Chen, Ruohao Guo, Liting Gao, Yang Xiang, Qingyu Luo, Zhenbo Li, Wenwu Wang

Abstract:Weakly-supervised audio-visual video parsing (AVVP) seeks to detect audible, visible, and audio-visual events without temporal annotations. Previous work has emphasized refining global predictions through contrastive or collaborative learning, but neglected stable segment-level supervision and class-aware cross-modal alignment. To address this, we propose two strategies: (1) an exponential moving average (EMA)-guided pseudo supervision framework that generates reliable segment-level masks via adaptive thresholds or top-k selection, offering stable temporal guidance beyond video-level labels; and (2) a class-aware cross-modal agreement (CMA) loss that aligns audio and visual embeddings at reliable segment-class pairs, ensuring consistency across modalities while preserving temporal structure. Evaluations on LLP and UnAV-100 datasets shows that our method achieves state-of-the-art (SOTA) performance across multiple metrics.

Via

Access Paper or Ask Questions

TEn-CATS: Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph

Sep 04, 2025

Yaru Chen, Faegheh Sardari, Peiliang Zhang, Ruohao Guo, Yang Xiang, Zhenbo Li, Wenwu Wang

Abstract:Audio-Visual Video Parsing (AVVP) task aims to identify event categories and their occurrence times in a given video with weakly supervised labels. Existing methods typically fall into two categories: (i) designing enhanced architectures based on attention mechanism for better temporal modeling, and (ii) generating richer pseudo-labels to compensate for the absence of frame-level annotations. However, the first type methods treat noisy segment-level pseudo labels as reliable supervision and the second type methods let indiscriminate attention spread them across all frames, the initial errors are repeatedly amplified during training. To address this issue, we propose a method that combines the Bi-Directional Text Fusion (BiT) module and Category-Aware Temporal Graph (CATS) module. Specifically, we integrate the strengths and complementarity of the two previous research directions. We first perform semantic injection and dynamic calibration on audio and visual modality features through the BiT module, to locate and purify cleaner and richer semantic cues. Then, we leverage the CATS module for semantic propagation and connection to enable precise semantic information dissemination across time. Experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance in multiple key indicators on two benchmark datasets, LLP and UnAV-100.

Via

Access Paper or Ask Questions

CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation

Aug 10, 2025

Yexing Du, Kaiyuan Liu, Youcheng Pan, Zheng Chu, Bo Yang, Xiaocheng Feng, Yang Xiang, Ming Liu

Abstract:As Large Language Models (LLMs) are increasingly popularized in the multilingual world, ensuring hallucination-free factuality becomes markedly crucial. However, existing benchmarks for evaluating the reliability of Multimodal Large Language Models (MLLMs) predominantly focus on textual or visual modalities with a primary emphasis on English, which creates a gap in evaluation when processing multilingual input, especially in speech. To bridge this gap, we propose a novel \textbf{C}ross-lingual and \textbf{C}ross-modal \textbf{F}actuality benchmark (\textbf{CCFQA}). Specifically, the CCFQA benchmark contains parallel speech-text factual questions across 8 languages, designed to systematically evaluate MLLMs' cross-lingual and cross-modal factuality capabilities. Our experimental results demonstrate that current MLLMs still face substantial challenges on the CCFQA benchmark. Furthermore, we propose a few-shot transfer learning strategy that effectively transfers the Question Answering (QA) capabilities of LLMs in English to multilingual Spoken Question Answering (SQA) tasks, achieving competitive performance with GPT-4o-mini-Audio using just 5-shot training. We release CCFQA as a foundational research resource to promote the development of MLLMs with more robust and reliable speech understanding capabilities. Our code and dataset are available at https://github.com/yxduir/ccfqa.

Via

Access Paper or Ask Questions

Uncertainty-Aware Semantic Decoding for LLM-Based Sequential Recommendation

Aug 10, 2025

Chenke Yin, Li Fan, Jia Wang, Dongxiao Hu, Haichao Zhang, Chong Zhang, Yang Xiang

Figure 1 for Uncertainty-Aware Semantic Decoding for LLM-Based Sequential Recommendation

Figure 2 for Uncertainty-Aware Semantic Decoding for LLM-Based Sequential Recommendation

Figure 3 for Uncertainty-Aware Semantic Decoding for LLM-Based Sequential Recommendation

Figure 4 for Uncertainty-Aware Semantic Decoding for LLM-Based Sequential Recommendation

Abstract:Large language models have been widely applied to sequential recommendation tasks, yet during inference, they continue to rely on decoding strategies developed for natural language processing. This creates a mismatch between text-generation objectives and recommendation next item selection objectives. This paper addresses this limitation by proposing an Uncertainty-aware Semantic Decoding (USD) framework that combines logit-based clustering with adaptive scoring to improve next-item predictions. Our approach clusters items with similar logit vectors into semantic equivalence groups, then redistributes probability mass within these clusters and computes entropy across them to control item scoring and sampling temperature during recommendation inference. Experiments on Amazon Product datasets (six domains) gains of 18.5\% in HR@3, 11.9\% in NDCG@3, and 10.8\% in MRR@3 compared to state-of-the-art baselines. Hyperparameter analysis confirms the optimal parameters among various settings, and experiments on H\&M, and Netflix datasets indicate that the framework can adapt to differing recommendation domains. The experimental results confirm that integrating semantic clustering and uncertainty assessment yields more reliable and accurate recommendations.

* Accepted by APWeb 2025

Via

Access Paper or Ask Questions

Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations

Jul 16, 2025

Yichen Han, Xiaoyang Hao, Keming Chen, Weibo Xiong, Jun He, Ruonan Zhang, Junjie Cao, Yue Liu, Bowen Li, Dongrui Zhang(+18 more)

Figure 1 for Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations

Figure 2 for Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations

Figure 3 for Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations

Figure 4 for Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations

Abstract:Text-to-speech (TTS) synthesis has seen renewed progress under the discrete modeling paradigm. Existing autoregressive approaches often rely on single-codebook representations, which suffer from significant information loss. Even with post-hoc refinement techniques such as flow matching, these methods fail to recover fine-grained details (e.g., prosodic nuances, speaker-specific timbres), especially in challenging scenarios like singing voice or music synthesis. We propose QTTS, a novel TTS framework built upon our new audio codec, QDAC. The core innovation of QDAC lies in its end-to-end training of an ASR-based auto-regressive network with a GAN, which achieves superior semantic feature disentanglement for scalable, near-lossless compression. QTTS models these discrete codes using two innovative strategies: the Hierarchical Parallel architecture, which uses a dual-AR structure to model inter-codebook dependencies for higher-quality synthesis, and the Delay Multihead approach, which employs parallelized prediction with a fixed delay to accelerate inference speed. Our experiments demonstrate that the proposed framework achieves higher synthesis quality and better preserves expressive content compared to baseline. This suggests that scaling up compression via multi-codebook modeling is a promising direction for high-fidelity, general-purpose speech and audio generation.

Via

Access Paper or Ask Questions

MVAR: MultiVariate AutoRegressive Air Pollutants Forecasting Model

Jul 16, 2025

Xu Fan, Zhihao Wang, Yuetan Lin, Yan Zhang, Yang Xiang, Hao Li

Abstract:Air pollutants pose a significant threat to the environment and human health, thus forecasting accurate pollutant concentrations is essential for pollution warnings and policy-making. Existing studies predominantly focus on single-pollutant forecasting, neglecting the interactions among different pollutants and their diverse spatial responses. To address the practical needs of forecasting multivariate air pollutants, we propose MultiVariate AutoRegressive air pollutants forecasting model (MVAR), which reduces the dependency on long-time-window inputs and boosts the data utilization efficiency. We also design the Multivariate Autoregressive Training Paradigm, enabling MVAR to achieve 120-hour long-term sequential forecasting. Additionally, MVAR develops Meteorological Coupled Spatial Transformer block, enabling the flexible coupling of AI-based meteorological forecasts while learning the interactions among pollutants and their diverse spatial responses. As for the lack of standardized datasets in air pollutants forecasting, we construct a comprehensive dataset covering 6 major pollutants across 75 cities in North China from 2018 to 2023, including ERA5 reanalysis data and FuXi-2.0 forecast data. Experimental results demonstrate that the proposed model outperforms state-of-the-art methods and validate the effectiveness of the proposed architecture.

Via

Access Paper or Ask Questions