Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Qi Zhang

School of Information, North China University of Technology

BokehDiff: Neural Lens Blur with One-Step Diffusion

Jul 24, 2025

Chengxuan Zhu, Qingnan Fan, Qi Zhang, Jinwei Chen, Huaqi Zhang, Chao Xu, Boxin Shi

Abstract:We introduce BokehDiff, a novel lens blur rendering method that achieves physically accurate and visually appealing outcomes, with the help of generative diffusion prior. Previous methods are bounded by the accuracy of depth estimation, generating artifacts in depth discontinuities. Our method employs a physics-inspired self-attention module that aligns with the image formation process, incorporating depth-dependent circle of confusion constraint and self-occlusion effects. We adapt the diffusion model to the one-step inference scheme without introducing additional noise, and achieve results of high quality and fidelity. To address the lack of scalable paired data, we propose to synthesize photorealistic foregrounds with transparency with diffusion models, balancing authenticity and scene diversity.

* Accepted by ICCV 2025

Via

Access Paper or Ask Questions

IONext: Unlocking the Next Era of Inertial Odometry

Jul 23, 2025

Shanshan Zhang, Siyue Wang, Tianshui Wen, Qi Zhang, Ziheng Zhou, Lingxiang Zheng, Yu Yang

Abstract:Researchers have increasingly adopted Transformer-based models for inertial odometry. While Transformers excel at modeling long-range dependencies, their limited sensitivity to local, fine-grained motion variations and lack of inherent inductive biases often hinder localization accuracy and generalization. Recent studies have shown that incorporating large-kernel convolutions and Transformer-inspired architectural designs into CNN can effectively expand the receptive field, thereby improving global motion perception. Motivated by these insights, we propose a novel CNN-based module called the Dual-wing Adaptive Dynamic Mixer (DADM), which adaptively captures both global motion patterns and local, fine-grained motion features from dynamic inputs. This module dynamically generates selective weights based on the input, enabling efficient multi-scale feature aggregation. To further improve temporal modeling, we introduce the Spatio-Temporal Gating Unit (STGU), which selectively extracts representative and task-relevant motion features in the temporal domain. This unit addresses the limitations of temporal modeling observed in existing CNN approaches. Built upon DADM and STGU, we present a new CNN-based inertial odometry backbone, named Next Era of Inertial Odometry (IONext). Extensive experiments on six public datasets demonstrate that IONext consistently outperforms state-of-the-art (SOTA) Transformer- and CNN-based methods. For instance, on the RNIN dataset, IONext reduces the average ATE by 10% and the average RTE by 12% compared to the representative model iMOT.

Via

Access Paper or Ask Questions

PRISM: Distributed Inference for Foundation Models at Edge

Jul 16, 2025

Muhammad Azlan Qazi, Alexandros Iosifidis, Qi Zhang

Abstract:Foundation models (FMs) have achieved remarkable success across a wide range of applications, from image classification to natural langurage processing, but pose significant challenges for deployment at edge. This has sparked growing interest in developing practical and efficient strategies for bringing foundation models to edge environments. In this work, we propose PRISM, a communication-efficient and compute-aware strategy for distributed Transformer inference on edge devices. Our method leverages a Segment Means representation to approximate intermediate output features, drastically reducing inter-device communication. Additionally, we restructure the self-attention mechanism to eliminate redundant computations caused by per-device Key/Value calculation in position-wise partitioning and design a partition-aware causal masking scheme tailored for autoregressive models. We evaluate PRISM on ViT, BERT, and GPT-2 across diverse datasets, namely CIFAR-10, CIFAR-100, ImageNet-1k, GLUE, and CBT. Our results demonstrate substantial reductions in communication overhead (up to 99.2% for BERT at compression rate CR = 128) and per-device computation (51.24% for BERT at the same setting), with only minor accuracy degradation. This method offers a scalable and practical solution for deploying foundation models in distributed resource-constrained environments.

Via

Access Paper or Ask Questions

Signal Prediction for Loss Mitigation in Tactile Internet: A Leader-Follower Game-Theoretic Approach

Jul 10, 2025

Mohammad Ali Vahedifar, Qi Zhang

Abstract:Tactile Internet (TI) requires achieving ultra-low latency and highly reliable packet delivery for haptic signals. In the presence of packet loss and delay, the signal prediction method provides a viable solution for recovering the missing signals. To this end, we introduce the Leader-Follower (LeFo) approach based on a cooperative Stackelberg game, which enables both users and robots to learn and predict actions. With accurate prediction, the teleoperation system can safely relax its strict delay requirements. Our method achieves high prediction accuracy, ranging from 80.62% to 95.03% for remote robot signals at the Human ($H$) side and from 70.44% to 89.77% for human operation signals at the remote Robot ($R$) side. We also establish an upper bound for maximum signal loss using Taylor Expansion, ensuring robustness.

* This work has been accepted for publication in the IEEE Machine Learning and Signal Processing Conference (MLSP 2025)

Via

Access Paper or Ask Questions

Perception Activator: An intuitive and portable framework for brain cognitive exploration

Jul 03, 2025

Le Xu, Qi Zhang, Qixian Zhang, Hongyun Zhang, Duoqian Miao, Cairong Zhao

Abstract:Recent advances in brain-vision decoding have driven significant progress, reconstructing with high fidelity perceived visual stimuli from neural activity, e.g., functional magnetic resonance imaging (fMRI), in the human visual cortex. Most existing methods decode the brain signal using a two-level strategy, i.e., pixel-level and semantic-level. However, these methods rely heavily on low-level pixel alignment yet lack sufficient and fine-grained semantic alignment, resulting in obvious reconstruction distortions of multiple semantic objects. To better understand the brain's visual perception patterns and how current decoding models process semantic objects, we have developed an experimental framework that uses fMRI representations as intervention conditions. By injecting these representations into multi-scale image features via cross-attention, we compare both downstream performance and intermediate feature changes on object detection and instance segmentation tasks with and without fMRI information. Our results demonstrate that incorporating fMRI signals enhances the accuracy of downstream detection and segmentation, confirming that fMRI contains rich multi-object semantic cues and coarse spatial localization information-elements that current models have yet to fully exploit or integrate.

Via

Access Paper or Ask Questions

COLUR: Confidence-Oriented Learning, Unlearning and Relearning with Noisy-Label Data for Model Restoration and Refinement

Jun 24, 2025

Zhihao Sui, Liang Hu, Jian Cao, Usman Naseem, Zhongyuan Lai, Qi Zhang

Abstract:Large deep learning models have achieved significant success in various tasks. However, the performance of a model can significantly degrade if it is needed to train on datasets with noisy labels with misleading or ambiguous information. To date, there are limited investigations on how to restore performance when model degradation has been incurred by noisy label data. Inspired by the ``forgetting mechanism'' in neuroscience, which enables accelerating the relearning of correct knowledge by unlearning the wrong knowledge, we propose a robust model restoration and refinement (MRR) framework COLUR, namely Confidence-Oriented Learning, Unlearning and Relearning. Specifically, we implement COLUR with an efficient co-training architecture to unlearn the influence of label noise, and then refine model confidence on each label for relearning. Extensive experiments are conducted on four real datasets and all evaluation results show that COLUR consistently outperforms other SOTA methods after MRR.

* IJCAI 2025

Via

Access Paper or Ask Questions

Recalling The Forgotten Class Memberships: Unlearned Models Can Be Noisy Labelers to Leak Privacy

Jun 24, 2025

Zhihao Sui, Liang Hu, Jian Cao, Dora D. Liu, Usman Naseem, Zhongyuan Lai, Qi Zhang

Figure 1 for Recalling The Forgotten Class Memberships: Unlearned Models Can Be Noisy Labelers to Leak Privacy

Figure 2 for Recalling The Forgotten Class Memberships: Unlearned Models Can Be Noisy Labelers to Leak Privacy

Figure 3 for Recalling The Forgotten Class Memberships: Unlearned Models Can Be Noisy Labelers to Leak Privacy

Figure 4 for Recalling The Forgotten Class Memberships: Unlearned Models Can Be Noisy Labelers to Leak Privacy

Abstract:Machine Unlearning (MU) technology facilitates the removal of the influence of specific data instances from trained models on request. Despite rapid advancements in MU technology, its vulnerabilities are still underexplored, posing potential risks of privacy breaches through leaks of ostensibly unlearned information. Current limited research on MU attacks requires access to original models containing privacy data, which violates the critical privacy-preserving objective of MU. To address this gap, we initiate an innovative study on recalling the forgotten class memberships from unlearned models (ULMs) without requiring access to the original one. Specifically, we implement a Membership Recall Attack (MRA) framework with a teacher-student knowledge distillation architecture, where ULMs serve as noisy labelers to transfer knowledge to student models. Then, it is translated into a Learning with Noisy Labels (LNL) problem for inferring the correct labels of the forgetting instances. Extensive experiments on state-of-the-art MU methods with multiple real datasets demonstrate that the proposed MRA strategy exhibits high efficacy in recovering class memberships of unlearned instances. As a result, our study and evaluation have established a benchmark for future research on MU vulnerabilities.

* IJCAI 2025

Via

Access Paper or Ask Questions

Capability Salience Vector: Fine-grained Alignment of Loss and Capabilities for Downstream Task Scaling Law

Jun 16, 2025

Qiming Ge, Shuhao Xing, Songyang Gao, Yunhua Zhou, Yicheng Zou, Songyang Zhang, Zhi Chen, Hang Yan, Qi Zhang, Qipeng Guo(+1 more)

Figure 1 for Capability Salience Vector: Fine-grained Alignment of Loss and Capabilities for Downstream Task Scaling Law

Figure 2 for Capability Salience Vector: Fine-grained Alignment of Loss and Capabilities for Downstream Task Scaling Law

Figure 3 for Capability Salience Vector: Fine-grained Alignment of Loss and Capabilities for Downstream Task Scaling Law

Figure 4 for Capability Salience Vector: Fine-grained Alignment of Loss and Capabilities for Downstream Task Scaling Law

Abstract:Scaling law builds the relationship between training computation and validation loss, enabling researchers to effectively predict the loss trending of models across different levels of computation. However, a gap still remains between validation loss and the model's downstream capabilities, making it untrivial to apply scaling law to direct performance prediction for downstream tasks. The loss typically represents a cumulative penalty for predicted tokens, which are implicitly considered to have equal importance. Nevertheless, our studies have shown evidence that when considering different training data distributions, we cannot directly model the relationship between downstream capability and computation or token loss. To bridge the gap between validation loss and downstream task capabilities, in this work, we introduce Capability Salience Vector, which decomposes the overall loss and assigns different importance weights to tokens to assess a specific meta-capability, aligning the validation loss with downstream task performance in terms of the model's capabilities. Experiments on various popular benchmarks demonstrate that our proposed Capability Salience Vector could significantly improve the predictability of language model performance on downstream tasks.

* 9 pages, 9 figures, ACL2025

Via

Access Paper or Ask Questions

Conditional Diffusion Model-Driven Generative Channels for Double RIS-Aided Wireless Systems

Jun 15, 2025

Yiyang Ni, Qi Zhang, Guangji Chen, Yan Cai, Jun Li, Shi Jin

Abstract:With the development of the upcoming sixth-generation networks (6G), reconfigurable intelligent surfaces (RISs) have gained significant attention due to its ability of reconfiguring wireless channels via smart reflections. However, traditional channel state information (CSI) acquisition techniques for double-RIS systems face challenges (e.g., high pilot overhead or multipath interference). This paper proposes a new channel generation method in double-RIS communication systems based on the tool of conditional diffusion model (CDM). The CDM is trained on synthetic channel data to capture channel characteristics. It addresses the limitations of traditional CSI generation methods, such as insufficient model understanding capability and poor environmental adaptability. We provide a detailed analysis of the diffusion process for channel generation, and it is validated through simulations. The simulation results demonstrate that the proposed CDM based method outperforms traditional channel acquisition methods in terms of normalized mean squared error (NMSE). This method offers a new paradigm for channel acquisition in double-RIS systems, which is expected to improve the quality of channel acquisition with low pilot overhead.

* 5 pages, 4 figures

Via

Access Paper or Ask Questions

Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction

Jun 14, 2025

Xiaoran Fan, Zhichao Sun, Yangfan Gao, Jingfei Xiong, Hang Yan, Yifei Cao, Jiajun Sun, Shuo Li, Zhihao Zhang, Zhiheng Xi(+14 more)

Figure 1 for Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction

Figure 2 for Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction

Figure 3 for Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction

Figure 4 for Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction

Abstract:Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the impact of key components (i.e., speech tokenizers, speech heads, and speaker modeling) on the performance of LLM-centric SLMs. We compare coupled, semi-decoupled, and fully decoupled speech tokenizers under a fair SLM framework and find that decoupled tokenization significantly improves alignment and synthesis quality. To address the information density mismatch between speech and text, we introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens. This leads to up to 12$\times$ faster decoding and a substantial drop in word error rate (from 6.07 to 3.01). Furthermore, we propose a speaker-aware generation paradigm and introduce RoleTriviaQA, a large-scale role-playing knowledge QA benchmark with diverse speaker identities. Experiments demonstrate that our methods enhance both knowledge understanding and speaker consistency.

Via

Access Paper or Ask Questions