Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuxing Han

Tsinghua University, China

OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

May 12, 2026

Yuchen Deng, Zidang Cai, Hai-Tao Zheng, Jie Wang, Feidiao Yang, Yuxing Han

Abstract:Omnimodal large language models (Omni-LLMs) show strong capability in audio-video understanding, but their practical deployment remains limited by high inference cost of long video streams and dense audio sequences. Despite recent progress, existing compression methods for Omni-LLMs typically rely on fixed or native compression units, which can disrupt cross-modal correspondence and the complementary information required for audio-video reasoning, making it difficult to improve inference efficiency while stably preserving performance. To address this, we propose OmniRefine, a training-free two-stage framework for efficient audio-visual token compression in Omni-LLMs. First, Correspondence-Preserving Chunk Refinement refines native chunk boundaries into cross-modally aligned compression units through frame-audio similarity and dynamic programming. Second, Modality-Aware Cooperative Compression jointly compresses video and audio tokens within each refined unit to reduce redundancy while preserving critical evidence. Extensive experiments show that OmniRefine achieves a better efficiency-performance trade-off than strong baselines and maintains stable performance under lower compression ratios. On WorldSense, it still reaches 46.7% accuracy at a 44% token retention ratio, nearly matching the full-token baseline. The code and interface will be released to facilitate further research.

Via

Access Paper or Ask Questions

Diffusion Path Alignment for Long-Range Motion Generation and Domain Transitions

Mar 31, 2026

Haichao Wang, Alexander Okupnik, Yuxing Han, Gene Wen, Johannes Schneider, Kyriakos Flouris

Abstract:Long-range human movement generation remains a central challenge in computer vision and graphics. Generating coherent transitions across semantically distinct motion domains remains largely unexplored. This capability is particularly important for applications such as dance choreography, where movements must fluidly transition across diverse stylistic and semantic motifs. We propose a simple and effective inference-time optimization framework inspired by diffusion-based stochastic optimal control. Specifically, a control-energy objective that explicitly regularizes the transition trajectories of a pretrained diffusion model. We show that optimizing this objective at inference time yields transitions with fidelity and temporal coherence. This is the first work to provide a general framework for controlled long-range human motion generation with explicit transition modeling.

Via

Access Paper or Ask Questions

FedGRPO: Privately Optimizing Foundation Models with Group-Relative Rewards from Domain Client

Feb 12, 2026

Gongxi Zhu, Hanlin Gu, Lixin Fan, Qiang Yang, Yuxing Han

Abstract:One important direction of Federated Foundation Models (FedFMs) is leveraging data from small client models to enhance the performance of a large server-side foundation model. Existing methods based on model level or representation level knowledge transfer either require expensive local training or incur high communication costs and introduce unavoidable privacy risks. We reformulate this problem as a reinforcement learning style evaluation process and propose FedGRPO, a privacy preserving framework comprising two modules. The first module performs competence-based expert selection by building a lightweight confidence graph from auxiliary data to identify the most suitable clients for each question. The second module leverages the "Group Relative" concept from the Group Relative Policy Optimization (GRPO) framework by packaging each question together with its solution rationale into candidate policies, dispatching these policies to a selected subset of expert clients, and aggregating solely the resulting scalar reward signals via a federated group-relative loss function. By exchanging reward values instead of data or model updates, FedGRPO reduces privacy risk and communication overhead while enabling parallel evaluation across heterogeneous devices. Empirical results on diverse domain tasks demonstrate that FedGRPO achieves superior downstream accuracy and communication efficiency compared to conventional FedFMs baselines.

* Accepted by AAAI 2026 as Oral

Via

Access Paper or Ask Questions

Backslash: Rate Constrained Optimized Training of Large Language Models

Apr 23, 2025

Jun Wu, Jiangtao Wen, Yuxing Han

Abstract:The rapid advancement of large-language models (LLMs) has driven extensive research into parameter compression after training has been completed, yet compression during the training phase remains largely unexplored. In this work, we introduce Rate-Constrained Training (Backslash), a novel training-time compression approach based on rate-distortion optimization (RDO). Backslash enables a flexible trade-off between model accuracy and complexity, significantly reducing parameter redundancy while preserving performance. Experiments in various architectures and tasks demonstrate that Backslash can reduce memory usage by 60\% - 90\% without accuracy loss and provides significant compression gain compared to compression after training. Moreover, Backslash proves to be highly versatile: it enhances generalization with small Lagrange multipliers, improves model robustness to pruning (maintaining accuracy even at 80\% pruning rates), and enables network simplification for accelerated inference on edge devices.

Via

Access Paper or Ask Questions

Flow Intelligence: Robust Feature Matching via Temporal Signature Correlation

Apr 16, 2025

Jie Wang, Chen Ye Gan, Caoqi Wei, Jiangtao Wen, Yuxing Han

Figure 1 for Flow Intelligence: Robust Feature Matching via Temporal Signature Correlation

Figure 2 for Flow Intelligence: Robust Feature Matching via Temporal Signature Correlation

Figure 3 for Flow Intelligence: Robust Feature Matching via Temporal Signature Correlation

Figure 4 for Flow Intelligence: Robust Feature Matching via Temporal Signature Correlation

Abstract:Feature matching across video streams remains a cornerstone challenge in computer vision. Increasingly, robust multimodal matching has garnered interest in robotics, surveillance, remote sensing, and medical imaging. While traditional rely on detecting and matching spatial features, they break down when faced with noisy, misaligned, or cross-modal data. Recent deep learning methods have improved robustness through learned representations, but remain constrained by their dependence on extensive training data and computational demands. We present Flow Intelligence, a paradigm-shifting approach that moves beyond spatial features by focusing on temporal motion patterns exclusively. Instead of detecting traditional keypoints, our method extracts motion signatures from pixel blocks across consecutive frames and extract temporal motion signatures between videos. These motion-based descriptors achieve natural invariance to translation, rotation, and scale variations while remaining robust across different imaging modalities. This novel approach also requires no pretraining data, eliminates the need for spatial feature detection, enables cross-modal matching using only temporal motion, and it outperforms existing methods in challenging scenarios where traditional approaches fail. By leveraging motion rather than appearance, Flow Intelligence enables robust, real-time video feature matching in diverse environments.

Via

Access Paper or Ask Questions

Snapshot Compressed Imaging Based Single-Measurement Computer Vision for Videos

Jan 25, 2025

Fengpu Pan, Jiangtao Wen, Yuxing Han

Figure 1 for Snapshot Compressed Imaging Based Single-Measurement Computer Vision for Videos

Figure 2 for Snapshot Compressed Imaging Based Single-Measurement Computer Vision for Videos

Figure 3 for Snapshot Compressed Imaging Based Single-Measurement Computer Vision for Videos

Figure 4 for Snapshot Compressed Imaging Based Single-Measurement Computer Vision for Videos

Abstract:Snapshot compressive imaging (SCI) is a promising technique for capturing high-speed video at low bandwidth and low power, typically by compressing multiple frames into a single measurement. However, similar to traditional CMOS image sensor based imaging systems, SCI also faces challenges in low-lighting photon-limited and low-signal-to-noise-ratio image conditions. In this paper, we propose a novel Compressive Denoising Autoencoder (CompDAE) using the STFormer architecture as the backbone, to explicitly model noise characteristics and provide computer vision functionalities such as edge detection and depth estimation directly from compressed sensing measurements, while accounting for realistic low-photon conditions. We evaluate the effectiveness of CompDAE across various datasets and demonstrated significant improvements in task performance compared to conventional RGB-based methods. In the case of ultra-low-lighting (APC $\leq$ 20) while conventional methods failed, the proposed algorithm can still maintain competitive performance.

Via

Access Paper or Ask Questions

Efficient Video Neural Network Processing Based on Motion Estimation

Jan 25, 2025

Haichao Wang, Jiangtao Wen, Yuxing Han

Abstract:Video neural network (VNN) processing using the conventional pipeline first converts Bayer video information into human understandable RGB videos using image signal processing (ISP) on a pixel by pixel basis. Then, VNN processing is performed on a frame by frame basis. Both ISP and VNN are computationally expensive with high power consumption and latency. In this paper, we propose an efficient VNN processing framework. Instead of using ISP, computer vision tasks are directly accomplished using Bayer pattern information. To accelerate VNN processing, motion estimation is introduced to find temporal redundancies in input video data so as to avoid repeated and unnecessary computations. Experiments show greater than 67\% computation reduction, while maintaining computer vision task accuracy for typical computer vision tasks and data sets.

Via

Access Paper or Ask Questions

Disentangling data distribution for Federated Learning

Oct 16, 2024

Xinyuan Zhao, Hanlin Gu, Lixin Fan, Qiang Yang, Yuxing Han

Abstract:Federated Learning (FL) facilitates collaborative training of a global model whose performance is boosted by private data owned by distributed clients, without compromising data privacy. Yet the wide applicability of FL is hindered by entanglement of data distributions across different clients. This paper demonstrates for the first time that by disentangling data distributions FL can in principle achieve efficiencies comparable to those of distributed systems, requiring only one round of communication. To this end, we propose a novel FedDistr algorithm, which employs stable diffusion models to decouple and recover data distributions. Empirical results on the CIFAR100 and DomainNet datasets show that FedDistr significantly enhances model utility and efficiency in both disentangled and near-disentangled scenarios while ensuring privacy, outperforming traditional federated learning methods.

Via

Access Paper or Ask Questions

Breaking the Hourglass Phenomenon of Residual Quantization: Enhancing the Upper Bound of Generative Retrieval

Jul 31, 2024

Zhirui Kuai, Zuxu Chen, Huimu Wang, Mingming Li, Dadong Miao, Binbin Wang, Xusong Chen, Li Kuang, Yuxing Han, Jiaxing Wang(+4 more)

Figure 1 for Breaking the Hourglass Phenomenon of Residual Quantization: Enhancing the Upper Bound of Generative Retrieval

Figure 2 for Breaking the Hourglass Phenomenon of Residual Quantization: Enhancing the Upper Bound of Generative Retrieval

Figure 3 for Breaking the Hourglass Phenomenon of Residual Quantization: Enhancing the Upper Bound of Generative Retrieval

Figure 4 for Breaking the Hourglass Phenomenon of Residual Quantization: Enhancing the Upper Bound of Generative Retrieval

Abstract:Generative retrieval (GR) has emerged as a transformative paradigm in search and recommender systems, leveraging numeric-based identifier representations to enhance efficiency and generalization. Notably, methods like TIGER employing Residual Quantization-based Semantic Identifiers (RQ-SID), have shown significant promise in e-commerce scenarios by effectively managing item IDs. However, a critical issue termed the "\textbf{Hourglass}" phenomenon, occurs in RQ-SID, where intermediate codebook tokens become overly concentrated, hindering the full utilization of generative retrieval methods. This paper analyses and addresses this problem by identifying data sparsity and long-tailed distribution as the primary causes. Through comprehensive experiments and detailed ablation studies, we analyze the impact of these factors on codebook utilization and data distribution. Our findings reveal that the "Hourglass" phenomenon substantially impacts the performance of RQ-SID in generative retrieval. We propose effective solutions to mitigate this issue, thereby significantly enhancing the effectiveness of generative retrieval in real-world E-commerce applications.

Via

Access Paper or Ask Questions

Unlearning during Learning: An Efficient Federated Machine Unlearning Method

May 24, 2024

Hanlin Gu, Gongxi Zhu, Jie Zhang, Xinyuan Zhao, Yuxing Han, Lixin Fan, Qiang Yang

Figure 1 for Unlearning during Learning: An Efficient Federated Machine Unlearning Method

Figure 2 for Unlearning during Learning: An Efficient Federated Machine Unlearning Method

Figure 3 for Unlearning during Learning: An Efficient Federated Machine Unlearning Method

Figure 4 for Unlearning during Learning: An Efficient Federated Machine Unlearning Method

Abstract:In recent years, Federated Learning (FL) has garnered significant attention as a distributed machine learning paradigm. To facilitate the implementation of the right to be forgotten, the concept of federated machine unlearning (FMU) has also emerged. However, current FMU approaches often involve additional time-consuming steps and may not offer comprehensive unlearning capabilities, which renders them less practical in real FL scenarios. In this paper, we introduce FedAU, an innovative and efficient FMU framework aimed at overcoming these limitations. Specifically, FedAU incorporates a lightweight auxiliary unlearning module into the learning process and employs a straightforward linear operation to facilitate unlearning. This approach eliminates the requirement for extra time-consuming steps, rendering it well-suited for FL. Furthermore, FedAU exhibits remarkable versatility. It not only enables multiple clients to carry out unlearning tasks concurrently but also supports unlearning at various levels of granularity, including individual data samples, specific classes, and even at the client level. We conducted extensive experiments on MNIST, CIFAR10, and CIFAR100 datasets to evaluate the performance of FedAU. The results demonstrate that FedAU effectively achieves the desired unlearning effect while maintaining model accuracy.

* Accepted by IJCAI 2024

Via

Access Paper or Ask Questions