Department of Radiology, Zhejiang Cancer Hospital, Hangzhou, 310022, China, Hangzhou Institute of Medicine




Abstract:With the development of large language models (LLMs), efficient inference through Key-Value (KV) cache compression has attracted considerable attention, especially for long-context generation. To compress the KV cache, recent methods identify critical KV tokens through heuristic ranking with attention scores. However, these methods often struggle to accurately determine critical tokens as they neglect the \textit{temporal patterns} in attention scores, resulting in a noticeable degradation in LLM performance. To address this challenge, we propose AttentionPredictor, which is the first learning-based critical token identification approach. Specifically, AttentionPredictor learns a lightweight convolution model to capture spatiotemporal patterns and predict the next-token attention score. An appealing feature of AttentionPredictor is that it accurately predicts the attention score while consuming negligible memory. Moreover, we propose a cross-token critical cache prefetching framework that hides the token estimation time overhead to accelerate the decoding stage. By retaining most of the attention information, AttentionPredictor achieves 16$\times$ KV cache compression with comparable LLM performance, significantly outperforming the state-of-the-art.
Abstract:Semantic information refers to the meaning conveyed through words, phrases, and contextual relationships within a given linguistic structure. Humans can leverage semantic information, such as familiar linguistic patterns and contextual cues, to reconstruct incomplete or masked speech signals in noisy environments. However, existing speech enhancement (SE) approaches often overlook the rich semantic information embedded in speech, which is crucial for improving intelligibility, speaker consistency, and overall quality of enhanced speech signals. To enrich the SE model with semantic information, we employ language models as an efficient semantic learner and propose a comprehensive framework tailored for language model-based speech enhancement, called \textit{GenSE}. Specifically, we approach SE as a conditional language modeling task rather than a continuous signal regression problem defined in existing works. This is achieved by tokenizing speech signals into semantic tokens using a pre-trained self-supervised model and into acoustic tokens using a custom-designed single-quantizer neural codec model. To improve the stability of language model predictions, we propose a hierarchical modeling method that decouples the generation of clean semantic tokens and clean acoustic tokens into two distinct stages. Moreover, we introduce a token chain prompting mechanism during the acoustic token generation stage to ensure timbre consistency throughout the speech enhancement process. Experimental results on benchmark datasets demonstrate that our proposed approach outperforms state-of-the-art SE systems in terms of speech quality and generalization capability.




Abstract:Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs. While integrating MoE into multimodal models like CLIP improves performance, training these models is notoriously challenging and expensive. We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture. Through extensive experimentation with various settings and auxiliary losses, we demonstrate that CLIP-UP significantly reduces training complexity and cost. Remarkably, our sparse CLIP B/16 model, trained with CLIP-UP, outperforms its dense counterpart by 7.2% and 6.6% on COCO and Flickr30k text-to-image Recall@1 benchmarks respectively. It even surpasses the larger CLIP L/14 model on this task while using only 30% of the inference FLOPs. We further demonstrate the generalizability of our training recipe across different scales, establishing sparse upcycling as a practical and scalable approach for building efficient, high-performance CLIP models.
Abstract:An ideal multimodal agent should be aware of the quality of its input modalities. Recent advances have enabled large language models (LLMs) to incorporate auditory systems for handling various speech-related tasks. However, most audio LLMs remain unaware of the quality of the speech they process. This limitation arises because speech quality evaluation is typically excluded from multi-task training due to the lack of suitable datasets. To address this, we introduce the first natural language-based speech evaluation corpus, generated from authentic human ratings. In addition to the overall Mean Opinion Score (MOS), this corpus offers detailed analysis across multiple dimensions and identifies causes of quality degradation. It also enables descriptive comparisons between two speech samples (A/B tests) with human-like judgment. Leveraging this corpus, we propose an alignment approach with LLM distillation (ALLD) to guide the audio LLM in extracting relevant information from raw speech and generating meaningful responses. Experimental results demonstrate that ALLD outperforms the previous state-of-the-art regression model in MOS prediction, with a mean square error of 0.17 and an A/B test accuracy of 98.6%. Additionally, the generated responses achieve BLEU scores of 25.8 and 30.2 on two tasks, surpassing the capabilities of task-specific models. This work advances the comprehensive perception of speech signals by audio LLMs, contributing to the development of real-world auditory and sensory intelligent agents.




Abstract:Over-the-air (OTA) computation has emerged as a promising technique for efficiently aggregating data from massive numbers of wireless devices. OTA computations can be performed by analog or digital communications. Analog OTA systems are often constrained by limited function adaptability and their reliance on analog amplitude modulation. On the other hand, digital OTA systems may face limitations such as high computational complexity and limited adaptability to varying network configurations. To address these challenges, this paper proposes a novel digital OTA computation system with a channel-aware constellation design for demodulation mappers. The proposed system dynamically adjusts the constellation based on the channel conditions of participating nodes, enabling reliable computation of various functions. By incorporating channel randomness into the constellation design, the system prevent overlap of constellation points, reduces computational complexity, and mitigates excessive transmit power consumption under poor channel conditions. Numerical results demonstrate that the system achieves reliable NMSE performance across a range of scenarios, offering valuable insights into the choice of signal processing methods and weighting strategies under varying computation point configurations, node counts, and quantization levels. This work advances the state of digital OTA computation by addressing critical challenges in scalability, transmit power consumption, and function adaptability.




Abstract:Point cloud sampling plays a crucial role in reducing computation costs and storage requirements for various vision tasks. Traditional sampling methods, such as farthest point sampling, lack task-specific information and, as a result, cannot guarantee optimal performance in specific applications. Learning-based methods train a network to sample the point cloud for the targeted downstream task. However, they do not guarantee that the sampled points are the most relevant ones. Moreover, they may result in duplicate sampled points, which requires completion of the sampled point cloud through post-processing techniques. To address these limitations, we propose a contribution-based sampling network (CS-Net), where the sampling operation is formulated as a Top-k operation. To ensure that the network can be trained in an end-to-end way using gradient descent algorithms, we use a differentiable approximation to the Top-k operation via entropy regularization of an optimal transport problem. Our network consists of a feature embedding module, a cascade attention module, and a contribution scoring module. The feature embedding module includes a specifically designed spatial pooling layer to reduce parameters while preserving important features. The cascade attention module combines the outputs of three skip connected offset attention layers to emphasize the attractive features and suppress less important ones. The contribution scoring module generates a contribution score for each point and guides the sampling process to prioritize the most important ones. Experiments on the ModelNet40 and PU147 showed that CS-Net achieved state-of-the-art performance in two semantic-based downstream tasks (classification and registration) and two reconstruction-based tasks (compression and surface reconstruction).




Abstract:Facial recognition models are increasingly employed by commercial enterprises, government agencies, and cloud service providers for identity verification, consumer services, and surveillance. These models are often trained using vast amounts of facial data processed and stored in cloud-based platforms, raising significant privacy concerns. Users' facial images may be exploited without their consent, leading to potential data breaches and misuse. This survey presents a comprehensive review of current methods aimed at preserving facial image privacy in cloud-based services. We categorize these methods into two primary approaches: image obfuscation-based protection and adversarial perturbation-based protection. We provide an in-depth analysis of both categories, offering qualitative and quantitative comparisons of their effectiveness. Additionally, we highlight unresolved challenges and propose future research directions to improve privacy preservation in cloud computing environments.
Abstract:Private data, when published online, may be collected by unauthorized parties to train deep neural networks (DNNs). To protect privacy, defensive noises can be added to original samples to degrade their learnability by DNNs. Recently, unlearnable examples are proposed to minimize the training loss such that the model learns almost nothing. However, raw data are often pre-processed before being used for training, which may restore the private information of protected data. In this paper, we reveal the data privacy violation induced by data augmentation, a commonly used data pre-processing technique to improve model generalization capability, which is the first of its kind as far as we are concerned. We demonstrate that data augmentation can significantly raise the accuracy of the model trained on unlearnable examples from 21.3% to 66.1%. To address this issue, we propose a defense framework, dubbed ARMOR, to protect data privacy from potential breaches of data augmentation. To overcome the difficulty of having no access to the model training process, we design a non-local module-assisted surrogate model that better captures the effect of data augmentation. In addition, we design a surrogate augmentation selection strategy that maximizes distribution alignment between augmented and non-augmented samples, to choose the optimal augmentation strategy for each class. We also use a dynamic step size adjustment algorithm to enhance the defensive noise generation process. Extensive experiments are conducted on 4 datasets and 5 data augmentation methods to verify the performance of ARMOR. Comparisons with 6 state-of-the-art defense methods have demonstrated that ARMOR can preserve the unlearnability of protected private data under data augmentation. ARMOR reduces the test accuracy of the model trained on augmented protected samples by as much as 60% more than baselines.




Abstract:Coverage optimization generally involves deploying a set of facilities (e.g., sensors) to best satisfy the demands of specified points, with wide applications in fields such as location science and sensor networks. In practical applications, coverage optimization focuses on target coverage, which is typically formulated as Mixed-Variable Optimization Problems (MVOPs) due to complex real-world constraints. Meanwhile, high-fidelity discretization and visibility analysis may bring additional calculations, which significantly increases the computational cost. These factors pose significant challenges for fitness evaluations (FEs) in canonical Evolutionary Algorithms (EAs), and evolve the coverage problem into an Expensive Mixed-Variable Optimization Problem (EMVOP). To address these issues, we propose the RankNet-Inspired Surrogate-assisted Hybrid Metaheuristic (RI-SHM), an extension of our previous work. RI-SHM integrates three key components: (1) a RankNet-based pairwise global surrogate that innovatively predicts rankings between pairs of individuals, bypassing the challenges of fitness estimation in discontinuous solution space; (2) a surrogate-assisted local Estimation of Distribution Algorithm (EDA) that enhances local exploitation and helps escape from local optima; and (3) a fitness diversity-driven switching strategy that dynamically balances exploration and exploitation. Experiments demonstrate that our algorithm can effectively handle large-scale coverage optimization tasks of up to 300 dimensions and more than 1,800 targets within desirable runtime. Compared to state-of-the-art algorithms for EMVOPs, RI-SHM consistently outperforms them by up to 56.5$\%$ across all tested instances.




Abstract:6-Degree of Freedom (6DoF) motion estimation with a combination of visual and inertial sensors is a growing area with numerous real-world applications. However, precise calibration of the time offset between these two sensor types is a prerequisite for accurate and robust tracking. To address this, we propose a universal online temporal calibration strategy for optimization-based visual-inertial navigation systems. Technically, we incorporate the time offset td as a state parameter in the optimization residual model to align the IMU state to the corresponding image timestamp using td, angular velocity and translational velocity. This allows the temporal misalignment td to be optimized alongside other tracking states during the process. As our method only modifies the structure of the residual model, it can be applied to various optimization-based frameworks with different tracking frontends. We evaluate our calibration method with both EuRoC and simulation data and extensive experiments demonstrate that our approach provides more accurate time offset estimation and faster convergence, particularly in the presence of noisy sensor data.