Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xuyang Wei

RAP: Runtime-Adaptive Pruning for LLM Inference

May 26, 2025

Huanrong Liu, Chunlin Tian, Xuyang Wei, Jiaheng Dai, Qin Liu, Tianqi Wei, Qingbiao Li, Li Li

Figure 1 for RAP: Runtime-Adaptive Pruning for LLM Inference

Figure 2 for RAP: Runtime-Adaptive Pruning for LLM Inference

Figure 3 for RAP: Runtime-Adaptive Pruning for LLM Inference

Figure 4 for RAP: Runtime-Adaptive Pruning for LLM Inference

Abstract:Large language models (LLMs) excel at language understanding and generation, but their enormous computational and memory requirements hinder deployment. Compression offers a potential solution to mitigate these constraints. However, most existing methods rely on fixed heuristics and thus fail to adapt to runtime memory variations or heterogeneous KV-cache demands arising from diverse user requests. To address these limitations, we propose RAP, an elastic pruning framework driven by reinforcement learning (RL) that dynamically adjusts compression strategies in a runtime-aware manner. Specifically, RAP dynamically tracks the evolving ratio between model parameters and KV-cache across practical execution. Recognizing that FFNs house most parameters, whereas parameter -light attention layers dominate KV-cache formation, the RL agent retains only those components that maximize utility within the current memory budget, conditioned on instantaneous workload and device state. Extensive experiments results demonstrate that RAP outperforms state-of-the-art baselines, marking the first time to jointly consider model weights and KV-cache on the fly.

Via

Access Paper or Ask Questions

AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs

Feb 27, 2025

Xuyang Wei, Chunlin Tian, Li Li

Figure 1 for AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs

Figure 2 for AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs

Figure 3 for AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs

Figure 4 for AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs

Abstract:Effective instruction fine-tuning on diverse image-text datasets is crucial for developing a versatile Multimodal Large Language Model (MLLM), where dataset composition dictates the model's adaptability across multimodal tasks. However, complex datasets often contain inherent conflicts -- stemming from modality-specific optimization objectives -- and latent commonalities that enable cross-task transfer, which most existing approaches handle separately. To bridge this gap, we introduce AsymLoRA, a parameter-efficient tuning framework that unifies knowledge modularization and cross-modal coordination via asymmetric LoRA: task-specific low-rank projections (matrix B) that preserve distinct adaptation pathways for conflicting objectives, and a shared projection (matrix A) that consolidates cross-modal commonalities. Extensive evaluations demonstrate that AsymLoRA consistently surpasses both vanilla LoRA, which captures only commonalities, and LoRA-MoE, which focuses solely on conflicts, achieving superior model performance and system efficiency across diverse benchmarks.\href{Code}{https://github.com/Clin0212/HydraLoRA/blob/main/MLLM-HydraLoRA/README.md}.

Via

Access Paper or Ask Questions