Abstract:Structured pruning is widely used to compress large language models (LLMs), yet its effectiveness depends heavily on neuron importance estimation. Most existing methods estimate neuron importance from activation statistics on a single calibration dataset, which introduces calibration bias and degrades downstream cross-task generalization. We observe that neurons exhibit heterogeneous distribution sensitivity, with distribution-robust neurons maintaining consistent rankings across datasets and distribution-sensitive neurons showing high cross-dataset ranking variance. Based on this, we identify two structural limitations in existing methods. First, ranking all neurons within a shared space causes distribution-sensitive neurons that strongly activate on calibration inputs to dominate, crowding out distribution-robust neurons critical for out-of-distribution tasks. Second, applying activation-based importance metrics uniformly can be unreliable. Distribution-sensitive neurons that infrequently activate on calibration data receive insufficient activation signal for accurate local ranking. To address these limitations, we propose GPrune-LLM, a generalization-aware structured pruning framework that explicitly accounts for neuron differences in cross-distribution behavior. We first partition neurons into behavior-consistent modules to localize ranking competition, then evaluate activation-based metric reliability per module according to distribution sensitivity and score magnitude. For modules where activation-based scoring is unreliable, we switch to an activation-independent metric. Finally, we adaptively learn module-wise sparsity. Extensive experiments across multiple downstream tasks demonstrate GPrune-LLM's consistent improvements in post-compression generalization, particularly at high sparsity, and reduced dependence on importance metric choice.




Abstract:In continual learning (CL), model growth enhances adaptability over new data, improving knowledge retention for more tasks. However, improper model growth can lead to severe degradation of previously learned knowledge, an issue we name as growth-induced forgetting (GIFt), especially in task-agnostic CL using entire grown model for inference. Existing works, despite adopting model growth and random initialization for better adaptability, often fail to recognize the presence of GIFt caused by improper model growth. This oversight limits comprehensive control of forgetting and hinders full utilization of model growth. We are the first in CL to identify this issue and conduct an in-depth study on root cause of GIFt, where layer expansion stands out among model growth strategies, widening layers without affecting model functionality. Yet, direct adoption of layer expansion presents challenges. It lacks data-driven control and initialization of expanded parameters to balance adaptability and knowledge retention. This paper presents a novel SparseGrow approach to overcome the issue of GIFt while enhancing adaptability over new data. SparseGrow employs data-driven sparse layer expansion to control efficient parameter usage during growth, reducing GIFt from excessive growth and functionality changes. It also combines sparse growth with on-data initialization at training late-stage to create partially 0-valued expansions that fit learned distribution, enhancing retention and adaptability. To further minimize forgetting, freezing is applied by calculating the sparse mask, allowing data-driven preservation of important parameters. Through experiments across datasets with various settings, cases and task numbers, we demonstrate the necessity of layer expansion and showcase the effectiveness of SparseGrow in overcoming GIFt, highlighting its adaptability and knowledge retention for incremental tasks.




Abstract:Differentiable architecture search (DAS) revolutionizes neural architecture search (NAS) with time-efficient automation, transitioning from discrete candidate sampling and evaluation to differentiable super-net optimization and discretization. However, existing DAS methods either only conduct coarse-grained operation-level search or manually define the remaining ratios for fine-grained kernel-level and weight-level units, which fail to simultaneously optimize model size and model performance. Furthermore, these methods compromise search quality to reduce memory consumption. To tackle these issues, we introduce multi-granularity architecture search (MGAS), a unified framework which aims to comprehensively and memory-efficiently explore the multi-granularity search space to discover both effective and efficient neural networks. Specifically, we learn discretization functions specific to each granularity level to adaptively determine the remaining ratios according to the evolving architecture. This ensures an optimal balance among units of different granularity levels for different target model sizes. Considering the memory demands, we break down the super-net optimization and discretization into multiple sub-net stages. Nevertheless, the greedy nature of this approach may introduce bias in the early stages. To compensate for the bias, we propose progressive re-evaluation to allow for re-pruning and regrowing of previous units during subsequent stages. Extensive experiments on CIFAR-10, CIFAR-100 and ImageNet demonstrate that MGAS outperforms other state-of-the-art methods in achieving a better trade-off between model performance and model size.