Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:ZipLM: Hardware-Aware Structured Pruning of Language Models

Feb 07, 2023

Eldar Kurtic, Elias Frantar, Dan Alistarh

Figure 1 for ZipLM: Hardware-Aware Structured Pruning of Language Models

Figure 2 for ZipLM: Hardware-Aware Structured Pruning of Language Models

Figure 3 for ZipLM: Hardware-Aware Structured Pruning of Language Models

Figure 4 for ZipLM: Hardware-Aware Structured Pruning of Language Models

Share this with someone who'll enjoy it:

Abstract:The breakthrough performance of large language models (LLMs) comes with large computational footprints and high deployment costs. In this paper, we progress towards resolving this problem by proposing a new structured compression approach for LLMs, called ZipLM, which provides state-of-the-art compression-vs-accuracy results, while guaranteeing to match a set of (achievable) target speedups on any given target hardware. Specifically, given a task, a model, an inference environment, as well as a set of speedup targets, ZipLM identifies and removes redundancies in the model through iterative structured shrinking of the model's weight matrices. Importantly, ZipLM works in both, the post-training/one-shot and the gradual compression setting, where it produces a set of accurate models in a single run, making it highly-efficient in practice. Our approach is based on new structured pruning and knowledge distillation techniques, and consistently outperforms prior structured compression methods in terms of accuracy-versus-speedup in experiments on BERT- and GPT-family models. In particular, when compressing GPT2 model, it outperforms DistilGPT2 while being 60% smaller and 30% faster. Further, ZipLM matches performance of heavily optimized MobileBERT model, obtained via extensive architecture search, by simply pruning the baseline BERT-large architecture, and outperforms all prior BERT-base compression techniques like CoFi, MiniLM and TinyBERT.

View paper on

Share this with someone who'll enjoy it:

Title:ZipLM: Hardware-Aware Structured Pruning of Language Models

Paper and Code