Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Rank and run-time aware compression of NLP Applications

Oct 06, 2020

Urmish Thakker, Jesse Beu, Dibakar Gope, Ganesh Dasika, Matthew Mattina

Figure 1 for Rank and run-time aware compression of NLP Applications

Figure 2 for Rank and run-time aware compression of NLP Applications

Figure 3 for Rank and run-time aware compression of NLP Applications

Share this with someone who'll enjoy it:

Abstract:Sequence model based NLP applications can be large. Yet, many applications that benefit from them run on small devices with very limited compute and storage capabilities, while still having run-time constraints. As a result, there is a need for a compression technique that can achieve significant compression without negatively impacting inference run-time and task accuracy. This paper proposes a new compression technique called Hybrid Matrix Factorization that achieves this dual objective. HMF improves low-rank matrix factorization (LMF) techniques by doubling the rank of the matrix using an intelligent hybrid-structure leading to better accuracy than LMF. Further, by preserving dense matrices, it leads to faster inference run-time than pruning or structure matrix based compression technique. We evaluate the impact of this technique on 5 NLP benchmarks across multiple tasks (Translation, Intent Detection, Language Modeling) and show that for similar accuracy values and compression factors, HMF can achieve more than 2.32x faster inference run-time than pruning and 16.77% better accuracy than LMF.

* Published at SustaiNLP@EMNLP 2020. arXiv admin note: text overlap with arXiv:1906.04886

View paper on

Share this with someone who'll enjoy it:

Title:Rank and run-time aware compression of NLP Applications

Paper and Code