Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

May 03, 2023

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister

Figure 1 for Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Figure 2 for Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Figure 3 for Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Figure 4 for Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Share this with someone who'll enjoy it:

Abstract:Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for small models within a multi-task training framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our 770M T5 model outperforms the 540B PaLM model using only 80% of available data on a benchmark task.

* Accepted to Findings of ACL 2023

View paper on

Share this with someone who'll enjoy it:

Title:Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Paper and Code