Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

Jul 12, 2023

Seongjun Yang, Gibbeum Lee, Jaewoong Cho, Dimitris Papailiopoulos, Kangwook Lee

Figure 1 for Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

Figure 2 for Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

Figure 3 for Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

Figure 4 for Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

Share this with someone who'll enjoy it:

Abstract:This paper presents "Predictive Pipelined Decoding (PPD)," an approach that speeds up greedy decoding in Large Language Models (LLMs) while maintaining the exact same output as the original decoding. Unlike conventional strategies, PPD employs additional compute resources to parallelize the initiation of subsequent token decoding during the current token decoding. This innovative method reduces decoding latency and reshapes the understanding of trade-offs in LLM decoding strategies. We have developed a theoretical framework that allows us to analyze the trade-off between computation and latency. Using this framework, we can analytically estimate the potential reduction in latency associated with our proposed method, achieved through the assessment of the match rate, represented as p_correct. The results demonstrate that the use of extra computational resources has the potential to accelerate LLM greedy decoding.

* ES-FoMo Workshop at ICML 2023

View paper on

Share this with someone who'll enjoy it:

Title:Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

Paper and Code