Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hiari Pizzini Cavagna

Determining Energy Efficiency Sweet Spots in Production LLM Inference

Feb 05, 2026

Hiari Pizzini Cavagna, Andrea Proia, Giacomo Madella, Giovanni B. Esposito, Francesco Antici, Daniele Cesarini, Zeynep Kiziltan, Andrea Bartolini

Abstract:Large Language Models (LLMs) inference is central in modern AI applications, making it critical to understand their energy footprint. Existing approaches typically estimate energy consumption through simple linear functions of input and output sequence lengths, yet our observations reveal clear Energy Efficiency regimes: peak efficiency occurs with short-to-moderate inputs and medium-length outputs, while efficiency drops sharply for long inputs or very short outputs, indicating a non-linear dependency. In this work, we propose an analytical model derived from the computational and memory-access complexity of the Transformer architecture, capable of accurately characterizing the efficiency curve as a function of input and output lengths. To assess its accuracy, we evaluate energy consumption using TensorRT-LLM on NVIDIA H100 GPUs across a diverse set of LLMs ranging from 1B to 9B parameters, including OPT, LLaMA, Gemma, Falcon, Qwen2, and Granite, tested over input and output lengths from 64 to 4096 tokens, achieving a mean MAPE of 1.79%. Our results show that aligning sequence lengths with these efficiency "Sweet Spots" can substantially reduce energy usage, supporting informed truncation, summarization, and adaptive generation strategies in production systems.

* To appear at ICPE 2026 (International Conference on Performance Engineering)

Via

Access Paper or Ask Questions

Assessing Tenstorrent's RISC-V MatMul Acceleration Capabilities

May 09, 2025

Hiari Pizzini Cavagna, Daniele Cesarini, Andrea Bartolini

Figure 1 for Assessing Tenstorrent's RISC-V MatMul Acceleration Capabilities

Figure 2 for Assessing Tenstorrent's RISC-V MatMul Acceleration Capabilities

Figure 3 for Assessing Tenstorrent's RISC-V MatMul Acceleration Capabilities

Figure 4 for Assessing Tenstorrent's RISC-V MatMul Acceleration Capabilities

Abstract:The increasing demand for generative AI as Large Language Models (LLMs) services has driven the need for specialized hardware architectures that optimize computational efficiency and energy consumption. This paper evaluates the performance of the Tenstorrent Grayskull e75 RISC-V accelerator for basic linear algebra kernels at reduced numerical precision, a fundamental operation in LLM computations. We present a detailed characterization of Grayskull's execution model, gridsize, matrix dimensions, data formats, and numerical precision impact computational efficiency. Furthermore, we compare Grayskull's performance against state-of-the-art architectures with tensor acceleration, including Intel Sapphire Rapids processors and two NVIDIA GPUs (V100 and A100). Whilst NVIDIA GPUs dominate raw performance, Grayskull demonstrates a competitive trade-off between power consumption and computational throughput, reaching a peak of 1.55 TFLOPs/Watt with BF16.

* Accepted to the Computational Aspects of Deep Learning Workshop at ISC High Performance 2025. To appear in the ISC High Performance 2025 Workshop Proceedings

Via

Access Paper or Ask Questions