Get our free extension to see links to code for papers anywhere online!

# Ultra Low-latency, Low-area Inference Accelerators using Heterogeneous Deep Quantization with QKeras and hls4ml

In this paper, we introduce the QKeras library, an extension of the Keras library allowing for the creation of heterogeneously quantized versions of deep neural network models, through drop-in replacement of Keras layers. These models are trained quantization-aware, where the user can trade off model area or energy consumption by accuracy. We demonstrate how the reduction of numerical precision, through quantization-aware training, significantly reduces resource consumption while retaining high accuracy when implemented on FPGA hardware. Together with the hls4ml library, this allows for a fully automated deployment of quantized Keras models on chip, crucial for ultra low-latency inference. As a benchmark problem, we consider a classification task for the triggering of events in proton-proton collisions at the CERN Large Hadron Collider, where a latency of ${\mathcal O}(1)~\mu$s is required.

* 9 pages, 9 figures, 3 tables, submitted to ICCAD