Abstract:In modern computer architectures, the performance of many memory-bound workloads (e.g., machine learning, graph processing, databases) is limited by the data movement bottleneck that emerges when transferring large amounts of data between the main memory and the central processing unit (CPU). Processing-in-memory is an emerging computing paradigm that aims to alleviate this data movement bottleneck by performing computation close to or within the memory units, where data resides. One example of a prevalent workload whose performance is bound by the data movement bottleneck is the training and inference process of artificial neural networks. In this work, we analyze the potential of modern general-purpose PiM architectures to accelerate neural networks. To this end, we selected the UPMEM PiM system, the first commercially available real-world general-purpose PiM architecture. We compared the implementation of multilayer perceptrons (MLPs) in PiM with a sequential baseline running on an Intel Xeon CPU. The UPMEM implementation achieves up to $259\times$ better performance for inference of large batch sizes when compared against the CPU that exploits the size of the available PiM memory. Additionally, two smaller MLPs were implemented using UPMEM's working SRAM (WRAM), a scratchpad memory, to evaluate their performance against a low-power Nvidia Jetson graphics processing unit (GPU), providing further insights into the efficiency of UPMEM's PiM for neural network inference. Results show that using WRAM achieves kernel execution times for MLP inference of under $3$ ms, which is within the same order of magnitude as low-power GPUs.
Abstract:In recent years, Convolutional Neural Networks (CNNs) have become the standard class of deep neural network for image processing, classification and segmentation tasks. However, the large strides in accuracy obtained by CNNs have been derived from increasing the complexity of network topologies, which incurs sizeable performance and energy penalties in the training and inference of CNNs. Many recent works have validated the effectiveness of parameter quantization, which consists in reducing the bit width of the network's parameters, to enable the attainment of considerable performance and energy efficiency gains without significantly compromising accuracy. However, it is difficult to compare the relative effectiveness of different quantization methods. To address this problem, we introduce RedBit, an open-source framework that provides a transparent, extensible and easy-to-use interface to evaluate the effectiveness of different algorithms and parameter configurations on network accuracy. We use RedBit to perform a comprehensive survey of five state-of-the-art quantization methods applied to the MNIST, CIFAR-10 and ImageNet datasets. We evaluate a total of 2300 individual bit width combinations, independently tuning the width of the network's weight and input activation parameters, from 32 bits down to 1 bit (e.g., 8/8, 2/2, 1/32, 1/1, for weights/activations). Upwards of 20000 hours of computing time in a pool of state-of-the-art GPUs were used to generate all the results in this paper. For 1-bit quantization, the accuracy losses for the MNIST, CIFAR-10 and ImageNet datasets range between [0.26%, 0.79%], [9.74%, 32.96%] and [10.86%, 47.36%] top-1, respectively. We actively encourage the reader to download the source code and experiment with RedBit, and to submit their own observed results to our public repository, available at https://github.com/IT-Coimbra/RedBit.