Abstract:Using Resistive Random Access Memory (RRAM) crossbars in Computing-in-Memory (CIM) architectures offers a promising solution to overcome the von Neumann bottleneck. Due to non-idealities like cell variability, RRAM crossbars are often operated in binary mode, utilizing only two states: Low Resistive State (LRS) and High Resistive State (HRS). Binary Neural Networks (BNNs) and Ternary Neural Networks (TNNs) are well-suited for this hardware due to their efficient mapping. Existing software projects for RRAM-based CIM typically focus on only one aspect: compilation, simulation, or Design Space Exploration (DSE). Moreover, they often rely on classical 8 bit quantization. To address these limitations, we introduce CIM-Explorer, a modular toolkit for optimizing BNN and TNN inference on RRAM crossbars. CIM-Explorer includes an end-to-end compiler stack, multiple mapping options, and simulators, enabling a DSE flow for accuracy estimation across different crossbar parameters and mappings. CIM-Explorer can accompany the entire design process, from early accuracy estimation for specific crossbar parameters, to selecting an appropriate mapping, and compiling BNNs and TNNs for a finalized crossbar chip. In DSE case studies, we demonstrate the expected accuracy for various mappings and crossbar parameters. CIM-Explorer can be found on GitHub.
Abstract:Accelerating Machine Learning (ML) workloads requires efficient methods due to their large optimization space. Autotuning has emerged as an effective approach for systematically evaluating variations of implementations. Traditionally, autotuning requires the workloads to be executed on the target hardware (HW). We present an interface that allows executing autotuning workloads on simulators. This approach offers high scalability when the availability of the target HW is limited, as many simulations can be run in parallel on any accessible HW. Additionally, we evaluate the feasibility of using fast instruction-accurate simulators for autotuning. We train various predictors to forecast the performance of ML workload implementations on the target HW based on simulation statistics. Our results demonstrate that the tuned predictors are highly effective. The best workload implementation in terms of actual run time on the target HW is always within the top 3 % of predictions for the tested x86, ARM, and RISC-V-based architectures. In the best case, this approach outperforms native execution on the target HW for embedded architectures when running as few as three samples on three simulators in parallel.
Abstract:The demand for efficient machine learning (ML) accelerators is growing rapidly, driving the development of novel computing concepts such as resistive random access memory (RRAM)-based tiled computing-in-memory (CIM) architectures. CIM allows to compute within the memory unit, resulting in faster data processing and reduced power consumption. Efficient compiler algorithms are essential to exploit the potential of tiled CIM architectures. While conventional ML compilers focus on code generation for CPUs, GPUs, and other von Neumann architectures, adaptations are needed to cover CIM architectures. Cross-layer scheduling is a promising approach, as it enhances the utilization of CIM cores, thereby accelerating computations. Although similar concepts are implicitly used in previous work, there is a lack of clear and quantifiable algorithmic definitions for cross-layer scheduling for tiled CIM architectures. To close this gap, we present CLSA-CIM, a cross-layer scheduling algorithm for tiled CIM architectures. We integrate CLSA-CIM with existing weight-mapping strategies and compare performance against state-of-the-art (SOTA) scheduling algorithms. CLSA-CIM improves the utilization by up to 17.9 x , resulting in an overall speedup increase of up to 29.2 x compared to SOTA.