The need to execute Deep Neural Networks (DNNs) at low latency and low power at the edge has spurred the development of new heterogeneous Systems-on-Chips (SoCs) encapsulating a diverse set of hardware accelerators. How to optimally map a DNN onto such multi-accelerator systems is an open problem. We propose ODiMO, a hardware-aware tool that performs a fine-grain mapping across different accelerators on-chip, splitting individual layers and executing them in parallel, to reduce inference energy consumption or latency, while taking into account each accelerator's quantization precision to maintain accuracy. Pareto-optimal networks in the accuracy vs. energy or latency space are pursued for three popular dataset/DNN pairs, and deployed on the DIANA heterogeneous ultra-low power edge AI SoC. We show that ODiMO reduces energy/latency by up to 33%/31% with limited accuracy drop (-0.53%/-0.32%) compared to manual heuristic mappings.
To meet the growing need for computational power for DNNs, multiple specialized hardware architectures have been proposed. Each DNN layer should be mapped onto the hardware with the most efficient schedule, however, SotA schedulers struggle to consistently provide optimum schedules in a reasonable time across all DNN-HW combinations. This paper proposes SALSA, a fast dual-engine scheduler to generate optimal execution schedules for both even and uneven mapping. We introduce a new strategy, combining exhaustive search with simulated annealing to address the dynamic nature of the loop ordering design space size across layers. SALSA is extensively benchmarked against two SotA schedulers, LOMA and Timeloop on 5 different DNNs, on average SALSA finds schedules with 11.9% and 7.6% lower energy while speeding up the search by 1.7x and 24x compared to LOMA and Timeloop, respectively.
The field of neuromorphic computing holds great promise in terms of advancing computing efficiency and capabilities by following brain-inspired principles. However, the rich diversity of techniques employed in neuromorphic research has resulted in a lack of clear standards for benchmarking, hindering effective evaluation of the advantages and strengths of neuromorphic methods compared to traditional deep-learning-based methods. This paper presents a collaborative effort, bringing together members from academia and the industry, to define benchmarks for neuromorphic computing: NeuroBench. The goals of NeuroBench are to be a collaborative, fair, and representative benchmark suite developed by the community, for the community. In this paper, we discuss the challenges associated with benchmarking neuromorphic solutions, and outline the key features of NeuroBench. We believe that NeuroBench will be a significant step towards defining standards that can unify the goals of neuromorphic computing and drive its technological progress. Please visit neurobench.ai for the latest updates on the benchmark tasks and metrics.
In recent years the automotive industry has been strongly promoting the development of smart cars, equipped with multi-modal sensors to gather information about the surroundings, in order to aid human drivers or make autonomous decisions. While the focus has mostly been on visual sensors, also acoustic events are crucial to detect situations that require a change in the driving behavior, such as a car honking, or the sirens of approaching emergency vehicles. In this paper, we summarize the results achieved so far in the Marie Sklodowska-Curie Actions (MSCA) European Industrial Doctorates (EID) project Intelligent Ultra Low-Power Signal Processing for Automotive (I-SPOT). On the algorithmic side, the I-SPOT Project aims to enable detecting, localizing and tracking environmental audio signals by jointly developing microphone array processing and deep learning techniques that specifically target automotive applications. Data generation software has been developed to cover the I-SPOT target scenarios and research challenges. This tool is currently being used to develop low-complexity deep learning techniques for emergency sound detection. On the hardware side, the goal impels workflows for hardware-algorithm co-design to ease the generation of architectures that are sufficiently flexible towards algorithmic evolutions without giving up on efficiency, as well as enable rapid feedback of hardware implications of algorithmic decision. This is pursued though a hierarchical workflow that breaks the hardware-algorithm design space into reasonable subsets, which has been tested for operator-level optimizations on state-of-the-art robust sound source localization for edge devices. Further, several open challenges towards an end-to-end system are clarified for the next stage of I-SPOT.
In this work we propose a methodology to accurately evaluate and compare the performance of efficient neural network building blocks for computer vision in a hardware-aware manner. Our comparison uses pareto fronts based on randomly sampled networks from a design space to capture the underlying accuracy/complexity trade-offs. We show that our approach allows to match the information obtained by previous comparison paradigms, but provides more insights in the relationship between hardware cost and accuracy. We use our methodology to analyze different building blocks and evaluate their performance on a range of embedded hardware platforms. This highlights the importance of benchmarking building blocks as a preselection step in the design process of a neural network. We show that choosing the right building block can speed up inference by up to a factor of 2x on specific hardware ML accelerators.
Multi-head self-attention forms the core of Transformer networks. However, their quadratically growing complexity with respect to the input sequence length impedes their deployment on resource-constrained edge devices. We address this challenge by proposing a dynamic pruning method, which exploits the temporal stability of data across tokens to reduce inference cost. The threshold-based method only retains significant differences between the subsequent tokens, effectively reducing the number of multiply-accumulates, as well as the internal tensor data sizes. The approach is evaluated on the Google Speech Commands Dataset for keyword spotting, and the performance is compared against the baseline Keyword Transformer. Our experiments show that we can reduce ~80% of operations while maintaining the original 98.4% accuracy. Moreover, a reduction of ~87-94% operations can be achieved when only degrading the accuracy by 1-4%, speeding up the multi-head self-attention inference by a factor of ~7.5-16.
Bayesian reasoning is a powerful mechanism for probabilistic inference in smart edge-devices. During such inferences, a low-precision arithmetic representation can enable improved energy efficiency. However, its impact on inference accuracy is not yet understood. Furthermore, general-purpose hardware does not natively support low-precision representation. To address this, we propose ProbLP, a framework that automates the analysis and design of low-precision probabilistic inference hardware. It automatically chooses an appropriate energy-efficient representation based on worst-case error-bounds and hardware energy-models. It generates custom hardware for the resulting inference network exploiting parallelism, pipelining and low-precision operation. The framework is validated on several embedded-sensing benchmarks.
Training deep learning models on embedded devices is typically avoided since this requires more memory, computation and power over inference. In this work, we focus on lowering the amount of memory needed for storing all activations, which are required during the backward pass to compute the gradients. Instead, during the forward pass, static Synthetic Gradient Modules (SGMs) predict gradients for each layer. This allows training the model in a feed-forward manner without having to store all activations. We tested our method on a robot grasping scenario where a robot needs to learn to grasp new objects given only a single demonstration. By first training the SGMs in a meta-learning manner on a set of common objects, during fine-tuning, the SGMs provided the model with accurate gradients to successfully learn to grasp new objects. We have shown that our method has comparable results to using standard backpropagation.
Recent advancements in ultra-low-power machine learning (TinyML) hardware promises to unlock an entirely new class of smart applications. However, continued progress is limited by the lack of a widely accepted benchmark for these systems. Benchmarking allows us to measure and thereby systematically compare, evaluate, and improve the performance of systems. In this position paper, we present the current landscape of TinyML and discuss the challenges and direction towards developing a fair and useful hardware benchmark for TinyML workloads. Our viewpoints reflect the collective thoughts of the TinyMLPerf working group that is comprised of 30 organizations.