Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Luca Benini

D-ITET, ETH Zürich, Switzerland

Tiny-PULP-Dronets: Squeezing Neural Networks for Faster and Lighter Inference on Multi-Tasking Autonomous Nano-Drones

Jul 02, 2024

Lorenzo Lamberti, Vlad Niculescu, Michał Barcis, Lorenzo Bellone, Enrico Natalizio, Luca Benini, Daniele Palossi

Abstract:Pocket-sized autonomous nano-drones can revolutionize many robotic use cases, such as visual inspection in narrow, constrained spaces, and ensure safer human-robot interaction due to their tiny form factor and weight -- i.e., tens of grams. This compelling vision is challenged by the high level of intelligence needed aboard, which clashes against the limited computational and storage resources available on PULP (parallel-ultra-low-power) MCU class navigation and mission controllers that can be hosted aboard. This work moves from PULP-Dronet, a State-of-the-Art convolutional neural network for autonomous navigation on nano-drones. We introduce Tiny-PULP-Dronet: a novel methodology to squeeze by more than one order of magnitude model size (50x fewer parameters), and number of operations (27x less multiply-and-accumulate) required to run inference with similar flight performance as PULP-Dronet. This massive reduction paves the way towards affordable multi-tasking on nano-drones, a fundamental requirement for achieving high-level intelligence.

* 3 Figures, 1 table. Accepted for publication at IEEE Artificial Intelligence Circuits and Systems (AICAS), 2022

Via

Access Paper or Ask Questions

BISeizuRe: BERT-Inspired Seizure Data Representation to Improve Epilepsy Monitoring

Jun 27, 2024

Luca Benfenati, Thorir Mar Ingolfsson, Andrea Cossettini, Daniele Jahier Pagliari, Alessio Burrello, Luca Benini

Figure 1 for BISeizuRe: BERT-Inspired Seizure Data Representation to Improve Epilepsy Monitoring

Figure 2 for BISeizuRe: BERT-Inspired Seizure Data Representation to Improve Epilepsy Monitoring

Figure 3 for BISeizuRe: BERT-Inspired Seizure Data Representation to Improve Epilepsy Monitoring

Figure 4 for BISeizuRe: BERT-Inspired Seizure Data Representation to Improve Epilepsy Monitoring

Abstract:This study presents a novel approach for EEG-based seizure detection leveraging a BERT-based model. The model, BENDR, undergoes a two-phase training process. Initially, it is pre-trained on the extensive Temple University Hospital EEG Corpus (TUEG), a 1.5 TB dataset comprising over 10,000 subjects, to extract common EEG data patterns. Subsequently, the model is fine-tuned on the CHB-MIT Scalp EEG Database, consisting of 664 EEG recordings from 24 pediatric patients, of which 198 contain seizure events. Key contributions include optimizing fine-tuning on the CHB-MIT dataset, where the impact of model architecture, pre-processing, and post-processing techniques are thoroughly examined to enhance sensitivity and reduce false positives per hour (FP/h). We also explored custom training strategies to ascertain the most effective setup. The model undergoes a novel second pre-training phase before subject-specific fine-tuning, enhancing its generalization capabilities. The optimized model demonstrates substantial performance enhancements, achieving as low as 0.23 FP/h, 2.5$\times$ lower than the baseline model, with a lower but still acceptable sensitivity rate, showcasing the effectiveness of applying a BERT-based approach on EEG-based seizure detection.

* 4 pages, 2 tables, 2 figures

Via

Access Paper or Ask Questions

Low Latency Visual Inertial Odometry with On-Sensor Accelerated Optical Flow for Resource-Constrained UAVs

Jun 19, 2024

Jonas Kühne, Michele Magno, Luca Benini

Abstract:Visual Inertial Odometry (VIO) is the task of estimating the movement trajectory of an agent from an onboard camera stream fused with additional Inertial Measurement Unit (IMU) measurements. A crucial subtask within VIO is the tracking of features, which can be achieved through Optical Flow (OF). As the calculation of OF is a resource-demanding task in terms of computational load and memory footprint, which needs to be executed at low latency, especially in robotic applications, OF estimation is today performed on powerful CPUs or GPUs. This restricts its use in a broad spectrum of applications where the deployment of such powerful, power-hungry processors is unfeasible due to constraints related to cost, size, and power consumption. On-sensor hardware acceleration is a promising approach to enable low latency VIO even on resource-constrained devices such as nano drones. This paper assesses the speed-up in a VIO sensor system exploiting a compact OF sensor consisting of a global shutter camera and an Application Specific Integrated Circuit (ASIC). By replacing the feature tracking logic of the VINS-Mono pipeline with data from this OF camera, we demonstrate a 49.4% reduction in latency and a 53.7% reduction of compute load of the VIO pipeline over the original VINS-Mono implementation, allowing VINS-Mono operation up to 50 FPS instead of 20 FPS on the quad-core ARM Cortex-A72 processor of a Raspberry Pi Compute Module 4.

* This article has been accepted for publication in the IEEE Sensors Journal (JSEN)

Via

Access Paper or Ask Questions

GAPses: Versatile smart glasses for comfortable and fully-dry acquisition and parallel ultra-low-power processing of EEG and EOG

Jun 12, 2024

Sebastian Frey, Mattia Alberto Lucchini, Victor Kartsch, Thorir Mar Ingolfsson, Andrea Helga Bernardi, Michael Segessenmann, Jakub Osieleniec, Simone Benatti, Luca Benini, Andrea Cossettini

Figure 1 for GAPses: Versatile smart glasses for comfortable and fully-dry acquisition and parallel ultra-low-power processing of EEG and EOG

Figure 2 for GAPses: Versatile smart glasses for comfortable and fully-dry acquisition and parallel ultra-low-power processing of EEG and EOG

Figure 3 for GAPses: Versatile smart glasses for comfortable and fully-dry acquisition and parallel ultra-low-power processing of EEG and EOG

Figure 4 for GAPses: Versatile smart glasses for comfortable and fully-dry acquisition and parallel ultra-low-power processing of EEG and EOG

Abstract:Recent advancements in head-mounted wearable technology are revolutionizing the field of biopotential measurement, but the integration of these technologies into practical, user-friendly devices remains challenging due to issues with design intrusiveness, comfort, and data privacy. To address these challenges, this paper presents GAPSES, a novel smart glasses platform designed for unobtrusive, comfortable, and secure acquisition and processing of electroencephalography (EEG) and electrooculography (EOG) signals. We introduce a direct electrode-electronics interface with custom fully dry soft electrodes to enhance comfort for long wear. An integrated parallel ultra-low-power RISC-V processor (GAP9, Greenwaves Technologies) processes data at the edge, thereby eliminating the need for continuous data streaming through a wireless link, enhancing privacy, and increasing system reliability in adverse channel conditions. We demonstrate the broad applicability of the designed prototype through validation in a number of EEG-based interaction tasks, including alpha waves, steady-state visual evoked potential analysis, and motor movement classification. Furthermore, we demonstrate an EEG-based biometric subject recognition task, where we reach a sensitivity and specificity of 98.87% and 99.86% respectively, with only 8 EEG channels and an energy consumption per inference on the edge as low as 121 uJ. Moreover, in an EOG-based eye movement classification task, we reach an accuracy of 96.68% on 11 classes, resulting in an information transfer rate of 94.78 bit/min, which can be further increased to 161.43 bit/min by reducing the accuracy to 81.43%. The deployed implementation has an energy consumption of 24 uJ per inference and a total system power of only 16.28 mW, allowing for continuous operation of more than 12 h with a small 75 mAh battery.

* 10 pages, 5 figures, 5 tables

Via

Access Paper or Ask Questions

Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform

May 29, 2024

Viviane Potocnik, Luca Colagrande, Tim Fischer, Luca Bertaccini, Daniele Jahier Pagliari, Alessio Burrello, Luca Benini

Abstract:Transformer-based foundation models have become crucial for various domains, most notably natural language processing (NLP) or computer vision (CV). These models are predominantly deployed on high-performance GPUs or hardwired accelerators with highly customized, proprietary instruction sets. Until now, limited attention has been given to RISC-V-based general-purpose platforms. In our work, we present the first end-to-end inference results of transformer models on an open-source many-tiny-core RISC-V platform implementing distributed Softmax primitives and leveraging ISA extensions for SIMD floating-point operand streaming and instruction repetition, as well as specialized DMA engines to minimize costly main memory accesses and to tolerate their latency. We focus on two foundational transformer topologies, encoder-only and decoder-only models. For encoder-only models, we demonstrate a speedup of up to 12.8x between the most optimized implementation and the baseline version. We reach over 79% FPU utilization and 294 GFLOPS/W, outperforming State-of-the-Art (SoA) accelerators by more than 2x utilizing the HW platform while achieving comparable throughput per computational unit. For decoder-only topologies, we achieve 16.1x speedup in the Non-Autoregressive (NAR) mode and up to 35.6x speedup in the Autoregressive (AR) mode compared to the baseline implementation. Compared to the best SoA dedicated accelerator, we achieve 2.04x higher FPU utilization.

* 14 pages, 10 figures, 4 tables, IEEE Transactions on Circuits and Systems for Artificial Intelligence

Via

Access Paper or Ask Questions

xTern: Energy-Efficient Ternary Neural Network Inference on RISC-V-Based Edge Systems

May 29, 2024

Georg Rutishauser, Joan Mihali, Moritz Scherer, Luca Benini

Figure 1 for xTern: Energy-Efficient Ternary Neural Network Inference on RISC-V-Based Edge Systems

Figure 2 for xTern: Energy-Efficient Ternary Neural Network Inference on RISC-V-Based Edge Systems

Figure 3 for xTern: Energy-Efficient Ternary Neural Network Inference on RISC-V-Based Edge Systems

Figure 4 for xTern: Energy-Efficient Ternary Neural Network Inference on RISC-V-Based Edge Systems

Abstract:Ternary neural networks (TNNs) offer a superior accuracy-energy trade-off compared to binary neural networks. However, until now, they have required specialized accelerators to realize their efficiency potential, which has hindered widespread adoption. To address this, we present xTern, a lightweight extension of the RISC-V instruction set architecture (ISA) targeted at accelerating TNN inference on general-purpose cores. To complement the ISA extension, we developed a set of optimized kernels leveraging xTern, achieving 67% higher throughput than their 2-bit equivalents. Power consumption is only marginally increased by 5.2%, resulting in an energy efficiency improvement by 57.1%. We demonstrate that the proposed xTern extension, integrated into an octa-core compute cluster, incurs a minimal silicon area overhead of 0.9% with no impact on timing. In end-to-end benchmarks, we demonstrate that xTern enables the deployment of TNNs achieving up to 1.6 percentage points higher CIFAR-10 classification accuracy than 2-bit networks at equal inference latency. Our results show that xTern enables RISC-V-based ultra-low-power edge AI platforms to benefit from the efficiency potential of TNNs.

* Accepted for publication at IEEE ASAP 2024

Via

Access Paper or Ask Questions

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

May 23, 2024

Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Xianglong Liu, Luca Benini, Michele Magno, Xiaojuan Qi

Figure 1 for SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Figure 2 for SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Figure 3 for SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Figure 4 for SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Abstract:Large language models (LLMs) achieve remarkable performance in natural language understanding but require substantial computation and memory resources. Post-training quantization (PTQ) is a powerful compression technique extensively investigated in LLMs. However, existing PTQ methods are still not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths. Standard PTQ methods using group-wise quantization suffer difficulties in quantizing LLMs accurately to such low-bit, but advanced methods remaining high-precision weights element-wisely are hard to realize their theoretical hardware efficiency. This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM. The scheme exploits the salience distribution of weights to determine optimal bit-width and quantizers for accurate LLM quantization, while aligning bit-width partition to groups for compact memory usage and fast integer inference. Specifically, the proposed SliM-LLM mainly relies on two novel techniques: (1) Salience-Determined Bit Allocation utilizes the clustering characteristics of salience distribution to allocate the bit-widths of each group, increasing the accuracy of quantized LLMs and maintaining the inference efficiency; (2) Salience-Weighted Quantizer Calibration optimizes the parameters of the quantizer by considering the element-wise salience within the group, balancing the maintenance of salient information and minimization of errors. Comprehensive experiments show that SliM-LLM significantly improves the accuracy of LLMs at ultra-low bits, e.g., 2-bit LLaMA-7B achieves a 5.5-times memory-saving than original model on NVIDIA A800 GPUs, and 48% decrease of perplexity compared to the state-of-the-art gradient-free PTQ method. Moreover, SliM-LLM+, which is integrated from the extension of SliM-LLM with gradient-based quantizers, further reduces perplexity by 35.1%.

* 22 pages

Via

Access Paper or Ask Questions

A Spiking Neural Network Decoder for Implantable Brain Machine Interfaces and its Sparsity-aware Deployment on RISC-V Microcontrollers

May 03, 2024

Jiawei Liao, Oscar Toomey, Xiaying Wang, Lars Widmer, Cynthia A. Chestek, Luca Benini, Taekwang Jang

Figure 1 for A Spiking Neural Network Decoder for Implantable Brain Machine Interfaces and its Sparsity-aware Deployment on RISC-V Microcontrollers

Figure 2 for A Spiking Neural Network Decoder for Implantable Brain Machine Interfaces and its Sparsity-aware Deployment on RISC-V Microcontrollers

Figure 3 for A Spiking Neural Network Decoder for Implantable Brain Machine Interfaces and its Sparsity-aware Deployment on RISC-V Microcontrollers

Figure 4 for A Spiking Neural Network Decoder for Implantable Brain Machine Interfaces and its Sparsity-aware Deployment on RISC-V Microcontrollers

Abstract:Implantable Brain-machine interfaces (BMIs) are promising for motor rehabilitation and mobility augmentation, and they demand accurate and energy-efficient algorithms. In this paper, we propose a novel spiking neural network (SNN) decoder for regression tasks for implantable BMIs. The SNN is trained with enhanced spatio-temporal backpropagation to fully leverage its capability to handle temporal problems. The proposed SNN decoder outperforms the state-of-the-art Kalman filter and artificial neural network (ANN) decoders in offline finger velocity decoding tasks. The decoder is deployed on a RISC-V-based hardware platform and optimized to exploit sparsity. The proposed implementation has an average power consumption of 0.50 mW in a duty-cycled mode. When conducting continuous inference without duty-cycling, it achieves an energy efficiency of 1.88 uJ per inference, which is 5.5X less than the baseline ANN. Additionally, the average decoding latency is 0.12 ms for each inference, which is 5.7X faster than the ANN implementation.

Via

Access Paper or Ask Questions

Multi-resolution Rescored ByteTrack for Video Object Detection on Ultra-low-power Embedded Systems

Apr 17, 2024

Luca Bompani, Manuele Rusci, Daniele Palossi, Francesco Conti, Luca Benini

Figure 1 for Multi-resolution Rescored ByteTrack for Video Object Detection on Ultra-low-power Embedded Systems

Figure 2 for Multi-resolution Rescored ByteTrack for Video Object Detection on Ultra-low-power Embedded Systems

Figure 3 for Multi-resolution Rescored ByteTrack for Video Object Detection on Ultra-low-power Embedded Systems

Figure 4 for Multi-resolution Rescored ByteTrack for Video Object Detection on Ultra-low-power Embedded Systems

Abstract:This paper introduces Multi-Resolution Rescored Byte-Track (MR2-ByteTrack), a novel video object detection framework for ultra-low-power embedded processors. This method reduces the average compute load of an off-the-shelf Deep Neural Network (DNN) based object detector by up to 2.25$\times$ by alternating the processing of high-resolution images (320$\times$320 pixels) with multiple down-sized frames (192$\times$192 pixels). To tackle the accuracy degradation due to the reduced image input size, MR2-ByteTrack correlates the output detections over time using the ByteTrack tracker and corrects potential misclassification using a novel probabilistic Rescore algorithm. By interleaving two down-sized images for every high-resolution one as the input of different state-of-the-art DNN object detectors with our MR2-ByteTrack, we demonstrate an average accuracy increase of 2.16% and a latency reduction of 43% on the GAP9 microcontroller compared to a baseline frame-by-frame inference scheme using exclusively full-resolution images. Code available at: https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack

* 9 pages, 3 figures Accepted for publication at the Embedded Vision Workshop of the Computer Vision and Pattern Recognition conference, Seattle, 2024

Via

Access Paper or Ask Questions

Foundation Models for Structural Health Monitoring

Apr 03, 2024

Luca Benfenati, Daniele Jahier Pagliari, Luca Zanatta, Yhorman Alexander Bedoya Velez, Andrea Acquaviva, Massimo Poncino, Enrico Macii, Luca Benini, Alessio Burrello

Figure 1 for Foundation Models for Structural Health Monitoring

Figure 2 for Foundation Models for Structural Health Monitoring

Figure 3 for Foundation Models for Structural Health Monitoring

Figure 4 for Foundation Models for Structural Health Monitoring

Abstract:Structural Health Monitoring (SHM) is a critical task for ensuring the safety and reliability of civil infrastructures, typically realized on bridges and viaducts by means of vibration monitoring. In this paper, we propose for the first time the use of Transformer neural networks, with a Masked Auto-Encoder architecture, as Foundation Models for SHM. We demonstrate the ability of these models to learn generalizable representations from multiple large datasets through self-supervised pre-training, which, coupled with task-specific fine-tuning, allows them to outperform state-of-the-art traditional methods on diverse tasks, including Anomaly Detection (AD) and Traffic Load Estimation (TLE). We then extensively explore model size versus accuracy trade-offs and experiment with Knowledge Distillation (KD) to improve the performance of smaller Transformers, enabling their embedding directly into the SHM edge nodes. We showcase the effectiveness of our foundation models using data from three operational viaducts. For AD, we achieve a near-perfect 99.9% accuracy with a monitoring time span of just 15 windows. In contrast, a state-of-the-art method based on Principal Component Analysis (PCA) obtains its first good result (95.03% accuracy) only considering 120 windows. On two different TLE tasks, our models obtain state-of-the-art performance on multiple evaluation metrics (R$^2$ score, MAE% and MSE%). On the first benchmark, we achieve an R$^2$ score of 0.97 and 0.85 for light and heavy vehicle traffic, respectively, while the best previous approach stops at 0.91 and 0.84. On the second one, we achieve an R$^2$ score of 0.54 versus the 0.10 of the best existing method.

* 16 pages, 4 tables, 9 figures

Via

Access Paper or Ask Questions