ETH Zurich
Abstract:Nano-sized unmanned aerial vehicles (UAVs) are ideal candidates for flying Internet-of-Things smart sensors to collect information in narrow spaces. This requires ultra-fast navigation under very tight memory/computation constraints. The PULP-Dronet convolutional neural network (CNN) enables autonomous navigation running aboard a nano-UAV at 19 frame/s, at the cost of a large memory footprint of 320 kB -- and with drone control in complex scenarios hindered by the disjoint training of collision avoidance and steering capabilities. In this work, we distill a novel family of CNNs with better capabilities than PULP-Dronet, but memory footprint reduced by up to 168x (down to 2.9 kB), achieving an inference rate of up to 139 frame/s; we collect a new open-source unified collision/steering 66 k images dataset for more robust navigation; and we perform a thorough in-field analysis of both PULP-Dronet and our tiny CNNs running on a commercially available nano-UAV. Our tiniest CNN, called Tiny-PULP-Dronet v3, navigates with a 100% success rate a challenging and never-seen-before path, composed of a narrow obstacle-populated corridor and a 180{\deg} turn, at a maximum target speed of 0.5 m/s. In the same scenario, the SoA PULP-Dronet consistently fails despite having 168x more parameters.
Abstract:A key requirement in robotics is the ability to simultaneously self-localize and map a previously unknown environment, relying primarily on onboard sensing and computation. Achieving fully onboard accurate simultaneous localization and mapping (SLAM) is feasible for high-end robotic platforms, whereas small and inexpensive robots face challenges due to constrained hardware, therefore frequently resorting to external infrastructure for sensing and computation. The challenge is further exacerbated in swarms of robots, where coordination, scalability, and latency are crucial concerns. This work introduces a decentralized and lightweight collaborative SLAM approach that enables mapping on virtually any robot, even those equipped with low-cost hardware, including miniaturized insect-size devices. Moreover, the proposed solution supports large swarm formations with the capability to coordinate hundreds of agents. To substantiate our claims, we have successfully implemented collaborative SLAM on centimeter-size drones weighing only 46 grams. Remarkably, we achieve results comparable to high-end state-of-the-art solutions while reducing the cost, memory, and computation requirements by two orders of magnitude. Our approach is innovative in three main aspects. First, it enables onboard infrastructure-less collaborative mapping with a lightweight and cost-effective solution in terms of sensing and computation. Second, we optimize the data traffic within the swarm to support hundreds of cooperative agents using standard wireless protocols such as ultra-wideband (UWB), Bluetooth, or WiFi. Last, we implement a distributed swarm coordination policy to decrease mapping latency and enhance accuracy.
Abstract:Pocket-sized autonomous nano-drones can revolutionize many robotic use cases, such as visual inspection in narrow, constrained spaces, and ensure safer human-robot interaction due to their tiny form factor and weight -- i.e., tens of grams. This compelling vision is challenged by the high level of intelligence needed aboard, which clashes against the limited computational and storage resources available on PULP (parallel-ultra-low-power) MCU class navigation and mission controllers that can be hosted aboard. This work moves from PULP-Dronet, a State-of-the-Art convolutional neural network for autonomous navigation on nano-drones. We introduce Tiny-PULP-Dronet: a novel methodology to squeeze by more than one order of magnitude model size (50x fewer parameters), and number of operations (27x less multiply-and-accumulate) required to run inference with similar flight performance as PULP-Dronet. This massive reduction paves the way towards affordable multi-tasking on nano-drones, a fundamental requirement for achieving high-level intelligence.
Abstract:This study presents a novel approach for EEG-based seizure detection leveraging a BERT-based model. The model, BENDR, undergoes a two-phase training process. Initially, it is pre-trained on the extensive Temple University Hospital EEG Corpus (TUEG), a 1.5 TB dataset comprising over 10,000 subjects, to extract common EEG data patterns. Subsequently, the model is fine-tuned on the CHB-MIT Scalp EEG Database, consisting of 664 EEG recordings from 24 pediatric patients, of which 198 contain seizure events. Key contributions include optimizing fine-tuning on the CHB-MIT dataset, where the impact of model architecture, pre-processing, and post-processing techniques are thoroughly examined to enhance sensitivity and reduce false positives per hour (FP/h). We also explored custom training strategies to ascertain the most effective setup. The model undergoes a novel second pre-training phase before subject-specific fine-tuning, enhancing its generalization capabilities. The optimized model demonstrates substantial performance enhancements, achieving as low as 0.23 FP/h, 2.5$\times$ lower than the baseline model, with a lower but still acceptable sensitivity rate, showcasing the effectiveness of applying a BERT-based approach on EEG-based seizure detection.
Abstract:Visual Inertial Odometry (VIO) is the task of estimating the movement trajectory of an agent from an onboard camera stream fused with additional Inertial Measurement Unit (IMU) measurements. A crucial subtask within VIO is the tracking of features, which can be achieved through Optical Flow (OF). As the calculation of OF is a resource-demanding task in terms of computational load and memory footprint, which needs to be executed at low latency, especially in robotic applications, OF estimation is today performed on powerful CPUs or GPUs. This restricts its use in a broad spectrum of applications where the deployment of such powerful, power-hungry processors is unfeasible due to constraints related to cost, size, and power consumption. On-sensor hardware acceleration is a promising approach to enable low latency VIO even on resource-constrained devices such as nano drones. This paper assesses the speed-up in a VIO sensor system exploiting a compact OF sensor consisting of a global shutter camera and an Application Specific Integrated Circuit (ASIC). By replacing the feature tracking logic of the VINS-Mono pipeline with data from this OF camera, we demonstrate a 49.4% reduction in latency and a 53.7% reduction of compute load of the VIO pipeline over the original VINS-Mono implementation, allowing VINS-Mono operation up to 50 FPS instead of 20 FPS on the quad-core ARM Cortex-A72 processor of a Raspberry Pi Compute Module 4.
Abstract:Recent advancements in head-mounted wearable technology are revolutionizing the field of biopotential measurement, but the integration of these technologies into practical, user-friendly devices remains challenging due to issues with design intrusiveness, comfort, and data privacy. To address these challenges, this paper presents GAPSES, a novel smart glasses platform designed for unobtrusive, comfortable, and secure acquisition and processing of electroencephalography (EEG) and electrooculography (EOG) signals. We introduce a direct electrode-electronics interface with custom fully dry soft electrodes to enhance comfort for long wear. An integrated parallel ultra-low-power RISC-V processor (GAP9, Greenwaves Technologies) processes data at the edge, thereby eliminating the need for continuous data streaming through a wireless link, enhancing privacy, and increasing system reliability in adverse channel conditions. We demonstrate the broad applicability of the designed prototype through validation in a number of EEG-based interaction tasks, including alpha waves, steady-state visual evoked potential analysis, and motor movement classification. Furthermore, we demonstrate an EEG-based biometric subject recognition task, where we reach a sensitivity and specificity of 98.87% and 99.86% respectively, with only 8 EEG channels and an energy consumption per inference on the edge as low as 121 uJ. Moreover, in an EOG-based eye movement classification task, we reach an accuracy of 96.68% on 11 classes, resulting in an information transfer rate of 94.78 bit/min, which can be further increased to 161.43 bit/min by reducing the accuracy to 81.43%. The deployed implementation has an energy consumption of 24 uJ per inference and a total system power of only 16.28 mW, allowing for continuous operation of more than 12 h with a small 75 mAh battery.
Abstract:Ternary neural networks (TNNs) offer a superior accuracy-energy trade-off compared to binary neural networks. However, until now, they have required specialized accelerators to realize their efficiency potential, which has hindered widespread adoption. To address this, we present xTern, a lightweight extension of the RISC-V instruction set architecture (ISA) targeted at accelerating TNN inference on general-purpose cores. To complement the ISA extension, we developed a set of optimized kernels leveraging xTern, achieving 67% higher throughput than their 2-bit equivalents. Power consumption is only marginally increased by 5.2%, resulting in an energy efficiency improvement by 57.1%. We demonstrate that the proposed xTern extension, integrated into an octa-core compute cluster, incurs a minimal silicon area overhead of 0.9% with no impact on timing. In end-to-end benchmarks, we demonstrate that xTern enables the deployment of TNNs achieving up to 1.6 percentage points higher CIFAR-10 classification accuracy than 2-bit networks at equal inference latency. Our results show that xTern enables RISC-V-based ultra-low-power edge AI platforms to benefit from the efficiency potential of TNNs.
Abstract:Transformer-based foundation models have become crucial for various domains, most notably natural language processing (NLP) or computer vision (CV). These models are predominantly deployed on high-performance GPUs or hardwired accelerators with highly customized, proprietary instruction sets. Until now, limited attention has been given to RISC-V-based general-purpose platforms. In our work, we present the first end-to-end inference results of transformer models on an open-source many-tiny-core RISC-V platform implementing distributed Softmax primitives and leveraging ISA extensions for SIMD floating-point operand streaming and instruction repetition, as well as specialized DMA engines to minimize costly main memory accesses and to tolerate their latency. We focus on two foundational transformer topologies, encoder-only and decoder-only models. For encoder-only models, we demonstrate a speedup of up to 12.8x between the most optimized implementation and the baseline version. We reach over 79% FPU utilization and 294 GFLOPS/W, outperforming State-of-the-Art (SoA) accelerators by more than 2x utilizing the HW platform while achieving comparable throughput per computational unit. For decoder-only topologies, we achieve 16.1x speedup in the Non-Autoregressive (NAR) mode and up to 35.6x speedup in the Autoregressive (AR) mode compared to the baseline implementation. Compared to the best SoA dedicated accelerator, we achieve 2.04x higher FPU utilization.
Abstract:Large language models (LLMs) achieve remarkable performance in natural language understanding but require substantial computation and memory resources. Post-training quantization (PTQ) is a powerful compression technique extensively investigated in LLMs. However, existing PTQ methods are still not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths. Standard PTQ methods using group-wise quantization suffer difficulties in quantizing LLMs accurately to such low-bit, but advanced methods remaining high-precision weights element-wisely are hard to realize their theoretical hardware efficiency. This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM. The scheme exploits the salience distribution of weights to determine optimal bit-width and quantizers for accurate LLM quantization, while aligning bit-width partition to groups for compact memory usage and fast integer inference. Specifically, the proposed SliM-LLM mainly relies on two novel techniques: (1) Salience-Determined Bit Allocation utilizes the clustering characteristics of salience distribution to allocate the bit-widths of each group, increasing the accuracy of quantized LLMs and maintaining the inference efficiency; (2) Salience-Weighted Quantizer Calibration optimizes the parameters of the quantizer by considering the element-wise salience within the group, balancing the maintenance of salient information and minimization of errors. Comprehensive experiments show that SliM-LLM significantly improves the accuracy of LLMs at ultra-low bits, e.g., 2-bit LLaMA-7B achieves a 5.5-times memory-saving than original model on NVIDIA A800 GPUs, and 48% decrease of perplexity compared to the state-of-the-art gradient-free PTQ method. Moreover, SliM-LLM+, which is integrated from the extension of SliM-LLM with gradient-based quantizers, further reduces perplexity by 35.1%.
Abstract:Implantable Brain-machine interfaces (BMIs) are promising for motor rehabilitation and mobility augmentation, and they demand accurate and energy-efficient algorithms. In this paper, we propose a novel spiking neural network (SNN) decoder for regression tasks for implantable BMIs. The SNN is trained with enhanced spatio-temporal backpropagation to fully leverage its capability to handle temporal problems. The proposed SNN decoder outperforms the state-of-the-art Kalman filter and artificial neural network (ANN) decoders in offline finger velocity decoding tasks. The decoder is deployed on a RISC-V-based hardware platform and optimized to exploit sparsity. The proposed implementation has an average power consumption of 0.50 mW in a duty-cycled mode. When conducting continuous inference without duty-cycling, it achieves an energy efficiency of 1.88 uJ per inference, which is 5.5X less than the baseline ANN. Additionally, the average decoding latency is 0.12 ms for each inference, which is 5.7X faster than the ANN implementation.