Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Emre Neftci

QS4D: Quantization-aware training for efficient hardware deployment of structured state-space sequential models

Jul 08, 2025

Sebastian Siegel, Ming-Jay Yang, Younes Bouhadjar, Maxime Fabre, Emre Neftci, John Paul Strachan

Abstract:Structured State Space models (SSM) have recently emerged as a new class of deep learning models, particularly well-suited for processing long sequences. Their constant memory footprint, in contrast to the linearly scaling memory demands of Transformers, makes them attractive candidates for deployment on resource-constrained edge-computing devices. While recent works have explored the effect of quantization-aware training (QAT) on SSMs, they typically do not address its implications for specialized edge hardware, for example, analog in-memory computing (AIMC) chips. In this work, we demonstrate that QAT can significantly reduce the complexity of SSMs by up to two orders of magnitude across various performance metrics. We analyze the relation between model size and numerical precision, and show that QAT enhances robustness to analog noise and enables structural pruning. Finally, we integrate these techniques to deploy SSMs on a memristive analog in-memory computing substrate and highlight the resulting benefits in terms of computational efficiency.

Via

Access Paper or Ask Questions

Structured State Space Model Dynamics and Parametrization for Spiking Neural Networks

Jun 04, 2025

Maxime Fabre, Lyubov Dudchenko, Emre Neftci

Abstract:Multi-state spiking neurons such as the adaptive leaky integrate-and-fire (AdLIF) neuron offer compelling alternatives to conventional deep learning models thanks to their sparse binary activations, second-order nonlinear recurrent dynamics, and efficient hardware realizations. However, such internal dynamics can cause instabilities during inference and training, often limiting performance and scalability. Meanwhile, state space models (SSMs) excel in long sequence processing using linear state-intrinsic recurrence resembling spiking neurons' subthreshold regime. Here, we establish a mathematical bridge between SSMs and second-order spiking neuron models. Based on structure and parametrization strategies of diagonal SSMs, we propose two novel spiking neuron models. The first extends the AdLIF neuron through timestep training and logarithmic reparametrization to facilitate training and improve final performance. The second additionally brings initialization and structure from complex-state SSMs, broadening the dynamical regime to oscillatory dynamics. Together, our two models achieve beyond or near state-of-the-art (SOTA) performances for reset-based spiking neuron models across both event-based and raw audio speech recognition datasets. We achieve this with a favorable number of parameters and required dynamic memory while maintaining high activity sparsity. Our models demonstrate enhanced scalability in network size and strike a favorable balance between performance and efficiency with respect to SSM models.

Via

Access Paper or Ask Questions

Contrastive Consolidation of Top-Down Modulations Achieves Sparsely Supervised Continual Learning

May 20, 2025

Viet Anh Khoa Tran, Emre Neftci, Willem. A. M. Wybo

Abstract:Biological brains learn continually from a stream of unlabeled data, while integrating specialized information from sparsely labeled examples without compromising their ability to generalize. Meanwhile, machine learning methods are susceptible to catastrophic forgetting in this natural learning setting, as supervised specialist fine-tuning degrades performance on the original task. We introduce task-modulated contrastive learning (TMCL), which takes inspiration from the biophysical machinery in the neocortex, using predictive coding principles to integrate top-down information continually and without supervision. We follow the idea that these principles build a view-invariant representation space, and that this can be implemented using a contrastive loss. Then, whenever labeled samples of a new class occur, new affine modulations are learned that improve separation of the new class from all others, without affecting feedforward weights. By co-opting the view-invariance learning mechanism, we then train feedforward weights to match the unmodulated representation of a data sample to its modulated counterparts. This introduces modulation invariance into the representation space, and, by also using past modulations, stabilizes it. Our experiments show improvements in both class-incremental and transfer learning over state-of-the-art unsupervised approaches, as well as over comparable supervised approaches, using as few as 1% of available labels. Taken together, our work suggests that top-down modulations play a crucial role in balancing stability and plasticity.

* 33 pages, 5 figures

Via

Access Paper or Ask Questions

A Grid Cell-Inspired Structured Vector Algebra for Cognitive Maps

Mar 11, 2025

Sven Krausse, Emre Neftci, Friedrich T. Sommer, Alpha Renner

Abstract:The entorhinal-hippocampal formation is the mammalian brain's navigation system, encoding both physical and abstract spaces via grid cells. This system is well-studied in neuroscience, and its efficiency and versatility make it attractive for applications in robotics and machine learning. While continuous attractor networks (CANs) successfully model entorhinal grid cells for encoding physical space, integrating both continuous spatial and abstract spatial computations into a unified framework remains challenging. Here, we attempt to bridge this gap by proposing a mechanistic model for versatile information processing in the entorhinal-hippocampal formation inspired by CANs and Vector Symbolic Architectures (VSAs), a neuro-symbolic computing framework. The novel grid-cell VSA (GC-VSA) model employs a spatially structured encoding scheme with 3D neuronal modules mimicking the discrete scales and orientations of grid cell modules, reproducing their characteristic hexagonal receptive fields. In experiments, the model demonstrates versatility in spatial and abstract tasks: (1) accurate path integration for tracking locations, (2) spatio-temporal representation for querying object locations and temporal relations, and (3) symbolic reasoning using family trees as a structured test case for hierarchical relationships.

* 10 pages, 5 figures, accepted at the 2025 Neuro Inspired Computational Elements (NICE) conference

Via

Access Paper or Ask Questions

A Truly Sparse and General Implementation of Gradient-Based Synaptic Plasticity

Jan 20, 2025

Jamie Lohoff, Anil Kaya, Florian Assmuth, Emre Neftci

Figure 1 for A Truly Sparse and General Implementation of Gradient-Based Synaptic Plasticity

Figure 2 for A Truly Sparse and General Implementation of Gradient-Based Synaptic Plasticity

Figure 3 for A Truly Sparse and General Implementation of Gradient-Based Synaptic Plasticity

Figure 4 for A Truly Sparse and General Implementation of Gradient-Based Synaptic Plasticity

Abstract:Online synaptic plasticity rules derived from gradient descent achieve high accuracy on a wide range of practical tasks. However, their software implementation often requires tediously hand-derived gradients or using gradient backpropagation which sacrifices the online capability of the rules. In this work, we present a custom automatic differentiation (AD) pipeline for sparse and online implementation of gradient-based synaptic plasticity rules that generalizes to arbitrary neuron models. Our work combines the programming ease of backpropagation-type methods for forward AD while being memory-efficient. To achieve this, we exploit the advantageous compute and memory scaling of online synaptic plasticity by providing an inherently sparse implementation of AD where expensive tensor contractions are replaced with simple element-wise multiplications if the tensors are diagonal. Gradient-based synaptic plasticity rules such as eligibility propagation (e-prop) have exactly this property and thus profit immensely from this feature. We demonstrate the alignment of our gradients with respect to gradient backpropagation on an synthetic task where e-prop gradients are exact, as well as audio speech classification benchmarks. We demonstrate how memory utilization scales with network size without dependence on the sequence length, as expected from forward AD methods.

* 8 pages, 7 figures

Via

Access Paper or Ask Questions

Optimal Gradient Checkpointing for Sparse and Recurrent Architectures using Off-Chip Memory

Dec 16, 2024

Wadjih Bencheikh, Jan Finkbeiner, Emre Neftci

Figure 1 for Optimal Gradient Checkpointing for Sparse and Recurrent Architectures using Off-Chip Memory

Figure 2 for Optimal Gradient Checkpointing for Sparse and Recurrent Architectures using Off-Chip Memory

Figure 3 for Optimal Gradient Checkpointing for Sparse and Recurrent Architectures using Off-Chip Memory

Figure 4 for Optimal Gradient Checkpointing for Sparse and Recurrent Architectures using Off-Chip Memory

Abstract:Recurrent neural networks (RNNs) are valued for their computational efficiency and reduced memory requirements on tasks involving long sequence lengths but require high memory-processor bandwidth to train. Checkpointing techniques can reduce the memory requirements by only storing a subset of intermediate states, the checkpoints, but are still rarely used due to the computational overhead of the additional recomputation phase. This work addresses these challenges by introducing memory-efficient gradient checkpointing strategies tailored for the general class of sparse RNNs and Spiking Neural Networks (SNNs). SNNs are energy efficient alternatives to RNNs thanks to their local, event-driven operation and potential neuromorphic implementation. We use the Intelligence Processing Unit (IPU) as an exemplary platform for architectures with distributed local memory. We exploit its suitability for sparse and irregular workloads to scale SNN training on long sequence lengths. We find that Double Checkpointing emerges as the most effective method, optimizing the use of local memory resources while minimizing recomputation overhead. This approach reduces dependency on slower large-scale memory access, enabling training on sequences over 10 times longer or 4 times larger networks than previously feasible, with only marginal time overhead. The presented techniques demonstrate significant potential to enhance scalability and efficiency in training sparse and recurrent networks across diverse hardware platforms, and highlights the benefits of sparse activations for scalable recurrent neural network training.

Via

Access Paper or Ask Questions

Zero-Shot Temporal Resolution Domain Adaptation for Spiking Neural Networks

Nov 07, 2024

Sanja Karilanova, Maxime Fabre, Emre Neftci, Ayça Özçelikkale

Figure 1 for Zero-Shot Temporal Resolution Domain Adaptation for Spiking Neural Networks

Figure 2 for Zero-Shot Temporal Resolution Domain Adaptation for Spiking Neural Networks

Figure 3 for Zero-Shot Temporal Resolution Domain Adaptation for Spiking Neural Networks

Figure 4 for Zero-Shot Temporal Resolution Domain Adaptation for Spiking Neural Networks

Abstract:Spiking Neural Networks (SNNs) are biologically-inspired deep neural networks that efficiently extract temporal information while offering promising gains in terms of energy efficiency and latency when deployed on neuromorphic devices. However, SNN model parameters are sensitive to temporal resolution, leading to significant performance drops when the temporal resolution of target data at the edge is not the same with that of the pre-deployment source data used for training, especially when fine-tuning is not possible at the edge. To address this challenge, we propose three novel domain adaptation methods for adapting neuron parameters to account for the change in time resolution without re-training on target time-resolution. The proposed methods are based on a mapping between neuron dynamics in SNNs and State Space Models (SSMs); and are applicable to general neuron models. We evaluate the proposed methods under spatio-temporal data tasks, namely the audio keyword spotting datasets SHD and MSWC as well as the image classification NMINST dataset. Our methods provide an alternative to - and in majority of the cases significantly outperform - the existing reference method that simply scales the time constant. Moreover, our results show that high accuracy on high temporal resolution data can be obtained by time efficient training on lower temporal resolution data and model adaptation.

Via

Access Paper or Ask Questions

On-Chip Learning via Transformer In-Context Learning

Oct 11, 2024

Jan Finkbeiner, Emre Neftci

Figure 1 for On-Chip Learning via Transformer In-Context Learning

Figure 2 for On-Chip Learning via Transformer In-Context Learning

Figure 3 for On-Chip Learning via Transformer In-Context Learning

Figure 4 for On-Chip Learning via Transformer In-Context Learning

Abstract:Autoregressive decoder-only transformers have become key components for scalable sequence processing and generation models. However, the transformer's self-attention mechanism requires transferring prior token projections from the main memory at each time step (token), thus severely limiting their performance on conventional processors. Self-attention can be viewed as a dynamic feed-forward layer, whose matrix is input sequence-dependent similarly to the result of local synaptic plasticity. Using this insight, we present a neuromorphic decoder-only transformer model that utilizes an on-chip plasticity processor to compute self-attention. Interestingly, the training of transformers enables them to ``learn'' the input context during inference. We demonstrate this in-context learning ability of transformers on the Loihi 2 processor by solving a few-shot classification problem. With this we emphasize the importance of pretrained models especially their ability to find simple, local, backpropagation free, learning rules enabling on-chip learning and adaptation in a hardware friendly manner.

Via

Access Paper or Ask Questions

Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

Sep 28, 2024

Nathan Leroux, Paul-Philipp Manea, Chirag Sudarshan, Jan Finkbeiner, Sebastian Siegel, John Paul Strachan, Emre Neftci

Figure 1 for Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

Figure 2 for Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

Figure 3 for Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

Figure 4 for Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

Abstract:Transformer neural networks, driven by self-attention mechanisms, are core components of foundational and Large Language Models. In generative transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step. However, GPU-stored projections must be loaded into SRAM for each new generation step, causing latency and energy bottlenecks for long sequences. In this work, we propose a fast and energy-efficient hardware implementation of self-attention using analog in-memory computing based on gain cell memories. Volatile gain cell memories can be efficiently written to store new tokens during sequence generation, while performing analog signed weight multiplications to compute the dot-products required for self-attention. We implement Sliding Window Attention, which keeps memory of a finite set of past steps. A charge-to-pulse converter for array readout eliminates the need for analog-to-digital conversion between self-attention stages. Using a co-designed initialization algorithm to adapt pre-trained weights to gain cell non-idealities, we achieve NLP performance comparable to ChatGPT-2 with minimal training iterations, despite hardware constraints. Our end-to-end hardware design includes digital controls, estimating area, latency, and energy. The system reduces attention latency by up to two orders of magnitude and energy consumption by up to five orders compared to GPUs, marking a significant step toward ultra-fast, low-power sequence generation in Large Language Models.

* 25 pages, 6 figures, 1 table

Via

Access Paper or Ask Questions

SNNAX -- Spiking Neural Networks in JAX

Sep 04, 2024

Jamie Lohoff, Jan Finkbeiner, Emre Neftci

Figure 1 for SNNAX -- Spiking Neural Networks in JAX

Figure 2 for SNNAX -- Spiking Neural Networks in JAX

Abstract:Spiking Neural Networks (SNNs) simulators are essential tools to prototype biologically inspired models and neuromorphic hardware architectures and predict their performance. For such a tool, ease of use and flexibility are critical, but so is simulation speed especially given the complexity inherent to simulating SNN. Here, we present SNNAX, a JAX-based framework for simulating and training such models with PyTorch-like intuitiveness and JAX-like execution speed. SNNAX models are easily extended and customized to fit the desired model specifications and target neuromorphic hardware. Additionally, SNNAX offers key features for optimizing the training and deployment of SNNs such as flexible automatic differentiation and just-in-time compilation. We evaluate and compare SNNAX to other commonly used machine learning (ML) frameworks used for programming SNNs. We provide key performance metrics, best practices, documented examples for simulating SNNs in SNNAX, and implement several benchmarks used in the literature.

Via

Access Paper or Ask Questions