Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Abbas Rahimi

IBM Research - Zurich

Terminating Differentiable Tree Experts

Jul 02, 2024

Jonathan Thomm, Michael Hersche, Giacomo Camposampiero, Aleksandar Terzić, Bernhard Schölkopf, Abbas Rahimi

Figure 1 for Terminating Differentiable Tree Experts

Figure 2 for Terminating Differentiable Tree Experts

Figure 3 for Terminating Differentiable Tree Experts

Figure 4 for Terminating Differentiable Tree Experts

Abstract:We advance the recently proposed neuro-symbolic Differentiable Tree Machine, which learns tree operations using a combination of transformers and Tensor Product Representations. We investigate the architecture and propose two key components. We first remove a series of different transformer layers that are used in every step by introducing a mixture of experts. This results in a Differentiable Tree Experts model with a constant number of parameters for any arbitrary number of steps in the computation, compared to the previous method in the Differentiable Tree Machine with a linear growth. Given this flexibility in the number of steps, we additionally propose a new termination algorithm to provide the model the power to choose how many steps to make automatically. The resulting Terminating Differentiable Tree Experts model sluggishly learns to predict the number of steps without an oracle. It can do so while maintaining the learning capabilities of the model, converging to the optimal amount of steps.

* Accepted at the 18th International Conference on Neural-Symbolic Learning and Reasoning (NeSy) 2024

Via

Access Paper or Ask Questions

Towards Learning Abductive Reasoning using VSA Distributed Representations

Jun 27, 2024

Giacomo Camposampiero, Michael Hersche, Aleksandar Terzić, Roger Wattenhofer, Abu Sebastian, Abbas Rahimi

Figure 1 for Towards Learning Abductive Reasoning using VSA Distributed Representations

Figure 2 for Towards Learning Abductive Reasoning using VSA Distributed Representations

Figure 3 for Towards Learning Abductive Reasoning using VSA Distributed Representations

Figure 4 for Towards Learning Abductive Reasoning using VSA Distributed Representations

Abstract:We introduce the Abductive Rule Learner with Context-awareness (ARLC), a model that solves abstract reasoning tasks based on Learn-VRF. ARLC features a novel and more broadly applicable training objective for abductive reasoning, resulting in better interpretability and higher accuracy when solving Raven's progressive matrices (RPM). ARLC allows both programming domain knowledge and learning the rules underlying a data distribution. We evaluate ARLC on the I-RAVEN dataset, showcasing state-of-the-art accuracy across both in-distribution and out-of-distribution (unseen attribute-rule pairs) tests. ARLC surpasses neuro-symbolic and connectionist baselines, including large language models, despite having orders of magnitude fewer parameters. We show ARLC's robustness to post-programming training by incrementally learning from examples on top of programmed knowledge, which only improves its performance and does not result in catastrophic forgetting of the programmed solution. We validate ARLC's seamless transfer learning from a 2x2 RPM constellation to unseen constellations. Our code is available at https://github.com/IBM/abductive-rule-learner-with-context-awareness.

* Accepted at the 18th International Conference on Neural-Symbolic Learning and Reasoning (NeSy) 2024

Via

Access Paper or Ask Questions

12 mJ per Class On-Device Online Few-Shot Class-Incremental Learning

Mar 12, 2024

Yoga Esa Wibowo, Cristian Cioflan, Thorir Mar Ingolfsson, Michael Hersche, Leo Zhao, Abbas Rahimi, Luca Benini

Figure 1 for 12 mJ per Class On-Device Online Few-Shot Class-Incremental Learning

Figure 2 for 12 mJ per Class On-Device Online Few-Shot Class-Incremental Learning

Figure 3 for 12 mJ per Class On-Device Online Few-Shot Class-Incremental Learning

Figure 4 for 12 mJ per Class On-Device Online Few-Shot Class-Incremental Learning

Abstract:Few-Shot Class-Incremental Learning (FSCIL) enables machine learning systems to expand their inference capabilities to new classes using only a few labeled examples, without forgetting the previously learned classes. Classical backpropagation-based learning and its variants are often unsuitable for battery-powered, memory-constrained systems at the extreme edge. In this work, we introduce Online Few-Shot Class-Incremental Learning (O-FSCIL), based on a lightweight model consisting of a pretrained and metalearned feature extractor and an expandable explicit memory storing the class prototypes. The architecture is pretrained with a novel feature orthogonality regularization and metalearned with a multi-margin loss. For learning a new class, our approach extends the explicit memory with novel class prototypes, while the remaining architecture is kept frozen. This allows learning previously unseen classes based on only a few examples with one single pass (hence online). O-FSCIL obtains an average accuracy of 68.62% on the FSCIL CIFAR100 benchmark, achieving state-of-the-art results. Tailored for ultra-low-power platforms, we implement O-FSCIL on the 60 mW GAP9 microcontroller, demonstrating online learning capabilities within just 12 mJ per new class.

* 6 pages, 4 tables, 3 figures. Accepted at IEEE DATE 2024

Via

Access Paper or Ask Questions

Limits of Transformer Language Models on Learning Algorithmic Compositions

Feb 13, 2024

Jonathan Thomm, Aleksandar Terzic, Geethan Karunaratne, Giacomo Camposampiero, Bernhard Schölkopf, Abbas Rahimi

Figure 1 for Limits of Transformer Language Models on Learning Algorithmic Compositions

Figure 2 for Limits of Transformer Language Models on Learning Algorithmic Compositions

Figure 3 for Limits of Transformer Language Models on Learning Algorithmic Compositions

Figure 4 for Limits of Transformer Language Models on Learning Algorithmic Compositions

Abstract:We analyze the capabilities of Transformer language models on learning discrete algorithms. To this end, we introduce two new tasks demanding the composition of several discrete sub-tasks. On both training LLaMA models from scratch and prompting on GPT-4 and Gemini we measure learning compositions of learned primitives. We observe that the compositional capabilities of state-of-the-art Transformer language models are very limited and sample-wise scale worse than relearning all sub-tasks for a new algorithmic composition. We also present a theorem in complexity theory, showing that gradient descent on memorizing feedforward models can be exponentially data inefficient.

Via

Access Paper or Ask Questions

Zero-shot Classification using Hyperdimensional Computing

Jan 30, 2024

Samuele Ruffino, Geethan Karunaratne, Michael Hersche, Luca Benini, Abu Sebastian, Abbas Rahimi

Figure 1 for Zero-shot Classification using Hyperdimensional Computing

Figure 2 for Zero-shot Classification using Hyperdimensional Computing

Figure 3 for Zero-shot Classification using Hyperdimensional Computing

Figure 4 for Zero-shot Classification using Hyperdimensional Computing

Abstract:Classification based on Zero-shot Learning (ZSL) is the ability of a model to classify inputs into novel classes on which the model has not previously seen any training examples. Providing an auxiliary descriptor in the form of a set of attributes describing the new classes involved in the ZSL-based classification is one of the favored approaches to solving this challenging task. In this work, inspired by Hyperdimensional Computing (HDC), we propose the use of stationary binary codebooks of symbol-like distributed representations inside an attribute encoder to compactly represent a computationally simple end-to-end trainable model, which we name Hyperdimensional Computing Zero-shot Classifier~(HDC-ZSC). It consists of a trainable image encoder, an attribute encoder based on HDC, and a similarity kernel. We show that HDC-ZSC can be used to first perform zero-shot attribute extraction tasks and, can later be repurposed for Zero-shot Classification tasks with minimal architectural changes and minimal model retraining. HDC-ZSC achieves Pareto optimal results with a 63.8% top-1 classification accuracy on the CUB-200 dataset by having only 26.6 million trainable parameters. Compared to two other state-of-the-art non-generative approaches, HDC-ZSC achieves 4.3% and 9.9% better accuracy, while they require more than 1.85x and 1.72x parameters compared to HDC-ZSC, respectively.

* This is the extended version of a paper accepted in the Design, Automation, and Test in Europe Conference (DATE), 2024

Via

Access Paper or Ask Questions

Probabilistic Abduction for Visual Abstract Reasoning via Learning Rules in Vector-symbolic Architectures

Jan 29, 2024

Michael Hersche, Francesco di Stefano, Thomas Hofmann, Abu Sebastian, Abbas Rahimi

Figure 1 for Probabilistic Abduction for Visual Abstract Reasoning via Learning Rules in Vector-symbolic Architectures

Figure 2 for Probabilistic Abduction for Visual Abstract Reasoning via Learning Rules in Vector-symbolic Architectures

Figure 3 for Probabilistic Abduction for Visual Abstract Reasoning via Learning Rules in Vector-symbolic Architectures

Figure 4 for Probabilistic Abduction for Visual Abstract Reasoning via Learning Rules in Vector-symbolic Architectures

Abstract:Abstract reasoning is a cornerstone of human intelligence, and replicating it with artificial intelligence (AI) presents an ongoing challenge. This study focuses on efficiently solving Raven's progressive matrices (RPM), a visual test for assessing abstract reasoning abilities, by using distributed computation and operators provided by vector-symbolic architectures (VSA). Instead of hard-coding the rule formulations associated with RPMs, our approach can learn the VSA rule formulations (hence the name Learn-VRF) with just one pass through the training data. Yet, our approach, with compact parameters, remains transparent and interpretable. Learn-VRF yields accurate predictions on I-RAVEN's in-distribution data, and exhibits strong out-of-distribution capabilities concerning unseen attribute-rule pairs, significantly outperforming pure connectionist baselines including large language models. Our code is available at https://github.com/IBM/learn-vector-symbolic-architectures-rule-formulations.

* Accepted in NeurIPS 2023 Workshop on MATH-AI

Via

Access Paper or Ask Questions

TCNCA: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing

Dec 09, 2023

Aleksandar Terzic, Michael Hersche, Geethan Karunaratne, Luca Benini, Abu Sebastian, Abbas Rahimi

Figure 1 for TCNCA: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing

Figure 2 for TCNCA: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing

Figure 3 for TCNCA: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing

Figure 4 for TCNCA: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing

Abstract:MEGA is a recent transformer-based architecture, which utilizes a linear recurrent operator whose parallel computation, based on the FFT, scales as $O(LlogL)$, with $L$ being the sequence length. We build upon their approach by replacing the linear recurrence with a special temporal convolutional network which permits larger receptive field size with shallower networks, and reduces the computational complexity to $O(L)$. The resulting model is called TCNCA, a Temporal Convolutional Network with Chunked Attention. We evaluate TCNCA on EnWik8 language modeling, long-range-arena (LRA) sequence classification, as well as a synthetic reasoning benchmark associative recall. On EnWik8, TCNCA outperforms MEGA, reaching a lower loss with $1.37\times$/$1.24\times$ faster forward/backward pass during training. The dilated convolutions used in TCNCA are consistently and significantly faster operations than the FFT-based parallelized recurrence in GPUs, making them a scalable candidate for handling very large sequence lengths: they are up to $7.07\times$/$2.86\times$ faster in the forward/backward pass for sequences up to 131k. Further on LRA, TCNCA achieves, on average, $1.28\times$ speed-up during inference with similar accuracy to what MEGA achieves. On associative recall, we find that even a simplified version of TCNCA, without excessive multiplicative and additive interactions, remains superior or competitive to MEGA on a range of sequence lengths and vocabulary sizes.

Via

Access Paper or Ask Questions

MIMONets: Multiple-Input-Multiple-Output Neural Networks Exploiting Computation in Superposition

Dec 05, 2023

Nicolas Menet, Michael Hersche, Geethan Karunaratne, Luca Benini, Abu Sebastian, Abbas Rahimi

Figure 1 for MIMONets: Multiple-Input-Multiple-Output Neural Networks Exploiting Computation in Superposition

Figure 2 for MIMONets: Multiple-Input-Multiple-Output Neural Networks Exploiting Computation in Superposition

Figure 3 for MIMONets: Multiple-Input-Multiple-Output Neural Networks Exploiting Computation in Superposition

Figure 4 for MIMONets: Multiple-Input-Multiple-Output Neural Networks Exploiting Computation in Superposition

Abstract:With the advent of deep learning, progressively larger neural networks have been designed to solve complex tasks. We take advantage of these capacity-rich models to lower the cost of inference by exploiting computation in superposition. To reduce the computational burden per input, we propose Multiple-Input-Multiple-Output Neural Networks (MIMONets) capable of handling many inputs at once. MIMONets augment various deep neural network architectures with variable binding mechanisms to represent an arbitrary number of inputs in a compositional data structure via fixed-width distributed representations. Accordingly, MIMONets adapt nonlinear neural transformations to process the data structure holistically, leading to a speedup nearly proportional to the number of superposed input items in the data structure. After processing in superposition, an unbinding mechanism recovers each transformed input of interest. MIMONets also provide a dynamic trade-off between accuracy and throughput by an instantaneous on-demand switching between a set of accuracy-throughput operating points, yet within a single set of fixed parameters. We apply the concept of MIMONets to both CNN and Transformer architectures resulting in MIMOConv and MIMOFormer, respectively. Empirical evaluations show that MIMOConv achieves about 2-4 x speedup at an accuracy delta within [+0.68, -3.18]% compared to WideResNet CNNs on CIFAR10 and CIFAR100. Similarly, MIMOFormer can handle 2-4 inputs at once while maintaining a high average accuracy within a [-1.07, -3.43]% delta on the long range arena benchmark. Finally, we provide mathematical bounds on the interference between superposition channels in MIMOFormer. Our code is available at https://github.com/IBM/multiple-input-multiple-output-nets.

* accepted in NeurIPS 2023

Via

Access Paper or Ask Questions

Model-Driven Engineering for Artificial Intelligence -- A Systematic Literature Review

Jul 10, 2023

Simon Raedler, Luca Berardinelli, Karolin Winter, Abbas Rahimi, Stefanie Rinderle-Ma

Abstract:Objective: This study aims to investigate the existing body of knowledge in the field of Model-Driven Engineering MDE in support of AI (MDE4AI) to sharpen future research further and define the current state of the art. Method: We conducted a Systemic Literature Review (SLR), collecting papers from five major databases resulting in 703 candidate studies, eventually retaining 15 primary studies. Each primary study will be evaluated and discussed with respect to the adoption of (1) MDE principles and practices and (2) the phases of AI development support aligned with the stages of the CRISP-DM methodology. Results: The study's findings show that the pillar concepts of MDE (metamodel, concrete syntax and model transformation), are leveraged to define domain-specific languages (DSL) explicitly addressing AI concerns. Different MDE technologies are used, leveraging different language workbenches. The most prominent AI-related concerns are training and modeling of the AI algorithm, while minor emphasis is given to the time-consuming preparation of the data sets. Early project phases that support interdisciplinary communication of requirements, such as the CRISP-DM \textit{Business Understanding} phase, are rarely reflected. Conclusion: The study found that the use of MDE for AI is still in its early stages, and there is no single tool or method that is widely used. Additionally, current approaches tend to focus on specific stages of development rather than providing support for the entire development process. As a result, the study suggests several research directions to further improve the use of MDE for AI and to guide future research in this area.

* 42 pages, 1 figure, 8 tables

Via

Access Paper or Ask Questions

Factorizers for Distributed Sparse Block Codes

Mar 24, 2023

Michael Hersche, Aleksandar Terzic, Geethan Karunaratne, Jovin Langenegger, Angéline Pouget, Giovanni Cherubini, Luca Benini, Abu Sebastian, Abbas Rahimi

Figure 1 for Factorizers for Distributed Sparse Block Codes

Figure 2 for Factorizers for Distributed Sparse Block Codes

Figure 3 for Factorizers for Distributed Sparse Block Codes

Figure 4 for Factorizers for Distributed Sparse Block Codes

Abstract:Distributed sparse block codes (SBCs) exhibit compact representations for encoding and manipulating symbolic data structures using fixed-with vectors. One major challenge however is to disentangle, or factorize, such data structures into their constituent elements without having to search through all possible combinations. This factorization becomes more challenging when queried by noisy SBCs wherein symbol representations are relaxed due to perceptual uncertainty and approximations made when modern neural networks are used to generate the query vectors. To address these challenges, we first propose a fast and highly accurate method for factorizing a more flexible and hence generalized form of SBCs, dubbed GSBCs. Our iterative factorizer introduces a threshold-based nonlinear activation, a conditional random sampling, and an $\ell_\infty$-based similarity metric. Its random sampling mechanism in combination with the search in superposition allows to analytically determine the expected number of decoding iterations, which matches the empirical observations up to the GSBC's bundling capacity. Secondly, the proposed factorizer maintains its high accuracy when queried by noisy product vectors generated using deep convolutional neural networks (CNNs). This facilitates its application in replacing the large fully connected layer (FCL) in CNNs, whereby C trainable class vectors, or attribute combinations, can be implicitly represented by our factorizer having F-factor codebooks, each with $\sqrt[\leftroot{-2}\uproot{2}F]{C}$ fixed codevectors. We provide a methodology to flexibly integrate our factorizer in the classification layer of CNNs with a novel loss function. We demonstrate the feasibility of our method on four deep CNN architectures over CIFAR-100, ImageNet-1K, and RAVEN datasets. In all use cases, the number of parameters and operations are significantly reduced compared to the FCL.

Via

Access Paper or Ask Questions