Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tamoghno Das

Robust Reasoning and Learning with Brain-Inspired Representations under Hardware-Induced Nonlinearities

Apr 13, 2026

William Youngwoo Chung, Hamza Errahmouni Barkam, Tamoghno Das, Mohsen Imani

Abstract:Traditional machine learning depends on high-precision arithmetic and near-ideal hardware assumptions, which is increasingly challenged by variability in aggressively scaled semiconductor devices. Compute-in-memory (CIM) architectures alleviate data-movement bottlenecks and improve energy efficiency yet introduce nonlinear distortions and reliability concerns. We address these issues with a hardware-aware optimization framework based on Hyperdimensional Computing (HDC), systematically compensating for non-ideal similarity computations in CIM. Our approach formulates encoding as an optimization problem, minimizing the Frobenius norm between an ideal kernel and its hardware-constrained counterpart, and employs a joint optimization strategy for end-to-end calibration of hypervector representations. Experimental results demonstrate that our method when applied to QuantHD achieves 84\% accuracy under severe hardware-induced perturbations, a 48\% increase over naive QuantHD under the same conditions. Additionally, our optimization is vital for graph-based HDC reliant on precise variable-binding for interpretable reasoning. Our framework preserves the accuracy of RelHD on the Cora dataset, achieving a 5.4$\times$ accuracy improvement over naive RelHD under nonlinear environments. By preserving HDC's robustness and symbolic properties, our solution enables scalable, energy-efficient intelligent systems capable of classification and reasoning on emerging CIM hardware.

* In Proceedings of the Great Lakes Symposium on VLSI 2025 (GLSVLSI '25), June 30-July 2, 2025, New Orleans, LA, USA. Association for Computing Machinery, New York, NY, USA, 8 pages
* 8 pages, 7 figures, accepted to Great Lakes Symposium on VLSI (GLSVLSI) 2025

Via

Access Paper or Ask Questions

TorR: Towards Brain-Inspired Task-Oriented Reasoning via Cache-Oriented Algorithm-Architecture Co-design

Mar 24, 2026

Hyunwoo Oh, SungHeon Jeong, Suyeon Jang, Hanning Chen, Sanggeon Yun, Tamoghno Das, Mohsen Imani

Abstract:Task-oriented object detection (TOOD) atop CLIP offers open-vocabulary, prompt-driven semantics, yet dense per-window computation and heavy memory traffic hinder real-time, power-limited edge deployment. We present \emph{TorR}, a brain-inspired \textbf{algorithm--architecture co-design} that \textbf{replaces CLIP-style dense alignment with a hyperdimensional (HDC) associative reasoner} and turns temporal coherence into reuse. On the \emph{algorithm} side, TorR reformulates alignment as HDC similarity and graph composition, introducing \emph{partial-similarity reuse} via (i) query caching with per-class score accumulation, (ii) exact $δ$-updates when only a small set of hypervector bits change, and (iii) similarity/load-gated bypass under high system load. On the \emph{architecture} side, TorR instantiates a lane-scalable, bit-sliced item memory with bank/precision gating and a lightweight controller that schedules bypass/$δ$/full paths to meet RT-30/RT-60 targets as object counts vary. Synthesized in a TSMC 28\,nm process and exercised with a cycle-accurate simulator, TorR sustains real-time throughput with millijoule-scale energy per window ($\approx$50\,mJ at 60\,FPS; $\approx$113\,mJ at 30\,FPS) and low latency jitter, while delivering competitive AP@0.5 across five task prompts (mean 44.27\%) within a bounded margin to strong VLM baselines, but at orders-of-magnitude lower energy. The design exposes deployment-time configurability (effective dimension $D'$, thresholds, precision) to trade accuracy, latency, and energy for edge budgets.

* Accepted to DAC 2026

Via

Access Paper or Ask Questions

T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization

Nov 17, 2025

Hyunwoo Oh, KyungIn Nam, Rajat Bhattacharjya, Hanning Chen, Tamoghno Das, Sanggeon Yun, Suyeon Jang, Andrew Ding, Nikil Dutt, Mohsen Imani

Figure 1 for T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization

Figure 2 for T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization

Figure 3 for T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization

Figure 4 for T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization

Abstract:Recent advances in LLMs have outpaced the computational and memory capacities of edge platforms that primarily employ CPUs, thereby challenging efficient and scalable deployment. While ternary quantization enables significant resource savings, existing CPU solutions rely heavily on memory-based lookup tables (LUTs) which limit scalability, and FPGA or GPU accelerators remain impractical for edge use. This paper presents T-SAR, the first framework to achieve scalable ternary LLM inference on CPUs by repurposing the SIMD register file for dynamic, in-register LUT generation with minimal hardware modifications. T-SAR eliminates memory bottlenecks and maximizes data-level parallelism, delivering 5.6-24.5x and 1.1-86.2x improvements in GEMM latency and GEMV throughput, respectively, with only 3.2% power and 1.4% area overheads in SIMD units. T-SAR achieves up to 2.5-4.9x the energy efficiency of an NVIDIA Jetson AGX Orin, establishing a practical approach for efficient LLM inference on edge platforms.

* Accepted to DATE 2026

Via

Access Paper or Ask Questions

QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention

Nov 17, 2025

Hyunwoo Oh, Hanning Chen, Sanggeon Yun, Yang Ni, Wenjun Huang, Tamoghno Das, Suyeon Jang, Mohsen Imani

Figure 1 for QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention

Figure 2 for QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention

Figure 3 for QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention

Figure 4 for QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention

Abstract:Deformable transformers deliver state-of-the-art detection but map poorly to hardware due to irregular memory access and low arithmetic intensity. We introduce QUILL, a schedule-aware accelerator that turns deformable attention into cache-friendly, single-pass work. At its core, Distance-based Out-of-Order Querying (DOOQ) orders queries by spatial proximity; the look-ahead drives a region prefetch into an alternate buffer--forming a schedule-aware prefetch loop that overlaps memory and compute. A fused MSDeformAttn engine executes interpolation, Softmax, aggregation, and the final projection (W''m) in one pass without spilling intermediates, while small tensors are kept on-chip and surrounding dense layers run on integrated GEMMs. Implemented as RTL and evaluated end-to-end, QUILL achieves up to 7.29x higher throughput and 47.3x better energy efficiency than an RTX 4090, and exceeds prior accelerators by 3.26-9.82x in throughput and 2.01-6.07x in energy efficiency. With mixed-precision quantization, accuracy tracks FP32 within <=0.9 AP across Deformable and Sparse DETR variants. By converting sparsity into locality--and locality into utilization--QUILL delivers consistent, end-to-end speedups.

* Accepted to DATE 2026

Via

Access Paper or Ask Questions

ASTER: Attention-based Spiking Transformer Engine for Event-driven Reasoning

Nov 10, 2025

Tamoghno Das, Khanh Phan Vu, Hanning Chen, Hyunwoo Oh, Mohsen Imani

Abstract:The integration of spiking neural networks (SNNs) with transformer-based architectures has opened new opportunities for bio-inspired low-power, event-driven visual reasoning on edge devices. However, the high temporal resolution and binary nature of spike-driven computation introduce architectural mismatches with conventional digital hardware (CPU/GPU). Prior neuromorphic and Processing-in-Memory (PIM) accelerators struggle with high sparsity and complex operations prevalent in such models. To address these challenges, we propose a memory-centric hardware accelerator tailored for spiking transformers, optimized for deployment in real-time event-driven frameworks such as classification with both static and event-based input frames. Our design leverages a hybrid analog-digital PIM architecture with input sparsity optimizations, and a custom-designed dataflow to minimize memory access overhead and maximize data reuse under spatiotemporal sparsity, for compute and memory-efficient end-to-end execution of spiking transformers. We subsequently propose inference-time software optimizations for layer skipping, and timestep reduction, leveraging Bayesian Optimization with surrogate modeling to perform robust, efficient co-exploration of the joint algorithmic-microarchitectural design spaces under tight computational budgets. Evaluated on both image(ImageNet) and event-based (CIFAR-10 DVS, DVSGesture) classification, the accelerator achieves up to ~467x and ~1.86x energy reduction compared to edge GPU (Jetson Orin Nano) and previous PIM accelerators for spiking transformers, while maintaining competitive task accuracy on ImageNet dataset. This work enables a new class of intelligent ubiquitous edge AI, built using spiking transformer acceleration for low-power, real-time visual processing at the extreme edge.

* Submitted for review at conference

Via

Access Paper or Ask Questions

LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation

Apr 15, 2025

Hanning Chen, Yang Ni, Wenjun Huang, Hyunwoo Oh, Yezi Liu, Tamoghno Das, Mohsen Imani

Figure 1 for LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation

Figure 2 for LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation

Figure 3 for LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation

Figure 4 for LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation

Abstract:Large Vision Language Models (LVLMs) have been widely adopted to guide vision foundation models in performing reasoning segmentation tasks, achieving impressive performance. However, the substantial computational overhead associated with LVLMs presents a new challenge. The primary source of this computational cost arises from processing hundreds of image tokens. Therefore, an effective strategy to mitigate such overhead is to reduce the number of image tokens, a process known as image token pruning. Previous studies on image token pruning for LVLMs have primarily focused on high level visual understanding tasks, such as visual question answering and image captioning. In contrast, guiding vision foundation models to generate accurate visual masks based on textual queries demands precise semantic and spatial reasoning capabilities. Consequently, pruning methods must carefully control individual image tokens throughout the LVLM reasoning process. Our empirical analysis reveals that existing methods struggle to adequately balance reductions in computational overhead with the necessity to maintain high segmentation accuracy. In this work, we propose LVLM_CSP, a novel training free visual token pruning method specifically designed for LVLM based reasoning segmentation tasks. LVLM_CSP consists of three stages: clustering, scattering, and pruning. Initially, the LVLM performs coarse-grained visual reasoning using a subset of selected image tokens. Next, fine grained reasoning is conducted, and finally, most visual tokens are pruned in the last stage. Extensive experiments demonstrate that LVLM_CSP achieves a 65% reduction in image token inference FLOPs with virtually no accuracy degradation, and a 70% reduction with only a minor 1% drop in accuracy on the 7B LVLM.

Via

Access Paper or Ask Questions