Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hemanth Saratchandran

Data Denoising and Derivative Estimation for Data-Driven Modeling of Nonlinear Dynamical Systems

Sep 17, 2025

Jiaqi Yao, Lewis Mitchell, John Maclean, Hemanth Saratchandran

Abstract:Data-driven modeling of nonlinear dynamical systems is often hampered by measurement noise. We propose a denoising framework, called Runge-Kutta and Total Variation Based Implicit Neural Representation (RKTV-INR), that represents the state trajectory with an implicit neural representation (INR) fitted directly to noisy observations. Runge-Kutta integration and total variation are imposed as constraints to ensure that the reconstructed state is a trajectory of a dynamical system that remains close to the original data. The trained INR yields a clean, continuous trajectory and provides accurate first-order derivatives via automatic differentiation. These denoised states and derivatives are then supplied to Sparse Identification of Nonlinear Dynamics (SINDy) to recover the governing equations. Experiments demonstrate effective noise suppression, precise derivative estimation, and reliable system identification.

Via

Access Paper or Ask Questions

Towards Higher Effective Rank in Parameter-efficient Fine-tuning using Khatri--Rao Product

Aug 01, 2025

Paul Albert, Frederic Z. Zhang, Hemanth Saratchandran, Anton van den Hengel, Ehsan Abbasnejad

Figure 1 for Towards Higher Effective Rank in Parameter-efficient Fine-tuning using Khatri--Rao Product

Figure 2 for Towards Higher Effective Rank in Parameter-efficient Fine-tuning using Khatri--Rao Product

Figure 3 for Towards Higher Effective Rank in Parameter-efficient Fine-tuning using Khatri--Rao Product

Figure 4 for Towards Higher Effective Rank in Parameter-efficient Fine-tuning using Khatri--Rao Product

Abstract:Parameter-efficient fine-tuning (PEFT) has become a standard approach for adapting large pre-trained models. Amongst PEFT methods, low-rank adaptation (LoRA) has achieved notable success. However, recent studies have highlighted its limitations compared against full-rank alternatives, particularly when applied to multimodal and large language models. In this work, we present a quantitative comparison amongst full-rank and low-rank PEFT methods using a synthetic matrix approximation benchmark with controlled spectral properties. Our results confirm that LoRA struggles to approximate matrices with relatively flat spectrums or high frequency components -- signs of high effective ranks. To this end, we introduce KRAdapter, a novel PEFT algorithm that leverages the Khatri-Rao product to produce weight updates, which, by construction, tends to produce matrix product with a high effective rank. We demonstrate performance gains with KRAdapter on vision-language models up to 1B parameters and on large language models up to 8B parameters, particularly on unseen common-sense reasoning tasks. In addition, KRAdapter maintains the memory and compute efficiency of LoRA, making it a practical and robust alternative to fine-tune billion-scale parameter models.

* To appear in ICCV 2025

Via

Access Paper or Ask Questions

Transformers Pretrained on Procedural Data Contain Modular Structures for Algorithmic Reasoning

May 28, 2025

Zachary Shinnick, Liangze Jiang, Hemanth Saratchandran, Anton van den Hengel, Damien Teney

Abstract:Pretraining on large, semantically rich datasets is key for developing language models. Surprisingly, recent studies have shown that even synthetic data, generated procedurally through simple semantic-free algorithms, can yield some of the same benefits as natural language pretraining. It is unclear what specific capabilities such simple synthetic data instils in a model, where these capabilities reside in the architecture, and how they manifest within its weights. In this short paper, we identify several beneficial forms of procedural data, together with specific algorithmic reasoning skills that improve in small transformers. Our core finding is that different procedural rules instil distinct but complementary inductive structures in the model. With extensive ablations and partial-transfer experiments, we discover that these structures reside in different parts of the model. Attention layers often carry the most transferable information, but some pretraining rules impart useful structure to MLP blocks instead. Most interestingly, the structures induced by multiple rules can be composed to jointly reinforce multiple capabilities. These results suggest an exciting possibility of disentangling the acquisition of knowledge from reasoning in language models, with the goal of improving their robustness and data efficiency.

Via

Access Paper or Ask Questions

Compressing Sine-Activated Low-Rank Adapters through Post-Training Quantization

May 28, 2025

Cameron Gordon, Yiping Ji, Hemanth Saratchandran, Paul Albert, Simon Lucey

Abstract:Low-Rank Adaptation (LoRA) has become a standard approach for parameter-efficient fine-tuning, offering substantial reductions in trainable parameters by modeling updates as the product of two low-rank matrices. While effective, the low-rank constraint inherently limits representational capacity, often resulting in reduced performance compared to full-rank fine-tuning. Recent work by Ji et al. (2025) has addressed this limitation by applying a fixed-frequency sinusoidal transformation to low-rank adapters, increasing their stable rank without introducing additional parameters. This raises a crucial question: can the same sine-activated technique be successfully applied within the context of Post-Training Quantization to retain benefits even after model compression? In this paper, we investigate this question by extending the sinusoidal transformation framework to quantized LoRA adapters. We develop a theoretical analysis showing that the stable rank of a quantized adapter is tightly linked to that of its full-precision counterpart, motivating the use of such rank-enhancing functions even under quantization. Our results demonstrate that the expressivity gains from a sinusoidal non-linearity persist after quantization, yielding highly compressed adapters with negligible loss in performance. We validate our approach across a range of fine-tuning tasks for language, vision and text-to-image generation achieving significant memory savings while maintaining competitive accuracy.

* 23 pages, 9 figures

Via

Access Paper or Ask Questions

Leaner Transformers: More Heads, Less Depth

May 27, 2025

Hemanth Saratchandran, Damien Teney, Simon Lucey

Abstract:Transformers have reshaped machine learning by utilizing attention mechanisms to capture complex patterns in large datasets, leading to significant improvements in performance. This success has contributed to the belief that "bigger means better", leading to ever-increasing model sizes. This paper challenge this ideology by showing that many existing transformers might be unnecessarily oversized. We discover a theoretical principle that redefines the role of multi-head attention. An important benefit of the multiple heads is in improving the conditioning of the attention block. We exploit this theoretical insight and redesign popular architectures with an increased number of heads. The improvement in the conditioning proves so significant in practice that model depth can be decreased, reducing the parameter count by up to 30-50% while maintaining accuracy. We obtain consistent benefits across a variety of transformer-based architectures of various scales, on tasks in computer vision (ImageNet-1k) as well as language and sequence modeling (GLUE benchmark, TinyStories, and the Long-Range Arena benchmark).

Via

Access Paper or Ask Questions

Structured Initialization for Vision Transformers

May 26, 2025

Jianqiao Zheng, Xueqian Li, Hemanth Saratchandran, Simon Lucey

Figure 1 for Structured Initialization for Vision Transformers

Figure 2 for Structured Initialization for Vision Transformers

Figure 3 for Structured Initialization for Vision Transformers

Figure 4 for Structured Initialization for Vision Transformers

Abstract:Convolutional Neural Networks (CNNs) inherently encode strong inductive biases, enabling effective generalization on small-scale datasets. In this paper, we propose integrating this inductive bias into ViTs, not through an architectural intervention but solely through initialization. The motivation here is to have a ViT that can enjoy strong CNN-like performance when data assets are small, but can still scale to ViT-like performance as the data expands. Our approach is motivated by our empirical results that random impulse filters can achieve commensurate performance to learned filters within a CNN. We improve upon current ViT initialization strategies, which typically rely on empirical heuristics such as using attention weights from pretrained models or focusing on the distribution of attention weights without enforcing structures. Empirical results demonstrate that our method significantly outperforms standard ViT initialization across numerous small and medium-scale benchmarks, including Food-101, CIFAR-10, CIFAR-100, STL-10, Flowers, and Pets, while maintaining comparative performance on large-scale datasets such as ImageNet-1K. Moreover, our initialization strategy can be easily integrated into various transformer-based architectures such as Swin Transformer and MLP-Mixer with consistent improvements in performance.

Via

Access Paper or Ask Questions

Enhancing Transformers Through Conditioned Embedded Tokens

May 19, 2025

Hemanth Saratchandran, Simon Lucey

Abstract:Transformers have transformed modern machine learning, driving breakthroughs in computer vision, natural language processing, and robotics. At the core of their success lies the attention mechanism, which enables the modeling of global dependencies among input tokens. However, we reveal that the attention block in transformers suffers from inherent ill-conditioning, which hampers gradient-based optimization and leads to inefficient training. To address this, we develop a theoretical framework that establishes a direct relationship between the conditioning of the attention block and that of the embedded tokenized data. Building on this insight, we introduce conditioned embedded tokens, a method that systematically modifies the embedded tokens to improve the conditioning of the attention mechanism. Our analysis demonstrates that this approach significantly mitigates ill-conditioning, leading to more stable and efficient training. We validate our methodology across various transformer architectures, achieving consistent improvements in image classification, object detection, instance segmentation, and natural language processing, highlighting its broad applicability and effectiveness.

Via

Access Paper or Ask Questions

Always Skip Attention

May 04, 2025

Yiping Ji, Hemanth Saratchandran, Peyman Moghaddam, Simon Lucey

Abstract:We highlight a curious empirical result within modern Vision Transformers (ViTs). Specifically, self-attention catastrophically fails to train unless it is used in conjunction with a skip connection. This is in contrast to other elements of a ViT that continue to exhibit good performance (albeit suboptimal) when skip connections are removed. Further, we show that this critical dependence on skip connections is a relatively new phenomenon, with previous deep architectures (\eg, CNNs) exhibiting good performance in their absence. In this paper, we theoretically characterize that the self-attention mechanism is fundamentally ill-conditioned and is, therefore, uniquely dependent on skip connections for regularization. Additionally, we propose Token Graying -- a simple yet effective complement (to skip connections) that further improves the condition of input tokens. We validate our approach in both supervised and self-supervised training methods.

Via

Access Paper or Ask Questions

RandLoRA: Full-rank parameter-efficient fine-tuning of large models

Feb 03, 2025

Paul Albert, Frederic Z. Zhang, Hemanth Saratchandran, Cristian Rodriguez-Opazo, Anton van den Hengel, Ehsan Abbasnejad

Abstract:Low-Rank Adaptation (LoRA) and its variants have shown impressive results in reducing the number of trainable parameters and memory requirements of large transformer networks while maintaining fine-tuning performance. However, the low-rank nature of the weight update inherently limits the representation power of fine-tuned models, potentially compromising performance on complex tasks. This raises a critical question: when a performance gap between LoRA and standard fine-tuning is observed, is it due to the reduced number of trainable parameters or the rank deficiency? This paper aims to answer this question by introducing RandLoRA, a parameter-efficient method that performs full-rank updates using a learned linear combinations of low-rank, non-trainable random matrices. Our method limits the number of trainable parameters by restricting optimization to diagonal scaling matrices applied to the fixed random matrices. This allows us to effectively overcome the low-rank limitations while maintaining parameter and memory efficiency during training. Through extensive experimentation across vision, language, and vision-language benchmarks, we systematically evaluate the limitations of LoRA and existing random basis methods. Our findings reveal that full-rank updates are beneficial across vision and language tasks individually, and even more so for vision-language tasks, where RandLoRA significantly reduces -- and sometimes eliminates -- the performance gap between standard fine-tuning and LoRA, demonstrating its efficacy.

* To appear at the International Conference on Learning Representations (ICLR) 2025

Via

Access Paper or Ask Questions

Rethinking Softmax: Self-Attention with Polynomial Activations

Oct 24, 2024

Hemanth Saratchandran, Jianqiao Zheng, Yiping Ji, Wenbo Zhang, Simon Lucey

Abstract:This paper challenges the conventional belief that softmax attention in transformers is effective primarily because it generates a probability distribution for attention allocation. Instead, we theoretically show that its success lies in its ability to implicitly regularize the Frobenius norm of the attention matrix during training. We then explore alternative activations that regularize the Frobenius norm of the attention matrix, demonstrating that certain polynomial activations can achieve this effect, making them suitable for attention-based architectures. Empirical results indicate these activations perform comparably or better than softmax across various computer vision and language tasks, suggesting new possibilities for attention mechanisms beyond softmax.

Via

Access Paper or Ask Questions