Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuval Ran-Milo

Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study

Jun 04, 2025

Yotam Alexander, Yonatan Slutzky, Yuval Ran-Milo, Nadav Cohen

Figure 1 for Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study

Figure 2 for Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study

Figure 3 for Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study

Figure 4 for Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study

Abstract:Conventional wisdom attributes the mysterious generalization abilities of overparameterized neural networks to gradient descent (and its variants). The recent volume hypothesis challenges this view: it posits that these generalization abilities persist even when gradient descent is replaced by Guess & Check (G&C), i.e., by drawing weight settings until one that fits the training data is found. The validity of the volume hypothesis for wide and deep neural networks remains an open question. In this paper, we theoretically investigate this question for matrix factorization (with linear and non-linear activation)--a common testbed in neural network theory. We first prove that generalization under G&C deteriorates with increasing width, establishing what is, to our knowledge, the first case where G&C is provably inferior to gradient descent. Conversely, we prove that generalization under G&C improves with increasing depth, revealing a stark contrast between wide and deep networks, which we further validate empirically. These findings suggest that even in simple settings, there may not be a simple answer to the question of whether neural networks need gradient descent to generalize well.

Via

Access Paper or Ask Questions

Mamba Knockout for Unraveling Factual Information Flow

May 30, 2025

Nir Endy, Idan Daniel Grosbard, Yuval Ran-Milo, Yonatan Slutzky, Itay Tshuva, Raja Giryes

Abstract:This paper investigates the flow of factual information in Mamba State-Space Model (SSM)-based language models. We rely on theoretical and empirical connections to Transformer-based architectures and their attention mechanisms. Exploiting this relationship, we adapt attentional interpretability techniques originally developed for Transformers--specifically, the Attention Knockout methodology--to both Mamba-1 and Mamba-2. Using them we trace how information is transmitted and localized across tokens and layers, revealing patterns of subject-token information emergence and layer-wise dynamics. Notably, some phenomena vary between mamba models and Transformer based models, while others appear universally across all models inspected--hinting that these may be inherent to LLMs in general. By further leveraging Mamba's structured factorization, we disentangle how distinct "features" either enable token-to-token information exchange or enrich individual tokens, thus offering a unified lens to understand Mamba internal operations.

* Accepted to ACL 2025

Via

Access Paper or Ask Questions

Provable Benefits of Complex Parameterizations for Structured State Space Models

Oct 17, 2024

Yuval Ran-Milo, Eden Lumbroso, Edo Cohen-Karlik, Raja Giryes, Amir Globerson, Nadav Cohen

Figure 1 for Provable Benefits of Complex Parameterizations for Structured State Space Models

Figure 2 for Provable Benefits of Complex Parameterizations for Structured State Space Models

Figure 3 for Provable Benefits of Complex Parameterizations for Structured State Space Models

Figure 4 for Provable Benefits of Complex Parameterizations for Structured State Space Models

Abstract:Structured state space models (SSMs), the core engine behind prominent neural networks such as S4 and Mamba, are linear dynamical systems adhering to a specified structure, most notably diagonal. In contrast to typical neural network modules, whose parameterizations are real, SSMs often use complex parameterizations. Theoretically explaining the benefits of complex parameterizations for SSMs is an open problem. The current paper takes a step towards its resolution, by establishing formal gaps between real and complex diagonal SSMs. Firstly, we prove that while a moderate dimension suffices in order for a complex SSM to express all mappings of a real SSM, a much higher dimension is needed for a real SSM to express mappings of a complex SSM. Secondly, we prove that even if the dimension of a real SSM is high enough to express a given mapping, typically, doing so requires the parameters of the real SSM to hold exponentially large values, which cannot be learned in practice. In contrast, a complex SSM can express any given mapping with moderate parameter values. Experiments corroborate our theory, and suggest a potential extension of the theory that accounts for selectivity, a new architectural feature yielding state of the art performance.

* 12 pages, 1 figure. Accepted to NeurIPS 2024

Via

Access Paper or Ask Questions