Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Philippe Rigollet

PMA

The power of fine-grained experts: Granularity boosts expressivity in Mixture of Experts

May 11, 2025

Enric Boix-Adsera, Philippe Rigollet

Abstract:Mixture-of-Experts (MoE) layers are increasingly central to frontier model architectures. By selectively activating parameters, they reduce computational cost while scaling total parameter count. This paper investigates the impact of the number of active experts, termed granularity, comparing architectures with many (e.g., 8 per layer in DeepSeek) to those with fewer (e.g., 1 per layer in Llama-4 models). We prove an exponential separation in network expressivity based on this design parameter, suggesting that models benefit from higher granularity. Experimental results corroborate our theoretical findings and illustrate this separation.

Via

Access Paper or Ask Questions

Quantitative Clustering in Mean-Field Transformer Models

Apr 20, 2025

Shi Chen, Zhengjiang Lin, Yury Polyanskiy, Philippe Rigollet

Abstract:The evolution of tokens through a deep transformer models can be modeled as an interacting particle system that has been shown to exhibit an asymptotic clustering behavior akin to the synchronization phenomenon in Kuramoto models. In this work, we investigate the long-time clustering of mean-field transformer models. More precisely, we establish exponential rates of contraction to a Dirac point mass for any suitably regular initialization under some assumptions on the parameters of transformer models, any suitably regular mean-field initialization synchronizes exponentially fast with some quantitative rates.

* 47 pages, 4 figures

Via

Access Paper or Ask Questions

Residual connections provably mitigate oversmoothing in graph neural networks

Jan 04, 2025

Ziang Chen, Zhengjiang Lin, Shi Chen, Yury Polyanskiy, Philippe Rigollet

Figure 1 for Residual connections provably mitigate oversmoothing in graph neural networks

Figure 2 for Residual connections provably mitigate oversmoothing in graph neural networks

Figure 3 for Residual connections provably mitigate oversmoothing in graph neural networks

Figure 4 for Residual connections provably mitigate oversmoothing in graph neural networks

Abstract:Graph neural networks (GNNs) have achieved remarkable empirical success in processing and representing graph-structured data across various domains. However, a significant challenge known as "oversmoothing" persists, where vertex features become nearly indistinguishable in deep GNNs, severely restricting their expressive power and practical utility. In this work, we analyze the asymptotic oversmoothing rates of deep GNNs with and without residual connections by deriving explicit convergence rates for a normalized vertex similarity measure. Our analytical framework is grounded in the multiplicative ergodic theorem. Furthermore, we demonstrate that adding residual connections effectively mitigates or prevents oversmoothing across several broad families of parameter distributions. The theoretical findings are strongly supported by numerical experiments.

Via

Access Paper or Ask Questions

On the number of modes of Gaussian kernel density estimators

Dec 12, 2024

Borjan Geshkovski, Philippe Rigollet, Yihang Sun

Figure 1 for On the number of modes of Gaussian kernel density estimators

Figure 2 for On the number of modes of Gaussian kernel density estimators

Figure 3 for On the number of modes of Gaussian kernel density estimators

Figure 4 for On the number of modes of Gaussian kernel density estimators

Abstract:We consider the Gaussian kernel density estimator with bandwidth $\beta^{-\frac12}$ of $n$ iid Gaussian samples. Using the Kac-Rice formula and an Edgeworth expansion, we prove that the expected number of modes on the real line scales as $\Theta(\sqrt{\beta\log\beta})$ as $\beta,n\to\infty$ provided $n^c\lesssim \beta\lesssim n^{2-c}$ for some constant $c>0$. An impetus behind this investigation is to determine the number of clusters to which Transformers are drawn in a metastable state.

Via

Access Paper or Ask Questions

Clustering in Causal Attention Masking

Nov 07, 2024

Nikita Karagodin, Yury Polyanskiy, Philippe Rigollet

Figure 1 for Clustering in Causal Attention Masking

Figure 2 for Clustering in Causal Attention Masking

Figure 3 for Clustering in Causal Attention Masking

Figure 4 for Clustering in Causal Attention Masking

Abstract:This work presents a modification of the self-attention dynamics proposed by Geshkovski et al. (arXiv:2312.10794) to better reflect the practically relevant, causally masked attention used in transformer architectures for generative AI. This modification translates into an interacting particle system that cannot be interpreted as a mean-field gradient flow. Despite this loss of structure, we significantly strengthen the results of Geshkovski et al. (arXiv:2312.10794) in this context: While previous rigorous results focused on cases where all three matrices (Key, Query, and Value) were scaled identities, we prove asymptotic convergence to a single cluster for arbitrary key-query matrices and a value matrix equal to the identity. Additionally, we establish a connection to the classical R\'enyi parking problem from combinatorial geometry to make initial theoretical steps towards demonstrating the existence of meta-stable states.

* 38th Conference on Neural Information Processing Systems (NeurIPS 2024), 22 pages, 6 figures

Via

Access Paper or Ask Questions

Measure-to-measure interpolation using Transformers

Nov 07, 2024

Borjan Geshkovski, Philippe Rigollet, Domènec Ruiz-Balet

Abstract:Transformers are deep neural network architectures that underpin the recent successes of large language models. Unlike more classical architectures that can be viewed as point-to-point maps, a Transformer acts as a measure-to-measure map implemented as specific interacting particle system on the unit sphere: the input is the empirical measure of tokens in a prompt and its evolution is governed by the continuity equation. In fact, Transformers are not limited to empirical measures and can in principle process any input measure. As the nature of data processed by Transformers is expanding rapidly, it is important to investigate their expressive power as maps from an arbitrary measure to another arbitrary measure. To that end, we provide an explicit choice of parameters that allows a single Transformer to match $N$ arbitrary input measures to $N$ arbitrary target measures, under the minimal assumption that every pair of input-target measures can be matched by some transport map.

Via

Access Paper or Ask Questions

Dynamic metastability in the self-attention model

Oct 09, 2024

Borjan Geshkovski, Hugo Koubbi, Yury Polyanskiy, Philippe Rigollet

Figure 1 for Dynamic metastability in the self-attention model

Figure 2 for Dynamic metastability in the self-attention model

Figure 3 for Dynamic metastability in the self-attention model

Figure 4 for Dynamic metastability in the self-attention model

Abstract:We consider the self-attention model - an interacting particle system on the unit sphere, which serves as a toy model for Transformers, the deep neural network architecture behind the recent successes of large language models. We prove the appearance of dynamic metastability conjectured in [GLPR23] - although particles collapse to a single cluster in infinite time, they remain trapped near a configuration of several clusters for an exponentially long period of time. By leveraging a gradient flow interpretation of the system, we also connect our result to an overarching framework of slow motion of gradient flows proposed by Otto and Reznikoff [OR07] in the context of coarsening and the Allen-Cahn equation. We finally probe the dynamics beyond the exponentially long period of metastability, and illustrate that, under an appropriate time-rescaling, the energy reaches its global maximum in finite time and has a staircase profile, with trajectories manifesting saddle-to-saddle-like behavior, reminiscent of recent works in the analysis of training dynamics via gradient descent for two-layer neural networks.

Via

Access Paper or Ask Questions

Statistical optimal transport

Jul 25, 2024

Sinho Chewi, Jonathan Niles-Weed, Philippe Rigollet

Figure 1 for Statistical optimal transport

Figure 2 for Statistical optimal transport

Figure 3 for Statistical optimal transport

Figure 4 for Statistical optimal transport

Abstract:We present an introduction to the field of statistical optimal transport, based on lectures given at \'Ecole d'\'Et\'e de Probabilit\'es de Saint-Flour XLIX.

* Lecture Notes for \'Ecole d'\'Et\'e de Probabilit\'es de Saint-Flour XLIX 2019

Via

Access Paper or Ask Questions

A mathematical perspective on Transformers

Dec 22, 2023

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, Philippe Rigollet

Figure 1 for A mathematical perspective on Transformers

Figure 2 for A mathematical perspective on Transformers

Figure 3 for A mathematical perspective on Transformers

Figure 4 for A mathematical perspective on Transformers

Abstract:Transformers play a central role in the inner workings of large language models. We develop a mathematical framework for analyzing Transformers based on their interpretation as interacting particle systems, which reveals that clusters emerge in long time. Our study explores the underlying theory and offers new perspectives for mathematicians as well as computer scientists.

Via

Access Paper or Ask Questions

Covariance alignment: from maximum likelihood estimation to Gromov-Wasserstein

Nov 22, 2023

Yanjun Han, Philippe Rigollet, George Stepaniants

Figure 1 for Covariance alignment: from maximum likelihood estimation to Gromov-Wasserstein

Figure 2 for Covariance alignment: from maximum likelihood estimation to Gromov-Wasserstein

Abstract:Feature alignment methods are used in many scientific disciplines for data pooling, annotation, and comparison. As an instance of a permutation learning problem, feature alignment presents significant statistical and computational challenges. In this work, we propose the covariance alignment model to study and compare various alignment methods and establish a minimax lower bound for covariance alignment that has a non-standard dimension scaling because of the presence of a nuisance parameter. This lower bound is in fact minimax optimal and is achieved by a natural quasi MLE. However, this estimator involves a search over all permutations which is computationally infeasible even when the problem has moderate size. To overcome this limitation, we show that the celebrated Gromov-Wasserstein algorithm from optimal transport which is more amenable to fast implementation even on large-scale problems is also minimax optimal. These results give the first statistical justification for the deployment of the Gromov-Wasserstein algorithm in practice.

* 41 pages, 2 figures

Via

Access Paper or Ask Questions