Alert button
Picture for Carles Domingo-Enrich

Carles Domingo-Enrich

Alert button

Length Generalization in Arithmetic Transformers

Jun 27, 2023
Samy Jelassi, Stéphane d'Ascoli, Carles Domingo-Enrich, Yuhuai Wu, Yuanzhi Li, François Charton

Figure 1 for Length Generalization in Arithmetic Transformers
Figure 2 for Length Generalization in Arithmetic Transformers
Figure 3 for Length Generalization in Arithmetic Transformers
Figure 4 for Length Generalization in Arithmetic Transformers

We examine how transformers cope with two challenges: learning basic integer arithmetic, and generalizing to longer sequences than seen during training. We find that relative position embeddings enable length generalization for simple tasks, such as addition: models trained on $5$-digit numbers can perform $15$-digit sums. However, this method fails for multiplication, and we propose train set priming: adding a few ($10$ to $50$) long sequences to the training set. We show that priming allows models trained on $5$-digit $\times$ $3$-digit multiplications to generalize to $35\times 3$ examples. We also show that models can be primed for different generalization lengths, and that the priming sample size scales as the logarithm of the training set size. Finally, we discuss potential applications of priming beyond arithmetic.

Viaarxiv icon

Open Problem: Learning with Variational Objectives on Measures

Jun 20, 2023
Vivien Cabannes, Carles Domingo-Enrich

The theory of statistical learning has focused on variational objectives expressed on functions. In this note, we discuss motivations to write similar objectives on measures, in particular to discuss out-of-distribution generalization and weakly-supervised learning. It raises a natural question: can one cast usual statistical learning results to objectives expressed on measures? Does the resulting construction lead to new algorithms of practical interest?

Viaarxiv icon

Multisample Flow Matching: Straightening Flows with Minibatch Couplings

Apr 28, 2023
Aram-Alexandre Pooladian, Heli Ben-Hamu, Carles Domingo-Enrich, Brandon Amos, Yaron Lipman, Ricky Chen

Figure 1 for Multisample Flow Matching: Straightening Flows with Minibatch Couplings
Figure 2 for Multisample Flow Matching: Straightening Flows with Minibatch Couplings
Figure 3 for Multisample Flow Matching: Straightening Flows with Minibatch Couplings
Figure 4 for Multisample Flow Matching: Straightening Flows with Minibatch Couplings

Simulation-free methods for training continuous-time generative models construct probability paths that go between noise distributions and individual data samples. Recent works, such as Flow Matching, derived paths that are optimal for each data sample. However, these algorithms rely on independent data and noise samples, and do not exploit underlying structure in the data distribution for constructing probability paths. We propose Multisample Flow Matching, a more general framework that uses non-trivial couplings between data and noise samples while satisfying the correct marginal constraints. At very small overhead costs, this generalization allows us to (i) reduce gradient variance during training, (ii) obtain straighter flows for the learned vector field, which allows us to generate high-quality samples using fewer function evaluations, and (iii) obtain transport maps with lower cost in high dimensions, which has applications beyond generative modeling. Importantly, we do so in a completely simulation-free manner with a simple minimization objective. We show that our proposed methods improve sample consistency on downsampled ImageNet data sets, and lead to better low-cost sample generation.

Viaarxiv icon

An Explicit Expansion of the Kullback-Leibler Divergence along its Fisher-Rao Gradient Flow

Feb 23, 2023
Carles Domingo-Enrich, Aram-Alexandre Pooladian

Figure 1 for An Explicit Expansion of the Kullback-Leibler Divergence along its Fisher-Rao Gradient Flow
Figure 2 for An Explicit Expansion of the Kullback-Leibler Divergence along its Fisher-Rao Gradient Flow
Figure 3 for An Explicit Expansion of the Kullback-Leibler Divergence along its Fisher-Rao Gradient Flow

Let $V_* : \mathbb{R}^d \to \mathbb{R}$ be some (possibly non-convex) potential function, and consider the probability measure $\pi \propto e^{-V_*}$. When $\pi$ exhibits multiple modes, it is known that sampling techniques based on Wasserstein gradient flows of the Kullback-Leibler (KL) divergence (e.g. Langevin Monte Carlo) suffer poorly in the rate of convergence, where the dynamics are unable to easily traverse between modes. In stark contrast, the work of Lu et al. (2019; 2022) has shown that the gradient flow of the KL with respect to the Fisher-Rao (FR) geometry exhibits a convergence rate to $\pi$ is that \textit{independent} of the potential function. In this short note, we complement these existing results in the literature by providing an explicit expansion of $\text{KL}(\rho_t^{\text{FR}}\|\pi)$ in terms of $e^{-t}$, where $(\rho_t^{\text{FR}})_{t\geq 0}$ is the FR gradient flow of the KL divergence. In turn, we are able to provide a clean asymptotic convergence rate, where the burn-in time is guaranteed to be finite. Our proof is based on observing a similarity between FR gradient flows and simulated annealing with linear scaling, and facts about cumulant generating functions. We conclude with simple synthetic experiments that demonstrate our theoretical findings are indeed tight. Based on our numerics, we conjecture that the asymptotic rates of convergence for Wasserstein-Fisher-Rao gradient flows are possibly related to this expansion in some cases.

* 15 pages, 4 figures 
Viaarxiv icon

Compress Then Test: Powerful Kernel Testing in Near-linear Time

Jan 14, 2023
Carles Domingo-Enrich, Raaz Dwivedi, Lester Mackey

Figure 1 for Compress Then Test: Powerful Kernel Testing in Near-linear Time
Figure 2 for Compress Then Test: Powerful Kernel Testing in Near-linear Time
Figure 3 for Compress Then Test: Powerful Kernel Testing in Near-linear Time
Figure 4 for Compress Then Test: Powerful Kernel Testing in Near-linear Time

Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on $n$ sample points. However, existing kernel tests either run in $n^2$ time or sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each $n$ point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test -- recovering the same optimal detection boundary -- while running in near-linear time. We couple these advances with cheaper permutation testing, justified by new power analyses; improved time-vs.-quality guarantees for low-rank approximation; and a fast aggregation procedure for identifying especially discriminating kernels. In our experiments with real and simulated data, CTT and its extensions provide 20--200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.

Viaarxiv icon

Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis

Jun 01, 2022
Carles Domingo-Enrich

Figure 1 for Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis
Figure 2 for Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis
Figure 3 for Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis
Figure 4 for Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis

When solving finite-sum minimization problems, two common alternatives to stochastic gradient descent (SGD) with theoretical benefits are random reshuffling (SGD-RR) and shuffle-once (SGD-SO), in which functions are sampled in cycles without replacement. Under a convenient stochastic noise approximation which holds experimentally, we study the stationary variances of the iterates of SGD, SGD-RR and SGD-SO, whose leading terms decrease in this order, and obtain simple approximations. To obtain our results, we study the power spectral density of the stochastic gradient noise sequences. Our analysis extends beyond SGD to SGD with momentum and to the stochastic Nesterov's accelerated gradient method. We perform experiments on quadratic objective functions to test the validity of our approximation and the correctness of our findings.

* The code can be found at \url{https://github.com/CDEnrich/sgd_shuffling} 
Viaarxiv icon

Auditing Differential Privacy in High Dimensions with the Kernel Quantum Rényi Divergence

May 27, 2022
Carles Domingo-Enrich, Youssef Mroueh

Figure 1 for Auditing Differential Privacy in High Dimensions with the Kernel Quantum Rényi Divergence

Differential privacy (DP) is the de facto standard for private data release and private machine learning. Auditing black-box DP algorithms and mechanisms to certify whether they satisfy a certain DP guarantee is challenging, especially in high dimension. We propose relaxations of differential privacy based on new divergences on probability distributions: the kernel R\'enyi divergence and its regularized version. We show that the regularized kernel R\'enyi divergence can be estimated from samples even in high dimensions, giving rise to auditing procedures for $\varepsilon$-DP, $(\varepsilon,\delta)$-DP and $(\alpha,\varepsilon)$-R\'enyi DP.

* Code at https://github.com/CDEnrich/kernel_renyi_dp 
Viaarxiv icon

Learning with Stochastic Orders

May 27, 2022
Carles Domingo-Enrich, Yair Schiff, Youssef Mroueh

Figure 1 for Learning with Stochastic Orders
Figure 2 for Learning with Stochastic Orders
Figure 3 for Learning with Stochastic Orders
Figure 4 for Learning with Stochastic Orders

Learning high-dimensional distributions is often done with explicit likelihood modeling or implicit modeling via minimizing integral probability metrics (IPMs). In this paper, we expand this learning paradigm to stochastic orders, namely, the convex or Choquet order between probability measures. Towards this end, we introduce the Choquet-Toland distance between probability measures, that can be used as a drop-in replacement for IPMs. We also introduce the Variational Dominance Criterion (VDC) to learn probability measures with dominance constraints, that encode the desired stochastic order between the learned measure and a known baseline. We analyze both quantities and show that they suffer from the curse of dimensionality and propose surrogates via input convex maxout networks (ICMNs), that enjoy parametric rates. Finally, we provide a min-max framework for learning with stochastic orders and validate it experimentally on synthetic and high-dimensional image generation, with promising results. The code is available at https://github.com/yair-schiff/stochastic-orders-ICMN

* Code available at https://github.com/yair-schiff/stochastic-orders-ICMN 
Viaarxiv icon

Simultaneous Transport Evolution for Minimax Equilibria on Measures

Feb 21, 2022
Carles Domingo-Enrich, Joan Bruna

Min-max optimization problems arise in several key machine learning setups, including adversarial learning and generative modeling. In their general form, in absence of convexity/concavity assumptions, finding pure equilibria of the underlying two-player zero-sum game is computationally hard [Daskalakis et al., 2021]. In this work we focus instead in finding mixed equilibria, and consider the associated lifted problem in the space of probability measures. By adding entropic regularization, our main result establishes global convergence towards the global equilibrium by using simultaneous gradient ascent-descent with respect to the Wasserstein metric -- a dynamics that admits efficient particle discretization in high-dimensions, as opposed to entropic mirror descent. We complement this positive result with a related entropy-regularized loss which is not bilinear but still convex-concave in the Wasserstein geometry, and for which simultaneous dynamics do not converge yet timescale separation does. Taken together, these results showcase the benign geometry of bilinear games in the space of measures, enabling particle dynamics with global qualitative convergence guarantees.

* Error in the proof of Lemma 1, which makes Theorem 1 not hold 
Viaarxiv icon