Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gérard Biau

LPSM

Distilling LLM Feedback for Lean Theorem Proving

May 29, 2026

Gaetan Narozniak, Gérard Biau, Rémi Munos, Ahmad Rammal, Pierre Marion

Abstract:Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.

Via

Access Paper or Ask Questions

A Geometry-Aware Residual Correction of Hagan's SABR Implied Volatility Formula

May 07, 2026

Adil Reghai, Lama Tarsissi, Gérard Biau, Alex Lipton

Abstract:This paper proposes a hybrid methodology to improve the approximation of SABR (Stochastic Alpha Beta Rho) implied volatility by combining analytical structure with machine learning. The approach augments the neural-network input representation with geometric features derived from the stochastic differential equations of the SABR model. Unlike approaches that fully replace analytical formulas with black-box models, the proposed framework preserves the analytical backbone of the model. The hybridization operates along two complementary dimensions. First, geometry-aware variables reflecting intrinsic properties of the SABR dynamics are used as structured inputs to the network. Second, the neural network is trained to learn the residual error relative to Hagan's closed-form approximation rather than implied volatility directly. The resulting model acts as a structured residual correction to the analytical formula, retaining interpretability while capturing higher-order effects that are not included in the asymptotic expansion. Numerical experiments conducted over realistic parameter domains, as well as stressed environments, show that the method improves accuracy and robustness compared with both analytical approximations and standard neural-network approaches. Because the correction remains lightweight and structurally consistent with the underlying model, the framework is well suited for real-time pricing and calibration in practical trading environments.

* 33 pages, 17 figures

Via

Access Paper or Ask Questions

Taking a Big Step: Large Learning Rates in Denoising Score Matching Prevent Memorization

Feb 05, 2025

Yu-Han Wu, Pierre Marion, Gérard Biau, Claire Boyer

Figure 1 for Taking a Big Step: Large Learning Rates in Denoising Score Matching Prevent Memorization

Figure 2 for Taking a Big Step: Large Learning Rates in Denoising Score Matching Prevent Memorization

Figure 3 for Taking a Big Step: Large Learning Rates in Denoising Score Matching Prevent Memorization

Figure 4 for Taking a Big Step: Large Learning Rates in Denoising Score Matching Prevent Memorization

Abstract:Denoising score matching plays a pivotal role in the performance of diffusion-based generative models. However, the empirical optimal score--the exact solution to the denoising score matching--leads to memorization, where generated samples replicate the training data. Yet, in practice, only a moderate degree of memorization is observed, even without explicit regularization. In this paper, we investigate this phenomenon by uncovering an implicit regularization mechanism driven by large learning rates. Specifically, we show that in the small-noise regime, the empirical optimal score exhibits high irregularity. We then prove that, when trained by stochastic gradient descent with a large enough learning rate, neural networks cannot stably converge to a local minimum with arbitrarily small excess risk. Consequently, the learned score cannot be arbitrarily close to the empirical optimal score, thereby mitigating memorization. To make the analysis tractable, we consider one-dimensional data and two-layer neural networks. Experiments validate the crucial role of the learning rate in preventing memorization, even beyond the one-dimensional setting.

Via

Access Paper or Ask Questions

Attention layers provably solve single-location regression

Oct 02, 2024

Pierre Marion, Raphaël Berthier, Gérard Biau, Claire Boyer

Figure 1 for Attention layers provably solve single-location regression

Figure 2 for Attention layers provably solve single-location regression

Figure 3 for Attention layers provably solve single-location regression

Figure 4 for Attention layers provably solve single-location regression

Abstract:Attention-based models, such as Transformer, excel across various tasks but lack a comprehensive theoretical understanding, especially regarding token-wise sparsity and internal linear representations. To address this gap, we introduce the single-location regression task, where only one token in a sequence determines the output, and its position is a latent random variable, retrievable via a linear projection of the input. To solve this task, we propose a dedicated predictor, which turns out to be a simplified version of a non-linear self-attention layer. We study its theoretical properties, by showing its asymptotic Bayes optimality and analyzing its training dynamics. In particular, despite the non-convex nature of the problem, the predictor effectively learns the underlying structure. This work highlights the capacity of attention mechanisms to handle sparse token information and internal linear structures.

* 41 pages, 7 figures

Via

Access Paper or Ask Questions

Physics-informed machine learning as a kernel method

Feb 12, 2024

Nathan Doumèche, Francis Bach, Claire Boyer, Gérard Biau

Figure 1 for Physics-informed machine learning as a kernel method

Figure 2 for Physics-informed machine learning as a kernel method

Figure 3 for Physics-informed machine learning as a kernel method

Abstract:Physics-informed machine learning combines the expressiveness of data-based approaches with the interpretability of physical models. In this context, we consider a general regression problem where the empirical risk is regularized by a partial differential equation that quantifies the physical inconsistency. We prove that for linear differential priors, the problem can be formulated as a kernel regression task. Taking advantage of kernel theory, we derive convergence rates for the minimizer of the regularized risk and show that it converges at least at the Sobolev minimax rate. However, faster rates can be achieved, depending on the physical error. This principle is illustrated with a one-dimensional example, supporting the claim that regularizing the empirical risk with physical information can be beneficial to the statistical performance of estimators.

Via

Access Paper or Ask Questions

Implicit regularization of deep residual networks towards neural ODEs

Sep 03, 2023

Pierre Marion, Yu-Han Wu, Michael E. Sander, Gérard Biau

Figure 1 for Implicit regularization of deep residual networks towards neural ODEs

Figure 2 for Implicit regularization of deep residual networks towards neural ODEs

Figure 3 for Implicit regularization of deep residual networks towards neural ODEs

Figure 4 for Implicit regularization of deep residual networks towards neural ODEs

Abstract:Residual neural networks are state-of-the-art deep learning models. Their continuous-depth analog, neural ordinary differential equations (ODEs), are also widely used. Despite their success, the link between the discrete and continuous models still lacks a solid mathematical foundation. In this article, we take a step in this direction by establishing an implicit regularization of deep residual networks towards neural ODEs, for nonlinear networks trained with gradient flow. We prove that if the network is initialized as a discretization of a neural ODE, then such a discretization holds throughout training. Our results are valid for a finite training time, and also as the training time tends to infinity provided that the network satisfies a Polyak-Lojasiewicz condition. Importantly, this condition holds for a family of residual networks where the residuals are two-layer perceptrons with an overparameterization in width that is only linear, and implies the convergence of gradient flow to a global minimum. Numerical experiments illustrate our results.

* 40 pages, 3 figures

Via

Access Paper or Ask Questions

Scaling ResNets in the Large-depth Regime

Jun 14, 2022

Pierre Marion, Adeline Fermanian, Gérard Biau, Jean-Philippe Vert

Figure 1 for Scaling ResNets in the Large-depth Regime

Figure 2 for Scaling ResNets in the Large-depth Regime

Figure 3 for Scaling ResNets in the Large-depth Regime

Figure 4 for Scaling ResNets in the Large-depth Regime

Abstract:Deep ResNets are recognized for achieving state-of-the-art results in complex machine learning tasks. However, the remarkable performance of these architectures relies on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients, particularly as the depth $L$ increases. No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $\alpha_L$. We show in a probabilistic setting that with standard i.i.d. initializations, the only non-trivial dynamics is for $\alpha_L = 1/\sqrt{L}$ (other choices lead either to explosion or to identity mapping). This scaling factor corresponds in the continuous-time limit to a neural stochastic differential equation, contrarily to a widespread interpretation that deep ResNets are discretizations of neural ordinary differential equations. By contrast, in the latter regime, stability is obtained with specific correlated initializations and $\alpha_L = 1/L$. Our analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index. Finally, in a series of experiments, we exhibit a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training.

* 43 pages, 9 figures

Via

Access Paper or Ask Questions

Optimal 1-Wasserstein Distance for WGANs

Jan 08, 2022

Arthur Stéphanovitch, Ugo Tanielian, Benoît Cadre, Nicolas Klutchnikoff, Gérard Biau

Figure 1 for Optimal 1-Wasserstein Distance for WGANs

Figure 2 for Optimal 1-Wasserstein Distance for WGANs

Figure 3 for Optimal 1-Wasserstein Distance for WGANs

Figure 4 for Optimal 1-Wasserstein Distance for WGANs

Abstract:The mathematical forces at work behind Generative Adversarial Networks raise challenging theoretical issues. Motivated by the important question of characterizing the geometrical properties of the generated distributions, we provide a thorough analysis of Wasserstein GANs (WGANs) in both the finite sample and asymptotic regimes. We study the specific case where the latent space is univariate and derive results valid regardless of the dimension of the output space. We show in particular that for a fixed sample size, the optimal WGANs are closely linked with connected paths minimizing the sum of the squared Euclidean distances between the sample points. We also highlight the fact that WGANs are able to approach (for the 1-Wasserstein distance) the target distribution as the sample size tends to infinity, at a given convergence rate and provided the family of generative Lipschitz functions grows appropriately. We derive in passing new results on optimal transport theory in the semi-discrete setting.

Via

Access Paper or Ask Questions

Framing RNN as a kernel method: A neural ODE approach

Jun 02, 2021

Adeline Fermanian, Pierre Marion, Jean-Philippe Vert, Gérard Biau

Figure 1 for Framing RNN as a kernel method: A neural ODE approach

Figure 2 for Framing RNN as a kernel method: A neural ODE approach

Figure 3 for Framing RNN as a kernel method: A neural ODE approach

Figure 4 for Framing RNN as a kernel method: A neural ODE approach

Abstract:Building on the interpretation of a recurrent neural network (RNN) as a continuous-time neural differential equation, we show, under appropriate conditions, that the solution of a RNN can be viewed as a linear function of a specific feature set of the input sequence, known as the signature. This connection allows us to frame a RNN as a kernel method in a suitable reproducing kernel Hilbert space. As a consequence, we obtain theoretical guarantees on generalization and stability for a large class of recurrent networks. Our results are illustrated on simulated datasets.

* 32 pages, 7 figures

Via

Access Paper or Ask Questions

SHAFF: Fast and consistent SHApley eFfect estimates via random Forests

May 25, 2021

Clément Bénard, Gérard Biau, Sébastien da Veiga, Erwan Scornet

Figure 1 for SHAFF: Fast and consistent SHApley eFfect estimates via random Forests

Figure 2 for SHAFF: Fast and consistent SHApley eFfect estimates via random Forests

Figure 3 for SHAFF: Fast and consistent SHApley eFfect estimates via random Forests

Figure 4 for SHAFF: Fast and consistent SHApley eFfect estimates via random Forests

Abstract:Interpretability of learning algorithms is crucial for applications involving critical decisions, and variable importance is one of the main interpretation tools. Shapley effects are now widely used to interpret both tree ensembles and neural networks, as they can efficiently handle dependence and interactions in the data, as opposed to most other variable importance measures. However, estimating Shapley effects is a challenging task, because of the computational complexity and the conditional expectation estimates. Accordingly, existing Shapley algorithms have flaws: a costly running time, or a bias when input variables are dependent. Therefore, we introduce SHAFF, SHApley eFfects via random Forests, a fast and accurate Shapley effect estimate, even when input variables are dependent. We show SHAFF efficiency through both a theoretical analysis of its consistency, and the practical performance improvements over competitors with extensive experiments. An implementation of SHAFF in C++ and R is available online.

Via

Access Paper or Ask Questions