Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Raphaël Carpintero Perez

CMAP

A reproducible comparative study of categorical kernels for Gaussian process regression, with new clustering-based nested kernels

Oct 02, 2025

Raphaël Carpintero Perez, Sébastien Da Veiga, Josselin Garnier

Abstract:Designing categorical kernels is a major challenge for Gaussian process regression with continuous and categorical inputs. Despite previous studies, it is difficult to identify a preferred method, either because the evaluation metrics, the optimization procedure, or the datasets change depending on the study. In particular, reproducible code is rarely available. The aim of this paper is to provide a reproducible comparative study of all existing categorical kernels on many of the test cases investigated so far. We also propose new evaluation metrics inspired by the optimization community, which provide quantitative rankings of the methods across several tasks. From our results on datasets which exhibit a group structure on the levels of categorical inputs, it appears that nested kernels methods clearly outperform all competitors. When the group structure is unknown or when there is no prior knowledge of such a structure, we propose a new clustering-based strategy using target encodings of categorical variables. We show that on a large panel of datasets, which do not necessarily have a known group structure, this estimation strategy still outperforms other approaches while maintaining low computational cost.

Via

Access Paper or Ask Questions

Physics-Learning AI Datamodel (PLAID) datasets: a collection of physics simulations for machine learning

May 08, 2025

Fabien Casenave, Xavier Roynard, Brian Staber, William Piat, Michele Alessandro Bucci, Nissrine Akkari, Abbas Kabalan, Xuan Minh Vuong Nguyen, Luca Saverio, Raphaël Carpintero Perez(+8 more)

Abstract:Machine learning-based surrogate models have emerged as a powerful tool to accelerate simulation-driven scientific workflows. However, their widespread adoption is hindered by the lack of large-scale, diverse, and standardized datasets tailored to physics-based simulations. While existing initiatives provide valuable contributions, many are limited in scope-focusing on specific physics domains, relying on fragmented tooling, or adhering to overly simplistic datamodels that restrict generalization. To address these limitations, we introduce PLAID (Physics-Learning AI Datamodel), a flexible and extensible framework for representing and sharing datasets of physics simulations. PLAID defines a unified standard for describing simulation data and is accompanied by a library for creating, reading, and manipulating complex datasets across a wide range of physical use cases (gitlab.com/drti/plaid). We release six carefully crafted datasets under the PLAID standard, covering structural mechanics and computational fluid dynamics, and provide baseline benchmarks using representative learning methods. Benchmarking tools are made available on Hugging Face, enabling direct participation by the community and contribution to ongoing evaluation efforts (huggingface.co/PLAIDcompetitions).

Via

Access Paper or Ask Questions

Learning signals defined on graphs with optimal transport and Gaussian process regression

Oct 21, 2024

Raphaël Carpintero Perez, Sébastien da Veiga, Josselin Garnier, Brian Staber

Figure 1 for Learning signals defined on graphs with optimal transport and Gaussian process regression

Figure 2 for Learning signals defined on graphs with optimal transport and Gaussian process regression

Figure 3 for Learning signals defined on graphs with optimal transport and Gaussian process regression

Figure 4 for Learning signals defined on graphs with optimal transport and Gaussian process regression

Abstract:In computational physics, machine learning has now emerged as a powerful complementary tool to explore efficiently candidate designs in engineering studies. Outputs in such supervised problems are signals defined on meshes, and a natural question is the extension of general scalar output regression models to such complex outputs. Changes between input geometries in terms of both size and adjacency structure in particular make this transition non-trivial. In this work, we propose an innovative strategy for Gaussian process regression where inputs are large and sparse graphs with continuous node attributes and outputs are signals defined on the nodes of the associated inputs. The methodology relies on the combination of regularized optimal transport, dimension reduction techniques, and the use of Gaussian processes indexed by graphs. In addition to enabling signal prediction, the main point of our proposal is to come with confidence intervals on node values, which is crucial for uncertainty quantification and active learning. Numerical experiments highlight the efficiency of the method to solve real problems in fluid dynamics and solid mechanics.

Via

Access Paper or Ask Questions

Gaussian process regression with Sliced Wasserstein Weisfeiler-Lehman graph kernels

Feb 06, 2024

Raphaël Carpintero Perez, Sébastien da Veiga, Josselin Garnier, Brian Staber

Figure 1 for Gaussian process regression with Sliced Wasserstein Weisfeiler-Lehman graph kernels

Figure 2 for Gaussian process regression with Sliced Wasserstein Weisfeiler-Lehman graph kernels

Figure 3 for Gaussian process regression with Sliced Wasserstein Weisfeiler-Lehman graph kernels

Figure 4 for Gaussian process regression with Sliced Wasserstein Weisfeiler-Lehman graph kernels

Abstract:Supervised learning has recently garnered significant attention in the field of computational physics due to its ability to effectively extract complex patterns for tasks like solving partial differential equations, or predicting material properties. Traditionally, such datasets consist of inputs given as meshes with a large number of nodes representing the problem geometry (seen as graphs), and corresponding outputs obtained with a numerical solver. This means the supervised learning model must be able to handle large and sparse graphs with continuous node attributes. In this work, we focus on Gaussian process regression, for which we introduce the Sliced Wasserstein Weisfeiler-Lehman (SWWL) graph kernel. In contrast to existing graph kernels, the proposed SWWL kernel enjoys positive definiteness and a drastic complexity reduction, which makes it possible to process datasets that were previously impossible to handle. The new kernel is first validated on graph classification for molecular datasets, where the input graphs have a few tens of nodes. The efficiency of the SWWL kernel is then illustrated on graph regression in computational fluid dynamics and solid mechanics, where the input graphs are made up of tens of thousands of nodes.

Via

Access Paper or Ask Questions