Alert button
Picture for Maurizio Filippone

Maurizio Filippone

Alert button

One-Line-of-Code Data Mollification Improves Optimization of Likelihood-based Generative Models

May 30, 2023
Ba-Hien Tran, Giulio Franzese, Pietro Michiardi, Maurizio Filippone

Figure 1 for One-Line-of-Code Data Mollification Improves Optimization of Likelihood-based Generative Models
Figure 2 for One-Line-of-Code Data Mollification Improves Optimization of Likelihood-based Generative Models
Figure 3 for One-Line-of-Code Data Mollification Improves Optimization of Likelihood-based Generative Models
Figure 4 for One-Line-of-Code Data Mollification Improves Optimization of Likelihood-based Generative Models

Generative Models (GMs) have attracted considerable attention due to their tremendous success in various domains, such as computer vision where they are capable to generate impressive realistic-looking images. Likelihood-based GMs are attractive due to the possibility to generate new data by a single model evaluation. However, they typically achieve lower sample quality compared to state-of-the-art score-based diffusion models (DMs). This paper provides a significant step in the direction of addressing this limitation. The idea is to borrow one of the strengths of score-based DMs, which is the ability to perform accurate density estimation in low-density regions and to address manifold overfitting by means of data mollification. We connect data mollification through the addition of Gaussian noise to Gaussian homotopy, which is a well-known technique to improve optimization. Data mollification can be implemented by adding one line of code in the optimization loop, and we demonstrate that this provides a boost in generation quality of likelihood-based GMs, without computational overheads. We report results on image data sets with popular likelihood-based GMs, including variants of variational autoencoders and normalizing flows, showing large improvements in FID score.

Viaarxiv icon

When is Importance Weighting Correction Needed for Covariate Shift Adaptation?

Mar 07, 2023
Davit Gogolashvili, Matteo Zecchin, Motonobu Kanagawa, Marios Kountouris, Maurizio Filippone

Figure 1 for When is Importance Weighting Correction Needed for Covariate Shift Adaptation?
Figure 2 for When is Importance Weighting Correction Needed for Covariate Shift Adaptation?
Figure 3 for When is Importance Weighting Correction Needed for Covariate Shift Adaptation?

This paper investigates when the importance weighting (IW) correction is needed to address covariate shift, a common situation in supervised learning where the input distributions of training and test data differ. Classic results show that the IW correction is needed when the model is parametric and misspecified. In contrast, recent results indicate that the IW correction may not be necessary when the model is nonparametric and well-specified. We examine the missing case in the literature where the model is nonparametric and misspecified, and show that the IW correction is needed for obtaining the best approximation of the true unknown function for the test distribution. We do this by analyzing IW-corrected kernel ridge regression, covering a variety of settings, including parametric and nonparametric models, well-specified and misspecified settings, and arbitrary weighting functions.

Viaarxiv icon

Continuous-Time Functional Diffusion Processes

Mar 01, 2023
Giulio Franzese, Simone Rossi, Dario Rossi, Markus Heinonen, Maurizio Filippone, Pietro Michiardi

Figure 1 for Continuous-Time Functional Diffusion Processes
Figure 2 for Continuous-Time Functional Diffusion Processes

We introduce functional diffusion processes (FDPs), which generalize traditional score-based diffusion models to infinite-dimensional function spaces. FDPs require a new mathematical framework to describe the forward and backward dynamics, and several extensions to derive practical training objectives. These include infinite-dimensional versions of the Girsanov theorem, in order to be able to compute an ELBO, and of the sampling theorem, in order to guarantee that functional evaluations in a countable set of points are equivalent to infinite-dimensional functions. We use FDPs to build a new breed of generative models in function spaces, which do not require specialized network architectures, and that can work with any kind of continuous data. Our results on synthetic and real data illustrate the advantages of FDPs in simplifying the design requirements of diffusion models.

Viaarxiv icon

Fully Bayesian Autoencoders with Latent Sparse Gaussian Processes

Feb 09, 2023
Ba-Hien Tran, Babak Shahbaba, Stephan Mandt, Maurizio Filippone

Figure 1 for Fully Bayesian Autoencoders with Latent Sparse Gaussian Processes
Figure 2 for Fully Bayesian Autoencoders with Latent Sparse Gaussian Processes
Figure 3 for Fully Bayesian Autoencoders with Latent Sparse Gaussian Processes
Figure 4 for Fully Bayesian Autoencoders with Latent Sparse Gaussian Processes

Autoencoders and their variants are among the most widely used models in representation learning and generative modeling. However, autoencoder-based models usually assume that the learned representations are i.i.d. and fail to capture the correlations between the data samples. To address this issue, we propose a novel Sparse Gaussian Process Bayesian Autoencoder (SGPBAE) model in which we impose fully Bayesian sparse Gaussian Process priors on the latent space of a Bayesian Autoencoder. We perform posterior estimation for this model via stochastic gradient Hamiltonian Monte Carlo. We evaluate our approach qualitatively and quantitatively on a wide range of representation learning and generative modeling tasks and show that our approach consistently outperforms multiple alternatives relying on Variational Autoencoders.

Viaarxiv icon

Locally Smoothed Gaussian Process Regression

Oct 18, 2022
Davit Gogolashvili, Bogdan Kozyrskiy, Maurizio Filippone

Figure 1 for Locally Smoothed Gaussian Process Regression
Figure 2 for Locally Smoothed Gaussian Process Regression
Figure 3 for Locally Smoothed Gaussian Process Regression
Figure 4 for Locally Smoothed Gaussian Process Regression

We develop a novel framework to accelerate Gaussian process regression (GPR). In particular, we consider localization kernels at each data point to down-weigh the contributions from other data points that are far away, and we derive the GPR model stemming from the application of such localization operation. Through a set of experiments, we demonstrate the competitive performance of the proposed approach compared to full GPR, other localized models, and deep Gaussian processes. Crucially, these performances are obtained with considerable speedups compared to standard global GPR due to the sparsification effect of the Gram matrix induced by the localization operation.

Viaarxiv icon

How Much is Enough? A Study on Diffusion Times in Score-based Generative Models

Jun 10, 2022
Giulio Franzese, Simone Rossi, Lixuan Yang, Alessandro Finamore, Dario Rossi, Maurizio Filippone, Pietro Michiardi

Figure 1 for How Much is Enough? A Study on Diffusion Times in Score-based Generative Models
Figure 2 for How Much is Enough? A Study on Diffusion Times in Score-based Generative Models
Figure 3 for How Much is Enough? A Study on Diffusion Times in Score-based Generative Models
Figure 4 for How Much is Enough? A Study on Diffusion Times in Score-based Generative Models

Score-based diffusion models are a class of generative models whose dynamics is described by stochastic differential equations that map noise into data. While recent works have started to lay down a theoretical foundation for these models, an analytical understanding of the role of the diffusion time T is still lacking. Current best practice advocates for a large T to ensure that the forward dynamics brings the diffusion sufficiently close to a known and simple noise distribution; however, a smaller value of T should be preferred for a better approximation of the score-matching objective and higher computational efficiency. Starting from a variational interpretation of diffusion models, in this work we quantify this trade-off, and suggest a new method to improve quality and efficiency of both training and sampling, by adopting smaller diffusion times. Indeed, we show how an auxiliary model can be used to bridge the gap between the ideal and the simulated forward dynamics, followed by a standard reverse diffusion process. Empirical results support our analysis; for image data, our method is competitive w.r.t. the state-of-the-art, according to standard sample quality metrics and log-likelihood.

Viaarxiv icon

Local Random Feature Approximations of the Gaussian Kernel

Apr 12, 2022
Jonas Wacker, Maurizio Filippone

Figure 1 for Local Random Feature Approximations of the Gaussian Kernel
Figure 2 for Local Random Feature Approximations of the Gaussian Kernel
Figure 3 for Local Random Feature Approximations of the Gaussian Kernel
Figure 4 for Local Random Feature Approximations of the Gaussian Kernel

A fundamental drawback of kernel-based statistical models is their limited scalability to large data sets, which requires resorting to approximations. In this work, we focus on the popular Gaussian kernel and on techniques to linearize kernel-based models by means of random feature approximations. In particular, we do so by studying a less explored random feature approximation based on Maclaurin expansions and polynomial sketches. We show that such approaches yield poor results when modelling high-frequency data, and we propose a novel localization scheme that improves kernel approximations and downstream performance significantly in this regime. We demonstrate these gains on a number of experiments involving the application of Gaussian process regression to synthetic and real-world data of different data sizes and dimensions.

* 11 pages 
Viaarxiv icon

Complex-to-Real Random Features for Polynomial Kernels

Feb 10, 2022
Jonas Wacker, Ruben Ohana, Maurizio Filippone

Figure 1 for Complex-to-Real Random Features for Polynomial Kernels
Figure 2 for Complex-to-Real Random Features for Polynomial Kernels
Figure 3 for Complex-to-Real Random Features for Polynomial Kernels
Figure 4 for Complex-to-Real Random Features for Polynomial Kernels

Kernel methods are ubiquitous in statistical modeling due to their theoretical guarantees as well as their competitive empirical performance. Polynomial kernels are of particular importance as their feature maps model the interactions between the dimensions of the input data. However, the construction time of explicit feature maps scales exponentially with the polynomial degree and a naive application of the kernel trick does not scale to large datasets. In this work, we propose Complex-to-Real (CtR) random features for polynomial kernels that leverage intermediate complex random projections and can yield kernel estimates with much lower variances than their real-valued analogs. The resulting features are real-valued, simple to construct and have the following advantages over the state-of-the-art: 1) shorter construction times, 2) lower kernel approximation errors for commonly used degrees, 3) they enable us to obtain a closed-form expression for their variance.

* 26 pages 
Viaarxiv icon

Improved Random Features for Dot Product Kernels

Feb 03, 2022
Jonas Wacker, Motonobu Kanagawa, Maurizio Filippone

Figure 1 for Improved Random Features for Dot Product Kernels
Figure 2 for Improved Random Features for Dot Product Kernels
Figure 3 for Improved Random Features for Dot Product Kernels
Figure 4 for Improved Random Features for Dot Product Kernels

Dot product kernels, such as polynomial and exponential (softmax) kernels, are among the most widely used kernels in machine learning, as they enable modeling the interactions between input features, which is crucial in applications like computer vision, natural language processing, and recommender systems. We make several novel contributions for improving the efficiency of random feature approximations for dot product kernels, to make these kernels more useful in large scale learning. First, we present a generalization of existing random feature approximations for polynomial kernels, such as Rademacher and Gaussian sketches and TensorSRHT, using complex-valued random features. We show empirically that the use of complex features can significantly reduce the variances of these approximations. Second, we provide a theoretical analysis for understanding the factors affecting the efficiency of various random feature approximations, by deriving closed-form expressions for their variances. These variance formulas elucidate conditions under which certain approximations (e.g., TensorSRHT) achieve lower variances than others (e.g., Rademacher sketches), and conditions under which the use of complex features leads to lower variances than real features. Third, by using these variance formulas, which can be evaluated in practice, we develop a data-driven optimization approach to improve random feature approximations for general dot product kernels, which is also applicable to the Gaussian kernel. We describe the improvements brought by these contributions with extensive experiments on a variety of tasks and datasets.

* 72 pages 
Viaarxiv icon