Abstract:This paper introduces a novel kernel-based formulation of the Equalized Odds (EO) criterion, denoted as $EO_k$, for fair representation learning (FRL) in supervised settings. The central goal of FRL is to mitigate discrimination regarding a sensitive attribute $S$ while preserving prediction accuracy for the target variable $Y$. Our proposed criterion enables a rigorous and interpretable quantification of three core fairness objectives: independence (prediction $\hat{Y}$ is independent of $S$), separation (also known as equalized odds; prediction $\hat{Y}$ is independent with $S$ conditioned on target attribute $Y$), and calibration ($Y$ is independent of $S$ conditioned on the prediction $\hat{Y}$). Under both unbiased ($Y$ is independent of $S$) and biased ($Y$ depends on $S$) conditions, we show that $EO_k$ satisfies both independence and separation in the former, and uniquely preserves predictive accuracy while lower bounding independence and calibration in the latter, thereby offering a unified analytical characterization of the tradeoffs among these fairness criteria. We further define the empirical counterpart, $\hat{EO}_k$, a kernel-based statistic that can be computed in quadratic time, with linear-time approximations also available. A concentration inequality for $\hat{EO}_k$ is derived, providing performance guarantees and error bounds, which serve as practical certificates of fairness compliance. While our focus is on theoretical development, the results lay essential groundwork for principled and provably fair algorithmic design in future empirical studies.
Abstract:Diffusion models have made significant advancements in recent years. However, their performance often deteriorates when trained or fine-tuned on imbalanced datasets. This degradation is largely due to the disproportionate representation of majority and minority data in image-text pairs. In this paper, we propose a general fine-tuning approach, dubbed PoGDiff, to address this challenge. Rather than directly minimizing the KL divergence between the predicted and ground-truth distributions, PoGDiff replaces the ground-truth distribution with a Product of Gaussians (PoG), which is constructed by combining the original ground-truth targets with the predicted distribution conditioned on a neighboring text embedding. Experiments on real-world datasets demonstrate that our method effectively addresses the imbalance problem in diffusion models, improving both generation accuracy and quality.
Abstract:Anomaly detection is an important problem in many application areas, such as network security. Many deep learning methods for unsupervised anomaly detection produce good empirical performance but lack theoretical guarantees. By casting anomaly detection into a binary classification problem, we establish non-asymptotic upper bounds and a convergence rate on the excess risk on rectified linear unit (ReLU) neural networks trained on synthetic anomalies. Our convergence rate on the excess risk matches the minimax optimal rate in the literature. Furthermore, we provide lower and upper bounds on the number of synthetic anomalies that can attain this optimality. For practical implementation, we relax some conditions to improve the search for the empirical risk minimizer, which leads to competitive performance to other classification-based methods for anomaly detection. Overall, our work provides the first theoretical guarantees of unsupervised neural network-based anomaly detectors and empirical insights on how to design them well.
Abstract:Maximum Mean Discrepancy (MMD) is a probability metric that has found numerous applications in machine learning. In this work, we focus on its application in generative models, including the minimum MMD estimator, Generative Moment Matching Network (GMMN), and Generative Adversarial Network (GAN). In these cases, MMD is part of an objective function in a minimization or min-max optimization problem. Even if its empirical performance is competitive, the consistency and convergence rate analysis of the corresponding MMD-based estimators has yet to be carried out. We propose a uniform concentration inequality for a class of Maximum Mean Discrepancy (MMD)-based estimators, that is, a maximum deviation bound of empirical MMD values over a collection of generated distributions and adversarially learned kernels. Here, our inequality serves as an efficient tool in the theoretical analysis for MMD-based generative models. As elaborating examples, we applied our main result to provide the generalization error bounds for the MMD-based estimators in the context of the minimum MMD estimator and MMD GAN.
Abstract:Motivated by the abundance of functional data such as time series and images, there has been a growing interest in integrating such data into neural networks and learning maps from function spaces to R (i.e., functionals). In this paper, we study the approximation of functionals on reproducing kernel Hilbert spaces (RKHS's) using neural networks. We establish the universality of the approximation of functionals on the RKHS's. Specifically, we derive explicit error bounds for those induced by inverse multiquadric, Gaussian, and Sobolev kernels. Moreover, we apply our findings to functional regression, proving that neural networks can accurately approximate the regression maps in generalized functional linear models. Existing works on functional learning require integration-type basis function expansions with a set of pre-specified basis functions. By leveraging the interpolating orthogonal projections in RKHS's, our proposed network is much simpler in that we use point evaluations to replace basis function expansions.
Abstract:In this paper, we prove the universal consistency of wide and deep ReLU neural network classifiers trained on the logistic loss. We also give sufficient conditions for a class of probability measures for which classifiers based on neural networks achieve minimax optimal rates of convergence. The result applies to a wide range of known function classes. In particular, while most previous works impose explicit smoothness assumptions on the regression function, our framework encompasses more general settings. The proposed neural networks are either the minimizers of the logistic loss or the $0$-$1$ loss. In the former case, they are interpolating classifiers that exhibit a benign overfitting behavior.
Abstract:Adversarial training has been proposed to hedge against adversarial attacks in machine learning and statistical models. This paper focuses on adversarial training under $\ell_\infty$-perturbation, which has recently attracted much research attention. The asymptotic behavior of the adversarial training estimator is investigated in the generalized linear model. The results imply that the limiting distribution of the adversarial training estimator under $\ell_\infty$-perturbation could put a positive probability mass at $0$ when the true parameter is $0$, providing a theoretical guarantee of the associated sparsity-recovery ability. Alternatively, a two-step procedure is proposed -- adaptive adversarial training, which could further improve the performance of adversarial training under $\ell_\infty$-perturbation. Specifically, the proposed procedure could achieve asymptotic unbiasedness and variable-selection consistency. Numerical experiments are conducted to show the sparsity-recovery ability of adversarial training under $\ell_\infty$-perturbation and to compare the empirical performance between classic adversarial training and adaptive adversarial training.
Abstract:The recent success of neural networks in pattern recognition and classification problems suggests that neural networks possess qualities distinct from other more classical classifiers such as SVMs or boosting classifiers. This paper studies the performance of plug-in classifiers based on neural networks in a binary classification setting as measured by their excess risks. Compared to the typical settings imposed in the literature, we consider a more general scenario that resembles actual practice in two respects: first, the function class to be approximated includes the Barron functions as a proper subset, and second, the neural network classifier constructed is the minimizer of a surrogate loss instead of the $0$-$1$ loss so that gradient descent-based numerical optimizations can be easily applied. While the class of functions we consider is quite large that optimal rates cannot be faster than $n^{-\frac{1}{3}}$, it is a regime in which dimension-free rates are possible and approximation power of neural networks can be taken advantage of. In particular, we analyze the estimation and approximation properties of neural networks to obtain a dimension-free, uniform rate of convergence for the excess risk. Finally, we show that the rate obtained is in fact minimax optimal up to a logarithmic factor, and the minimax lower bound shows the effect of the margin assumption in this regime.
Abstract:This paper studies the binary classification of unbounded data from ${\mathbb R}^d$ generated under Gaussian Mixture Models (GMMs) using deep ReLU neural networks. We obtain $\unicode{x2013}$ for the first time $\unicode{x2013}$ non-asymptotic upper bounds and convergence rates of the excess risk (excess misclassification error) for the classification without restrictions on model parameters. The convergence rates we derive do not depend on dimension $d$, demonstrating that deep ReLU networks can overcome the curse of dimensionality in classification. While the majority of existing generalization analysis of classification algorithms relies on a bounded domain, we consider an unbounded domain by leveraging the analyticity and fast decay of Gaussian distributions. To facilitate our analysis, we give a novel approximation error bound for general analytic functions using ReLU networks, which may be of independent interest. Gaussian distributions can be adopted nicely to model data arising in applications, e.g., speeches, images, and texts; our results provide a theoretical verification of the observed efficiency of deep neural networks in practical classification problems.
Abstract:Given a sequence of observable variables $\{(x_1, y_1), \ldots, (x_n, y_n)\}$, the conformal prediction method estimates a confidence set for $y_{n+1}$ given $x_{n+1}$ that is valid for any finite sample size by merely assuming that the joint distribution of the data is permutation invariant. Although attractive, computing such a set is computationally infeasible in most regression problems. Indeed, in these cases, the unknown variable $y_{n+1}$ can take an infinite number of possible candidate values, and generating conformal sets requires retraining a predictive model for each candidate. In this paper, we focus on a sparse linear model with only a subset of variables for prediction and use numerical continuation techniques to approximate the solution path efficiently. The critical property we exploit is that the set of selected variables is invariant under a small perturbation of the input data. Therefore, it is sufficient to enumerate and refit the model only at the change points of the set of active features and smoothly interpolate the rest of the solution via a Predictor-Corrector mechanism. We show how our path-following algorithm accurately approximates conformal prediction sets and illustrate its performance using synthetic and real data examples.