IBM Research, USA
Abstract:Understanding the generalization of deep neural networks is one of the most important tasks in deep learning. Although much progress has been made, theoretical error bounds still often behave disparately from empirical observations. In this work, we develop margin-based generalization bounds, where the margins are normalized with optimal transport costs between independent random subsets sampled from the training distribution. In particular, the optimal transport cost can be interpreted as a generalization of variance which captures the structural properties of the learned feature space. Our bounds robustly predict the generalization error, given training data and network parameters, on large scale datasets. Theoretically, we demonstrate that the concentration and separation of features play crucial roles in generalization, supporting empirical results in the literature. The code is available at \url{https://github.com/chingyaoc/kV-Margin}.
Abstract:Training classifiers under fairness constraints such as group fairness, regularizes the disparities of predictions between the groups. Nevertheless, even though the constraints are satisfied during training, they might not generalize at evaluation time. To improve the generalizability of fair classifiers, we propose fair mixup, a new data augmentation strategy for imposing the fairness constraint. In particular, we show that fairness can be achieved by regularizing the models on paths of interpolated samples between the groups. We use mixup, a powerful data augmentation strategy to generate these interpolates. We analyze fair mixup and empirically show that it ensures a better generalization for both accuracy and fairness measurement in tabular, vision, and language benchmarks.
Abstract:Image captioning has recently demonstrated impressive progress largely owing to the introduction of neural network algorithms trained on curated dataset like MS-COCO. Often work in this field is motivated by the promise of deployment of captioning systems in practical applications. However, the scarcity of data and contexts in many competition datasets renders the utility of systems trained on these datasets limited as an assistive technology in real-world settings, such as helping visually impaired people navigate and accomplish everyday tasks. This gap motivated the introduction of the novel VizWiz dataset, which consists of images taken by the visually impaired and captions that have useful, task-oriented information. In an attempt to help the machine learning computer vision field realize its promise of producing technologies that have positive social impact, the curators of the VizWiz dataset host several competitions, including one for image captioning. This work details the theory and engineering from our winning submission to the 2020 captioning competition. Our work provides a step towards improved assistive image captioning systems.
Abstract:Image captioning systems have made substantial progress, largely due to the availability of curated datasets like Microsoft COCO or Vizwiz that have accurate descriptions of their corresponding images. Unfortunately, scarce availability of such cleanly labeled data results in trained algorithms producing captions that can be terse and idiosyncratically specific to details in the image. We propose a new technique, cooperative distillation that combines clean curated datasets with the web-scale automatically extracted captions of the Google Conceptual Captions dataset (GCC), which can have poor descriptions of images, but is abundant in size and therefore provides a rich vocabulary resulting in more expressive captions.
Abstract:We consider the maximum mean discrepancy ($\mathrm{MMD}$) GAN problem and propose a parametric kernelized gradient flow that mimics the min-max game in gradient regularized $\mathrm{MMD}$ GAN. We show that this flow provides a descent direction minimizing the $\mathrm{MMD}$ on a statistical manifold of probability distributions. We then derive an explicit condition which ensures that gradient descent on the parameter space of the generator in gradient regularized $\mathrm{MMD}$ GAN is globally convergent to the target distribution. Under this condition, we give non asymptotic convergence results of gradient descent in MMD GAN. Another contribution of this paper is the introduction of a dynamic formulation of a regularization of $\mathrm{MMD}$ and demonstrating that the parametric kernelized descent for $\mathrm{MMD}$ is the gradient flow of this functional with respect to the new Riemannian structure. Our obtained theoretical result allows ones to treat gradient flows for quite general functionals and thus has potential applications to other types of variational inferences on a statistical manifold beyond GANs. Finally, numerical experiments suggest that our parametric kernelized gradient flow stabilizes GAN training and guarantees convergence.
Abstract:Tabular datasets are ubiquitous in data science applications. Given their importance, it seems natural to apply state-of-the-art deep learning algorithms in order to fully unlock their potential. Here we propose neural network models that represent tabular time series that can optionally leverage their hierarchical structure. This results in two architectures for tabular time series: one for learning representations that is analogous to BERT and can be pre-trained end-to-end and used in downstream tasks, and one that is akin to GPT and can be used for generation of realistic synthetic tabular sequences. We demonstrate our models on two datasets: a synthetic credit card transaction dataset, where the learned representations are used for fraud detection and synthetic data generation, and on a real pollution dataset, where the learned encodings are used to predict atmospheric pollutant concentrations. Code and data are available at https://github.com/IBM/TabFormer.
Abstract:We introduce Unbalanced Sobolev Descent (USD), a particle descent algorithm for transporting a high dimensional source distribution to a target distribution that does not necessarily have the same mass. We define the Sobolev-Fisher discrepancy between distributions and show that it relates to advection-reaction transport equations and the Wasserstein-Fisher-Rao metric between distributions. USD transports particles along gradient flows of the witness function of the Sobolev-Fisher discrepancy (advection step) and reweighs the mass of particles with respect to this witness function (reaction step). The reaction step can be thought of as a birth-death process of the particles with rate of growth proportional to the witness function. When the Sobolev-Fisher witness function is estimated in a Reproducing Kernel Hilbert Space (RKHS), under mild assumptions we show that USD converges asymptotically (in the limit of infinite particles) to the target distribution in the Maximum Mean Discrepancy (MMD) sense. We then give two methods to estimate the Sobolev-Fisher witness with neural networks, resulting in two Neural USD algorithms. The first one implements the reaction step with mirror descent on the weights, while the second implements it through a birth-death process of particles. We show on synthetic examples that USD transports distributions with or without conservation of mass faster than previous particle descent algorithms, and finally demonstrate its use for molecular biology analyses where our method is naturally suited to match developmental stages of populations of differentiating cells based on their single-cell RNA sequencing profile. Code is available at https://github.com/ibm/usd .
Abstract:Surrogate models for partial-differential equations are widely used in the design of meta-materials to rapidly evaluate the behavior of composable components. However, the training cost of accurate surrogates by machine learning can rapidly increase with the number of variables. For photonic-device models, we find that this training becomes especially challenging as design regions grow larger than the optical wavelength. We present an active learning algorithm that reduces the number of training points by more than an order of magnitude for a neural-network surrogate model of optical-surface components compared to random samples. Results show that the surrogate evaluation is over two orders of magnitude faster than a direct solve, and we demonstrate how this can be exploited to accelerate large-scale engineering optimization.
Abstract:We are interested in gradient-based Explicit Generative Modeling where samples can be derived from iterative gradient updates based on an estimate of the score function of the data distribution. Recent advances in Stochastic Gradient Langevin Dynamics (SGLD) demonstrates impressive results with energy-based models on high-dimensional and complex data distributions. Stein Variational Gradient Descent (SVGD) is a deterministic sampling algorithm that iteratively transports a set of particles to approximate a given distribution, based on functional gradient descent that decreases the KL divergence. SVGD has promising results on several Bayesian inference applications. However, applying SVGD on high dimensional problems is still under-explored. The goal of this work is to study high dimensional inference with SVGD. We first identify key challenges in practical kernel SVGD inference in high-dimension. We propose noise conditional kernel SVGD (NCK-SVGD), that works in tandem with the recently introduced Noise Conditional Score Network estimator. NCK is crucial for successful inference with SVGD in high dimension, as it adapts the kernel to the noise level of the score estimate. As we anneal the noise, NCK-SVGD targets the real data distribution. We then extend the annealed SVGD with an entropic regularization. We show that this offers a flexible control between sample quality and diversity, and verify it empirically by precision and recall evaluations. The NCK-SVGD produces samples comparable to GANs and annealed SGLD on computer vision benchmarks, including MNIST and CIFAR-10.
Abstract:Recently, the task of image generation has attracted much attention. In particular, the recent empirical successes of the Markov Chain Monte Carlo (MCMC) technique of Langevin Dynamics have prompted a number of theoretical advances; despite this, several outstanding problems remain. First, the Langevin Dynamics is run in very high dimension on a nonconvex landscape; in the worst case, due to the NP-hardness of nonconvex optimization, it is thought that Langevin Dynamics mixes only in time exponential in the dimension. In this work, we demonstrate how the manifold hypothesis allows for the considerable reduction of mixing time, from exponential in the ambient dimension to depending only on the (much smaller) intrinsic dimension of the data. Second, the high dimension of the sampling space significantly hurts the performance of Langevin Dynamics; we leverage a multi-scale approach to help ameliorate this issue and observe that this multi-resolution algorithm allows for a trade-off between image quality and computational expense in generation.