Despite major advances in open-ended text generation, there has been limited progress in designing evaluation metrics for this task. We propose MAUVE -- a metric for open-ended text generation, which directly compares the distribution of machine-generated text to that of human language. MAUVE measures the mean area under the divergence curve for the two distributions, exploring the trade-off between two types of errors: those arising from parts of the human distribution that the model distribution approximates well, and those it does not. We present experiments across two open-ended generation tasks in the web text domain and the story domain, and a variety of decoding algorithms and model sizes. Our results show that evaluation under MAUVE indeed reflects the more natural behavior with respect to model size, compared to prior metrics. MAUVE's ordering of the decoding algorithms also agrees with that of generation perplexity, the most widely used metric in open-ended text generation; however, MAUVE presents a more principled evaluation metric for the task as it considers both model and human text.
The notion of a Moreau envelope is central to the analysis of first-order optimization algorithms for machine learning. Yet, it has not been developed and extended to be applied to a deep network and, more broadly, to a machine learning system with a differentiable programming implementation. We define a compositional calculus adapted to Moreau envelopes and show how to integrate it within differentiable programming. The proposed framework casts in a mathematical optimization framework several variants of gradient back-propagation related to the idea of the propagation of virtual targets.
We study the estimation of policy gradients for continuous-time systems with known dynamics. By reframing policy learning in continuous-time, we show that it is possible construct a more efficient and accurate gradient estimator. The standard back-propagation through time estimator (BPTT) computes exact gradients for a crude discretization of the continuous-time system. In contrast, we approximate continuous-time gradients in the original system. With the explicit goal of estimating continuous-time gradients, we are able to discretize adaptively and construct a more efficient policy gradient estimator which we call the Continuous-Time Policy Gradient (CTPG). We show that replacing BPTT policy gradients with more efficient CTPG estimates results in faster and more robust learning in a variety of control tasks and simulators.
Consider the problem of estimating the optimal coupling (i.e., matching) between $N$ i.i.d. data points sampled from two densities $\rho_0$ and $\rho_1$ in $\mathbb{R}^d$. The cost of transport is an arbitrary continuous function that satisfies suitable growth and integrability assumptions. For both computational efficiency and smoothness, often a regularization term using entropy is added to this discrete problem. We introduce a modification of the commonly used discrete entropic regularization (Cuturi '13) such that the optimal coupling for the regularized problem can be thought of as the static Schr\"odinger bridge with $N$ particles. This paper is on the asymptotic properties of this discrete Schr\"odinger bridge as $N$ tends to infinity. We show that it converges to the continuum Schr\"odinger bridge and derive the first two error terms of orders $N^{-1/2}$ and $N^{-1}$, respectively. This gives us functional CLT, including for the cost of transport, and second order Gaussian chaos limits when the limiting Gaussian variance is zero, extending similar recent results derived for finite state spaces and the quadratic cost. The proofs are based on a novel chaos decomposition of the discrete Schr\"odinger bridge by polynomial functions of the pair of empirical distributions as a first and second order Taylor approximations in the space of measures. This is achieved by extending the Hoeffding decomposition from the classical theory of U-statistics. The kernels corresponding to the first and second order chaoses are given by Markov operators which have natural interpretations in the Sinkhorn algorithm.
Classical supervised learning via empirical risk (or negative log-likelihood) minimization hinges upon the assumption that the testing distribution coincides with the training distribution. This assumption can be challenged in modern applications of machine learning in which learning machines may operate at prediction time with testing data whose distribution departs from the one of the training data. We revisit the superquantile regression method by proposing a first-order optimization algorithm to minimize a superquantile-based learning objective. The proposed algorithm is based on smoothing the superquantile function by infimal convolution. Promising numerical results illustrate the interest of the approach towards safer supervised learning.
We consider convolutional networks from a reproducing kernel Hilbert space viewpoint. We establish harmonic decompositions of convolutional networks, that is expansions into sums of elementary functions of increasing order. The elementary functions are related to the spherical harmonics, a fundamental class of special functions on spheres. The harmonic decompositions allow us to characterize the integral operators associated with convolutional networks, and obtain as a result statistical bounds for convolutional networks.
We characterize the behavior of integral operators associated with multi-layer perceptrons in two eigenvalue decay regimes. We obtain as a result sharper risk bounds for multi-layer perceptrons highlighting their behavior in high dimensions. Doing so, we also improve on previous results on integral operators related to power series kernels on spheres, with sharper eigenvalue decay estimates in a wider range of eigenvalue decay regimes.
We propose a federated learning framework to handle heterogeneous client devices which do not conform to the population data distribution. The approach hinges upon a parameterized superquantile-based objective, where the parameter ranges over levels of conformity. We present an optimization algorithm and establish its convergence to a stationary point. We show how to practically implement it using secure aggregation by interleaving iterations of the usual federated averaging method with device filtering. We conclude with numerical experiments on neural networks as well as linear models on tasks from computer vision and natural language processing.
We present an approach to obtain convergence guarantees of optimization algorithms for deep networks based on elementary arguments and computations. The convergence analysis revolves around the analytical and computational structures of optimization oracles central to the implementation of deep networks in machine learning software. We provide a systematic way to compute estimates of the smoothness constants that govern the convergence behavior of first-order optimization algorithms used to train deep networks. A diverse set of example components and architectures arising in modern deep networks intersperse the exposition to illustrate the approach.
We present a robust aggregation approach to make federated learning robust to settings when a fraction of the devices may be sending corrupted updates to the server. The proposed approach relies on a robust secure aggregation oracle based on the geometric median, which returns a robust aggregate using a constant number of calls to a regular non-robust secure average oracle. The robust aggregation oracle is privacy-preserving, similar to the secure average oracle it builds upon. We provide experimental results of the proposed approach with linear models and deep networks for two tasks in computer vision and natural language processing. The robust aggregation approach is agnostic to the level of corruption; it outperforms the classical aggregation approach in terms of robustness when the level of corruption is high, while being competitive in the regime of low corruption.