We ask whether there exists a function or measure that (1) minimizes a given convex functional or risk and (2) satisfies a symmetry property specified by an amenable group of transformations. Examples of such symmetry properties are invariance, equivariance, or quasi-invariance. Our results draw on old ideas of Stein and Le Cam and on approximate group averages that appear in ergodic theorems for amenable groups. A class of convex sets known as orbitopes in convex analysis emerges as crucial, and we establish properties of such orbitopes in nonparametric settings. We also show how a simple device called a cocycle can be used to reduce different forms of symmetry to a single problem. As applications, we obtain results on invariant kernel mean embeddings and a Monge-Kantorovich theorem on optimality of transport plans under symmetry constraints. We also explain connections to the Hunt-Stein theorem on invariant tests.
Crystallographic groups describe the symmetries of crystals and other repetitive structures encountered in nature and the sciences. These groups include the wallpaper and space groups. We derive linear and nonlinear representations of functions that are (1) smooth and (2) invariant under such a group. The linear representation generalizes the Fourier basis to crystallographically invariant basis functions. We show that such a basis exists for each crystallographic group, that it is orthonormal in the relevant $L_2$ space, and recover the standard Fourier basis as a special case for pure shift groups. The nonlinear representation embeds the orbit space of the group into a finite-dimensional Euclidean space. We show that such an embedding exists for every crystallographic group, and that it factors functions through a generalization of a manifold called an orbifold. We describe algorithms that, given a standardized description of the group, compute the Fourier basis and an embedding map. As examples, we construct crystallographically invariant neural networks, kernel machines, and Gaussian processes.
We provide results that exactly quantify how data augmentation affects the convergence rate and variance of estimates. They lead to some unexpected findings: Contrary to common intuition, data augmentation may increase rather than decrease uncertainty of estimates, such as the empirical prediction risk. Our main theoretical tool is a limit theorem for functions of randomly transformed, high-dimensional random vectors. The proof draws on work in probability on noise stability of functions of many variables. The pathological behavior we identify is not a consequence of complex models, but can occur even in the simplest settings -- one of our examples is a linear ridge regressor with two parameters. On the other hand, our results also show that data augmentation can have real, quantifiable benefits.
We introduce a class of generative network models that insert edges by connecting the starting and terminal vertices of a random walk on the network graph. Within the taxonomy of statistical network models, this class is distinguished by permitting the location of a new edge to explicitly depend on the structure of the graph, but being nonetheless statistically and computationally tractable. In the limit of infinite walk length, the model converges to an extension of the preferential attachment model---in this sense, it can be motivated alternatively by asking what preferential attachment is an approximation to. Theoretical properties, including the limiting degree sequence, are studied analytically. If the entire history of the graph is observed, parameters can be estimated by maximum likelihood. If only the final graph is available, its history can be imputed using MCMC. We develop a class of sequential Monte Carlo algorithms that are more generally applicable to sequential network models, and may be of interest in their own right. The model parameters can be recovered from a single graph generated by the model. Applications to data clarify the role of the random walk length as a length scale of interactions within the graph.
Empirical risk minimization is the principal tool for prediction problems, but its extension to relational data remains unsolved. We solve this problem using recent advances in graph sampling theory. We (i) define an empirical risk for relational data and (ii) obtain stochastic gradients for this risk that are automatically unbiased. The key ingredient is to consider the method by which data is sampled from a graph as an explicit component of model design. Theoretical results establish that the choice of sampling scheme is critical. By integrating fast implementations of graph sampling schemes with standard automatic differentiation tools, we are able to solve the risk minimization in a plug-and-play fashion even on large datasets. We demonstrate empirically that relational ERM models achieve state-of-the-art results on semi-supervised node classification tasks. The experiments also confirm the importance of the choice of sampling scheme.
Modern neural networks are highly overparameterized, with capacity to substantially overfit to training data. Nevertheless, these networks often generalize well in practice. It has also been observed that trained networks can often be "compressed" to much smaller representations. The purpose of this paper is to connect these two empirical observations. Our main technical result is a generalization bound for compressed networks based on the compressed size. Combined with off-the-shelf compression algorithms, the bound leads to state of the art generalization guarantees; in particular, we provide the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem. As additional evidence connecting compression and generalization, we show that compressibility of models that tend to overfit is limited: We establish an absolute limit on expected compressibility as a function of expected generalization error, where the expectations are over the random choice of training examples. The bounds are complemented by empirical results that show an increase in overfitting implies an increase in the number of bits required to describe a trained network.
* 14 pages, 1 figure. Minor phrasing changes and better notation for
The natural habitat of most Bayesian methods is data represented by exchangeable sequences of observations, for which de Finetti's theorem provides the theoretical foundation. Dirichlet process clustering, Gaussian process regression, and many other parametric and nonparametric Bayesian models fall within the remit of this framework; many problems arising in modern data analysis do not. This article provides an introduction to Bayesian models of graphs, matrices, and other data that can be modeled by random structures. We describe results in probability theory that generalize de Finetti's theorem to such data and discuss their relevance to nonparametric Bayesian modeling. With the basic ideas in place, we survey example models available in the literature; applications of such models include collaborative filtering, link prediction, and graph and network analysis. We also highlight connections to recent developments in graph theory and probability, and sketch the more general mathematical foundation of Bayesian methods for other types of data beyond sequences and arrays.
* IEEE Transactions Pattern Analysis and Machine Intelligence 2015,
Vol. 37, No. 2, pp. 437-461
A pivotal problem in Bayesian nonparametrics is the construction of prior distributions on the space M(V) of probability measures on a given domain V. In principle, such distributions on the infinite-dimensional space M(V) can be constructed from their finite-dimensional marginals---the most prominent example being the construction of the Dirichlet process from finite-dimensional Dirichlet distributions. This approach is both intuitive and applicable to the construction of arbitrary distributions on M(V), but also hamstrung by a number of technical difficulties. We show how these difficulties can be resolved if the domain V is a Polish topological space, and give a representation theorem directly applicable to the construction of any probability distribution on M(V) whose first moment measure is well-defined. The proof draws on a projective limit theorem of Bochner, and on properties of set functions on Polish spaces to establish countable additivity of the resulting random probabilities.
* Electronic Journal of Statistics 2011, Vol. 5, 1354-1373 * 20 pages, 3 figures. Published in the Electronic Journal of
Statistics by the Institute of Mathematical Statistics
We characterize conjugate nonparametric Bayesian models as projective limits of conjugate, finite-dimensional Bayesian models. In particular, we identify a large class of nonparametric models representable as infinite-dimensional analogues of exponential family distributions and their canonical conjugate priors. This class contains most models studied in the literature, including Dirichlet processes and Gaussian process regression models. To derive these results, we introduce a representation of infinite-dimensional Bayesian models by projective limits of regular conditional probabilities. We show under which conditions the nonparametric model itself, its sufficient statistics, and -- if they exist -- conjugate updates of the posterior are projective limits of their respective finite-dimensional counterparts. We illustrate our results both by application to existing nonparametric models and by construction of a model on infinite permutations.