In experimental design, we are given $n$ vectors in $d$ dimensions, and our goal is to select $k\ll n$ of them to perform expensive measurements, e.g., to obtain labels/responses, for a linear regression task. Many statistical criteria have been proposed for choosing the optimal design, with popular choices including A- and D-optimality. If prior knowledge is given, typically in the form of a $d\times d$ precision matrix $\mathbf A$, then all of the criteria can be extended to incorporate that information via a Bayesian framework. In this paper, we demonstrate a new fundamental connection between Bayesian experimental design and determinantal point processes, the latter being widely used for sampling diverse subsets of data. We use this connection to develop new efficient algorithms for finding $(1+\epsilon)$-approximations of optimal designs under four optimality criteria: A, C, D and V. Our algorithms can achieve this when the desired subset size $k$ is $\Omega(\frac{d_{\mathbf A}}{\epsilon} + \frac{\log 1/\epsilon}{\epsilon^2})$, where $d_{\mathbf A}\leq d$ is the $\mathbf A$-effective dimension, which can often be much smaller than $d$. Our results offer direct improvements over a number of prior works, for both Bayesian and classical experimental design, in terms of algorithm efficiency, approximation quality, and range of applicable criteria.
We regard pre-trained residual networks (ResNets) as nonlinear systems and use linearization, a common method used in the qualitative analysis of nonlinear systems, to understand the behavior of the networks under small perturbations of the input images. We work with ResNet-56 and ResNet-110 trained on the CIFAR-10 data set. We linearize these networks at the level of residual units and network stages, and the singular value decomposition is used in the stability analysis of these components. It is found that most of the singular values of the linearizations of residual units are 1 and, in spite of the fact that the linearizations depend directly on the activation maps, the singular values differ only slightly for different input images. However, adjusting the scaling of the skip connection or the values of the weights in a residual unit has a significant impact on the singular value distributions. Inspection of how random and adversarial perturbations of input images propagate through the network reveals that there is a dramatic jump in the magnitude of adversarial perturbations towards the end of the final stage of the network that is not present in the case of random perturbations. We attempt to gain a better understanding of this phenomenon by projecting the perturbations onto singular vectors of the linearizations of the residual units.
In distributed optimization and distributed numerical linear algebra, we often encounter an inversion bias: if we want to compute a quantity that depends on the inverse of a sum of distributed matrices, then the sum of the inverses does not equal the inverse of the sum. An example of this occurs in distributed Newton's method, where we wish to compute (or implicitly work with) the inverse Hessian multiplied by the gradient. In this case, locally computed estimates are biased, and so taking a uniform average will not recover the correct solution. To address this, we propose determinantal averaging, a new approach for correcting the inversion bias. This approach involves reweighting the local estimates of the Newton's step proportionally to the determinant of the local Hessian estimate, and then averaging them together to obtain an improved global estimate. This method provides the first known distributed Newton step that is asymptotically consistent, i.e., it recovers the exact step in the limit as the number of distributed partitions grows to infinity. To show this, we develop new expectation identities and moment bounds for the determinant and adjugate of a random matrix. Determinantal averaging can be applied not only to Newton's method, but to computing any quantity that is a linear tranformation of a matrix inverse, e.g., taking a trace of the inverse covariance matrix, which is used in data uncertainty quantification.
In addition to providing high-profile successes in computer vision and natural language processing, neural networks also provide an emerging set of techniques for scientific problems. Such data-driven models, however, typically ignore physical insights from the scientific system under consideration. Among other things, a physics-informed model formulation should encode some degree of stability or robustness or well-conditioning (in that a small change of the input will not lead to drastic changes in the output), characteristic of the underlying scientific problem. We investigate whether it is possible to include physics-informed prior knowledge for improving the model quality (e.g., generalization performance, sensitivity to parameter tuning, or robustness in the presence of noisy data). To that extent, we focus on the stability of an equilibrium, one of the most basic properties a dynamic system can have, via the lens of Lyapunov analysis. For the prototypical problem of fluid flow prediction, we show that models preserving Lyapunov stability improve the generalization error and reduce the prediction uncertainty.
It has been demonstrated that very simple attacks can fool highly-sophisticated neural network architectures. In particular, so-called adversarial examples, constructed from perturbations of input data that are small or imperceptible to humans but lead to different predictions, may lead to an enormous risk in certain critical applications. In light of this, there has been a great deal of work on developing adversarial training strategies to improve model robustness. These training strategies are very expensive, in both human and computational time. To complement these approaches, we propose a very simple and inexpensive strategy which can be used to ``retrofit'' a previously-trained network to improve its resilience to adversarial attacks. More concretely, we propose a new activation function---the JumpReLU---which, when used in place of a ReLU in an already-trained model, leads to a trade-off between predictive accuracy and robustness. This trade-off is controlled by the jump size, a hyper-parameter which can be tuned during the validation stage. Our empirical results demonstrate that this increases model robustness, protecting against adversarial attacks with substantially increased levels of perturbations. This is accomplished simply by retrofitting existing networks with our JumpReLU activation function, without the need for retraining the model. Additionally, we demonstrate that adversarially trained (robust) models can greatly benefit from retrofitting.
Motivated by recent developments in serverless systems for large-scale machine learning as well as improvements in scalable randomized matrix algorithms, we develop OverSketched Newton, a randomized Hessian-based optimization algorithm to solve large-scale smooth and strongly-convex problems in serverless systems. OverSketched Newton leverages matrix sketching ideas from Randomized Numerical Linear Algebra to compute the Hessian approximately. These sketching methods lead to inbuilt resiliency against stragglers that are a characteristic of serverless architectures. We establish that OverSketched Newton has a linear-quadratic convergence rate, and we empirically validate our results by solving large-scale supervised learning problems on real-world datasets. Experiments demonstrate a reduction of ~50% in total running time on AWS Lambda, compared to state-of-the-art distributed optimization schemes.
In stochastic optimization, large batch training can leverage parallel resources to produce faster wall-clock training times per epoch. However, for both training loss and testing error, recent results analyzing large batch Stochastic Gradient Descent (SGD) have found sharp diminishing returns beyond a certain critical batch size. In the hopes of addressing this, the Kronecker-Factored Approximate Curvature (\mbox{K-FAC}) method has been hypothesized to allow for greater scalability to large batch sizes for non-convex machine learning problems, as well as greater robustness to variation in hyperparameters. Here, we perform a detailed empirical analysis of these two hypotheses, evaluating performance in terms of both wall-clock time and aggregate computational cost. Our main results are twofold: first, we find that \mbox{K-FAC} does not exhibit improved large-batch scalability behavior, as compared to SGD; and second, we find that \mbox{K-FAC}, in addition to requiring more hyperparameters to tune, suffers from the same hyperparameter sensitivity patterns as SGD. We discuss extensive results using residual networks on \mbox{CIFAR-10}, as well as more general implications of our findings.
In many applications, it is important to reconstruct a fluid flow field, or some other high-dimensional state, from limited measurements and limited data. In this work, we propose a shallow neural network-based learning methodology for such fluid flow reconstruction. Our approach learns an end-to-end mapping between the sensor measurements and the high-dimensional fluid flow field, without any heavy preprocessing on the raw data. No prior knowledge is assumed to be available, and the estimation method is purely data-driven. We demonstrate the performance on three examples in fluid mechanics and oceanography, showing that this modern data-driven approach outperforms traditional modal approximation techniques which are commonly used for flow reconstruction. Not only does the proposed method show superior performance characteristics, it can also produce a comparable level of performance with traditional methods in the area, using significantly fewer sensors. Thus, the mathematical architecture is ideal for emerging global monitoring technologies where measurement data are often limited.
In experimental design, we are given a large collection of vectors, each with a hidden response value that we assume derives from an underlying linear model, and we wish to pick a small subset of the vectors such that querying the corresponding responses will lead to a good estimator of the model. A classical approach in statistics is to assume the responses are linear, plus zero-mean i.i.d. Gaussian noise, in which case the goal is to provide an unbiased estimator with smallest mean squared error (A-optimal design). A related approach, more common in computer science, is to assume the responses are arbitrary but fixed, in which case the goal is to estimate the least squares solution using few responses, as quickly as possible, for worst-case inputs. Despite many attempts, characterizing the relationship between these two approaches has proven elusive. We address this by proposing a framework for experimental design where the responses are produced by an arbitrary unknown distribution. We show that there is an efficient randomized experimental design procedure that achieves strong variance bounds for an unbiased estimator using few responses in this general model. Nearly tight bounds for the classical A-optimality criterion, as well as improved bounds for worst-case responses, emerge as special cases of this result. In the process, we develop a new algorithm for a joint sampling distribution called volume sampling, and we propose a new i.i.d. importance sampling method: inverse score sampling. A key novelty of our analysis is in developing new expected error bounds for worst-case regression by controlling the tail behavior of i.i.d. sampling via the jointness of volume sampling. Our result motivates a new minimax-optimality criterion for experimental design which can be viewed as an extension of both A-optimal design and sampling for worst-case regression.
Given two or more Deep Neural Networks (DNNs) with the same or similar architectures, and trained on the same dataset, but trained with different solvers, parameters, hyper-parameters, regularization, etc., can we predict which DNN will have the best test accuracy, and can we do so without peeking at the test data? In this paper, we show how to use a new Theory of Heavy-Tailed Self-Regularization (HT-SR) to answer this. HT-SR suggests, among other things, that modern DNNs exhibit what we call Heavy-Tailed Mechanistic Universality (HT-MU), meaning that the correlations in the layer weight matrices can be fit to a power law with exponents that lie in common Universality classes from Heavy-Tailed Random Matrix Theory (HT-RMT). From this, we develop a Universal capacity control metric that is a weighted average of these PL exponents. Rather than considering small toy NNs, we examine over 50 different, large-scale pre-trained DNNs, ranging over 15 different architectures, trained on ImagetNet, each of which has been reported to have different test accuracies. We show that this new capacity metric correlates very well with the reported test accuracies of these DNNs, looking across each architecture (VGG16/.../VGG19, ResNet10/.../ResNet152, etc.). We also show how to approximate the metric by the more familiar Product Norm capacity measure, as the average of the log Frobenius norm of the layer weight matrices. Our approach requires no changes to the underlying DNN or its loss function, it does not require us to train a model (although it could be used to monitor training), and it does not even require access to the ImageNet data.