Adaptive gradient methods are the method of choice for optimization in machine learning and used to train the largest deep models. In this paper we study the problem of learning a local preconditioner, that can change as the data is changing along the optimization trajectory. We propose an adaptive gradient method that has provable adaptive regret guarantees vs. the best local preconditioner. To derive this guarantee, we prove a new adaptive regret bound in online learning that improves upon previous adaptive online learning methods. We demonstrate the robustness of our method in automatically choosing the optimal learning rate schedule for popular benchmarking tasks in vision and language domains. Without the need to manually tune a learning rate schedule, our method can, in a single run, achieve comparable and stable task accuracy as a fine-tuned optimizer.
Stochastic Gradient Descent (SGD) is among the simplest and most popular methods in optimization. The convergence rate for SGD has been extensively studied and tight analyses have been established for the running average scheme, but the sub-optimality of the final iterate is still not well-understood. shamir2013stochastic gave the best known upper bound for the final iterate of SGD minimizing non-smooth convex functions, which is $O(\log T/\sqrt{T})$ for Lipschitz convex functions and $O(\log T/ T)$ with additional assumption on strongly convexity. The best known lower bounds, however, are worse than the upper bounds by a factor of $\log T$. harvey2019tight gave matching lower bounds but their construction requires dimension $d= T$. It was then asked by koren2020open how to characterize the final-iterate convergence of SGD in the constant dimension setting. In this paper, we answer this question in the more general setting for any $d\leq T$, proving $\Omega(\log d/\sqrt{T})$ and $\Omega(\log d/T)$ lower bounds for the sub-optimality of the final iterate of SGD in minimizing non-smooth Lipschitz convex and strongly convex functions respectively with standard step size schedules. Our results provide the first general dimension dependent lower bound on the convergence of SGD's final iterate, partially resolving a COLT open question raised by koren2020open. We also present further evidence to show the correct rate in one dimension should be $\Theta(1/\sqrt{T})$, such as a proof of a tight $O(1/\sqrt{T})$ upper bound for one-dimensional special cases in settings more general than koren2020open.
We consider the lower bounds of differentially private empirical risk minimization for general convex functions in this paper. For convex generalized linear models (GLMs), the well-known tight bound of DP-ERM in the constrained case is $\tilde{\Theta}(\frac{\sqrt{p}}{\epsilon n})$, while recently, \cite{sstt21} find the tight bound of DP-ERM in the unconstrained case is $\tilde{\Theta}(\frac{\sqrt{\text{rank}}}{\epsilon n})$ where $p$ is the dimension, $n$ is the sample size and $\text{rank}$ is the rank of the feature matrix of the GLM objective function. As $\text{rank}\leq \min\{n,p\}$, a natural and important question arises that whether we can evade the curse of dimensionality for over-parameterized models where $n\ll p$, for more general convex functions beyond GLM. We answer this question negatively by giving the first and tight lower bound of unconstrained private ERM for the general convex function, matching the current upper bound $\tilde{O}(\frac{\sqrt{p}}{n\epsilon})$ for unconstrained private ERM. We also give an $\Omega(\frac{p}{n\epsilon})$ lower bound for unconstrained pure-DP ERM which recovers the result in the constrained case.
It is well-known that standard neural networks, even with a high classification accuracy, are vulnerable to small $\ell_\infty$-norm bounded adversarial perturbations. Although many attempts have been made, most previous works either can only provide empirical verification of the defense to a particular attack method, or can only develop a certified guarantee of the model robustness in limited scenarios. In this paper, we seek for a new approach to develop a theoretically principled neural network that inherently resists $\ell_\infty$ perturbations. In particular, we design a novel neuron that uses $\ell_\infty$-distance as its basic operation (which we call $\ell_\infty$-dist neuron), and show that any neural network constructed with $\ell_\infty$-dist neurons (called $\ell_{\infty}$-dist net) is naturally a 1-Lipschitz function with respect to $\ell_\infty$-norm. This directly provides a rigorous guarantee of the certified robustness based on the margin of prediction outputs. We also prove that such networks have enough expressive power to approximate any 1-Lipschitz function with robust generalization guarantee. Our experimental results show that the proposed network is promising. Using $\ell_{\infty}$-dist nets as the basic building blocks, we consistently achieve state-of-the-art performance on commonly used datasets: 93.09% certified accuracy on MNIST ($\epsilon=0.3$), 79.23% on Fashion MNIST ($\epsilon=0.1$) and 35.10% on CIFAR-10 ($\epsilon=8/255$).
In this note we prove a sharp lower bound on the necessary number of nestings of nested absolute-value functions of generalized hinging hyperplanes (GHH) to represent arbitrary CPWL functions. Previous upper bound states that $n+1$ nestings is sufficient for GHH to achieve universal representation power, but the corresponding lower bound was unknown. We prove that $n$ nestings is necessary for universal representation power, which provides an almost tight lower bound. We also show that one-hidden-layer neural networks don't have universal approximation power over the whole domain. The analysis is based on a key lemma showing that any finite sum of periodic functions is either non-integrable or the zero function, which might be of independent interest.
Leveraging algorithmic stability to derive sharp generalization bounds is a classic and powerful approach in learning theory. Since Vapnik and Chervonenkis [1974] first formalized the idea for analyzing SVMs, it has been utilized to study many fundamental learning algorithms (e.g., $k$-nearest neighbors [Rogers and Wagner, 1978], stochastic gradient method [Hardt et al., 2016], linear regression [Maurer, 2017], etc). In a recent line of great works by Feldman and Vondrak [2018, 2019] as well as Bousquet et al. [2020b], they prove a high probability generalization upper bound of order $\tilde{\mathcal{O}}(\gamma +\frac{L}{\sqrt{n}})$ for any uniformly $\gamma$-stable algorithm and $L$-bounded loss function. Although much progress was achieved in proving generalization upper bounds for stable algorithms, our knowledge of lower bounds is rather limited. In fact, there is no nontrivial lower bound known ever since the study of uniform stability [Bousquet and Elisseeff, 2002], to the best of our knowledge. In this paper we fill the gap by proving a tight generalization lower bound of order $\Omega(\gamma+\frac{L}{\sqrt{n}})$, which matches the best known upper bound up to logarithmic factors
We prove a Johns theorem for simplices in $R^d$ with positive dilation factor $d+2$, which improves the previously known $d^2$ upper bound. This bound is tight in view of the $d$ lower bound. Furthermore, we give an example that $d$ isn't the optimal lower bound when $d=2$. Our results answered both questions regarding Johns theorem for simplices with positive dilation raised by \cite{leme2020costly}.
We propose a framework of boosting for learning and control in environments that maintain a state. Leveraging methods for online learning with memory and for online boosting, we design an efficient online algorithm that can provably improve the accuracy of weak-learners in stateful environments. As a consequence, we give efficient boosting algorithms for both prediction and the control of dynamical systems. Empirical evaluation on simulated and real data for both control and prediction supports our theoretical findings.
The expressive power of neural networks is important for understanding deep learning. Most existing works consider this problem from the view of the depth of a network. In this paper, we study how width affects the expressiveness of neural networks. Classical results state that depth-bounded (e.g. depth-$2$) networks with suitable activation functions are universal approximators. We show a universal approximation theorem for width-bounded ReLU networks: width-$(n+4)$ ReLU networks, where $n$ is the input dimension, are universal approximators. Moreover, except for a measure zero set, all functions cannot be approximated by width-$n$ ReLU networks, which exhibits a phase transition. Several recent works demonstrate the benefits of depth by proving the depth-efficiency of neural networks. That is, there are classes of deep networks which cannot be realized by any shallow network whose size is no more than an exponential bound. Here we pose the dual question on the width-efficiency of ReLU networks: Are there wide networks that cannot be realized by narrow networks whose size is not substantially larger? We show that there exist classes of wide networks which cannot be realized by any narrow network whose depth is no more than a polynomial bound. On the other hand, we demonstrate by extensive experiments that narrow networks whose size exceed the polynomial bound by a constant factor can approximate wide and shallow network with high accuracy. Our results provide more comprehensive evidence that depth is more effective than width for the expressiveness of ReLU networks.