Stochastic variance reduced methods have gained a lot of interest recently for empirical risk minimization due to its appealing run time complexity. When the data size is large and disjointly stored on different machines, it becomes imperative to distribute the implementation of such variance reduced methods. In this paper, we consider a general framework that directly distributes popular stochastic variance reduced methods, by assigning outer loops to the parameter server, and inner loops to worker machines. This framework is natural as it does not require sampling extra data and is friendly to implement, but its theoretical convergence is not well understood. We obtain a unified understanding of the convergence for algorithms under this framework by measuring the smoothness of the discrepancy between the local and global loss functions. We establish the linear convergence of distributed versions of a family of stochastic variance reduced algorithms, including those using accelerated and recursive gradient updates, for minimizing strongly convex losses. Our theory captures how the convergence of distributed algorithms behaves as the number of machines and the size of local data vary. Furthermore, we show that when the smoothness discrepancy between local and global loss functions is large, regularization can be used to ensure convergence. Our analysis can be further extended to handle nonsmooth and nonconvex loss functions.
This paper studies noisy low-rank matrix completion: given partial and corrupted entries of a large low-rank matrix, the goal is to estimate the underlying matrix faithfully and efficiently. Arguably one of the most popular paradigms to tackle this problem is convex relaxation, which achieves remarkable efficacy in practice. However, the theoretical support of this approach is still far from optimal in the noisy setting, falling short of explaining the empirical success. We make progress towards demystifying the practical efficacy of convex relaxation vis-\`a-vis random noise. When the rank of the unknown matrix is a constant, we demonstrate that the convex programming approach achieves near-optimal estimation errors --- in terms of the Euclidean loss, the entrywise loss, and the spectral norm loss --- for a wide range of noise levels. All of this is enabled by bridging convex relaxation with the nonconvex Burer-Monteiro approach, a seemingly distinct algorithmic paradigm that is provably robust against noise. More specifically, we show that an approximate critical point of the nonconvex formulation serves as an extremely tight approximation of the convex solution, allowing us to transfer the desired statistical guarantees of the nonconvex approach to its convex counterpart.
Substantial progress has been made recently on developing provably accurate and efficient algorithms for low-rank matrix factorization via nonconvex optimization. While conventional wisdom often takes a dim view of nonconvex optimization algorithms due to their susceptibility to spurious local minima, simple iterative methods such as gradient descent have been remarkably successful in practice. The theoretical footings, however, had been largely lacking until recently. In this tutorial-style overview, we highlight the important role of statistical models in enabling efficient nonconvex optimization with performance guarantees. We review two contrasting approaches: (1) two-stage algorithms, which consist of a tailored initialization step followed by successive refinement; and (2) global landscape analysis and initialization-free algorithms. Several canonical matrix factorization problems are discussed, including but not limited to matrix sensing, phase retrieval, matrix completion, blind deconvolution, robust principal component analysis, phase synchronization, and joint alignment. Special care is taken to illustrate the key technical insights underlying their analyses. This article serves as a testament that the integrated thinking of optimization and statistics leads to fruitful research findings.
For many modern applications in science and engineering, data are collected in a streaming fashion carrying time-varying information, and practitioners need to process them with a limited amount of memory and computational resources in a timely manner for decision making. This often is coupled with the missing data problem, such that only a small fraction of data attributes are observed. These complications impose significant, and unconventional, constraints on the problem of streaming Principal Component Analysis (PCA) and subspace tracking, which is an essential building block for many inference tasks in signal processing and machine learning. This survey article reviews a variety of classical and recent algorithms for solving this problem with low computational and memory complexities, particularly those applicable in the big data regime with missing data. We illustrate that streaming PCA and subspace tracking algorithms can be understood through algebraic and geometric perspectives, and they need to be adjusted carefully to handle missing data. Both asymptotic and non-asymptotic convergence guarantees are reviewed. Finally, we benchmark the performance of several competitive algorithms in the presence of missing data for both well-conditioned and ill-conditioned systems.
Low-rank modeling plays a pivotal role in signal processing and machine learning, with applications ranging from collaborative filtering, video surveillance, medical imaging, to dimensionality reduction and adaptive filtering. Many modern high-dimensional data and interactions thereof can be modeled as lying approximately in a low-dimensional subspace or manifold, possibly with additional structures, and its proper exploitations lead to significant reduction of costs in sensing, computation and storage. In recent years, there is a plethora of progress in understanding how to exploit low-rank structures using computationally efficient procedures in a provable manner, including both convex and nonconvex approaches. On one side, convex relaxations such as nuclear norm minimization often lead to statistically optimal procedures for estimating low-rank matrices, where first-order methods are developed to address the computational challenges; on the other side, there is emerging evidence that properly designed nonconvex procedures, such as projected gradient descent, often provide globally optimal solutions with a much lower computational cost in many problems. This survey article will provide a unified overview of these recent advances on low-rank matrix estimation from incomplete measurements. Attention is paid to rigorous characterization of the performance of these algorithms, and to problems where the low-rank matrix have additional structural properties that require new algorithmic designs and theoretical analysis.
This paper considers the problem of solving systems of quadratic equations, namely, recovering an object of interest $\mathbf{x}^{\natural}\in\mathbb{R}^{n}$ from $m$ quadratic equations/samples $y_{i}=(\mathbf{a}_{i}^{\top}\mathbf{x}^{\natural})^{2}$, $1\leq i\leq m$. This problem, also dubbed as phase retrieval, spans multiple domains including physical sciences and machine learning. We investigate the efficiency of gradient descent (or Wirtinger flow) designed for the nonconvex least squares problem. We prove that under Gaussian designs, gradient descent --- when randomly initialized --- yields an $\epsilon$-accurate solution in $O\big(\log n+\log(1/\epsilon)\big)$ iterations given nearly minimal samples, thus achieving near-optimal computational and sample complexities at once. This provides the first global convergence guarantee concerning vanilla gradient descent for phase retrieval, without the need of (i) carefully-designed initialization, (ii) sample splitting, or (iii) sophisticated saddle-point escaping schemes. All of these are achieved by exploiting the statistical models in analyzing optimization algorithms, via a leave-one-out approach that enables the decoupling of certain statistical dependency between the gradient descent iterates and the data.
We study the local geometry of a one-hidden-layer fully-connected neural network where the training samples are generated from a multi-neuron logistic regression model. We prove that under Gaussian input, the empirical risk function employing quadratic loss exhibits strong convexity and smoothness uniformly in a local neighborhood of the ground truth, for a class of smooth activation functions satisfying certain properties, including sigmoid and tanh, as soon as the sample complexity is sufficiently large. This implies that if initialized in this neighborhood, gradient descent converges linearly to a critical point that is provably close to the ground truth without requiring a fresh set of samples at each iteration. This significantly improves upon prior results on learning shallow neural networks with multiple neurons. To the best of our knowledge, this is the first global convergence guarantee for one-hidden-layer neural networks using gradient descent over the empirical risk function without resampling at the near-optimal sampling and computational complexity.
We consider the problem of recovering low-rank matrices from random rank-one measurements, which spans numerous applications including covariance sketching, phase retrieval, quantum state tomography, and learning shallow polynomial neural networks, among others. Our approach is to directly estimate the low-rank factor by minimizing a nonconvex quadratic loss function via vanilla gradient descent, following a tailored spectral initialization. When the true rank is small, this algorithm is guaranteed to converge to the ground truth (up to global ambiguity) with near-optimal sample complexity and computational complexity. To the best of our knowledge, this is the first guarantee that achieves near-optimality in both metrics. In particular, the key enabler of near-optimal computational guarantees is an implicit regularization phenomenon: without explicit regularization, both spectral initialization and the gradient descent iterates automatically stay within a region incoherent with the measurement vectors. This feature allows one to employ much more aggressive step sizes compared with the ones suggested in prior literature, without the need of sample splitting.
Low-rank matrix completion (MC) has achieved great success in many real-world data applications. A latent feature model formulation is usually employed and, to improve prediction performance, the similarities between latent variables can be exploited by pairwise learning, e.g., the graph regularized matrix factorization (GRMF) method. However, existing GRMF approaches often use a squared L2 norm to measure the pairwise difference, which may be overly influenced by dissimilar pairs and lead to inferior prediction. To fully empower pairwise learning for matrix completion, we propose a general optimization framework that allows a rich class of (non-)convex pairwise penalty functions. A new and efficient algorithm is further developed to uniformly solve the optimization problem, with a theoretical convergence guarantee. In an important situation where the latent variables form a small number of subgroups, its statistical guarantee is also fully characterized. In particular, we theoretically characterize the complexity-regularized maximum likelihood estimator, as a special case of our framework. It has a better error bound when compared to the standard trace-norm regularized matrix completion. We conduct extensive experiments on both synthetic and real datasets to demonstrate the superior performance of this general framework.
Recent years have seen a flurry of activities in designing provably efficient nonconvex procedures for solving statistical estimation problems. Due to the highly nonconvex nature of the empirical loss, state-of-the-art procedures often require proper regularization (e.g. trimming, regularized cost, projection) in order to guarantee fast convergence. For vanilla procedures such as gradient descent, however, prior theory either recommends highly conservative learning rates to avoid overshooting, or completely lacks performance guarantees. This paper uncovers a striking phenomenon in nonconvex optimization: even in the absence of explicit regularization, gradient descent enforces proper regularization implicitly under various statistical models. In fact, gradient descent follows a trajectory staying within a basin that enjoys nice geometry, consisting of points incoherent with the sampling mechanism. This "implicit regularization" feature allows gradient descent to proceed in a far more aggressive fashion without overshooting, which in turn results in substantial computational savings. Focusing on three fundamental statistical estimation problems, i.e. phase retrieval, low-rank matrix completion, and blind deconvolution, we establish that gradient descent achieves near-optimal statistical and computational guarantees without explicit regularization. In particular, by marrying statistical modeling with generic optimization theory, we develop a general recipe for analyzing the trajectories of iterative algorithms via a leave-one-out perturbation argument. As a byproduct, for noisy matrix completion, we demonstrate that gradient descent achieves near-optimal error control --- measured entrywise and by the spectral norm --- which might be of independent interest.