Adaptivity is an important yet under-studied property in modern optimization theory. The gap between the state-of-the-art theory and the current practice is striking in that algorithms with desirable theoretical guarantees typically involve drastically different settings of hyperparameters, such as step-size schemes and batch sizes, in different regimes. Despite the appealing theoretical results, such divisive strategies provide little, if any, insight to practitioners to select algorithms that work broadly without tweaking the hyperparameters. In this work, blending the "geometrization" technique introduced by Lei & Jordan 2016 and the \texttt{SARAH} algorithm of Nguyen et al., 2017, we propose the Geometrized \texttt{SARAH} algorithm for non-convex finite-sum and stochastic optimization. Our algorithm is proved to achieve adaptivity to both the magnitude of the target accuracy and the Polyak-\L{}ojasiewicz (PL) constant if present. In addition, it achieves the best-available convergence rate for non-PL objectives simultaneously while outperforming existing algorithms for PL objectives.
We study the fixed-support Wasserstein barycenter problem (FS-WBP), which consists in computing the Wasserstein barycenter of $m$ discrete probability measures supported on a finite metric space of size $n$. We show first that the constraint matrix arising from the linear programming (LP) representation of the FS-WBP is totally unimodular when $m \geq 3$ and $n = 2$, but not totally unimodular when $m \geq 3$ and $n \geq 3$. This result answers an open problem, since it shows that the FS-WBP is not a minimum-cost flow problem and therefore cannot be solved efficiently using linear programming. Building on this negative result, we propose and analyze a simple and efficient variant of the iterative Bregman projection (IBP) algorithm, currently the most widely adopted algorithm to solve the FS-WBP. The algorithm is an accelerated IBP algorithm which achieves the complexity bound of $\widetilde{\mathcal{O}}(mn^{7/3}/\varepsilon)$. This bound is better than that obtained for the standard IBP algorithm---$\widetilde{\mathcal{O}}(mn^{2}/\varepsilon^2)$---in terms of $\varepsilon$, and that of accelerated primal-dual gradient algorithm---$\widetilde{\mathcal{O}}(mn^{5/2}/\varepsilon)$---in terms of $n$. Empirical studies on simulated datasets demonstrate that the acceleration promised by the theory is real in practice.
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients to reduce the variance of stochastic gradients. Compared to SGD, these methods require at least double the number of operations per update to model parameters. To reduce the computational cost of these methods, we introduce a new sparsity operator: The random-top-k operator. Our operator reduces computational complexity by estimating gradient sparsity exhibited in a variety of applications by combining the top-k operator and the randomized coordinate descent operator. With this operator, large batch gradients offer an extra benefit beyond variance reduction: A reliable estimate of gradient sparsity. Theoretically, our algorithm is at least as good as the best algorithm (SpiderBoost), and further excels in performance whenever the random-top-k operator captures gradient sparsity. Empirically, our algorithm consistently outperforms SpiderBoost using various models on various tasks including image classification, natural language processing, and sparse matrix factorization. We also provide empirical evidence to support the intuition behind our algorithm via a simple gradient entropy computation, which serves to quantify gradient sparsity at every iteration.
We study the problem of sampling from the power posterior distribution in Bayesian Gaussian mixture models, a robust version of the classical posterior. This power posterior is known to be non-log-concave and multi-modal, which leads to exponential mixing times for some standard MCMC algorithms. We introduce and study the Reflected Metropolis-Hastings Random Walk (RMRW) algorithm for sampling. For symmetric two-component Gaussian mixtures, we prove that its mixing time is bounded as $d^{1.5}(d + \Vert \theta_{0} \Vert^2)^{4.5}$ as long as the sample size $n$ is of the order $d (d + \Vert \theta_{0} \Vert^2)$. Notably, this result requires no conditions on the separation of the two means. En route to proving this bound, we establish some new results of possible independent interest that allow for combining Poincar\'{e} inequalities for conditional and marginal densities.
One important partition of algorithms for controlling the false discovery rate (FDR) in multiple testing is into offline and online algorithms. The first generally achieve significantly higher power of discovery, while the latter allow making decisions sequentially as well as adaptively formulating hypotheses based on past observations. Using existing methodology, it is unclear how one could trade off the benefits of these two broad families of algorithms, all the while preserving their formal FDR guarantees. To this end, we introduce $\text{Batch}_{\text{BH}}$ and $\text{Batch}_{\text{St-BH}}$, algorithms for controlling the FDR when a possibly infinite sequence of batches of hypotheses is tested by repeated application of one of the most widely used offline algorithms, the Benjamini-Hochberg (BH) method or Storey's improvement of the BH method. We show that our algorithms interpolate between existing online and offline methodology, thus trading off the best of both worlds.
We study the complexity of approximating the multimarginal optimal transport (OT) problem, a generalization of the classical optimal transport distance, considered here between $m$ discrete probability distributions supported each on $n$ support points. First, we show that the multimarginal OT problem is not a minimum-cost flow problem when $m \geq 3$. This implies that many of the combinatorial algorithms developed for classical OT are not applicable to multimarginal OT, and therefore the standard interior-point algorithm bounds result in an intractable complexity bound of $\widetilde{\mathcal{O}}(n^{3m})$. Second, we propose and analyze two simple algorithms for approximating the multimarginal OT problem. The first algorithm, which we refer to as multimarginal Sinkhorn, improves upon previous multimarginal generalizations of the celebrated Sinkhorn algorithm. We show that it achieves a near-linear time complexity bound of $\widetilde{\mathcal{O}}(m^3 n^m / \varepsilon^2)$ for a tolerance $\varepsilon \in (0, 1)$. This matches the best known complexity bound for the Sinkhorn algorithm when $m = 2$ for approximating the classical OT distance. The second algorithm, which we refer to as multimarginal Randkhorn, accelerates the first algorithm by incorporating a randomized estimate sequence and achieves a complexity bound of $\widetilde{\mathcal{O}}(m^{8/3} n^{m+1/3}/\varepsilon)$. This improves on the complexity bound of the first algorithm by $1/\varepsilon$ and matches the best known complexity bound for the Randkhorn algorithm when $m=2$ for approximating the classical OT distance.
Deep neural networks trained on a wide range of datasets demonstrate impressive transferability. Deep features appear general in that they are applicable to many datasets and tasks. Such property is in prevalent use in real-world applications. A neural network pretrained on large datasets, such as ImageNet, can significantly boost generalization and accelerate training if fine-tuned to a smaller target dataset. Despite its pervasiveness, few effort has been devoted to uncovering the reason of transferability in deep feature representations. This paper tries to understand transferability from the perspectives of improved generalization, optimization and the feasibility of transferability. We demonstrate that 1) Transferred models tend to find flatter minima, since their weight matrices stay close to the original flat region of pretrained parameters when transferred to a similar target dataset; 2) Transferred representations make the loss landscape more favorable with improved Lipschitzness, which accelerates and stabilizes training substantially. The improvement largely attributes to the fact that the principal component of gradient is suppressed in the pretrained parameters, thus stabilizing the magnitude of gradient in back-propagation. 3) The feasibility of transferability is related to the similarity of both input and label. And a surprising discovery is that the feasibility is also impacted by the training stages in that the transferability first increases during training, and then declines. We further provide a theoretical analysis to verify our observations.
We propose a Markov chain Monte Carlo (MCMC) algorithm based on third-order Langevin dynamics for sampling from distributions with log-concave and smooth densities. The higher-order dynamics allow for more flexible discretization schemes, and we develop a specific method that combines splitting with more accurate integration. For a broad class of $d$-dimensional distributions arising from generalized linear models, we prove that the resulting third-order algorithm produces samples from a distribution that is at most $\varepsilon > 0$ in Wasserstein distance from the target distribution in $O\left(\frac{d^{1/3}}{ \varepsilon^{2/3}} \right)$ steps. This result requires only Lipschitz conditions on the gradient. For general strongly convex potentials with $\alpha$-th order smoothness, we prove that the mixing time scales as $O \left(\frac{d^{1/3}}{\varepsilon^{2/3}} + \frac{d^{1/2}}{\varepsilon^{1/(\alpha - 1)}} \right)$.
Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation tradeoff. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed. This paper presents the first provable RL algorithm with both polynomial runtime and polynomial sample complexity in this linear setting, without requiring a "simulator" or additional assumptions. Concretely, we prove that an optimistic modification of Least-Squares Value Iteration (LSVI)---a classical algorithm frequently studied in the linear setting---achieves $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret, where $d$ is the ambient dimension of feature space, $H$ is the length of each episode, and $T$ is the total number of steps. Importantly, such regret is independent of the number of states and actions.