Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wotao Yin

On Representing Linear Programs by Graph Neural Networks

Sep 25, 2022

Ziang Chen, Jialin Liu, Xinshang Wang, Jianfeng Lu, Wotao Yin

Figure 1 for On Representing Linear Programs by Graph Neural Networks

Figure 2 for On Representing Linear Programs by Graph Neural Networks

Abstract:Learning to optimize is a rapidly growing area that aims to solve optimization problems or improve existing optimization algorithms using machine learning (ML). In particular, the graph neural network (GNN) is considered a suitable ML model for optimization problems whose variables and constraints are permutation--invariant, for example, the linear program (LP). While the literature has reported encouraging numerical results, this paper establishes the theoretical foundation of applying GNNs to solving LPs. Given any size limit of LPs, we construct a GNN that maps different LPs to different outputs. We show that properly built GNNs can reliably predict feasibility, boundedness, and an optimal solution for each LP in a broad class. Our proofs are based upon the recently--discovered connections between the Weisfeiler--Lehman isomorphism test and the GNN. To validate our results, we train a simple GNN and present its accuracy in mapping LPs to their feasibilities and solutions.

Via

Access Paper or Ask Questions

Lower Bounds and Nearly Optimal Algorithms in Distributed Learning with Communication Compression

Jun 08, 2022

Xinmeng Huang, Yiming Chen, Wotao Yin, Kun Yuan

Figure 1 for Lower Bounds and Nearly Optimal Algorithms in Distributed Learning with Communication Compression

Figure 2 for Lower Bounds and Nearly Optimal Algorithms in Distributed Learning with Communication Compression

Figure 3 for Lower Bounds and Nearly Optimal Algorithms in Distributed Learning with Communication Compression

Figure 4 for Lower Bounds and Nearly Optimal Algorithms in Distributed Learning with Communication Compression

Abstract:Recent advances in distributed optimization and learning have shown that communication compression is one of the most effective means of reducing communication. While there have been many results on convergence rates under communication compression, a theoretical lower bound is still missing. Analyses of algorithms with communication compression have attributed convergence to two abstract properties: the unbiased property or the contractive property. They can be applied with either unidirectional compression (only messages from workers to server are compressed) or bidirectional compression. In this paper, we consider distributed stochastic algorithms for minimizing smooth and non-convex objective functions under communication compression. We establish a convergence lower bound for algorithms whether using unbiased or contractive compressors in unidirection or bidirection. To close the gap between the lower bound and the existing upper bounds, we further propose an algorithm, NEOLITHIC, which almost reaches our lower bound (up to logarithm factors) under mild conditions. Our results also show that using contractive bidirectional compression can yield iterative methods that converge as fast as those using unbiased unidirectional compression. The experimental results validate our findings.

Via

Access Paper or Ask Questions

FiLM: Frequency improved Legendre Memory Model for Long-term Time Series Forecasting

May 24, 2022

Tian Zhou, Ziqing Ma, Xue wang, Qingsong Wen, Liang Sun, Tao Yao, Wotao Yin, Rong Jin

Figure 1 for FiLM: Frequency improved Legendre Memory Model for Long-term Time Series Forecasting

Figure 2 for FiLM: Frequency improved Legendre Memory Model for Long-term Time Series Forecasting

Figure 3 for FiLM: Frequency improved Legendre Memory Model for Long-term Time Series Forecasting

Figure 4 for FiLM: Frequency improved Legendre Memory Model for Long-term Time Series Forecasting

Abstract:Recent studies have shown that deep learning models such as RNNs and Transformers have brought significant performance gains for long-term forecasting of time series because they effectively utilize historical information. We found, however, that there is still great room for improvement in how to preserve historical information in neural networks while avoiding overfitting to noise presented in the history. Addressing this allows better utilization of the capabilities of deep learning models. To this end, we design a \textbf{F}requency \textbf{i}mproved \textbf{L}egendre \textbf{M}emory model, or {\bf FiLM}: it applies Legendre Polynomials projections to approximate historical information, uses Fourier projection to remove noise, and adds a low-rank approximation to speed up computation. Our empirical studies show that the proposed FiLM significantly improves the accuracy of state-of-the-art models in multivariate and univariate long-term forecasting by (\textbf{20.3\%}, \textbf{22.6\%}), respectively. We also demonstrate that the representation module developed in this work can be used as a general plug-in to improve the long-term prediction performance of other deep learning modules. Code will be released soon.

Via

Access Paper or Ask Questions

A Novel Convergence Analysis for Algorithms of the Adam Family

Dec 07, 2021

Zhishuai Guo, Yi Xu, Wotao Yin, Rong Jin, Tianbao Yang

Figure 1 for A Novel Convergence Analysis for Algorithms of the Adam Family

Figure 2 for A Novel Convergence Analysis for Algorithms of the Adam Family

Abstract:Since its invention in 2014, the Adam optimizer has received tremendous attention. On one hand, it has been widely used in deep learning and many variants have been proposed, while on the other hand their theoretical convergence property remains to be a mystery. It is far from satisfactory in the sense that some studies require strong assumptions about the updates, which are not necessarily applicable in practice, while other studies still follow the original problematic convergence analysis of Adam, which was shown to be not sufficient to ensure convergence. Although rigorous convergence analysis exists for Adam, they impose specific requirements on the update of the adaptive step size, which are not generic enough to cover many other variants of Adam. To address theses issues, in this extended abstract, we present a simple and generic proof of convergence for a family of Adam-style methods (including Adam, AMSGrad, Adabound, etc.). Our analysis only requires an increasing or large "momentum" parameter for the first-order moment, which is indeed the case used in practice, and a boundness condition on the adaptive factor of the step size, which applies to all variants of Adam under mild conditions of stochastic gradients. We also establish a variance diminishing result for the used stochastic gradient estimators. Indeed, our analysis of Adam is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non-convex optimization problems, including min-max, compositional, and bilevel optimization problems. For the full (earlier) version of this extended abstract, please refer to arXiv:2104.14840.

* In NeurIPS OPT Workshop 2021. arXiv admin note: substantial text overlap with arXiv:2104.14840

Via

Access Paper or Ask Questions

BlueFog: Make Decentralized Algorithms Practical for Optimization and Deep Learning

Nov 08, 2021

Bicheng Ying, Kun Yuan, Hanbin Hu, Yiming Chen, Wotao Yin

Figure 1 for BlueFog: Make Decentralized Algorithms Practical for Optimization and Deep Learning

Figure 2 for BlueFog: Make Decentralized Algorithms Practical for Optimization and Deep Learning

Figure 3 for BlueFog: Make Decentralized Algorithms Practical for Optimization and Deep Learning

Figure 4 for BlueFog: Make Decentralized Algorithms Practical for Optimization and Deep Learning

Abstract:Decentralized algorithm is a form of computation that achieves a global goal through local dynamics that relies on low-cost communication between directly-connected agents. On large-scale optimization tasks involving distributed datasets, decentralized algorithms have shown strong, sometimes superior, performance over distributed algorithms with a central node. Recently, developing decentralized algorithms for deep learning has attracted great attention. They are considered as low-communication-overhead alternatives to those using a parameter server or the Ring-Allreduce protocol. However, the lack of an easy-to-use and efficient software package has kept most decentralized algorithms merely on paper. To fill the gap, we introduce BlueFog, a python library for straightforward, high-performance implementations of diverse decentralized algorithms. Based on a unified abstraction of various communication operations, BlueFog offers intuitive interfaces to implement a spectrum of decentralized algorithms, from those using a static, undirected graph for synchronous operations to those using dynamic and directed graphs for asynchronous operations. BlueFog also adopts several system-level acceleration techniques to further optimize the performance on the deep learning tasks. On mainstream DNN training tasks, BlueFog reaches a much higher throughput and achieves an overall $1.2\times \sim 1.8\times$ speedup over Horovod, a state-of-the-art distributed deep learning package based on Ring-Allreduce. BlueFog is open source at https://github.com/Bluefog-Lib/bluefog.

Via

Access Paper or Ask Questions

Hyperparameter Tuning is All You Need for LISTA

Oct 29, 2021

Xiaohan Chen, Jialin Liu, Zhangyang Wang, Wotao Yin

Figure 1 for Hyperparameter Tuning is All You Need for LISTA

Figure 2 for Hyperparameter Tuning is All You Need for LISTA

Figure 3 for Hyperparameter Tuning is All You Need for LISTA

Figure 4 for Hyperparameter Tuning is All You Need for LISTA

Abstract:Learned Iterative Shrinkage-Thresholding Algorithm (LISTA) introduces the concept of unrolling an iterative algorithm and training it like a neural network. It has had great success on sparse recovery. In this paper, we show that adding momentum to intermediate variables in the LISTA network achieves a better convergence rate and, in particular, the network with instance-optimal parameters is superlinearly convergent. Moreover, our new theoretical results lead to a practical approach of automatically and adaptively calculating the parameters of a LISTA network layer based on its previous layers. Perhaps most surprisingly, such an adaptive-parameter procedure reduces the training of LISTA to tuning only three hyperparameters from data: a new record set in the context of the recent advances on trimming down LISTA complexity. We call this new ultra-light weight network HyperLISTA. Compared to state-of-the-art LISTA models, HyperLISTA achieves almost the same performance on seen data distributions and performs better when tested on unseen distributions (specifically, those with different sparsity levels and nonzero magnitudes). Code is available: https://github.com/VITA-Group/HyperLISTA.

* Accepted at NeurIPS 2021

Via

Access Paper or Ask Questions

Exponential Graph is Provably Efficient for Decentralized Deep Training

Oct 26, 2021

Bicheng Ying, Kun Yuan, Yiming Chen, Hanbin Hu, Pan Pan, Wotao Yin

Figure 1 for Exponential Graph is Provably Efficient for Decentralized Deep Training

Figure 2 for Exponential Graph is Provably Efficient for Decentralized Deep Training

Figure 3 for Exponential Graph is Provably Efficient for Decentralized Deep Training

Figure 4 for Exponential Graph is Provably Efficient for Decentralized Deep Training

Abstract:Decentralized SGD is an emerging training method for deep learning known for its much less (thus faster) communication per iteration, which relaxes the averaging step in parallel SGD to inexact averaging. The less exact the averaging is, however, the more the total iterations the training needs to take. Therefore, the key to making decentralized SGD efficient is to realize nearly-exact averaging using little communication. This requires a skillful choice of communication topology, which is an under-studied topic in decentralized optimization. In this paper, we study so-called exponential graphs where every node is connected to $O(\log(n))$ neighbors and $n$ is the total number of nodes. This work proves such graphs can lead to both fast communication and effective averaging simultaneously. We also discover that a sequence of $\log(n)$ one-peer exponential graphs, in which each node communicates to one single neighbor per iteration, can together achieve exact averaging. This favorable property enables one-peer exponential graph to average as effective as its static counterpart but communicates more efficiently. We apply these exponential graphs in decentralized (momentum) SGD to obtain the state-of-the-art balance between per-iteration communication and iteration complexity among all commonly-used topologies. Experimental results on a variety of tasks and models demonstrate that decentralized (momentum) SGD over exponential graphs promises both fast and high-quality training. Our code is implemented through BlueFog and available at https://github.com/Bluefog-Lib/NeurIPS2021-Exponential-Graph.

Via

Access Paper or Ask Questions

Learned Robust PCA: A Scalable Deep Unfolding Approach for High-Dimensional Outlier Detection

Oct 11, 2021

HanQin Cai, Jialin Liu, Wotao Yin

Figure 1 for Learned Robust PCA: A Scalable Deep Unfolding Approach for High-Dimensional Outlier Detection

Figure 2 for Learned Robust PCA: A Scalable Deep Unfolding Approach for High-Dimensional Outlier Detection

Figure 3 for Learned Robust PCA: A Scalable Deep Unfolding Approach for High-Dimensional Outlier Detection

Figure 4 for Learned Robust PCA: A Scalable Deep Unfolding Approach for High-Dimensional Outlier Detection

Abstract:Robust principal component analysis (RPCA) is a critical tool in modern machine learning, which detects outliers in the task of low-rank matrix reconstruction. In this paper, we propose a scalable and learnable non-convex approach for high-dimensional RPCA problems, which we call Learned Robust PCA (LRPCA). LRPCA is highly efficient, and its free parameters can be effectively learned to optimize via deep unfolding. Moreover, we extend deep unfolding from finite iterations to infinite iterations via a novel feedforward-recurrent-mixed neural network model. We establish the recovery guarantee of LRPCA under mild assumptions for RPCA. Numerical experiments show that LRPCA outperforms the state-of-the-art RPCA algorithms, such as ScaledGD and AltProj, on both synthetic datasets and real-world applications.

* NeurIPS 2021

Via

Access Paper or Ask Questions

Curvature-Aware Derivative-Free Optimization

Sep 27, 2021

Bumsu Kim, HanQin Cai, Daniel McKenzie, Wotao Yin

Figure 1 for Curvature-Aware Derivative-Free Optimization

Figure 2 for Curvature-Aware Derivative-Free Optimization

Figure 3 for Curvature-Aware Derivative-Free Optimization

Figure 4 for Curvature-Aware Derivative-Free Optimization

Abstract:We propose a new line-search method, coined Curvature-Aware Random Search (CARS), for derivative-free optimization. CARS exploits approximate curvature information to estimate the optimal step-size given a search direction. We prove that for strongly convex objective functions, CARS converges linearly if the search direction is drawn from a distribution satisfying very mild conditions. We also explore a variant, CARS-NQ, which uses Numerical Quadrature instead of a Monte Carlo method when approximating curvature along the search direction. We show CARS-NQ is effective on highly non-convex problems of the form $f = f_{\mathrm{cvx}} + f_{\mathrm{osc}}$ where $f_{\mathrm{cvx}}$ is strongly convex and $f_{\mathrm{osc}}$ is rapidly oscillating. Experimental results show that CARS and CARS-NQ match or exceed the state-of-the-arts on benchmark problem sets.

* 35 pages, 5 figures

Via

Access Paper or Ask Questions

Tighter Analysis of Alternating Stochastic Gradient Method for Stochastic Nested Problems

Jun 25, 2021

Tianyi Chen, Yuejiao Sun, Wotao Yin

Figure 1 for Tighter Analysis of Alternating Stochastic Gradient Method for Stochastic Nested Problems

Figure 2 for Tighter Analysis of Alternating Stochastic Gradient Method for Stochastic Nested Problems

Figure 3 for Tighter Analysis of Alternating Stochastic Gradient Method for Stochastic Nested Problems

Abstract:Stochastic nested optimization, including stochastic compositional, min-max and bilevel optimization, is gaining popularity in many machine learning applications. While the three problems share the nested structure, existing works often treat them separately, and thus develop problem-specific algorithms and their analyses. Among various exciting developments, simple SGD-type updates (potentially on multiple variables) are still prevalent in solving this class of nested problems, but they are believed to have slower convergence rate compared to that of the non-nested problems. This paper unifies several SGD-type updates for stochastic nested problems into a single SGD approach that we term ALternating Stochastic gradient dEscenT (ALSET) method. By leveraging the hidden smoothness of the problem, this paper presents a tighter analysis of ALSET for stochastic nested problems. Under the new analysis, to achieve an $\epsilon$-stationary point of the nested problem, it requires ${\cal O}(\epsilon^{-2})$ samples. Under certain regularity conditions, applying our results to stochastic compositional, min-max and reinforcement learning problems either improves or matches the best-known sample complexity in the respective cases. Our results explain why simple SGD-type algorithms in stochastic nested problems all work very well in practice without the need for further modifications.

* Submitted for publication

Via

Access Paper or Ask Questions