Alert button
Picture for Haijie Gu

Haijie Gu

Alert button

Carnegie Mellon University

AdaScale SGD: A User-Friendly Algorithm for Distributed Training

Jul 09, 2020
Tyler B. Johnson, Pulkit Agrawal, Haijie Gu, Carlos Guestrin

Figure 1 for AdaScale SGD: A User-Friendly Algorithm for Distributed Training
Figure 2 for AdaScale SGD: A User-Friendly Algorithm for Distributed Training
Figure 3 for AdaScale SGD: A User-Friendly Algorithm for Distributed Training
Figure 4 for AdaScale SGD: A User-Friendly Algorithm for Distributed Training

When using large-batch training to speed up stochastic gradient descent, learning rates must adapt to new batch sizes in order to maximize speed-ups and preserve model quality. Re-tuning learning rates is resource intensive, while fixed scaling rules often degrade model quality. We propose AdaScale SGD, an algorithm that reliably adapts learning rates to large-batch training. By continually adapting to the gradient's variance, AdaScale automatically achieves speed-ups for a wide range of batch sizes. We formally describe this quality with AdaScale's convergence bound, which maintains final objective values, even as batch sizes grow large and the number of iterations decreases. In empirical comparisons, AdaScale trains well beyond the batch size limits of popular "linear learning rate scaling" rules. This includes large-batch training with no model degradation for machine translation, image classification, object detection, and speech recognition tasks. AdaScale's qualitative behavior is similar to that of "warm-up" heuristics, but unlike warm-up, this behavior emerges naturally from a principled mechanism. The algorithm introduces negligible computational overhead and no new hyperparameters, making AdaScale an attractive choice for large-scale training in practice.

* ICML 2020 
Viaarxiv icon

Sequential Nonparametric Regression

Jun 27, 2012
Haijie Gu, John Lafferty

Figure 1 for Sequential Nonparametric Regression
Figure 2 for Sequential Nonparametric Regression

We present algorithms for nonparametric regression in settings where the data are obtained sequentially. While traditional estimators select bandwidths that depend upon the sample size, for sequential data the effective sample size is dynamically changing. We propose a linear time algorithm that adjusts the bandwidth for each new data point, and show that the estimator achieves the optimal minimax rate of convergence. We also propose the use of online expert mixing algorithms to adapt to unknown smoothness of the regression function. We provide simulations that confirm the theoretical results, and demonstrate the effectiveness of the methods.

* Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012) 
Viaarxiv icon

Forest Density Estimation

Oct 20, 2010
Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, Larry Wasserman

Figure 1 for Forest Density Estimation
Figure 2 for Forest Density Estimation
Figure 3 for Forest Density Estimation
Figure 4 for Forest Density Estimation

We study graph estimation and density estimation in high dimensions, using a family of density estimators based on forest structured undirected graphical models. For density estimation, we do not assume the true distribution corresponds to a forest; rather, we form kernel density estimates of the bivariate and univariate marginals, and apply Kruskal's algorithm to estimate the optimal forest on held out data. We prove an oracle inequality on the excess risk of the resulting estimator relative to the risk of the best forest. For graph estimation, we consider the problem of estimating forests with restricted tree sizes. We prove that finding a maximum weight spanning forest with restricted tree size is NP-hard, and develop an approximation algorithm for this problem. Viewing the tree size as a complexity parameter, we then select a forest using data splitting, and prove bounds on excess risk and structure selection consistency of the procedure. Experiments with simulated data and microarray data indicate that the methods are a practical alternative to Gaussian graphical models.

* Extended version of earlier paper titled "Tree density estimation" 
Viaarxiv icon