Alert button
Picture for Bohan Wang

Bohan Wang

Alert button

Large Catapults in Momentum Gradient Descent with Warmup: An Empirical Study

Nov 25, 2023
Prin Phunyaphibarn, Junghyun Lee, Bohan Wang, Huishuai Zhang, Chulhee Yun

Although gradient descent with momentum is widely used in modern deep learning, a concrete understanding of its effects on the training trajectory still remains elusive. In this work, we empirically show that momentum gradient descent with a large learning rate and learning rate warmup displays large catapults, driving the iterates towards flatter minima than those found by gradient descent. We then provide empirical evidence and theoretical intuition that the large catapult is caused by momentum "amplifying" the self-stabilization effect (Damian et al., 2023).

* 19 pages, 14 figures. Accepted to the NeurIPS 2023 M3L Workshop (oral). The first two authors contributed equally 
Viaarxiv icon

Closing the Gap Between the Upper Bound and the Lower Bound of Adam's Iteration Complexity

Oct 27, 2023
Bohan Wang, Jingwen Fu, Huishuai Zhang, Nanning Zheng, Wei Chen

Figure 1 for Closing the Gap Between the Upper Bound and the Lower Bound of Adam's Iteration Complexity

Recently, Arjevani et al. [1] established a lower bound of iteration complexity for the first-order optimization under an $L$-smooth condition and a bounded noise variance assumption. However, a thorough review of existing literature on Adam's convergence reveals a noticeable gap: none of them meet the above lower bound. In this paper, we close the gap by deriving a new convergence guarantee of Adam, with only an $L$-smooth condition and a bounded noise variance assumption. Our results remain valid across a broad spectrum of hyperparameters. Especially with properly chosen hyperparameters, we derive an upper bound of the iteration complexity of Adam and show that it meets the lower bound for first-order optimizers. To the best of our knowledge, this is the first to establish such a tight upper bound for Adam's convergence. Our proof utilizes novel techniques to handle the entanglement between momentum and adaptive learning rate and to convert the first-order term in the Descent Lemma to the gradient norm, which may be of independent interest.

* NeurIPS 2023 Accept 
Viaarxiv icon

How Can Large Language Models Help Humans in Design and Manufacturing?

Jul 25, 2023
Liane Makatura, Michael Foshey, Bohan Wang, Felix HähnLein, Pingchuan Ma, Bolei Deng, Megan Tjandrasuwita, Andrew Spielberg, Crystal Elaine Owens, Peter Yichen Chen, Allan Zhao, Amy Zhu, Wil J Norton, Edward Gu, Joshua Jacob, Yifei Li, Adriana Schulz, Wojciech Matusik

Figure 1 for How Can Large Language Models Help Humans in Design and Manufacturing?
Figure 2 for How Can Large Language Models Help Humans in Design and Manufacturing?
Figure 3 for How Can Large Language Models Help Humans in Design and Manufacturing?
Figure 4 for How Can Large Language Models Help Humans in Design and Manufacturing?

The advancement of Large Language Models (LLMs), including GPT-4, provides exciting new opportunities for generative design. We investigate the application of this tool across the entire design and manufacturing workflow. Specifically, we scrutinize the utility of LLMs in tasks such as: converting a text-based prompt into a design specification, transforming a design into manufacturing instructions, producing a design space and design variations, computing the performance of a design, and searching for designs predicated on performance. Through a series of examples, we highlight both the benefits and the limitations of the current LLMs. By exposing these limitations, we aspire to catalyze the continued improvement and progression of these models.

Viaarxiv icon

Fast Conditional Mixing of MCMC Algorithms for Non-log-concave Distributions

Jun 18, 2023
Xiang Cheng, Bohan Wang, Jingzhao Zhang, Yusong Zhu

MCMC algorithms offer empirically efficient tools for sampling from a target distribution $\pi(x) \propto \exp(-V(x))$. However, on the theory side, MCMC algorithms suffer from slow mixing rate when $\pi(x)$ is non-log-concave. Our work examines this gap and shows that when Poincar\'e-style inequality holds on a subset $\mathcal{X}$ of the state space, the conditional distribution of MCMC iterates over $\mathcal{X}$ mixes fast to the true conditional distribution. This fast mixing guarantee can hold in cases when global mixing is provably slow. We formalize the statement and quantify the conditional mixing rate. We further show that conditional mixing can have interesting implications for sampling from mixtures of Gaussians, parameter estimation for Gaussian mixture models and Gibbs-sampling with well-connected local minima.

Viaarxiv icon

When and Why Momentum Accelerates SGD:An Empirical Study

Jun 15, 2023
Jingwen Fu, Bohan Wang, Huishuai Zhang, Zhizheng Zhang, Wei Chen, Nanning Zheng

Figure 1 for When and Why Momentum Accelerates SGD:An Empirical Study
Figure 2 for When and Why Momentum Accelerates SGD:An Empirical Study
Figure 3 for When and Why Momentum Accelerates SGD:An Empirical Study
Figure 4 for When and Why Momentum Accelerates SGD:An Empirical Study

Momentum has become a crucial component in deep learning optimizers, necessitating a comprehensive understanding of when and why it accelerates stochastic gradient descent (SGD). To address the question of ''when'', we establish a meaningful comparison framework that examines the performance of SGD with Momentum (SGDM) under the \emph{effective learning rates} $\eta_{ef}$, a notion unifying the influence of momentum coefficient $\mu$ and batch size $b$ over learning rate $\eta$. In the comparison of SGDM and SGD with the same effective learning rate and the same batch size, we observe a consistent pattern: when $\eta_{ef}$ is small, SGDM and SGD experience almost the same empirical training losses; when $\eta_{ef}$ surpasses a certain threshold, SGDM begins to perform better than SGD. Furthermore, we observe that the advantage of SGDM over SGD becomes more pronounced with a larger batch size. For the question of ``why'', we find that the momentum acceleration is closely related to \emph{abrupt sharpening} which is to describe a sudden jump of the directional Hessian along the update direction. Specifically, the misalignment between SGD and SGDM happens at the same moment that SGD experiences abrupt sharpening and converges slower. Momentum improves the performance of SGDM by preventing or deferring the occurrence of abrupt sharpening. Together, this study unveils the interplay between momentum, learning rates, and batch sizes, thus improving our understanding of momentum acceleration.

Viaarxiv icon

ALO-VC: Any-to-any Low-latency One-shot Voice Conversion

Jun 01, 2023
Bohan Wang, Damien Ronssin, Milos Cernak

Figure 1 for ALO-VC: Any-to-any Low-latency One-shot Voice Conversion
Figure 2 for ALO-VC: Any-to-any Low-latency One-shot Voice Conversion
Figure 3 for ALO-VC: Any-to-any Low-latency One-shot Voice Conversion
Figure 4 for ALO-VC: Any-to-any Low-latency One-shot Voice Conversion

This paper presents ALO-VC, a non-parallel low-latency one-shot phonetic posteriorgrams (PPGs) based voice conversion method. ALO-VC enables any-to-any voice conversion using only one utterance from the target speaker, with only 47.5 ms future look-ahead. The proposed hybrid signal processing and machine learning pipeline combines a pre-trained speaker encoder, a pitch predictor to predict the converted speech's prosody, and positional encoding to convey the phoneme's location information. We introduce two system versions: ALO-VC-R, which uses a pre-trained d-vector speaker encoder, and ALO-VC-E, which improves performance using the ECAPA-TDNN speaker encoder. The experimental results demonstrate both ALO-VC-R and ALO-VC-E can achieve comparable performance to non-causal baseline systems on the VCTK dataset and two out-of-domain datasets. Furthermore, both proposed systems can be deployed on a single CPU core with 55 ms latency and 0.78 real-time factor. Our demo is available online.

* Accepted to Interspeech 2023. Some audio samples are available at https://bohan7.github.io/ALO-VC-demo/ 
Viaarxiv icon

Convergence of AdaGrad for Non-convex Objectives: Simple Proofs and Relaxed Assumptions

May 29, 2023
Bohan Wang, Huishuai Zhang, Zhi-Ming Ma, Wei Chen

We provide a simple convergence proof for AdaGrad optimizing non-convex objectives under only affine noise variance and bounded smoothness assumptions. The proof is essentially based on a novel auxiliary function $\xi$ that helps eliminate the complexity of handling the correlation between the numerator and denominator of AdaGrad's update. Leveraging simple proofs, we are able to obtain tighter results than existing results \citep{faw2022power} and extend the analysis to several new and important cases. Specifically, for the over-parameterized regime, we show that AdaGrad needs only $\mathcal{O}(\frac{1}{\varepsilon^2})$ iterations to ensure the gradient norm smaller than $\varepsilon$, which matches the rate of SGD and significantly tighter than existing rates $\mathcal{O}(\frac{1}{\varepsilon^4})$ for AdaGrad. We then discard the bounded smoothness assumption and consider a realistic assumption on smoothness called $(L_0,L_1)$-smooth condition, which allows local smoothness to grow with the gradient norm. Again based on the auxiliary function $\xi$, we prove that AdaGrad succeeds in converging under $(L_0,L_1)$-smooth condition as long as the learning rate is lower than a threshold. Interestingly, we further show that the requirement on learning rate under the $(L_0,L_1)$-smooth condition is necessary via proof by contradiction, in contrast with the case of uniform smoothness conditions where convergence is guaranteed regardless of learning rate choices. Together, our analyses broaden the understanding of AdaGrad and demonstrate the power of the new auxiliary function in the investigations of AdaGrad.

* COLT 2023 
Viaarxiv icon

On the Trade-off of Intra-/Inter-class Diversity for Supervised Pre-training

May 20, 2023
Jieyu Zhang, Bohan Wang, Zhengyu Hu, Pang Wei Koh, Alexander Ratner

Figure 1 for On the Trade-off of Intra-/Inter-class Diversity for Supervised Pre-training
Figure 2 for On the Trade-off of Intra-/Inter-class Diversity for Supervised Pre-training
Figure 3 for On the Trade-off of Intra-/Inter-class Diversity for Supervised Pre-training
Figure 4 for On the Trade-off of Intra-/Inter-class Diversity for Supervised Pre-training

Pre-training datasets are critical for building state-of-the-art machine learning models, motivating rigorous study on their impact on downstream tasks. In this work, we study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset. Empirically, we found that with the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity. To understand the underlying mechanism, we show theoretically that the downstream performance depends monotonically on both types of diversity. Notably, our theory reveals that the optimal class-to-sample ratio (#classes / #samples per class) is invariant to the size of the pre-training dataset, which motivates an application of predicting the optimal number of pre-training classes. We demonstrate the effectiveness of this application by an improvement of around 2 points on the downstream tasks when using ImageNet as the pre-training dataset.

* 18 pages 
Viaarxiv icon

InfraDet3D: Multi-Modal 3D Object Detection based on Roadside Infrastructure Camera and LiDAR Sensors

Apr 29, 2023
Walter Zimmer, Joseph Birkner, Marcel Brucker, Huu Tung Nguyen, Stefan Petrovski, Bohan Wang, Alois C. Knoll

Figure 1 for InfraDet3D: Multi-Modal 3D Object Detection based on Roadside Infrastructure Camera and LiDAR Sensors
Figure 2 for InfraDet3D: Multi-Modal 3D Object Detection based on Roadside Infrastructure Camera and LiDAR Sensors
Figure 3 for InfraDet3D: Multi-Modal 3D Object Detection based on Roadside Infrastructure Camera and LiDAR Sensors
Figure 4 for InfraDet3D: Multi-Modal 3D Object Detection based on Roadside Infrastructure Camera and LiDAR Sensors

Current multi-modal object detection approaches focus on the vehicle domain and are limited in the perception range and the processing capabilities. Roadside sensor units (RSUs) introduce a new domain for perception systems and leverage altitude to observe traffic. Cameras and LiDARs mounted on gantry bridges increase the perception range and produce a full digital twin of the traffic. In this work, we introduce InfraDet3D, a multi-modal 3D object detector for roadside infrastructure sensors. We fuse two LiDARs using early fusion and further incorporate detections from monocular cameras to increase the robustness and to detect small objects. Our monocular 3D detection module uses HD maps to ground object yaw hypotheses, improving the final perception results. The perception framework is deployed on a real-world intersection that is part of the A9 Test Stretch in Munich, Germany. We perform several ablation studies and experiments and show that fusing two LiDARs with two cameras leads to an improvement of +1.90 mAP compared to a camera-only solution. We evaluate our results on the A9 infrastructure dataset and achieve 68.48 mAP on the test set. The dataset and code will be available at https://a9-dataset.com to allow the research community to further improve the perception results and make autonomous driving safer.

Viaarxiv icon

Regularization of polynomial networks for image recognition

Mar 24, 2023
Grigorios G Chrysos, Bohan Wang, Jiankang Deng, Volkan Cevher

Figure 1 for Regularization of polynomial networks for image recognition
Figure 2 for Regularization of polynomial networks for image recognition
Figure 3 for Regularization of polynomial networks for image recognition
Figure 4 for Regularization of polynomial networks for image recognition

Deep Neural Networks (DNNs) have obtained impressive performance across tasks, however they still remain as black boxes, e.g., hard to theoretically analyze. At the same time, Polynomial Networks (PNs) have emerged as an alternative method with a promising performance and improved interpretability but have yet to reach the performance of the powerful DNN baselines. In this work, we aim to close this performance gap. We introduce a class of PNs, which are able to reach the performance of ResNet across a range of six benchmarks. We demonstrate that strong regularization is critical and conduct an extensive study of the exact regularization schemes required to match performance. To further motivate the regularization schemes, we introduce D-PolyNets that achieve a higher-degree of expansion than previously proposed polynomial networks. D-PolyNets are more parameter-efficient while achieving a similar performance as other polynomial networks. We expect that our new models can lead to an understanding of the role of elementwise activation functions (which are no longer required for training PNs). The source code is available at https://github.com/grigorisg9gr/regularized_polynomials.

* Accepted at CVPR'23 
Viaarxiv icon