Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anastasios Kyrillidis

Better Schedules for Low Precision Training of Deep Neural Networks

Mar 04, 2024

Cameron R. Wolfe, Anastasios Kyrillidis

Abstract:Low precision training can significantly reduce the computational overhead of training deep neural networks (DNNs). Though many such techniques exist, cyclic precision training (CPT), which dynamically adjusts precision throughout training according to a cyclic schedule, achieves particularly impressive improvements in training efficiency, while actually improving DNN performance. Existing CPT implementations take common learning rate schedules (e.g., cyclical cosine schedules) and use them for low precision training without adequate comparisons to alternative scheduling options. We define a diverse suite of CPT schedules and analyze their performance across a variety of DNN training regimes, some of which are unexplored in the low precision training literature (e.g., node classification with graph neural networks). From these experiments, we discover alternative CPT schedules that offer further improvements in training efficiency and model performance, as well as derive a set of best practices for choosing CPT schedules. Going further, we find that a correlation exists between model performance and training cost, and that changing the underlying CPT schedule can control the tradeoff between these two variables. To explain the direct correlation between model performance and training cost, we draw a connection between quantized training and critical learning periods, suggesting that aggressive quantization is a form of learning impairment that can permanently damage model performance.

* Machine Learning (2024): 1-19
* 20 pages, 8 figures, 1 table, ACML 2023

Via

Access Paper or Ask Questions

On the Error-Propagation of Inexact Deflation for Principal Component Analysis

Oct 06, 2023

Fangshuo Liao, Junhyung Lyle Kim, Cruz Barnum, Anastasios Kyrillidis

Figure 1 for On the Error-Propagation of Inexact Deflation for Principal Component Analysis

Figure 2 for On the Error-Propagation of Inexact Deflation for Principal Component Analysis

Figure 3 for On the Error-Propagation of Inexact Deflation for Principal Component Analysis

Figure 4 for On the Error-Propagation of Inexact Deflation for Principal Component Analysis

Abstract:Principal Component Analysis (PCA) is a popular tool in data analysis, especially when the data is high-dimensional. PCA aims to find subspaces, spanned by the so-called \textit{principal components}, that best explain the variance in the dataset. The deflation method is a popular meta-algorithm -- used to discover such subspaces -- that sequentially finds individual principal components, starting from the most important one and working its way towards the less important ones. However, due to its sequential nature, the numerical error introduced by not estimating principal components exactly -- e.g., due to numerical approximations through this process -- propagates, as deflation proceeds. To the best of our knowledge, this is the first work that mathematically characterizes the error propagation of the inexact deflation method, and this is the key contribution of this paper. We provide two main results: $i)$ when the sub-routine for finding the leading eigenvector is generic, and $ii)$ when power iteration is used as the sub-routine. In the latter case, the additional directional information from power iteration allows us to obtain a tighter error bound than the analysis of the sub-routine agnostic case. As an outcome, we provide explicit characterization on how the error progresses and affects subsequent principal component estimations for this fundamental problem.

Via

Access Paper or Ask Questions

CrysFormer: Protein Structure Prediction via 3d Patterson Maps and Partial Structure Attention

Oct 05, 2023

Chen Dun, Qiutai Pan, Shikai Jin, Ria Stevens, Mitchell D. Miller, George N. Phillips, Jr., Anastasios Kyrillidis

Figure 1 for CrysFormer: Protein Structure Prediction via 3d Patterson Maps and Partial Structure Attention

Figure 2 for CrysFormer: Protein Structure Prediction via 3d Patterson Maps and Partial Structure Attention

Figure 3 for CrysFormer: Protein Structure Prediction via 3d Patterson Maps and Partial Structure Attention

Figure 4 for CrysFormer: Protein Structure Prediction via 3d Patterson Maps and Partial Structure Attention

Abstract:Determining the structure of a protein has been a decades-long open question. A protein's three-dimensional structure often poses nontrivial computation costs, when classical simulation algorithms are utilized. Advances in the transformer neural network architecture -- such as AlphaFold2 -- achieve significant improvements for this problem, by learning from a large dataset of sequence information and corresponding protein structures. Yet, such methods only focus on sequence information; other available prior knowledge, such as protein crystallography and partial structure of amino acids, could be potentially utilized. To the best of our knowledge, we propose the first transformer-based model that directly utilizes protein crystallography and partial structure information to predict the electron density maps of proteins. Via two new datasets of peptide fragments (2-residue and 15-residue) , we demonstrate our method, dubbed \texttt{CrysFormer}, can achieve accurate predictions, based on a much smaller dataset size and with reduced computation costs.

Via

Access Paper or Ask Questions

Sweeping Heterogeneity with Smart MoPs: Mixture of Prompts for LLM Task Adaptation

Oct 05, 2023

Chen Dun, Mirian Hipolito Garcia, Guoqing Zheng, Ahmed Hassan Awadallah, Anastasios Kyrillidis, Robert Sim

Figure 1 for Sweeping Heterogeneity with Smart MoPs: Mixture of Prompts for LLM Task Adaptation

Figure 2 for Sweeping Heterogeneity with Smart MoPs: Mixture of Prompts for LLM Task Adaptation

Figure 3 for Sweeping Heterogeneity with Smart MoPs: Mixture of Prompts for LLM Task Adaptation

Figure 4 for Sweeping Heterogeneity with Smart MoPs: Mixture of Prompts for LLM Task Adaptation

Abstract:Large Language Models (LLMs) have the ability to solve a variety of tasks, such as text summarization and mathematical questions, just out of the box, but they are often trained with a single task in mind. Due to high computational costs, the current trend is to use prompt instruction tuning to better adjust monolithic, pretrained LLMs for new -- but often individual -- downstream tasks. Thus, how one would expand prompt tuning to handle -- concomitantly -- heterogeneous tasks and data distributions is a widely open question. To address this gap, we suggest the use of \emph{Mixture of Prompts}, or MoPs, associated with smart gating functionality: the latter -- whose design is one of the contributions of this paper -- can identify relevant skills embedded in different groups of prompts and dynamically assign combined experts (i.e., collection of prompts), based on the target task. Additionally, MoPs are empirically agnostic to any model compression technique applied -- for efficiency reasons -- as well as instruction data source and task composition. In practice, MoPs can simultaneously mitigate prompt training "interference" in multi-task, multi-source scenarios (e.g., task and data heterogeneity across sources), as well as possible implications from model approximations. As a highlight, MoPs manage to decrease final perplexity from $\sim20\%$ up to $\sim70\%$, as compared to baselines, in the federated scenario, and from $\sim 3\%$ up to $\sim30\%$ in the centralized scenario.

Via

Access Paper or Ask Questions

Stochastic Implicit Neural Signed Distance Functions for Safe Motion Planning under Sensing Uncertainty

Sep 28, 2023

Carlos Quintero-Peña, Wil Thomason, Zachary Kingston, Anastasios Kyrillidis, Lydia E. Kavraki

Figure 1 for Stochastic Implicit Neural Signed Distance Functions for Safe Motion Planning under Sensing Uncertainty

Figure 2 for Stochastic Implicit Neural Signed Distance Functions for Safe Motion Planning under Sensing Uncertainty

Figure 3 for Stochastic Implicit Neural Signed Distance Functions for Safe Motion Planning under Sensing Uncertainty

Figure 4 for Stochastic Implicit Neural Signed Distance Functions for Safe Motion Planning under Sensing Uncertainty

Abstract:Motion planning under sensing uncertainty is critical for robots in unstructured environments to guarantee safety for both the robot and any nearby humans. Most work on planning under uncertainty does not scale to high-dimensional robots such as manipulators, assumes simplified geometry of the robot or environment, or requires per-object knowledge of noise. Instead, we propose a method that directly models sensor-specific aleatoric uncertainty to find safe motions for high-dimensional systems in complex environments, without exact knowledge of environment geometry. We combine a novel implicit neural model of stochastic signed distance functions with a hierarchical optimization-based motion planner to plan low-risk motions without sacrificing path quality. Our method also explicitly bounds the risk of the path, offering trustworthiness. We empirically validate that our method produces safe motions and accurate risk bounds and is safer than baseline approaches.

* 8 pages, 4 figures, 1 table. Submitted to the 2024 IEEE International Conference on Robotics and Automation

Via

Access Paper or Ask Questions

Fast FixMatch: Faster Semi-Supervised Learning with Curriculum Batch Size

Sep 07, 2023

John Chen, Chen Dun, Anastasios Kyrillidis

Abstract:Advances in Semi-Supervised Learning (SSL) have almost entirely closed the gap between SSL and Supervised Learning at a fraction of the number of labels. However, recent performance improvements have often come \textit{at the cost of significantly increased training computation}. To address this, we propose Curriculum Batch Size (CBS), \textit{an unlabeled batch size curriculum which exploits the natural training dynamics of deep neural networks.} A small unlabeled batch size is used in the beginning of training and is gradually increased to the end of training. A fixed curriculum is used regardless of dataset, model or number of epochs, and reduced training computations is demonstrated on all settings. We apply CBS, strong labeled augmentation, Curriculum Pseudo Labeling (CPL) \citep{FlexMatch} to FixMatch \citep{FixMatch} and term the new SSL algorithm Fast FixMatch. We perform an ablation study to show that strong labeled augmentation and/or CPL do not significantly reduce training computations, but, in synergy with CBS, they achieve optimal performance. Fast FixMatch also achieves substantially higher data utilization compared to previous state-of-the-art. Fast FixMatch achieves between $2.1\times$ - $3.4\times$ reduced training computations on CIFAR-10 with all but 40, 250 and 4000 labels removed, compared to vanilla FixMatch, while attaining the same cited state-of-the-art error rate \citep{FixMatch}. Similar results are achieved for CIFAR-100, SVHN and STL-10. Finally, Fast MixMatch achieves between $2.6\times$ - $3.3\times$ reduced training computations in federated SSL tasks and online/streaming learning SSL tasks, which further demonstrate the generializbility of Fast MixMatch to different scenarios and tasks.

Via

Access Paper or Ask Questions

Federated Learning Over Images: Vertical Decompositions and Pre-Trained Backbones Are Difficult to Beat

Sep 06, 2023

Erdong Hu, Yuxin Tang, Anastasios Kyrillidis, Chris Jermaine

Abstract:We carefully evaluate a number of algorithms for learning in a federated environment, and test their utility for a variety of image classification tasks. We consider many issues that have not been adequately considered before: whether learning over data sets that do not have diverse sets of images affects the results; whether to use a pre-trained feature extraction "backbone"; how to evaluate learner performance (we argue that classification accuracy is not enough), among others. Overall, across a wide variety of settings, we find that vertically decomposing a neural network seems to give the best results, and outperforms more standard reconciliation-used methods.

* 16 pages, 7 figures, Accepted at ICCV2023

Via

Access Paper or Ask Questions

Adaptive Federated Learning with Auto-Tuned Clients

Jun 19, 2023

Junhyung Lyle Kim, Mohammad Taha Toghani, César A. Uribe, Anastasios Kyrillidis

Figure 1 for Adaptive Federated Learning with Auto-Tuned Clients

Figure 2 for Adaptive Federated Learning with Auto-Tuned Clients

Figure 3 for Adaptive Federated Learning with Auto-Tuned Clients

Figure 4 for Adaptive Federated Learning with Auto-Tuned Clients

Abstract:Federated learning (FL) is a distributed machine learning framework where the global model of a central server is trained via multiple collaborative steps by participating clients without sharing their data. While being a flexible framework, where the distribution of local data, participation rate, and computing power of each client can greatly vary, such flexibility gives rise to many new challenges, especially in the hyperparameter tuning on both the server and the client side. We propose $\Delta$-SGD, a simple step size rule for SGD that enables each client to use its own step size by adapting to the local smoothness of the function each client is optimizing. We provide theoretical and empirical results where the benefit of the client adaptivity is shown in various FL scenarios. In particular, our proposed method achieves TOP-1 accuracy in 73% and TOP-2 accuracy in 100% of the experiments considered without additional tuning.

Via

Access Paper or Ask Questions

Fed-ZERO: Efficient Zero-shot Personalization with Federated Mixture of Experts

Jun 14, 2023

Chen Dun, Mirian Hipolito Garcia, Guoqing Zheng, Ahmed Hassan Awadallah, Robert Sim, Anastasios Kyrillidis, Dimitrios Dimitriadis

Figure 1 for Fed-ZERO: Efficient Zero-shot Personalization with Federated Mixture of Experts

Figure 2 for Fed-ZERO: Efficient Zero-shot Personalization with Federated Mixture of Experts

Figure 3 for Fed-ZERO: Efficient Zero-shot Personalization with Federated Mixture of Experts

Figure 4 for Fed-ZERO: Efficient Zero-shot Personalization with Federated Mixture of Experts

Abstract:One of the goals in Federated Learning (FL) is to create personalized models that can adapt to the context of each participating client, while utilizing knowledge from a shared global model. Yet, often, personalization requires a fine-tuning step using clients' labeled data in order to achieve good performance. This may not be feasible in scenarios where incoming clients are fresh and/or have privacy concerns. It, then, remains open how one can achieve zero-shot personalization in these scenarios. We propose a novel solution by using a Mixture-of-Experts (MoE) framework within a FL setup. Our method leverages the diversity of the clients to train specialized experts on different subsets of classes, and a gating function to route the input to the most relevant expert(s). Our gating function harnesses the knowledge of a pretrained model common expert to enhance its routing decisions on-the-fly. As a highlight, our approach can improve accuracy up to 18\% in state of the art FL settings, while maintaining competitive zero-shot performance. In practice, our method can handle non-homogeneous data distributions, scale more efficiently, and improve the state-of-the-art performance on common FL benchmarks.

* 14 Pages

Via

Access Paper or Ask Questions

Accelerated Convergence of Nesterov's Momentum for Deep Neural Networks under Partial Strong Convexity

Jun 13, 2023

Fangshuo Liao, Anastasios Kyrillidis

Figure 1 for Accelerated Convergence of Nesterov's Momentum for Deep Neural Networks under Partial Strong Convexity

Abstract:Current state-of-the-art analyses on the convergence of gradient descent for training neural networks focus on characterizing properties of the loss landscape, such as the Polyak-Lojaciewicz (PL) condition and the restricted strong convexity. While gradient descent converges linearly under such conditions, it remains an open question whether Nesterov's momentum enjoys accelerated convergence under similar settings and assumptions. In this work, we consider a new class of objective functions, where only a subset of the parameters satisfies strong convexity, and show Nesterov's momentum achieves acceleration in theory for this objective class. We provide two realizations of the problem class, one of which is deep ReLU networks, which --to the best of our knowledge--constitutes this work the first that proves accelerated convergence rate for non-trivial neural network architectures.

Via

Access Paper or Ask Questions