Alert button
Picture for Yaoqing Yang

Yaoqing Yang

Alert button

Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training

Dec 01, 2023
Yefan Zhou, Tianyu Pang, Keqin Liu, Charles H. Martin, Michael W. Mahoney, Yaoqing Yang

Regularization in modern machine learning is crucial, and it can take various forms in algorithmic design: training set, model family, error function, regularization terms, and optimizations. In particular, the learning rate, which can be interpreted as a temperature-like parameter within the statistical mechanics of learning, plays a crucial role in neural network training. Indeed, many widely adopted training strategies basically just define the decay of the learning rate over time. This process can be interpreted as decreasing a temperature, using either a global learning rate (for the entire model) or a learning rate that varies for each parameter. This paper proposes TempBalance, a straightforward yet effective layer-wise learning rate method. TempBalance is based on Heavy-Tailed Self-Regularization (HT-SR) Theory, an approach which characterizes the implicit self-regularization of different layers in trained models. We demonstrate the efficacy of using HT-SR-motivated metrics to guide the scheduling and balancing of temperature across all network layers during model training, resulting in improved performance during testing. We implement TempBalance on CIFAR10, CIFAR100, SVHN, and TinyImageNet datasets using ResNets, VGGs, and WideResNets with various depths and widths. Our results show that TempBalance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization. We also show that TempBalance outperforms a number of state-of-the-art optimizers and learning rate schedulers.

* NeurIPS 2023 Spotlight, first two authors contributed equally 
Viaarxiv icon

A Three-regime Model of Network Pruning

May 28, 2023
Yefan Zhou, Yaoqing Yang, Arin Chang, Michael W. Mahoney

Figure 1 for A Three-regime Model of Network Pruning
Figure 2 for A Three-regime Model of Network Pruning
Figure 3 for A Three-regime Model of Network Pruning
Figure 4 for A Three-regime Model of Network Pruning

Recent work has highlighted the complex influence training hyperparameters, e.g., the number of training epochs, can have on the prunability of machine learning models. Perhaps surprisingly, a systematic approach to predict precisely how adjusting a specific hyperparameter will affect prunability remains elusive. To address this gap, we introduce a phenomenological model grounded in the statistical mechanics of learning. Our approach uses temperature-like and load-like parameters to model the impact of neural network (NN) training hyperparameters on pruning performance. A key empirical result we identify is a sharp transition phenomenon: depending on the value of a load-like parameter in the pruned model, increasing the value of a temperature-like parameter in the pre-pruned model may either enhance or impair subsequent pruning performance. Based on this transition, we build a three-regime model by taxonomizing the global structure of the pruned NN loss landscape. Our model reveals that the dichotomous effect of high temperature is associated with transitions between distinct types of global structures in the post-pruned model. Based on our results, we present three case-studies: 1) determining whether to increase or decrease a hyperparameter for improved pruning; 2) selecting the best model to prune from a family of models; and 3) tuning the hyperparameter of the Sharpness Aware Minimization method for better pruning performance.

* ICML 2023 
Viaarxiv icon

When are ensembles really effective?

May 21, 2023
Ryan Theisen, Hyunsuk Kim, Yaoqing Yang, Liam Hodgkinson, Michael W. Mahoney

Figure 1 for When are ensembles really effective?
Figure 2 for When are ensembles really effective?
Figure 3 for When are ensembles really effective?
Figure 4 for When are ensembles really effective?

Ensembling has a long history in statistical data analysis, with many impactful applications. However, in many modern machine learning settings, the benefits of ensembling are less ubiquitous and less obvious. We study, both theoretically and empirically, the fundamental question of when ensembling yields significant performance improvements in classification tasks. Theoretically, we prove new results relating the \emph{ensemble improvement rate} (a measure of how much ensembling decreases the error rate versus a single model, on a relative scale) to the \emph{disagreement-error ratio}. We show that ensembling improves performance significantly whenever the disagreement rate is large relative to the average error rate; and that, conversely, one classifier is often enough whenever the disagreement rate is low relative to the average error rate. On the way to proving these results, we derive, under a mild condition called \emph{competence}, improved upper and lower bounds on the average test error rate of the majority vote classifier. To complement this theory, we study ensembling empirically in a variety of settings, verifying the predictions made by our theory, and identifying practical scenarios where ensembling does and does not result in large performance improvements. Perhaps most notably, we demonstrate a distinct difference in behavior between interpolating models (popular in current practice) and non-interpolating models (such as tree-based methods, where ensembling is popular), demonstrating that ensembling helps considerably more in the latter case than in the former.

Viaarxiv icon

Neurotoxin: Durable Backdoors in Federated Learning

Jun 12, 2022
Zhengming Zhang, Ashwinee Panda, Linyue Song, Yaoqing Yang, Michael W. Mahoney, Joseph E. Gonzalez, Kannan Ramchandran, Prateek Mittal

Figure 1 for Neurotoxin: Durable Backdoors in Federated Learning
Figure 2 for Neurotoxin: Durable Backdoors in Federated Learning
Figure 3 for Neurotoxin: Durable Backdoors in Federated Learning
Figure 4 for Neurotoxin: Durable Backdoors in Federated Learning

Due to their decentralized nature, federated learning (FL) systems have an inherent vulnerability during their training to adversarial backdoor attacks. In this type of attack, the goal of the attacker is to use poisoned updates to implant so-called backdoors into the learned model such that, at test time, the model's outputs can be fixed to a given target for certain inputs. (As a simple toy example, if a user types "people from New York" into a mobile keyboard app that uses a backdoored next word prediction model, then the model could autocomplete the sentence to "people from New York are rude"). Prior work has shown that backdoors can be inserted into FL models, but these backdoors are often not durable, i.e., they do not remain in the model after the attacker stops uploading poisoned updates. Thus, since training typically continues progressively in production FL systems, an inserted backdoor may not survive until deployment. Here, we propose Neurotoxin, a simple one-line modification to existing backdoor attacks that acts by attacking parameters that are changed less in magnitude during training. We conduct an exhaustive evaluation across ten natural language processing and computer vision tasks, and we find that we can double the durability of state of the art backdoors.

* Appears in ICML 2022 
Viaarxiv icon

Evaluating natural language processing models with generalization metrics that do not need access to any training or testing data

Feb 06, 2022
Yaoqing Yang, Ryan Theisen, Liam Hodgkinson, Joseph E. Gonzalez, Kannan Ramchandran, Charles H. Martin, Michael W. Mahoney

Figure 1 for Evaluating natural language processing models with generalization metrics that do not need access to any training or testing data
Figure 2 for Evaluating natural language processing models with generalization metrics that do not need access to any training or testing data
Figure 3 for Evaluating natural language processing models with generalization metrics that do not need access to any training or testing data
Figure 4 for Evaluating natural language processing models with generalization metrics that do not need access to any training or testing data

The search for effective and robust generalization metrics has been the focus of recent theoretical and empirical work. In this paper, we discuss the performance of natural language processing (NLP) models, and we evaluate various existing and novel generalization metrics. Compared to prior studies, we (i) focus on NLP instead of computer vision (CV), (ii) focus on generalization metrics that predict test error instead of the generalization gap, (iii) focus on generalization metrics that do not need the access to data, and (iv) focus on the heavy-tail (HT) phenomenon that has received comparatively less attention in the study of deep neural networks (NNs). We extend recent HT-based work which focuses on power law (PL) distributions, and we study exponential (EXP) and exponentially truncated power law (E-TPL) fitting to the empirical spectral densities (ESDs) of weight matrices. Our detailed empirical studies show that (i) \emph{shape metrics}, or the metrics obtained from fitting the shape of the ESDs, perform uniformly better at predicting generalization performance than \emph{scale metrics} commonly studied in the literature, as measured by the \emph{average} rank correlations with the generalization performance for all of our experiments; (ii) among forty generalization metrics studied in our paper, the \RANDDISTANCE metric, a new shape metric invented in this paper that measures the distance between empirical eigenvalues of weight matrices and those of randomly initialized weight matrices, achieves the highest worst-case rank correlation with generalization performance under a variety of training settings; and (iii) among the three HT distributions considered in our paper, the E-TPL fitting of ESDs performs the most robustly.

Viaarxiv icon

The Effect of Model Size on Worst-Group Generalization

Dec 08, 2021
Alan Pham, Eunice Chan, Vikranth Srivatsa, Dhruba Ghosh, Yaoqing Yang, Yaodong Yu, Ruiqi Zhong, Joseph E. Gonzalez, Jacob Steinhardt

Figure 1 for The Effect of Model Size on Worst-Group Generalization
Figure 2 for The Effect of Model Size on Worst-Group Generalization
Figure 3 for The Effect of Model Size on Worst-Group Generalization
Figure 4 for The Effect of Model Size on Worst-Group Generalization

Overparameterization is shown to result in poor test accuracy on rare subgroups under a variety of settings where subgroup information is known. To gain a more complete picture, we consider the case where subgroup information is unknown. We investigate the effect of model size on worst-group generalization under empirical risk minimization (ERM) across a wide range of settings, varying: 1) architectures (ResNet, VGG, or BERT), 2) domains (vision or natural language processing), 3) model size (width or depth), and 4) initialization (with pre-trained or random weights). Our systematic evaluation reveals that increasing model size does not hurt, and may help, worst-group test performance under ERM across all setups. In particular, increasing pre-trained model size consistently improves performance on Waterbirds and MultiNLI. We advise practitioners to use larger pre-trained models when subgroup labels are unknown.

* The first four authors contributed equally to the work 
Viaarxiv icon

A Dataset-Dispersion Perspective on Reconstruction Versus Recognition in Single-View 3D Reconstruction Networks

Nov 30, 2021
Yefan Zhou, Yiru Shen, Yujun Yan, Chen Feng, Yaoqing Yang

Figure 1 for A Dataset-Dispersion Perspective on Reconstruction Versus Recognition in Single-View 3D Reconstruction Networks
Figure 2 for A Dataset-Dispersion Perspective on Reconstruction Versus Recognition in Single-View 3D Reconstruction Networks
Figure 3 for A Dataset-Dispersion Perspective on Reconstruction Versus Recognition in Single-View 3D Reconstruction Networks
Figure 4 for A Dataset-Dispersion Perspective on Reconstruction Versus Recognition in Single-View 3D Reconstruction Networks

Neural networks (NN) for single-view 3D reconstruction (SVR) have gained in popularity. Recent work points out that for SVR, most cutting-edge NNs have limited performance on reconstructing unseen objects because they rely primarily on recognition (i.e., classification-based methods) rather than shape reconstruction. To understand this issue in depth, we provide a systematic study on when and why NNs prefer recognition to reconstruction and vice versa. Our finding shows that a leading factor in determining recognition versus reconstruction is how dispersed the training data is. Thus, we introduce the dispersion score, a new data-driven metric, to quantify this leading factor and study its effect on NNs. We hypothesize that NNs are biased toward recognition when training images are more dispersed and training shapes are less dispersed. Our hypothesis is supported and the dispersion score is proved effective through our experiments on synthetic and benchmark datasets. We show that the proposed metric is a principal way to analyze reconstruction quality and provides novel information in addition to the conventional reconstruction score.

* Accepted to 3DV 2021 
Viaarxiv icon

Augmentations in Graph Contrastive Learning: Current Methodological Flaws & Towards Better Practices

Nov 05, 2021
Puja Trivedi, Ekdeep Singh Lubana, Yujun Yan, Yaoqing Yang, Danai Koutra

Figure 1 for Augmentations in Graph Contrastive Learning: Current Methodological Flaws & Towards Better Practices
Figure 2 for Augmentations in Graph Contrastive Learning: Current Methodological Flaws & Towards Better Practices
Figure 3 for Augmentations in Graph Contrastive Learning: Current Methodological Flaws & Towards Better Practices
Figure 4 for Augmentations in Graph Contrastive Learning: Current Methodological Flaws & Towards Better Practices

Graph classification has applications in bioinformatics, social sciences, automated fake news detection, web document classification, and more. In many practical scenarios, including web-scale applications, where labels are scarce or hard to obtain, unsupervised learning is a natural paradigm but it trades off performance. Recently, contrastive learning (CL) has enabled unsupervised computer vision models to compete well against supervised ones. Theoretical and empirical works analyzing visual CL frameworks find that leveraging large datasets and domain aware augmentations is essential for framework success. Interestingly, graph CL frameworks often report high performance while using orders of magnitude smaller data, and employing domain-agnostic augmentations (e.g., node or edge dropping, feature perturbations) that can corrupt the graphs' underlying properties. Motivated by these discrepancies, we seek to determine: (i) why existing graph CL frameworks perform well despite weak augmentations and limited data; and (ii) whether adhering to visual CL principles can improve performance on graph classification tasks. Through extensive analysis, we identify flawed practices in graph data augmentation and evaluation protocols that are commonly used in the graph CL literature, and propose improved practices and sanity checks for future research and applications. We show that on small benchmark datasets, the inductive bias of graph neural networks can significantly compensate for the limitations of existing frameworks. In case studies with relatively larger graph classification tasks, we find that commonly used domain-agnostic augmentations perform poorly, while adhering to principles in visual CL can significantly improve performance. For example, in graph-based document classification, which can be used for better web search, we show task-relevant augmentations improve accuracy by 20%.

* 8 pages, 4 figures 
Viaarxiv icon

Taxonomizing local versus global structure in neural network loss landscapes

Jul 23, 2021
Yaoqing Yang, Liam Hodgkinson, Ryan Theisen, Joe Zou, Joseph E. Gonzalez, Kannan Ramchandran, Michael W. Mahoney

Figure 1 for Taxonomizing local versus global structure in neural network loss landscapes
Figure 2 for Taxonomizing local versus global structure in neural network loss landscapes
Figure 3 for Taxonomizing local versus global structure in neural network loss landscapes
Figure 4 for Taxonomizing local versus global structure in neural network loss landscapes

Viewing neural network models in terms of their loss landscapes has a long history in the statistical mechanics approach to learning, and in recent years it has received attention within machine learning proper. Among other things, local metrics (such as the smoothness of the loss landscape) have been shown to correlate with global properties of the model (such as good generalization). Here, we perform a detailed empirical analysis of the loss landscape structure of thousands of neural network models, systematically varying learning tasks, model architectures, and/or quantity/quality of data. By considering a range of metrics that attempt to capture different aspects of the loss landscape, we demonstrate that the best test accuracy is obtained when: the loss landscape is globally well-connected; ensembles of trained models are more similar to each other; and models converge to locally smooth regions. We also show that globally poorly-connected landscapes can arise when models are small or when they are trained to lower quality data; and that, if the loss landscape is globally poorly-connected, then training to zero loss can actually lead to worse test accuracy. Based on these results, we develop a simple one-dimensional model with load-like and temperature-like parameters, we introduce the notion of an \emph{effective loss landscape} depending on these parameters, and we interpret our results in terms of a \emph{rugged convexity} of the loss landscape. When viewed through this lens, our detailed empirical results shed light on phases of learning (and consequent double descent behavior), fundamental versus incidental determinants of good generalization, the role of load-like and temperature-like parameters in the learning process, different influences on the loss landscape from model and data, and the relationships between local and global metrics, all topics of recent interest.

Viaarxiv icon

Contrastive Spatial Reasoning on Multi-View Line Drawings

Apr 27, 2021
Siyuan Xiang, Anbang Yang, Yanfei Xue, Yaoqing Yang, Chen Feng

Figure 1 for Contrastive Spatial Reasoning on Multi-View Line Drawings
Figure 2 for Contrastive Spatial Reasoning on Multi-View Line Drawings
Figure 3 for Contrastive Spatial Reasoning on Multi-View Line Drawings
Figure 4 for Contrastive Spatial Reasoning on Multi-View Line Drawings

Spatial reasoning on multi-view line drawings by state-of-the-art supervised deep networks is recently shown with puzzling low performances on the SPARE3D dataset. To study the reason behind the low performance and to further our understandings of these tasks, we design controlled experiments on both input data and network designs. Guided by the hindsight from these experiment results, we propose a simple contrastive learning approach along with other network modifications to improve the baseline performance. Our approach uses a self-supervised binary classification network to compare the line drawing differences between various views of any two similar 3D objects. It enables deep networks to effectively learn detail-sensitive yet view-invariant line drawing representations of 3D objects. Experiments show that our method could significantly increase the baseline performance in SPARE3D, while some popular self-supervised learning methods cannot.

* The first two authors contributed equally. Chen Feng is the corresponding author 
Viaarxiv icon