Alert button
Picture for Kayhan Behdin

Kayhan Behdin

Alert button

QuantEase: Optimization-based Quantization for Language Models -- An Efficient and Intuitive Algorithm

Sep 05, 2023
Kayhan Behdin, Ayan Acharya, Aman Gupta, Sathiya Keerthi, Rahul Mazumder

With the rising popularity of Large Language Models (LLMs), there has been an increasing interest in compression techniques that enable their efficient deployment. This study focuses on the Post-Training Quantization (PTQ) of LLMs. Drawing from recent advances, our work introduces QuantEase, a layer-wise quantization framework where individual layers undergo separate quantization. The problem is framed as a discrete-structured non-convex optimization, prompting the development of algorithms rooted in Coordinate Descent (CD) techniques. These CD-based methods provide high-quality solutions to the complex non-convex layer-wise quantization problems. Notably, our CD-based approach features straightforward updates, relying solely on matrix and vector operations, circumventing the need for matrix inversion or decomposition. We also explore an outlier-aware variant of our approach, allowing for retaining significant weights (outliers) with complete precision. Our proposal attains state-of-the-art performance in terms of perplexity and zero-shot accuracy in empirical evaluations across various LLMs and datasets, with relative improvements up to 15% over methods such as GPTQ. Particularly noteworthy is our outlier-aware algorithm's capability to achieve near or sub-3-bit quantization of LLMs with an acceptable drop in accuracy, obviating the need for non-uniform quantization or grouping techniques, improving upon methods such as SpQR by up to two times in terms of perplexity.

Viaarxiv icon

Sparse Gaussian Graphical Models with Discrete Optimization: Computational and Statistical Perspectives

Jul 18, 2023
Kayhan Behdin, Wenyu Chen, Rahul Mazumder

Figure 1 for Sparse Gaussian Graphical Models with Discrete Optimization: Computational and Statistical Perspectives
Figure 2 for Sparse Gaussian Graphical Models with Discrete Optimization: Computational and Statistical Perspectives
Figure 3 for Sparse Gaussian Graphical Models with Discrete Optimization: Computational and Statistical Perspectives
Figure 4 for Sparse Gaussian Graphical Models with Discrete Optimization: Computational and Statistical Perspectives

We consider the problem of learning a sparse graph underlying an undirected Gaussian graphical model, a key problem in statistical machine learning. Given $n$ samples from a multivariate Gaussian distribution with $p$ variables, the goal is to estimate the $p \times p$ inverse covariance matrix (aka precision matrix), assuming it is sparse (i.e., has a few nonzero entries). We propose GraphL0BnB, a new estimator based on an $\ell_0$-penalized version of the pseudolikelihood function, while most earlier approaches are based on the $\ell_1$-relaxation. Our estimator can be formulated as a convex mixed integer program (MIP) which can be difficult to compute at scale using off-the-shelf commercial solvers. To solve the MIP, we propose a custom nonlinear branch-and-bound (BnB) framework that solves node relaxations with tailored first-order methods. As a by-product of our BnB framework, we propose large-scale solvers for obtaining good primal solutions that are of independent interest. We derive novel statistical guarantees (estimation and variable selection) for our estimator and discuss how our approach improves upon existing estimators. Our numerical experiments on real/synthetic datasets suggest that our method can solve, to near-optimality, problem instances with $p = 10^4$ -- corresponding to a symmetric matrix of size $p \times p$ with $p^2/2$ binary variables. We demonstrate the usefulness of GraphL0BnB versus various state-of-the-art approaches on a range of datasets.

Viaarxiv icon

Sharpness-Aware Minimization: An Implicit Regularization Perspective

Feb 28, 2023
Kayhan Behdin, Rahul Mazumder

Figure 1 for Sharpness-Aware Minimization: An Implicit Regularization Perspective
Figure 2 for Sharpness-Aware Minimization: An Implicit Regularization Perspective
Figure 3 for Sharpness-Aware Minimization: An Implicit Regularization Perspective
Figure 4 for Sharpness-Aware Minimization: An Implicit Regularization Perspective

Sharpness-Aware Minimization (SAM) is a recent optimization framework aiming to improve the deep neural network generalization, through obtaining flatter (i.e. less sharp) solutions. As SAM has been numerically successful, recent papers have studied the theoretical aspects of the framework. In this work, we study SAM through an implicit regularization lens, and present a new theoretical explanation of why SAM generalizes well. To this end, we study the least-squares linear regression problem and show a bias-variance trade-off for SAM's error over the course of the algorithm. We show SAM has lower bias compared to Gradient Descent (GD), while having higher variance. This shows SAM can outperform GD, specially if the algorithm is \emph{stopped early}, which is often the case when training large neural networks due to the prohibitive computational cost. We extend our results to kernel regression, as well as stochastic optimization and discuss how implicit regularization of SAM can improve upon vanilla training.

Viaarxiv icon

mSAM: Micro-Batch-Averaged Sharpness-Aware Minimization

Feb 19, 2023
Kayhan Behdin, Qingquan Song, Aman Gupta, Ayan Acharya, David Durfee, Borja Ocejo, Sathiya Keerthi, Rahul Mazumder

Figure 1 for mSAM: Micro-Batch-Averaged Sharpness-Aware Minimization
Figure 2 for mSAM: Micro-Batch-Averaged Sharpness-Aware Minimization
Figure 3 for mSAM: Micro-Batch-Averaged Sharpness-Aware Minimization
Figure 4 for mSAM: Micro-Batch-Averaged Sharpness-Aware Minimization

Modern deep learning models are over-parameterized, where different optima can result in widely varying generalization performance. To account for this, Sharpness-Aware Minimization (SAM) modifies the underlying loss function to guide descent methods towards flatter minima, which arguably have better generalization abilities. In this paper, we focus on a variant of SAM known as micro-batch SAM (mSAM), which, during training, averages the updates generated by adversarial perturbations across several disjoint shards (micro batches) of a mini-batch. We extend a recently developed and well-studied general framework for flatness analysis to show that distributed gradient computation for sharpness-aware minimization theoretically achieves even flatter minima. In order to support this theoretical superiority, we provide a thorough empirical evaluation on a variety of image classification and natural language processing tasks. We also show that contrary to previous work, mSAM can be implemented in a flexible and parallelizable manner without significantly increasing computational costs. Our practical implementation of mSAM yields superior generalization performance across a wide range of tasks compared to SAM, further supporting our theoretical framework.

* arXiv admin note: substantial text overlap with arXiv:2212.04343 
Viaarxiv icon

Multi-Task Learning for Sparsity Pattern Heterogeneity: A Discrete Optimization Approach

Dec 16, 2022
Gabriel Loewinger, Kayhan Behdin, Kenneth T. Kishida, Giovanni Parmigiani, Rahul Mazumder

Figure 1 for Multi-Task Learning for Sparsity Pattern Heterogeneity: A Discrete Optimization Approach
Figure 2 for Multi-Task Learning for Sparsity Pattern Heterogeneity: A Discrete Optimization Approach
Figure 3 for Multi-Task Learning for Sparsity Pattern Heterogeneity: A Discrete Optimization Approach
Figure 4 for Multi-Task Learning for Sparsity Pattern Heterogeneity: A Discrete Optimization Approach

We extend best-subset selection to linear Multi-Task Learning (MTL), where a set of linear models are jointly trained on a collection of datasets (``tasks''). Allowing the regression coefficients of tasks to have different sparsity patterns (i.e., different supports), we propose a modeling framework for MTL that encourages models to share information across tasks, for a given covariate, through separately 1) shrinking the coefficient supports together, and/or 2) shrinking the coefficient values together. This allows models to borrow strength during variable selection even when the coefficient values differ markedly between tasks. We express our modeling framework as a Mixed-Integer Program, and propose efficient and scalable algorithms based on block coordinate descent and combinatorial local search. We show our estimator achieves statistically optimal prediction rates. Importantly, our theory characterizes how our estimator leverages the shared support information across tasks to achieve better variable selection performance. We evaluate the performance of our method in simulations and two biology applications. Our proposed approaches outperform other sparse MTL methods in variable selection and prediction accuracy. Interestingly, penalties that shrink the supports together often outperform penalties that shrink the coefficient values together. We will release an R package implementing our methods.

Viaarxiv icon

Improved Deep Neural Network Generalization Using m-Sharpness-Aware Minimization

Dec 07, 2022
Kayhan Behdin, Qingquan Song, Aman Gupta, David Durfee, Ayan Acharya, Sathiya Keerthi, Rahul Mazumder

Figure 1 for Improved Deep Neural Network Generalization Using m-Sharpness-Aware Minimization
Figure 2 for Improved Deep Neural Network Generalization Using m-Sharpness-Aware Minimization
Figure 3 for Improved Deep Neural Network Generalization Using m-Sharpness-Aware Minimization
Figure 4 for Improved Deep Neural Network Generalization Using m-Sharpness-Aware Minimization

Modern deep learning models are over-parameterized, where the optimization setup strongly affects the generalization performance. A key element of reliable optimization for these systems is the modification of the loss function. Sharpness-Aware Minimization (SAM) modifies the underlying loss function to guide descent methods towards flatter minima, which arguably have better generalization abilities. In this paper, we focus on a variant of SAM known as mSAM, which, during training, averages the updates generated by adversarial perturbations across several disjoint shards of a mini-batch. Recent work suggests that mSAM can outperform SAM in terms of test accuracy. However, a comprehensive empirical study of mSAM is missing from the literature -- previous results have mostly been limited to specific architectures and datasets. To that end, this paper presents a thorough empirical evaluation of mSAM on various tasks and datasets. We provide a flexible implementation of mSAM and compare the generalization performance of mSAM to the performance of SAM and vanilla training on different image classification and natural language processing tasks. We also conduct careful experiments to understand the computational cost of training with mSAM, its sensitivity to hyperparameters and its correlation with the flatness of the loss landscape. Our analysis reveals that mSAM yields superior generalization performance and flatter minima, compared to SAM, across a wide range of tasks without significantly increasing computational costs.

Viaarxiv icon

Archetypal Analysis for Sparse Nonnegative Matrix Factorization: Robustness Under Misspecification

Apr 08, 2021
Kayhan Behdin, Rahul Mazumder

Figure 1 for Archetypal Analysis for Sparse Nonnegative Matrix Factorization: Robustness Under Misspecification
Figure 2 for Archetypal Analysis for Sparse Nonnegative Matrix Factorization: Robustness Under Misspecification
Figure 3 for Archetypal Analysis for Sparse Nonnegative Matrix Factorization: Robustness Under Misspecification
Figure 4 for Archetypal Analysis for Sparse Nonnegative Matrix Factorization: Robustness Under Misspecification

We consider the problem of sparse nonnegative matrix factorization (NMF) with archetypal regularization. The goal is to represent a collection of data points as nonnegative linear combinations of a few nonnegative sparse factors with appealing geometric properties, arising from the use of archetypal regularization. We generalize the notion of robustness studied in Javadi and Montanari (2019) (without sparsity) to the notions of (a) strong robustness that implies each estimated archetype is close to the underlying archetypes and (b) weak robustness that implies there exists at least one recovered archetype that is close to the underlying archetypes. Our theoretical results on robustness guarantees hold under minimal assumptions on the underlying data, and applies to settings where the underlying archetypes need not be sparse. We propose new algorithms for our optimization problem; and present numerical experiments on synthetic and real datasets that shed further insights into our proposed framework and theoretical developments.

Viaarxiv icon

Recovering Quantized Data with Missing Information Using Bilinear Factorization and Augmented Lagrangian Method

Oct 07, 2018
Ashkan Esmaeili, Kayhan Behdin, Sina Al-E-Mohammad, Farokh Marvasti

Figure 1 for Recovering Quantized Data with Missing Information Using Bilinear Factorization and Augmented Lagrangian Method
Figure 2 for Recovering Quantized Data with Missing Information Using Bilinear Factorization and Augmented Lagrangian Method
Figure 3 for Recovering Quantized Data with Missing Information Using Bilinear Factorization and Augmented Lagrangian Method
Figure 4 for Recovering Quantized Data with Missing Information Using Bilinear Factorization and Augmented Lagrangian Method

In this paper, we propose a novel approach in order to recover a quantized matrix with missing information. We propose a regularized convex cost function composed of a log-likelihood term and a Trace norm term. The Bi-factorization approach and the Augmented Lagrangian Method (ALM) are applied to find the global minimizer of the cost function in order to recover the genuine data. We provide mathematical convergence analysis for our proposed algorithm. In the Numerical Experiments Section, we show the superiority of our method in accuracy and also its robustness in computational complexity compared to the state-of-the-art literature methods.

Viaarxiv icon

Transduction with Matrix Completion Using Smoothed Rank Function

May 19, 2018
Ashkan Esmaeili, Kayhan Behdin, Mohammad Amin Fakharian, Farokh Marvasti

Figure 1 for Transduction with Matrix Completion Using Smoothed Rank Function
Figure 2 for Transduction with Matrix Completion Using Smoothed Rank Function

In this paper, we propose two new algorithms for transduction with Matrix Completion (MC) problem. The joint MC and prediction tasks are addressed simultaneously to enhance the accuracy, i.e., the label matrix is concatenated to the data matrix forming a stacked matrix. Assuming the data matrix is of low rank, we propose new recommendation methods by posing the problem as a constrained minimization of the Smoothed Rank Function (SRF). We provide convergence analysis for the proposed algorithms. The simulations are conducted on real datasets in two different scenarios of randomly missing pattern with and without block loss. The results confirm that the accuracy of our proposed methods outperforms those of state-of-the-art methods even up to 10% in low observation rates for the scenario without block loss. Our accuracy in the latter scenario, is comparable to state-of-the-art methods while the complexity of the proposed algorithms are reduced up to 4 times.

Viaarxiv icon