Abstract:The quadratic computational complexity in the self-attention mechanism of popular transformer architectures poses significant challenges for training and inference, particularly in terms of efficiency and memory requirements. Towards addressing these challenges, this paper introduces a novel fast computation method for gradient calculation in multi-layer transformer models. Our approach enables the computation of gradients for the entire multi-layer transformer model in almost linear time $n^{1+o(1)}$, where $n$ is the input sequence length. This breakthrough significantly reduces the computational bottleneck associated with the traditional quadratic time complexity. Our theory holds for any loss function and maintains a bounded approximation error across the entire model. Furthermore, our analysis can hold when the multi-layer transformer model contains many practical sub-modules, such as residual connection, casual mask, and multi-head attention. By improving the efficiency of gradient computation in large language models, we hope that our work will facilitate the more effective training and deployment of long-context language models based on our theoretical results.
Abstract:In this work, we improved the analysis of the running time of SparseGPT [Frantar, Alistarh ICML 2023] from $O(d^{3})$ to $O(d^{\omega} + d^{2+a+o(1)} + d^{1+\omega(1,1,a)-a})$ for any $a \in [0, 1]$, where $\omega$ is the exponent of matrix multiplication. In particular, for the current $\omega \approx 2.371$ [Alman, Duan, Williams, Xu, Xu, Zhou 2024], our running times boil down to $O(d^{2.53})$. This running time is due to the analysis of the lazy update behavior in iterative maintenance problems, such as [Deng, Song, Weinstein 2022, Brand, Song, Zhou ICML 2024].
Abstract:Leverage scores have become essential in statistics and machine learning, aiding regression analysis, randomized matrix computations, and various other tasks. This paper delves into the inverse problem, aiming to recover the intrinsic model parameters given the leverage scores gradient. This endeavor not only enriches the theoretical understanding of models trained with leverage score techniques but also has substantial implications for data privacy and adversarial security. We specifically scrutinize the inversion of the leverage score gradient, denoted as $g(x)$. An innovative iterative algorithm is introduced for the approximate resolution of the regularized least squares problem stated as $\min_{x \in \mathbb{R}^d} 0.5 \|g(x) - c\|_2^2 + 0.5\|\mathrm{diag}(w)Ax\|_2^2$. Our algorithm employs subsampled leverage score distributions to compute an approximate Hessian in each iteration, under standard assumptions, considerably mitigating the time complexity. Given that a total of $T = \log(\| x_0 - x^* \|_2/ \epsilon)$ iterations are required, the cost per iteration is optimized to the order of $O( (\mathrm{nnz}(A) + d^{\omega} ) \cdot \mathrm{poly}(\log(n/\delta))$, where $\mathrm{nnz}(A)$ denotes the number of non-zero entries of $A$.
Abstract:Determining the John ellipsoid - the largest volume ellipsoid contained within a convex polytope - is a fundamental problem with applications in machine learning, optimization, and data analytics. Recent work has developed fast algorithms for approximating the John ellipsoid using sketching and leverage score sampling techniques. However, these algorithms do not provide privacy guarantees for sensitive input data. In this paper, we present the first differentially private algorithm for fast John ellipsoid computation. Our method integrates noise perturbation with sketching and leverage score sampling to achieve both efficiency and privacy. We prove that (1) our algorithm provides $(\epsilon,\delta)$-differential privacy, and the privacy guarantee holds for neighboring datasets that are $\epsilon_0$-close, allowing flexibility in the privacy definition; (2) our algorithm still converges to a $(1+\xi)$-approximation of the optimal John ellipsoid in $O(\xi^{-2}(\log(n/\delta_0) + (L\epsilon_0)^{-2}))$ iterations where $n$ is the number of data point, $L$ is the Lipschitz constant, $\delta_0$ is the failure probability, and $\epsilon_0$ is the closeness of neighboring input datasets. Our theoretical analysis demonstrates the algorithm's convergence and privacy properties, providing a robust approach for balancing utility and privacy in John ellipsoid computation. This is the first differentially private algorithm for fast John ellipsoid computation, opening avenues for future research in privacy-preserving optimization techniques.
Abstract:Cross-attention has become a fundamental module nowadays in many important artificial intelligence applications, e.g., retrieval-augmented generation (RAG), system prompt, guided stable diffusion, and many so on. Ensuring cross-attention privacy is crucial and urgently needed because its key and value matrices may contain sensitive information about companies and their users, many of which profit solely from their system prompts or RAG data. In this work, we design a novel differential privacy (DP) data structure to address the privacy security of cross-attention with a theoretical guarantee. In detail, let $n$ be the input token length of system prompt/RAG data, $d$ be the feature dimension, $0 < \alpha \le 1$ be the relative error parameter, $R$ be the maximum value of the query and key matrices, $R_w$ be the maximum value of the value matrix, and $r,s,\epsilon_s$ be parameters of polynomial kernel methods. Then, our data structure requires $\widetilde{O}(ndr^2)$ memory consumption with $\widetilde{O}(nr^2)$ initialization time complexity and $\widetilde{O}(\alpha^{-1} r^2)$ query time complexity for a single token query. In addition, our data structure can guarantee that the user query is $(\epsilon, \delta)$-DP with $\widetilde{O}(n^{-1} \epsilon^{-1} \alpha^{-1/2} R^{2s} R_w r^2)$ additive error and $n^{-1} (\alpha + \epsilon_s)$ relative error between our output and the true answer. Furthermore, our result is robust to adaptive queries in which users can intentionally attack the cross-attention system. To our knowledge, this is the first work to provide DP for cross-attention. We believe it can inspire more privacy algorithm design in large generative models (LGMs).
Abstract:Training data privacy is a fundamental problem in modern Artificial Intelligence (AI) applications, such as face recognition, recommendation systems, language generation, and many others, as it may contain sensitive user information related to legal issues. To fundamentally understand how privacy mechanisms work in AI applications, we study differential privacy (DP) in the Neural Tangent Kernel (NTK) regression setting, where DP is one of the most powerful tools for measuring privacy under statistical learning, and NTK is one of the most popular analysis frameworks for studying the learning mechanisms of deep neural networks. In our work, we can show provable guarantees for both differential privacy and test accuracy of our NTK regression. Furthermore, we conduct experiments on the basic image classification dataset CIFAR10 to demonstrate that NTK regression can preserve good accuracy under a modest privacy budget, supporting the validity of our analysis. To our knowledge, this is the first work to provide a DP guarantee for NTK regression.
Abstract:We investigate the statistical and computational limits of latent \textbf{Di}ffusion \textbf{T}ransformers (\textbf{DiT}s) under the low-dimensional linear latent space assumption. Statistically, we study the universal approximation and sample complexity of the DiTs score function, as well as the distribution recovery property of the initial data. Specifically, under mild data assumptions, we derive an approximation error bound for the score network of latent DiTs, which is sub-linear in the latent space dimension. Additionally, we derive the corresponding sample complexity bound and show that the data distribution generated from the estimated score function converges toward a proximate area of the original one. Computationally, we characterize the hardness of both forward inference and backward computation of latent DiTs, assuming the Strong Exponential Time Hypothesis (SETH). For forward inference, we identify efficient criteria for all possible latent DiTs inference algorithms and showcase our theory by pushing the efficiency toward almost-linear time inference. For backward computation, we leverage the low-rank structure within the gradient computation of DiTs training for possible algorithmic speedup. Specifically, we show that such speedup achieves almost-linear time latent DiTs training by casting the DiTs gradient as a series of chained low-rank approximations with bounded error. Under the low-dimensional assumption, we show that the convergence rate and the computational efficiency are both dominated by the dimension of the subspace, suggesting that latent DiTs have the potential to bypass the challenges associated with the high dimensionality of initial data.
Abstract:Prompting and contextual-based fine-tuning methods, which we call Prefix Learning, have been proposed to enhance the performance of language models on various downstream tasks that can match full parameter fine-tuning. There remains a limited theoretical understanding of how these methods work. In this paper, we aim to relieve this limitation by studying the learning ability of Prefix Learning from the perspective of prefix length. In particular, we approximate the infinite-long Prefix Learning optimization process by the Neural Tangent Kernel (NTK) technique. We formulate and solve it as a learning problem of the infinite-long prefix in a one-layer attention network. Our results confirm the over-parameterization property and arbitrary small loss convergence guarantee of the infinite-long Prefix Learning in attention. To the implementation end, we propose our NTK-Attention method, which is "equivalent" to attention computation with arbitrary prefix length efficiently. Its time complexity mainly depends on the sub-quadratic of input length (without prefix), and our method only requires $d^2 + d$ extra parameters for representation, where $d$ is the feature dimension. In addition, we conducted experiments that compare our NTK-Attention with full parameters fine-tuning, LoRA, and P-Tuning V2 methods across vision or natural language datasets. The results indicate our approach may be a promising parameter-efficient-fine-tuning method since it has demonstrated superior performance in numerous scenarios. Our code can be found at \url{https://github.com/ChristianYang37/chiwun/tree/main/src/NTK-Attention}.
Abstract:We study the computational limits of Low-Rank Adaptation (LoRA) update for finetuning transformer-based models using fine-grained complexity theory. Our key observation is that the existence of low-rank decompositions within the gradient computation of LoRA adaptation leads to possible algorithmic speedup. This allows us to (i) identify a phase transition behavior and (ii) prove the existence of nearly linear algorithms by controlling the LoRA update computation term by term, assuming the Strong Exponential Time Hypothesis (SETH). For the former, we identify a sharp transition in the efficiency of all possible rank-$r$ LoRA update algorithms for transformers, based on specific norms resulting from the multiplications of the input sequence $\mathbf{X}$, pretrained weights $\mathbf{W^\star}$, and adapter matrices $\alpha \mathbf{B} \mathbf{A} / r$. Specifically, we derive a shared upper bound threshold for such norms and show that efficient (sub-quadratic) approximation algorithms of LoRA exist only below this threshold. For the latter, we prove the existence of nearly linear approximation algorithms for LoRA adaptation by utilizing the hierarchical low-rank structures of LoRA gradients and approximating the gradients with a series of chained low-rank approximations. To showcase our theory, we consider two practical scenarios: partial (e.g., only $\mathbf{W}_V$ and $\mathbf{W}_Q$) and full adaptations (e.g., $\mathbf{W}_Q$, $\mathbf{W}_V$, and $\mathbf{W}_K$) of weights in attention heads.
Abstract:Tensor Attention, a multi-view attention that is able to capture high-order correlations among multiple modalities, can overcome the representational limitations of classical matrix attention. However, the $\Omega(n^3)$ time complexity of tensor attention poses a significant obstacle to its practical implementation in transformers, where $n$ is the input sequence length. In this work, we prove that the backward gradient of tensor attention training can be computed in almost linear $n^{1+o(1)}$ time, the same complexity as its forward computation under a bounded entries assumption. We provide a closed-form solution for the gradient and propose a fast computation method utilizing polynomial approximation methods and tensor algebraic tricks. Furthermore, we prove the necessity and tightness of our assumption through hardness analysis, showing that slightly weakening it renders the gradient problem unsolvable in truly subcubic time. Our theoretical results establish the feasibility of efficient higher-order transformer training and may facilitate practical applications of tensor attention architectures.