Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Walid Krichene

Google Research

MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation

Mar 31, 2026

Jinghan Yao, Sam Adé Jacobs, Walid Krichene, Masahiro Tanaka, Dhabaleswar K Panda

Abstract:Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is model-agnostic and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), compared to the latest FlashInfer library, MAC-Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention-phase speedups, up to 2.6x end-to-end, while maintaining full-attention quality. By reusing computation, MAC-Attention delivers long-context inference that is both fast and faithful. Code is available here: https://github.com/YJHMITWEB/MAC-Attention.git

Via

Access Paper or Ask Questions

Training Differentially Private Ad Prediction Models with Semi-Sensitive Features

Jan 26, 2024

Lynn Chua, Qiliang Cui, Badih Ghazi, Charlie Harrison, Pritish Kamath, Walid Krichene, Ravi Kumar, Pasin Manurangsi, Krishna Giri Narra, Amer Sinha(+2 more)

Figure 1 for Training Differentially Private Ad Prediction Models with Semi-Sensitive Features

Figure 2 for Training Differentially Private Ad Prediction Models with Semi-Sensitive Features

Figure 3 for Training Differentially Private Ad Prediction Models with Semi-Sensitive Features

Figure 4 for Training Differentially Private Ad Prediction Models with Semi-Sensitive Features

Abstract:Motivated by problems arising in digital advertising, we introduce the task of training differentially private (DP) machine learning models with semi-sensitive features. In this setting, a subset of the features is known to the attacker (and thus need not be protected) while the remaining features as well as the label are unknown to the attacker and should be protected by the DP guarantee. This task interpolates between training the model with full DP (where the label and all features should be protected) or with label DP (where all the features are considered known, and only the label should be protected). We present a new algorithm for training DP models with semi-sensitive features. Through an empirical evaluation on real ads datasets, we demonstrate that our algorithm surpasses in utility the baselines of (i) DP stochastic gradient descent (DP-SGD) run on all features (known and unknown), and (ii) a label DP algorithm run only on the known features (while discarding the unknown ones).

* 7 pages, 4 figures

Via

Access Paper or Ask Questions

Private Learning with Public Features

Oct 24, 2023

Walid Krichene, Nicolas Mayoraz, Steffen Rendle, Shuang Song, Abhradeep Thakurta, Li Zhang

Figure 1 for Private Learning with Public Features

Figure 2 for Private Learning with Public Features

Figure 3 for Private Learning with Public Features

Figure 4 for Private Learning with Public Features

Abstract:We study a class of private learning problems in which the data is a join of private and public features. This is often the case in private personalization tasks such as recommendation or ad prediction, in which features related to individuals are sensitive, while features related to items (the movies or songs to be recommended, or the ads to be shown to users) are publicly available and do not require protection. A natural question is whether private algorithms can achieve higher utility in the presence of public features. We give a positive answer for multi-encoder models where one of the encoders operates on public features. We develop new algorithms that take advantage of this separation by only protecting certain sufficient statistics (instead of adding noise to the gradient). This method has a guaranteed utility improvement for linear regression, and importantly, achieves the state of the art on two standard private recommendation benchmarks, demonstrating the importance of methods that adapt to the private-public feature separation.

Via

Access Paper or Ask Questions

Private Matrix Factorization with Public Item Features

Sep 17, 2023

Mihaela Curmei, Walid Krichene, Li Zhang, Mukund Sundararajan

Figure 1 for Private Matrix Factorization with Public Item Features

Figure 2 for Private Matrix Factorization with Public Item Features

Figure 3 for Private Matrix Factorization with Public Item Features

Figure 4 for Private Matrix Factorization with Public Item Features

Abstract:We consider the problem of training private recommendation models with access to public item features. Training with Differential Privacy (DP) offers strong privacy guarantees, at the expense of loss in recommendation quality. We show that incorporating public item features during training can help mitigate this loss in quality. We propose a general approach based on collective matrix factorization (CMF), that works by simultaneously factorizing two matrices: the user feedback matrix (representing sensitive data) and an item feature matrix that encodes publicly available (non-sensitive) item information. The method is conceptually simple, easy to tune, and highly scalable. It can be applied to different types of public item data, including: (1) categorical item features; (2) item-item similarities learned from public sources; and (3) publicly available user feedback. Furthermore, these data modalities can be collectively utilized to fully leverage public data. Evaluating our method on a standard DP recommendation benchmark, we find that using public item features significantly narrows the quality gap between private models and their non-private counterparts. As privacy constraints become more stringent, models rely more heavily on public side features for recommendation. This results in a smooth transition from collaborative filtering to item-based contextual recommendations.

* Presented at ACM Recsys 2023

Via

Access Paper or Ask Questions

Multi-Task Differential Privacy Under Distribution Skew

Feb 15, 2023

Walid Krichene, Prateek Jain, Shuang Song, Mukund Sundararajan, Abhradeep Thakurta, Li Zhang

Figure 1 for Multi-Task Differential Privacy Under Distribution Skew

Figure 2 for Multi-Task Differential Privacy Under Distribution Skew

Figure 3 for Multi-Task Differential Privacy Under Distribution Skew

Figure 4 for Multi-Task Differential Privacy Under Distribution Skew

Abstract:We study the problem of multi-task learning under user-level differential privacy, in which $n$ users contribute data to $m$ tasks, each involving a subset of users. One important aspect of the problem, that can significantly impact quality, is the distribution skew among tasks. Certain tasks may have much fewer data samples than others, making them more susceptible to the noise added for privacy. It is natural to ask whether algorithms can adapt to this skew to improve the overall utility. We give a systematic analysis of the problem, by studying how to optimally allocate a user's privacy budget among tasks. We propose a generic algorithm, based on an adaptive reweighting of the empirical loss, and show that when there is task distribution skew, this gives a quantifiable improvement of excess empirical risk. Experimental studies on recommendation problems that exhibit a long tail of small tasks, demonstrate that our methods significantly improve utility, achieving the state of the art on two standard benchmarks.

Via

Access Paper or Ask Questions

Differentially Private Image Classification from Features

Nov 24, 2022

Harsh Mehta, Walid Krichene, Abhradeep Thakurta, Alexey Kurakin, Ashok Cutkosky

Figure 1 for Differentially Private Image Classification from Features

Figure 2 for Differentially Private Image Classification from Features

Figure 3 for Differentially Private Image Classification from Features

Figure 4 for Differentially Private Image Classification from Features

Abstract:Leveraging transfer learning has recently been shown to be an effective strategy for training large models with Differential Privacy (DP). Moreover, somewhat surprisingly, recent works have found that privately training just the last layer of a pre-trained model provides the best utility with DP. While past studies largely rely on algorithms like DP-SGD for training large models, in the specific case of privately learning from features, we observe that computational burden is low enough to allow for more sophisticated optimization schemes, including second-order methods. To that end, we systematically explore the effect of design parameters such as loss function and optimization algorithm. We find that, while commonly used logistic regression performs better than linear regression in the non-private setting, the situation is reversed in the private setting. We find that linear regression is much more effective than logistic regression from both privacy and computational aspects, especially at stricter epsilon values ($\epsilon < 1$). On the optimization side, we also explore using Newton's method, and find that second-order information is quite helpful even with privacy, although the benefit significantly diminishes with stricter privacy guarantees. While both methods use second-order information, least squares is effective at lower epsilons while Newton's method is effective at larger epsilon values. To combine the benefits of both, we propose a novel algorithm called DP-FC, which leverages feature covariance instead of the Hessian of the logistic regression loss and performs well across all $\epsilon$ values we tried. With this, we obtain new SOTA results on ImageNet-1k, CIFAR-100 and CIFAR-10 across all values of $\epsilon$ typically considered. Most remarkably, on ImageNet-1K, we obtain top-1 accuracy of 88\% under (8, $8 * 10^{-7}$)-DP and 84.3\% under (0.1, $8 * 10^{-7}$)-DP.

Via

Access Paper or Ask Questions

Reciprocity in Machine Learning

Feb 19, 2022

Mukund Sundararajan, Walid Krichene

Figure 1 for Reciprocity in Machine Learning

Figure 2 for Reciprocity in Machine Learning

Figure 3 for Reciprocity in Machine Learning

Figure 4 for Reciprocity in Machine Learning

Abstract:Machine learning is pervasive. It powers recommender systems such as Spotify, Instagram and YouTube, and health-care systems via models that predict sleep patterns, or the risk of disease. Individuals contribute data to these models and benefit from them. Are these contributions (outflows of influence) and benefits (inflows of influence) reciprocal? We propose measures of outflows, inflows and reciprocity building on previously proposed measures of training data influence. Our initial theoretical and empirical results indicate that under certain distributional assumptions, some classes of models are approximately reciprocal. We conclude with several open directions.

Via

Access Paper or Ask Questions

ALX: Large Scale Matrix Factorization on TPUs

Dec 03, 2021

Harsh Mehta, Steffen Rendle, Walid Krichene, Li Zhang

Figure 1 for ALX: Large Scale Matrix Factorization on TPUs

Figure 2 for ALX: Large Scale Matrix Factorization on TPUs

Figure 3 for ALX: Large Scale Matrix Factorization on TPUs

Figure 4 for ALX: Large Scale Matrix Factorization on TPUs

Abstract:We present ALX, an open-source library for distributed matrix factorization using Alternating Least Squares, written in JAX. Our design allows for efficient use of the TPU architecture and scales well to matrix factorization problems of O(B) rows/columns by scaling the number of available TPU cores. In order to spur future research on large scale matrix factorization methods and to illustrate the scalability properties of our own implementation, we also built a real world web link prediction dataset called WebGraph. This dataset can be easily modeled as a matrix factorization problem. We created several variants of this dataset based on locality and sparsity properties of sub-graphs. The largest variant of WebGraph has around 365M nodes and training a single epoch finishes in about 20 minutes with 256 TPU cores. We include speed and performance numbers of ALX on all variants of WebGraph. Both the framework code and the dataset is open-sourced.

Via

Access Paper or Ask Questions

iALS++: Speeding up Matrix Factorization with Subspace Optimization

Oct 26, 2021

Steffen Rendle, Walid Krichene, Li Zhang, Yehuda Koren

Figure 1 for iALS++: Speeding up Matrix Factorization with Subspace Optimization

Figure 2 for iALS++: Speeding up Matrix Factorization with Subspace Optimization

Figure 3 for iALS++: Speeding up Matrix Factorization with Subspace Optimization

Figure 4 for iALS++: Speeding up Matrix Factorization with Subspace Optimization

Abstract:iALS is a popular algorithm for learning matrix factorization models from implicit feedback with alternating least squares. This algorithm was invented over a decade ago but still shows competitive quality compared to recent approaches like VAE, EASE, SLIM, or NCF. Due to a computational trick that avoids negative sampling, iALS is very efficient especially for large item catalogues. However, iALS does not scale well with large embedding dimensions, d, due to its cubic runtime dependency on d. Coordinate descent variations, iCD, have been proposed to lower the complexity to quadratic in d. In this work, we show that iCD approaches are not well suited for modern processors and can be an order of magnitude slower than a careful iALS implementation for small to mid scale embedding sizes (d ~ 100) and only perform better than iALS on large embeddings d ~ 1000. We propose a new solver iALS++ that combines the advantages of iALS in terms of vector processing with a low computational complexity as in iCD. iALS++ is an order of magnitude faster than iCD both for small and large embedding dimensions. It can solve benchmark problems like Movielens 20M or Million Song Dataset even for 1000 dimensional embedding vectors in a few minutes.

Via

Access Paper or Ask Questions

Revisiting the Performance of iALS on Item Recommendation Benchmarks

Oct 26, 2021

Steffen Rendle, Walid Krichene, Li Zhang, Yehuda Koren

Figure 1 for Revisiting the Performance of iALS on Item Recommendation Benchmarks

Figure 2 for Revisiting the Performance of iALS on Item Recommendation Benchmarks

Figure 3 for Revisiting the Performance of iALS on Item Recommendation Benchmarks

Figure 4 for Revisiting the Performance of iALS on Item Recommendation Benchmarks

Abstract:Matrix factorization learned by implicit alternating least squares (iALS) is a popular baseline in recommender system research publications. iALS is known to be one of the most computationally efficient and scalable collaborative filtering methods. However, recent studies suggest that its prediction quality is not competitive with the current state of the art, in particular autoencoders and other item-based collaborative filtering methods. In this work, we revisit the iALS algorithm and present a bag of tricks that we found useful when applying iALS. We revisit four well-studied benchmarks where iALS was reported to perform poorly and show that with proper tuning, iALS is highly competitive and outperforms any method on at least half of the comparisons. We hope that these high quality results together with iALS's known scalability spark new interest in applying and further improving this decade old technique.

Via

Access Paper or Ask Questions