Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aditya Krishna Menon

Supervision Complexity and its Role in Knowledge Distillation

Jan 28, 2023
Hrayr Harutyunyan, Ankit Singh Rawat, Aditya Krishna Menon, Seungyeon Kim, Sanjiv Kumar

Figure 1 for Supervision Complexity and its Role in Knowledge Distillation

Figure 2 for Supervision Complexity and its Role in Knowledge Distillation

Figure 3 for Supervision Complexity and its Role in Knowledge Distillation

Figure 4 for Supervision Complexity and its Role in Knowledge Distillation

Despite the popularity and efficacy of knowledge distillation, there is limited understanding of why it helps. In order to study the generalization behavior of a distilled student, we propose a new theoretical framework that leverages supervision complexity: a measure of alignment between teacher-provided supervision and the student's neural tangent kernel. The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions. Specifically, it provides a rigorous justification for the utility of various techniques that are prevalent in the context of distillation, such as early stopping and temperature scaling. Our analysis further suggests the use of online distillation, where a student receives increasingly more complex supervision from teachers in different stages of their training. We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.

* Published at ICLR 2023

Via

Access Paper or Ask Questions

EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval

Jan 27, 2023
Seungyeon Kim, Ankit Singh Rawat, Manzil Zaheer, Sadeep Jayasumana, Veeranjaneyulu Sadhanala, Wittawat Jitkrittum, Aditya Krishna Menon, Rob Fergus, Sanjiv Kumar

Figure 1 for EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval

Figure 2 for EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval

Figure 3 for EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval

Figure 4 for EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval

Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR). In this paper, we aim to improve distillation methods that pave the way for the deployment of such models in practice. The proposed distillation approach supports both retrieval and re-ranking stages and crucially leverages the relative geometry among queries and documents learned by the large teacher model. It goes beyond existing distillation methods in the IR literature, which simply rely on the teacher's scalar scores over the training data, on two fronts: providing stronger signals about local geometry via embedding matching and attaining better coverage of data manifold globally via query generation. Embedding matching provides a stronger signal to align the representations of the teacher and student models. At the same time, query generation explores the data manifold to reduce the discrepancies between the student and teacher where training data is sparse. Our distillation approach is theoretically justified and applies to both dual encoder (DE) and cross-encoder (CE) models. Furthermore, for distilling a CE model to a DE model via embedding matching, we propose a novel dual pooling-based scorer for the CE model that facilitates a distillation-friendly embedding geometry, especially for DE student models.

Via

Access Paper or Ask Questions

When does mixup promote local linearity in learned representations?

Oct 28, 2022
Arslan Chaudhry, Aditya Krishna Menon, Andreas Veit, Sadeep Jayasumana, Srikumar Ramalingam, Sanjiv Kumar

Figure 1 for When does mixup promote local linearity in learned representations?

Figure 2 for When does mixup promote local linearity in learned representations?

Figure 3 for When does mixup promote local linearity in learned representations?

Figure 4 for When does mixup promote local linearity in learned representations?

Mixup is a regularization technique that artificially produces new samples using convex combinations of original training points. This simple technique has shown strong empirical performance, and has been heavily used as part of semi-supervised learning techniques such as mixmatch~\citep{berthelot2019mixmatch} and interpolation consistent training (ICT)~\citep{verma2019interpolation}. In this paper, we look at Mixup through a \emph{representation learning} lens in a semi-supervised learning setup. In particular, we study the role of Mixup in promoting linearity in the learned network representations. Towards this, we study two questions: (1) how does the Mixup loss that enforces linearity in the \emph{last} network layer propagate the linearity to the \emph{earlier} layers?; and (2) how does the enforcement of stronger Mixup loss on more than two data points affect the convergence of training? We empirically investigate these properties of Mixup on vision datasets such as CIFAR-10, CIFAR-100 and SVHN. Our results show that supervised Mixup training does not make \emph{all} the network layers linear; in fact the \emph{intermediate layers} become more non-linear during Mixup training compared to a network that is trained \emph{without} Mixup. However, when Mixup is used as an unsupervised loss, we observe that all the network layers become more linear resulting in faster training convergence.

* NeurIPS 2022 (First Workshop on Interpolation and Beyond)

Via

Access Paper or Ask Questions

Robust Distillation for Worst-class Performance

Jun 13, 2022
Serena Wang, Harikrishna Narasimhan, Yichen Zhou, Sara Hooker, Michal Lukasik, Aditya Krishna Menon

Figure 1 for Robust Distillation for Worst-class Performance

Figure 2 for Robust Distillation for Worst-class Performance

Figure 3 for Robust Distillation for Worst-class Performance

Figure 4 for Robust Distillation for Worst-class Performance

Knowledge distillation has proven to be an effective technique in improving the performance a student model using predictions from a teacher model. However, recent work has shown that gains in average efficiency are not uniform across subgroups in the data, and in particular can often come at the cost of accuracy on rare subgroups and classes. To preserve strong performance across classes that may follow a long-tailed distribution, we develop distillation techniques that are tailored to improve the student's worst-class performance. Specifically, we introduce robust optimization objectives in different combinations for the teacher and student, and further allow for training with any tradeoff between the overall accuracy and the robust worst-class objective. We show empirically that our robust distillation techniques not only achieve better worst-class performance, but also lead to Pareto improvement in the tradeoff between overall performance and worst-class performance compared to other baseline methods. Theoretically, we provide insights into what makes a good teacher when the goal is to train a robust student.

Via

Access Paper or Ask Questions

ELM: Embedding and Logit Margins for Long-Tail Learning

Apr 27, 2022
Wittawat Jitkrittum, Aditya Krishna Menon, Ankit Singh Rawat, Sanjiv Kumar

Figure 1 for ELM: Embedding and Logit Margins for Long-Tail Learning

Figure 2 for ELM: Embedding and Logit Margins for Long-Tail Learning

Figure 3 for ELM: Embedding and Logit Margins for Long-Tail Learning

Figure 4 for ELM: Embedding and Logit Margins for Long-Tail Learning

Long-tail learning is the problem of learning under skewed label distributions, which pose a challenge for standard learners. Several recent approaches for the problem have proposed enforcing a suitable margin in logit space. Such techniques are intuitive analogues of the guiding principle behind SVMs, and are equally applicable to linear models and neural models. However, when applied to neural models, such techniques do not explicitly control the geometry of the learned embeddings. This can be potentially sub-optimal, since embeddings for tail classes may be diffuse, resulting in poor generalization for these classes. We present Embedding and Logit Margins (ELM), a unified approach to enforce margins in logit space, and regularize the distribution of embeddings. This connects losses for long-tail learning to proposals in the literature on metric embedding, and contrastive learning. We theoretically show that minimising the proposed ELM objective helps reduce the generalisation gap. The ELM method is shown to perform well empirically, and results in tighter tail class embeddings.

* 24 pages

Via

Access Paper or Ask Questions

When in Doubt, Summon the Titans: Efficient Inference with Large Models

Oct 19, 2021
Ankit Singh Rawat, Manzil Zaheer, Aditya Krishna Menon, Amr Ahmed, Sanjiv Kumar

Figure 1 for When in Doubt, Summon the Titans: Efficient Inference with Large Models

Figure 2 for When in Doubt, Summon the Titans: Efficient Inference with Large Models

Figure 3 for When in Doubt, Summon the Titans: Efficient Inference with Large Models

Figure 4 for When in Doubt, Summon the Titans: Efficient Inference with Large Models

Scaling neural networks to "large" sizes, with billions of parameters, has been shown to yield impressive results on many challenging problems. However, the inference cost incurred by such large models often prevents their application in most real-world settings. In this paper, we propose a two-stage framework based on distillation that realizes the modelling benefits of the large models, while largely preserving the computational benefits of inference with more lightweight models. In a nutshell, we use the large teacher models to guide the lightweight student models to only make correct predictions on a subset of "easy" examples; for the "hard" examples, we fall-back to the teacher. Such an approach allows us to efficiently employ large models in practical scenarios where easy examples are much more frequent than rare hard examples. Our proposed use of distillation to only handle easy instances allows for a more aggressive trade-off in the student size, thereby reducing the amortized cost of inference and achieving better accuracy than standard distillation. Empirically, we demonstrate the benefits of our approach on both image classification and natural language processing benchmarks.

Via

Access Paper or Ask Questions

Training Over-parameterized Models with Non-decomposable Objectives

Jul 09, 2021
Harikrishna Narasimhan, Aditya Krishna Menon

Figure 1 for Training Over-parameterized Models with Non-decomposable Objectives

Figure 2 for Training Over-parameterized Models with Non-decomposable Objectives

Figure 3 for Training Over-parameterized Models with Non-decomposable Objectives

Figure 4 for Training Over-parameterized Models with Non-decomposable Objectives

Many modern machine learning applications come with complex and nuanced design goals such as minimizing the worst-case error, satisfying a given precision or recall target, or enforcing group-fairness constraints. Popular techniques for optimizing such non-decomposable objectives reduce the problem into a sequence of cost-sensitive learning tasks, each of which is then solved by re-weighting the training loss with example-specific costs. We point out that the standard approach of re-weighting the loss to incorporate label costs can produce unsatisfactory results when used to train over-parameterized models. As a remedy, we propose new cost-sensitive losses that extend the classical idea of logit adjustment to handle more general cost matrices. Our losses are calibrated, and can be further improved with distilled labels from a teacher model. Through experiments on benchmark image datasets, we showcase the effectiveness of our approach in training ResNet models with common robust and constrained optimization objectives.

Via

Access Paper or Ask Questions

Teacher's pet: understanding and mitigating biases in distillation

Jul 08, 2021
Michal Lukasik, Srinadh Bhojanapalli, Aditya Krishna Menon, Sanjiv Kumar

Figure 1 for Teacher's pet: understanding and mitigating biases in distillation

Figure 2 for Teacher's pet: understanding and mitigating biases in distillation

Figure 3 for Teacher's pet: understanding and mitigating biases in distillation

Figure 4 for Teacher's pet: understanding and mitigating biases in distillation

Knowledge distillation is widely used as a means of improving the performance of a relatively simple student model using the predictions from a complex teacher model. Several works have shown that distillation significantly boosts the student's overall performance; however, are these gains uniform across all data subgroups? In this paper, we show that distillation can harm performance on certain subgroups, e.g., classes with few associated samples. We trace this behaviour to errors made by the teacher distribution being transferred to and amplified by the student model. To mitigate this problem, we present techniques which soften the teacher influence for subgroups where it is less reliable. Experiments on several image classification benchmarks show that these modifications of distillation maintain boost in overall accuracy, while additionally ensuring improvement in subgroup performance.

* 21 pages, 8 figures

Via

Access Paper or Ask Questions

Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces

May 12, 2021
Ankit Singh Rawat, Aditya Krishna Menon, Wittawat Jitkrittum, Sadeep Jayasumana, Felix X. Yu, Sashank Reddi, Sanjiv Kumar

Figure 1 for Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces

Figure 2 for Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces

Figure 3 for Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces

Figure 4 for Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces

Negative sampling schemes enable efficient training given a large number of classes, by offering a means to approximate a computationally expensive loss function that takes all labels into account. In this paper, we present a new connection between these schemes and loss modification techniques for countering label imbalance. We show that different negative sampling schemes implicitly trade-off performance on dominant versus rare labels. Further, we provide a unified means to explicitly tackle both sampling bias, arising from working with a subset of all labels, and labeling bias, which is inherent to the data due to label imbalance. We empirically verify our findings on long-tail classification and retrieval benchmarks.

* To appear in ICML 2021

Via

Access Paper or Ask Questions