Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:TopKD: Top-scaled Knowledge Distillation

Aug 06, 2025

Qi Wang, Jinjia Zhou

Figure 1 for TopKD: Top-scaled Knowledge Distillation

Figure 2 for TopKD: Top-scaled Knowledge Distillation

Figure 3 for TopKD: Top-scaled Knowledge Distillation

Figure 4 for TopKD: Top-scaled Knowledge Distillation

Share this with someone who'll enjoy it:

Abstract:Recent advances in knowledge distillation (KD) predominantly emphasize feature-level knowledge transfer, frequently overlooking critical information embedded within the teacher's logit distributions. In this paper, we revisit logit-based distillation and reveal an underexplored yet critical element: Top-K knowledge. Motivated by this insight, we propose Top-scaled Knowledge Distillation (TopKD), a simple, efficient, and architecture-agnostic framework that significantly enhances logit-based distillation. TopKD consists of two main components: (1) a Top-K Scaling Module (TSM), which adaptively amplifies the most informative logits, and (2) a Top-K Decoupled Loss (TDL), which offers targeted and effective supervision. Notably, TopKD integrates seamlessly into existing KD methods without introducing extra modules or requiring architectural changes. Extensive experiments on CIFAR-100, ImageNet, STL-10, and Tiny-ImageNet demonstrate that TopKD consistently surpasses state-of-the-art distillation methods. Moreover, our method demonstrates substantial effectiveness when distilling Vision Transformers, underscoring its versatility across diverse network architectures. These findings highlight the significant potential of logits to advance knowledge distillation.

* 12 pages, 6 figures, conference, 8 Tables

View paper on

Share this with someone who'll enjoy it:

Title:TopKD: Top-scaled Knowledge Distillation

Paper and Code