Alert button
Picture for Chuanguang Yang

Chuanguang Yang

Alert button

CLIP-KD: An Empirical Study of Distilling CLIP Models

Jul 24, 2023
Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, Yongjun Xu

Figure 1 for CLIP-KD: An Empirical Study of Distilling CLIP Models
Figure 2 for CLIP-KD: An Empirical Study of Distilling CLIP Models
Figure 3 for CLIP-KD: An Empirical Study of Distilling CLIP Models
Figure 4 for CLIP-KD: An Empirical Study of Distilling CLIP Models

CLIP has become a promising language-supervised visual pre-training framework and achieves excellent performance over a wide range of tasks. This paper aims to distill small CLIP models supervised by a large teacher CLIP model. We propose several distillation strategies, including relation, feature, gradient and contrastive paradigm, to examine the impact on CLIP distillation. We show that the simplest feature mimicry with MSE loss performs best. Moreover, interactive contrastive learning and relation-based distillation are also critical in performance improvement. We apply the unified method to distill several student networks trained on 15 million (image, text) pairs. Distillation improves the student CLIP models consistently over zero-shot ImageNet classification and cross-modal retrieval benchmarks. We hope our empirical study will become an important baseline for future CLIP distillation research. The code is available at \url{https://github.com/winycg/CLIP-KD}.

Viaarxiv icon

Categories of Response-Based, Feature-Based, and Relation-Based Knowledge Distillation

Jun 19, 2023
Chuanguang Yang, Xinqiang Yu, Zhulin An, Yongjun Xu

Figure 1 for Categories of Response-Based, Feature-Based, and Relation-Based Knowledge Distillation
Figure 2 for Categories of Response-Based, Feature-Based, and Relation-Based Knowledge Distillation
Figure 3 for Categories of Response-Based, Feature-Based, and Relation-Based Knowledge Distillation
Figure 4 for Categories of Response-Based, Feature-Based, and Relation-Based Knowledge Distillation

Deep neural networks have achieved remarkable performance for artificial intelligence tasks. The success behind intelligent systems often relies on large-scale models with high computational complexity and storage costs. The over-parameterized networks are often easy to optimize and can achieve better performance. However, it is challenging to deploy them over resource-limited edge-devices. Knowledge Distillation (KD) aims to optimize a lightweight network from the perspective of over-parameterized training. The traditional offline KD transfers knowledge from a cumbersome teacher to a small and fast student network. When a sizeable pre-trained teacher network is unavailable, online KD can improve a group of models by collaborative or mutual learning. Without needing extra models, Self-KD boosts the network itself using attached auxiliary architectures. KD mainly involves knowledge extraction and distillation strategies these two aspects. Beyond KD schemes, various KD algorithms are widely used in practical applications, such as multi-teacher KD, cross-modal KD, attention-based KD, data-free KD and adversarial KD. This paper provides a comprehensive KD survey, including knowledge categories, distillation schemes and algorithms, as well as some empirical studies on performance comparison. Finally, we discuss the open challenges of existing KD works and prospect the future directions.

* Published at Springer book "Advancements in Knowledge Distillation: Towards New Horizons of Intelligent Systems" 
Viaarxiv icon

Team AcieLee: Technical Report for EPIC-SOUNDS Audio-Based Interaction Recognition Challenge 2023

Jun 15, 2023
Yuqi Li, Yizhi Luo, Xiaoshuai Hao, Chuanguang Yang, Zhulin An, Dantong Song, Wei Yi

Figure 1 for Team AcieLee: Technical Report for EPIC-SOUNDS Audio-Based Interaction Recognition Challenge 2023
Figure 2 for Team AcieLee: Technical Report for EPIC-SOUNDS Audio-Based Interaction Recognition Challenge 2023
Figure 3 for Team AcieLee: Technical Report for EPIC-SOUNDS Audio-Based Interaction Recognition Challenge 2023

In this report, we describe the technical details of our submission to the EPIC-SOUNDS Audio-Based Interaction Recognition Challenge 2023, by Team "AcieLee" (username: Yuqi\_Li). The task is to classify the audio caused by interactions between objects, or from events of the camera wearer. We conducted exhaustive experiments and found learning rate step decay, backbone frozen, label smoothing and focal loss contribute most to the performance improvement. After training, we combined multiple models from different stages and integrated them into a single model by assigning fusion weights. This proposed method allowed us to achieve 3rd place in the CVPR 2023 workshop of EPIC-SOUNDS Audio-Based Interaction Recognition Challenge.

Viaarxiv icon

eTag: Class-Incremental Learning with Embedding Distillation and Task-Oriented Generation

Apr 20, 2023
Libo Huang, Yan Zeng, Chuanguang Yang, Zhulin An, Boyu Diao, Yongjun Xu

Figure 1 for eTag: Class-Incremental Learning with Embedding Distillation and Task-Oriented Generation
Figure 2 for eTag: Class-Incremental Learning with Embedding Distillation and Task-Oriented Generation
Figure 3 for eTag: Class-Incremental Learning with Embedding Distillation and Task-Oriented Generation
Figure 4 for eTag: Class-Incremental Learning with Embedding Distillation and Task-Oriented Generation

Class-Incremental Learning (CIL) aims to solve the neural networks' catastrophic forgetting problem, which refers to the fact that once the network updates on a new task, its performance on previously-learned tasks drops dramatically. Most successful CIL methods incrementally train a feature extractor with the aid of stored exemplars, or estimate the feature distribution with the stored prototypes. However, the stored exemplars would violate the data privacy concerns, while the stored prototypes might not reasonably be consistent with a proper feature distribution, hindering the exploration of real-world CIL applications. In this paper, we propose a method of \textit{e}mbedding distillation and \textit{Ta}sk-oriented \textit{g}eneration (\textit{eTag}) for CIL, which requires neither the exemplar nor the prototype. Instead, eTag achieves a data-free manner to train the neural networks incrementally. To prevent the feature extractor from forgetting, eTag distills the embeddings of the network's intermediate blocks. Additionally, eTag enables a generative network to produce suitable features, fitting the needs of the top incremental classifier. Experimental results confirmed that our proposed eTag considerably outperforms the state-of-the-art methods on CIFAR-100 and ImageNet-sub\footnote{Our code is available in the Supplementary Materials.

* 12 pages, 12 figures 
Viaarxiv icon

MixSKD: Self-Knowledge Distillation from Mixup for Image Recognition

Aug 11, 2022
Chuanguang Yang, Zhulin An, Helong Zhou, Linhang Cai, Xiang Zhi, Jiwen Wu, Yongjun Xu, Qian Zhang

Figure 1 for MixSKD: Self-Knowledge Distillation from Mixup for Image Recognition
Figure 2 for MixSKD: Self-Knowledge Distillation from Mixup for Image Recognition
Figure 3 for MixSKD: Self-Knowledge Distillation from Mixup for Image Recognition
Figure 4 for MixSKD: Self-Knowledge Distillation from Mixup for Image Recognition

Unlike the conventional Knowledge Distillation (KD), Self-KD allows a network to learn knowledge from itself without any guidance from extra networks. This paper proposes to perform Self-KD from image Mixture (MixSKD), which integrates these two techniques into a unified framework. MixSKD mutually distills feature maps and probability distributions between the random pair of original images and their mixup images in a meaningful way. Therefore, it guides the network to learn cross-image knowledge by modelling supervisory signals from mixup images. Moreover, we construct a self-teacher network by aggregating multi-stage feature maps for providing soft labels to supervise the backbone classifier, further improving the efficacy of self-boosting. Experiments on image classification and transfer learning to object detection and semantic segmentation demonstrate that MixSKD outperforms other state-of-the-art Self-KD and data augmentation methods. The code is available at https://github.com/winycg/Self-KD-Lib.

* 22 pages, ECCV-2022 
Viaarxiv icon

Online Knowledge Distillation via Mutual Contrastive Learning for Visual Recognition

Jul 23, 2022
Chuanguang Yang, Zhulin An, Helong Zhou, Yongjun Xu, Qian Zhan

Figure 1 for Online Knowledge Distillation via Mutual Contrastive Learning for Visual Recognition
Figure 2 for Online Knowledge Distillation via Mutual Contrastive Learning for Visual Recognition
Figure 3 for Online Knowledge Distillation via Mutual Contrastive Learning for Visual Recognition
Figure 4 for Online Knowledge Distillation via Mutual Contrastive Learning for Visual Recognition

The teacher-free online Knowledge Distillation (KD) aims to train an ensemble of multiple student models collaboratively and distill knowledge from each other. Although existing online KD methods achieve desirable performance, they often focus on class probabilities as the core knowledge type, ignoring the valuable feature representational information. We present a Mutual Contrastive Learning (MCL) framework for online KD. The core idea of MCL is to perform mutual interaction and transfer of contrastive distributions among a cohort of networks in an online manner. Our MCL can aggregate cross-network embedding information and maximize the lower bound to the mutual information between two networks. This enables each network to learn extra contrastive knowledge from others, leading to better feature representations, thus improving the performance of visual recognition tasks. Beyond the final layer, we extend MCL to several intermediate layers assisted by auxiliary feature refinement modules. This further enhances the ability of representation learning for online KD. Experiments on image classification and transfer learning to visual recognition tasks show that MCL can lead to consistent performance gains against state-of-the-art online KD approaches. The superiority demonstrates that MCL can guide the network to generate better feature representations. Our code is publicly available at https://github.com/winycg/MCL.

* 15 pages 
Viaarxiv icon

Localizing Semantic Patches for Accelerating Image Classification

Jun 07, 2022
Chuanguang Yang, Zhulin An, Yongjun Xu

Figure 1 for Localizing Semantic Patches for Accelerating Image Classification
Figure 2 for Localizing Semantic Patches for Accelerating Image Classification
Figure 3 for Localizing Semantic Patches for Accelerating Image Classification
Figure 4 for Localizing Semantic Patches for Accelerating Image Classification

Existing works often focus on reducing the architecture redundancy for accelerating image classification but ignore the spatial redundancy of the input image. This paper proposes an efficient image classification pipeline to solve this problem. We first pinpoint task-aware regions over the input image by a lightweight patch proposal network called AnchorNet. We then feed these localized semantic patches with much smaller spatial redundancy into a general classification network. Unlike the popular design of deep CNN, we aim to carefully design the Receptive Field of AnchorNet without intermediate convolutional paddings. This ensures the exact mapping from a high-level spatial location to the specific input image patch. The contribution of each patch is interpretable. Moreover, AnchorNet is compatible with any downstream architecture. Experimental results on ImageNet show that our method outperforms SOTA dynamic inference methods with fewer inference costs. Our code is available at https://github.com/winycg/AnchorNet.

* Accepted by ICME-2022 
Viaarxiv icon

Cross-Image Relational Knowledge Distillation for Semantic Segmentation

Apr 14, 2022
Chuanguang Yang, Helong Zhou, Zhulin An, Xue Jiang, Yongjun Xu, Qian Zhang

Figure 1 for Cross-Image Relational Knowledge Distillation for Semantic Segmentation
Figure 2 for Cross-Image Relational Knowledge Distillation for Semantic Segmentation
Figure 3 for Cross-Image Relational Knowledge Distillation for Semantic Segmentation
Figure 4 for Cross-Image Relational Knowledge Distillation for Semantic Segmentation

Current Knowledge Distillation (KD) methods for semantic segmentation often guide the student to mimic the teacher's structured information generated from individual data samples. However, they ignore the global semantic relations among pixels across various images that are valuable for KD. This paper proposes a novel Cross-Image Relational KD (CIRKD), which focuses on transferring structured pixel-to-pixel and pixel-to-region relations among the whole images. The motivation is that a good teacher network could construct a well-structured feature space in terms of global pixel dependencies. CIRKD makes the student mimic better structured semantic relations from the teacher, thus improving the segmentation performance. Experimental results over Cityscapes, CamVid and Pascal VOC datasets demonstrate the effectiveness of our proposed approach against state-of-the-art distillation methods. The code is available at https://github.com/winycg/CIRKD.

* Accepted by CVPR-2022 
Viaarxiv icon

Knowledge Distillation Using Hierarchical Self-Supervision Augmented Distribution

Sep 07, 2021
Chuanguang Yang, Zhulin An, Linhang Cai, Yongjun Xu

Figure 1 for Knowledge Distillation Using Hierarchical Self-Supervision Augmented Distribution
Figure 2 for Knowledge Distillation Using Hierarchical Self-Supervision Augmented Distribution
Figure 3 for Knowledge Distillation Using Hierarchical Self-Supervision Augmented Distribution
Figure 4 for Knowledge Distillation Using Hierarchical Self-Supervision Augmented Distribution

Knowledge distillation (KD) is an effective framework that aims to transfer meaningful information from a large teacher to a smaller student. Generally, KD often involves how to define and transfer knowledge. Previous KD methods often focus on mining various forms of knowledge, for example, feature maps and refined information. However, the knowledge is derived from the primary supervised task and thus is highly task-specific. Motivated by the recent success of self-supervised representation learning, we propose an auxiliary self-supervision augmented task to guide networks to learn more meaningful features. Therefore, we can derive soft self-supervision augmented distributions as richer dark knowledge from this task for KD. Unlike previous knowledge, this distribution encodes joint knowledge from supervised and self-supervised feature learning. Beyond knowledge exploration, another crucial aspect is how to learn and distill our proposed knowledge effectively. To fully take advantage of hierarchical feature maps, we propose to append several auxiliary branches at various hidden layers. Each auxiliary branch is guided to learn self-supervision augmented task and distill this distribution from teacher to student. Thus we call our KD method as Hierarchical Self-Supervision Augmented Knowledge Distillation (HSSAKD). Experiments on standard image classification show that both offline and online HSSAKD achieves state-of-the-art performance in the field of KD. Further transfer experiments on object detection further verify that HSSAKD can guide the network to learn better features, which can be attributed to learn and distill an auxiliary self-supervision augmented task effectively.

* 15 pages, an extension of Hierarchical Self-supervised Augmented Knowledge Distillation published at IJCAI-2021 
Viaarxiv icon

Hierarchical Self-supervised Augmented Knowledge Distillation

Jul 29, 2021
Chuanguang Yang, Zhulin An, Linhang Cai, Yongjun Xu

Figure 1 for Hierarchical Self-supervised Augmented Knowledge Distillation
Figure 2 for Hierarchical Self-supervised Augmented Knowledge Distillation
Figure 3 for Hierarchical Self-supervised Augmented Knowledge Distillation
Figure 4 for Hierarchical Self-supervised Augmented Knowledge Distillation

Knowledge distillation often involves how to define and transfer knowledge from teacher to student effectively. Although recent self-supervised contrastive knowledge achieves the best performance, forcing the network to learn such knowledge may damage the representation learning of the original class recognition task. We therefore adopt an alternative self-supervised augmented task to guide the network to learn the joint distribution of the original recognition task and self-supervised auxiliary task. It is demonstrated as a richer knowledge to improve the representation power without losing the normal classification capability. Moreover, it is incomplete that previous methods only transfer the probabilistic knowledge between the final layers. We propose to append several auxiliary classifiers to hierarchical intermediate feature maps to generate diverse self-supervised knowledge and perform the one-to-one transfer to teach the student network thoroughly. Our method significantly surpasses the previous SOTA SSKD with an average improvement of 2.56\% on CIFAR-100 and an improvement of 0.77\% on ImageNet across widely used network pairs. Codes are available at https://github.com/winycg/HSAKD.

* 7 pages, IJCAI-2021 
Viaarxiv icon