CLIP has become a promising language-supervised visual pre-training framework and achieves excellent performance over a wide range of tasks. This paper aims to distill small CLIP models supervised by a large teacher CLIP model. We propose several distillation strategies, including relation, feature, gradient and contrastive paradigm, to examine the impact on CLIP distillation. We show that the simplest feature mimicry with MSE loss performs best. Moreover, interactive contrastive learning and relation-based distillation are also critical in performance improvement. We apply the unified method to distill several student networks trained on 15 million (image, text) pairs. Distillation improves the student CLIP models consistently over zero-shot ImageNet classification and cross-modal retrieval benchmarks. We hope our empirical study will become an important baseline for future CLIP distillation research. The code is available at \url{https://github.com/winycg/CLIP-KD}.
Deep neural networks have achieved remarkable performance for artificial intelligence tasks. The success behind intelligent systems often relies on large-scale models with high computational complexity and storage costs. The over-parameterized networks are often easy to optimize and can achieve better performance. However, it is challenging to deploy them over resource-limited edge-devices. Knowledge Distillation (KD) aims to optimize a lightweight network from the perspective of over-parameterized training. The traditional offline KD transfers knowledge from a cumbersome teacher to a small and fast student network. When a sizeable pre-trained teacher network is unavailable, online KD can improve a group of models by collaborative or mutual learning. Without needing extra models, Self-KD boosts the network itself using attached auxiliary architectures. KD mainly involves knowledge extraction and distillation strategies these two aspects. Beyond KD schemes, various KD algorithms are widely used in practical applications, such as multi-teacher KD, cross-modal KD, attention-based KD, data-free KD and adversarial KD. This paper provides a comprehensive KD survey, including knowledge categories, distillation schemes and algorithms, as well as some empirical studies on performance comparison. Finally, we discuss the open challenges of existing KD works and prospect the future directions.
Takeaway recommender systems, which aim to accurately provide stores that offer foods meeting users' interests, have served billions of users in our daily life. Different from traditional recommendation, takeaway recommendation faces two main challenges: (1) Dual Interaction-Aware Preference Modeling. Traditional recommendation commonly focuses on users' single preferences for items while takeaway recommendation needs to comprehensively consider users' dual preferences for stores and foods. (2) Period-Varying Preference Modeling. Conventional recommendation generally models continuous changes in users' preferences from a session-level or day-level perspective. However, in practical takeaway systems, users' preferences vary significantly during the morning, noon, night, and late night periods of the day. To address these challenges, we propose a Dual Period-Varying Preference modeling (DPVP) for takeaway recommendation. Specifically, we design a dual interaction-aware module, aiming to capture users' dual preferences based on their interactions with stores and foods. Moreover, to model various preferences in different time periods of the day, we propose a time-based decomposition module as well as a time-aware gating mechanism. Extensive offline and online experiments demonstrate that our model outperforms state-of-the-art methods on real-world datasets and it is capable of modeling the dual period-varying preferences. Moreover, our model has been deployed online on Meituan Takeaway platform, leading to an average improvement in GMV (Gross Merchandise Value) of 0.70%.
In this report, we describe the technical details of our submission to the EPIC-SOUNDS Audio-Based Interaction Recognition Challenge 2023, by Team "AcieLee" (username: Yuqi\_Li). The task is to classify the audio caused by interactions between objects, or from events of the camera wearer. We conducted exhaustive experiments and found learning rate step decay, backbone frozen, label smoothing and focal loss contribute most to the performance improvement. After training, we combined multiple models from different stages and integrated them into a single model by assigning fusion weights. This proposed method allowed us to achieve 3rd place in the CVPR 2023 workshop of EPIC-SOUNDS Audio-Based Interaction Recognition Challenge.
Class-Incremental Learning (CIL) aims to solve the neural networks' catastrophic forgetting problem, which refers to the fact that once the network updates on a new task, its performance on previously-learned tasks drops dramatically. Most successful CIL methods incrementally train a feature extractor with the aid of stored exemplars, or estimate the feature distribution with the stored prototypes. However, the stored exemplars would violate the data privacy concerns, while the stored prototypes might not reasonably be consistent with a proper feature distribution, hindering the exploration of real-world CIL applications. In this paper, we propose a method of \textit{e}mbedding distillation and \textit{Ta}sk-oriented \textit{g}eneration (\textit{eTag}) for CIL, which requires neither the exemplar nor the prototype. Instead, eTag achieves a data-free manner to train the neural networks incrementally. To prevent the feature extractor from forgetting, eTag distills the embeddings of the network's intermediate blocks. Additionally, eTag enables a generative network to produce suitable features, fitting the needs of the top incremental classifier. Experimental results confirmed that our proposed eTag considerably outperforms the state-of-the-art methods on CIFAR-100 and ImageNet-sub\footnote{Our code is available in the Supplementary Materials.
Radiologists have different training and clinical experiences, so they may provide various segmentation annotations for a lung nodule, which causes segmentation uncertainty among multiple annotations. Conventional methods usually chose a single annotation as the learning target or tried to learn a latent space of various annotations. Still, they wasted the valuable information of consensus or disagreements ingrained in the multiple annotations. This paper proposes an Uncertainty-Aware Attention Mechanism (UAAM), which utilizes consensus or disagreements among annotations to produce a better segmentation. In UAAM, we propose a Multi-Confidence Mask (MCM), which is a combination of a Low-Confidence (LC) Mask and a High-Confidence (HC) Mask. LC mask indicates regions with low segmentation confidence, which may cause different segmentation options among radiologists. Following UAAM, we further design an Uncertainty-Guide Segmentation Network (UGS-Net), which contains three modules:Feature Extracting Module captures a general feature of a lung nodule. Uncertainty-Aware Module produce three features for the annotations' union, intersection, and annotation set. Finally, Intersection-Union Constraining Module use distances between three features to balance the predictions of final segmentation, LC mask, and HC mask. To fully demonstrate the performance of our method, we propose a Complex Nodule Challenge on LIDC-IDRI, which tests UGS-Net's segmentation performance on the lung nodules that are difficult to segment by U-Net. Experimental results demonstrate that our method can significantly improve the segmentation performance on nodules with poor segmentation by U-Net.
Unlike the conventional Knowledge Distillation (KD), Self-KD allows a network to learn knowledge from itself without any guidance from extra networks. This paper proposes to perform Self-KD from image Mixture (MixSKD), which integrates these two techniques into a unified framework. MixSKD mutually distills feature maps and probability distributions between the random pair of original images and their mixup images in a meaningful way. Therefore, it guides the network to learn cross-image knowledge by modelling supervisory signals from mixup images. Moreover, we construct a self-teacher network by aggregating multi-stage feature maps for providing soft labels to supervise the backbone classifier, further improving the efficacy of self-boosting. Experiments on image classification and transfer learning to object detection and semantic segmentation demonstrate that MixSKD outperforms other state-of-the-art Self-KD and data augmentation methods. The code is available at https://github.com/winycg/Self-KD-Lib.
The teacher-free online Knowledge Distillation (KD) aims to train an ensemble of multiple student models collaboratively and distill knowledge from each other. Although existing online KD methods achieve desirable performance, they often focus on class probabilities as the core knowledge type, ignoring the valuable feature representational information. We present a Mutual Contrastive Learning (MCL) framework for online KD. The core idea of MCL is to perform mutual interaction and transfer of contrastive distributions among a cohort of networks in an online manner. Our MCL can aggregate cross-network embedding information and maximize the lower bound to the mutual information between two networks. This enables each network to learn extra contrastive knowledge from others, leading to better feature representations, thus improving the performance of visual recognition tasks. Beyond the final layer, we extend MCL to several intermediate layers assisted by auxiliary feature refinement modules. This further enhances the ability of representation learning for online KD. Experiments on image classification and transfer learning to visual recognition tasks show that MCL can lead to consistent performance gains against state-of-the-art online KD approaches. The superiority demonstrates that MCL can guide the network to generate better feature representations. Our code is publicly available at https://github.com/winycg/MCL.
Existing works often focus on reducing the architecture redundancy for accelerating image classification but ignore the spatial redundancy of the input image. This paper proposes an efficient image classification pipeline to solve this problem. We first pinpoint task-aware regions over the input image by a lightweight patch proposal network called AnchorNet. We then feed these localized semantic patches with much smaller spatial redundancy into a general classification network. Unlike the popular design of deep CNN, we aim to carefully design the Receptive Field of AnchorNet without intermediate convolutional paddings. This ensures the exact mapping from a high-level spatial location to the specific input image patch. The contribution of each patch is interpretable. Moreover, AnchorNet is compatible with any downstream architecture. Experimental results on ImageNet show that our method outperforms SOTA dynamic inference methods with fewer inference costs. Our code is available at https://github.com/winycg/AnchorNet.