Existing knowledge distillation methods typically work by imparting the knowledge of output logits or intermediate feature maps from the teacher network to the student network, which is very successful in multi-class single-label learning. However, these methods can hardly be extended to the multi-label learning scenario, where each instance is associated with multiple semantic labels, because the prediction probabilities do not sum to one and feature maps of the whole example may ignore minor classes in such a scenario. In this paper, we propose a novel multi-label knowledge distillation method. On one hand, it exploits the informative semantic knowledge from the logits by dividing the multi-label learning problem into a set of binary classification problems; on the other hand, it enhances the distinctiveness of the learned feature representations by leveraging the structural information of label-wise embeddings. Experimental results on multiple benchmark datasets validate that the proposed method can avoid knowledge counteraction among labels, thus achieving superior performance against diverse comparing methods. Our code is available at: https://github.com/penghui-yang/L2D
Knowledge distillation aims to learn a lightweight student network from a pre-trained teacher network. In practice, existing knowledge distillation methods are usually infeasible when the original training data is unavailable due to some privacy issues and data management considerations. Therefore, data-free knowledge distillation approaches proposed to collect training instances from the Internet. However, most of them have ignored the common distribution shift between the instances from original training data and webly collected data, affecting the reliability of the trained student network. To solve this problem, we propose a novel method dubbed ``Knowledge Distillation between Different Distributions" (KD$^{3}$), which consists of three components. Specifically, we first dynamically select useful training instances from the webly collected data according to the combined predictions of teacher network and student network. Subsequently, we align both the weighted features and classifier parameters of the two networks for knowledge memorization. Meanwhile, we also build a new contrastive learning block called MixDistribution to generate perturbed data with a new distribution for instance alignment, so that the student network can further learn a distribution-invariant representation. Intensive experiments on various benchmark datasets demonstrate that our proposed KD$^{3}$ can outperform the state-of-the-art data-free knowledge distillation approaches.
Generating unlabeled data has been recently shown to help address the few-shot hypothesis adaptation (FHA) problem, where we aim to train a classifier for the target domain with a few labeled target-domain data and a well-trained source-domain classifier (i.e., a source hypothesis), for the additional information of the highly-compatible unlabeled data. However, the generated data of the existing methods are extremely similar or even the same. The strong dependency among the generated data will lead the learning to fail. In this paper, we propose a diversity-enhancing generative network (DEG-Net) for the FHA problem, which can generate diverse unlabeled data with the help of a kernel independence measure: the Hilbert-Schmidt independence criterion (HSIC). Specifically, DEG-Net will generate data via minimizing the HSIC value (i.e., maximizing the independence) among the semantic features of the generated data. By DEG-Net, the generated unlabeled data are more diverse and more effective for addressing the FHA problem. Experimental results show that the DEG-Net outperforms existing FHA baselines and further verifies that generating diverse data plays a vital role in addressing the FHA problem
The combinatorial pure exploration (CPE) in the stochastic multi-armed bandit setting (MAB) is a well-studied online decision-making problem: A player wants to find the optimal \emph{action} $\boldsymbol{\pi}^*$ from \emph{action class} $\mathcal{A}$, which is a collection of subsets of arms with certain combinatorial structures. Though CPE can represent many combinatorial structures such as paths, matching, and spanning trees, most existing works focus only on binary action class $\mathcal{A}\subseteq\{0, 1\}^d$ for some positive integer $d$. This binary formulation excludes important problems such as the optimal transport, knapsack, and production planning problems. To overcome this limitation, we extend the binary formulation to real, $\mathcal{A}\subseteq\mathbb{R}^d$, and propose a new algorithm. The only assumption we make is that the number of actions in $\mathcal{A}$ is polynomial in $d$. We show an upper bound of the sample complexity for our algorithm and the action class-dependent lower bound for R-CPE-MAB, by introducing a quantity that characterizes the problem's difficulty, which is a generalization of the notion \emph{width} introduced in Chen et al.[2014].
Training a classifier exploiting a huge amount of supervised data is expensive or even prohibited in a situation, where the labeling cost is high. The remarkable progress in working with weaker forms of supervision is binary classification from multiple unlabeled datasets which requires the knowledge of exact class priors for all unlabeled datasets. However, the availability of class priors is restrictive in many real-world scenarios. To address this issue, we propose to solve a new problem setting, i.e., binary classification from multiple unlabeled datasets with only one pairwise numerical relationship of class priors (MU-OPPO), which knows the relative order (which unlabeled dataset has a higher proportion of positive examples) of two class-prior probabilities for two datasets among multiple unlabeled datasets. In MU-OPPO, we do not need the class priors for all unlabeled datasets, but we only require that there exists a pair of unlabeled datasets for which we know which unlabeled dataset has a larger class prior. Clearly, this form of supervision is easier to be obtained, which can make labeling costs almost free. We propose a novel framework to handle the MU-OPPO problem, which consists of four sequential modules: (i) pseudo label assignment; (ii) confident example collection; (iii) class prior estimation; (iv) classifier training with estimated class priors. Theoretically, we analyze the gap between estimated class priors and true class priors under the proposed framework. Empirically, we confirm the superiority of our framework with comprehensive experiments. Experimental results demonstrate that our framework brings smaller estimation errors of class priors and better performance of binary classification.
Label-noise learning (LNL) aims to increase the model's generalization given training data with noisy labels. To facilitate practical LNL algorithms, researchers have proposed different label noise types, ranging from class-conditional to instance-dependent noises. In this paper, we introduce a novel label noise type called BadLabel, which can significantly degrade the performance of existing LNL algorithms by a large margin. BadLabel is crafted based on the label-flipping attack against standard classification, where specific samples are selected and their labels are flipped to other labels so that the loss values of clean and noisy labels become indistinguishable. To address the challenge posed by BadLabel, we further propose a robust LNL method that perturbs the labels in an adversarial manner at each epoch to make the loss values of clean and noisy labels again distinguishable. Once we select a small set of (mostly) clean labeled data, we can apply the techniques of semi-supervised learning to train the model accurately. Empirically, our experimental results demonstrate that existing LNL algorithms are vulnerable to the newly introduced BadLabel noise type, while our proposed robust LNL method can effectively improve the generalization performance of the model under various types of label noise. The new dataset of noisy labels and the source codes of robust LNL algorithms are available at https://github.com/zjfheart/BadLabels.
Distribution shift (DS) may have two levels: the distribution itself changes, and the support (i.e., the set where the probability density is non-zero) also changes. When considering the support change between the training and test distributions, there can be four cases: (i) they exactly match; (ii) the training support is wider (and thus covers the test support); (iii) the test support is wider; (iv) they partially overlap. Existing methods are good at cases (i) and (ii), while cases (iii) and (iv) are more common nowadays but still under-explored. In this paper, we generalize importance weighting (IW), a golden solver for cases (i) and (ii), to a universal solver for all cases. Specifically, we first investigate why IW may fail in cases (iii) and (iv); based on the findings, we propose generalized IW (GIW) that could handle cases (iii) and (iv) and would reduce to IW in cases (i) and (ii). In GIW, the test support is split into an in-training (IT) part and an out-of-training (OOT) part, and the expected risk is decomposed into a weighted classification term over the IT part and a standard classification term over the OOT part, which guarantees the risk consistency of GIW. Then, the implementation of GIW consists of three components: (a) the split of validation data is carried out by the one-class support vector machine, (b) the first term of the empirical risk can be handled by any IW algorithm given training data and IT validation data, and (c) the second term just involves OOT validation data. Experiments demonstrate that GIW is a universal solver for DS problems, outperforming IW methods in cases (iii) and (iv).
In this paper, we introduce the imprecise label learning (ILL) framework, a unified approach to handle various imprecise label configurations, which are commonplace challenges in machine learning tasks. ILL leverages an expectation-maximization (EM) algorithm for the maximum likelihood estimation (MLE) of the imprecise label information, treating the precise labels as latent variables. Compared to previous versatile methods attempting to infer correct labels from the imprecise label information, our ILL framework considers all possible labeling imposed by the imprecise label information, allowing a unified solution to deal with any imprecise labels. With comprehensive experimental results, we demonstrate that ILL can seamlessly adapt to various situations, including partial label learning, semi-supervised learning, noisy label learning, and a mixture of these settings. Notably, our simple method surpasses the existing techniques for handling imprecise labels, marking the first unified framework with robust and effective performance across various imprecise labels. We believe that our approach has the potential to significantly enhance the performance of machine learning models on tasks where obtaining precise labels is expensive and complicated. We hope our work will inspire further research on this topic with an open-source codebase release.
Disentangled representation learning is a challenging task that involves separating multiple factors of variation in complex data. Although various metrics for learning and evaluating disentangled representations have been proposed, it remains unclear what these metrics truly quantify and how to compare them. In this work, we study the definitions of disentanglement given by first-order equational predicates and introduce a systematic approach for transforming an equational definition into a compatible quantitative metric based on enriched category theory. Specifically, we show how to replace (i) equality with metric or divergence, (ii) logical connectives with order operations, (iii) universal quantifier with aggregation, and (iv) existential quantifier with the best approximation. Using this approach, we derive metrics for measuring the desired properties of a disentangled representation extractor and demonstrate their effectiveness on synthetic data. Our proposed approach provides practical guidance for researchers in selecting appropriate evaluation metrics and designing effective learning algorithms for disentangled representation learning.
Complementary-label Learning (CLL) is a form of weakly supervised learning that trains an ordinary classifier using only complementary labels, which are the classes that certain instances do not belong to. While existing CLL studies typically use novel loss functions or training techniques to solve this problem, few studies focus on how complementary labels collectively provide information to train the ordinary classifier. In this paper, we fill the gap by analyzing the implicit sharing of complementary labels on nearby instances during training. Our analysis reveals that the efficiency of implicit label sharing is closely related to the performance of existing CLL models. Based on this analysis, we propose a novel technique that enhances the sharing efficiency via complementary-label augmentation, which explicitly propagates additional complementary labels to each instance. We carefully design the augmentation process to enrich the data with new and accurate complementary labels, which provide CLL models with fresh and valuable information to enhance the sharing efficiency. We then verify our proposed technique by conducting thorough experiments on both synthetic and real-world datasets. Our results confirm that complementary-label augmentation can systematically improve empirical performance over state-of-the-art CLL models.