Abstract:Multi-Label Online Continual Learning (MOCL) requires models to learn continuously from endless multi-label data streams, facing complex challenges including persistent catastrophic forgetting, potential missing labels, and uncontrollable imbalanced class distributions. While existing MOCL methods attempt to address these challenges through various techniques, \textit{they all overlook label-specific region identifying and feature learning} - a fundamental solution rooted in multi-label learning but challenging to achieve in the online setting with incremental and partial supervision. To this end, we first leverage the inherent structural information of input data to evaluate and verify the innate localization capability of different pre-trained models. Then, we propose CUTER (CUT-out-and-Experience-Replay), a simple yet versatile strategy that provides fine-grained supervision signals by further identifying, strengthening and cutting out label-specific regions for efficient experience replay. It not only enables models to simultaneously address catastrophic forgetting, missing labels, and class imbalance challenges, but also serves as an orthogonal solution that seamlessly integrates with existing approaches. Extensive experiments on multiple multi-label image benchmarks demonstrate the superiority of our proposed method. The code is available at \href{https://github.com/wxr99/Cut-Replay}{https://github.com/wxr99/Cut-Replay}
Abstract:Using unlabeled wild data containing both in-distribution (ID) and out-of-distribution (OOD) data to improve the safety and reliability of models has recently received increasing attention. Existing methods either design customized losses for labeled ID and unlabeled wild data then perform joint optimization, or first filter out OOD data from the latter then learn an OOD detector. While achieving varying degrees of success, two potential issues remain: (i) Labeled ID data typically dominates the learning of models, inevitably making models tend to fit OOD data as IDs; (ii) The selection of thresholds for identifying OOD data in unlabeled wild data usually faces dilemma due to the unavailability of pure OOD samples. To address these issues, we propose a novel loss-difference OOD detection framework (LoD) by \textit{intentionally label-noisifying} unlabeled wild data. Such operations not only enable labeled ID data and OOD data in unlabeled wild data to jointly dominate the models' learning but also ensure the distinguishability of the losses between ID and OOD samples in unlabeled wild data, allowing the classic clustering technique (e.g., K-means) to filter these OOD samples without requiring thresholds any longer. We also provide theoretical foundation for LoD's viability, and extensive experiments verify its superiority.
Abstract:Based on the success of large-scale visual foundation models like CLIP in various downstream tasks, this paper initially attempts to explore their impact on Long-Tailed Semi-Supervised Learning (LTSSL) by employing the foundation model with three strategies: Linear Probing (LP), Lightweight Fine-Tuning (LFT), and Full Fine-Tuning (FFT). Our analysis presents the following insights: i) Compared to LTSSL algorithms trained from scratch, FFT results in a decline in model performance, whereas LP and LFT, although boosting overall model performance, exhibit negligible benefits to tail classes. ii) LP produces numerous false pseudo-labels due to \textit{underlearned} training data, while LFT can reduce the number of these false labels but becomes overconfident about them owing to \textit{biased fitting} training data. This exacerbates the pseudo-labeled and classifier biases inherent in LTSSL, limiting performance improvement in the tail classes. With these insights, we propose a Unbiased Lightweight Fine-tuning strategy, \textbf{ULFine}, which mitigates the overconfidence via confidence-aware adaptive fitting of textual prototypes and counteracts the pseudo-labeled and classifier biases via complementary fusion of dual logits. Extensive experiments demonstrate that ULFine markedly decreases training costs by over ten times and substantially increases prediction accuracies compared to state-of-the-art methods.
Abstract:Out-of-distribution detection (OOD) is a pivotal task for real-world applications that trains models to identify samples that are distributionally different from the in-distribution (ID) data during testing. Recent advances in AI, particularly Vision-Language Models (VLMs) like CLIP, have revolutionized OOD detection by shifting from traditional unimodal image detectors to multimodal image-text detectors. This shift has inspired extensive research; however, existing categorization schemes (e.g., few- or zero-shot types) still rely solely on the availability of ID images, adhering to a unimodal paradigm. To better align with CLIP's cross-modal nature, we propose a new categorization framework rooted in both image and text modalities. Specifically, we categorize existing methods based on how visual and textual information of OOD data is utilized within image + text modalities, and further divide them into four groups: OOD Images (i.e., outliers) Seen or Unseen, and OOD Texts (i.e., learnable vectors or class names) Known or Unknown, across two training strategies (i.e., train-free or training-required). More importantly, we discuss open problems in CLIP-like OOD detection and highlight promising directions for future research, including cross-domain integration, practical applications, and theoretical understanding.
Abstract:Recent studies have verified that semi-supervised learning (SSL) is vulnerable to data poisoning backdoor attacks. Even a tiny fraction of contaminated training data is sufficient for adversaries to manipulate up to 90\% of the test outputs in existing SSL methods. Given the emerging threat of backdoor attacks designed for SSL, this work aims to protect SSL against such risks, marking it as one of the few known efforts in this area. Specifically, we begin by identifying that the spurious correlations between the backdoor triggers and the target class implanted by adversaries are the primary cause of manipulated model predictions during the test phase. To disrupt these correlations, we utilize three key techniques: Gaussian Filter, complementary learning and trigger mix-up, which collectively filter, obstruct and dilute the influence of backdoor attacks in both data pre-processing and feature learning. Experimental results demonstrate that our proposed method, Backdoor Invalidator (BI), significantly reduces the average attack success rate from 84.7\% to 1.8\% across different state-of-the-art backdoor attacks. It is also worth mentioning that BI does not sacrifice accuracy on clean data and is supported by a theoretical guarantee of its generalization capability.
Abstract:Irregularly sampled multivariate time series (ISMTS) are prevalent in reality. Due to their non-uniform intervals between successive observations and varying sampling rates among series, the channel-independent (CI) strategy, which has been demonstrated more desirable for complete multivariate time series forecasting in recent studies, has failed. This failure can be further attributed to the sampling sparsity, which provides insufficient information for effective CI learning, thereby reducing its capacity. When we resort to the channel-dependent (CD) strategy, even higher capacity cannot mitigate the potential loss of diversity in learning similar embedding patterns across different channels. We find that existing work considers CI and CD strategies to be mutually exclusive, primarily because they apply these strategies to the global channel. However, we hold the view that channel strategies do not necessarily have to be used globally. Instead, by appropriately applying them locally and globally, we can create an opportunity to take full advantage of both strategies. This leads us to introduce the Channel Harmony ISMTS Transformer (TimeCHEAT), which utilizes the CD locally and the CI globally. Specifically, we segment the ISMTS into sub-series level patches. Locally, the CD strategy aggregates information within each patch for time embedding learning, maximizing the use of relevant observations while reducing long-range irrelevant interference. Here, we enhance generality by transforming embedding learning into an edge weight prediction task using bipartite graphs, eliminating the need for special prior knowledge. Globally, the CI strategy is applied across patches, allowing the Transformer to learn individualized attention patterns for each channel. Experimental results indicate our proposed TimeCHEAT demonstrates competitive SOTA performance across three mainstream tasks.
Abstract:Domain generalization addresses domain shift in real-world applications. Most approaches adopt a domain angle, seeking invariant representation across domains by aligning their marginal distributions, irrespective of individual classes, naturally leading to insufficient exploration of discriminative information. Switching to a class angle, we find that multiple domain-related peaks or clusters within the same individual classes must emerge due to distribution shift. In other words, marginal alignment does not guarantee conditional alignment, leading to suboptimal generalization. Therefore, we argue that acquiring discriminative generalization between classes within domains is crucial. In contrast to seeking distribution alignment, we endeavor to safeguard domain-related between-class discrimination. To this end, we devise a novel Conjugate Consistent Enhanced Module, namely Con2EM, based on a distribution over domains, i.e., a meta-distribution. Specifically, we employ a novel distribution-level Universum strategy to generate supplementary diverse domain-related class-conditional distributions, thereby enhancing generalization. This allows us to resample from these generated distributions to provide feedback to the primordial instance-level classifier, further improving its adaptability to the target-agnostic. To ensure generation accuracy, we establish an additional distribution-level classifier to regularize these conditional distributions. Extensive experiments have been conducted to demonstrate its effectiveness and low computational cost compared to SOTAs.
Abstract:Irregularly sampled multivariate time series (ISMTS) are prevalent in reality. Most existing methods treat ISMTS as synchronized regularly sampled time series with missing values, neglecting that the irregularities are primarily attributed to variations in sampling rates. In this paper, we introduce a novel perspective that irregularity is essentially relative in some senses. With sampling rates artificially determined from low to high, an irregularly sampled time series can be transformed into a hierarchical set of relatively regular time series from coarse to fine. We observe that additional coarse-grained relatively regular series not only mitigate the irregularly sampled challenges to some extent but also incorporate broad-view temporal information, thereby serving as a valuable asset for representation learning. Therefore, following the philosophy of learning that Seeing the big picture first, then delving into the details, we present the Multi-Scale and Multi-Correlation Attention Network (MuSiCNet) combining multiple scales to iteratively refine the ISMTS representation. Specifically, within each scale, we explore time attention and frequency correlation matrices to aggregate intra- and inter-series information, naturally enhancing the representation quality with richer and more intrinsic details. While across adjacent scales, we employ a representation rectification method containing contrastive learning and reconstruction results adjustment to further improve representation consistency. MuSiCNet is an ISMTS analysis framework that competitive with SOTA in three mainstream tasks consistently, including classification, interpolation, and forecasting.
Abstract:Online continual learning requires the models to learn from constant, endless streams of data. While significant efforts have been made in this field, most were focused on mitigating the catastrophic forgetting issue to achieve better classification ability, at the cost of a much heavier training workload. They overlooked that in real-world scenarios, e.g., in high-speed data stream environments, data do not pause to accommodate slow models. In this paper, we emphasize that model throughput -- defined as the maximum number of training samples that a model can process within a unit of time -- is equally important. It directly limits how much data a model can utilize and presents a challenging dilemma for current methods. With this understanding, we revisit key challenges in OCL from both empirical and theoretical perspectives, highlighting two critical issues beyond the well-documented catastrophic forgetting: Model's ignorance: the single-pass nature of OCL challenges models to learn effective features within constrained training time and storage capacity, leading to a trade-off between effective learning and model throughput; Model's myopia: the local learning nature of OCL on the current task leads the model to adopt overly simplified, task-specific features and excessively sparse classifier, resulting in the gap between the optimal solution for the current task and the global objective. To tackle these issues, we propose the Non-sparse Classifier Evolution framework (NsCE) to facilitate effective global discriminative feature learning with minimal time cost. NsCE integrates non-sparse maximum separation regularization and targeted experience replay techniques with the help of pre-trained models, enabling rapid acquisition of new globally discriminative features.
Abstract:Neural Collapse (NC) presents an elegant geometric structure that enables individual activations (features), class means and classifier (weights) vectors to reach \textit{optimal} inter-class separability during the terminal phase of training on a \textit{balanced} dataset. Once shifted to imbalanced classification, such an optimal structure of NC can be readily destroyed by the notorious \textit{minority collapse}, where the classifier vectors corresponding to the minority classes are squeezed. In response, existing works endeavor to recover NC typically by optimizing classifiers. However, we discover that this squeezing phenomenon is not only confined to classifier vectors but also occurs with class means. Consequently, reconstructing NC solely at the classifier aspect may be futile, as the feature means remain compressed, leading to the violation of inherent \textit{self-duality} in NC (\textit{i.e.}, class means and classifier vectors converge mutually) and incidentally, resulting in an unsatisfactory collapse of individual activations towards the corresponding class means. To shake off these dilemmas, we present a unified \textbf{All}-around \textbf{N}eural \textbf{C}ollapse framework (AllNC), aiming to comprehensively restore NC across multiple aspects including individual activations, class means and classifier vectors. We thoroughly analyze its effectiveness and verify on multiple benchmark datasets that it achieves state-of-the-art in both balanced and imbalanced settings.