Abstract:Causal inference has emerged as a promising approach to mitigate long-tail classification by handling the biases introduced by class imbalance. However, along with the change of advanced backbone models from Convolutional Neural Networks (CNNs) to Visual Transformers (ViT), existing causal models may not achieve an expected performance gain. This paper investigates the influence of existing causal models on CNNs and ViT variants, highlighting that ViT's global feature representation makes it hard for causal methods to model associations between fine-grained features and predictions, which leads to difficulties in classifying tail classes with similar visual appearance. To address these issues, this paper proposes TSCNet, a two-stage causal modeling method to discover fine-grained causal associations through multi-scale causal interventions. Specifically, in the hierarchical causal representation learning stage (HCRL), it decouples the background and objects, applying backdoor interventions at both the patch and feature level to prevent model from using class-irrelevant areas to infer labels which enhances fine-grained causal representation. In the counterfactual logits bias calibration stage (CLBC), it refines the optimization of model's decision boundary by adaptive constructing counterfactual balanced data distribution to remove the spurious associations in the logits caused by data distribution. Extensive experiments conducted on various long-tail benchmarks demonstrate that the proposed TSCNet can eliminate multiple biases introduced by data imbalance, which outperforms existing methods.
Abstract:Cross-modal alignment is an effective approach to improving visual classification. Existing studies typically enforce a one-step mapping that uses deep neural networks to project the visual features to mimic the distribution of textual features. However, they typically face difficulties in finding such a projection due to the two modalities in both the distribution of class-wise samples and the range of their feature values. To address this issue, this paper proposes a novel Semantic-Space-Intervened Diffusive Alignment method, termed SeDA, models a semantic space as a bridge in the visual-to-textual projection, considering both types of features share the same class-level information in classification. More importantly, a bi-stage diffusion framework is developed to enable the progressive alignment between the two modalities. Specifically, SeDA first employs a Diffusion-Controlled Semantic Learner to model the semantic features space of visual features by constraining the interactive features of the diffusion model and the category centers of visual features. In the later stage of SeDA, the Diffusion-Controlled Semantic Translator focuses on learning the distribution of textual features from the semantic space. Meanwhile, the Progressive Feature Interaction Network introduces stepwise feature interactions at each alignment step, progressively integrating textual information into mapped features. Experimental results show that SeDA achieves stronger cross-modal feature alignment, leading to superior performance over existing methods across multiple scenarios.
Abstract:Attribute bias in federated learning (FL) typically leads local models to optimize inconsistently due to the learning of non-causal associations, resulting degraded performance. Existing methods either use data augmentation for increasing sample diversity or knowledge distillation for learning invariant representations to address this problem. However, they lack a comprehensive analysis of the inference paths, and the interference from confounding factors limits their performance. To address these limitations, we propose the \underline{Fed}erated \underline{D}econfounding and \underline{D}ebiasing \underline{L}earning (FedDDL) method. It constructs a structured causal graph to analyze the model inference process, and performs backdoor adjustment to eliminate confounding paths. Specifically, we design an intra-client deconfounding learning module for computer vision tasks to decouple background and objects, generating counterfactual samples that establish a connection between the background and any label, which stops the model from using the background to infer the label. Moreover, we design an inter-client debiasing learning module to construct causal prototypes to reduce the proportion of the background in prototype components. Notably, it bridges the gap between heterogeneous representations via causal prototypical regularization. Extensive experiments on 2 benchmarking datasets demonstrate that \methodname{} significantly enhances the model capability to focus on main objects in unseen data, leading to 4.5\% higher Top-1 Accuracy on average over 9 state-of-the-art existing methods.
Abstract:The clustering performance of Fuzzy Adaptive Resonance Theory (Fuzzy ART) is highly dependent on the preset vigilance parameter, where deviations in its value can lead to significant fluctuations in clustering results, severely limiting its practicality for non-expert users. Existing approaches generally enhance vigilance parameter robustness through adaptive mechanisms such as particle swarm optimization and fuzzy logic rules. However, they often introduce additional hyperparameters or complex frameworks that contradict the original simplicity of the algorithm. To address this, we propose Iterative Refinement Adaptive Resonance Theory (IR-ART), which integrates three key phases into a unified iterative framework: (1) Cluster Stability Detection: A dynamic stability detection module that identifies unstable clusters by analyzing the change of sample size (number of samples in the cluster) in iteration. (2) Unstable Cluster Deletion: An evolutionary pruning module that eliminates low-quality clusters. (3) Vigilance Region Expansion: A vigilance region expansion mechanism that adaptively adjusts similarity thresholds. Independent of the specific execution of clustering, these three phases sequentially focus on analyzing the implicit knowledge within the iterative process, adjusting weights and vigilance parameters, thereby laying a foundation for the next iteration. Experimental evaluation on 15 datasets demonstrates that IR-ART improves tolerance to suboptimal vigilance parameter values while preserving the parameter simplicity of Fuzzy ART. Case studies visually confirm the algorithm's self-optimization capability through iterative refinement, making it particularly suitable for non-expert users in resource-constrained scenarios.
Abstract:The personalized text-to-image generation has rapidly advanced with the emergence of Stable Diffusion. Existing methods, which typically fine-tune models using embedded identifiers, often struggle with insufficient stylization and inaccurate image content due to reduced textual controllability. In this paper, we propose style refinement and content preservation strategies. The style refinement strategy leverages the semantic information of visual reasoning prompts and reference images to optimize style embeddings, allowing a more precise and consistent representation of style information. The content preservation strategy addresses the content bias problem by preserving the model's generalization capabilities, ensuring enhanced textual controllability without compromising stylization. Experimental results verify that our approach achieves superior performance in generating consistent and personalized text-to-image outputs.
Abstract:Attribute skew in federated learning leads local models to focus on learning non-causal associations, guiding them towards inconsistent optimization directions, which inevitably results in performance degradation and unstable convergence. Existing methods typically leverage data augmentation to enhance sample diversity or employ knowledge distillation to learn invariant representations. However, the instability in the quality of generated data and the lack of domain information limit their performance on unseen samples. To address these issues, this paper presents a global intervention and distillation method, termed FedGID, which utilizes diverse attribute features for backdoor adjustment to break the spurious association between background and label. It includes two main modules, where the global intervention module adaptively decouples objects and backgrounds in images, injects background information into random samples to intervene in the sample distribution, which links backgrounds to all categories to prevent the model from treating background-label associations as causal. The global distillation module leverages a unified knowledge base to guide the representation learning of client models, preventing local models from overfitting to client-specific attributes. Experimental results on three datasets demonstrate that FedGID enhances the model's ability to focus on the main subjects in unseen data and outperforms existing methods in collaborative modeling.
Abstract:Product bundling aims to organize a set of thematically related items into a combined bundle for shipment facilitation and item promotion. To increase the exposure of fresh or overstocked products, sellers typically bundle these items with popular products for inventory clearance. This specific task can be formulated as a long-tail product bundling scenario, which leverages the user-item interactions to define the popularity of each item. The inherent popularity bias in the pre-extracted user feedback features and the insufficient utilization of other popularity-independent knowledge may force the conventional bundling methods to find more popular items, thereby struggling with this long-tail bundling scenario. Through intuitive and empirical analysis, we navigate the core solution for this challenge, which is maximally mining the popularity-free features and effectively incorporating them into the bundling process. To achieve this, we propose a Distilled Modality-Oriented Knowledge Transfer framework (DieT) to effectively counter the popularity bias misintroduced by the user feedback features and adhere to the original intent behind the real-world bundling behaviors. Specifically, DieT first proposes the Popularity-free Collaborative Distribution Modeling module (PCD) to capture the popularity-independent information from the bundle-item view, which is proven most effective in the long-tail bundling scenario to enable the directional information transfer. With the tailored Unbiased Bundle-aware Knowledge Transferring module (UBT), DieT can highlight the significance of popularity-free features while mitigating the negative effects of user feedback features in the long-tail scenario via the knowledge distillation paradigm. Extensive experiments on two real-world datasets demonstrate the superiority of DieT over a list of SOTA methods in the long-tail bundling scenario.
Abstract:Recommender systems aim to capture users' personalized preferences from the cast amount of user behaviors, making them pivotal in the era of information explosion. However, the presence of the dynamic preference, the "information cocoons", and the inherent feedback loops in recommendation make users interact with a limited number of items. Conventional recommendation algorithms typically focus on the positive historical behaviors, while neglecting the essential role of negative feedback in user interest understanding. As a promising but easy-to-ignored area, negative sampling is proficients in revealing the genuine negative aspect inherent in user behaviors, emerging as an inescapable procedure in recommendation. In this survey, we first discuss the role of negative sampling in recommendation and thoroughly analyze challenges that consistently impede its progress. Then, we conduct an extensive literature review on the existing negative sampling strategies in recommendation and classify them into five categories with their discrepant techniques. Finally, we detail the insights of the tailored negative sampling strategies in diverse recommendation scenarios and outline an overview of the prospective research directions toward which the community may engage and benefit.
Abstract:Image classification models often demonstrate unstable performance in real-world applications due to variations in image information, driven by differing visual perspectives of subject objects and lighting discrepancies. To mitigate these challenges, existing studies commonly incorporate additional modal information matching the visual data to regularize the model's learning process, enabling the extraction of high-quality visual features from complex image regions. Specifically, in the realm of multimodal learning, cross-modal alignment is recognized as an effective strategy, harmonizing different modal information by learning a domain-consistent latent feature space for visual and semantic features. However, this approach may face limitations due to the heterogeneity between multimodal information, such as differences in feature distribution and structure. To address this issue, we introduce a Multimodal Alignment and Reconstruction Network (MARNet), designed to enhance the model's resistance to visual noise. Importantly, MARNet includes a cross-modal diffusion reconstruction module for smoothly and stably blending information across different domains. Experiments conducted on two benchmark datasets, Vireo-Food172 and Ingredient-101, demonstrate that MARNet effectively improves the quality of image information extracted by the model. It is a plug-and-play framework that can be rapidly integrated into various image classification frameworks, boosting model performance.
Abstract:Federated learning benefits from cross-training strategies, which enables models to train on data from distinct sources to improve the generalization capability. However, the data heterogeneity between sources may lead models to gradually forget previously acquired knowledge when undergoing cross-training to adapt to new tasks or data sources. We argue that integrating personalized and global knowledge to gather information from multiple perspectives could potentially improve performance. To achieve this goal, this paper presents a novel approach that enhances federated learning through a cross-training scheme incorporating multi-view information. Specifically, the proposed method, termed FedCT, includes three main modules, where the consistency-aware knowledge broadcasting module aims to optimize model assignment strategies, which enhances collaborative advantages between clients and achieves an efficient federated learning process. The multi-view knowledge-guided representation learning module leverages fused prototypical knowledge from both global and local views to enhance the preservation of local knowledge before and after model exchange, as well as to ensure consistency between local and global knowledge. The mixup-based feature augmentation module aggregates rich information to further increase the diversity of feature spaces, which enables the model to better discriminate complex samples. Extensive experiments were conducted on four datasets in terms of performance comparison, ablation study, in-depth analysis and case study. The results demonstrated that FedCT alleviates knowledge forgetting from both local and global views, which enables it outperform state-of-the-art methods.