Abstract:The success of the machine learning field has reliably depended on training on large datasets. While effective, this trend comes at an extraordinary cost. This is due to two deeply intertwined factors: the size of models and the size of datasets. While promising research efforts focus on reducing the size of models, the other half of the equation remains fairly mysterious. Indeed, it is surprising that the standard approach to training remains to iterate over and over, uniformly sampling the training dataset. In this paper we explore a series of alternative training paradigms that leverage insights from hard-data-mining and dropout, simple enough to implement and use that can become the new training standard. The proposed Progressive Data Dropout reduces the number of effective epochs to as little as 12.4% of the baseline. This savings actually do not come at any cost for accuracy. Surprisingly, the proposed method improves accuracy by up to 4.82%. Our approach requires no changes to model architecture or optimizer, and can be applied across standard training pipelines, thus posing an excellent opportunity for wide adoption. Code can be found here: https://github.com/bazyagami/LearningWithRevision
Abstract:Effectively handling the co-occurrence of non-IID data and long-tailed distributions remains a critical challenge in federated learning. While fine-tuning vision-language models (VLMs) like CLIP has shown to be promising in addressing non-IID data challenges, this approach leads to severe degradation of tail classes in federated long-tailed scenarios. Under the composite effects of strong non-IID data distribution and long-tailed class imbalances, VLM fine-tuning may even fail to yield any improvement. To address this issue, we propose Class-Aware Prompt Learning for Federated Long-tailed Learning (CAPT), a novel framework that leverages a pre-trained VLM to effectively handle both data heterogeneity and long-tailed distributions. CAPT introduces a dual-prompt mechanism that synergizes general and class-aware prompts, enabling the framework to capture global trends while preserving class-specific knowledge. To better aggregate and share knowledge across clients, we introduce a heterogeneity-aware client clustering strategy that groups clients based on their data distributions, enabling efficient collaboration and knowledge sharing. Extensive experiments on various long-tailed datasets with different levels of data heterogeneity demonstrate that CAPT significantly improves tail class performance without compromising overall accuracy, outperforming state-of-the-art methods in federated long-tailed learning scenarios.
Abstract:The construction of large open knowledge bases (OKBs) is integral to many applications in the field of mobile computing. Noun phrases and relational phrases in OKBs often suffer from redundancy and ambiguity, which calls for the investigation on OKB canonicalization. However, in order to meet the requirements of some privacy protection regulations and to ensure the timeliness of the data, the canonicalized OKB often needs to remove some sensitive information or outdated data. The machine unlearning in OKB canonicalization is an excellent solution to the above problem. Current solutions address OKB canonicalization by devising advanced clustering algorithms and using knowledge graph embedding (KGE) to further facilitate the canonicalization process. Effective schemes are urgently needed to fully synergise machine unlearning with clustering and KGE learning. To this end, we put forward a multi-task unlearning framework, namely MulCanon, to tackle machine unlearning problem in OKB canonicalization. Specifically, the noise characteristics in the diffusion model are utilized to achieve the effect of machine unlearning for data in OKB. MulCanon unifies the learning objectives of diffusion model, KGE and clustering algorithms, and adopts a two-step multi-task learning paradigm for training. A thorough experimental study on popular OKB canonicalization datasets validates that MulCanon achieves advanced machine unlearning effects.