Abstract:Class imbalance, overlap, and noise degrade data quality, reduce model reliability, and limit generalization. Although widely studied in binary classification, these issues remain underexplored in multi-class settings, where complex inter-class relationships make minority-majority structures unclear and traditional clustering fails to capture distribution shape. Approaches that rely only on geometric distances risk removing informative samples and generating low-quality synthetic data, while binarization approaches treat imbalance locally and ignore global inter-class dependencies. At the algorithmic level, ensembles struggle to integrate weak classifiers, leading to limited robustness. This paper proposes IMOVNO+ (IMbalance-OVerlap-NOise+ Algorithm-Level Optimization), a two-level framework designed to jointly enhance data quality and algorithmic robustness for binary and multi-class tasks. At the data level, first, conditional probability is used to quantify the informativeness of each sample. Second, the dataset is partitioned into core, overlapping, and noisy regions. Third, an overlapping-cleaning algorithm is introduced that combines Z-score metrics with a big-jump gap distance. Fourth, a smart oversampling algorithm based on multi-regularization controls synthetic sample proximity, preventing new overlaps. At the algorithmic level, a meta-heuristic prunes ensemble classifiers to reduce weak-learner influence. IMOVNO+ was evaluated on 35 datasets (13 multi-class, 22 binary). Results show consistent superiority over state-of-the-art methods, approaching 100% in several cases. For multi-class data, IMOVNO+ achieves gains of 37-57% in G-mean, 25-44% in F1-score, 25-39% in precision, and 26-43% in recall. In binary tasks, it attains near-perfect performance with improvements of 14-39%. The framework handles data scarcity and imbalance from collection and privacy limits.




Abstract:Accurate illness diagnosis is vital for effective treatment and patient safety. Machine learning models are widely used for cancer diagnosis based on historical medical data. However, data imbalance remains a major challenge, leading to hindering classifier performance and reliability. The SMOTEBoost method addresses this issue by generating synthetic data to balance the dataset, but it may overlook crucial overlapping regions near the decision boundary and can produce noisy samples. This paper proposes RE-SMOTEBoost, an enhanced version of SMOTEBoost, designed to overcome these limitations. Firstly, RE-SMOTEBoost focuses on generating synthetic samples in overlapping regions to better capture the decision boundary using roulette wheel selection. Secondly, it incorporates a filtering mechanism based on information entropy to reduce noise, and borderline cases and improve the quality of generated data. Thirdly, we introduce a double regularization penalty to control the synthetic samples proximity to the decision boundary and avoid class overlap. These enhancements enable higher-quality oversampling of the minority class, resulting in a more balanced and effective training dataset. The proposed method outperforms existing state-of-the-art techniques when evaluated on imbalanced datasets. Compared to the top-performing sampling algorithms, RE-SMOTEBoost demonstrates a notable improvement of 3.22\% in accuracy and a variance reduction of 88.8\%. These results indicate that the proposed model offers a solid solution for medical settings, effectively overcoming data scarcity and severe imbalance caused by limited samples, data collection difficulties, and privacy constraints.