Abstract:Objective: ML-based clinical risk prediction models are increasingly used to support decision-making in healthcare. While class-imbalance correction techniques are commonly applied to improve model performance in settings with rare outcomes, their impact on probabilistic calibration remains insufficiently understood. This study evaluated the effect of widely used resampling strategies on both discrimination and calibration across real-world clinical prediction tasks. Methods: Ten clinical datasets spanning diverse medical domains and including 605,842 patients were analyzed. Multiple machine-learning model families, including linear models and several non-linear approaches, were evaluated. Models were trained on the original data and under three commonly used 1:1 class-imbalance correction strategies (SMOTE, RUS, ROS). Performance was assessed on held-out data using discrimination and calibration metrics. Results: Across all datasets and model families, resampling had no positive impact on predictive performance. Changes in the Receiver Operating Characteristic Area Under Curve (ROC-AUC) relative to models trained on the original data were small and inconsistent (ROS: -0.002, p<0.05; RUS: -0.004, p>0.05; SMOTE: -0.01, p<0.05), with no resampling strategy demonstrating a systematic improvement. In contrast, resampling in general degraded the calibration performance. Models trained using imbalance correction exhibited higher Brier scores (0.029 to 0.080, p<0.05), reflecting poorer probabilistic accuracy, and marked deviations in calibration intercept and slope, indicating systematic distortions of predicted risk despite preserved rank-based performance. Conclusion: In a diverse set of real-world clinical prediction tasks, commonly used class-imbalance correction techniques did not provide generalizable improvements in discrimination and were associated with degraded calibration.




Abstract:Accurately predicting blood glucose (BG) levels of ICU patients is critical, as both hypoglycemia (BG < 70 mg/dL) and hyperglycemia (BG > 180 mg/dL) are associated with increased morbidity and mortality. We develop the Multi-source Irregular Time-Series Transformer (MITST), a novel machine learning-based model to forecast the next BG level, classifying it into hypoglycemia, hyperglycemia, or euglycemia (70-180 mg/dL). The irregularity and complexity of Electronic Health Record (EHR) data, spanning multiple heterogeneous clinical sources like lab results, medications, and vital signs, pose significant challenges for prediction tasks. MITST addresses these using hierarchical Transformer architectures, which include a feature-level, a timestamp-level, and a source-level Transformer. This design captures fine-grained temporal dynamics and allows learning-based data integration instead of traditional predefined aggregation. In a large-scale evaluation using the eICU database (200,859 ICU stays across 208 hospitals), MITST achieves an average improvement of 1.7% (p < 0.001) in AUROC and 1.8% (p < 0.001) in AUPRC over a state-of-the-art baseline. For hypoglycemia, MITST achieves an AUROC of 0.915 and an AUPRC of 0.247, both significantly higher than the baseline's AUROC of 0.862 and AUPRC of 0.208 (p < 0.001). The flexible architecture of MITST allows seamless integration of new data sources without retraining the entire model, enhancing its adaptability in clinical decision support. Although this study focuses on predicting BG levels, MITST can easily be extended to other critical event prediction tasks in ICU settings, offering a robust solution for analyzing complex, multi-source, irregular time-series data.