Abstract:Graph representation learning is effective for obtaining a meaningful latent space utilizing the structure of graph data and is widely applied, including biological networks. In particular, Graph Contrastive Learning (GCL) has emerged as a powerful self-supervised method that relies on applying perturbations to graphs for data augmentation. However, when applying existing GCL methods to biological networks such as Gene Regulatory Networks (GRNs), they overlooked meaningful biologically relevant perturbations, e.g., gene knockdowns. In this study, we introduce SupGCL (Supervised Graph Contrastive Learning), a novel GCL method for GRNs that directly incorporates biological perturbations derived from gene knockdown experiments as the supervision. SupGCL mathematically extends existing GCL methods that utilize non-biological perturbations to probabilistic models that introduce actual biological gene perturbation utilizing gene knockdown data. Using the GRN representation obtained by our proposed method, our aim is to improve the performance of biological downstream tasks such as patient hazard prediction and disease subtype classification (graph-level task), and gene function classification (node-level task). We applied SupGCL on real GRN datasets derived from patients with multiple types of cancer, and in all experiments SupGCL achieves better performance than state-of-the-art baselines.
Abstract:Machine learning is increasingly used to predict lifestyle-related disease onset using health and medical data. However, the prediction effectiveness is hindered by dataset shift, which involves discrepancies in data distribution between the training and testing datasets, misclassifying out-of-distribution (OOD) data. To diminish dataset shift effects, this paper proposes the out-of-distribution reject option for prediction (ODROP), which integrates OOD detection models to preclude OOD data from the prediction phase. We investigated the efficacy of five OOD detection methods (variational autoencoder, neural network ensemble std, neural network ensemble epistemic, neural network energy, and neural network gaussian mixture based energy measurement) across two datasets, the Hirosaki and Wakayama health checkup data, in the context of three disease onset prediction tasks: diabetes, dyslipidemia, and hypertension. To evaluate the ODROP method, we trained disease onset prediction models and OOD detection models on Hirosaki data and used AUROC-rejection curve plots from Wakayama data. The variational autoencoder method showed superior stability and magnitude of improvement in Area Under the Receiver Operating Curve (AUROC) in five cases: AUROC in the Wakayama data was improved from 0.80 to 0.90 at a 31.1% rejection rate for diabetes onset and from 0.70 to 0.76 at a 34% rejection rate for dyslipidemia. We categorized dataset shifts into two types using SHAP clustering - those that considerably affect predictions and those that do not. We expect that this classification will help standardize measuring instruments. This study is the first to apply OOD detection to actual health and medical data, demonstrating its potential to substantially improve the accuracy and reliability of disease prediction models amidst dataset shift.