Graph neural network (GNN) is effective to model graphs for distributed representations of nodes and an entire graph. Recently, research on the expressive power of GNN attracted growing attention. A highly-expressive GNN has the ability to generate discriminative graph representations. However, in the end-to-end training process for a certain graph learning task, a highly-expressive GNN risks generating graph representations overfitting the training data for the target task, while losing information important for the model generalization. In this paper, we propose Distribution Preserving GNN (DP-GNN) - a GNN framework that can improve the generalizability of expressive GNN models by preserving several kinds of distribution information in graph representations and node representations. Besides the generalizability, by applying an expressive GNN backbone, DP-GNN can also have high expressive power. We evaluate the proposed DP-GNN framework on multiple benchmark datasets for graph classification tasks. The experimental results demonstrate that our model achieves state-of-the-art performances.
Transformers-based models, such as BERT, have dramatically improved the performance for various natural language processing tasks. The clinical knowledge enriched model, namely ClinicalBERT, also achieved state-of-the-art results when performed on clinical named entity recognition and natural language inference tasks. One of the core limitations of these transformers is the substantial memory consumption due to their full self-attention mechanism. To overcome this, long sequence transformer models, e.g. Longformer and BigBird, were proposed with the idea of sparse attention mechanism to reduce the memory usage from quadratic to the sequence length to a linear scale. These models extended the maximum input sequence length from 512 to 4096, which enhanced the ability of modeling long-term dependency and consequently achieved optimal results in a variety of tasks. Inspired by the success of these long sequence transformer models, we introduce two domain enriched language models, namely Clinical-Longformer and Clinical-BigBird, which are pre-trained from large-scale clinical corpora. We evaluate both pre-trained models using 10 baseline tasks including named entity recognition, question answering, and document classification tasks. The results demonstrate that Clinical-Longformer and Clinical-BigBird consistently and significantly outperform ClinicalBERT as well as other short-sequence transformers in all downstream tasks. We have made our source code available at [https://github.com/luoyuanlab/Clinical-Longformer] the pre-trained models available for public download at: [https://huggingface.co/yikuan8/Clinical-Longformer].
Dimensionality reduction techniques are powerful tools for data preprocessing and visualization which typically come with few guarantees concerning the topological correctness of an embedding. The interleaving distance between the persistent homology of Vietoris-Rips filtrations can be used to identify a scale at which topological features such as clusters or holes in an embedding and original data set are in correspondence. We show how optimization seeking to minimize the interleaving distance can be incorporated into dimensionality reduction algorithms, and explicitly demonstrate its use in finding an optimal linear projection. We demonstrate the utility of this framework to data visualization.
Open-set recognition generalizes a classification task by classifying test samples as one of the known classes from training or "unknown." As novel cancer drug cocktails with improved treatment are continually discovered, predicting cancer treatments can naturally be formulated in terms of an open-set recognition problem. Drawbacks, due to modeling unknown samples during training, arise from straightforward implementations of prior work in healthcare open-set learning. Accordingly, we reframe the problem methodology and apply a recent existing Gaussian mixture variational autoencoder model, which achieves state-of-the-art results for image datasets, to breast cancer patient data. Not only do we obtain more accurate and robust classification results, with a 24.5% average F1 increase compared to a recent method, but we also reexamine open-set recognition in terms of deployability to a clinical setting.
Background: The increasing adoption of electronic health records (EHR) across the US has created troves of computable data, to which machine learning methods have been applied to extract useful insights. EHR data, represented as a three-dimensional analogue of a matrix (tensor), is decomposed into two-dimensional factors that can be interpreted as computational phenotypes. Methods: We apply constrained tensor factorization to derive computational phenotypes and predict mortality in cohorts of patients with breast, prostate, colorectal, or lung cancer in the Northwestern Medicine Enterprise Data Warehouse from 2000 to 2015. In our experiments, we examined using a supervised term in the factorization algorithm, filtering tensor co-occurrences by medical indication, and incorporating additional social determinants of health (SDOH) covariates in the factorization process. We evaluated the resulting computational phenotypes qualitatively and by assessing their ability to predict five-year mortality using the area under the curve (AUC) statistic. Results: Filtering by medical indication led to more concise and interpretable phenotypes. Mortality prediction performance (AUC) varied under the different experimental conditions and by cancer type (breast: 0.623 - 0.694, prostate: 0.603 - 0.750, colorectal: 0.523 - 0.641, and lung: 0.517 - 0.623). Generally, prediction performance improved with the use of a supervised term and the incorporation of SDOH covariates. Conclusion: Constrained tensor factorization, applied to sparse EHR data of patients with cancer, can discover computational phenotypes predictive of five-year mortality. The incorporation of SDOH variables into the factorization algorithm is an easy-to-implement and effective way to improve prediction performance.
Systemic lupus erythematosus (SLE) is a rare autoimmune disorder characterized by an unpredictable course of flares and remission with diverse manifestations. Lupus nephritis, one of the major disease manifestations of SLE for organ damage and mortality, is a key component of lupus classification criteria. Accurately identifying lupus nephritis in electronic health records (EHRs) would therefore benefit large cohort observational studies and clinical trials where characterization of the patient population is critical for recruitment, study design, and analysis. Lupus nephritis can be recognized through procedure codes and structured data, such as laboratory tests. However, other critical information documenting lupus nephritis, such as histologic reports from kidney biopsies and prior medical history narratives, require sophisticated text processing to mine information from pathology reports and clinical notes. In this study, we developed algorithms to identify lupus nephritis with and without natural language processing (NLP) using EHR data. We developed four algorithms: a rule-based algorithm using only structured data (baseline algorithm) and three algorithms using different NLP models. The three NLP models are based on regularized logistic regression and use different sets of features including positive mention of concept unique identifiers (CUIs), number of appearances of CUIs, and a mixture of three components respectively. The baseline algorithm and the best performed NLP algorithm were external validated on a dataset from Vanderbilt University Medical Center (VUMC). Our best performing NLP model incorporating features from both structured data, regular expression concepts, and mapped CUIs improved F measure in both the NMEDW (0.41 vs 0.79) and VUMC (0.62 vs 0.96) datasets compared to the baseline lupus nephritis algorithm.
Background Sepsis is one of the most life-threatening circumstances for critically ill patients in the US, while a standardized criteria for sepsis identification is still under development. Disparities in social determinants of sepsis patients can interfere with the risk prediction performances using machine learning. Methods Disparities in social determinants, including race, gender, marital status, insurance types and languages, among patients identified by six available sepsis criteria were revealed by forest plots. Sixteen machine learning classifiers were trained to predict in-hospital mortality for sepsis patients. The performance of the trained model was tested on the entire randomly conducted test set and each sub-population built based on each of the following social determinants: race, gender, marital status, insurance type, and language. Results We analyzed a total of 11,791 critical care patients from the MIMIC-III database. Within the population identified by each sepsis identification method, significant differences were observed among sub-populations regarding race, marital status, insurance type, and language. On the 5,783 sepsis patients identified by the Sepsis-3 criteria statistically significant performance decreases for mortality prediction were observed when applying the trained machine learning model on Asian and Hispanic patients. With pairwise comparison, we detected performance discrepancies in mortality prediction between Asian and White patients, Asians and patients of other races, as well as English-speaking and Spanish-speaking patients. Conclusions Disparities in proportions of patients identified by various sepsis criteria were detected among the different social determinant groups. To achieve accurate diagnosis, a versatile diagnostic system for sepsis is needed to overcome the social determinant disparities of patients.
Hypertension is the leading global cause of cardiovascular disease and premature death. Distinct hypertension subtypes may vary in their prognoses and require different treatments. An individual's risk for hypertension is determined by genetic and environmental factors as well as their interactions. In this work, we studied 911 African Americans and 1,171 European Americans in the Hypertension Genetic Epidemiology Network (HyperGEN) cohort. We built hypertension subtype classification models using both environmental variables and sets of genetic features selected based on different criteria. The fitted prediction models provided insights into the genetic landscape of hypertension subtypes, which may aid personalized diagnosis and treatment of hypertension in the future.
Machine learning in medicine leverages the wealth of healthcare data to extract knowledge, facilitate clinical decision-making, and ultimately improve care delivery. However, ML models trained on datasets that lack demographic diversity could yield suboptimal performance when applied to the underrepresented populations (e.g. ethnic minorities, lower social-economic status), thus perpetuating health disparity. In this study, we evaluated four classifiers built to predict Hyperchloremia - a condition that often results from aggressive fluids administration in the ICU population - and compared their performance in racial, gender, and insurance subgroups. We observed that adding social determinants features in addition to the lab-based ones improved model performance on all patients. The subgroup testing yielded significantly different AUC scores in 40 out of the 44 model-subgroup, suggesting disparities when applying ML models to social determinants subgroups. We urge future researchers to design models that proactively adjust for potential biases and include subgroup reporting in their studies.