As Graph Neural Networks (GNNs) are increasingly employed in real-world applications, it becomes critical to ensure that the stakeholders understand the rationale behind their predictions. While several GNN explanation methods have been proposed recently, there has been little to no work on theoretically analyzing the behavior of these methods or systematically evaluating their effectiveness. Here, we introduce the first axiomatic framework for theoretically analyzing, evaluating, and comparing state-of-the-art GNN explanation methods. We outline and formalize the key desirable properties that all GNN explanation methods should satisfy in order to generate reliable explanations, namely, faithfulness, stability, and fairness. We leverage these properties to present the first ever theoretical analysis of the effectiveness of state-of-the-art GNN explanation methods. Our analysis establishes upper bounds on all the aforementioned properties for popular GNN explanation methods. We also leverage our framework to empirically evaluate these methods on multiple real-world datasets from diverse domains. Our empirical results demonstrate that some popular GNN explanation methods (e.g., gradient-based methods) perform no better than a random baseline and that methods which leverage the graph structure are more effective than those that solely rely on the node features.
Spatial context is central to understanding health and disease. Yet reference protein interaction networks lack such contextualization, thereby limiting the study of where protein interactions likely occur in the human body. Contextualized protein interactions could better characterize genes with disease-specific interactions and elucidate diseases' manifestation in specific cell types. Here, we introduce AWARE, a graph neural message passing approach to inject cellular and tissue context into protein embeddings. AWARE optimizes for a multi-scale embedding space, whose structure reflects the topology of cell type specific networks. We construct a multi-scale network of the Human Cell Atlas and apply AWARE to learn protein, cell type, and tissue embeddings that uphold cell type and tissue hierarchies. We demonstrate AWARE on the novel task of predicting whether a gene is associated with a disease and where it most likely manifests in the human body. AWARE embeddings outperform global embeddings by at least 12.5%, highlighting the importance of contextual learners for protein networks.
With the remarkable success of representation learning in providing powerful predictions and data insights, we have witnessed a rapid expansion of representation learning techniques into modeling, analysis, and learning with networks. Biomedical networks are universal descriptors of systems of interacting elements, from protein interactions to disease networks, all the way to healthcare systems and scientific knowledge. In this review, we put forward an observation that long-standing principles of network biology and medicine -- while often unspoken in machine learning research -- can provide the conceptual grounding for representation learning, explain its current successes and limitations, and inform future advances. We synthesize a spectrum of algorithmic approaches that, at their core, leverage topological features to embed networks into compact vector spaces. We also provide a taxonomy of biomedical areas that are likely to benefit most from algorithmic innovation. Representation learning techniques are becoming essential for identifying causal variants underlying complex traits, disentangling behaviors of single cells and their impact on health, and diagnosing and treating diseases with safe and effective medicines.
Language model pre-training (LMPT) has achieved remarkable results in natural language understanding. However, LMPT is much less successful in non-natural language domains like protein sequences, revealing a crucial discrepancy between the various sequential domains. Here, we posit that while LMPT can effectively model per-token relations, it fails at modeling per-sequence relations in non-natural language domains. To this end, we develop a framework that couples LMPT with deep structure-preserving metric learning to produce richer embeddings than can be obtained from LMPT alone. We examine new and existing pre-training models in this framework and theoretically analyze the framework overall. We also design experiments on a variety of synthetic datasets and new graph-augmented datasets of proteins and scientific abstracts. Our approach offers notable performance improvements on downstream tasks, including prediction of protein remote homology and classification of citation intent.
As the representations output by Graph Neural Networks (GNNs) are increasingly employed in real-world applications, it becomes important to ensure that these representations are fair and stable. In this work, we establish a key connection between counterfactual fairness and stability and leverage it to propose a novel framework, NIFTY (uNIfying Fairness and stabiliTY), which can be used with any GNN to learn fair and stable representations. We introduce a novel objective function that simultaneously accounts for fairness and stability and develop a layer-wise weight normalization using the Lipschitz constant to enhance neural message passing in GNNs. In doing so, we enforce fairness and stability both in the objective function as well as in the GNN architecture. Further, we show theoretically that our layer-wise weight normalization promotes counterfactual fairness and stability in the resulting representations. We introduce three new graph datasets comprising of high-stakes decisions in criminal justice and financial lending domains. Extensive experimentation with the above datasets demonstrates the efficacy of our framework.
Machine learning for therapeutics is an emerging field with incredible opportunities for innovation and expansion. Despite the initial success, many key challenges remain open. Here, we introduce Therapeutics Data Commons (TDC), the first unifying framework to systematically access and evaluate machine learning across the entire range of therapeutics. At its core, TDC is a collection of curated datasets and learning tasks that can translate algorithmic innovation into biomedical and clinical implementation. To date, TDC includes 66 machine learning-ready datasets from 22 learning tasks, spanning the discovery and development of safe and effective medicines. TDC also provides an ecosystem of tools, libraries, leaderboards, and community resources, including data functions, strategies for systematic model evaluation, meaningful data splits, data processors, and molecule generation oracles. All datasets and learning tasks are integrated and accessible via an open-source library. We envision that TDC can facilitate algorithmic and scientific advances and accelerate development, validation, and transition into production and clinical implementation. TDC is a continuous, open-source initiative, and we invite contributions from the research community. TDC is publicly available at https://tdcommons.ai.
Machine Learning (ML) models typically require large-scale, balanced training data to be robust, generalizable, and effective in the context of healthcare. This has been a major issue for developing ML models for the coronavirus-disease 2019 (COVID-19) pandemic where data is highly imbalanced, particularly within electronic health records (EHR) research. Conventional approaches in ML use cross-entropy loss (CEL) that often suffers from poor margin classification. For the first time, we show that contrastive loss (CL) improves the performance of CEL especially for imbalanced EHR data and the related COVID-19 analyses. This study has been approved by the Institutional Review Board at the Icahn School of Medicine at Mount Sinai. We use EHR data from five hospitals within the Mount Sinai Health System (MSHS) to predict mortality, intubation, and intensive care unit (ICU) transfer in hospitalized COVID-19 patients over 24 and 48 hour time windows. We train two sequential architectures (RNN and RETAIN) using two loss functions (CEL and CL). Models are tested on full sample data set which contain all available data and restricted data set to emulate higher class imbalance.CL models consistently outperform CEL models with the restricted data set on these tasks with differences ranging from 0.04 to 0.15 for AUPRC and 0.05 to 0.1 for AUROC. For the restricted sample, only the CL model maintains proper clustering and is able to identify important features, such as pulse oximetry. CL outperforms CEL in instances of severe class imbalance, on three EHR outcomes with respect to three performance metrics: predictive power, clustering, and feature importance. We believe that the developed CL framework can be expanded and used for EHR ML work in general.
The efficacy of a drug depends on its binding affinity to the therapeutic target and pharmacokinetics. Deep learning (DL) has demonstrated remarkable progress in predicting drug efficacy. We develop MolDesigner, a human-in-the-loop web user-interface (UI), to assist drug developers leverage DL predictions to design more effective drugs. A developer can draw a drug molecule in the interface. In the backend, more than 17 state-of-the-art DL models generate predictions on important indices that are crucial for a drug's efficacy. Based on these predictions, drug developers can edit the drug molecule and reiterate until satisfaction. MolDesigner can make predictions in real-time with a latency of less than a second.
Prevailing methods for graphs require abundant label and edge information for learning. When data for a new task are scarce, meta-learning allows us to learn from prior experiences and form much-needed inductive biases for fast adaption to the new task. Here, we introduce G-Meta, a novel meta-learning approach for graphs. G-Meta uses local subgraphs to transfer subgraph-specific information and make the model learn the essential knowledge faster via meta gradients. G-Meta learns how to quickly adapt to a new task using only a handful of nodes or edges in the new task and does so by learning from data points in other graphs or related, albeit disjoint label sets. G-Meta is theoretically justified as we show that the evidence for a particular prediction can be found in the local subgraph surrounding the target node or edge. Experiments on seven datasets and nine baseline methods show that G-Meta can considerably outperform existing methods by up to 16.3%. Unlike previous methods, G-Meta can successfully learn in challenging, few-shot learning settings that require generalization to completely new graphs and never-before-seen labels. Finally, G-Meta scales to large graphs, which we demonstrate on our new Tree-of-Life dataset comprising of 1,840 graphs, a two-orders of magnitude increase in the number of graphs used in prior work.