We propose a Deep learning-based weak label learning method for analysing whole slide images (WSIs) of Hematoxylin and Eosin (H&E) stained tumorcells not requiring pixel-level or tile-level annotations using Self-supervised pre-training and heterogeneity-aware deep Multiple Instance LEarning (DeepSMILE). We apply DeepSMILE to the task of Homologous recombination deficiency (HRD) and microsatellite instability (MSI) prediction. We utilize contrastive self-supervised learning to pre-train a feature extractor on histopathology tiles of cancer tissue. Additionally, we use variability-aware deep multiple instance learning to learn the tile feature aggregation function while modeling tumor heterogeneity. Compared to state-of-the-art genomic label classification methods, DeepSMILE improves classification performance for HRD from $70.43\pm4.10\%$ to $83.79\pm1.25\%$ AUC and MSI from $78.56\pm6.24\%$ to $90.32\pm3.58\%$ AUC in a multi-center breast and colorectal cancer dataset, respectively. These improvements suggest we can improve genomic label classification performance without collecting larger datasets. In the future, this may reduce the need for expensive genome sequencing techniques, provide personalized therapy recommendations based on widely available WSIs of cancer tissue, and improve patient care with quicker treatment decisions - also in medical centers without access to genome sequencing resources.
Machine learning systems based on minimizing average error have been shown to perform inconsistently across notable subsets of the data, which is not exposed by a low average error for the entire dataset. In consequential social and economic applications, where data represent people, this can lead to discrimination of underrepresented gender and ethnic groups. Given the importance of bias mitigation in machine learning, the topic leads to contentious debates on how to ensure fairness in practice (data bias versus algorithmic bias). Distributionally Robust Optimization (DRO) seemingly addresses this problem by minimizing the worst expected risk across subpopulations. We establish theoretical results that clarify the relation between DRO and the optimization of the same loss averaged on an adequately weighted training dataset. The results cover finite and infinite number of training distributions, as well as convex and non-convex loss functions. We show that neither DRO nor curating the training set should be construed as a complete solution for bias mitigation: in the same way that there is no universally robust training set, there is no universal way to setup a DRO problem and ensure a socially acceptable set of results. We then leverage these insights to provide a mininal set of practical recommendations for addressing bias with DRO. Finally, we discuss ramifications of our results in other related applications of DRO, using an example of adversarial robustness. Our results show that there is merit to both the algorithm-focused and the data-focused side of the bias debate, as long as arguments in favor of these positions are precisely qualified and backed by relevant mathematics known today.
Graphs are widely used to model the relational structure of data, and the research of graph machine learning (ML) has a wide spectrum of applications ranging from drug design in molecular graphs to friendship recommendation in social networks. Prevailing approaches for graph ML typically require abundant labeled instances in achieving satisfactory results, which is commonly infeasible in real-world scenarios since labeled data for newly emerged concepts (e.g., new categorizations of nodes) on graphs is limited. Though meta-learning has been applied to different few-shot graph learning problems, most existing efforts predominately assume that all the data from those seen classes is gold-labeled, while those methods may lose their efficacy when the seen data is weakly-labeled with severe label noise. As such, we aim to investigate a novel problem of weakly-supervised graph meta-learning for improving the model robustness in terms of knowledge transfer. To achieve this goal, we propose a new graph meta-learning framework -- Graph Hallucination Networks (Meta-GHN) in this paper. Based on a new robustness-enhanced episodic training, Meta-GHN is meta-learned to hallucinate clean node representations from weakly-labeled data and extracts highly transferable meta-knowledge, which enables the model to quickly adapt to unseen tasks with few labeled instances. Extensive experiments demonstrate the superiority of Meta-GHN over existing graph meta-learning studies on the task of weakly-supervised few-shot node classification.
Although healthcare is a remarkably sensitive domain of application, and systems that exert direct control over the world can cause harm in a way that humans cannot necessarily correct or oversee, it is still unclear whether and how healthcare robots are currently regulated or should be regulated. Existing regulations are primarily unprepared to provide guidance for such a rapidly evolving field and accommodate devices that rely on machine learning and AI. Moreover, the field of healthcare robotics is very rich and extensive, but it is still very much scattered and unclear in terms of definitions, medical and technical classifications, product characteristics, purpose, and intended use. As a result, these devices often navigate between the medical device regulation or other non-medical norms, such as the ISO personal care standard. Before regulating the field of healthcare robots, it is therefore essential to map the major state-of-the-art developments in healthcare robotics, their capabilities and applications, and the challenges we face as a result of their integration within the healthcare environment. This contribution fills in this gap and lack of clarity currently experienced within healthcare robotics and its governance by providing a structured overview of and further elaboration on the main categories now established, their intended purpose, use, and main characteristics. We explicitly focus on surgical, assistive, and service robots to rightfully match the definition of healthcare as the organized provision of medical care to individuals, including efforts to maintain, treat, or restore physical, mental, or emotional well-being. We complement these findings with policy recommendations to help policymakers unravel an optimal regulatory framing for healthcare robot technologies
It is commonly believed that Bayesian optimization (BO) algorithms are highly efficient for optimizing numerically costly functions. However, BO is not often compared to widely different alternatives, and is mostly tested on narrow sets of problems (multimodal, low-dimensional functions), which makes it difficult to assess where (or if) they actually achieve state-of-the-art performance. Moreover, several aspects in the design of these algorithms vary across implementations without a clear recommendation emerging from current practices, and many of these design choices are not substantiated by authoritative test campaigns. This article reports a large investigation about the effects on the performance of (Gaussian process based) BO of common and less common design choices. The experiments are carried out with the established COCO (COmparing Continuous Optimizers) software. It is found that a small initial budget, a quadratic trend, high-quality optimization of the acquisition criterion bring consistent progress. Using the GP mean as an occasional acquisition contributes to a negligible additional improvement. Warping degrades performance. The Mat\'ern 5/2 kernel is a good default but it may be surpassed by the exponential kernel on irregular functions. Overall, the best EGO variants are competitive or improve over state-of-the-art algorithms in dimensions less or equal to 5 for multimodal functions. The code developed for this study makes the new version (v2.1.1) of the R package DiceOptim available on CRAN. The structure of the experiments by function groups allows to define priorities for future research on Bayesian optimization.
This paper explores learning rich self-supervised entity representations from large amounts of the associated text. Once pre-trained, these models become applicable to multiple entity-centric tasks such as ranked retrieval, knowledge base completion, question answering, and more. Unlike other methods that harvest self-supervision signals based merely on a local context within a sentence, we radically expand the notion of context to include any available text related to an entity. This enables a new class of powerful, high-capacity representations that can ultimately distill much of the useful information about an entity from multiple text sources, without any human supervision. We present several training strategies that, unlike prior approaches, learn to jointly predict words and entities -- strategies we compare experimentally on downstream tasks in the TV-Movies domain, such as MovieLens tag prediction from user reviews and natural language movie search. As evidenced by results, our models match or outperform competitive baselines, sometimes with little or no fine-tuning, and can scale to very large corpora. Finally, we make our datasets and pre-trained models publicly available. This includes Reviews2Movielens (see https://goo.gle/research-docent ), mapping the up to 1B word corpus of Amazon movie reviews (He and McAuley, 2016) to MovieLens tags (Harper and Konstan, 2016), as well as Reddit Movie Suggestions (see https://urikz.github.io/docent ) with natural language queries and corresponding community recommendations.
Knowledge tracing (KT) refers to the problem of predicting future learner performance given their past performance in educational applications. Recent developments in KT using flexible deep neural network-based models excel at this task. However, these models often offer limited interpretability, thus making them insufficient for personalized learning, which requires using interpretable feedback and actionable recommendations to help learners achieve better learning outcomes. In this paper, we propose attentive knowledge tracing (AKT), which couples flexible attention-based neural network models with a series of novel, interpretable model components inspired by cognitive and psychometric models. AKT uses a novel monotonic attention mechanism that relates a learner's future responses to assessment questions to their past responses; attention weights are computed using exponential decay and a context-aware relative distance measure, in addition to the similarity between questions. Moreover, we use the Rasch model to regularize the concept and question embeddings; these embeddings are able to capture individual differences among questions on the same concept without using an excessive number of parameters. We conduct experiments on several real-world benchmark datasets and show that AKT outperforms existing KT methods (by up to $6\%$ in AUC in some cases) on predicting future learner responses. We also conduct several case studies and show that AKT exhibits excellent interpretability and thus has potential for automated feedback and personalization in real-world educational settings.
In recent years, Graph Convolutional Networks (GCNs) and their variants have been widely utilized in learning tasks that involve graphs. These tasks include recommendation systems, node classification, among many others. In node classification problem, the input is a graph in which the edges represent the association between pairs of nodes, multi-dimensional feature vectors are associated with the nodes, and some of the nodes in the graph have known labels. The objective is to predict the labels of the nodes that are not labeled, using the nodes features, in conjunction with graph topology. While GCNs have been successfully applied to this problem, the caveats that they inherit from traditional deep learning models pose significant challenges to broad utilization of GCNs in node classification. One such caveat is that training a GCN requires a large number of labeled training instances, which is often not the case in realistic settings. To remedy this requirement, state-of-the-art methods leverage network diffusion-based approaches to propagate labels across the network before training GCNs. However, these approaches ignore the tendency of the network diffusion methods in biasing proximity with centrality, resulting in the propagation of labels to the nodes that are well-connected in the graph. To address this problem, here we present an alternate approach to extrapolating node labels in GCNs in the following three steps: (i) clustering of the network to identify communities, (ii) use of network diffusion algorithms to quantify the proximity of each node to the communities, thereby obtaining a low-dimensional topological profile for each node, (iii) comparing these topological profiles to identify nodes that are most similar to the labeled nodes.
The complexity of human cancer often results in significant heterogeneity in response to treatment. Precision medicine offers potential to improve patient outcomes by leveraging this heterogeneity. Individualized treatment rules (ITRs) formalize precision medicine as maps from the patient covariate space into the space of allowable treatments. The optimal ITR is that which maximizes the mean of a clinical outcome in a population of interest. Patient-derived xenograft (PDX) studies permit the evaluation of multiple treatments within a single tumor and thus are ideally suited for estimating optimal ITRs. PDX data are characterized by correlated outcomes, a high-dimensional feature space, and a large number of treatments. Existing methods for estimating optimal ITRs do not take advantage of the unique structure of PDX data or handle the associated challenges well. In this paper, we explore machine learning methods for estimating optimal ITRs from PDX data. We analyze data from a large PDX study to identify biomarkers that are informative for developing personalized treatment recommendations in multiple cancers. We estimate optimal ITRs using regression-based approaches such as Q-learning and direct search methods such as outcome weighted learning. Finally, we implement a superlearner approach to combine a set of estimated ITRs and show that the resulting ITR performs better than any of the input ITRs, mitigating uncertainty regarding user choice of any particular ITR estimation methodology. Our results indicate that PDX data are a valuable resource for developing individualized treatment strategies in oncology.
With the rapid development of Internet and multimedia services in the past decade, a huge amount of user-generated and service provider-generated multimedia data become available. These data are heterogeneous and multi-modal in nature, imposing great challenges for processing and analyzing them. Multi-modal data consist of a mixture of various types of data from different modalities such as texts, images, videos, audios etc. In this article, we present a deep and comprehensive overview for multi-modal analysis in multimedia. We introduce two scientific research problems, data-driven correlational representation and knowledge-guided fusion for multimedia analysis. To address the two scientific problems, we investigate them from the following aspects: 1) multi-modal correlational representation: multi-modal fusion of data across different modalities, and 2) multi-modal data and knowledge fusion: multi-modal fusion of data with domain knowledge. More specifically, on data-driven correlational representation, we highlight three important categories of methods, such as multi-modal deep representation, multi-modal transfer learning, and multi-modal hashing. On knowledge-guided fusion, we discuss the approaches for fusing knowledge with data and four exemplar applications that require various kinds of domain knowledge, including multi-modal visual question answering, multi-modal video summarization, multi-modal visual pattern mining and multi-modal recommendation. Finally, we bring forward our insights and future research directions.