Machine learning holds tremendous promise for transforming the fundamental practice of scientific discovery by virtue of its data-driven nature. With the ever-increasing stream of research data collection, it would be appealing to autonomously explore patterns and insights from observational data for discovering novel classes of phenotypes and concepts. However, in the biomedical domain, there are several challenges inherently presented in the cumulated data which hamper the progress of novel class discovery. The non-i.i.d. data distribution accompanied by the severe imbalance among different groups of classes essentially leads to ambiguous and biased semantic representations. In this work, we present a geometry-constrained probabilistic modeling treatment to resolve the identified issues. First, we propose to parameterize the approximated posterior of instance embedding as a marginal von MisesFisher distribution to account for the interference of distributional latent bias. Then, we incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space, which in turn minimizes the uncontrollable risk for unknown class learning and structuring. Furthermore, a spectral graph-theoretic method is devised to estimate the number of potential novel classes. It inherits two intriguing merits compared to existent approaches, namely high computational efficiency and flexibility for taxonomy-adaptive estimation. Extensive experiments across various biomedical scenarios substantiate the effectiveness and general applicability of our method.
Annotation scarcity and cross-modality/stain data distribution shifts are two major obstacles hindering the application of deep learning models for nuclei analysis, which holds a broad spectrum of potential applications in digital pathology. Recently, unsupervised domain adaptation (UDA) methods have been proposed to mitigate the distributional gap between different imaging modalities for unsupervised nuclei segmentation in histopathology images. However, existing UDA methods are built upon the assumption that data distributions within each domain should be uniform. Based on the over-simplified supposition, they propose to align the histopathology target domain with the source domain integrally, neglecting severe intra-domain discrepancy over subpartitions incurred by mixed cancer types and sampling organs. In this paper, for the first time, we propose to explicitly consider the heterogeneity within the histopathology domain and introduce open compound domain adaptation (OCDA) to resolve the crux. In specific, a two-stage disentanglement framework is proposed to acquire domain-invariant feature representations at both image and instance levels. The holistic design addresses the limitations of existing OCDA approaches which struggle to capture instance-wise variations. Two regularization strategies are specifically devised herein to leverage the rich subpartition-specific characteristics in histopathology images and facilitate subdomain decomposition. Moreover, we propose a dual-branch nucleus shape and structure preserving module to prevent nucleus over-generation and deformation in the synthesized images. Experimental results on both cross-modality and cross-stain scenarios over a broad range of diverse datasets demonstrate the superiority of our method compared with state-of-the-art UDA and OCDA methods.
The success of automated medical image analysis depends on large-scale and expert-annotated training sets. Unsupervised domain adaptation (UDA) has been raised as a promising approach to alleviate the burden of labeled data collection. However, they generally operate under the closed-set adaptation setting assuming an identical label set between the source and target domains, which is over-restrictive in clinical practice where new classes commonly exist across datasets due to taxonomic inconsistency. While several methods have been presented to tackle both domain shifts and incoherent label sets, none of them take into account the common characteristics of the two issues and consider the learning dynamics along network training. In this work, we propose optimization trajectory distillation, a unified approach to address the two technical challenges from a new perspective. It exploits the low-rank nature of gradient space and devises a dual-stream distillation algorithm to regularize the learning dynamics of insufficiently annotated domain and classes with the external guidance obtained from reliable sources. Our approach resolves the issue of inadequate navigation along network optimization, which is the major obstacle in the taxonomy adaptive cross-domain adaptation scenario. We evaluate the proposed method extensively on several tasks towards various endpoints with clinical and open-world significance. The results demonstrate its effectiveness and improvements over previous methods.
Many efforts have been made to discover tumor-specific microenvironment elements (TMEs) from immunostained tissue sections. However, the identification of yet unknown but relevant TMEs from multiplex immunostained tissues remains a challenge, due to the number of markers involved (tens) and the complexity of their spatial interactions. We present NaroNet, which uses machine learning to identify and annotate known as well as novel TMEs from self-supervised embeddings of cells, organized at different levels (local cell phenotypes and cellular neighborhoods). Then it uses the abundance of TMEs to classify patients based on biological or clinical features. We validate NaroNet using synthetic patient cohorts with adjustable incidence of different TMEs and two cancer patient datasets. In both synthetic and real datasets, NaroNet unsupervisedly identifies novel TMEs, relevant for the user-defined classification task. As NaroNet requires only patient-level information, it renders state-of-the-art computational methods accessible to a broad audience, accelerating the discovery of biomarker signatures.
We present NaroNet, a Machine Learning framework that integrates the multiscale spatial, in situ analysis of the tumor microenvironment (TME) with patient-level predictions into a seamless end-to-end learning pipeline. Trained only with patient-level labels, NaroNet quantifies the phenotypes, neighborhoods, and neighborhood interactions that have the highest influence on the predictive task. We validate NaroNet using synthetic data simulating multiplex-immunostained images with adjustable probabilistic incidence of different TMEs. Then we apply our model to two real sets of patient tumors, one consisting of 336 seven-color multiplex-immunostained images from 12 high-grade endometrial cancers, and the other consisting of 372 35-plex mass cytometry images from 283 breast cancer patients. In both synthetic and real datasets, NaroNet provides outstanding predictions while associating those predictions to the presence of specific TMEs. This inherent interpretability could be of great value both in a clinical setting and as a tool to discover novel biomarker signatures.