Abstract:Glioblastoma, IDH-wildtype (GBM-IDHwt) is the most common malignant brain tumor. Histomorphology is a crucial component of the integrated diagnosis of GBM-IDHwt. Artificial intelligence (AI) methods have shown promise to extract additional prognostic information from histological whole-slide images (WSI) of hematoxylin and eosin-stained glioblastoma tissue. Here, we present an explainable AI-based method to support systematic interpretation of histomorphological features associated with survival. It combines an explainable multiple instance learning (MIL) architecture with a sparse autoencoder (SAE) to relate human-interpretable visual patterns of tissue to survival. The MIL architecture directly identifies prognosis-relevant image tiles and the SAE maps these tiles post-hoc to visual patterns. The MIL method was trained and evaluated using a new real-world dataset that comprised 720 GBM-IDHwt cases from three hospitals and four cancer registries in Germany. The SAE was trained using 1878 WSIs of glioblastoma from five independent public data collections. Despite the many factors influencing survival time, our method showed some ability to discriminate between patients living less than 180 days or more than 360 days solely based on histomorphology (AUC: 0.67; 95% CI: 0.63-0.72). Cox proportional hazards regression confirmed a significant difference in survival time between the predicted groups after adjustment for established prognostic factors (hazard ratio: 1.47; 95% CI: 1.26-1.72). Our method identified multiple interpretable visual patterns associated with survival. Three neuropathologists separately found that 21 of the 24 most strongly associated patterns could be clearly attributed to seven histomorphological categories. Necrosis and hemorrhage appeared to be associated with shorter survival while highly cellular tumor areas were associated with longer survival.
Abstract:The generation of synthetic data is a promising technology to make medical data available for secondary use in a privacy-compliant manner. A popular method for creating realistic patient data is the rule-based Synthea data generator. Synthea generates data based on rules describing the lifetime of a synthetic patient. These rules typically express the probability of a condition occurring, such as a disease, depending on factors like age. Since they only contain statistical information, rules usually have no specific data protection requirements. However, creating meaningful rules can be a very complex process that requires expert knowledge and realistic sample data. In this paper, we introduce and evaluate an approach to automatically generate Synthea rules based on statistics from tabular data, which we extracted from cancer reports. As an example use case, we created a Synthea module for glioblastoma from a real-world dataset and used it to generate a synthetic dataset. Compared to the original dataset, the synthetic data reproduced known disease courses and mostly retained the statistical properties. Overall, synthetic patient data holds great potential for privacy-preserving research. The data can be used to formulate hypotheses and to develop prototypes, but medical interpretation should consider the specific limitations as with any currently available approach.
Abstract:Generative artificial intelligence for synthetic data generation holds substantial potential to address practical challenges in epidemiology. However, many current methods suffer from limited quality, high computational demands, and complexity for non-experts. Furthermore, common evaluation strategies for synthetic data often fail to directly reflect statistical utility. Against this background, a critical underexplored question is whether synthetic data can reliably reproduce key findings from epidemiological research. We propose the use of adversarial random forests (ARF) as an efficient and convenient method for synthesizing tabular epidemiological data. To evaluate its performance, we replicated statistical analyses from six epidemiological publications and compared original with synthetic results. These publications cover blood pressure, anthropometry, myocardial infarction, accelerometry, loneliness, and diabetes, based on data from the German National Cohort (NAKO Gesundheitsstudie), the Bremen STEMI Registry U45 Study, and the Guelph Family Health Study. Additionally, we assessed the impact of dimensionality and variable complexity on synthesis quality by limiting datasets to variables relevant for individual analyses, including necessary derivations. Across all replicated original studies, results from multiple synthetic data replications consistently aligned with original findings. Even for datasets with relatively low sample size-to-dimensionality ratios, the replication outcomes closely matched the original results across various descriptive and inferential analyses. Reducing dimensionality and pre-deriving variables further enhanced both quality and stability of the results.