Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ben Glocker

Biomedical Image Analysis Group, Department of Computing, Imperial College London

Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results

May 13, 2025

Meritxell Riera-Marin, Sikha O K, Julia Rodriguez-Comas, Matthias Stefan May, Zhaohong Pan, Xiang Zhou, Xiaokun Liang, Franciskus Xaverius Erick, Andrea Prenner, Cedric Hemon(+22 more)

Figure 1 for Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results

Figure 2 for Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results

Figure 3 for Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results

Figure 4 for Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results

Abstract:Deep learning (DL) has become the dominant approach for medical image segmentation, yet ensuring the reliability and clinical applicability of these models requires addressing key challenges such as annotation variability, calibration, and uncertainty estimation. This is why we created the Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS), which highlights the critical role of multiple annotators in establishing a more comprehensive ground truth, emphasizing that segmentation is inherently subjective and that leveraging inter-annotator variability is essential for robust model evaluation. Seven teams participated in the challenge, submitting a variety of DL models evaluated using metrics such as Dice Similarity Coefficient (DSC), Expected Calibration Error (ECE), and Continuous Ranked Probability Score (CRPS). By incorporating consensus and dissensus ground truth, we assess how DL models handle uncertainty and whether their confidence estimates align with true segmentation performance. Our findings reinforce the importance of well-calibrated models, as better calibration is strongly correlated with the quality of the results. Furthermore, we demonstrate that segmentation models trained on diverse datasets and enriched with pre-trained knowledge exhibit greater robustness, particularly in cases deviating from standard anatomical structures. Notably, the best-performing models achieved high DSC and well-calibrated uncertainty estimates. This work underscores the need for multi-annotator ground truth, thorough calibration assessments, and uncertainty-aware evaluations to develop trustworthy and clinically reliable DL-based medical image segmentation models.

* This challenge was hosted in MICCAI 2024

Via

Access Paper or Ask Questions

Combining imaging and shape features for prediction tasks of Alzheimer's disease classification and brain age regression

Jan 14, 2025

Nairouz Shehata, Carolina Piçarra, Ben Glocker

Figure 1 for Combining imaging and shape features for prediction tasks of Alzheimer's disease classification and brain age regression

Figure 2 for Combining imaging and shape features for prediction tasks of Alzheimer's disease classification and brain age regression

Figure 3 for Combining imaging and shape features for prediction tasks of Alzheimer's disease classification and brain age regression

Figure 4 for Combining imaging and shape features for prediction tasks of Alzheimer's disease classification and brain age regression

Abstract:We investigate combining imaging and shape features extracted from MRI for the clinically relevant tasks of brain age prediction and Alzheimer's disease classification. Our proposed model fuses ResNet-extracted image embeddings with shape embeddings from a bespoke graph neural network. The shape embeddings are derived from surface meshes of 15 brain structures, capturing detailed geometric information. Combined with the appearance features from T1-weighted images, we observe improvements in the prediction performance on both tasks, with substantial gains for classification. We evaluate the model using public datasets, including CamCAN, IXI, and OASIS3, demonstrating the effectiveness of fusing imaging and shape features for brain analysis.

Via

Access Paper or Ask Questions

Continuous Bayesian Model Selection for Multivariate Causal Discovery

Nov 15, 2024

Anish Dhir, Ruby Sedgwick, Avinash Kori, Ben Glocker, Mark van der Wilk

Figure 1 for Continuous Bayesian Model Selection for Multivariate Causal Discovery

Figure 2 for Continuous Bayesian Model Selection for Multivariate Causal Discovery

Figure 3 for Continuous Bayesian Model Selection for Multivariate Causal Discovery

Figure 4 for Continuous Bayesian Model Selection for Multivariate Causal Discovery

Abstract:Current causal discovery approaches require restrictive model assumptions or assume access to interventional data to ensure structure identifiability. These assumptions often do not hold in real-world applications leading to a loss of guarantees and poor accuracy in practice. Recent work has shown that, in the bivariate case, Bayesian model selection can greatly improve accuracy by exchanging restrictive modelling for more flexible assumptions, at the cost of a small probability of error. We extend the Bayesian model selection approach to the important multivariate setting by making the large discrete selection problem scalable through a continuous relaxation. We demonstrate how for our choice of Bayesian non-parametric model, the Causal Gaussian Process Conditional Density Estimator (CGP-CDE), an adjacency matrix can be constructed from the model hyperparameters. This adjacency matrix is then optimised using the marginal likelihood and an acyclicity regulariser, outputting the maximum a posteriori causal graph. We demonstrate the competitiveness of our approach on both synthetic and real-world datasets, showing it is possible to perform multivariate causal discovery without infeasible assumptions using Bayesian model selection.

Via

Access Paper or Ask Questions

Automatic dataset shift identification to support root cause analysis of AI performance drift

Nov 13, 2024

Mélanie Roschewitz, Raghav Mehta, Charles Jones, Ben Glocker

Abstract:Shifts in data distribution can substantially harm the performance of clinical AI models. Hence, various methods have been developed to detect the presence of such shifts at deployment time. However, root causes of dataset shifts are varied, and the choice of shift mitigation strategies is highly dependent on the precise type of shift encountered at test time. As such, detecting test-time dataset shift is not sufficient: precisely identifying which type of shift has occurred is critical. In this work, we propose the first unsupervised dataset shift identification framework, effectively distinguishing between prevalence shift (caused by a change in the label distribution), covariate shift (caused by a change in input characteristics) and mixed shifts (simultaneous prevalence and covariate shifts). We discuss the importance of self-supervised encoders for detecting subtle covariate shifts and propose a novel shift detector leveraging both self-supervised encoders and task model outputs for improved shift detection. We report promising results for the proposed shift identification framework across three different imaging modalities (chest radiography, digital mammography, and retinal fundus images) on five types of real-world dataset shifts, using four large publicly available datasets.

* Code available at https://github.com/biomedia-mira/shift_identification

Via

Access Paper or Ask Questions

Rethinking Fair Representation Learning for Performance-Sensitive Tasks

Oct 05, 2024

Charles Jones, Fabio de Sousa Ribeiro, Mélanie Roschewitz, Daniel C. Castro, Ben Glocker

Figure 1 for Rethinking Fair Representation Learning for Performance-Sensitive Tasks

Figure 2 for Rethinking Fair Representation Learning for Performance-Sensitive Tasks

Figure 3 for Rethinking Fair Representation Learning for Performance-Sensitive Tasks

Figure 4 for Rethinking Fair Representation Learning for Performance-Sensitive Tasks

Abstract:We investigate the prominent class of fair representation learning methods for bias mitigation. Using causal reasoning to define and formalise different sources of dataset bias, we reveal important implicit assumptions inherent to these methods. We prove fundamental limitations on fair representation learning when evaluation data is drawn from the same distribution as training data and run experiments across a range of medical modalities to examine the performance of fair representation learning under distribution shifts. Our results explain apparent contradictions in the existing literature and reveal how rarely considered causal and statistical aspects of the underlying data affect the validity of fair representation learning. We raise doubts about current evaluation practices and the applicability of fair representation learning methods in performance-sensitive settings. We argue that fine-grained analysis of dataset biases should play a key role in the field moving forward.

Via

Access Paper or Ask Questions

Radio-opaque artefacts in digital mammography: automatic detection and analysis of downstream effects

Oct 04, 2024

Amelia Schueppert, Ben Glocker, Mélanie Roschewitz

Figure 1 for Radio-opaque artefacts in digital mammography: automatic detection and analysis of downstream effects

Figure 2 for Radio-opaque artefacts in digital mammography: automatic detection and analysis of downstream effects

Figure 3 for Radio-opaque artefacts in digital mammography: automatic detection and analysis of downstream effects

Figure 4 for Radio-opaque artefacts in digital mammography: automatic detection and analysis of downstream effects

Abstract:This study investigates the effects of radio-opaque artefacts, such as skin markers, breast implants, and pacemakers, on mammography classification models. After manually annotating 22,012 mammograms from the publicly available EMBED dataset, a robust multi-label artefact detector was developed to identify five distinct artefact types (circular and triangular skin markers, breast implants, support devices and spot compression structures). Subsequent experiments on two clinically relevant tasks $-$ breast density assessment and cancer screening $-$ revealed that these artefacts can significantly affect model performance, alter classification thresholds, and distort output distributions. These findings underscore the importance of accurate automatic artefact detection for developing reliable and robust classification models in digital mammography. To facilitate future research our annotations, code, and model predictions are made publicly available.

* Code available at https://github.com/biomedia-mira/mammo-artifacts

Via

Access Paper or Ask Questions

Robust image representations with counterfactual contrastive learning

Sep 16, 2024

Mélanie Roschewitz, Fabio De Sousa Ribeiro, Tian Xia, Galvin Khara, Ben Glocker

Figure 1 for Robust image representations with counterfactual contrastive learning

Figure 2 for Robust image representations with counterfactual contrastive learning

Figure 3 for Robust image representations with counterfactual contrastive learning

Figure 4 for Robust image representations with counterfactual contrastive learning

Abstract:Contrastive pretraining can substantially increase model generalisation and downstream performance. However, the quality of the learned representations is highly dependent on the data augmentation strategy applied to generate positive pairs. Positive contrastive pairs should preserve semantic meaning while discarding unwanted variations related to the data acquisition domain. Traditional contrastive pipelines attempt to simulate domain shifts through pre-defined generic image transformations. However, these do not always mimic realistic and relevant domain variations for medical imaging such as scanner differences. To tackle this issue, we herein introduce counterfactual contrastive learning, a novel framework leveraging recent advances in causal image synthesis to create contrastive positive pairs that faithfully capture relevant domain variations. Our method, evaluated across five datasets encompassing both chest radiography and mammography data, for two established contrastive objectives (SimCLR and DINO-v2), outperforms standard contrastive learning in terms of robustness to acquisition shift. Notably, counterfactual contrastive learning achieves superior downstream performance on both in-distribution and on external datasets, especially for images acquired with scanners under-represented in the training set. Further experiments show that the proposed framework extends beyond acquisition shifts, with models trained with counterfactual contrastive learning substantially improving subgroup performance across biological sex.

* Code available at https://github.com/biomedia-mira/counterfactual-contrastive/

Via

Access Paper or Ask Questions

Latent 3D Brain MRI Counterfactual

Sep 09, 2024

Wei Peng, Tian Xia, Fabio De Sousa Ribeiro, Tomas Bosschieter, Ehsan Adeli, Qingyu Zhao, Ben Glocker, Kilian M. Pohl

Figure 1 for Latent 3D Brain MRI Counterfactual

Figure 2 for Latent 3D Brain MRI Counterfactual

Figure 3 for Latent 3D Brain MRI Counterfactual

Figure 4 for Latent 3D Brain MRI Counterfactual

Abstract:The number of samples in structural brain MRI studies is often too small to properly train deep learning models. Generative models show promise in addressing this issue by effectively learning the data distribution and generating high-fidelity MRI. However, they struggle to produce diverse, high-quality data outside the distribution defined by the training data. One way to address the issue is using causal models developed for 3D volume counterfactuals. However, accurately modeling causality in high-dimensional spaces is a challenge so that these models generally generate 3D brain MRIS of lower quality. To address these challenges, we propose a two-stage method that constructs a Structural Causal Model (SCM) within the latent space. In the first stage, we employ a VQ-VAE to learn a compact embedding of the MRI volume. Subsequently, we integrate our causal model into this latent space and execute a three-step counterfactual procedure using a closed-form Generalized Linear Model (GLM). Our experiments conducted on real-world high-resolution MRI data (1mm) demonstrate that our method can generate high-quality 3D MRI counterfactuals.

Via

Access Paper or Ask Questions

Quantifying the Impact of Population Shift Across Age and Sex for Abdominal Organ Segmentation

Aug 08, 2024

Kate Čevora, Ben Glocker, Wenjia Bai

Figure 1 for Quantifying the Impact of Population Shift Across Age and Sex for Abdominal Organ Segmentation

Figure 2 for Quantifying the Impact of Population Shift Across Age and Sex for Abdominal Organ Segmentation

Figure 3 for Quantifying the Impact of Population Shift Across Age and Sex for Abdominal Organ Segmentation

Figure 4 for Quantifying the Impact of Population Shift Across Age and Sex for Abdominal Organ Segmentation

Abstract:Deep learning-based medical image segmentation has seen tremendous progress over the last decade, but there is still relatively little transfer into clinical practice. One of the main barriers is the challenge of domain generalisation, which requires segmentation models to maintain high performance across a wide distribution of image data. This challenge is amplified by the many factors that contribute to the diverse appearance of medical images, such as acquisition conditions and patient characteristics. The impact of shifting patient characteristics such as age and sex on segmentation performance remains relatively under-studied, especially for abdominal organs, despite that this is crucial for ensuring the fairness of the segmentation model. We perform the first study to determine the impact of population shift with respect to age and sex on abdominal CT image segmentation, by leveraging two large public datasets, and introduce a novel metric to quantify the impact. We find that population shift is a challenge similar in magnitude to cross-dataset shift for abdominal organ segmentation, and that the effect is asymmetric and dataset-dependent. We conclude that dataset diversity in terms of known patient characteristics is not necessarily equivalent to dataset diversity in terms of image features. This implies that simple population matching to ensure good generalisation and fairness may be insufficient, and we recommend that fairness research should be directed towards better understanding and quantifying medical image dataset diversity in terms of performance-relevant characteristics such as organ morphology.

* This paper has been accepted for publication by the MICCAI 2024 Fairness of AI in Medical Imaging (FAIMI) Workshop

Via

Access Paper or Ask Questions

SharkTrack: an accurate, generalisable software for streamlining shark and ray underwater video analysis

Jul 30, 2024

Filippo Varini, Francesco Ferretti, Jeremy Jenrette, Joel H. Gayford, Mark E. Bond, Matthew J. Witt, Michael R. Heithaus, Sophie Wilday, Ben Glocker

Abstract:Elasmobranchs (sharks and rays) can be important components of marine ecosystems but are experiencing global population declines. Effective monitoring of these populations is essential to their protection. Baited Remote Underwater Video Stations (BRUVS) have been a key tool for monitoring, but require time-consuming manual analysis. To address these challenges, we developed SharkTrack, an AI-enhanced BRUVS analysis software. SharkTrack uses Convolutional Neural Networks and Multi-Object Tracking to detect and track elasmobranchs and provides an annotation pipeline to manually classify elasmobranch species and compute MaxN, the standard metric of relative abundance. We tested SharkTrack on BRUVS footage from locations unseen by the model during training. SharkTrack computed MaxN with 89% accuracy over 207 hours of footage. The semi-automatic SharkTrack pipeline required two minutes of manual classification per hour of video, a 97% reduction of manual BRUVS analysis time compared to traditional methods, estimated conservatively at one hour per hour of video. Furthermore, we demonstrate SharkTrack application across diverse marine ecosystems and elasmobranch species, an advancement compared to previous models, which were limited to specific species or locations. SharkTrack applications extend beyond BRUVS analysis, facilitating rapid annotation of unlabeled videos, aiding the development of further models to classify elasmobranch species. We provide public access to the software and an unprecedentedly diverse dataset, facilitating future research in an important area of marine conservation.

Via

Access Paper or Ask Questions