Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mélanie Roschewitz

Counterfactual Stress Testing for Image Classification Models

May 11, 2026

Moritz Stammel, Fabio De Sousa Ribeiro, Raghav Mehta, Mélanie Roschewitz, Ben Glocker

Abstract:Deep learning models in medical imaging often fail when deployed in new clinical environments due to distribution shifts in demographics, scanner hardware, or acquisition protocols. A central challenge is underspecification, where models with similar validation performance exhibit divergent real-world failure modes. Although stress testing has emerged as a tool to assess this, current methods typically rely on simple, uninformed perturbations (e.g., brightness or contrast changes), which fail to capture clinically realistic variation and can overestimate robustness. In this work, we introduce a counterfactual stress testing framework based on causal generative models that create realistic "what if" images by intervening on attributes such as scanner type and patient sex while preserving anatomical identity, enabling controlled and semantically meaningful evaluation under targeted distribution shifts. Across two imaging modalities (chest X-ray and mammography), three model architectures, and multiple shift scenarios, we show that counterfactual stress tests provide a substantially more accurate proxy for real out-of-distribution performance than classical perturbations, capturing the direction and relative magnitude of performance changes as well as model ranking. These results suggest that causal generative models can serve as practical simulators for robustness assessment, offering a more reliable basis for evaluating medical AI systems prior to deployment.

Via

Access Paper or Ask Questions

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

Apr 16, 2026

Mélanie Roschewitz, Kenneth Styppa, Yitian Tao, Jiwoong Sohn, Jean-Benoit Delbrouck, Benjamin Gundersen, Nicolas Deperrois, Christian Bluethgen, Julia Vogt, Bjoern Menze(+3 more)

Abstract:Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool-using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by a fully inspectable trace of intermediate decisions and tool interactions, allowing clinicians to examine how the reported findings are derived. In our experiments, we observe that RadAgent improves Chest CT report generation over its 3D VLM counterpart, CT-Chat, across three dimensions. Clinical accuracy improves by 6.0 points (36.4% relative) in macro-F1 and 5.4 points (19.6% relative) in micro-F1. Robustness under adversarial conditions improves by 24.7 points (41.9% relative). Furthermore, RadAgent achieves 37.0% in faithfulness, a new capability entirely absent in its 3D VLM counterpart. By structuring the interpretation of chest CT as an explicit, tool-augmented and iterative reasoning trace, RadAgent brings us closer toward transparent and reliable AI for radiology.

Via

Access Paper or Ask Questions

Pixel-level Counterfactual Contrastive Learning for Medical Image Segmentation

Mar 17, 2026

Marceau Lafargue-Hauret, Raghav Mehta, Fabio De Sousa Ribeiro, Mélanie Roschewitz, Ben Glocker

Abstract:Image segmentation relies on large annotated datasets, which are expensive and slow to produce. Silver-standard (AI-generated) labels are easier to obtain, but they risk introducing bias. Self-supervised learning, needing only images, has become key for pre-training. Recent work combining contrastive learning with counterfactual generation improves representation learning for classification but does not readily extend to pixel-level tasks. We propose a pipeline combining counterfactual generation with dense contrastive learning via Dual-View (DVD-CL) and Multi-View (MVD-CL) methods, along with supervised variants that utilize available silver-standard annotations. A new visualisation algorithm, the Color-coded High Resolution Overlay map (CHRO-map) is also introduced. Experiments show annotation-free DVD-CL outperforms other dense contrastive learning methods, while supervised variants using silver-standard labels outperform training on the silver-standard labeled data directly, achieving $\sim$94% DSC on challenging data. These results highlight that pixel-level contrastive learning, enhanced by counterfactuals and silver-standard annotations, improves robustness to acquisition and pathological variations.

* Accepted at ISBI-2026 (oral presentation)

Via

Access Paper or Ask Questions

Exploring the interplay of label bias with subgroup size and separability: A case study in mammographic density classification

Jul 24, 2025

Emma A. M. Stanley, Raghav Mehta, Mélanie Roschewitz, Nils D. Forkert, Ben Glocker

Abstract:Systematic mislabelling affecting specific subgroups (i.e., label bias) in medical imaging datasets represents an understudied issue concerning the fairness of medical AI systems. In this work, we investigated how size and separability of subgroups affected by label bias influence the learned features and performance of a deep learning model. Therefore, we trained deep learning models for binary tissue density classification using the EMory BrEast imaging Dataset (EMBED), where label bias affected separable subgroups (based on imaging manufacturer) or non-separable "pseudo-subgroups". We found that simulated subgroup label bias led to prominent shifts in the learned feature representations of the models. Importantly, these shifts within the feature space were dependent on both the relative size and the separability of the subgroup affected by label bias. We also observed notable differences in subgroup performance depending on whether a validation set with clean labels was used to define the classification threshold for the model. For instance, with label bias affecting the majority separable subgroup, the true positive rate for that subgroup fell from 0.898, when the validation set had clean labels, to 0.518, when the validation set had biased labels. Our work represents a key contribution toward understanding the consequences of label bias on subgroup fairness in medical imaging AI.

* Accepted at MICCAI Workshop on Fairness of AI in Medical Imaging (FAIMI) 2025

Via

Access Paper or Ask Questions

Where are we with calibration under dataset shift in image classification?

Jul 10, 2025

Mélanie Roschewitz, Raghav Mehta, Fabio de Sousa Ribeiro, Ben Glocker

Figure 1 for Where are we with calibration under dataset shift in image classification?

Figure 2 for Where are we with calibration under dataset shift in image classification?

Figure 3 for Where are we with calibration under dataset shift in image classification?

Figure 4 for Where are we with calibration under dataset shift in image classification?

Abstract:We conduct an extensive study on the state of calibration under real-world dataset shift for image classification. Our work provides important insights on the choice of post-hoc and in-training calibration techniques, and yields practical guidelines for all practitioners interested in robust calibration under shift. We compare various post-hoc calibration methods, and their interactions with common in-training calibration strategies (e.g., label smoothing), across a wide range of natural shifts, on eight different classification tasks across several imaging domains. We find that: (i) simultaneously applying entropy regularisation and label smoothing yield the best calibrated raw probabilities under dataset shift, (ii) post-hoc calibrators exposed to a small amount of semantic out-of-distribution data (unrelated to the task) are most robust under shift, (iii) recent calibration methods specifically aimed at increasing calibration under shifts do not necessarily offer significant improvements over simpler post-hoc calibration methods, (iv) improving calibration under shifts often comes at the cost of worsening in-distribution calibration. Importantly, these findings hold for randomly initialised classifiers, as well as for those finetuned from foundation models, the latter being consistently better calibrated compared to models trained from scratch. Finally, we conduct an in-depth analysis of ensembling effects, finding that (i) applying calibration prior to ensembling (instead of after) is more effective for calibration under shifts, (ii) for ensembles, OOD exposure deteriorates the ID-shifted calibration trade-off, (iii) ensembling remains one of the most effective methods to improve calibration robustness and, combined with finetuning from foundation models, yields best calibration results overall.

* Code available at https://github.com/biomedia-mira/calibration_under_shifts

Via

Access Paper or Ask Questions

Automatic dataset shift identification to support root cause analysis of AI performance drift

Nov 13, 2024

Mélanie Roschewitz, Raghav Mehta, Charles Jones, Ben Glocker

Abstract:Shifts in data distribution can substantially harm the performance of clinical AI models. Hence, various methods have been developed to detect the presence of such shifts at deployment time. However, root causes of dataset shifts are varied, and the choice of shift mitigation strategies is highly dependent on the precise type of shift encountered at test time. As such, detecting test-time dataset shift is not sufficient: precisely identifying which type of shift has occurred is critical. In this work, we propose the first unsupervised dataset shift identification framework, effectively distinguishing between prevalence shift (caused by a change in the label distribution), covariate shift (caused by a change in input characteristics) and mixed shifts (simultaneous prevalence and covariate shifts). We discuss the importance of self-supervised encoders for detecting subtle covariate shifts and propose a novel shift detector leveraging both self-supervised encoders and task model outputs for improved shift detection. We report promising results for the proposed shift identification framework across three different imaging modalities (chest radiography, digital mammography, and retinal fundus images) on five types of real-world dataset shifts, using four large publicly available datasets.

* Code available at https://github.com/biomedia-mira/shift_identification

Via

Access Paper or Ask Questions

Rethinking Fair Representation Learning for Performance-Sensitive Tasks

Oct 05, 2024

Charles Jones, Fabio de Sousa Ribeiro, Mélanie Roschewitz, Daniel C. Castro, Ben Glocker

Figure 1 for Rethinking Fair Representation Learning for Performance-Sensitive Tasks

Figure 2 for Rethinking Fair Representation Learning for Performance-Sensitive Tasks

Figure 3 for Rethinking Fair Representation Learning for Performance-Sensitive Tasks

Figure 4 for Rethinking Fair Representation Learning for Performance-Sensitive Tasks

Abstract:We investigate the prominent class of fair representation learning methods for bias mitigation. Using causal reasoning to define and formalise different sources of dataset bias, we reveal important implicit assumptions inherent to these methods. We prove fundamental limitations on fair representation learning when evaluation data is drawn from the same distribution as training data and run experiments across a range of medical modalities to examine the performance of fair representation learning under distribution shifts. Our results explain apparent contradictions in the existing literature and reveal how rarely considered causal and statistical aspects of the underlying data affect the validity of fair representation learning. We raise doubts about current evaluation practices and the applicability of fair representation learning methods in performance-sensitive settings. We argue that fine-grained analysis of dataset biases should play a key role in the field moving forward.

Via

Access Paper or Ask Questions

Radio-opaque artefacts in digital mammography: automatic detection and analysis of downstream effects

Oct 04, 2024

Amelia Schueppert, Ben Glocker, Mélanie Roschewitz

Figure 1 for Radio-opaque artefacts in digital mammography: automatic detection and analysis of downstream effects

Figure 2 for Radio-opaque artefacts in digital mammography: automatic detection and analysis of downstream effects

Figure 3 for Radio-opaque artefacts in digital mammography: automatic detection and analysis of downstream effects

Figure 4 for Radio-opaque artefacts in digital mammography: automatic detection and analysis of downstream effects

Abstract:This study investigates the effects of radio-opaque artefacts, such as skin markers, breast implants, and pacemakers, on mammography classification models. After manually annotating 22,012 mammograms from the publicly available EMBED dataset, a robust multi-label artefact detector was developed to identify five distinct artefact types (circular and triangular skin markers, breast implants, support devices and spot compression structures). Subsequent experiments on two clinically relevant tasks $-$ breast density assessment and cancer screening $-$ revealed that these artefacts can significantly affect model performance, alter classification thresholds, and distort output distributions. These findings underscore the importance of accurate automatic artefact detection for developing reliable and robust classification models in digital mammography. To facilitate future research our annotations, code, and model predictions are made publicly available.

* Code available at https://github.com/biomedia-mira/mammo-artifacts

Via

Access Paper or Ask Questions

Robust image representations with counterfactual contrastive learning

Sep 16, 2024

Mélanie Roschewitz, Fabio De Sousa Ribeiro, Tian Xia, Galvin Khara, Ben Glocker

Figure 1 for Robust image representations with counterfactual contrastive learning

Figure 2 for Robust image representations with counterfactual contrastive learning

Figure 3 for Robust image representations with counterfactual contrastive learning

Figure 4 for Robust image representations with counterfactual contrastive learning

Abstract:Contrastive pretraining can substantially increase model generalisation and downstream performance. However, the quality of the learned representations is highly dependent on the data augmentation strategy applied to generate positive pairs. Positive contrastive pairs should preserve semantic meaning while discarding unwanted variations related to the data acquisition domain. Traditional contrastive pipelines attempt to simulate domain shifts through pre-defined generic image transformations. However, these do not always mimic realistic and relevant domain variations for medical imaging such as scanner differences. To tackle this issue, we herein introduce counterfactual contrastive learning, a novel framework leveraging recent advances in causal image synthesis to create contrastive positive pairs that faithfully capture relevant domain variations. Our method, evaluated across five datasets encompassing both chest radiography and mammography data, for two established contrastive objectives (SimCLR and DINO-v2), outperforms standard contrastive learning in terms of robustness to acquisition shift. Notably, counterfactual contrastive learning achieves superior downstream performance on both in-distribution and on external datasets, especially for images acquired with scanners under-represented in the training set. Further experiments show that the proposed framework extends beyond acquisition shifts, with models trained with counterfactual contrastive learning substantially improving subgroup performance across biological sex.

* Code available at https://github.com/biomedia-mira/counterfactual-contrastive/

Via

Access Paper or Ask Questions

Mitigating attribute amplification in counterfactual image generation

Mar 14, 2024

Tian Xia, Mélanie Roschewitz, Fabio De Sousa Ribeiro, Charles Jones, Ben Glocker

Figure 1 for Mitigating attribute amplification in counterfactual image generation

Figure 2 for Mitigating attribute amplification in counterfactual image generation

Figure 3 for Mitigating attribute amplification in counterfactual image generation

Figure 4 for Mitigating attribute amplification in counterfactual image generation

Abstract:Causal generative modelling is gaining interest in medical imaging due to its ability to answer interventional and counterfactual queries. Most work focuses on generating counterfactual images that look plausible, using auxiliary classifiers to enforce effectiveness of simulated interventions. We investigate pitfalls in this approach, discovering the issue of attribute amplification, where unrelated attributes are spuriously affected during interventions, leading to biases across protected characteristics and disease status. We show that attribute amplification is caused by the use of hard labels in the counterfactual training process and propose soft counterfactual fine-tuning to mitigate this issue. Our method substantially reduces the amplification effect while maintaining effectiveness of generated images, demonstrated on a large chest X-ray dataset. Our work makes an important advancement towards more faithful and unbiased causal modelling in medical imaging.

Via

Access Paper or Ask Questions