Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nils Strodthoff

Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study

May 12, 2026

M A Al-Masud, Nils Strodthoff

Abstract:Specialized foundation models are beginning to emerge in various medical subdomains, but pretraining methodologies and parametric scaling with the size of the pretraining dataset are rarely assessed systematically and in a like-for-like manner. This work focuses on foundation models for electrocardiography (ECG) data, one of the most widely captured physiological time series world-wide. We present a comprehensive assessment of pretraining methodologies, covering five different contrastive and non-contrastive self-supervised learning objectives for ECG foundation models, and investigate their scaling behavior with pretraining dataset sizes up to 11M input samples, exclusively from publicly available sources. Pretraining strategy has a meaningful and consistent impact on downstream performance, with contrastive predictive coding (slightly ahead of JEPA) yielding the most transferable representations across diverse clinical tasks. Scaling pretraining data continues to yield meaningful improvements up to 11M samples for most objectives. We also compare model architectures across all pretraining methodologies and find evidence for a clear superiority of structured state space models compared to transformers and CNN models. We hypothesize that the strong inductive biases of structured state space models, rather than pretraining scale alone, are the primary driver of effective ECG representation learning, with important implications for future foundation model development in this and potentially other physiological signal domains.

* 59 pages, 16 figures, 59 Tables. Code available at https://anonymous.4open.science/r/ecg-pretraining-strategies-4DE3

Via

Access Paper or Ask Questions

Benchmark Problems and Benchmark Datasets for the evaluation of Machine and Deep Learning methods on Photoplethysmography signals: the D4 report from the QUMPHY project

Apr 01, 2026

Urs Hackstein, Jordi Alastruey, Philip Aston, Ciaran Bench, Peter H. Charlton, Loic Coquelin, Nando Hegemann, Vaidotas Marozas, Mohammad Moulaeifard, Manasi Nandi(+6 more)

Abstract:This report is part of the Qumphy project (22HLT01 Qumphy) that is funded by the European Union and is dedicated to the development of measures to quantify the uncertainties associated with Machine Learning algorithms applied to medical problems, in particular the analysis and processing of Photoplethysmography (PPG) signals. In this report, a list of six medical problems that are related to PPG signals and serve as Benchmark Problems is given. Suitable Benchmark datasets and their usage are described also.

* 28 pages

Via

Access Paper or Ask Questions

Deriving Health Metrics from the Photoplethysmogram: Benchmarks and Insights from MIMIC-III-Ext-PPG

Mar 23, 2026

Mohammad Moulaeifard, Philip J. Aston, Peter H. Charlton, Nils Strodthoff

Abstract:Photoplethysmography (PPG) is one of the most widely captured biosignals for clinical prediction tasks, yet PPG-based algorithms are typically trained on small-scale datasets of uncertain quality, which hinders meaningful algorithm comparisons. We present a comprehensive benchmark for PPG-based clinical prediction using the \dbname~dataset, establishing baselines across the full spectrum of clinically relevant applications: multi-class heart rhythm classification, and regression of physiological parameters including respiratory rate (RR), heart rate (HR), and blood pressure (BP). Most notably, we provide the first comprehensive assessment of PPG for general arrhythmia detection beyond atrial fibrillation (AF) and atrial flutter (AFLT), with performance stratified by BP, HR, and demographic subgroups. Using established deep learning architectures, we achieved strong performance for AF detection (AUROC = 0.96) and accurate physiological parameter estimation (RR MAE: 2.97 bpm; HR MAE: 1.13 bpm; SBP/DBP MAE: 16.13/8.70 mmHg). Cross-dataset validation demonstrates excellent generalizability for AF detection (AUROC = 0.97), while clinical subgroup analysis reveals marked performance differences across subgroups by BP, HR, and demographic strata. These variations appear to reflect population-specific waveform differences rather than systematic bias in model behavior. This framework establishes the first integrated benchmark for multi-task PPG-based clinical prediction, demonstrating that PPG signals can effectively support multiple simultaneous monitoring tasks and providing essential baselines for future algorithm development.

* 22 pages, 1 figure

Via

Access Paper or Ask Questions

ProtoMask: Segmentation-Guided Prototype Learning

Oct 01, 2025

Steffen Meinert, Philipp Schlinge, Nils Strodthoff, Martin Atzmueller

Figure 1 for ProtoMask: Segmentation-Guided Prototype Learning

Figure 2 for ProtoMask: Segmentation-Guided Prototype Learning

Figure 3 for ProtoMask: Segmentation-Guided Prototype Learning

Figure 4 for ProtoMask: Segmentation-Guided Prototype Learning

Abstract:XAI gained considerable importance in recent years. Methods based on prototypical case-based reasoning have shown a promising improvement in explainability. However, these methods typically rely on additional post-hoc saliency techniques to explain the semantics of learned prototypes. Multiple critiques have been raised about the reliability and quality of such techniques. For this reason, we study the use of prominent image segmentation foundation models to improve the truthfulness of the mapping between embedding and input space. We aim to restrict the computation area of the saliency map to a predefined semantic image patch to reduce the uncertainty of such visualizations. To perceive the information of an entire image, we use the bounding box from each generated segmentation mask to crop the image. Each mask results in an individual input in our novel model architecture named ProtoMask. We conduct experiments on three popular fine-grained classification datasets with a wide set of metrics, providing a detailed overview on explainability characteristics. The comparison with other popular models demonstrates competitive performance and unique explainability features of our model. https://github.com/uos-sis/quanproto

Via

Access Paper or Ask Questions

FeatInv: Spatially resolved mapping from feature space to input space using conditional diffusion models

May 27, 2025

Nils Neukirch, Johanna Vielhaben, Nils Strodthoff

Figure 1 for FeatInv: Spatially resolved mapping from feature space to input space using conditional diffusion models

Figure 2 for FeatInv: Spatially resolved mapping from feature space to input space using conditional diffusion models

Figure 3 for FeatInv: Spatially resolved mapping from feature space to input space using conditional diffusion models

Figure 4 for FeatInv: Spatially resolved mapping from feature space to input space using conditional diffusion models

Abstract:Internal representations are crucial for understanding deep neural networks, such as their properties and reasoning patterns, but remain difficult to interpret. While mapping from feature space to input space aids in interpreting the former, existing approaches often rely on crude approximations. We propose using a conditional diffusion model - a pretrained high-fidelity diffusion model conditioned on spatially resolved feature maps - to learn such a mapping in a probabilistic manner. We demonstrate the feasibility of this approach across various pretrained image classifiers from CNNs to ViTs, showing excellent reconstruction capabilities. Through qualitative comparisons and robustness analysis, we validate our method and showcase possible applications, such as the visualization of concept steering in input space or investigations of the composite nature of the feature space. This approach has broad potential for improving feature space understanding in computer vision models.

* 15 pages, 10 figures, code is available at https://github.com/AI4HealthUOL/FeatInv

Via

Access Paper or Ask Questions

Uncertainty quantification with approximate variational learning for wearable photoplethysmography prediction tasks

May 16, 2025

Ciaran Bench, Vivek Desai, Mohammad Moulaeifard, Nils Strodthoff, Philip Aston, Andrew Thompson

Abstract:Photoplethysmography (PPG) signals encode information about relative changes in blood volume that can be used to assess various aspects of cardiac health non-invasively, e.g.\ to detect atrial fibrillation (AF) or predict blood pressure (BP). Deep networks are well-equipped to handle the large quantities of data acquired from wearable measurement devices. However, they lack interpretability and are prone to overfitting, leaving considerable risk for poor performance on unseen data and misdiagnosis. Here, we describe the use of two scalable uncertainty quantification techniques: Monte Carlo Dropout and the recently proposed Improved Variational Online Newton. These techniques are used to assess the trustworthiness of models trained to perform AF classification and BP regression from raw PPG time series. We find that the choice of hyperparameters has a considerable effect on the predictive performance of the models and on the quality and composition of predicted uncertainties. E.g. the stochasticity of the model parameter sampling determines the proportion of the total uncertainty that is aleatoric, and has varying effects on predictive performance and calibration quality dependent on the chosen uncertainty quantification technique and the chosen expression of uncertainty. We find significant discrepancy in the quality of uncertainties over the predicted classes, emphasising the need for a thorough evaluation protocol that assesses local and adaptive calibration. This work suggests that the choice of hyperparameters must be carefully tuned to balance predictive performance and calibration quality, and that the optimal parameterisation may vary depending on the chosen expression of uncertainty.

Via

Access Paper or Ask Questions

Machine-learning for photoplethysmography analysis: Benchmarking feature, image, and signal-based approaches

Feb 27, 2025

Mohammad Moulaeifard, Loic Coquelin, Mantas Rinkevičius, Andrius Sološenko, Oskar Pfeffer, Ciaran Bench, Nando Hegemann, Sara Vardanega, Manasi Nandi, Jordi Alastruey(+6 more)

Figure 1 for Machine-learning for photoplethysmography analysis: Benchmarking feature, image, and signal-based approaches

Figure 2 for Machine-learning for photoplethysmography analysis: Benchmarking feature, image, and signal-based approaches

Figure 3 for Machine-learning for photoplethysmography analysis: Benchmarking feature, image, and signal-based approaches

Figure 4 for Machine-learning for photoplethysmography analysis: Benchmarking feature, image, and signal-based approaches

Abstract:Photoplethysmography (PPG) is a widely used non-invasive physiological sensing technique, suitable for various clinical applications. Such clinical applications are increasingly supported by machine learning methods, raising the question of the most appropriate input representation and model choice. Comprehensive comparisons, in particular across different input representations, are scarce. We address this gap in the research landscape by a comprehensive benchmarking study covering three kinds of input representations, interpretable features, image representations and raw waveforms, across prototypical regression and classification use cases: blood pressure and atrial fibrillation prediction. In both cases, the best results are achieved by deep neural networks operating on raw time series as input representations. Within this model class, best results are achieved by modern convolutional neural networks (CNNs). but depending on the task setup, shallow CNNs are often also very competitive. We envision that these results will be insightful for researchers to guide their choice on machine learning tasks for PPG data, even beyond the use cases presented in this work.

* 39 pages, 9 figures, code available at https://gitlab.com/qumphy/d1-code

Via

Access Paper or Ask Questions

Generalizable deep learning for photoplethysmography-based blood pressure estimation -- A Benchmarking Study

Feb 26, 2025

Mohammad Moulaeifard, Peter H. Charlton, Nils Strodthoff

Abstract:Photoplethysmography (PPG)-based blood pressure (BP) estimation represents a promising alternative to cuff-based BP measurements. Recently, an increasing number of deep learning models have been proposed to infer BP from the raw PPG waveform. However, these models have been predominantly evaluated on in-distribution test sets, which immediately raises the question of the generalizability of these models to external datasets. To investigate this question, we trained five deep learning models on the recently released PulseDB dataset, provided in-distribution benchmarking results on this dataset, and then assessed out-of-distribution performance on several external datasets. The best model (XResNet1d101) achieved in-distribution MAEs of 9.4 and 6.0 mmHg for systolic and diastolic BP respectively on PulseDB (with subject-specific calibration), and 14.0 and 8.5 mmHg respectively without calibration. Equivalent MAEs on external test datasets without calibration ranged from 15.0 to 25.1 mmHg (SBP) and 7.0 to 10.4 mmHg (DBP). Our results indicate that the performance is strongly influenced by the differences in BP distributions between datasets. We investigated a simple way of improving performance through sample-based domain adaptation and put forward recommendations for training models with good generalization properties. With this work, we hope to educate more researchers for the importance and challenges of out-of-distribution generalization.

* 20 pages, 5 figures, code available at https://github.com/AI4HealthUOL/ppg-ood-generalization

Via

Access Paper or Ask Questions

Benchmarking machine learning for bowel sound pattern classification from tabular features to pretrained models

Feb 21, 2025

Zahra Mansour, Verena Uslar, Dirk Weyhe, Danilo Hollosi, Nils Strodthoff

Abstract:The development of electronic stethoscopes and wearable recording sensors opened the door to the automated analysis of bowel sound (BS) signals. This enables a data-driven analysis of bowel sound patterns, their interrelations, and their correlation to different pathologies. This work leverages a BS dataset collected from 16 healthy subjects that was annotated according to four established BS patterns. This dataset is used to evaluate the performance of machine learning models to detect and/or classify BS patterns. The selection of considered models covers models using tabular features, convolutional neural networks based on spectrograms and models pre-trained on large audio datasets. The results highlight the clear superiority of pre-trained models, particularly in detecting classes with few samples, achieving an AUC of 0.89 in distinguishing BS from non-BS using a HuBERT model and an AUC of 0.89 in differentiating bowel sound patterns using a Wav2Vec 2.0 model. These results pave the way for an improved understanding of bowel sounds in general and future machine-learning-driven diagnostic applications for gastrointestinal examinations

* 9 pages, 6 figures and 1 table

Via

Access Paper or Ask Questions

Explainable and externally validated machine learning for neuropsychiatric diagnosis via electrocardiograms

Feb 07, 2025

Juan Miguel Lopez Alcaraz, Ebenezer Oloyede, David Taylor, Wilhelm Haverkamp, Nils Strodthoff

Figure 1 for Explainable and externally validated machine learning for neuropsychiatric diagnosis via electrocardiograms

Figure 2 for Explainable and externally validated machine learning for neuropsychiatric diagnosis via electrocardiograms

Figure 3 for Explainable and externally validated machine learning for neuropsychiatric diagnosis via electrocardiograms

Figure 4 for Explainable and externally validated machine learning for neuropsychiatric diagnosis via electrocardiograms

Abstract:Electrocardiogram (ECG) analysis has emerged as a promising tool for identifying physiological changes associated with neuropsychiatric conditions. The relationship between cardiovascular health and neuropsychiatric disorders suggests that ECG abnormalities could serve as valuable biomarkers for more efficient detection, therapy monitoring, and risk stratification. However, the potential of the ECG to accurately distinguish neuropsychiatric conditions, particularly among diverse patient populations, remains underexplored. This study utilized ECG markers and basic demographic data to predict neuropsychiatric conditions using machine learning models, with targets defined through ICD-10 codes. Both internal and external validation were performed using the MIMIC-IV and ECG-View datasets respectively. Performance was assessed using AUROC scores. To enhance model interpretability, Shapley values were applied to provide insights into the contributions of individual ECG features to the predictions. Significant predictive performance was observed for conditions within the neurological and psychiatric groups. For the neurological group, Alzheimer's disease (G30) achieved an internal AUROC of 0.813 (0.812-0.814) and an external AUROC of 0.868 (0.867-0.868). In the psychiatric group, unspecified dementia (F03) showed an internal AUROC of 0.849 (0.848-0.849) and an external AUROC of 0.862 (0.861-0.863). Discriminative features align with known ECG markers but also provide hints on potentially new markers. ECG offers significant promise for diagnosing and monitoring neuropsychiatric conditions, with robust predictive performance across internal and external cohorts. Future work should focus on addressing potential confounders, such as therapy-related cardiotoxicity, and expanding the scope of ECG applications, including personalized care and early intervention strategies.

* 9 pages, 2 figures, source code under https://github.com/AI4HealthUOL/CardioDiag

Via

Access Paper or Ask Questions