Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Felix Zimmer

Adaptive Gaussian Process Search for Simulation-Based Sample Size Estimation in Clinical Prediction Models: Validation of the pmsims R Package

Mar 24, 2026

Oyebayo Ridwan Olaniran, Diana Shamsutdinova, Sarah Markham, Felix Zimmer, Daniel Stahl, Gordon Forbes, Ewan Carr

Abstract:Background: Determining an adequate sample size is essential for developing reliable and generalisable clinical prediction models, yet practical guidance on selecting appropriate methods remains limited. Existing analytical and simulation-based approaches often rely on restrictive assumptions and focus on mean-based criteria. We present and validate pmsims, an R package that uses Gaussian process surrogate modelling to provide a flexible and computationally efficient simulation-based framework for sample size determination across diverse prediction settings. Methods: We conducted a comprehensive simulation study with two aims. First, we compared three search engines implemented in pmsims: a Gaussian process-based adaptive method, a deterministic bisection method, and a hybrid approach, across binary, continuous, and survival outcomes. Second, we benchmarked the best-performing pmsims engine against existing analytical (pmsampsize) and simulation-based (samplesizedev) methods, evaluating recommended sample sizes, computational time, and achieved performance on large independent validation datasets. Results: The Gaussian process-based method consistently produced the most stable sample size estimates, particularly in low-signal, high-dimensional settings. In benchmarking, pmsims achieved performance close to prespecified targets across all outcome types, matching simulation-based approaches and outperforming analytical methods in more challenging scenarios. Conclusions: pmsims provides an efficient and flexible framework for principled sample size planning in clinical prediction modelling, requiring fewer model evaluations than non-adaptive simulation approaches.

* 27 pages, 2 main-text figures, 16 supplementary figures, 9 tables, preprint

Via

Access Paper or Ask Questions

RelChaNet: Neural Network Feature Selection using Relative Change Scores

Oct 03, 2024

Felix Zimmer

Figure 1 for RelChaNet: Neural Network Feature Selection using Relative Change Scores

Figure 2 for RelChaNet: Neural Network Feature Selection using Relative Change Scores

Figure 3 for RelChaNet: Neural Network Feature Selection using Relative Change Scores

Figure 4 for RelChaNet: Neural Network Feature Selection using Relative Change Scores

Abstract:There is an ongoing effort to develop feature selection algorithms to improve interpretability, reduce computational resources, and minimize overfitting in predictive models. Neural networks stand out as architectures on which to build feature selection methods, and recently, neuron pruning and regrowth have emerged from the sparse neural network literature as promising new tools. We introduce RelChaNet, a novel and lightweight feature selection algorithm that uses neuron pruning and regrowth in the input layer of a dense neural network. For neuron pruning, a gradient sum metric measures the relative change induced in a network after a feature enters, while neurons are randomly regrown. We also propose an extension that adapts the size of the input layer at runtime. Extensive experiments on nine different datasets show that our approach generally outperforms the current state-of-the-art methods, and in particular improves the average accuracy by 2% on the MNIST dataset. Our code is available at https://github.com/flxzimmer/relchanet.

Via

Access Paper or Ask Questions

Sample Size in Natural Language Processing within Healthcare Research

Sep 05, 2023

Jaya Chaturvedi, Diana Shamsutdinova, Felix Zimmer, Sumithra Velupillai, Daniel Stahl, Robert Stewart, Angus Roberts

Figure 1 for Sample Size in Natural Language Processing within Healthcare Research

Figure 2 for Sample Size in Natural Language Processing within Healthcare Research

Figure 3 for Sample Size in Natural Language Processing within Healthcare Research

Figure 4 for Sample Size in Natural Language Processing within Healthcare Research

Abstract:Sample size calculation is an essential step in most data-based disciplines. Large enough samples ensure representativeness of the population and determine the precision of estimates. This is true for most quantitative studies, including those that employ machine learning methods, such as natural language processing, where free-text is used to generate predictions and classify instances of text. Within the healthcare domain, the lack of sufficient corpora of previously collected data can be a limiting factor when determining sample sizes for new studies. This paper tries to address the issue by making recommendations on sample sizes for text classification tasks in the healthcare domain. Models trained on the MIMIC-III database of critical care records from Beth Israel Deaconess Medical Center were used to classify documents as having or not having Unspecified Essential Hypertension, the most common diagnosis code in the database. Simulations were performed using various classifiers on different sample sizes and class proportions. This was repeated for a comparatively less common diagnosis code within the database of diabetes mellitus without mention of complication. Smaller sample sizes resulted in better results when using a K-nearest neighbours classifier, whereas larger sample sizes provided better results with support vector machines and BERT models. Overall, a sample size larger than 1000 was sufficient to provide decent performance metrics. The simulations conducted within this study provide guidelines that can be used as recommendations for selecting appropriate sample sizes and class proportions, and for predicting expected performance, when building classifiers for textual healthcare data. The methodology used here can be modified for sample size estimates calculations with other datasets.

* Submitted to Journal of Biomedical Informatics

Via

Access Paper or Ask Questions