Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anisoara Ionescu

FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records

Apr 24, 2026

Hojjat Karami, David Atienza, Jean-Philippe Thiran, Anisoara Ionescu

Abstract:Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either lack clinical domain awareness or assume clean, regularly sampled inputs, limiting their applicability to real-world EHR data. We present \textbf{FeatEHR-LLM}, a framework that leverages Large Language Models (LLMs) to generate clinically meaningful tabular features from irregularly sampled EHR time series. To limit patient privacy exposure, the LLM operates exclusively on dataset schemas and task descriptions rather than raw patient records. A tool-augmented generation mechanism equips the LLM with specialized routines for querying irregular temporal data, enabling it to produce executable feature-extraction code that explicitly handles uneven observation patterns and informative sparsity. FeatEHR-LLM supports both univariate and multivariate feature generation through an iterative, validation-in-the-loop pipeline. Evaluated on eight clinical prediction tasks across four ICU datasets, our framework achieves the highest mean AUROC on 7 out of 8 tasks, with improvements of up to 6 percentage points over strong baselines. Code is available at github.com/hojjatkarami/FeatEHR-LLM.

Via

Access Paper or Ask Questions

SynEHRgy: Synthesizing Mixed-Type Structured Electronic Health Records using Decoder-Only Transformers

Nov 20, 2024

Hojjat Karami, David Atienza, Anisoara Ionescu

Abstract:Generating synthetic Electronic Health Records (EHRs) offers significant potential for data augmentation, privacy-preserving data sharing, and improving machine learning model training. We propose a novel tokenization strategy tailored for structured EHR data, which encompasses diverse data types such as covariates, ICD codes, and irregularly sampled time series. Using a GPT-like decoder-only transformer model, we demonstrate the generation of high-quality synthetic EHRs. Our approach is evaluated using the MIMIC-III dataset, and we benchmark the fidelity, utility, and privacy of the generated data against state-of-the-art models.

Via

Access Paper or Ask Questions

TEE4EHR: Transformer Event Encoder for Better Representation Learning in Electronic Health Records

Feb 09, 2024

Hojjat Karami, David Atienza, Anisoara Ionescu

Figure 1 for TEE4EHR: Transformer Event Encoder for Better Representation Learning in Electronic Health Records

Figure 2 for TEE4EHR: Transformer Event Encoder for Better Representation Learning in Electronic Health Records

Figure 3 for TEE4EHR: Transformer Event Encoder for Better Representation Learning in Electronic Health Records

Figure 4 for TEE4EHR: Transformer Event Encoder for Better Representation Learning in Electronic Health Records

Abstract:Irregular sampling of time series in electronic health records (EHRs) is one of the main challenges for developing machine learning models. Additionally, the pattern of missing data in certain clinical variables is not at random but depends on the decisions of clinicians and the state of the patient. Point process is a mathematical framework for analyzing event sequence data that is consistent with irregular sampling patterns. Our model, TEE4EHR, is a transformer event encoder (TEE) with point process loss that encodes the pattern of laboratory tests in EHRs. The utility of our TEE has been investigated in a variety of benchmark event sequence datasets. Additionally, we conduct experiments on two real-world EHR databases to provide a more comprehensive evaluation of our model. Firstly, in a self-supervised learning approach, the TEE is jointly learned with an existing attention-based deep neural network which gives superior performance in negative log-likelihood and future event prediction. Besides, we propose an algorithm for aggregating attention weights that can reveal the interaction between the events. Secondly, we transfer and freeze the learned TEE to the downstream task for the outcome prediction, where it outperforms state-of-the-art models for handling irregularly sampled time series. Furthermore, our results demonstrate that our approach can improve representation learning in EHRs and can be useful for clinical prediction tasks.

Via

Access Paper or Ask Questions

TimEHR: Image-based Time Series Generation for Electronic Health Records

Feb 09, 2024

Hojjat Karami, Mary-Anne Hartley, David Atienza, Anisoara Ionescu

Figure 1 for TimEHR: Image-based Time Series Generation for Electronic Health Records

Figure 2 for TimEHR: Image-based Time Series Generation for Electronic Health Records

Figure 3 for TimEHR: Image-based Time Series Generation for Electronic Health Records

Figure 4 for TimEHR: Image-based Time Series Generation for Electronic Health Records

Abstract:Time series in Electronic Health Records (EHRs) present unique challenges for generative models, such as irregular sampling, missing values, and high dimensionality. In this paper, we propose a novel generative adversarial network (GAN) model, TimEHR, to generate time series data from EHRs. In particular, TimEHR treats time series as images and is based on two conditional GANs. The first GAN generates missingness patterns, and the second GAN generates time series values based on the missingness pattern. Experimental results on three real-world EHR datasets show that TimEHR outperforms state-of-the-art methods in terms of fidelity, utility, and privacy metrics.

Via

Access Paper or Ask Questions