This study aims to elucidate the significance of long-range correlations for deep-learning-based sleep staging. It is centered around S4Sleep(TS), a recently proposed model for automated sleep staging. This model utilizes electroencephalography (EEG) as raw time series input and relies on structured state space sequence (S4) models as essential model component. Although the model already surpasses state-of-the-art methods for a moderate number of 15 input epochs, recent literature results suggest potential benefits from incorporating very long correlations spanning hundreds of input epochs. In this submission, we explore the possibility of achieving further enhancements by systematically scaling up the model's input size, anticipating potential improvements in prediction accuracy. In contrast to findings in literature, our results demonstrate that augmenting the input size does not yield a significant enhancement in the performance of S4Sleep(TS). These findings, coupled with the distinctive ability of S4 models to capture long-range dependencies in time series data, cast doubt on the diagnostic relevance of very long-range interactions for sleep staging.
Feature removal is a central building block for eXplainable AI (XAI), both for occlusion-based explanations (Shapley values) as well as their evaluation (pixel flipping, PF). However, occlusion strategies can vary significantly from simple mean replacement up to inpainting with state-of-the-art diffusion models. This ambiguity limits the usefulness of occlusion-based approaches. For example, PF benchmarks lead to contradicting rankings. This is amplified by competing PF measures: Features are either removed starting with most influential first (MIF) or least influential first (LIF). This study proposes two complementary perspectives to resolve this disagreement problem. Firstly, we address the common criticism of occlusion-based XAI, that artificial samples lead to unreliable model evaluations. We propose to measure the reliability by the R(eference)-Out-of-Model-Scope (OMS) score. The R-OMS score enables a systematic comparison of occlusion strategies and resolves the disagreement problem by grouping consistent PF rankings. Secondly, we show that the insightfulness of MIF and LIF is conversely dependent on the R-OMS score. To leverage this, we combine the MIF and LIF measures into the symmetric relevance gain (SRG) measure. This breaks the inherent connection to the underlying occlusion strategy and leads to consistent rankings. This resolves the disagreement problem, which we verify for a set of 40 different occlusion strategies.
Current deep learning algorithms designed for automatic ECG analysis have exhibited notable accuracy. However, akin to traditional electrocardiography, they tend to be narrowly focused and typically address a singular diagnostic condition. In this study, we specifically demonstrate the capability of a single model to predict a diverse range of both cardiac and non-cardiac discharge diagnoses based on a sole ECG collected in the emergency department. Among the 1,076 hierarchically structured ICD codes considered, our model achieves an AUROC exceeding 0.8 in 439 of them. This underscores the models proficiency in handling a wide array of diagnostic scenarios. We emphasize the potential of utilizing this model as a screening tool, potentially integrated into a holistic clinical decision support system for efficiently triaging patients in the emergency department. This research underscores the remarkable capabilities of comprehensive ECG analysis algorithms and the extensive range of possibilities facilitated by the open MIMIC-IV-ECG dataset. Finally, our data may play a pivotal role in revolutionizing the way ECG analysis is performed, marking a significant advancement in the field.
Cardiovascular diseases remain the leading global cause of mortality. This necessitates a profound understanding of heart aging processes to diagnose constraints in cardiovascular fitness. Traditionally, most of such insights have been drawn from the analysis of electrocardiogram (ECG) feature changes of individuals as they age. However, these features, while informative, may potentially obscure underlying data relationships. In this paper, we employ a deep-learning model and a tree-based model to analyze ECG data from a robust dataset of healthy individuals across varying ages in both raw signals and ECG feature format. Explainable AI techniques are then used to identify ECG features or raw signal characteristics are most discriminative for distinguishing between age groups. Our analysis with tree-based classifiers reveal age-related declines in inferred breathing rates and identifies notably high SDANN values as indicative of elderly individuals, distinguishing them from younger adults. Furthermore, the deep-learning model underscores the pivotal role of the P-wave in age predictions across all age groups, suggesting potential changes in the distribution of different P-wave types with age. These findings shed new light on age-related ECG changes, offering insights that transcend traditional feature-based approaches.
Scoring sleep stages in polysomnography recordings is a time-consuming task plagued by significant inter-rater variability. Therefore, it stands to benefit from the application of machine learning algorithms. While many algorithms have been proposed for this purpose, certain critical architectural decisions have not received systematic exploration. In this study, we meticulously investigate these design choices within the broad category of encoder-predictor architectures. We identify robust architectures applicable to both time series and spectrogram input representations. These architectures incorporate structured state space models as integral components, leading to statistically significant advancements in performance on the extensive SHHS dataset. These improvements are assessed through both statistical and systematic error estimations. We anticipate that the architectural insights gained from this study will not only prove valuable for future research in sleep staging but also hold relevance for other time series annotation tasks.
Motivation: We explored how explainable AI (XAI) can help to shed light into the inner workings of neural networks for protein function prediction, by extending the widely used XAI method of integrated gradients such that latent representations inside of transformer models, which were finetuned to Gene Ontology term and Enzyme Commission number prediction, can be inspected too. Results: The approach enabled us to identify amino acids in the sequences that the transformers pay particular attention to, and to show that these relevant sequence parts reflect expectations from biology and chemistry, both in the embedding layer and inside of the model, where we identified transformer heads with a statistically significant correspondence of attribution maps with ground truth sequence annotations (e.g., transmembrane regions, active sites) across many proteins. Availability and Implementation: Source code can be accessed at https://github.com/markuswenzel/xai-proteins .
Deep learning has emerged as the preferred modeling approach for automatic ECG analysis. In this study, we investigate three elements aimed at improving the quantitative accuracy of such systems. These components consistently enhance performance beyond the existing state-of-the-art, which is predominantly based on convolutional models. Firstly, we explore more expressive architectures by exploiting structured state space models (SSMs). These models have shown promise in capturing long-term dependencies in time series data. By incorporating SSMs into our approach, we not only achieve better performance, but also gain insights into long-standing questions in the field. Specifically, for standard diagnostic tasks, we find no advantage in using higher sampling rates such as 500Hz compared to 100Hz. Similarly, extending the input size of the model beyond 3 seconds does not lead to significant improvements. Secondly, we demonstrate that self-supervised learning using contrastive predictive coding can further improve the performance of SSMs. By leveraging self-supervision, we enable the model to learn more robust and representative features, leading to improved analysis accuracy. Lastly, we depart from synthetic benchmarking scenarios and incorporate basic demographic metadata alongside the ECG signal as input. This inclusion of patient metadata departs from the conventional practice of relying solely on the signal itself. Remarkably, this addition consistently yields positive effects on predictive performance. We firmly believe that all three components should be considered when developing next-generation ECG analysis algorithms.
Deep neural networks have become increasingly popular for analyzing ECG data because of their ability to accurately identify cardiac conditions and hidden clinical factors. However, the lack of transparency due to the black box nature of these models is a common concern. To address this issue, explainable AI (XAI) methods can be employed. In this study, we present a comprehensive analysis of post-hoc XAI methods, investigating the local (attributions per sample) and global (based on domain expert concepts) perspectives. We have established a set of sanity checks to identify sensible attribution methods, and we provide quantitative evidence in accordance with expert rules. This dataset-wide analysis goes beyond anecdotal evidence by aggregating data across patient subgroups. Furthermore, we demonstrate how these XAI techniques can be utilized for knowledge discovery, such as identifying subtypes of myocardial infarction. We believe that these proposed methods can serve as building blocks for a complementary assessment of the internal validity during a certification process, as well as for knowledge discovery in the field of ECG analysis.
Feature importance methods promise to provide a ranking of features according to importance for a given classification task. A wide range of methods exist but their rankings often disagree and they are inherently difficult to evaluate due to a lack of ground truth beyond synthetic datasets. In this work, we put feature importance methods to the test on real-world data in the domain of cardiology, where we try to distinguish three specific pathologies from healthy subjects based on ECG features comparing to features used in cardiologists' decision rules as ground truth. Some methods generally performed well and others performed poorly, while some methods did well on some but not all of the problems considered.
The completeness axiom renders the explanation of a post-hoc XAI method only locally faithful to the model, i.e. for a single decision. For the trustworthy application of XAI, in particular for high-stake decisions, a more global model understanding is required. Recently, concept-based methods have been proposed, which are however not guaranteed to be bound to the actual model reasoning. To circumvent this problem, we propose Multi-dimensional Concept Discovery (MCD) as an extension of previous approaches that fulfills a completeness relation on the level of concepts. Our method starts from general linear subspaces as concepts and does neither require reinforcing concept interpretability nor re-training of model parts. We propose sparse subspace clustering to discover improved concepts and fully leverage the potential of multi-dimensional subspaces. MCD offers two complementary analysis tools for concepts in input space: (1) concept activation maps, that show where a concept is expressed within a sample, allowing for concept characterization through prototypical samples, and (2) concept relevance heatmaps, that decompose the model decision into concept contributions. Both tools together enable a detailed understanding of the model reasoning, which is guaranteed to relate to the model via a completeness relation. This paves the way towards more trustworthy concept-based XAI. We empirically demonstrate the superiority of MCD against more constrained concept definitions.