Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their practical utility. Over 50% of the features remain consistent across random seeds, and reconstruction quality is preserved. SAE features capture general acoustic and semantic information as well as specific events, including environmental noises and paralinguistic sounds (e.g. laughter, whispering) and disentangle them effectively, requiring removal of only 19-27% of features to erase a concept. Feature steering reduces Whisper's false speech detections by 70% with negligible WER increase, demonstrating real-world applicability. Finally, we find SAE features correlated with human EEG activity during speech perception, indicating alignment with human neural processing. The code and checkpoints are available at https://github.com/audiosae/audiosae_demo.
Accurate and timely seizure detection from Electroencephalography (EEG) is critical for clinical intervention, yet manual review of long-term recordings is labor-intensive. Recent efforts to encode EEG signals into large language models (LLMs) show promise in handling neural signals across diverse patients, but two significant challenges remain: (1) multi-channel heterogeneity, as seizure-relevant information varies substantially across EEG channels, and (2) computing inefficiency, as the EEG signals need to be encoded into a massive number of tokens for the prediction. To address these issues, we draw the EEG signal and propose the novel NeuroCanvas framework. Specifically, NeuroCanvas consists of two modules: (i) The Entropy-guided Channel Selector (ECS) selects the seizure-relevant channels input to LLM and (ii) the following Canvas of Neuron Signal (CNS) converts selected multi-channel heterogeneous EEG signals into structured visual representations. The ECS module alleviates the multi-channel heterogeneity issue, and the CNS uses compact visual tokens to represent the EEG signals that improve the computing efficiency. We evaluate NeuroCanvas across multiple seizure detection datasets, demonstrating a significant improvement of $20\%$ in F1 score and reductions of $88\%$ in inference latency. These results highlight NeuroCanvas as a scalable and effective solution for real-time and resource-efficient seizure detection in clinical practice.The code will be released at https://github.com/Yanchen30247/seizure_detect.
Electroencephalography (EEG) analysis is critical for brain-computer interfaces and neuroscience, but the intrinsic noise and high dimensionality of EEG signals hinder effective feature learning. We propose a self-supervised framework based on the Hierarchical Functional Maximal Correlation Algorithm (HFMCA), which learns orthonormal EEG representations by enforcing feature decorrelation and reducing redundancy. This design enables robust capture of essential brain dynamics for various EEG recognition tasks. We validate HFMCA on two benchmark datasets, SEED and BCIC-2A, where pretraining with HFMCA consistently outperforms competitive self-supervised baselines, achieving notable gains in classification accuracy. Across diverse EEG tasks, our method demonstrates superior cross-subject generalization under leave-one-subject-out validation, advancing state-of-the-art by 2.71\% on SEED emotion recognition and 2.57\% on BCIC-2A motor imagery classification.
Objective: EEG-based methods can predict speech intelligibility, but their accuracy and robustness lag behind behavioral tests, which typically show test-retest differences under 1 dB. We introduce the multi-decoder method to predict speech reception thresholds (SRTs) from EEG recordings, enabling objective assessment for populations unable to perform behavioral tests; such as those with disorders of consciousness or during hearing aid fitting. Approach: The method aggregates data from hundreds of decoders, each trained on different speech features and EEG preprocessing setups to quantify neural tracking (NT) of speech signals. Using data from 39 participants (ages 18-24), we recorded 29 minutes of EEG per person while they listened to speech at six signal-to-noise ratios and a quiet story. NT values were combined into a high-dimensional feature vector per subject, and a support vector regression model was trained to predict SRTs from these vectors. Main Result: Predictions correlated significantly with behavioral SRTs (r = 0.647, p < 0.001; NRMSE = 0.19), with all differences under 1 dB. SHAP analysis showed theta/delta bands and early lags had slightly greater influence. Using pretrained subject-independent decoders reduced required EEG data collection to 15 minutes (3 minutes of story, 12 minutes across six SNR conditions) without losing accuracy.
Neural control of grasping arises from nonlinear interactions across multiple brain rhythms, yet EEG-based motor decoding has largely relied on linear, second-order spectral features. Here, we examine whether higher-order cross-frequency dynamics distinguish motor planning from execution during natural reach-to-grasp behavior. EEG was recorded in a cue-based paradigm during executed precision and power grips, enabling stage-resolved analysis of preparatory and execution-related neural activity. Cross-frequency bispectral analysis was used to compute bicoherence matrices across canonical frequency band pairs, from which magnitude- and phase-based features were extracted. Classification, permutation-based feature selection, and within-subject statistical testing showed that execution is characterized by substantially stronger and more discriminative nonlinear coupling than planning, with dominant contributions from beta- and gamma-driven interactions. In contrast, decoding of precision versus power grips achieved comparable performance during planning and execution, indicating that grasp-type representations emerge during planning and persist into execution. Spatial and spectral analyses further revealed that informative bispectral features reflect coordinated activity across prefrontal, central, and occipital regions. Despite substantial feature redundancy, effective dimensionality reduction preserved decoding performance. Together, these findings demonstrate that nonlinear cross-frequency coupling provides an interpretable and robust marker of motor planning and execution, extending bispectral EEG analysis to ecologically valid grasping and supporting its relevance for brain-computer interfaces and neuroprosthetic control.
Museums increasingly rely on digital content to support visitors' understanding of artworks, yet little is known about how these formats shape the emotional engagement that underlies meaningful art experiences. This research presents an in-situ EEG study on how digital interpretive content modulate engagement during art viewing. Participants experienced three modalities: direct viewing of a Bruegel painting, a 180° immersive interpretive projection, and a regular, display-based interpretive video. Frontal EEG markers of motivational orientation, internal involvement, perceptual drive, and arousal were extracted using eyes-open baselines and Z-normalized contrasts. Results show modality-specific engagement profiles: display-based interpretive video induced high arousal and fast-band activity, immersive projections promoted calm, presence-oriented absorption, and original artworks reflected internally regulated engagement. These findings, relying on lightweight EEG sensing in an operational cultural environment, suggest that digital interpretive content affects engagement style rather than quantity. This paves the way for new multimodal sensing approaches and enables museums to optimize the modalities and content of their interpretive media.
Recent electroencephalography (EEG) spatial super-resolution (SR) methods, while showing improved quality by either directly predicting missing signals from visible channels or adapting latent diffusion-based generative modeling to temporal data, often lack awareness of physiological spatial structure, thereby constraining spatial generation performance. To address this issue, we introduce TopoDiff, a geometry- and relation-aware diffusion model for EEG spatial super-resolution. Inspired by how human experts interpret spatial EEG patterns, TopoDiff incorporates topology-aware image embeddings derived from EEG topographic representations to provide global geometric context for spatial generation, together with a dynamic channel-relation graph that encodes inter-electrode relationships and evolves with temporal dynamics. This design yields a spatially grounded EEG spatial super-resolution framework with consistent performance improvements. Across multiple EEG datasets spanning diverse applications, including SEED/SEED-IV for emotion recognition, PhysioNet motor imagery (MI/MM), and TUSZ for seizure detection, our method achieves substantial gains in generation fidelity and leads to notable improvements in downstream EEG task performance.
We study multimodal affect modeling when EEG and peripheral physiology are asynchronous, which most fusion methods ignore or handle with costly warping. We propose Cross-Temporal Attention Fusion (CTAF), a self-supervised module that learns soft bidirectional alignments between modalities and builds a robust clip embedding using time-aware cross attention, a lightweight fusion gate, and alignment-regularized contrastive objectives with optional weak supervision. On the K-EmoCon dataset, under leave-one-out cross-validation evaluation, CTAF yields higher cosine margins for matched pairs and better cross-modal token retrieval within one second, and it is competitive with the baseline on three-bin accuracy and macro-F1 while using few labels. Our contributions are a time-aware fusion mechanism that directly models correspondence, an alignment-driven self-supervised objective tailored to EEG and physiology, and an evaluation protocol that measures alignment quality itself. Our approach accounts for the coupling between the central and autonomic nervous systems in psychophysiological time series. These results indicate that CTAF is a strong step toward label-efficient, generalizable EEG-peripheral fusion under temporal asynchrony.
Electroencephalography (EEG) underpins neuroscience, clinical neurophysiology, and brain-computer interfaces (BCIs), yet pronounced inter- and intra-subject variability limits reliability, reproducibility, and translation. This systematic review studies that quantified or modeled EEG variability across resting-state, event-related potentials (ERPs), and task-related/BCI paradigms (including motor imagery and SSVEP) in healthy and clinical cohorts. Across paradigms, inter-subject differences are typically larger than within-subject fluctuations, but both affect inference and model generalization. Stability is feature-dependent: alpha-band measures and individual alpha peak frequency are often relatively reliable, whereas higher-frequency and many connectivity-derived metrics show more heterogeneous reliability; ERP reliability varies by component, with P300 measures frequently showing moderate-to-good stability. We summarize major sources of variability (biological, state-related, technical, and analytical), review common quantification and modeling approaches (e.g., ICC, CV, SNR, generalizability theory, and multivariate/learning-based methods), and provide recommendations for study design, reporting, and harmonization. Overall, EEG variability should be treated as both a practical constraint to manage and a meaningful signal to leverage for precision neuroscience and robust neurotechnology.
Electroencephalography (EEG) is a widely used technique for measuring brain activity. EEG-based signals can reveal a persons emotional state, as they directly reflect activity in different brain regions. Emotion-aware systems and EEG-based emotion recognition are a growing research area. This paper presents how machine learning (ML) models categorize a limited dataset of EEG signals into three different classes, namely Negative, Neutral, or Positive. It also presents the complete workflow, including data preprocessing and comparison of ML models. To understand which ML classification model works best for this kind of problem, we train and test the following three commonly used models: logistic regression (LR), support vector machine (SVM), and random forest (RF). The performance of each is evaluated with respect to accuracy and F1-score. The results indicate that ML models can be effectively utilized for three-way emotion classification of EEG signals. Among the three ML models trained on the available dataset, the RF model gave the best results. Its higher accuracy and F1-score suggest that it is able to capture the emotional patterns more accurately and effectively than the other two models. The RF model also outperformed the existing state-of-the-art classification models in terms of the accuracy parameter.