Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shrikanth Narayanan

Signal Analysis and Interpretation Lab, University of Southern California, Information Science Institute, University of Southern California

Interpretable Modeling of Articulatory Temporal Dynamics from real-time MRI for Phoneme Recognition

Sep 19, 2025

Jay Park, Hong Nguyen, Sean Foley, Jihwan Lee, Yoonjeong Lee, Dani Byrd, Shrikanth Narayanan

Figure 1 for Interpretable Modeling of Articulatory Temporal Dynamics from real-time MRI for Phoneme Recognition

Figure 2 for Interpretable Modeling of Articulatory Temporal Dynamics from real-time MRI for Phoneme Recognition

Figure 3 for Interpretable Modeling of Articulatory Temporal Dynamics from real-time MRI for Phoneme Recognition

Figure 4 for Interpretable Modeling of Articulatory Temporal Dynamics from real-time MRI for Phoneme Recognition

Abstract:Real-time Magnetic Resonance Imaging (rtMRI) visualizes vocal tract action, offering a comprehensive window into speech articulation. However, its signals are high dimensional and noisy, hindering interpretation. We investigate compact representations of spatiotemporal articulatory dynamics for phoneme recognition from midsagittal vocal tract rtMRI videos. We compare three feature types: (1) raw video, (2) optical flow, and (3) six linguistically-relevant regions of interest (ROIs) for articulator movements. We evaluate models trained independently on each representation, as well as multi-feature combinations. Results show that multi-feature models consistently outperform single-feature baselines, with the lowest phoneme error rate (PER) of 0.34 obtained by combining ROI and raw video. Temporal fidelity experiments demonstrate a reliance on fine-grained articulatory dynamics, while ROI ablation studies reveal strong contributions from tongue and lips. Our findings highlight how rtMRI-derived features provide accuracy and interpretability, and establish strategies for leveraging articulatory data in speech processing.

Via

Access Paper or Ask Questions

Can Layer-wise SSL Features Improve Zero-Shot ASR Performance for Children's Speech?

Aug 28, 2025

Abhijit Sinha, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Shrikanth Narayanan

Abstract:Automatic Speech Recognition (ASR) systems often struggle to accurately process children's speech due to its distinct and highly variable acoustic and linguistic characteristics. While recent advancements in self-supervised learning (SSL) models have greatly enhanced the transcription of adult speech, accurately transcribing children's speech remains a significant challenge. This study investigates the effectiveness of layer-wise features extracted from state-of-the-art SSL pre-trained models - specifically, Wav2Vec2, HuBERT, Data2Vec, and WavLM in improving the performance of ASR for children's speech in zero-shot scenarios. A detailed analysis of features extracted from these models was conducted, integrating them into a simplified DNN-based ASR system using the Kaldi toolkit. The analysis identified the most effective layers for enhancing ASR performance on children's speech in a zero-shot scenario, where WSJCAM0 adult speech was used for training and PFSTAR children speech for testing. Experimental results indicated that Layer 22 of the Wav2Vec2 model achieved the lowest Word Error Rate (WER) of 5.15%, representing a 51.64% relative improvement over the direct zero-shot decoding using Wav2Vec2 (WER of 10.65%). Additionally, age group-wise analysis demonstrated consistent performance improvements with increasing age, along with significant gains observed even in younger age groups using the SSL features. Further experiments on the CMU Kids dataset confirmed similar trends, highlighting the generalizability of the proposed approach.

* IEEE Signal Processing Letters 2025
* Accepted

Via

Access Paper or Ask Questions

Decoding Neural Signatures of Semantic Evaluations in Depression and Suicidality

Jul 30, 2025

Woojae Jeong, Aditya Kommineni, Kleanthis Avramidis, Colin McDaniel, Donald Berry, Myzelle Hughes, Thomas McGee, Elsi Kaiser, Dani Byrd, Assal Habibi(+8 more)

Abstract:Depression and suicidality profoundly impact cognition and emotion, yet objective neurophysiological biomarkers remain elusive. We investigated the spatiotemporal neural dynamics underlying affective semantic processing in individuals with varying levels of clinical severity of depression and suicidality using multivariate decoding of electroencephalography (EEG) data. Participants (N=137) completed a sentence evaluation task involving emotionally charged self-referential statements while EEG was recorded. We identified robust, neural signatures of semantic processing, with peak decoding accuracy between 300-600 ms -- a window associated with automatic semantic evaluation and conflict monitoring. Compared to healthy controls, individuals with depression and suicidality showed earlier onset, longer duration, and greater amplitude decoding responses, along with broader cross-temporal generalization and increased activation of frontocentral and parietotemporal components. These findings suggest altered sensitivity and impaired disengagement from emotionally salient content in the clinical groups, advancing our understanding of the neurocognitive basis of mental health and providing a principled basis for developing reliable EEG-based biomarkers of depression and suicidality.

Via

Access Paper or Ask Questions

Joint ASR and Speaker Role Tagging with Serialized Output Training

Jun 12, 2025

Anfeng Xu, Tiantian Feng, Shrikanth Narayanan

Abstract:Automatic Speech Recognition systems have made significant progress with large-scale pre-trained models. However, most current systems focus solely on transcribing the speech without identifying speaker roles, a function that is critical for conversational AI. In this work, we investigate the use of serialized output training (SOT) for joint ASR and speaker role tagging. By augmenting Whisper with role-specific tokens and fine-tuning it with SOT, we enable the model to generate role-aware transcriptions in a single decoding pass. We compare the SOT approach against a self-supervised previous baseline method on two real-world conversational datasets. Our findings show that this approach achieves more than 10% reduction in multi-talker WER, demonstrating its feasibility as a unified model for speaker-role aware speech transcription.

* Under review

Via

Access Paper or Ask Questions

Developing a High-performance Framework for Speech Emotion Recognition in Naturalistic Conditions Challenge for Emotional Attribute Prediction

Jun 12, 2025

Thanathai Lertpetchpun, Tiantian Feng, Dani Byrd, Shrikanth Narayanan

Abstract:Speech emotion recognition (SER) in naturalistic conditions presents a significant challenge for the speech processing community. Challenges include disagreement in labeling among annotators and imbalanced data distributions. This paper presents a reproducible framework that achieves superior (top 1) performance in the Emotion Recognition in Naturalistic Conditions Challenge (IS25-SER Challenge) - Task 2, evaluated on the MSP-Podcast dataset. Our system is designed to tackle the aforementioned challenges through multimodal learning, multi-task learning, and imbalanced data handling. Specifically, our best system is trained by adding text embeddings, predicting gender, and including ``Other'' (O) and ``No Agreement'' (X) samples in the training set. Our system's results secured both first and second places in the IS25-SER Challenge, and the top performance was achieved by a simple two-system ensemble.

* Accepeted to Interspeech2025

Via

Access Paper or Ask Questions

Neural Responses to Affective Sentences Reveal Signatures of Depression

Jun 06, 2025

Aditya Kommineni, Woojae Jeong, Kleanthis Avramidis, Colin McDaniel, Myzelle Hughes, Thomas McGee, Elsi Kaiser, Kristina Lerman, Idan A. Blank, Dani Byrd(+6 more)

Abstract:Major Depressive Disorder (MDD) is a highly prevalent mental health condition, and a deeper understanding of its neurocognitive foundations is essential for identifying how core functions such as emotional and self-referential processing are affected. We investigate how depression alters the temporal dynamics of emotional processing by measuring neural responses to self-referential affective sentences using surface electroencephalography (EEG) in healthy and depressed individuals. Our results reveal significant group-level differences in neural activity during sentence viewing, suggesting disrupted integration of emotional and self-referential information in depression. Deep learning model trained on these responses achieves an area under the receiver operating curve (AUC) of 0.707 in distinguishing healthy from depressed participants, and 0.624 in differentiating depressed subgroups with and without suicidal ideation. Spatial ablations highlight anterior electrodes associated with semantic and affective processing as key contributors. These findings suggest stable, stimulus-driven neural signatures of depression that may inform future diagnostic tools.

Via

Access Paper or Ask Questions

LSM-2: Learning from Incomplete Wearable Sensor Data

Jun 05, 2025

Maxwell A. Xu, Girish Narayanswamy, Kumar Ayush, Dimitris Spathis, Shun Liao, Shyam A. Tailor, Ahmed Metwally, A. Ali Heydari, Yuwei Zhang, Jake Garrison(+15 more)

Figure 1 for LSM-2: Learning from Incomplete Wearable Sensor Data

Figure 2 for LSM-2: Learning from Incomplete Wearable Sensor Data

Figure 3 for LSM-2: Learning from Incomplete Wearable Sensor Data

Figure 4 for LSM-2: Learning from Incomplete Wearable Sensor Data

Abstract:Foundation models, a cornerstone of recent advancements in machine learning, have predominantly thrived on complete and well-structured data. Wearable sensor data frequently suffers from significant missingness, posing a substantial challenge for self-supervised learning (SSL) models that typically assume complete data inputs. This paper introduces the second generation of Large Sensor Model (LSM-2) with Adaptive and Inherited Masking (AIM), a novel SSL approach that learns robust representations directly from incomplete data without requiring explicit imputation. AIM's core novelty lies in its use of learnable mask tokens to model both existing ("inherited") and artificially introduced missingness, enabling it to robustly handle fragmented real-world data during inference. Pre-trained on an extensive dataset of 40M hours of day-long multimodal sensor data, our LSM-2 with AIM achieves the best performance across a diverse range of tasks, including classification, regression and generative modeling. Furthermore, LSM-2 with AIM exhibits superior scaling performance, and critically, maintains high performance even under targeted missingness scenarios, reflecting clinically coherent patterns, such as the diagnostic value of nighttime biosignals for hypertension prediction. This makes AIM a more reliable choice for real-world wearable data applications.

* Xu and Narayanswamy are co-first authors. McDuff and Liu are co-last authors

Via

Access Paper or Ask Questions

Towards disentangling the contributions of articulation and acoustics in multimodal phoneme recognition

May 29, 2025

Sean Foley, Hong Nguyen, Jihwan Lee, Sudarsana Reddy Kadiri, Dani Byrd, Louis Goldstein, Shrikanth Narayanan

Abstract:Although many previous studies have carried out multimodal learning with real-time MRI data that captures the audio-visual kinematics of the vocal tract during speech, these studies have been limited by their reliance on multi-speaker corpora. This prevents such models from learning a detailed relationship between acoustics and articulation due to considerable cross-speaker variability. In this study, we develop unimodal audio and video models as well as multimodal models for phoneme recognition using a long-form single-speaker MRI corpus, with the goal of disentangling and interpreting the contributions of each modality. Audio and multimodal models show similar performance on different phonetic manner classes but diverge on places of articulation. Interpretation of the models' latent space shows similar encoding of the phonetic space across audio and multimodal models, while the models' attention weights highlight differences in acoustic and articulatory timing for certain phonemes.

Via

Access Paper or Ask Questions

Developing a Top-tier Framework in Naturalistic Conditions Challenge for Categorized Emotion Prediction: From Speech Foundation Models and Learning Objective to Data Augmentation and Engineering Choices

May 28, 2025

Tiantian Feng, Thanathai Lertpetchpun, Dani Byrd, Shrikanth Narayanan

Abstract:Speech emotion recognition (SER), particularly for naturally expressed emotions, remains a challenging computational task. Key challenges include the inherent subjectivity in emotion annotation and the imbalanced distribution of emotion labels in datasets. This paper introduces the \texttt{SAILER} system developed for participation in the INTERSPEECH 2025 Emotion Recognition Challenge (Task 1). The challenge dataset, which contains natural emotional speech from podcasts, serves as a valuable resource for studying imbalanced and subjective emotion annotations. Our system is designed to be simple, reproducible, and effective, highlighting critical choices in modeling, learning objectives, data augmentation, and engineering choices. Results show that even a single system (without ensembling) can outperform more than 95\% of the submissions, with a Macro-F1 score exceeding 0.4. Moreover, an ensemble of three systems further improves performance, achieving a competitively ranked score (top-3 performing team). Our model is at: https://github.com/tiantiaf0627/vox-profile-release.

* Accepted to INTERSPEECH 2025

Via

Access Paper or Ask Questions

Large Language Models Do Multi-Label Classification Differently

May 23, 2025

Marcus Ma, Georgios Chochlakis, Niyantha Maruthu Pandiyan, Jesse Thomason, Shrikanth Narayanan

Abstract:Multi-label classification is prevalent in real-world settings, but the behavior of Large Language Models (LLMs) in this setting is understudied. We investigate how autoregressive LLMs perform multi-label classification, with a focus on subjective tasks, by analyzing the output distributions of the models in each generation step. We find that their predictive behavior reflects the multiple steps in the underlying language modeling required to generate all relevant labels as they tend to suppress all but one label at each step. We further observe that as model scale increases, their token distributions exhibit lower entropy, yet the internal ranking of the labels improves. Finetuning methods such as supervised finetuning and reinforcement learning amplify this phenomenon. To further study this issue, we introduce the task of distribution alignment for multi-label settings: aligning LLM-derived label distributions with empirical distributions estimated from annotator responses in subjective tasks. We propose both zero-shot and supervised methods which improve both alignment and predictive performance over existing approaches.

* 18 pages, 11 figures, 6 tables

Via

Access Paper or Ask Questions