Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xuan Shi

Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits

May 20, 2025

Tiantian Feng, Jihwan Lee, Anfeng Xu, Yoonjeong Lee, Thanathai Lertpetchpun, Xuan Shi, Helin Wang, Thomas Thebaud, Laureano Moro-Velazquez, Dani Byrd(+2 more)

Abstract:We introduce Vox-Profile, a comprehensive benchmark to characterize rich speaker and speech traits using speech foundation models. Unlike existing works that focus on a single dimension of speaker traits, Vox-Profile provides holistic and multi-dimensional profiles that reflect both static speaker traits (e.g., age, sex, accent) and dynamic speech properties (e.g., emotion, speech flow). This benchmark is grounded in speech science and linguistics, developed with domain experts to accurately index speaker and speech characteristics. We report benchmark experiments using over 15 publicly available speech datasets and several widely used speech foundation models that target various static and dynamic speaker and speech properties. In addition to benchmark experiments, we showcase several downstream applications supported by Vox-Profile. First, we show that Vox-Profile can augment existing speech recognition datasets to analyze ASR performance variability. Vox-Profile is also used as a tool to evaluate the performance of speech generation systems. Finally, we assess the quality of our automated profiles through comparison with human evaluation and show convergent validity. Vox-Profile is publicly available at: https://github.com/tiantiaf0627/vox-profile-release.

Via

Access Paper or Ask Questions

Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling

Sep 14, 2024

Tiantian Feng, Anfeng Xu, Xuan Shi, Somer Bishop, Shrikanth Narayanan

Figure 1 for Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling

Figure 2 for Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling

Figure 3 for Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling

Figure 4 for Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling

Abstract:Autism spectrum disorder (ASD) is a neurodevelopmental condition characterized by challenges in social communication, repetitive behavior, and sensory processing. One important research area in ASD is evaluating children's behavioral changes over time during treatment. The standard protocol with this objective is BOSCC, which involves dyadic interactions between a child and clinicians performing a pre-defined set of activities. A fundamental aspect of understanding children's behavior in these interactions is automatic speech understanding, particularly identifying who speaks and when. Conventional approaches in this area heavily rely on speech samples recorded from a spectator perspective, and there is limited research on egocentric speech modeling. In this study, we design an experiment to perform speech sampling in BOSCC interviews from an egocentric perspective using wearable sensors and explore pre-training Ego4D speech samples to enhance child-adult speaker classification in dyadic interactions. Our findings highlight the potential of egocentric speech collection and pre-training to improve speaker classification accuracy.

* pre-print under review

Via

Access Paper or Ask Questions

Toward Fully-End-to-End Listened Speech Decoding from EEG Signals

Jun 12, 2024

Jihwan Lee, Aditya Kommineni, Tiantian Feng, Kleanthis Avramidis, Xuan Shi, Sudarsana Kadiri, Shrikanth Narayanan

Figure 1 for Toward Fully-End-to-End Listened Speech Decoding from EEG Signals

Figure 2 for Toward Fully-End-to-End Listened Speech Decoding from EEG Signals

Figure 3 for Toward Fully-End-to-End Listened Speech Decoding from EEG Signals

Figure 4 for Toward Fully-End-to-End Listened Speech Decoding from EEG Signals

Abstract:Speech decoding from EEG signals is a challenging task, where brain activity is modeled to estimate salient characteristics of acoustic stimuli. We propose FESDE, a novel framework for Fully-End-to-end Speech Decoding from EEG signals. Our approach aims to directly reconstruct listened speech waveforms given EEG signals, where no intermediate acoustic feature processing step is required. The proposed method consists of an EEG module and a speech module along with a connector. The EEG module learns to better represent EEG signals, while the speech module generates speech waveforms from model representations. The connector learns to bridge the distributions of the latent spaces of EEG and speech. The proposed framework is both simple and efficient, by allowing single-step inference, and outperforms prior works on objective metrics. A fine-grained phoneme analysis is conducted to unveil model characteristics of speech decoding. The source code is available here: github.com/lee-jhwn/fesde.

* accepted to Interspeech2024

Via

Access Paper or Ask Questions

TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality

Apr 27, 2024

Tiantian Feng, Xuan Shi, Rahul Gupta, Shrikanth S. Narayanan

Figure 1 for TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality

Figure 2 for TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality

Figure 3 for TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality

Figure 4 for TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality

Abstract:Automatic Speech Understanding (ASU) aims at human-like speech interpretation, providing nuanced intent, emotion, sentiment, and content understanding from speech and language (text) content conveyed in speech. Typically, training a robust ASU model relies heavily on acquiring large-scale, high-quality speech and associated transcriptions. However, it is often challenging to collect or use speech data for training ASU due to concerns such as privacy. To approach this setting of enabling ASU when speech (audio) modality is missing, we propose TI-ASU, using a pre-trained text-to-speech model to impute the missing speech. We report extensive experiments evaluating TI-ASU on various missing scales, both multi- and single-modality settings, and the use of LLMs. Our findings show that TI-ASU yields substantial benefits to improve ASU in scenarios where even up to 95% of training speech is missing. Moreover, we show that TI-ASU is adaptive to dropout training, improving model robustness in addressing missing speech during inference.

Via

Access Paper or Ask Questions

Unlocking Foundation Models for Privacy-Enhancing Speech Understanding: An Early Study on Low Resource Speech Training Leveraging Label-guided Synthetic Speech Content

Jun 13, 2023

Tiantian Feng, Digbalay Bose, Xuan Shi, Shrikanth Narayanan

Abstract:Automatic Speech Understanding (ASU) leverages the power of deep learning models for accurate interpretation of human speech, leading to a wide range of speech applications that enrich the human experience. However, training a robust ASU model requires the curation of a large number of speech samples, creating risks for privacy breaches. In this work, we investigate using foundation models to assist privacy-enhancing speech computing. Unlike conventional works focusing primarily on data perturbation or distributed algorithms, our work studies the possibilities of using pre-trained generative models to synthesize speech content as training data with just label guidance. We show that zero-shot learning with training label-guided synthetic speech content remains a challenging task. On the other hand, our results demonstrate that the model trained with synthetic speech samples provides an effective initialization point for low-resource ASU training. This result reveals the potential to enhance privacy by reducing user data collection but using label-guided synthetic speech content.

Via

Access Paper or Ask Questions

A Review of Speech-centric Trustworthy Machine Learning: Privacy, Safety, and Fairness

Dec 18, 2022

Tiantian Feng, Rajat Hebbar, Nicholas Mehlman, Xuan Shi, Aditya Kommineni, and Shrikanth Narayanan

Figure 1 for A Review of Speech-centric Trustworthy Machine Learning: Privacy, Safety, and Fairness

Figure 2 for A Review of Speech-centric Trustworthy Machine Learning: Privacy, Safety, and Fairness

Figure 3 for A Review of Speech-centric Trustworthy Machine Learning: Privacy, Safety, and Fairness

Figure 4 for A Review of Speech-centric Trustworthy Machine Learning: Privacy, Safety, and Fairness

Abstract:Speech-centric machine learning systems have revolutionized many leading domains ranging from transportation and healthcare to education and defense, profoundly changing how people live, work, and interact with each other. However, recent studies have demonstrated that many speech-centric ML systems may need to be considered more trustworthy for broader deployment. Specifically, concerns over privacy breaches, discriminating performance, and vulnerability to adversarial attacks have all been discovered in ML research fields. In order to address the above challenges and risks, a significant number of efforts have been made to ensure these ML systems are trustworthy, especially private, safe, and fair. In this paper, we conduct the first comprehensive survey on speech-centric trustworthy ML topics related to privacy, safety, and fairness. In addition to serving as a summary report for the research community, we point out several promising future research directions to inspire the researchers who wish to explore further in this area.

Via

Access Paper or Ask Questions

Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?

Nov 25, 2022

Xuan Shi, Erica Cooper, Xin Wang, Junichi Yamagishi, Shrikanth Narayanan

Figure 1 for Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?

Figure 2 for Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?

Figure 3 for Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?

Figure 4 for Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?

Abstract:With the similarity between music and speech synthesis from symbolic input and the rapid development of text-to-speech (TTS) techniques, it is worthwhile to explore ways to improve the MIDI-to-audio performance by borrowing from TTS techniques. In this study, we analyze the shortcomings of a TTS-based MIDI-to-audio system and improve it in terms of feature computation, model selection, and training strategy, aiming to synthesize highly natural-sounding audio. Moreover, we conducted an extensive model evaluation through listening tests, pitch measurement, and spectrogram analysis. This work demonstrates not only synthesis of highly natural music but offers a thorough analytical approach and useful outcomes for the community. Our code and pre-trained models are open sourced at https://github.com/nii-yamagishilab/midi-to-audio.

* Submitted to ICASSP 2023

Via

Access Paper or Ask Questions

Use of speaker recognition approaches for learning timbre representations of musical instrument sounds from raw waveforms

Jul 24, 2021

Xuan Shi, Erica Cooper, Junichi Yamagishi

Figure 1 for Use of speaker recognition approaches for learning timbre representations of musical instrument sounds from raw waveforms

Figure 2 for Use of speaker recognition approaches for learning timbre representations of musical instrument sounds from raw waveforms

Figure 3 for Use of speaker recognition approaches for learning timbre representations of musical instrument sounds from raw waveforms

Figure 4 for Use of speaker recognition approaches for learning timbre representations of musical instrument sounds from raw waveforms

Abstract:Timbre representations of musical instruments, essential for diverse applications such as musical audio synthesis and separation, might be learned as bottleneck features from an instrumental recognition model. Given the similarities between speaker recognition and musical instrument recognition, in this paper, we investigate how to adapt successful speaker recognition algorithms to musical instrument recognition to learn meaningful instrumental timbre representations. To address the mismatch between musical audio and models devised for speech, we introduce a group of trainable filters to generate proper acoustic features from input raw waveforms, making it easier for a model to be optimized in an input-agnostic and end-to-end manner. Through experiments on both the NSynth and RWC databases in both musical instrument closed-set identification and open-set verification scenarios, the modified speaker recognition model was capable of generating discriminative embeddings for instrument and instrument-family identities. We further conducted extensive experiments to characterize the encoded information in learned timbre embeddings.

* Submitted to the IEEE/ACM Transactions on Audio, Speech, and Language Processing

Via

Access Paper or Ask Questions

RepGN:Object Detection with Relational Proposal Graph Network

Apr 18, 2019

Xingjian Du, Xuan Shi, Risheng Huang

Figure 1 for RepGN:Object Detection with Relational Proposal Graph Network

Figure 2 for RepGN:Object Detection with Relational Proposal Graph Network

Figure 3 for RepGN:Object Detection with Relational Proposal Graph Network

Figure 4 for RepGN:Object Detection with Relational Proposal Graph Network

Abstract:Region based object detectors achieve the state-of-the-art performance, but few consider to model the relation of proposals. In this paper, we explore the idea of modeling the relationships among the proposals for object detection from the graph learning perspective. Specifically, we present relational proposal graph network (RepGN) which is defined on object proposals and the semantic and spatial relation modeled as the edge. By integrating our RepGN module into object detectors, the relation and context constraints will be introduced to the feature extraction of regions and bounding boxes regression and classification. Besides, we propose a novel graph-cut based pooling layer for hierarchical coarsening of the graph, which empowers the RepGN module to exploit the inter-regional correlation and scene description in a hierarchical manner. We perform extensive experiments on COCO object detection dataset and show promising results.

Via

Access Paper or Ask Questions

End-to-End Model for Speech Enhancement by Consistent Spectrogram Masking

Jan 02, 2019

Xingjian Du, Mengyao Zhu, Xuan Shi, Xinpeng Zhang, Wen Zhang, Jingdong Chen

Figure 1 for End-to-End Model for Speech Enhancement by Consistent Spectrogram Masking

Figure 2 for End-to-End Model for Speech Enhancement by Consistent Spectrogram Masking

Figure 3 for End-to-End Model for Speech Enhancement by Consistent Spectrogram Masking

Figure 4 for End-to-End Model for Speech Enhancement by Consistent Spectrogram Masking

Abstract:Recently, phase processing is attracting increasinginterest in speech enhancement community. Some researchersintegrate phase estimations module into speech enhancementmodels by using complex-valued short-time Fourier transform(STFT) spectrogram based training targets, e.g. Complex RatioMask (cRM) [1]. However, masking on spectrogram would violentits consistency constraints. In this work, we prove that theinconsistent problem enlarges the solution space of the speechenhancement model and causes unintended artifacts. ConsistencySpectrogram Masking (CSM) is proposed to estimate the complexspectrogram of a signal with the consistency constraint in asimple but not trivial way. The experiments comparing ourCSM based end-to-end model with other methods are conductedto confirm that the CSM accelerate the model training andhave significant improvements in speech quality. From ourexperimental results, we assured that our method could enha

Via

Access Paper or Ask Questions