Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xuechen Liu

From Sharpness to Better Generalization for Speech Deepfake Detection

Jun 13, 2025

Wen Huang, Xuechen Liu, Xin Wang, Junichi Yamagishi, Yanmin Qian

Abstract:Generalization remains a critical challenge in speech deepfake detection (SDD). While various approaches aim to improve robustness, generalization is typically assessed through performance metrics like equal error rate without a theoretical framework to explain model performance. This work investigates sharpness as a theoretical proxy for generalization in SDD. We analyze how sharpness responds to domain shifts and find it increases in unseen conditions, indicating higher model sensitivity. Based on this, we apply Sharpness-Aware Minimization (SAM) to reduce sharpness explicitly, leading to better and more stable performance across diverse unseen test sets. Furthermore, correlation analysis confirms a statistically significant relationship between sharpness and generalization in most test settings. These findings suggest that sharpness can serve as a theoretical indicator for generalization in SDD and that sharpness-aware training offers a promising strategy for improving robustness.

* Accepted to Interspeech 2025

Via

Access Paper or Ask Questions

A Hierarchical Probabilistic Framework for Incremental Knowledge Tracing in Classroom Settings

Jun 11, 2025

Xinyi Gao, Qiucheng Wu, Yang Zhang, Xuechen Liu, Kaizhi Qian, Ying Xu, Shiyu Chang

Abstract:Knowledge tracing (KT) aims to estimate a student's evolving knowledge state and predict their performance on new exercises based on performance history. Many realistic classroom settings for KT are typically low-resource in data and require online updates as students' exercise history grows, which creates significant challenges for existing KT approaches. To restore strong performance under low-resource conditions, we revisit the hierarchical knowledge concept (KC) information, which is typically available in many classroom settings and can provide strong prior when data are sparse. We therefore propose Knowledge-Tree-based Knowledge Tracing (KT$^2$), a probabilistic KT framework that models student understanding over a tree-structured hierarchy of knowledge concepts using a Hidden Markov Tree Model. KT$^2$ estimates student mastery via an EM algorithm and supports personalized prediction through an incremental update mechanism as new responses arrive. Our experiments show that KT$^2$ consistently outperforms strong baselines in realistic online, low-resource settings.

* 24 pages, 4 figures

Via

Access Paper or Ask Questions

Quantifying Source Speaker Leakage in One-to-One Voice Conversion

Apr 22, 2025

Scott Wellington, Xuechen Liu, Junichi Yamagishi

Figure 1 for Quantifying Source Speaker Leakage in One-to-One Voice Conversion

Abstract:Using a multi-accented corpus of parallel utterances for use with commercial speech devices, we present a case study to show that it is possible to quantify a degree of confidence about a source speaker's identity in the case of one-to-one voice conversion. Following voice conversion using a HiFi-GAN vocoder, we compare information leakage for a range speaker characteristics; assuming a "worst-case" white-box scenario, we quantify our confidence to perform inference and narrow the pool of likely source speakers, reinforcing the regulatory obligation and moral duty that providers of synthetic voices have to ensure the privacy of their speakers' data.

* Accepted at IEEE 23rd International Conference of the Biometrics Special Interest Group (BIOSIG 2024)

Via

Access Paper or Ask Questions

ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech

Feb 13, 2025

Xin Wang, Héctor Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen(+19 more)

Abstract:ASVspoof 5 is the fifth edition in a series of challenges which promote the study of speech spoofing and deepfake attacks as well as the design of detection solutions. We introduce the ASVspoof 5 database which is generated in crowdsourced fashion from data collected in diverse acoustic conditions (cf. studio-quality data for earlier ASVspoof databases) and from ~2,000 speakers (cf. ~100 earlier). The database contains attacks generated with 32 different algorithms, also crowdsourced, and optimised to varying degrees using new surrogate detection models. Among them are attacks generated with a mix of legacy and contemporary text-to-speech synthesis and voice conversion models, in addition to adversarial attacks which are incorporated for the first time. ASVspoof 5 protocols comprise seven speaker-disjoint partitions. They include two distinct partitions for the training of different sets of attack models, two more for the development and evaluation of surrogate detection models, and then three additional partitions which comprise the ASVspoof 5 training, development and evaluation sets. An auxiliary set of data collected from an additional 30k speakers can also be used to train speaker encoders for the implementation of attack algorithms. Also described herein is an experimental validation of the new ASVspoof 5 database using a set of automatic speaker verification and spoof/deepfake baseline detectors. With the exception of protocols and tools for the generation of spoofed/deepfake speech, the resources described in this paper, already used by participants of the ASVspoof 5 challenge in 2024, are now all freely available to the community.

* Database link: https://zenodo.org/records/14498691, Database mirror link: https://huggingface.co/datasets/jungjee/asvspoof5, ASVspoof 5 Challenge Workshop Proceeding: https://www.isca-archive.org/asvspoof_2024/index.html

Via

Access Paper or Ask Questions

Explaining Speaker and Spoof Embeddings via Probing

Dec 24, 2024

Xuechen Liu, Junichi Yamagishi, Md Sahidullah, Tomi kinnunen

Figure 1 for Explaining Speaker and Spoof Embeddings via Probing

Figure 2 for Explaining Speaker and Spoof Embeddings via Probing

Figure 3 for Explaining Speaker and Spoof Embeddings via Probing

Figure 4 for Explaining Speaker and Spoof Embeddings via Probing

Abstract:This study investigates the explainability of embedding representations, specifically those used in modern audio spoofing detection systems based on deep neural networks, known as spoof embeddings. Building on established work in speaker embedding explainability, we examine how well these spoof embeddings capture speaker-related information. We train simple neural classifiers using either speaker or spoof embeddings as input, with speaker-related attributes as target labels. These attributes are categorized into two groups: metadata-based traits (e.g., gender, age) and acoustic traits (e.g., fundamental frequency, speaking rate). Our experiments on the ASVspoof 2019 LA evaluation set demonstrate that spoof embeddings preserve several key traits, including gender, speaking rate, F0, and duration. Further analysis of gender and speaking rate indicates that the spoofing detector partially preserves these traits, potentially to ensure the decision process remains robust against them.

* To appear in IEEE ICASSP 2025

Via

Access Paper or Ask Questions

Libri2Vox Dataset: Target Speaker Extraction with Diverse Speaker Conditions and Synthetic Data

Dec 17, 2024

Yun Liu, Xuechen Liu, Xiaoxiao Miao, Junichi Yamagishi

Figure 1 for Libri2Vox Dataset: Target Speaker Extraction with Diverse Speaker Conditions and Synthetic Data

Figure 2 for Libri2Vox Dataset: Target Speaker Extraction with Diverse Speaker Conditions and Synthetic Data

Figure 3 for Libri2Vox Dataset: Target Speaker Extraction with Diverse Speaker Conditions and Synthetic Data

Figure 4 for Libri2Vox Dataset: Target Speaker Extraction with Diverse Speaker Conditions and Synthetic Data

Abstract:Target speaker extraction (TSE) is essential in speech processing applications, particularly in scenarios with complex acoustic environments. Current TSE systems face challenges in limited data diversity and a lack of robustness in real-world conditions, primarily because they are trained on artificially mixed datasets with limited speaker variability and unrealistic noise profiles. To address these challenges, we propose Libri2Vox, a new dataset that combines clean target speech from the LibriTTS dataset with interference speech from the noisy VoxCeleb2 dataset, providing a large and diverse set of speakers under realistic noisy conditions. We also augment Libri2Vox with synthetic speakers generated using state-of-the-art speech generative models to enhance speaker diversity. Additionally, to further improve the effectiveness of incorporating synthetic data, curriculum learning is implemented to progressively train TSE models with increasing levels of difficulty. Extensive experiments across multiple TSE architectures reveal varying degrees of improvement, with SpeakerBeam demonstrating the most substantial gains: a 1.39 dB improvement in signal-to-distortion ratio (SDR) on the Libri2Talker test set compared to baseline training. Building upon these results, we further enhanced performance through our speaker similarity-based curriculum learning approach with the Conformer architecture, achieving an additional 0.78 dB improvement over conventional random sampling methods in which data samples are randomly selected from the entire dataset. These results demonstrate the complementary benefits of diverse real-world data, synthetic speaker augmentation, and structured training strategies in building robust TSE systems.

Via

Access Paper or Ask Questions

Improving curriculum learning for target speaker extraction with synthetic speakers

Oct 01, 2024

Yun Liu, Xuechen Liu, Junichi Ymagishi

Figure 1 for Improving curriculum learning for target speaker extraction with synthetic speakers

Figure 2 for Improving curriculum learning for target speaker extraction with synthetic speakers

Figure 3 for Improving curriculum learning for target speaker extraction with synthetic speakers

Figure 4 for Improving curriculum learning for target speaker extraction with synthetic speakers

Abstract:Target speaker extraction (TSE) aims to isolate individual speaker voices from complex speech environments. The effectiveness of TSE systems is often compromised when the speaker characteristics are similar to each other. Recent research has introduced curriculum learning (CL), in which TSE models are trained incrementally on speech samples of increasing complexity. In CL training, the model is first trained on samples with low speaker similarity between the target and interference speakers, and then on samples with high speaker similarity. To further improve CL, this paper uses a $k$-nearest neighbor-based voice conversion method to simulate and generate speech of diverse interference speakers, and then uses the generated data as part of the CL. Experiments demonstrate that training data based on synthetic speakers can effectively enhance the model's capabilities and significantly improve the performance of multiple TSE systems.

* Accepted by SLT2024

Via

Access Paper or Ask Questions

AfriHuBERT: A self-supervised speech representation model for African languages

Sep 30, 2024

Jesujoba O. Alabi, Xuechen Liu, Dietrich Klakow, Junichi Yamagishi

Figure 1 for AfriHuBERT: A self-supervised speech representation model for African languages

Figure 2 for AfriHuBERT: A self-supervised speech representation model for African languages

Figure 3 for AfriHuBERT: A self-supervised speech representation model for African languages

Figure 4 for AfriHuBERT: A self-supervised speech representation model for African languages

Abstract:In this work, we present AfriHuBERT, an extension of mHuBERT-147, a state-of-the-art (SOTA) and compact self-supervised learning (SSL) model, originally pretrained on 147 languages. While mHuBERT-147 was pretrained on 16 African languages, we expand this to cover 39 African languages through continued pretraining on 6,500+ hours of speech data aggregated from diverse sources, including 23 newly added languages. We evaluate AfriHuBERT on two key speech tasks: Language Identification (LID) and Automatic Speech Recognition (ASR) using FLEURS dataset. Our results show a +4% F1 score improvement on average for LID and a -1.2% average Word Error Rate (WER) reduction for ASR. Further analysis shows that ASR models trained on AfriHuBERT exhibit improved cross-corpus generalization. Additionally, the analysis indicates that the FLEURS have data quality limitations that may affect their suitability for evaluating low-resource African languages, suggesting the need for better evaluation benchmarks for these languages.

* 14 pages

Via

Access Paper or Ask Questions

Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion

Sep 10, 2024

Zhengyang Chen, Shuai Wang, Mingyang Zhang, Xuechen Liu, Junichi Yamagishi, Yanmin Qian

Abstract:Voice conversion (VC) aims to modify the speaker's timbre while retaining speech content. Previous approaches have tokenized the outputs from self-supervised into semantic tokens, facilitating disentanglement of speech content information. Recently, in-context learning (ICL) has emerged in text-to-speech (TTS) systems for effectively modeling specific characteristics such as timbre through context conditioning. This paper proposes an ICL capability enhanced VC system (ICL-VC) employing a mask and reconstruction training strategy based on flow-matching generative models. Augmented with semantic tokens, our experiments on the LibriTTS dataset demonstrate that ICL-VC improves speaker similarity. Additionally, we find that k-means is a versatile tokenization method applicable to various pre-trained models. However, the ICL-VC system faces challenges in preserving the prosody of the source speech. To mitigate this issue, we propose incorporating prosody embeddings extracted from a pre-trained emotion recognition model into our system. Integration of prosody embeddings notably enhances the system's capability to preserve source speech prosody, as validated on the Emotional Speech Database.

Via

Access Paper or Ask Questions

A Preliminary Case Study on Long-Form In-the-Wild Audio Spoofing Detection

Aug 26, 2024

Xuechen Liu, Xin Wang, Junichi Yamagishi

Figure 1 for A Preliminary Case Study on Long-Form In-the-Wild Audio Spoofing Detection

Figure 2 for A Preliminary Case Study on Long-Form In-the-Wild Audio Spoofing Detection

Abstract:Audio spoofing detection has become increasingly important due to the rise in real-world cases. Current spoofing detectors, referred to as spoofing countermeasures (CM), are mainly trained and focused on audio waveforms with a single speaker and short duration. This study explores spoofing detection in more realistic scenarios, where the audio is long in duration and features multiple speakers and complex acoustic conditions. We test the widely-acquired AASIST under this challenging scenario, looking at the impact of multiple variations such as duration, speaker presence, and acoustic complexities on CM performance. Our work reveals key issues with current methods and suggests preliminary ways to improve them. We aim to make spoofing detection more applicable in more in-the-wild scenarios. This research is served as an important step towards developing detection systems that can handle the challenges of audio spoofing in real-world applications.

* Accepted to the 23rd International Conference of the Biometrics Special Interest Group (BIOSIG 2024). Copyright might be transferred, in such case the current version may be replaced

Via

Access Paper or Ask Questions