Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xuechen Liu

AfriHuBERT: A self-supervised speech representation model for African languages

Sep 30, 2024

Jesujoba O. Alabi, Xuechen Liu, Dietrich Klakow, Junichi Yamagishi

Figure 1 for AfriHuBERT: A self-supervised speech representation model for African languages

Figure 2 for AfriHuBERT: A self-supervised speech representation model for African languages

Figure 3 for AfriHuBERT: A self-supervised speech representation model for African languages

Figure 4 for AfriHuBERT: A self-supervised speech representation model for African languages

Abstract:In this work, we present AfriHuBERT, an extension of mHuBERT-147, a state-of-the-art (SOTA) and compact self-supervised learning (SSL) model, originally pretrained on 147 languages. While mHuBERT-147 was pretrained on 16 African languages, we expand this to cover 39 African languages through continued pretraining on 6,500+ hours of speech data aggregated from diverse sources, including 23 newly added languages. We evaluate AfriHuBERT on two key speech tasks: Language Identification (LID) and Automatic Speech Recognition (ASR) using FLEURS dataset. Our results show a +4% F1 score improvement on average for LID and a -1.2% average Word Error Rate (WER) reduction for ASR. Further analysis shows that ASR models trained on AfriHuBERT exhibit improved cross-corpus generalization. Additionally, the analysis indicates that the FLEURS have data quality limitations that may affect their suitability for evaluating low-resource African languages, suggesting the need for better evaluation benchmarks for these languages.

* 14 pages

Via

Access Paper or Ask Questions

Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion

Sep 10, 2024

Zhengyang Chen, Shuai Wang, Mingyang Zhang, Xuechen Liu, Junichi Yamagishi, Yanmin Qian

Abstract:Voice conversion (VC) aims to modify the speaker's timbre while retaining speech content. Previous approaches have tokenized the outputs from self-supervised into semantic tokens, facilitating disentanglement of speech content information. Recently, in-context learning (ICL) has emerged in text-to-speech (TTS) systems for effectively modeling specific characteristics such as timbre through context conditioning. This paper proposes an ICL capability enhanced VC system (ICL-VC) employing a mask and reconstruction training strategy based on flow-matching generative models. Augmented with semantic tokens, our experiments on the LibriTTS dataset demonstrate that ICL-VC improves speaker similarity. Additionally, we find that k-means is a versatile tokenization method applicable to various pre-trained models. However, the ICL-VC system faces challenges in preserving the prosody of the source speech. To mitigate this issue, we propose incorporating prosody embeddings extracted from a pre-trained emotion recognition model into our system. Integration of prosody embeddings notably enhances the system's capability to preserve source speech prosody, as validated on the Emotional Speech Database.

Via

Access Paper or Ask Questions

A Preliminary Case Study on Long-Form In-the-Wild Audio Spoofing Detection

Aug 26, 2024

Xuechen Liu, Xin Wang, Junichi Yamagishi

Figure 1 for A Preliminary Case Study on Long-Form In-the-Wild Audio Spoofing Detection

Figure 2 for A Preliminary Case Study on Long-Form In-the-Wild Audio Spoofing Detection

Abstract:Audio spoofing detection has become increasingly important due to the rise in real-world cases. Current spoofing detectors, referred to as spoofing countermeasures (CM), are mainly trained and focused on audio waveforms with a single speaker and short duration. This study explores spoofing detection in more realistic scenarios, where the audio is long in duration and features multiple speakers and complex acoustic conditions. We test the widely-acquired AASIST under this challenging scenario, looking at the impact of multiple variations such as duration, speaker presence, and acoustic complexities on CM performance. Our work reveals key issues with current methods and suggests preliminary ways to improve them. We aim to make spoofing detection more applicable in more in-the-wild scenarios. This research is served as an important step towards developing detection systems that can handle the challenges of audio spoofing in real-world applications.

* Accepted to the 23rd International Conference of the Biometrics Special Interest Group (BIOSIG 2024). Copyright might be transferred, in such case the current version may be replaced

Via

Access Paper or Ask Questions

ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

Aug 16, 2024

Xin Wang, Hector Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen(+3 more)

Figure 1 for ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

Figure 2 for ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

Figure 3 for ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

Figure 4 for ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

Abstract:ASVspoof 5 is the fifth edition in a series of challenges that promote the study of speech spoofing and deepfake attacks, and the design of detection solutions. Compared to previous challenges, the ASVspoof 5 database is built from crowdsourced data collected from a vastly greater number of speakers in diverse acoustic conditions. Attacks, also crowdsourced, are generated and tested using surrogate detection models, while adversarial attacks are incorporated for the first time. New metrics support the evaluation of spoofing-robust automatic speaker verification (SASV) as well as stand-alone detection solutions, i.e., countermeasures without ASV. We describe the two challenge tracks, the new database, the evaluation metrics, baselines, and the evaluation platform, and present a summary of the results. Attacks significantly compromise the baseline systems, while submissions bring substantial improvements.

* 8 pages, ASVspoof 5 Workshop (Interspeech2024 Satellite)

Via

Access Paper or Ask Questions

Open-Source Conversational AI with SpeechBrain 1.0

Jul 02, 2024

Mirco Ravanelli, Titouan Parcollet, Adel Moumen, Sylvain de Langen, Cem Subakan, Peter Plantinga, Yingzhi Wang, Pooneh Mousavi, Luca Della Libera, Artem Ploujnikov(+20 more)

Figure 1 for Open-Source Conversational AI with SpeechBrain 1.0

Figure 2 for Open-Source Conversational AI with SpeechBrain 1.0

Abstract:SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete "recipes" of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks

* Submitted to JMLR (Machine Learning Open Source Software)

Via

Access Paper or Ask Questions

Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

Jun 13, 2024

Zhengyang Chen, Xuechen Liu, Erica Cooper, Junichi Yamagishi, Yanmin Qian

Figure 1 for Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

Figure 2 for Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

Figure 3 for Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

Figure 4 for Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

Abstract:This paper proposes a speech synthesis system that allows users to specify and control the acoustic characteristics of a speaker by means of prompts describing the speaker's traits of synthesized speech. Unlike previous approaches, our method utilizes listener impressions to construct prompts, which are easier to collect and align more naturally with everyday descriptions of speaker traits. We adopt the Low-rank Adaptation (LoRA) technique to swiftly tailor a pre-trained language model to our needs, facilitating the extraction of speaker-related traits from the prompt text. Besides, different from other prompt-driven text-to-speech (TTS) systems, we separate the prompt-to-speaker module from the multi-speaker TTS system, enhancing system flexibility and compatibility with various pre-trained multi-speaker TTS systems. Moreover, for the prompt-to-speaker characteristic module, we also compared the discriminative method and flow-matching based generative method and we found that combining both methods can help the system simultaneously capture speaker-related information from prompts better and generate speech with higher fidelity.

* Accepted for presentation at Interspeech 2024 (with more analysis in the final Appendix part)

Via

Access Paper or Ask Questions

Target Speaker Extraction with Curriculum Learning

Jun 12, 2024

Yun Liu, Xuechen Liu, Xiaoxiao Miao, Junichi Yamagishi

Abstract:This paper presents a novel approach to target speaker extraction (TSE) using Curriculum Learning (CL) techniques, addressing the challenge of distinguishing a target speaker's voice from a mixture containing interfering speakers. For efficient training, we propose designing a curriculum that selects subsets of increasing complexity, such as increasing similarity between target and interfering speakers, and that selects training data strategically. Our CL strategies include both variants using predefined difficulty measures (e.g. gender, speaker similarity, and signal-to-distortion ratio) and ones using the TSE's standard objective function, each designed to expose the model gradually to more challenging scenarios. Comprehensive testing on the Libri2talker dataset demonstrated that our CL strategies for TSE improved the performance, and the results markedly exceeded baseline models without CL about 1 dB.

* Accepted for presentation at Interspeech 2024

Via

Access Paper or Ask Questions

Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Jan 28, 2024

Xuechen Liu, Md Sahidullah, Kong Aik Lee, Tomi Kinnunen

Figure 1 for Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Figure 2 for Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Figure 3 for Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Figure 4 for Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Abstract:It is now well-known that automatic speaker verification (ASV) systems can be spoofed using various types of adversaries. The usual approach to counteract ASV systems against such attacks is to develop a separate spoofing countermeasure (CM) module to classify speech input either as a bonafide, or a spoofed utterance. Nevertheless, such a design requires additional computation and utilization efforts at the authentication stage. An alternative strategy involves a single monolithic ASV system designed to handle both zero-effort imposter (non-targets) and spoofing attacks. Such spoof-aware ASV systems have the potential to provide stronger protections and more economic computations. To this end, we propose to generalize the standalone ASV (G-SASV) against spoofing attacks, where we leverage limited training data from CM to enhance a simple backend in the embedding space, without the involvement of a separate CM module during the test (authentication) phase. We propose a novel yet simple backend classifier based on deep neural networks and conduct the study via domain adaptation and multi-task integration of spoof embeddings at the training stage. Experiments are conducted on the ASVspoof 2019 logical access dataset, where we improve the performance of statistical ASV backends on the joint (bonafide and spoofed) and spoofed conditions by a maximum of 36.2% and 49.8% in terms of equal error rates, respectively.

* Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (doi updated)

Via

Access Paper or Ask Questions

Speaker-Text Retrieval via Contrastive Learning

Dec 11, 2023

Xuechen Liu, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi

Abstract:In this study, we introduce a novel cross-modal retrieval task involving speaker descriptions and their corresponding audio samples. Utilizing pre-trained speaker and text encoders, we present a simple learning framework based on contrastive learning. Additionally, we explore the impact of incorporating speaker labels into the training process. Our findings establish the effectiveness of linking speaker and text information for the task for both English and Japanese languages, across diverse data configurations. Additional visual analysis unveils potential nuanced associations between speaker clustering and retrieval performance.

* Submitted to IEEE Signal Processing Letters

Via

Access Paper or Ask Questions

Towards single integrated spoofing-aware speaker verification embeddings

Jun 01, 2023

Sung Hwan Mun, Hye-jin Shim, Hemlata Tak, Xin Wang, Xuechen Liu, Md Sahidullah, Myeonghun Jeong, Min Hyun Han, Massimiliano Todisco, Kong Aik Lee(+5 more)

Figure 1 for Towards single integrated spoofing-aware speaker verification embeddings

Figure 2 for Towards single integrated spoofing-aware speaker verification embeddings

Figure 3 for Towards single integrated spoofing-aware speaker verification embeddings

Figure 4 for Towards single integrated spoofing-aware speaker verification embeddings

Abstract:This study aims to develop a single integrated spoofing-aware speaker verification (SASV) embeddings that satisfy two aspects. First, rejecting non-target speakers' input as well as target speakers' spoofed inputs should be addressed. Second, competitive performance should be demonstrated compared to the fusion of automatic speaker verification (ASV) and countermeasure (CM) embeddings, which outperformed single embedding solutions by a large margin in the SASV2022 challenge. We analyze that the inferior performance of single SASV embeddings comes from insufficient amount of training data and distinct nature of ASV and CM tasks. To this end, we propose a novel framework that includes multi-stage training and a combination of loss functions. Copy synthesis, combined with several vocoders, is also exploited to address the lack of spoofed data. Experimental results show dramatic improvements, achieving a SASV-EER of 1.06% on the evaluation protocol of the SASV2022 challenge.

* Accepted by INTERSPEECH 2023. Code and models are available in https://github.com/sasv-challenge/ASVSpoof5-SASVBaseline

Via

Access Paper or Ask Questions