Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haizhou Li

WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction

Sep 24, 2024

Shuai Wang, Ke Zhang, Shaoxiong Lin, Junjie Li, Xuefei Wang, Meng Ge, Jianwei Yu, Yanmin Qian, Haizhou Li

Figure 1 for WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction

Figure 2 for WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction

Figure 3 for WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction

Figure 4 for WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction

Abstract:Target speaker extraction (TSE) focuses on isolating the speech of a specific target speaker from overlapped multi-talker speech, which is a typical setup in the cocktail party problem. In recent years, TSE draws increasing attention due to its potential for various applications such as user-customized interfaces and hearing aids, or as a crutial front-end processing technologies for subsequential tasks such as speech recognition and speaker recongtion. However, there are currently few open-source toolkits or available pre-trained models for off-the-shelf usage. In this work, we introduce WeSep, a toolkit designed for research and practical applications in TSE. WeSep is featured with flexible target speaker modeling, scalable data management, effective on-the-fly data simulation, structured recipes and deployment support. The toolkit is publicly avaliable at \url{https://github.com/wenet-e2e/WeSep.}

* Interspeech 2024

Via

Access Paper or Ask Questions

M-Vec: Matryoshka Speaker Embeddings with Flexible Dimensions

Sep 24, 2024

Shuai Wang, Pengcheng Zhu, Haizhou Li

Figure 1 for M-Vec: Matryoshka Speaker Embeddings with Flexible Dimensions

Figure 2 for M-Vec: Matryoshka Speaker Embeddings with Flexible Dimensions

Figure 3 for M-Vec: Matryoshka Speaker Embeddings with Flexible Dimensions

Figure 4 for M-Vec: Matryoshka Speaker Embeddings with Flexible Dimensions

Abstract:Fixed-dimensional speaker embeddings have become the dominant approach in speaker modeling, typically spanning hundreds to thousands of dimensions. These dimensions are hyperparameters that are not specifically picked, nor are they hierarchically ordered in terms of importance. In large-scale speaker representation databases, reducing the dimensionality of embeddings can significantly lower storage and computational costs. However, directly training low-dimensional representations often yields suboptimal performance. In this paper, we introduce the Matryoshka speaker embedding, a method that allows dynamic extraction of sub-dimensions from the embedding while maintaining performance. Our approach is validated on the VoxCeleb dataset, demonstrating that it can achieve extremely low-dimensional embeddings, such as 8 dimensions, while preserving high speaker verification performance.

* ICSR 2024, Shenzhen

Via

Access Paper or Ask Questions

Aligning Language Models Using Follow-up Likelihood as Reward Signal

Sep 20, 2024

Chen Zhang, Dading Chong, Feng Jiang, Chengguang Tang, Anningzhe Gao, Guohua Tang, Haizhou Li

Figure 1 for Aligning Language Models Using Follow-up Likelihood as Reward Signal

Figure 2 for Aligning Language Models Using Follow-up Likelihood as Reward Signal

Figure 3 for Aligning Language Models Using Follow-up Likelihood as Reward Signal

Figure 4 for Aligning Language Models Using Follow-up Likelihood as Reward Signal

Abstract:In natural human-to-human conversations, participants often receive feedback signals from one another based on their follow-up reactions. These reactions can include verbal responses, facial expressions, changes in emotional state, and other non-verbal cues. Similarly, in human-machine interactions, the machine can leverage the user's follow-up utterances as feedback signals to assess whether it has appropriately addressed the user's request. Therefore, we propose using the likelihood of follow-up utterances as rewards to differentiate preferred responses from less favored ones, without relying on human or commercial LLM-based preference annotations. Our proposed reward mechanism, ``Follow-up Likelihood as Reward" (FLR), matches the performance of strong reward models trained on large-scale human or GPT-4 annotated data on 8 pairwise-preference and 4 rating-based benchmarks. Building upon the FLR mechanism, we propose to automatically mine preference data from the online generations of a base policy model. The preference data are subsequently used to boost the helpfulness of the base model through direct alignment from preference (DAP) methods, such as direct preference optimization (DPO). Lastly, we demonstrate that fine-tuning the language model that provides follow-up likelihood with natural language feedback significantly enhances FLR's performance on reward modeling benchmarks and effectiveness in aligning the base policy model's helpfulness.

* 16 pages, reward model, LLM Alignment

Via

Access Paper or Ask Questions

On the effectiveness of enrollment speech augmentation for Target Speaker Extraction

Sep 15, 2024

Junjie Li, Ke Zhang, Shuai Wang, Haizhou Li, Man-Wai Mak, Kong Aik Lee

Figure 1 for On the effectiveness of enrollment speech augmentation for Target Speaker Extraction

Figure 2 for On the effectiveness of enrollment speech augmentation for Target Speaker Extraction

Figure 3 for On the effectiveness of enrollment speech augmentation for Target Speaker Extraction

Figure 4 for On the effectiveness of enrollment speech augmentation for Target Speaker Extraction

Abstract:Deep learning technologies have significantly advanced the performance of target speaker extraction (TSE) tasks. To enhance the generalization and robustness of these algorithms when training data is insufficient, data augmentation is a commonly adopted technique. Unlike typical data augmentation applied to speech mixtures, this work thoroughly investigates the effectiveness of augmenting the enrollment speech space. We found that for both pretrained and jointly optimized speaker encoders, directly augmenting the enrollment speech leads to consistent performance improvement. In addition to conventional methods such as noise and reverberation addition, we propose a novel augmentation method called self-estimated speech augmentation (SSA). Experimental results on the Libri2Mix test set show that our proposed method can achieve an improvement of up to 2.5 dB.

* Accepted by SLT2024

Via

Access Paper or Ask Questions

MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion

Sep 14, 2024

Sho Inoue, Shuai Wang, Wanxing Wang, Pengcheng Zhu, Mengxiao Bi, Haizhou Li

Figure 1 for MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion

Figure 2 for MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion

Figure 3 for MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion

Figure 4 for MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion

Abstract:In accented voice conversion or accent conversion, we seek to convert the accent in speech from one another while preserving speaker identity and semantic content. In this study, we formulate a novel method for creating multi-accented speech samples, thus pairs of accented speech samples by the same speaker, through text transliteration for training accent conversion systems. We begin by generating transliterated text with Large Language Models (LLMs), which is then fed into multilingual TTS models to synthesize accented English speech. As a reference system, we built a sequence-to-sequence model on the synthetic parallel corpus for accent conversion. We validated the proposed method for both native and non-native English speakers. Subjective and objective evaluations further validate our dataset's effectiveness in accent conversion studies.

* Project page with Speech Demo: https://github.com/shinshoji01/MacST-project-page

Via

Access Paper or Ask Questions

E1 TTS: Simple and Fast Non-Autoregressive TTS

Sep 14, 2024

Zhijun Liu, Shuai Wang, Pengcheng Zhu, Mengxiao Bi, Haizhou Li

Figure 1 for E1 TTS: Simple and Fast Non-Autoregressive TTS

Figure 2 for E1 TTS: Simple and Fast Non-Autoregressive TTS

Figure 3 for E1 TTS: Simple and Fast Non-Autoregressive TTS

Figure 4 for E1 TTS: Simple and Fast Non-Autoregressive TTS

Abstract:This paper introduces Easy One-Step Text-to-Speech (E1 TTS), an efficient non-autoregressive zero-shot text-to-speech system based on denoising diffusion pretraining and distribution matching distillation. The training of E1 TTS is straightforward; it does not require explicit monotonic alignment between the text and audio pairs. The inference of E1 TTS is efficient, requiring only one neural network evaluation for each utterance. Despite its sampling efficiency, E1 TTS achieves naturalness and speaker similarity comparable to various strong baseline models. Audio samples are available at http://e1tts.github.io/ .

Via

Access Paper or Ask Questions

Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection

Sep 11, 2024

Xinyuan Qian, Xianghu Yue, Jiadong Wang, Huiping Zhuang, Haizhou Li

Figure 1 for Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection

Figure 2 for Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection

Figure 3 for Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection

Figure 4 for Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection

Abstract:Sound Source Localization (SSL) enabling technology for applications such as surveillance and robotics. While traditional Signal Processing (SP)-based SSL methods provide analytic solutions under specific signal and noise assumptions, recent Deep Learning (DL)-based methods have significantly outperformed them. However, their success depends on extensive training data and substantial computational resources. Moreover, they often rely on large-scale annotated spatial data and may struggle when adapting to evolving sound classes. To mitigate these challenges, we propose a novel Class Incremental Learning (CIL) approach, termed SSL-CIL, which avoids serious accuracy degradation due to catastrophic forgetting by incrementally updating the DL-based SSL model through a closed-form analytic solution. In particular, data privacy is ensured since the learning process does not revisit any historical data (exemplar-free), which is more suitable for smart home scenarios. Empirical results in the public SSLR dataset demonstrate the superior performance of our proposal, achieving a localization accuracy of 90.9%, surpassing other competitive methods.

Via

Access Paper or Ask Questions

NeuroSpex: Neuro-Guided Speaker Extraction with Cross-Modal Attention

Sep 04, 2024

Dashanka De Silva, Siqi Cai, Saurav Pahuja, Tanja Schultz, Haizhou Li

Figure 1 for NeuroSpex: Neuro-Guided Speaker Extraction with Cross-Modal Attention

Figure 2 for NeuroSpex: Neuro-Guided Speaker Extraction with Cross-Modal Attention

Figure 3 for NeuroSpex: Neuro-Guided Speaker Extraction with Cross-Modal Attention

Figure 4 for NeuroSpex: Neuro-Guided Speaker Extraction with Cross-Modal Attention

Abstract:In the study of auditory attention, it has been revealed that there exists a robust correlation between attended speech and elicited neural responses, measurable through electroencephalography (EEG). Therefore, it is possible to use the attention information available within EEG signals to guide the extraction of the target speaker in a cocktail party computationally. In this paper, we present a neuro-guided speaker extraction model, i.e. NeuroSpex, using the EEG response of the listener as the sole auxiliary reference cue to extract attended speech from monaural speech mixtures. We propose a novel EEG signal encoder that captures the attention information. Additionally, we propose a cross-attention (CA) mechanism to enhance the speech feature representations, generating a speaker extraction mask. Experimental results on a publicly available dataset demonstrate that our proposed model outperforms two baseline models across various evaluation metrics.

Via

Access Paper or Ask Questions

Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing

Aug 29, 2024

Qianhui Liu, Jiadong Wang, Yang Wang, Xin Yang, Gang Pan, Haizhou Li

Figure 1 for Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing

Figure 2 for Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing

Figure 3 for Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing

Figure 4 for Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing

Abstract:Humans naturally perform audiovisual speech recognition (AVSR), enhancing the accuracy and robustness by integrating auditory and visual information. Spiking neural networks (SNNs), which mimic the brain's information-processing mechanisms, are well-suited for emulating the human capability of AVSR. Despite their potential, research on SNNs for AVSR is scarce, with most existing audio-visual multimodal methods focused on object or digit recognition. These models simply integrate features from both modalities, neglecting their unique characteristics and interactions. Additionally, they often rely on future information for current processing, which increases recognition latency and limits real-time applicability. Inspired by human speech perception, this paper proposes a novel human-inspired SNN named HI-AVSNN for AVSR, incorporating three key characteristics: cueing interaction, causal processing and spike activity. For cueing interaction, we propose a visual-cued auditory attention module (VCA2M) that leverages visual cues to guide attention to auditory features. We achieve causal processing by aligning the SNN's temporal dimension with that of visual and auditory features and applying temporal masking to utilize only past and current information. To implement spike activity, in addition to using SNNs, we leverage the event camera to capture lip movement as spikes, mimicking the human retina and providing efficient visual data. We evaluate HI-AVSNN on an audiovisual speech recognition dataset combining the DVS-Lip dataset with its corresponding audio samples. Experimental results demonstrate the superiority of our proposed fusion method, outperforming existing audio-visual SNN fusion methods and achieving a 2.27% improvement in accuracy over the only existing SNN-based AVSR method.

Via

Access Paper or Ask Questions

Generative Expressive Conversational Speech Synthesis

Aug 01, 2024

Rui Liu, Yifan Hu, Yi Ren, Xiang Yin, Haizhou Li

Figure 1 for Generative Expressive Conversational Speech Synthesis

Figure 2 for Generative Expressive Conversational Speech Synthesis

Figure 3 for Generative Expressive Conversational Speech Synthesis

Figure 4 for Generative Expressive Conversational Speech Synthesis

Abstract:Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper speaking style in a user-agent conversation setting. Existing CSS methods employ effective multi-modal context modeling techniques to achieve empathy understanding and expression. However, they often need to design complex network architectures and meticulously optimize the modules within them. In addition, due to the limitations of small-scale datasets containing scripted recording styles, they often fail to simulate real natural conversational styles. To address the above issues, we propose a novel generative expressive CSS system, termed GPT-Talker.We transform the multimodal information of the multi-turn dialogue history into discrete token sequences and seamlessly integrate them to form a comprehensive user-agent dialogue context. Leveraging the power of GPT, we predict the token sequence, that includes both semantic and style knowledge, of response for the agent. After that, the expressive conversational speech is synthesized by the conversation-enriched VITS to deliver feedback to the user.Furthermore, we propose a large-scale Natural CSS Dataset called NCSSD, that includes both naturally recorded conversational speech in improvised styles and dialogues extracted from TV shows. It encompasses both Chinese and English languages, with a total duration of 236 hours.We conducted comprehensive experiments on the reliability of the NCSSD and the effectiveness of our GPT-Talker. Both subjective and objective evaluations demonstrate that our model outperforms other state-of-the-art CSS systems significantly in terms of naturalness and expressiveness. The Code, Dataset, and Pre-trained Model are available at: https://github.com/AI-S2-Lab/GPT-Talker.

* 14 pages, 6 figures, 8 tables. Accepted by ACM MM 2024

Via

Access Paper or Ask Questions