Alert button
Picture for Ruijie Tao

Ruijie Tao

Alert button

Prompt-driven Target Speech Diarization

Oct 23, 2023
Yidi Jiang, Zhengyang Chen, Ruijie Tao, Liqun Deng, Yanmin Qian, Haizhou Li

We introduce a novel task named `target speech diarization', which seeks to determine `when target event occurred' within an audio signal. We devise a neural architecture called Prompt-driven Target Speech Diarization (PTSD), that works with diverse prompts that specify the target speech events of interest. We train and evaluate PTSD using sim2spk, sim3spk and sim4spk datasets, which are derived from the Librispeech. We show that the proposed framework accurately localizes target speech events. Furthermore, our framework exhibits versatility through its impressive performance in three diarization-related tasks: target speaker voice activity detection, overlapped speech detection and gender diarization. In particular, PTSD achieves comparable performance to specialized models across these tasks on both real and simulated data. This work serves as a reference benchmark and provides valuable insights into prompt-driven target speech processing.

* submitted to ICASSP 2024 
Viaarxiv icon

Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification

Sep 26, 2023
Duc-Tuan Truong, Ruijie Tao, Jia Qi Yip, Kong Aik Lee, Eng Siong Chng

Figure 1 for Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification
Figure 2 for Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification
Figure 3 for Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification
Figure 4 for Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification

Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student networks at the embedding level or label level. However, the conventional label-level KD overlooks the significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for automatic speaker verification. In this paper, we first demonstrate that leveraging a larger number of training non-target speakers improves the performance of automatic speaker verification models. Inspired by this finding about the importance of non-target speakers' knowledge, we modified the conventional label-level KD by disentangling and emphasizing the classification probabilities of non-target speakers during knowledge distillation. The proposed method is applied to three different student model architectures and achieves an average of 13.67% improvement in EER on the VoxCeleb dataset compared to embedding-level and conventional label-level KD methods.

* Submitted to ICASSP 2024 
Viaarxiv icon

USED: Universal Speaker Extraction and Diarization

Sep 19, 2023
Junyi Ao, Mehmet Sinan Yıldırım, Meng Ge, Shuai Wang, Ruijie Tao, Yanmin Qian, Liqun Deng, Longshuai Xiao, Haizhou Li

Figure 1 for USED: Universal Speaker Extraction and Diarization
Figure 2 for USED: Universal Speaker Extraction and Diarization
Figure 3 for USED: Universal Speaker Extraction and Diarization
Figure 4 for USED: Universal Speaker Extraction and Diarization

Speaker extraction and diarization are two crucial enabling techniques for speech applications. Speaker extraction aims to extract a target speaker's voice from a multi-talk mixture, while speaker diarization demarcates speech segments by speaker, identifying `who spoke when'. The previous studies have typically treated the two tasks independently. However, the two tasks share a similar objective, that is to disentangle the speakers in the spectral domain for the former but in the temporal domain for the latter. It is logical to believe that the speaker turns obtained from speaker diarization can benefit speaker extraction, while the extracted speech offers more accurate speaker turns than the mixture speech. In this paper, we propose a unified framework called Universal Speaker Extraction and Diarization (USED). We extend the existing speaker extraction model to simultaneously extract the waveforms of all speakers. We also employ a scenario-aware differentiated loss function to address the problem of sparsely overlapped speech in real-world conversations. We show that the USED model significantly outperforms the baselines for both speaker extraction and diarization tasks, in both highly overlapped and sparsely overlapped scenarios. Audio samples are available at https://ajyy.github.io/demo/USED/.

* Submitted to ICASSP 2024 
Viaarxiv icon

Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech

Sep 15, 2023
Junjie Li, Ruijie Tao, Zexu Pan, Meng Ge, Shuai Wang, Haizhou Li

Target speaker extraction aims to extract the speech of a specific speaker from a multi-talker mixture as specified by an auxiliary reference. Most studies focus on the scenario where the target speech is highly overlapped with the interfering speech. However, this scenario only accounts for a small percentage of real-world conversations. In this paper, we aim at the sparsely overlapped scenarios in which the auxiliary reference needs to perform two tasks simultaneously: detect the activity of the target speaker and disentangle the active speech from any interfering speech. We propose an audio-visual speaker extraction model named ActiveExtract, which leverages speaking activity from audio-visual active speaker detection (ASD). The ASD directly provides the frame-level activity of the target speaker, while its intermediate feature representation is trained to discriminate speech-lip synchronization that could be used for speaker disentanglement. Experimental results show our model outperforms baselines across various overlapping ratios, achieving an average improvement of more than 4 dB in terms of SI-SNR.

* Submitted to ICASSP 2024 
Viaarxiv icon

Target Active Speaker Detection with Audio-visual Cues

May 26, 2023
Yidi Jiang, Ruijie Tao, Zexu Pan, Haizhou Li

Figure 1 for Target Active Speaker Detection with Audio-visual Cues
Figure 2 for Target Active Speaker Detection with Audio-visual Cues
Figure 3 for Target Active Speaker Detection with Audio-visual Cues
Figure 4 for Target Active Speaker Detection with Audio-visual Cues

In active speaker detection (ASD), we would like to detect whether an on-screen person is speaking based on audio-visual cues. Previous studies have primarily focused on modeling audio-visual synchronization cue, which depends on the video quality of the lip region of a speaker. In real-world applications, it is possible that we can also have the reference speech of the on-screen speaker. To benefit from both facial cue and reference speech, we propose the Target Speaker TalkNet (TS-TalkNet), which leverages a pre-enrolled speaker embedding to complement the audio-visual synchronization cue in detecting whether the target speaker is speaking. Our framework outperforms the popular model, TalkNet on two datasets, achieving absolute improvements of 1.6\% in mAP on the AVA-ActiveSpeaker validation set, and 0.8\%, 0.4\%, and 0.8\% in terms of AP, AUC and EER on the ASW test set, respectively. Code is available at \href{https://github.com/Jiang-Yidi/TS-TalkNet/}{\color{red}{https://github.com/Jiang-Yidi/TS-TalkNet/}}.

* Accepted to INTERSPEECH2023 
Viaarxiv icon

I4U System Description for NIST SRE'20 CTS Challenge

Nov 02, 2022
Kong Aik Lee, Tomi Kinnunen, Daniele Colibro, Claudio Vair, Andreas Nautsch, Hanwu Sun, Liang He, Tianyu Liang, Qiongqiong Wang, Mickael Rouvier, Pierre-Michel Bousquet, Rohan Kumar Das, Ignacio Viñals Bailo, Meng Liu, Héctor Deldago, Xuechen Liu, Md Sahidullah, Sandro Cumani, Boning Zhang, Koji Okabe, Hitoshi Yamamoto, Ruijie Tao, Haizhou Li, Alfonso Ortega Giménez, Longbiao Wang, Luis Buera

Figure 1 for I4U System Description for NIST SRE'20 CTS Challenge
Figure 2 for I4U System Description for NIST SRE'20 CTS Challenge
Figure 3 for I4U System Description for NIST SRE'20 CTS Challenge
Figure 4 for I4U System Description for NIST SRE'20 CTS Challenge

This manuscript describes the I4U submission to the 2020 NIST Speaker Recognition Evaluation (SRE'20) Conversational Telephone Speech (CTS) Challenge. The I4U's submission was resulted from active collaboration among researchers across eight research teams - I$^2$R (Singapore), UEF (Finland), VALPT (Italy, Spain), NEC (Japan), THUEE (China), LIA (France), NUS (Singapore), INRIA (France) and TJU (China). The submission was based on the fusion of top performing sub-systems and sub-fusion systems contributed by individual teams. Efforts have been spent on the use of common development and validation sets, submission schedule and milestone, minimizing inconsistency in trial list and score file format across sites.

* SRE 2021, NIST Speaker Recognition Evaluation Workshop, CTS Speaker Recognition Challenge, 14-12 December 2021 
Viaarxiv icon

Speaker recognition with two-step multi-modal deep cleansing

Oct 28, 2022
Ruijie Tao, Kong Aik Lee, Zhan Shi, Haizhou Li

Figure 1 for Speaker recognition with two-step multi-modal deep cleansing
Figure 2 for Speaker recognition with two-step multi-modal deep cleansing
Figure 3 for Speaker recognition with two-step multi-modal deep cleansing
Figure 4 for Speaker recognition with two-step multi-modal deep cleansing

Neural network-based speaker recognition has achieved significant improvement in recent years. A robust speaker representation learns meaningful knowledge from both hard and easy samples in the training set to achieve good performance. However, noisy samples (i.e., with wrong labels) in the training set induce confusion and cause the network to learn the incorrect representation. In this paper, we propose a two-step audio-visual deep cleansing framework to eliminate the effect of noisy labels in speaker representation learning. This framework contains a coarse-grained cleansing step to search for the peculiar samples, followed by a fine-grained cleansing step to filter out the noisy labels. Our study starts from an efficient audio-visual speaker recognition system, which achieves a close to perfect equal-error-rate (EER) of 0.01\%, 0.07\% and 0.13\% on the Vox-O, E and H test sets. With the proposed multi-modal cleansing mechanism, four different speaker recognition networks achieve an average improvement of 5.9\%. Code has been made available at: \textcolor{magenta}{\url{https://github.com/TaoRuijie/AVCleanse}}.

* 5 pages, 3 figures 
Viaarxiv icon

Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs

Oct 27, 2022
Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamäki, Haizhou Li

Figure 1 for Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs
Figure 2 for Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs
Figure 3 for Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs
Figure 4 for Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs

We study a novel neural architecture and its training strategies of speaker encoder for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed-size speaker embedding from a spoken utterance of various length. Contrastive learning is a typical self-supervised learning technique. However, the quality of the speaker encoder depends very much on the sampling strategy of positive and negative pairs. It is common that we sample a positive pair of segments from the same utterance. Unfortunately, such poor-man's positive pairs (PPP) lack necessary diversity for the training of a robust encoder. In this work, we propose a multi-modal contrastive learning technique with novel sampling strategies. By cross-referencing between speech and face data, we study a method that finds diverse positive pairs (DPP) for contrastive learning, thus improving the robustness of the speaker encoder. We train the speaker encoder on the VoxCeleb2 dataset without any speaker labels, and achieve an equal error rate (EER) of 2.89\%, 3.17\% and 6.27\% under the proposed progressive clustering strategy, and an EER of 1.44\%, 1.77\% and 3.27\% under the two-stage learning strategy with pseudo labels, on the three test sets of VoxCeleb1. This novel solution outperforms the state-of-the-art self-supervised learning methods by a large margin, at the same time, achieves comparable results with the supervised learning counterpart. We also evaluate our self-supervised learning technique on LRS2 and LRW datasets, where the speaker information is unknown. All experiments suggest that the proposed neural architecture and sampling strategies are robust across datasets.

* 13 pages 
Viaarxiv icon

HLT-NUS SUBMISSION FOR 2020 NIST Conversational Telephone Speech SRE

Nov 12, 2021
Rohan Kumar Das, Ruijie Tao, Haizhou Li

Figure 1 for HLT-NUS SUBMISSION FOR 2020 NIST Conversational Telephone Speech SRE
Figure 2 for HLT-NUS SUBMISSION FOR 2020 NIST Conversational Telephone Speech SRE

This work provides a brief description of Human Language Technology (HLT) Laboratory, National University of Singapore (NUS) system submission for 2020 NIST conversational telephone speech (CTS) speaker recognition evaluation (SRE). The challenge focuses on evaluation under CTS data containing multilingual speech. The systems developed at HLT-NUS consider time-delay neural network (TDNN) x-vector and ECAPA-TDNN systems. We also perform domain adaption of probabilistic linear discriminant analysis (PLDA) model and adaptive s-norm on our systems. The score level fusion of TDNN x-vector and ECAPA-TDNN systems is carried out, which improves the final system performance of our submission to 2020 NIST CTS SRE.

* 3 pages 
Viaarxiv icon