Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kong Aik Lee

Disentangling Voice and Content with Self-Supervision for Speaker Recognition

Oct 02, 2023

Tianchi Liu, Kong Aik Lee, Qiongqiong Wang, Haizhou Li

Abstract:For speaker recognition, it is difficult to extract an accurate speaker representation from speech because of its mixture of speaker traits and content. This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is realized with the use of three Gaussian inference layers, each consisting of a learnable transition model that extracts distinct speech components. Notably, a strengthened transition model is specifically designed to model complex speech dynamics. We also propose a self-supervision method to dynamically disentangle content without the use of labels other than speaker identities. The efficacy of the proposed framework is validated via experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF, respectively. Since neither additional model training nor data is specifically needed, it is easily applicable in practical use.

* Accepted to NeurIPS 2023 (main track)

Via

Access Paper or Ask Questions

Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification

Sep 26, 2023

Duc-Tuan Truong, Ruijie Tao, Jia Qi Yip, Kong Aik Lee, Eng Siong Chng

Abstract:Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student networks at the embedding level or label level. However, the conventional label-level KD overlooks the significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for automatic speaker verification. In this paper, we first demonstrate that leveraging a larger number of training non-target speakers improves the performance of automatic speaker verification models. Inspired by this finding about the importance of non-target speakers' knowledge, we modified the conventional label-level KD by disentangling and emphasizing the classification probabilities of non-target speakers during knowledge distillation. The proposed method is applied to three different student model architectures and achieves an average of 13.67% improvement in EER on the VoxCeleb dataset compared to embedding-level and conventional label-level KD methods.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

The second multi-channel multi-party meeting transcription challenge 2.0): A benchmark for speaker-attributed ASR

Sep 24, 2023

Yuhao Liang, Mohan Shi, Fan Yu, Yangze Li, Shiliang Zhang, Zhihao Du, Qian Chen, Lei Xie, Yanmin Qian, Jian Wu(+4 more)

Figure 1 for The second multi-channel multi-party meeting transcription challenge 2.0): A benchmark for speaker-attributed ASR

Figure 2 for The second multi-channel multi-party meeting transcription challenge 2.0): A benchmark for speaker-attributed ASR

Figure 3 for The second multi-channel multi-party meeting transcription challenge 2.0): A benchmark for speaker-attributed ASR

Figure 4 for The second multi-channel multi-party meeting transcription challenge 2.0): A benchmark for speaker-attributed ASR

Abstract:With the success of the first Multi-channel Multi-party Meeting Transcription challenge (M2MeT), the second M2MeT challenge (M2MeT 2.0) held in ASRU2023 particularly aims to tackle the complex task of speaker-attributed ASR (SA-ASR), which directly addresses the practical and challenging problem of "who spoke what at when" at typical meeting scenario. We particularly established two sub-tracks. 1) The fixed training condition sub-track, where the training data is constrained to predetermined datasets, but participants can use any open-source pre-trained model. 2) The open training condition sub-track, which allows for the use of all available data and models. In addition, we release a new 10-hour test set for challenge ranking. This paper provides an overview of the dataset, track settings, results, and analysis of submitted systems, as a benchmark to show the current state of speaker-attributed ASR.

* 8 pages, Accepted by ASRU2023

Via

Access Paper or Ask Questions

t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

Sep 21, 2023

Tomi Kinnunen, Kong Aik Lee, Hemlata Tak, Nicholas Evans, Andreas Nautsch

Figure 1 for t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

Figure 2 for t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

Figure 3 for t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

Figure 4 for t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

Abstract:Presentation attack (spoofing) detection (PAD) typically operates alongside biometric verification to improve reliablity in the face of spoofing attacks. Even though the two sub-systems operate in tandem to solve the single task of reliable biometric verification, they address different detection tasks and are hence typically evaluated separately. Evidence shows that this approach is suboptimal. We introduce a new metric for the joint evaluation of PAD solutions operating in situ with biometric verification. In contrast to the tandem detection cost function proposed recently, the new tandem equal error rate (t-EER) is parameter free. The combination of two classifiers nonetheless leads to a \emph{set} of operating points at which false alarm and miss rates are equal and also dependent upon the prevalence of attacks. We therefore introduce the \emph{concurrent} t-EER, a unique operating point which is invariable to the prevalence of attacks. Using both modality (and even application) agnostic simulated scores, as well as real scores for a voice biometrics application, we demonstrate application of the t-EER to a wide range of biometric system evaluations under attack. The proposed approach is a strong candidate metric for the tandem evaluation of PAD systems and biometric comparators.

* To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence. For associated codes, see https://github.com/TakHemlata/T-EER (Github) and https://colab.research.google.com/drive/1ga7eiKFP11wOFMuZjThLJlkBcwEG6_4m?usp=sharing (Google Colab)

Via

Access Paper or Ask Questions

Towards single integrated spoofing-aware speaker verification embeddings

Jun 01, 2023

Sung Hwan Mun, Hye-jin Shim, Hemlata Tak, Xin Wang, Xuechen Liu, Md Sahidullah, Myeonghun Jeong, Min Hyun Han, Massimiliano Todisco, Kong Aik Lee(+5 more)

Figure 1 for Towards single integrated spoofing-aware speaker verification embeddings

Figure 2 for Towards single integrated spoofing-aware speaker verification embeddings

Figure 3 for Towards single integrated spoofing-aware speaker verification embeddings

Figure 4 for Towards single integrated spoofing-aware speaker verification embeddings

Abstract:This study aims to develop a single integrated spoofing-aware speaker verification (SASV) embeddings that satisfy two aspects. First, rejecting non-target speakers' input as well as target speakers' spoofed inputs should be addressed. Second, competitive performance should be demonstrated compared to the fusion of automatic speaker verification (ASV) and countermeasure (CM) embeddings, which outperformed single embedding solutions by a large margin in the SASV2022 challenge. We analyze that the inferior performance of single SASV embeddings comes from insufficient amount of training data and distinct nature of ASV and CM tasks. To this end, we propose a novel framework that includes multi-stage training and a combination of loss functions. Copy synthesis, combined with several vocoders, is also exploited to address the lack of spoofed data. Experimental results show dramatic improvements, achieving a SASV-EER of 1.06% on the evaluation protocol of the SASV2022 challenge.

* Accepted by INTERSPEECH 2023. Code and models are available in https://github.com/sasv-challenge/ASVSpoof5-SASVBaseline

Via

Access Paper or Ask Questions

Generalized domain adaptation framework for parametric back-end in speaker recognition

May 24, 2023

Qiongqiong Wang, Koji Okabe, Kong Aik Lee, Takafumi Koshinaka

Figure 1 for Generalized domain adaptation framework for parametric back-end in speaker recognition

Figure 2 for Generalized domain adaptation framework for parametric back-end in speaker recognition

Figure 3 for Generalized domain adaptation framework for parametric back-end in speaker recognition

Figure 4 for Generalized domain adaptation framework for parametric back-end in speaker recognition

Abstract:State-of-the-art speaker recognition systems comprise a speaker embedding front-end followed by a probabilistic linear discriminant analysis (PLDA) back-end. The effectiveness of these components relies on the availability of a large amount of labeled training data. In practice, it is common for domains (e.g., language, channel, demographic) in which a system is deployed to differ from that in which a system has been trained. To close the resulting gap, domain adaptation is often essential for PLDA models. Among two of its variants are Heavy-tailed PLDA (HT-PLDA) and Gaussian PLDA (G-PLDA). Though the former better fits real feature spaces than does the latter, its popularity has been severely limited by its computational complexity and, especially, by the difficulty, it presents in domain adaptation, which results from its non-Gaussian property. Various domain adaptation methods have been proposed for G-PLDA. This paper proposes a generalized framework for domain adaptation that can be applied to both of the above variants of PLDA for speaker recognition. It not only includes several existing supervised and unsupervised domain adaptation methods but also makes possible more flexible usage of available data in different domains. In particular, we introduce here two new techniques: (1) correlation-alignment in the model level, and (2) covariance regularization. To the best of our knowledge, this is the first proposed application of such techniques for domain adaptation w.r.t. HT-PLDA. The efficacy of the proposed techniques has been experimentally validated on NIST 2016, 2018, and 2019 Speaker Recognition Evaluation (SRE'16, SRE'18, and SRE'19) datasets.

Via

Access Paper or Ask Questions

Speaker-Aware Anti-Spoofing

Mar 02, 2023

Xuechen Liu, Md Sahidullah, Kong Aik Lee, Tomi Kinnunen

Abstract:We address speaker-aware anti-spoofing, where prior knowledge of the target speaker is incorporated into a voice spoofing countermeasure (CM). In contrast to the frequently used speaker-independent solutions, we train the CM in a speaker-conditioned way. As a proof of concept, we consider speaker-aware extension to the state-of-the-art AASIST (audio anti-spoofing using integrated spectro-temporal graph attention networks) model. To this end, we consider two alternative strategies to incorporate target speaker information at the frame and utterance levels, respectively. The experimental results on a custom protocol based on ASVspoof 2019 dataset indicates the efficiency of the speaker information via enrollment: we obtain maximum relative improvements of 25.1% and 11.6% in equal error rate (EER) and minimum tandem detection cost function (t-DCF) over a speaker-independent baseline, respectively.

Via

Access Paper or Ask Questions

Incorporating Uncertainty from Speaker Embedding Estimation to Speaker Verification

Feb 23, 2023

Qiongqiong Wang, Kong Aik Lee, Tianchi Liu

Abstract:Speech utterances recorded under differing conditions exhibit varying degrees of confidence in their embedding estimates, i.e., uncertainty, even if they are extracted using the same neural network. This paper aims to incorporate the uncertainty estimate produced in the xi-vector network front-end with a probabilistic linear discriminant analysis (PLDA) back-end scoring for speaker verification. To achieve this we derive a posterior covariance matrix, which measures the uncertainty, from the frame-wise precisions to the embedding space. We propose a log-likelihood ratio function for the PLDA scoring with the uncertainty propagation. We also propose to replace the length normalization pre-processing technique with a length scaling technique for the application of uncertainty propagation in the back-end. Experimental results on the VoxCeleb-1, SITW test sets as well as a domain-mismatched CNCeleb1-E set show the effectiveness of the proposed techniques with 14.5%-41.3% EER reductions and 4.6%-25.3% minDCF reductions.

* Accepted in ICASSP 2023 conference

Via

Access Paper or Ask Questions

Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification

Feb 22, 2023

Meng Liu, Kong Aik Lee, Longbiao Wang, Hanyi Zhang, Chang Zeng, Jianwu Dang

Figure 1 for Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification

Figure 2 for Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification

Figure 3 for Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification

Figure 4 for Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification

Abstract:Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation. Inside each booster, a max-feature-map embedded Transformer variant is proposed for modality alignment and enhanced feature generation. The network is co-learned both from scratch and with pretrained models. Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement over independently trained audio-only/visual-only and baseline fusion systems, respectively.

Via

Access Paper or Ask Questions

Probabilistic Back-ends for Online Speaker Recognition and Clustering

Feb 19, 2023

Alexey Sholokhov, Nikita Kuzmin, Kong Aik Lee, Eng Siong Chng

Abstract:This paper focuses on multi-enrollment speaker recognition which naturally occurs in the task of online speaker clustering, and studies the properties of different scoring back-ends in this scenario. First, we show that popular cosine scoring suffers from poor score calibration with a varying number of enrollment utterances. Second, we propose a simple replacement for cosine scoring based on an extremely constrained version of probabilistic linear discriminant analysis (PLDA). The proposed model improves over the cosine scoring for multi-enrollment recognition while keeping the same performance in the case of one-to-one comparisons. Finally, we consider an online speaker clustering task where each step naturally involves multi-enrollment recognition. We propose an online clustering algorithm allowing us to take benefits from the PLDA model such as the ability to handle uncertainty and better score calibration. Our experiments demonstrate the effectiveness of the proposed algorithm.

* Accepted to ICASSP 2023

Via

Access Paper or Ask Questions