Alert button
Picture for Oldřich Plchot

Oldřich Plchot

Alert button

Improving Speaker Verification with Self-Pretrained Transformer Models

May 17, 2023
Junyi Peng, Oldřich Plchot, Themos Stafylakis, Ladislav Mošner, Lukáš Burget, Jan Černocký

Figure 1 for Improving Speaker Verification with Self-Pretrained Transformer Models
Figure 2 for Improving Speaker Verification with Self-Pretrained Transformer Models
Figure 3 for Improving Speaker Verification with Self-Pretrained Transformer Models
Figure 4 for Improving Speaker Verification with Self-Pretrained Transformer Models

Recently, fine-tuning large pre-trained Transformer models using downstream datasets has received a rising interest. Despite their success, it is still challenging to disentangle the benefits of large-scale datasets and Transformer structures from the limitations of the pre-training. In this paper, we introduce a hierarchical training approach, named self-pretraining, in which Transformer models are pretrained and finetuned on the same dataset. Three pre-trained models including HuBERT, Conformer and WavLM are evaluated on four different speaker verification datasets with varying sizes. Our experiments show that these self-pretrained models achieve competitive performance on downstream speaker verification tasks with only one-third of the data compared to Librispeech pretraining, such as VoxCeleb1 and CNCeleb1. Furthermore, when pre-training only on the VoxCeleb2-dev, the Conformer model outperforms the one pre-trained on 94k hours of data using the same fine-tuning settings.

* Accepted to Interspeech 2023 
Viaarxiv icon

Parameter-efficient transfer learning of pre-trained Transformer models for speaker verification using adapters

Oct 28, 2022
Junyi Peng, Themos Stafylakis, Rongzhi Gu, Oldřich Plchot, Ladislav Mošner, Lukáš Burget, Jan Černocký

Figure 1 for Parameter-efficient transfer learning of pre-trained Transformer models for speaker verification using adapters
Figure 2 for Parameter-efficient transfer learning of pre-trained Transformer models for speaker verification using adapters
Figure 3 for Parameter-efficient transfer learning of pre-trained Transformer models for speaker verification using adapters
Figure 4 for Parameter-efficient transfer learning of pre-trained Transformer models for speaker verification using adapters

Recently, the pre-trained Transformer models have received a rising interest in the field of speech processing thanks to their great success in various downstream tasks. However, most fine-tuning approaches update all the parameters of the pre-trained model, which becomes prohibitive as the model size grows and sometimes results in overfitting on small datasets. In this paper, we conduct a comprehensive analysis of applying parameter-efficient transfer learning (PETL) methods to reduce the required learnable parameters for adapting to speaker verification tasks. Specifically, during the fine-tuning process, the pre-trained models are frozen, and only lightweight modules inserted in each Transformer block are trainable (a method known as adapters). Moreover, to boost the performance in a cross-language low-resource scenario, the Transformer model is further tuned on a large intermediate dataset before directly fine-tuning it on a small dataset. With updating fewer than 4% of parameters, (our proposed) PETL-based methods achieve comparable performances with full fine-tuning methods (Vox1-O: 0.55%, Vox1-E: 0.82%, Vox1-H:1.73%).

* submitted to ICASSP2023 
Viaarxiv icon

Training Speaker Embedding Extractors Using Multi-Speaker Audio with Unknown Speaker Boundaries

Mar 29, 2022
Themos Stafylakis, Ladislav Mošner, Oldřich Plchot, Johan Rohdin, Anna Silnova, Lukáš Burget, Jan "Honza'' Černocký

Figure 1 for Training Speaker Embedding Extractors Using Multi-Speaker Audio with Unknown Speaker Boundaries
Figure 2 for Training Speaker Embedding Extractors Using Multi-Speaker Audio with Unknown Speaker Boundaries
Figure 3 for Training Speaker Embedding Extractors Using Multi-Speaker Audio with Unknown Speaker Boundaries

In this paper, we demonstrate a method for training speaker embedding extractors using weak annotation. More specifically, we are using the full VoxCeleb recordings and the name of the celebrities appearing on each video without knowledge of the time intervals the celebrities appear in the video. We show that by combining a baseline speaker diarization algorithm that requires no training or parameter tuning, a modified loss with aggregation over segments, and a two-stage training approach, we are able to train a competitive ResNet-based embedding extractor. Finally, we experiment with two different aggregation functions and analyze their behaviour in terms of their gradients.

* Submitted to Interspeech 2022 
Viaarxiv icon

Probabilistic Spherical Discriminant Analysis: An Alternative to PLDA for length-normalized embeddings

Mar 28, 2022
Niko Brümmer, Albert Swart, Ladislav Mošner, Anna Silnova, Oldřich Plchot, Themos Stafylakis, Lukáš Burget

Figure 1 for Probabilistic Spherical Discriminant Analysis: An Alternative to PLDA for length-normalized embeddings

In speaker recognition, where speech segments are mapped to embeddings on the unit hypersphere, two scoring backends are commonly used, namely cosine scoring or PLDA. Both have advantages and disadvantages, depending on the context. Cosine scoring follows naturally from the spherical geometry, but for PLDA the blessing is mixed -- length normalization Gaussianizes the between-speaker distribution, but violates the assumption of a speaker-independent within-speaker distribution. We propose PSDA, an analogue to PLDA that uses Von Mises-Fisher distributions on the hypersphere for both within and between-class distributions. We show how the self-conjugacy of this distribution gives closed-form likelihood-ratio scores, making it a drop-in replacement for PLDA at scoring time. All kinds of trials can be scored, including single-enroll and multi-enroll verification, as well as more complex likelihood-ratios that could be used in clustering and diarization. Learning is done via an EM-algorithm with closed-form updates. We explain the model and present some first experiments.

* Submitted to Interspeech 2022 
Viaarxiv icon

MultiSV: Dataset for Far-Field Multi-Channel Speaker Verification

Nov 11, 2021
Ladislav Mošner, Oldřich Plchot, Lukáš Burget, Jan Černocký

Figure 1 for MultiSV: Dataset for Far-Field Multi-Channel Speaker Verification
Figure 2 for MultiSV: Dataset for Far-Field Multi-Channel Speaker Verification
Figure 3 for MultiSV: Dataset for Far-Field Multi-Channel Speaker Verification

Motivated by unconsolidated data situation and the lack of a standard benchmark in the field, we complement our previous efforts and present a comprehensive corpus designed for training and evaluating text-independent multi-channel speaker verification systems. It can be readily used also for experiments with dereverberation, denoising, and speech enhancement. We tackled the ever-present problem of the lack of multi-channel training data by utilizing data simulation on top of clean parts of the Voxceleb dataset. The development and evaluation trials are based on a retransmitted Voices Obscured in Complex Environmental Settings (VOiCES) corpus, which we modified to provide multi-channel trials. We publish full recipes that create the dataset from public sources as the MultiSV corpus, and we provide results with two of our multi-channel speaker verification systems with neural network-based beamforming based either on predicting ideal binary masks or the more recent Conv-TasNet.

* Submitted to ICASSP 2022 
Viaarxiv icon

BUT System Description to VoxCeleb Speaker Recognition Challenge 2019

Oct 16, 2019
Hossein Zeinali, Shuai Wang, Anna Silnova, Pavel Matějka, Oldřich Plchot

Figure 1 for BUT System Description to VoxCeleb Speaker Recognition Challenge 2019
Figure 2 for BUT System Description to VoxCeleb Speaker Recognition Challenge 2019
Figure 3 for BUT System Description to VoxCeleb Speaker Recognition Challenge 2019

In this report, we describe the submission of Brno University of Technology (BUT) team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2019. We also provide a brief analysis of different systems on VoxCeleb-1 test sets. Submitted systems for both Fixed and Open conditions are a fusion of 4 Convolutional Neural Network (CNN) topologies. The first and second networks have ResNet34 topology and use two-dimensional CNNs. The last two networks are one-dimensional CNN and are based on the x-vector extraction topology. Some of the networks are fine-tuned using additive margin angular softmax. Kaldi FBanks and Kaldi PLPs were used as features. The difference between Fixed and Open systems lies in the used training data and fusion strategy. The best systems for Fixed and Open conditions achieved 1.42% and 1.26% ERR on the challenge evaluation set respectively.

Viaarxiv icon

Learning document embeddings along with their uncertainties

Aug 29, 2019
Santosh Kesiraju, Oldřich Plchot, Lukáš Burget, Suryakanth V Gangashetty

Figure 1 for Learning document embeddings along with their uncertainties
Figure 2 for Learning document embeddings along with their uncertainties
Figure 3 for Learning document embeddings along with their uncertainties
Figure 4 for Learning document embeddings along with their uncertainties

Majority of the text modelling techniques yield only point estimates of document embeddings and lack in capturing the uncertainty of the estimates. These uncertainties give a notion of how well the embeddings represent a document. We present Bayesian subspace multinomial model (Bayesian SMM), a generative log-linear model that learns to represent documents in the form of Gaussian distributions, thereby encoding the uncertainty in its covariance. Additionally, in the proposed Bayesian SMM, we address a commonly encountered problem of intractability that appears during variational inference in mixed-logit models. We also present a generative Gaussian linear classifier for topic identification that exploits the uncertainty in document embeddings. Our intrinsic evaluation using perplexity measure shows that the proposed Bayesian SMM fits the data better as compared to variational auto-encoder based document model. Our topic identification experiments on speech (Fisher) and text (20Newsgroups) corpora show that the proposed Bayesian SMM is robust to over-fitting on unseen test data. The topic ID results show that the proposed model is significantly better than variational auto-encoder based methods and achieve similar results when compared to fully supervised discriminative models.

Viaarxiv icon

BUT VOiCES 2019 System Description

Jul 13, 2019
Hossein Zeinali, Pavel Matějka, Ladislav Mošner, Oldřich Plchot, Anna Silnova, Ondřej Novotný, Ján Profant, Ondřej Glembek, Lukáš Burget

Figure 1 for BUT VOiCES 2019 System Description
Figure 2 for BUT VOiCES 2019 System Description
Figure 3 for BUT VOiCES 2019 System Description

This is a description of our effort in VOiCES 2019 Speaker Recognition challenge. All systems in the fixed condition are based on the x-vector paradigm with different features and DNN topologies. The single best system reaches 1.2% EER and a fusion of 3 systems yields 1.0% EER, which is 15% relative improvement. The open condition allowed us to use external data which we did for the PLDA adaptation and achieved less than ~10% relative improvement. In the submission to open condition, we used 3 x-vector systems and also one i-vector based system.

Viaarxiv icon