Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ivan Yakovlev

ReDimNet2: Scaling Speaker Verification via Time-Pooled Dimension Reshaping

Mar 12, 2026

Ivan Yakovlev, Anton Okhotnikov

Abstract:We present ReDimNet2, an improved neural network architecture for extracting utterance-level speaker representations that builds upon the ReDimNet dimension-reshaping framework. The key modification in ReDimNet2 is the introduction of pooling over the time dimension within the 1D processing pathway. This operation preserves the nature of the 1D feature space, since 1D features remain a reshaped version of 2D features regardless of temporal resolution, while enabling significantly more aggressive scaling of the channel dimension without proportional compute increase. We introduce a family of seven model configurations (B0-B6) ranging from 1.1M to 12.3M parameters and 0.33 to 13 GMACS. Experimental results on VoxCeleb1 benchmarks demonstrate that ReDimNet2 improves the Pareto front of computational cost versus accuracy at every scale point compared to ReDimNet, achieving 0.287% EER on Vox1-O with 12.3M parameters and 13 GMACS.

* Submitted to Interspeech 2026

Via

Access Paper or Ask Questions

Study on Inter and Intra Speaker Variability in Speaker Recognition

Nov 12, 2024

Anton Okhotnikov, Nikita Torgashov, Ivan Yakovlev, Pavel Malov, Rostislav Makarov

Figure 1 for Study on Inter and Intra Speaker Variability in Speaker Recognition

Figure 2 for Study on Inter and Intra Speaker Variability in Speaker Recognition

Figure 3 for Study on Inter and Intra Speaker Variability in Speaker Recognition

Figure 4 for Study on Inter and Intra Speaker Variability in Speaker Recognition

Abstract:Optimization of a trade-off between the number of speakers and their temporal variability (or session diversity) is crucial for the development of a speaker recognition system together with making the data collection process feasible from a time perspective. In this article, we provide the analysis of dependency between inter and intra speaker variability in training data for the modern neural network-based speaker recognition system using the VoxTube dataset for text-independent speaker recognition task. Besides, an auxiliary contribution of this work is a release of upload date metadata per utterance in a VoxTube dataset. We want this article to contribute to guidelines and best practices for collecting and filtering data from media hosting platforms to facilitate the efforts of researchers in developing speaker recognition systems.

Via

Access Paper or Ask Questions

Reshape Dimensions Network for Speaker Recognition

Jul 25, 2024

Ivan Yakovlev, Rostislav Makarov, Andrei Balykin, Pavel Malov, Anton Okhotnikov, Nikita Torgashov

Figure 1 for Reshape Dimensions Network for Speaker Recognition

Figure 2 for Reshape Dimensions Network for Speaker Recognition

Figure 3 for Reshape Dimensions Network for Speaker Recognition

Figure 4 for Reshape Dimensions Network for Speaker Recognition

Abstract:In this paper, we present Reshape Dimensions Network (ReDimNet), a novel neural network architecture for extracting utterance-level speaker representations. Our approach leverages dimensionality reshaping of 2D feature maps to 1D signal representation and vice versa, enabling the joint usage of 1D and 2D blocks. We propose an original network topology that preserves the volume of channel-timestep-frequency outputs of 1D and 2D blocks, facilitating efficient residual feature maps aggregation. Moreover, ReDimNet is efficiently scalable, and we introduce a range of model sizes, varying from 1 to 15 M parameters and from 0.5 to 20 GMACs. Our experimental results demonstrate that ReDimNet achieves state-of-the-art performance in speaker recognition while reducing computational complexity and the number of model parameters.

* Accepted to Interspeech 2024

Via

Access Paper or Ask Questions

LRPD: Large Replay Parallel Dataset

Sep 29, 2023

Ivan Yakovlev, Mikhail Melnikov, Nikita Bukhal, Rostislav Makarov, Alexander Alenin, Nikita Torgashov, Anton Okhotnikov

Figure 1 for LRPD: Large Replay Parallel Dataset

Figure 2 for LRPD: Large Replay Parallel Dataset

Figure 3 for LRPD: Large Replay Parallel Dataset

Figure 4 for LRPD: Large Replay Parallel Dataset

Abstract:The latest research in the field of voice anti-spoofing (VAS) shows that deep neural networks (DNN) outperform classic approaches like GMM in the task of presentation attack detection. However, DNNs require a lot of data to converge, and still lack generalization ability. In order to foster the progress of neural network systems, we introduce a Large Replay Parallel Dataset (LRPD) aimed for a detection of replay attacks. LRPD contains more than 1M utterances collected by 19 recording devices in 17 various environments. We also provide an example training pipeline in PyTorch [1] and a baseline system, that achieves 0.28% Equal Error Rate (EER) on evaluation subset of LRPD and 11.91% EER on publicly available ASVpoof 2017 [2] eval set. These results show that model trained with LRPD dataset has a consistent performance on the fully unknown conditions. Our dataset is free for research purposes and hosted on GDrive. Baseline code and pre-trained models are available at GitHub.

* ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6612-6616

Via

Access Paper or Ask Questions

The ID R&D VoxCeleb Speaker Recognition Challenge 2023 System Description

Aug 20, 2023

Nikita Torgashov, Rostislav Makarov, Ivan Yakovlev, Pavel Malov, Andrei Balykin, Anton Okhotnikov

Figure 1 for The ID R&D VoxCeleb Speaker Recognition Challenge 2023 System Description

Figure 2 for The ID R&D VoxCeleb Speaker Recognition Challenge 2023 System Description

Figure 3 for The ID R&D VoxCeleb Speaker Recognition Challenge 2023 System Description

Figure 4 for The ID R&D VoxCeleb Speaker Recognition Challenge 2023 System Description

Abstract:This report describes ID R&D team submissions for Track 2 (open) to the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23). Our solution is based on the fusion of deep ResNets and self-supervised learning (SSL) based models trained on a mixture of a VoxCeleb2 dataset and a large version of a VoxTube dataset. The final submission to the Track 2 achieved the first place on the VoxSRC-23 public leaderboard with a minDCF(0.05) of 0.0762 and EER of 1.30%.

Via

Access Paper or Ask Questions