Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chong-Xin Gan

UNet-Based Fusion and Exponential Moving Average Adaptation for Noise-Robust Speaker Recognition

Apr 28, 2026

Chong-Xin Gan, Peter Bell, Man-Wai Mak, Zhe Li, Zezhong Jin, Zilong Huang, Kong Aik Lee

Abstract:The joint training of speech enhancement and speaker embedding networks for speaker recognition is widely adopted under noisy acoustic environments. While effective, this paradigm often fails to leverage the generalization and robustness benefits inherent in large-scale speech enhancement pre-training. Moreover, maintaining the speaker information in the denoised speech is not an explicit objective of the speech enhancement process. To address these limitations, we proposed a scalable \textbf{U}Net-based \textbf{F}usion framework (UF-EMA) that considers the noisy and enhanced speech as a multi-channel input, thereby enabling the speaker encoder to exploit speaker information effectively. In addition, an \textbf{E}xponential \textbf{M}oving \textbf{A}verage strategy is applied to a speaker encoder pre-trained on clean speech to mitigate overfitting and facilitate a smooth transition from clean to noisy conditions. Experimental results on multiple noise-contaminated test sets showcase the superiority of the proposed approach.

* Submitted to Interspeech 2026

Via

Access Paper or Ask Questions

Asymmetric Clean Segments-Guided Self-Supervised Learning for Robust Speaker Verification

Sep 08, 2023

Chong-Xin Gan, Man-Wai Mak, Weiwei Lin, Jen-Tzung Chien

Figure 1 for Asymmetric Clean Segments-Guided Self-Supervised Learning for Robust Speaker Verification

Figure 2 for Asymmetric Clean Segments-Guided Self-Supervised Learning for Robust Speaker Verification

Figure 3 for Asymmetric Clean Segments-Guided Self-Supervised Learning for Robust Speaker Verification

Figure 4 for Asymmetric Clean Segments-Guided Self-Supervised Learning for Robust Speaker Verification

Abstract:Contrastive self-supervised learning (CSL) for speaker verification (SV) has drawn increasing interest recently due to its ability to exploit unlabeled data. Performing data augmentation on raw waveforms, such as adding noise or reverberation, plays a pivotal role in achieving promising results in SV. Data augmentation, however, demands meticulous calibration to ensure intact speaker-specific information, which is difficult to achieve without speaker labels. To address this issue, we introduce a novel framework by incorporating clean and augmented segments into the contrastive training pipeline. The clean segments are repurposed to pair with noisy segments to form additional positive and negative pairs. Moreover, the contrastive loss is weighted to increase the difference between the clean and augmented embeddings of different speakers. Experimental results on Voxceleb1 suggest that the proposed framework can achieve a remarkable 19% improvement over the conventional methods, and it surpasses many existing state-of-the-art techniques.

* 5 pages, 2 figures, submitted to ICASSP 2024

Via

Access Paper or Ask Questions