Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Oli Liu

Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models

Jun 12, 2025

Michele Gubian, Ioana Krehan, Oli Liu, James Kirby, Sharon Goldwater

Figure 1 for Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models

Figure 2 for Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models

Figure 3 for Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models

Figure 4 for Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models

Abstract:Analyses of self-supervised speech models have begun to reveal where and how they represent different types of information. However, almost all analyses have focused on English. Here, we examine how wav2vec2 models trained on four different languages encode both language-matched and non-matched speech. We use probing classifiers and geometric analyses to examine how phones, lexical tones, and speaker information are represented. We show that for all pretraining and test languages, the subspaces encoding phones, tones, and speakers are largely orthogonal, and that layerwise patterns of probing accuracy are similar, with a relatively small advantage for matched-language phone and tone (but not speaker) probes in the later layers. Our findings suggest that the structure of representations learned by wav2vec2 is largely independent of the speech material used during pretraining.

Via

Access Paper or Ask Questions

Self-supervised Predictive Coding Models Encode Speaker and Phonetic Information in Orthogonal Subspaces

May 21, 2023

Oli Liu, Hao Tang, Sharon Goldwater

Figure 1 for Self-supervised Predictive Coding Models Encode Speaker and Phonetic Information in Orthogonal Subspaces

Figure 2 for Self-supervised Predictive Coding Models Encode Speaker and Phonetic Information in Orthogonal Subspaces

Figure 3 for Self-supervised Predictive Coding Models Encode Speaker and Phonetic Information in Orthogonal Subspaces

Figure 4 for Self-supervised Predictive Coding Models Encode Speaker and Phonetic Information in Orthogonal Subspaces

Abstract:Self-supervised speech representations are known to encode both speaker and phonetic information, but how they are distributed in the high-dimensional space remains largely unexplored. We hypothesize that they are encoded in orthogonal subspaces, a property that lends itself to simple disentanglement. Applying principal component analysis to representations of two predictive coding models, we identify two subspaces that capture speaker and phonetic variances, and confirm that they are nearly orthogonal. Based on this property, we propose a new speaker normalization method which collapses the subspace that encodes speaker information, without requiring transcriptions. Probing experiments show that our method effectively eliminates speaker information and outperforms a previous baseline in phone discrimination tasks. Moreover, the approach generalizes and can be used to remove information of unseen speakers.

Via

Access Paper or Ask Questions