Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anastasia Ananeva

Revealing the Hidden Temporal Structure of HubertSoft Embeddings based on the Russian Phonetic Corpus

Jul 09, 2025

Anastasia Ananeva, Anton Tomilov, Marina Volkova

Figure 1 for Revealing the Hidden Temporal Structure of HubertSoft Embeddings based on the Russian Phonetic Corpus

Figure 2 for Revealing the Hidden Temporal Structure of HubertSoft Embeddings based on the Russian Phonetic Corpus

Figure 3 for Revealing the Hidden Temporal Structure of HubertSoft Embeddings based on the Russian Phonetic Corpus

Figure 4 for Revealing the Hidden Temporal Structure of HubertSoft Embeddings based on the Russian Phonetic Corpus

Abstract:Self-supervised learning (SSL) models such as Wav2Vec 2.0 and HuBERT have shown remarkable success in extracting phonetic information from raw audio without labelled data. While prior work has demonstrated that SSL embeddings encode phonetic features at the frame level, it remains unclear whether these models preserve temporal structure, specifically, whether embeddings at phoneme boundaries reflect the identity and order of adjacent phonemes. This study investigates the extent to which boundary-sensitive embeddings from HubertSoft, a soft-clustering variant of HuBERT, encode phoneme transitions. Using the CORPRES Russian speech corpus, we labelled 20 ms embedding windows with triplets of phonemes corresponding to their start, centre, and end segments. A neural network was trained to predict these positions separately, and multiple evaluation metrics, such as ordered, unordered accuracy and a flexible centre accuracy, were used to assess temporal sensitivity. Results show that embeddings extracted at phoneme boundaries capture both phoneme identity and temporal order, with especially high accuracy at segment boundaries. Confusion patterns further suggest that the model encodes articulatory detail and coarticulatory effects. These findings contribute to our understanding of the internal structure of SSL speech representations and their potential for phonological analysis and fine-grained transcription tasks.

* 11 pages, 5 figures, Specom 2025 conference

Via

Access Paper or Ask Questions