Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Simon Malan

ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

Feb 17, 2026

Nicol Visser, Simon Malan, Danel Slabbert, Herman Kamper

Abstract:Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM's intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.

* 3 figures, 2 tables

Via

Access Paper or Ask Questions

Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Sep 22, 2024

Simon Malan, Benjamin van Niekerk, Herman Kamper

Figure 1 for Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Figure 2 for Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Figure 3 for Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Figure 4 for Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Abstract:We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Several previous methods use a scoring model coupled with dynamic programming to find an optimal segmentation. Here we propose a much simpler strategy: we predict word boundaries using the dissimilarity between adjacent self-supervised features, then we cluster the predicted segments to construct a lexicon. For a fair comparison, we update the older ES-KMeans dynamic programming method with better features and boundary constraints. On the five-language ZeroSpeech benchmarks, our simple approach gives similar state-of-the-art results compared to the new ES-KMeans+ method, while being almost five times faster.

* 3 figures, 3 tables

Via

Access Paper or Ask Questions