Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Herman Kamper

Automatically assessing oral narratives of Afrikaans and isiXhosa children

Jul 18, 2025

Retief Louw, Emma Sharratt, Febe de Wet, Christiaan Jacobs, Annelien Smith, Herman Kamper

Abstract:Developing narrative and comprehension skills in early childhood is critical for later literacy. However, teachers in large preschool classrooms struggle to accurately identify students who require intervention. We present a system for automatically assessing oral narratives of preschool children in Afrikaans and isiXhosa. The system uses automatic speech recognition followed by a machine learning scoring model to predict narrative and comprehension scores. For scoring predicted transcripts, we compare a linear model to a large language model (LLM). The LLM-based system outperforms the linear model in most cases, but the linear system is competitive despite its simplicity. The LLM-based system is comparable to a human expert in flagging children who require intervention. We lay the foundation for automatic oral assessments in classrooms, giving teachers extra capacity to focus on personalised support for children's learning.

* Accepted to SLaTE 2025

Via

Access Paper or Ask Questions

Feature-based analysis of oral narratives from Afrikaans and isiXhosa children

Jul 17, 2025

Emma Sharratt, Annelien Smith, Retief Louw, Daleen Klop, Febe de Wet, Herman Kamper

Abstract:Oral narrative skills are strong predictors of later literacy development. This study examines the features of oral narratives from children who were identified by experts as requiring intervention. Using simple machine learning methods, we analyse recorded stories from four- and five-year-old Afrikaans- and isiXhosa-speaking children. Consistent with prior research, we identify lexical diversity (unique words) and length-based features (mean utterance length) as indicators of typical development, but features like articulation rate prove less informative. Despite cross-linguistic variation in part-of-speech patterns, the use of specific verbs and auxiliaries associated with goal-directed storytelling is correlated with a reduced likelihood of requiring intervention. Our analysis of two linguistically distinct languages reveals both language-specific and shared predictors of narrative proficiency, with implications for early assessment in multilingual contexts.

* SLaTE 2025 in Nijmegen, Netherlands

Via

Access Paper or Ask Questions

Towards few-shot isolated word reading assessment

Jul 16, 2025

Reuben Smit, Retief Louw, Herman Kamper

Abstract:We explore an ASR-free method for isolated word reading assessment in low-resource settings. Our few-shot approach compares input child speech to a small set of adult-provided reference templates. Inputs and templates are encoded using intermediate layers from large self-supervised learned (SSL) models. Using an Afrikaans child speech benchmark, we investigate design options such as discretising SSL features and barycentre averaging of the templates. Idealised experiments show reasonable performance for adults, but a substantial drop for child speech input, even with child templates. Despite the success of employing SSL representations in low-resource speech tasks, our work highlights the limitations of SSL representations for processing child data when used in a few-shot classification system.

* Accepted to SLaTE 2025

Via

Access Paper or Ask Questions

Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis

Jul 02, 2025

Marc-André Carbonneau, Benjamin van Niekerk, Hugo Seuté, Jean-Philippe Letendre, Herman Kamper, Julian Zaïdi

Abstract:Modeling voice identity is challenging due to its multifaceted nature. In generative speech systems, identity is often assessed using automatic speaker verification (ASV) embeddings, designed for discrimination rather than characterizing identity. This paper investigates which aspects of a voice are captured in such representations. We find that widely used ASV embeddings focus mainly on static features like timbre and pitch range, while neglecting dynamic elements such as rhythm. We also identify confounding factors that compromise speaker similarity measurements and suggest mitigation strategies. To address these gaps, we propose U3D, a metric that evaluates speakers' dynamic rhythm patterns. This work contributes to the ongoing challenge of assessing speaker identity consistency in the context of ever-better voice cloning systems. We publicly release our code.

* Accepted at SSW13 - Interspeech 2025 Speech Synthesis Workshop

Via

Access Paper or Ask Questions

The mutual exclusivity bias of bilingual visually grounded speech models

Jun 04, 2025

Dan Oneata, Leanne Nortje, Yevgen Matusevych, Herman Kamper

Abstract:Mutual exclusivity (ME) is a strategy where a novel word is associated with a novel object rather than a familiar one, facilitating language learning in children. Recent work has found an ME bias in a visually grounded speech (VGS) model trained on English speech with paired images. But ME has also been studied in bilingual children, who may employ it less due to cross-lingual ambiguity. We explore this pattern computationally using bilingual VGS models trained on combinations of English, French, and Dutch. We find that bilingual models generally exhibit a weaker ME bias than monolingual models, though exceptions exist. Analyses show that the combined visual embeddings of bilingual models have a smaller variance for familiar data, partly explaining the increase in confusion between novel and familiar concepts. We also provide new insights into why the ME bias exists in VGS models in the first place. Code and data: https://github.com/danoneata/me-vgs

* Interspeech 2025

Via

Access Paper or Ask Questions

Spoken Language Modeling with Duration-Penalized Self-Supervised Units

May 29, 2025

Nicol Visser, Herman Kamper

Abstract:Spoken language models (SLMs) operate on acoustic units obtained by discretizing self-supervised speech representations. Although the characteristics of these units directly affect performance, the interaction between codebook size and unit coarseness (i.e., duration) remains unexplored. We investigate SLM performance as we vary codebook size and unit coarseness using the simple duration-penalized dynamic programming (DPDP) method. New analyses are performed across different linguistic levels. At the phone and word levels, coarseness provides little benefit, as long as the codebook size is chosen appropriately. However, when producing whole sentences in a resynthesis task, SLMs perform better with coarser units. In lexical and syntactic language modeling tasks, coarser units also give higher accuracies at lower bitrates. We therefore show that coarser units aren't always better, but that DPDP is a simple and efficient way to obtain coarser units for the tasks where they are beneficial.

* Accepted to Interspeech 2025

Via

Access Paper or Ask Questions

Speech Recognition for Automatically Assessing Afrikaans and isiXhosa Preschool Oral Narratives

Jan 11, 2025

Christiaan Jacobs, Annelien Smith, Daleen Klop, Ondřej Klejch, Febe de Wet, Herman Kamper

Abstract:We develop automatic speech recognition (ASR) systems for stories told by Afrikaans and isiXhosa preschool children. Oral narratives provide a way to assess children's language development before they learn to read. We consider a range of prior child-speech ASR strategies to determine which is best suited to this unique setting. Using Whisper and only 5 minutes of transcribed in-domain child speech, we find that additional in-domain adult data (adult speech matching the story domain) provides the biggest improvement, especially when coupled with voice conversion. Semi-supervised learning also helps for both languages, while parameter-efficient fine-tuning helps on Afrikaans but not on isiXhosa (which is under-represented in the Whisper model). Few child-speech studies look at non-English data, and even fewer at the preschool ages of 4 and 5. Our work therefore represents a unique validation of a wide range of previous child-speech ASR strategies in an under-explored setting.

* Accepted to ICASSP 2025

Via

Access Paper or Ask Questions

MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model

Jan 10, 2025

Matthew Baas, Pieter Scholtz, Arnav Mehta, Elliott Dyson, Akshat Prakash, Herman Kamper

Figure 1 for MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model

Figure 2 for MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model

Figure 3 for MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model

Abstract:Codec-based text-to-speech (TTS) models have shown impressive quality with zero-shot voice cloning abilities. However, they often struggle with more expressive references or complex text inputs. We present MARS6, a robust encoder-decoder transformer for rapid, expressive TTS. MARS6 is built on recent improvements in spoken language modelling. Utilizing a hierarchical setup for its decoder, new speech tokens are processed at a rate of only 12 Hz, enabling efficient modelling of long-form text while retaining reconstruction quality. We combine several recent training and inference techniques to reduce repetitive generation and improve output stability and quality. This enables the 70M-parameter MARS6 to achieve similar performance to models many times larger. We show this in objective and subjective evaluations, comparing TTS output quality and reference speaker cloning ability. Project page: https://camb-ai.github.io/mars6-turbo/

* 5 pages, 2 figures, 1 table. Accepted at ICASSP 2025

Via

Access Paper or Ask Questions

Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Sep 22, 2024

Simon Malan, Benjamin van Niekerk, Herman Kamper

Figure 1 for Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Figure 2 for Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Figure 3 for Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Figure 4 for Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Abstract:We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Several previous methods use a scoring model coupled with dynamic programming to find an optimal segmentation. Here we propose a much simpler strategy: we predict word boundaries using the dissimilarity between adjacent self-supervised features, then we cluster the predicted segments to construct a lexicon. For a fair comparison, we update the older ES-KMeans dynamic programming method with better features and boundary constraints. On the five-language ZeroSpeech benchmarks, our simple approach gives similar state-of-the-art results compared to the new ES-KMeans+ method, while being almost five times faster.

* 3 figures, 3 tables

Via

Access Paper or Ask Questions

Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings

Sep 09, 2024

Leanne Nortje, Dan Oneata, Herman Kamper

Figure 1 for Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings

Figure 2 for Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings

Figure 3 for Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings

Figure 4 for Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings

Abstract:Given an image query, visually prompted keyword localisation (VPKL) aims to find occurrences of the depicted word in a speech collection. This can be useful when transcriptions are not available for a low-resource language (e.g. if it is unwritten). Previous work showed that VPKL can be performed with a visually grounded speech model trained on paired images and unlabelled speech. But all experiments were done on English. Moreover, transcriptions were used to get positive and negative pairs for the contrastive loss. This paper introduces a few-shot learning scheme to mine pairs automatically without transcriptions. On English, this results in only a small drop in performance. We also - for the first time - consider VPKL on a real low-resource language, Yoruba. While scores are reasonable, here we see a bigger drop in performance compared to using ground truth pairs because the mining is less accurate in Yoruba.

Via

Access Paper or Ask Questions