Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Thomas Hain

Teaching the Teachers: Boosting unsupervised domain adaptation in speech recognition by ensemble update

Apr 13, 2026

Rehan Ahmad, Muhammad Umar Farooq, Qihang Feng, Thomas Hain

Abstract:Speech recognition systems often struggle with data domains that have not been included in the training. To address this, unsupervised domain adaptation has been explored with ensemble and multi-stage teacher-student training methods reducing the word error rate. Despite improvements, the error rate remains much higher than that achieved with supervised in-domain training. This work proposes a more efficient strategy by simultaneously updating the ensemble of teacher models along with the single student model eliminating the need for sequential models training. The joint update improves the word error rate of the student model, benefiting the progressively enhanced teacher models. Experiments are conducted with three labelled source datasets, namely AMI, WSJ, LS360, and one unlabeled target domain i.e. SwitchBoard. The results show that the proposed method improves the WER by 4.6% on the Switchboard eval00 test set, thus outperforming multi-stage and iterative training methods.

Via

Access Paper or Ask Questions

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs

Mar 09, 2026

Hezhao Zhang, Huang-Cheng Chou, Shrikanth Narayanan, Thomas Hain

Abstract:Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective distributions.

* submitted to Interspeech 2026

Via

Access Paper or Ask Questions

Position-invariant Fine-tuning of Speech Enhancement Models with Self-supervised Speech Representations

Jan 28, 2026

Amit Meghanani, Thomas Hain

Abstract:Integrating front-end speech enhancement (SE) models with self-supervised learning (SSL)-based speech models is effective for downstream tasks in noisy conditions. SE models are commonly fine-tuned using SSL representations with mean squared error (MSE) loss between enhanced and clean speech. However, MSE is prone to exploiting positional embeddings in SSL models, allowing the objective to be minimised through positional correlations instead of content-related information. This work frames the problem as a general limitation of self-supervised representation fine-tuning and investigates it through representation-guided SE. Two strategies are considered: (1) zero-padding, previously explored in SSL pre-training but here examined in the fine-tuning setting, and (2) speed perturbations with a soft-DTW loss. Experiments show that the soft-DTW-based approach achieves faster convergence and improved downstream performance, underscoring the importance of position-invariant fine-tuning in SSL-based speech modelling.

* Accepted to ICASSP 2026

Via

Access Paper or Ask Questions

Towards a Unified Benchmark for Arabic Pronunciation Assessment: Quranic Recitation as Case Study

Jun 09, 2025

Yassine El Kheir, Omnia Ibrahim, Amit Meghanani, Nada Almarwani, Hawau Olamide Toyin, Sadeen Alharbi, Modar Alfadly, Lamya Alkanhal, Ibrahim Selim, Shehab Elbatal(+5 more)

Figure 1 for Towards a Unified Benchmark for Arabic Pronunciation Assessment: Quranic Recitation as Case Study

Figure 2 for Towards a Unified Benchmark for Arabic Pronunciation Assessment: Quranic Recitation as Case Study

Figure 3 for Towards a Unified Benchmark for Arabic Pronunciation Assessment: Quranic Recitation as Case Study

Figure 4 for Towards a Unified Benchmark for Arabic Pronunciation Assessment: Quranic Recitation as Case Study

Abstract:We present a unified benchmark for mispronunciation detection in Modern Standard Arabic (MSA) using Qur'anic recitation as a case study. Our approach lays the groundwork for advancing Arabic pronunciation assessment by providing a comprehensive pipeline that spans data processing, the development of a specialized phoneme set tailored to the nuances of MSA pronunciation, and the creation of the first publicly available test set for this task, which we term as the Qur'anic Mispronunciation Benchmark (QuranMB.v1). Furthermore, we evaluate several baseline models to provide initial performance insights, thereby highlighting both the promise and the challenges inherent in assessing MSA pronunciation. By establishing this standardized framework, we aim to foster further research and development in pronunciation assessment in Arabic language technology and related applications.

* Accepted Interspeech 2025 and ArabicNLP Shared Task 2025

Via

Access Paper or Ask Questions

Methods for Automatic Matrix Language Determination of Code-Switched Speech

Oct 03, 2024

Olga Iakovenko, Thomas Hain

Figure 1 for Methods for Automatic Matrix Language Determination of Code-Switched Speech

Figure 2 for Methods for Automatic Matrix Language Determination of Code-Switched Speech

Figure 3 for Methods for Automatic Matrix Language Determination of Code-Switched Speech

Figure 4 for Methods for Automatic Matrix Language Determination of Code-Switched Speech

Abstract:Code-switching (CS) is the process of speakers interchanging between two or more languages which in the modern world becomes increasingly common. In order to better describe CS speech the Matrix Language Frame (MLF) theory introduces the concept of a Matrix Language, which is the language that provides the grammatical structure for a CS utterance. In this work the MLF theory was used to develop systems for Matrix Language Identity (MLID) determination. The MLID of English/Mandarin and English/Spanish CS text and speech was compared to acoustic language identity (LID), which is a typical way to identify a language in monolingual utterances. MLID predictors from audio show higher correlation with the textual principles than LID in all cases while also outperforming LID in an MLID recognition task based on F1 macro (60\%) and correlation score (0.38). This novel approach has identified that non-English languages (Mandarin and Spanish) are preferred over the English language as the ML contrary to the monolingual choice of LID.

* Accepted at EMNLP

Via

Access Paper or Ask Questions

Using Speech Foundational Models in Loss Functions for Hearing Aid Speech Enhancement

Jul 18, 2024

Robert Sutherland, George Close, Thomas Hain, Stefan Goetze, Jon Barker

Figure 1 for Using Speech Foundational Models in Loss Functions for Hearing Aid Speech Enhancement

Figure 2 for Using Speech Foundational Models in Loss Functions for Hearing Aid Speech Enhancement

Figure 3 for Using Speech Foundational Models in Loss Functions for Hearing Aid Speech Enhancement

Figure 4 for Using Speech Foundational Models in Loss Functions for Hearing Aid Speech Enhancement

Abstract:Machine learning techniques are an active area of research for speech enhancement for hearing aids, with one particular focus on improving the intelligibility of a noisy speech signal. Recent work has shown that feature encodings from self-supervised speech representation models can effectively capture speech intelligibility. In this work, it is shown that the distance between self-supervised speech representations of clean and noisy speech correlates more strongly with human intelligibility ratings than other signal-based metrics. Experiments show that training a speech enhancement model using this distance as part of a loss function improves the performance over using an SNR-based loss function, demonstrated by an increase in HASPI, STOI, PESQ and SI-SNR scores. This method takes inference of a high parameter count model only at training time, meaning the speech enhancement model can remain smaller, as is required for hearing aids.

* Accepted for EUSIPCO 2024

Via

Access Paper or Ask Questions

Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis

Jul 04, 2024

Cong-Thanh Do, Shuhei Imai, Rama Doddipatla, Thomas Hain

Figure 1 for Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis

Figure 2 for Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis

Figure 3 for Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis

Figure 4 for Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis

Abstract:This paper investigates the use of unsupervised text-to-speech synthesis (TTS) as a data augmentation method to improve accented speech recognition. TTS systems are trained with a small amount of accented speech training data and their pseudo-labels rather than manual transcriptions, and hence unsupervised. This approach enables the use of accented speech data without manual transcriptions to perform data augmentation for accented speech recognition. Synthetic accented speech data, generated from text prompts by using the TTS systems, are then combined with available non-accented speech data to train automatic speech recognition (ASR) systems. ASR experiments are performed in a self-supervised learning framework using a Wav2vec2.0 model which was pre-trained on large amount of unsupervised accented speech data. The accented speech data for training the unsupervised TTS are read speech, selected from L2-ARCTIC and British Isles corpora, while spontaneous conversational speech from the Edinburgh international accents of English corpus are used as the evaluation data. Experimental results show that Wav2vec2.0 models which are fine-tuned to downstream ASR task with synthetic accented speech data, generated by the unsupervised TTS, yield up to 6.1% relative word error rate reductions compared to a Wav2vec2.0 baseline which is fine-tuned with the non-accented speech data from Librispeech corpus.

* Accepted to EUSIPCO 2024

Via

Access Paper or Ask Questions

LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks

Jun 13, 2024

Amit Meghanani, Thomas Hain

Figure 1 for LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks

Figure 2 for LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks

Figure 3 for LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks

Figure 4 for LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks

Abstract:Self-supervised learning (SSL)-based speech models are extensively used for full-stack speech processing. However, it has been observed that improving SSL-based speech representations using unlabeled speech for content-related tasks is challenging and computationally expensive. Recent attempts have been made to address this issue with cost-effective self-supervised fine-tuning (SSFT) approaches. Continuing in this direction, a cost-effective SSFT method named "LASER: Learning by Aligning Self-supervised Representations" is presented. LASER is based on the soft-DTW alignment loss with temporal regularisation term. Experiments are conducted with HuBERT and WavLM models and evaluated on the SUPERB benchmark for two content-related tasks: automatic speech recognition (ASR) and phoneme recognition (PR). A relative improvement of 3.7% and 8.2% for HuBERT, and 4.1% and 11.7% for WavLM are observed, for the ASR and PR tasks respectively, with only < 3 hours of fine-tuning on a single GPU.

* Accepted at Interspeech 2024

Via

Access Paper or Ask Questions

Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

Jun 13, 2024

William Ravenscroft, George Close, Stefan Goetze, Thomas Hain, Mohammad Soleymanpour, Anurag Chowdhury, Mark C. Fuhs

Figure 1 for Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

Figure 2 for Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

Figure 3 for Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

Figure 4 for Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

Abstract:One solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals. Commonly, the separator produces artefacts which often degrade ASR performance. Addressing this issue typically requires reference transcriptions to jointly train the separation and ASR networks. This is often not viable for training on real-world in-domain audio where reference transcript information is not always available. This paper proposes a transcription-free method for joint training using only audio signals. The proposed method uses embedding differences of pre-trained ASR encoders as a loss with a proposed modification to permutation invariant training (PIT) called guided PIT (GPIT). The method achieves a 6.4% improvement in word error rate (WER) measures over a signal-level loss and also shows enhancement improvements in perceptual measures such as short-time objective intelligibility (STOI).

* 5 pages, 3 Figures, 3 Tables, Accepted for Interspeech 2024

Via

Access Paper or Ask Questions

EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark

Jun 11, 2024

Ziyang Ma, Mingjie Chen, Hezhao Zhang, Zhisheng Zheng, Wenxi Chen, Xiquan Li, Jiaxin Ye, Xie Chen, Thomas Hain

Abstract:Speech emotion recognition (SER) is an important part of human-computer interaction, receiving extensive attention from both industry and academia. However, the current research field of SER has long suffered from the following problems: 1) There are few reasonable and universal splits of the datasets, making comparing different models and methods difficult. 2) No commonly used benchmark covers numerous corpus and languages for researchers to refer to, making reproduction a burden. In this paper, we propose EmoBox, an out-of-the-box multilingual multi-corpus speech emotion recognition toolkit, along with a benchmark for both intra-corpus and cross-corpus settings. For intra-corpus settings, we carefully designed the data partitioning for different datasets. For cross-corpus settings, we employ a foundation SER model, emotion2vec, to mitigate annotation errors and obtain a test set that is fully balanced in speakers and emotions distributions. Based on EmoBox, we present the intra-corpus SER results of 10 pre-trained speech models on 32 emotion datasets with 14 languages, and the cross-corpus SER results on 4 datasets with the fully balanced test sets. To the best of our knowledge, this is the largest SER benchmark, across language scopes and quantity scales. We hope that our toolkit and benchmark can facilitate the research of SER in the community.

* Accepted by INTERSPEECH 2024. GitHub Repository: https://github.com/emo-box/EmoBox

Via

Access Paper or Ask Questions