Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alkis Koudounas

Benchmarking Speech-to-Speech Translation Models

Jun 02, 2026

Alkis Koudounas, Hayato Futami, Quentin Jodelet, Osamu Take, Shinji Watanabe, Emiru Tsunoo

Abstract:Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and reproducible benchmarking framework integrating 46 metrics across eight dimensions, and deploy it on 1,248 model-language configurations from FLEURS and CVSS, spanning cascaded and end-to-end architectures over ten language pairs. Architectures exhibit complementary strengths: best-vs-worst gaps exceed 30\% on naturalness and speaker preservation but remain within a few points on translation quality, so single-metric rankings systematically misrepresent system quality. Correlation filtering reduces 46 metrics to 10 per direction, with three axes requiring different metrics across X$\to$EN and EN$\to$X (e.g., TER/UTMOS vs. ChrF++/NISQA-MOS); these subsets preserve rankings (Spearman's $ρ>0.80$) while cutting evaluation time by $\approx 2.5\times$. Human validation across dubbing, podcasts, and medical domains shows standalone MOS predictors fail to predict listener preference, while top domain-specific metrics correlate with human judgment ($ρ\geq 0.90$). We release COMPASS as a foundation for domain-aware S2ST evaluation.

* Paper under submission

Via

Access Paper or Ask Questions

FAME: Fictional Actors for Multilingual Erasure

Dec 17, 2025

Claudio Savelli, Moreno La Quatra, Alkis Koudounas, Flavio Giobergia

Abstract:LLMs trained on web-scale data raise concerns about privacy and the right to be forgotten. To address these issues, Machine Unlearning provides techniques to remove specific information from trained models without retraining from scratch. However, existing benchmarks for evaluating unlearning in LLMs face two major limitations: they focus only on English and support only entity-level forgetting (removing all information about a person). We introduce FAME (Fictional Actors for Multilingual Erasure), a synthetic benchmark for evaluating Machine Unlearning across five languages: English, French, German, Italian, and Spanish. FAME contains 1,000 fictional actor biographies and 20,000 question-answer pairs. Each biography includes information on 20 topics organized into structured categories (biography, career, achievements, personal information). This design enables both entity-level unlearning (i.e., forgetting entire identities) and instance-level unlearning (i.e., forgetting specific facts while retaining others). We provide two dataset splits to support these two different unlearning scenarios and enable systematic comparison of unlearning techniques across languages. Since FAME uses entirely fictional data, it ensures that the information was never encountered during model pretraining, allowing for a controlled evaluation of unlearning methods.

Via

Access Paper or Ask Questions

A Concept-based approach to Voice Disorder Detection

Jul 23, 2025

Davide Ghia, Gabriele Ciravegna, Alkis Koudounas, Marco Fantini, Erika Crosetti, Giovanni Succo, Tania Cerquitelli

Figure 1 for A Concept-based approach to Voice Disorder Detection

Figure 2 for A Concept-based approach to Voice Disorder Detection

Figure 3 for A Concept-based approach to Voice Disorder Detection

Figure 4 for A Concept-based approach to Voice Disorder Detection

Abstract:Voice disorders affect a significant portion of the population, and the ability to diagnose them using automated, non-invasive techniques would represent a substantial advancement in healthcare, improving the quality of life of patients. Recent studies have demonstrated that artificial intelligence models, particularly Deep Neural Networks (DNNs), can effectively address this task. However, due to their complexity, the decision-making process of such models often remain opaque, limiting their trustworthiness in clinical contexts. This paper investigates an alternative approach based on Explainable AI (XAI), a field that aims to improve the interpretability of DNNs by providing different forms of explanations. Specifically, this works focuses on concept-based models such as Concept Bottleneck Model (CBM) and Concept Embedding Model (CEM) and how they can achieve performance comparable to traditional deep learning methods, while offering a more transparent and interpretable decision framework.

Via

Access Paper or Ask Questions

MVP: Multi-source Voice Pathology detection

May 26, 2025

Alkis Koudounas, Moreno La Quatra, Gabriele Ciravegna, Marco Fantini, Erika Crosetti, Giovanni Succo, Tania Cerquitelli, Sabato Marco Siniscalchi, Elena Baralis

Abstract:Voice disorders significantly impact patient quality of life, yet non-invasive automated diagnosis remains under-explored due to both the scarcity of pathological voice data, and the variability in recording sources. This work introduces MVP (Multi-source Voice Pathology detection), a novel approach that leverages transformers operating directly on raw voice signals. We explore three fusion strategies to combine sentence reading and sustained vowel recordings: waveform concatenation, intermediate feature fusion, and decision-level combination. Empirical validation across the German, Portuguese, and Italian languages shows that intermediate feature fusion using transformers best captures the complementary characteristics of both recording types. Our approach achieves up to +13% AUC improvement over single-source methods.

* Accepted at Interspeech 2025

Via

Access Paper or Ask Questions

DeepDialogue: A Multi-Turn Emotionally-Rich Spoken Dialogue Dataset

May 26, 2025

Alkis Koudounas, Moreno La Quatra, Elena Baralis

Abstract:Recent advances in conversational AI have demonstrated impressive capabilities in single-turn responses, yet multi-turn dialogues remain challenging for even the most sophisticated language models. Current dialogue datasets are limited in their emotional range, domain diversity, turn depth, and are predominantly text-only, hindering progress in developing more human-like conversational systems across modalities. To address these limitations, we present DeepDialogue, a large-scale multimodal dataset containing 40,150 high-quality multi-turn dialogues spanning 41 domains and incorporating 20 distinct emotions with coherent emotional progressions. Our approach pairs 9 different language models (4B-72B parameters) to generate 65,600 initial conversations, which we then evaluate through a combination of human annotation and LLM-based quality filtering. The resulting dataset reveals fundamental insights: smaller models fail to maintain coherence beyond 6 dialogue turns; concrete domains (e.g., "cars," "travel") yield more meaningful conversations than abstract ones (e.g., "philosophy"); and cross-model interactions produce more coherent dialogues than same-model conversations. A key contribution of DeepDialogue is its speech component, where we synthesize emotion-consistent voices for all 40,150 dialogues, creating the first large-scale open-source multimodal dialogue dataset that faithfully preserves emotional context across multi-turn conversations.

* Currently under review. See the official website: https://salt-research.github.io/DeepDialogue

Via

Access Paper or Ask Questions

Exploring Generative Error Correction for Dysarthric Speech Recognition

May 26, 2025

Moreno La Quatra, Alkis Koudounas, Valerio Mario Salerno, Sabato Marco Siniscalchi

Abstract:Despite the remarkable progress in end-to-end Automatic Speech Recognition (ASR) engines, accurately transcribing dysarthric speech remains a major challenge. In this work, we proposed a two-stage framework for the Speech Accessibility Project Challenge at INTERSPEECH 2025, which combines cutting-edge speech recognition models with LLM-based generative error correction (GER). We assess different configurations of model scales and training strategies, incorporating specific hypothesis selection to improve transcription accuracy. Experiments on the Speech Accessibility Project dataset demonstrate the strength of our approach on structured and spontaneous speech, while highlighting challenges in single-word recognition. Through comprehensive analysis, we provide insights into the complementary roles of acoustic and linguistic modeling in dysarthric speech recognition

* Accepted at INTERSPEECH 2025

Via

Access Paper or Ask Questions

"KAN you hear me?" Exploring Kolmogorov-Arnold Networks for Spoken Language Understanding

May 26, 2025

Alkis Koudounas, Moreno La Quatra, Eliana Pastor, Sabato Marco Siniscalchi, Elena Baralis

Figure 1 for "KAN you hear me?" Exploring Kolmogorov-Arnold Networks for Spoken Language Understanding

Figure 2 for "KAN you hear me?" Exploring Kolmogorov-Arnold Networks for Spoken Language Understanding

Figure 3 for "KAN you hear me?" Exploring Kolmogorov-Arnold Networks for Spoken Language Understanding

Figure 4 for "KAN you hear me?" Exploring Kolmogorov-Arnold Networks for Spoken Language Understanding

Abstract:Kolmogorov-Arnold Networks (KANs) have recently emerged as a promising alternative to traditional neural architectures, yet their application to speech processing remains under explored. This work presents the first investigation of KANs for Spoken Language Understanding (SLU) tasks. We experiment with 2D-CNN models on two datasets, integrating KAN layers in five different configurations within the dense block. The best-performing setup, which places a KAN layer between two linear layers, is directly applied to transformer-based models and evaluated on five SLU datasets with increasing complexity. Our results show that KAN layers can effectively replace the linear layers, achieving comparable or superior performance in most cases. Finally, we provide insights into how KAN and linear layers on top of transformers differently attend to input regions of the raw waveforms.

* Accepted at INTERSPEECH 2025

Via

Access Paper or Ask Questions

"Alexa, can you forget me?" Machine Unlearning Benchmark in Spoken Language Understanding

May 21, 2025

Alkis Koudounas, Claudio Savelli, Flavio Giobergia, Elena Baralis

Abstract:Machine unlearning, the process of efficiently removing specific information from machine learning models, is a growing area of interest for responsible AI. However, few studies have explored the effectiveness of unlearning methods on complex tasks, particularly speech-related ones. This paper introduces UnSLU-BENCH, the first benchmark for machine unlearning in spoken language understanding (SLU), focusing on four datasets spanning four languages. We address the unlearning of data from specific speakers as a way to evaluate the quality of potential "right to be forgotten" requests. We assess eight unlearning techniques and propose a novel metric to simultaneously better capture their efficacy, utility, and efficiency. UnSLU-BENCH sets a foundation for unlearning in SLU and reveals significant differences in the effectiveness and computational feasibility of various techniques.

Via

Access Paper or Ask Questions

voc2vec: A Foundation Model for Non-Verbal Vocalization

Feb 22, 2025

Alkis Koudounas, Moreno La Quatra, Marco Sabato Siniscalchi, Elena Baralis

Figure 1 for voc2vec: A Foundation Model for Non-Verbal Vocalization

Figure 2 for voc2vec: A Foundation Model for Non-Verbal Vocalization

Figure 3 for voc2vec: A Foundation Model for Non-Verbal Vocalization

Figure 4 for voc2vec: A Foundation Model for Non-Verbal Vocalization

Abstract:Speech foundation models have demonstrated exceptional capabilities in speech-related tasks. Nevertheless, these models often struggle with non-verbal audio data, such as vocalizations, baby crying, etc., which are critical for various real-world applications. Audio foundation models well handle non-speech data but also fail to capture the nuanced features of non-verbal human sounds. In this work, we aim to overcome the above shortcoming and propose a novel foundation model, termed voc2vec, specifically designed for non-verbal human data leveraging exclusively open-source non-verbal audio datasets. We employ a collection of 10 datasets covering around 125 hours of non-verbal audio. Experimental results prove that voc2vec is effective in non-verbal vocalization classification, and it outperforms conventional speech and audio foundation models. Moreover, voc2vec consistently outperforms strong baselines, namely OpenSmile and emotion2vec, on six different benchmark datasets. To the best of the authors' knowledge, voc2vec is the first universal representation model for vocalization tasks.

* Accepted at ICASSP 2025

Via

Access Paper or Ask Questions

Speech Analysis of Language Varieties in Italy

Jun 22, 2024

Moreno La Quatra, Alkis Koudounas, Elena Baralis, Sabato Marco Siniscalchi

Figure 1 for Speech Analysis of Language Varieties in Italy

Figure 2 for Speech Analysis of Language Varieties in Italy

Figure 3 for Speech Analysis of Language Varieties in Italy

Figure 4 for Speech Analysis of Language Varieties in Italy

Abstract:Italy exhibits rich linguistic diversity across its territory due to the distinct regional languages spoken in different areas. Recent advances in self-supervised learning provide new opportunities to analyze Italy's linguistic varieties using speech data alone. This includes the potential to leverage representations learned from large amounts of data to better examine nuances between closely related linguistic varieties. In this study, we focus on automatically identifying the geographic region of origin of speech samples drawn from Italy's diverse language varieties. We leverage self-supervised learning models to tackle this task and analyze differences and similarities between Italy's regional languages. In doing so, we also seek to uncover new insights into the relationships among these diverse yet closely related varieties, which may help linguists understand their interconnected evolution and regional development over time and space. To improve the discriminative ability of learned representations, we evaluate several supervised contrastive learning objectives, both as pre-training steps and additional fine-tuning objectives. Experimental evidence shows that pre-trained self-supervised models can effectively identify regions from speech recording. Additionally, incorporating contrastive objectives during fine-tuning improves classification accuracy and yields embeddings that distinctly separate regional varieties, demonstrating the value of combining self-supervised pre-training and contrastive learning for this task.

* Accepted to LREC-COLING 2024 - https://aclanthology.org/2024.lrec-main.1317/

Via

Access Paper or Ask Questions