Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Eduardo Lleida

An Intervention-Based Framework for Shortcut Diagnosis in Spoofing Countermeasures

Jul 03, 2026

Santiago Rubio, Pilar Bello, Dayana Ribas, Antonio Miguel, Eduardo Lleida, Alfonso Ortega

Abstract:While deepfake audio detection systems achieve high performance in controlled benchmarks, their reliability often diminishes in the wild. Prior work shows that dataset-specific artifacts contribute to this gap. Yet, systematic tools to identify which acoustic properties a model exploits as shortcuts remain limited. We propose an intervention-based diagnostic framework, grounded in a directed graphical model, that formally distinguishes confound-driven shortcut dependencies from legitimate domain shift. We operationalise this through controlled acoustic perturbations targeting non-speech structure, spectral content, and signal energy, complemented by corpus-level distributional analysis. Evaluating XLS-R-300M with RawGAT-ST across ASVspoof challenges datasets, we quantify model sensitivity to specific intervention types. Results reveal that non-speech interventions produce the largest performance shifts, confirming non-speech intervals as a dominant shortcut.

* Accepted at Odyssey 2026: The Speaker and Language Recognition Workshop

Via

Access Paper or Ask Questions

Open-Set Source Tracing as Compositional Factors via Structured Prototypes

Jul 03, 2026

Santiago Rubio, Antonio Almudévar, Antonio Miguel, Eduardo Lleida, Alfonso Ortega

Abstract:Recent research expands beyond binary anti-spoofing with the emergence of Source Tracing, the task of identifying the specific generative origins of synthetic speech. However, current research often equates a "source" with its generative architecture. We propose redefining a source as a compositional tuple of Architecture, Training Data, and other training factors affecting the generated speech. We propose a framework using Structured Orthonormal Prototypes to minimize class overlap and intra-class variance. Our Subspace Partitioning strategy splits the embedding into architecture and data subspaces, while a residual subspace captures stochastic variability, enabling "compositional generalization" for novel factor combinations. This approach improves performance for partially seen sources and maintains robustness in fully open-set scenarios. MLAAD evaluations for Few-Shot open-set Identification show our approach significantly outperforms angular-margin baselines.

* Submitted to IEEE Spoken Language Technology Workshop (SLT) 2026

Via

Access Paper or Ask Questions

A Fair and Transparent Framework for Speech-Based Depression Detection: Balancing Interpretability and Performance

Jun 30, 2026

Mariel Estevez, Alfonso Ortega, Antonio Miguel, Eduardo Lleida

Abstract:While speech provides rich, non-invasive biomarkers for mental-health assessment, clinical adoption is limited by opaque models and potential demographic bias. In this work we propose a methodological framework to evaluate robustness and interpretability for automated depression detection on the extended DAIC-WOZ dataset using low-complexity machine learning baselines (RF, SVM, and MLP) chosen to mitigate overfitting and enhance generalization in combination with human-understandable acoustic features (MFCCs, eGeMAPS). To balance accuracy with clinical trust, we leverage explainability methods (LIME and SHAP) for feature selection, validating our findings with statistical significance tests and demographic fairness analyses to mitigate spurious, artifact-driven correlations. Empirical results demonstrate that an optimized subset of explainable AI (XAI)-selected features combined with an MLP architecture achieves a state-of-the-art test accuracy of 82\%. Ultimately, this work provides a transparent framework for robust and ethical assistive technologies that can be applied to any other binary task.

* 7 pages, 2 figures, 3 tables. This work has been submitted to the IEEE for possible publication

Via

Access Paper or Ask Questions

Audio-Visual Speaker Diarization: Current Databases, Approaches and Challenges

Sep 09, 2024

Victoria Mingote, Alfonso Ortega, Antonio Miguel, Eduardo Lleida

Abstract:Nowadays, the large amount of audio-visual content available has fostered the need to develop new robust automatic speaker diarization systems to analyse and characterise it. This kind of system helps to reduce the cost of doing this process manually and allows the use of the speaker information for different applications, as a huge quantity of information is present, for example, images of faces, or audio recordings. Therefore, this paper aims to address a critical area in the field of speaker diarization systems, the integration of audio-visual content of different domains. This paper seeks to push beyond current state-of-the-art practices by developing a robust audio-visual speaker diarization framework adaptable to various data domains, including TV scenarios, meetings, and daily activities. Unlike most of the existing audio-visual speaker diarization systems, this framework will also include the proposal of an approach to lead the precise assignment of specific identities in TV scenarios where celebrities appear. In addition, in this work, we have conducted an extensive compilation of the current state-of-the-art approaches and the existing databases for developing audio-visual speaker diarization.

Via

Access Paper or Ask Questions

Defining and Measuring Disentanglement for non-Independent Factors of Variation

Aug 13, 2024

Antonio Almudévar, Alfonso Ortega, Luis Vicente, Antonio Miguel, Eduardo Lleida

Figure 1 for Defining and Measuring Disentanglement for non-Independent Factors of Variation

Figure 2 for Defining and Measuring Disentanglement for non-Independent Factors of Variation

Figure 3 for Defining and Measuring Disentanglement for non-Independent Factors of Variation

Figure 4 for Defining and Measuring Disentanglement for non-Independent Factors of Variation

Abstract:Representation learning is an approach that allows to discover and extract the factors of variation from the data. Intuitively, a representation is said to be disentangled if it separates the different factors of variation in a way that is understandable to humans. Definitions of disentanglement and metrics to measure it usually assume that the factors of variation are independent of each other. However, this is generally false in the real world, which limits the use of these definitions and metrics to very specific and unrealistic scenarios. In this paper we give a definition of disentanglement based on information theory that is also valid when the factors of variation are not independent. Furthermore, we relate this definition to the Information Bottleneck Method. Finally, we propose a method to measure the degree of disentanglement from the given definition that works when the factors of variation are not independent. We show through different experiments that the method proposed in this paper correctly measures disentanglement with non-independent factors of variation, while other methods fail in this scenario.

Via

Access Paper or Ask Questions

Predefined Prototypes for Intra-Class Separation and Disentanglement

Jun 23, 2024

Antonio Almudévar, Théo Mariotte, Alfonso Ortega, Marie Tahon, Luis Vicente, Antonio Miguel, Eduardo Lleida

Figure 1 for Predefined Prototypes for Intra-Class Separation and Disentanglement

Figure 2 for Predefined Prototypes for Intra-Class Separation and Disentanglement

Abstract:Prototypical Learning is based on the idea that there is a point (which we call prototype) around which the embeddings of a class are clustered. It has shown promising results in scenarios with little labeled data or to design explainable models. Typically, prototypes are either defined as the average of the embeddings of a class or are designed to be trainable. In this work, we propose to predefine prototypes following human-specified criteria, which simplify the training pipeline and brings different advantages. Specifically, in this work we explore two of these advantages: increasing the inter-class separability of embeddings and disentangling embeddings with respect to different variance factors, which can translate into the possibility of having explainable predictions. Finally, we propose different experiments that help to understand our proposal and demonstrate empirically the mentioned advantages.

Via

Access Paper or Ask Questions

Improved Vocal Effort Transfer Vector Estimation for Vocal Effort-Robust Speaker Verification

May 03, 2023

Iván López-Espejo, Santi Prieto, Alfonso Ortega, Eduardo Lleida

Figure 1 for Improved Vocal Effort Transfer Vector Estimation for Vocal Effort-Robust Speaker Verification

Figure 2 for Improved Vocal Effort Transfer Vector Estimation for Vocal Effort-Robust Speaker Verification

Figure 3 for Improved Vocal Effort Transfer Vector Estimation for Vocal Effort-Robust Speaker Verification

Figure 4 for Improved Vocal Effort Transfer Vector Estimation for Vocal Effort-Robust Speaker Verification

Abstract:Despite the maturity of modern speaker verification technology, its performance still significantly degrades when facing non-neutrally-phonated (e.g., shouted and whispered) speech. To address this issue, in this paper, we propose a new speaker embedding compensation method based on a minimum mean square error (MMSE) estimator. This method models the joint distribution of the vocal effort transfer vector and non-neutrally-phonated embedding spaces and operates in a principal component analysis domain to cope with non-neutrally-phonated speech data scarcity. Experiments are carried out using a cutting-edge speaker verification system integrating a powerful self-supervised pre-trained model for speech representation. In comparison with a state-of-the-art embedding compensation method, the proposed MMSE estimator yields superior and competitive equal error rate results when tackling shouted and whispered speech, respectively.

Via

Access Paper or Ask Questions

Class Token and Knowledge Distillation for Multi-head Self-Attention Speaker Verification Systems

Nov 06, 2021

Victoria Mingote, Antonio Miguel, Alfonso Ortega, Eduardo Lleida

Figure 1 for Class Token and Knowledge Distillation for Multi-head Self-Attention Speaker Verification Systems

Figure 2 for Class Token and Knowledge Distillation for Multi-head Self-Attention Speaker Verification Systems

Figure 3 for Class Token and Knowledge Distillation for Multi-head Self-Attention Speaker Verification Systems

Figure 4 for Class Token and Knowledge Distillation for Multi-head Self-Attention Speaker Verification Systems

Abstract:This paper explores three novel approaches to improve the performance of speaker verification (SV) systems based on deep neural networks (DNN) using Multi-head Self-Attention (MSA) mechanisms and memory layers. Firstly, we propose the use of a learnable vector called Class token to replace the average global pooling mechanism to extract the embeddings. Unlike global average pooling, our proposal takes into account the temporal structure of the input what is relevant for the text-dependent SV task. The class token is concatenated to the input before the first MSA layer, and its state at the output is used to predict the classes. To gain additional robustness, we introduce two approaches. First, we have developed a Bayesian estimation of the class token. Second, we have added a distilled representation token for training a teacher-student pair of networks using the Knowledge Distillation (KD) philosophy, which is combined with the class token. This distillation token is trained to mimic the predictions from the teacher network, while the class token replicates the true label. All the strategies have been tested on the RSR2015-Part II and DeepMine-Part 1 databases for text-dependent SV, providing competitive results compared to the same architecture using the average pooling mechanism to extract average embeddings.

Via

Access Paper or Ask Questions

Generalizing AUC Optimization to Multiclass Classification for Audio Segmentation With Limited Training Data

Oct 27, 2021

Pablo Gimeno, Victoria Mingote, Alfonso Ortega, Antonio Miguel, Eduardo Lleida

Figure 1 for Generalizing AUC Optimization to Multiclass Classification for Audio Segmentation With Limited Training Data

Figure 2 for Generalizing AUC Optimization to Multiclass Classification for Audio Segmentation With Limited Training Data

Figure 3 for Generalizing AUC Optimization to Multiclass Classification for Audio Segmentation With Limited Training Data

Abstract:Area under the ROC curve (AUC) optimisation techniques developed for neural networks have recently demonstrated their capabilities in different audio and speech related tasks. However, due to its intrinsic nature, AUC optimisation has focused only on binary tasks so far. In this paper, we introduce an extension to the AUC optimisation framework so that it can be easily applied to an arbitrary number of classes, aiming to overcome the issues derived from training data limitations in deep learning solutions. Building upon the multiclass definitions of the AUC metric found in the literature, we define two new training objectives using a one-versus-one and a one-versus-rest approach. In order to demonstrate its potential, we apply them in an audio segmentation task with limited training data that aims to differentiate 3 classes: foreground music, background music and no music. Experimental results show that our proposal can improve the performance of audio segmentation systems significantly compared to traditional training criteria such as cross entropy.

* IEEE Signal Processing Letters, vol. 28, pp. 1135-1139, 2021

Via

Access Paper or Ask Questions

Shouted Speech Compensation for Speaker Verification Robust to Vocal Effort Conditions

Aug 06, 2020

Santi Prieto, Alfonso Ortega, Iván López-Espejo, Eduardo Lleida

Figure 1 for Shouted Speech Compensation for Speaker Verification Robust to Vocal Effort Conditions

Figure 2 for Shouted Speech Compensation for Speaker Verification Robust to Vocal Effort Conditions

Figure 3 for Shouted Speech Compensation for Speaker Verification Robust to Vocal Effort Conditions

Figure 4 for Shouted Speech Compensation for Speaker Verification Robust to Vocal Effort Conditions

Abstract:The performance of speaker verification systems degrades when vocal effort conditions between enrollment and test (e.g., shouted vs. normal speech) are different. This is a potential situation in non-cooperative speaker verification tasks. In this paper, we present a study on different methods for linear compensation of embeddings making use of Gaussian mixture models to cluster shouted and normal speech domains. These compensation techniques are borrowed from the area of robustness for automatic speech recognition and, in this work, we apply them to compensate the mismatch between shouted and normal conditions in speaker verification. Before compensation, shouted condition is automatically detected by means of logistic regression. The process is computationally light and it is performed in the back-end of an x-vector system. Experimental results show that applying the proposed approach in the presence of vocal effort mismatch yields up to 13.8% equal error rate relative improvement with respect to a system that applies neither shouted speech detection nor compensation.

Via

Access Paper or Ask Questions