Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Federico Costa

How Attention Shapes Emotion: A Comparative Study of Attention Mechanisms for Speech Emotion Recognition

Mar 16, 2026

Marc Casals-Salvador, Federico Costa, Rodolfo Zevallos, Javier Hernando

Abstract:Speech Emotion Recognition (SER) plays a key role in advancing human-computer interaction. Attention mechanisms have become the dominant approach for modeling emotional speech due to their ability to capture long-range dependencies and emphasize salient information. However, standard self-attention suffers from quadratic computational and memory complexity, limiting its scalability. In this work, we present a systematic benchmark of optimized attention mechanisms for SER, including RetNet, LightNet, GSA, FoX, and KDA. Experiments on both MSP-Podcast benchmark versions show that while standard self-attention achieves the strongest recognition performance across test sets, efficient attention variants dramatically improve scalability, reducing inference latency and memory usage by up to an order of magnitude. These results highlight a critical trade-off between accuracy and efficiency, providing practical insights for designing scalable SER systems.

Via

Access Paper or Ask Questions

Quantifying Cross-Lingual Transfer in Paralinguistic Speech Tasks

Mar 09, 2026

Pol Buitrago, Oriol Pareras, Federico Costa, Javier Hernando

Abstract:Paralinguistic speech tasks are often considered relatively language-agnostic, as they rely on extralinguistic acoustic cues rather than lexical content. However, prior studies report performance degradation under cross-lingual conditions, indicating non-negligible language dependence. Still, these studies typically focus on isolated language pairs or task-specific settings, limiting comparability and preventing a systematic assessment of task-level language dependence. We introduce the Cross-Lingual Transfer Matrix (CLTM), a systematic method to quantify cross-lingual interactions between pairs of languages within a given task. We apply the CLTM to two paralinguistic tasks, gender identification and speaker verification, using a multilingual HuBERT-based encoder, to analyze how donor-language data affects target-language performance during fine-tuning. Our results reveal distinct transfer patterns across tasks and languages, reflecting systematic, language-dependent effects.

* 6 pages, 5 figures, Submitted to Interspeech 2026

Via

Access Paper or Ask Questions

On the Use of Audio to Improve Dialogue Policies

Oct 17, 2024

Daniel Roncel, Federico Costa, Javier Hernando

Figure 1 for On the Use of Audio to Improve Dialogue Policies

Figure 2 for On the Use of Audio to Improve Dialogue Policies

Figure 3 for On the Use of Audio to Improve Dialogue Policies

Abstract:With the significant progress of speech technologies, spoken goal-oriented dialogue systems are becoming increasingly popular. One of the main modules of a dialogue system is typically the dialogue policy, which is responsible for determining system actions. This component usually relies only on audio transcriptions, being strongly dependent on their quality and ignoring very important extralinguistic information embedded in the user's speech. In this paper, we propose new architectures to add audio information by combining speech and text embeddings using a Double Multi-Head Attention component. Our experiments show that audio embedding-aware dialogue policies outperform text-based ones, particularly in noisy transcription scenarios, and that how text and audio embeddings are combined is crucial to improve performance. We obtained a 9.8% relative improvement in the User Request Score compared to an only-text-based dialogue system on the DSTC2 dataset.

* IberSpeech 2024

Via

Access Paper or Ask Questions

BSC-UPC at EmoSPeech-IberLEF2024: Attention Pooling for Emotion Recognition

Jul 17, 2024

Marc Casals-Salvador, Federico Costa, Miquel India, Javier Hernando

Abstract:The domain of speech emotion recognition (SER) has persistently been a frontier within the landscape of machine learning. It is an active field that has been revolutionized in the last few decades and whose implementations are remarkable in multiple applications that could affect daily life. Consequently, the Iberian Languages Evaluation Forum (IberLEF) of 2024 held a competitive challenge to leverage the SER results with a Spanish corpus. This paper presents the approach followed with the goal of participating in this competition. The main architecture consists of different pre-trained speech and text models to extract features from both modalities, utilizing an attention pooling mechanism. The proposed system has achieved the first position in the challenge with an 86.69% in Macro F1-Score.

Via

Access Paper or Ask Questions

Double Multi-Head Attention Multimodal System for Odyssey 2024 Speech Emotion Recognition Challenge

Jun 15, 2024

Federico Costa, Miquel India, Javier Hernando

Figure 1 for Double Multi-Head Attention Multimodal System for Odyssey 2024 Speech Emotion Recognition Challenge

Figure 2 for Double Multi-Head Attention Multimodal System for Odyssey 2024 Speech Emotion Recognition Challenge

Figure 3 for Double Multi-Head Attention Multimodal System for Odyssey 2024 Speech Emotion Recognition Challenge

Figure 4 for Double Multi-Head Attention Multimodal System for Odyssey 2024 Speech Emotion Recognition Challenge

Abstract:As computer-based applications are becoming more integrated into our daily lives, the importance of Speech Emotion Recognition (SER) has increased significantly. Promoting research with innovative approaches in SER, the Odyssey 2024 Speech Emotion Recognition Challenge was organized as part of the Odyssey 2024 Speaker and Language Recognition Workshop. In this paper we describe the Double Multi-Head Attention Multimodal System developed for this challenge. Pre-trained self-supervised models were used to extract informative acoustic and text features. An early fusion strategy was adopted, where a Multi-Head Attention layer transforms these mixed features into complementary contextualized representations. A second attention mechanism is then applied to pool these representations into an utterance-level vector. Our proposed system achieved the third position in the categorical task ranking with a 34.41% Macro-F1 score, where 31 teams participated in total.

* Odyssey 2024: The Speaker and Language Recognition Workshop

Via

Access Paper or Ask Questions

Speaker Characterization by means of Attention Pooling

May 07, 2024

Federico Costa, Miquel India, Javier Hernando

Figure 1 for Speaker Characterization by means of Attention Pooling

Figure 2 for Speaker Characterization by means of Attention Pooling

Figure 3 for Speaker Characterization by means of Attention Pooling

Figure 4 for Speaker Characterization by means of Attention Pooling

Abstract:State-of-the-art Deep Learning systems for speaker verification are commonly based on speaker embedding extractors. These architectures are usually composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. The authors have recently proposed the use of a Double Multi-Head Self-Attention pooling for speaker recognition, placed between a CNN-based front-end and a set of fully connected layers. This has shown to be an excellent approach to efficiently select the most relevant features captured by the front-end from the speech signal. In this paper we show excellent experimental results by adapting this architecture to other different speaker characterization tasks, such as emotion recognition, sex classification and COVID-19 detection.

* Proc. IberSPEECH 2022, 166-170
* IberSpeech 2022

Via

Access Paper or Ask Questions