Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Valentin Vielzeuf

LIUM

Forewarned is Forearmed: When Non-Sequential Embedding Turns Into an Anomaly Detector

Jun 29, 2026

Elys Allesiardo, Antoine Caubrière, Valentin Vielzeuf

Abstract:This paper offers an in-depth analysis of non-sequential multimodal sentence-level embeddings, with a particular focus on the SONAR model. We demonstrate that certain embedding dimensions are sensitive to perturbations and can serve as indicators of decoding anomalies. By leveraging the consistency between successive encoding and decoding, we successfully build an accurate detector. Additionally, we explore modifying specific dimensions of interest to attempt to correct them. This work underscores the importance of understanding and analyzing the embeddings themselves to enhance the reliability of multimodal representations.

* Accepted for presentation at LREC 2026

Via

Access Paper or Ask Questions

Do we really need Self-Attention for Streaming Automatic Speech Recognition?

Jan 27, 2026

Youness Dkhissi, Valentin Vielzeuf, Elys Allesiardo, Anthony Larcher

Abstract:Transformer-based architectures are the most used architectures in many deep learning fields like Natural Language Processing, Computer Vision or Speech processing. It may encourage the direct use of Transformers in the constrained tasks, without questioning whether it will yield the same benefits as in standard tasks. Given specific constraints, it is essential to evaluate the relevance of transformer models. This work questions the suitability of transformers for specific domains. We argue that the high computational requirements and latency issues associated with these models do not align well with streaming applications. Our study promotes the search for alternative strategies to improve efficiency without sacrificing performance. In light of this observation, our paper critically examines the usefulness of transformer architecture in such constrained environments. As a first attempt, we show that the computational cost for Streaming Automatic Speech Recognition (ASR) can be reduced using deformable convolution instead of Self-Attention. Furthermore, we show that Self-Attention mechanisms can be entirely removed and not replaced, without observing significant degradation in the Word Error Rate.

* International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE Signal Processing Society, May 2026, Barcelona, Spain

Via

Access Paper or Ask Questions

The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

Oct 10, 2025

Nizar El Ghazal, Antoine Caubrière, Valentin Vielzeuf

Figure 1 for The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

Figure 2 for The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

Figure 3 for The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

Figure 4 for The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

Abstract:This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically evaluate traditional multimodal context (combining text history and spoken current turn), full spoken history, and compressed spoken history approaches. Our experiments on the SpokenWOZ corpus demonstrate that providing the full spoken conversation as input yields the highest performance among models of similar size, significantly surpassing prior methods. Furthermore, we show that attention-pooling-based compression of the spoken history offers a strong trade-off, maintaining competitive accuracy with reduced context size. Detailed analysis confirms that improvements stem from more effective context utilization.

Via

Access Paper or Ask Questions

Investigating Low-Cost LLM Annotation for~Spoken Dialogue Understanding Datasets

Jun 19, 2024

Lucas Druart, Valentin Vielzeuf, Yannick Estève

Abstract:In spoken Task-Oriented Dialogue (TOD) systems, the choice of the semantic representation describing the users' requests is key to a smooth interaction. Indeed, the system uses this representation to reason over a database and its domain knowledge to choose its next action. The dialogue course thus depends on the information provided by this semantic representation. While textual datasets provide fine-grained semantic representations, spoken dialogue datasets fall behind. This paper provides insights into automatic enhancement of spoken dialogue datasets' semantic representations. Our contributions are three fold: (1) assess the relevance of Large Language Model fine-tuning, (2) evaluate the knowledge captured by the produced annotations and (3) highlight semi-automatic annotation implications.

* 27th International Conference on Text, Speech and Dialogue, Sep 2024, Brno (R{\'e}p. Tch{\`e}que), Czech Republic

Via

Access Paper or Ask Questions

Sustainable self-supervised learning for speech representations

Jun 11, 2024

Luis Lugo, Valentin Vielzeuf

Figure 1 for Sustainable self-supervised learning for speech representations

Figure 2 for Sustainable self-supervised learning for speech representations

Figure 3 for Sustainable self-supervised learning for speech representations

Figure 4 for Sustainable self-supervised learning for speech representations

Abstract:Sustainable artificial intelligence focuses on data, hardware, and algorithms to make machine learning models more environmentally responsible. In particular, machine learning models for speech representations are computationally expensive, generating environmental concerns because of their high energy consumption. Thus, we propose a sustainable self-supervised model to learn speech representation, combining optimizations in neural layers and training to reduce computing costs. The proposed model improves over a resource-efficient baseline, reducing both memory usage and computing cost estimations. It pretrains using a single GPU in less than a day. On top of that, it improves the error rate performance of the baseline in downstream task evaluations. When comparing it to large speech representation approaches, there is an order of magnitude reduction in memory usage, while computing cost reductions represent almost three orders of magnitude improvement.

Via

Access Paper or Ask Questions

Investigating the 'Autoencoder Behavior' in Speech Self-Supervised Models: a focus on HuBERT's Pretraining

May 14, 2024

Valentin Vielzeuf

Figure 1 for Investigating the 'Autoencoder Behavior' in Speech Self-Supervised Models: a focus on HuBERT's Pretraining

Figure 2 for Investigating the 'Autoencoder Behavior' in Speech Self-Supervised Models: a focus on HuBERT's Pretraining

Figure 3 for Investigating the 'Autoencoder Behavior' in Speech Self-Supervised Models: a focus on HuBERT's Pretraining

Figure 4 for Investigating the 'Autoencoder Behavior' in Speech Self-Supervised Models: a focus on HuBERT's Pretraining

Abstract:Self-supervised learning has shown great success in Speech Recognition. However, it has been observed that finetuning all layers of the learned model leads to lower performance compared to resetting top layers. This phenomenon is attributed to the ''autoencoder'' behavior: top layers contain information closer to the input and are less suitable for tasks that require linguistic information, such as Speech Recognition.To better our understanding of this behavior, we propose to study the evolution of high-level information within the model during pretraining. We focus on the HuBERT model, which exhibits a less pronounced ''autoencoder'' behavior. By experimentally exploring various factors that may have an impact, we aim to improve the training procedure and enhance the top layers of HuBERT for high-level tasks.Furthermore, our experiments demonstrate that these improvements in the training procedure result in faster convergence and competitive performance on downstream tasks.

Via

Access Paper or Ask Questions

Efficiency-oriented approaches for self-supervised speech representation learning

Dec 18, 2023

Luis Lugo, Valentin Vielzeuf

Figure 1 for Efficiency-oriented approaches for self-supervised speech representation learning

Figure 2 for Efficiency-oriented approaches for self-supervised speech representation learning

Figure 3 for Efficiency-oriented approaches for self-supervised speech representation learning

Figure 4 for Efficiency-oriented approaches for self-supervised speech representation learning

Abstract:Self-supervised learning enables the training of large neural models without the need for large, labeled datasets. It has been generating breakthroughs in several fields, including computer vision, natural language processing, biology, and speech. In particular, the state-of-the-art in several speech processing applications, such as automatic speech recognition or speaker identification, are models where the latent representation is learned using self-supervised approaches. Several configurations exist in self-supervised learning for speech, including contrastive, predictive, and multilingual approaches. There is, however, a crucial limitation in most existing approaches: their high computational costs. These costs limit the deployment of models, the size of the training dataset, and the number of research groups that can afford research with large self-supervised models. Likewise, we should consider the environmental costs that high energy consumption implies. Efforts in this direction comprise optimization of existing models, neural architecture efficiency, improvements in finetuning for speech processing tasks, and data efficiency. But despite current efforts, more work could be done to address high computational costs in self-supervised representation learning.

* 16 pages, 3 figures

Via

Access Paper or Ask Questions

Is one brick enough to break the wall of spoken dialogue state tracking?

Nov 03, 2023

Lucas Druart, Valentin Vielzeuf, Yannick Estève

Figure 1 for Is one brick enough to break the wall of spoken dialogue state tracking?

Figure 2 for Is one brick enough to break the wall of spoken dialogue state tracking?

Figure 3 for Is one brick enough to break the wall of spoken dialogue state tracking?

Figure 4 for Is one brick enough to break the wall of spoken dialogue state tracking?

Abstract:In Task-Oriented Dialogue (TOD) systems, correctly updating the system's understanding of the user's needs (a.k.a dialogue state tracking) is key to a smooth interaction. Traditionally, TOD systems perform this update in three steps: transcription of the user's utterance, semantic extraction of the key concepts, and contextualization with the previously identified concepts. Such cascade approaches suffer from cascading errors and separate optimization. End-to-End approaches have been proved helpful up to the semantic extraction step. This paper goes one step further paving the path towards completely neural spoken dialogue state tracking by comparing three approaches: (1) a state of the art cascade approach, (2) a locally E2E approach with rule-based contextualization and (3) a completely neural approach. Our study highlights that although they all outperform the recent DSTC11 best model, especially with a filtering post-processing step, (1) remains the most accurate approach. Indeed, both (2) and (3) have trouble propagating context as dialogues unfold showing that context propagation in completely neural approaches is an open challenge.

* Submitted to IEEE ICASSP 2024{\copyright} 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Via

Access Paper or Ask Questions

Are cascade dialogue state tracking models speaking out of turn in spoken dialogues?

Nov 03, 2023

Lucas Druart, Léo Jacqmin, Benoît Favre, Lina Maria Rojas-Barahona, Valentin Vielzeuf

Figure 1 for Are cascade dialogue state tracking models speaking out of turn in spoken dialogues?

Figure 2 for Are cascade dialogue state tracking models speaking out of turn in spoken dialogues?

Figure 3 for Are cascade dialogue state tracking models speaking out of turn in spoken dialogues?

Figure 4 for Are cascade dialogue state tracking models speaking out of turn in spoken dialogues?

Abstract:In Task-Oriented Dialogue (TOD) systems, correctly updating the system's understanding of the user's needs is key to a smooth interaction. Traditionally TOD systems are composed of several modules that interact with one another. While each of these components is the focus of active research communities, their behavior in interaction can be overlooked. This paper proposes a comprehensive analysis of the errors of state of the art systems in complex settings such as Dialogue State Tracking which highly depends on the dialogue context. Based on spoken MultiWoz, we identify that errors on non-categorical slots' values are essential to address in order to bridge the gap between spoken and chat-based dialogue systems. We explore potential solutions to improve transcriptions and help dialogue state tracking generative models correct such errors.

Via

Access Paper or Ask Questions

OLISIA: a Cascade System for Spoken Dialogue State Tracking

Apr 20, 2023

Léo Jacqmin, Lucas Druart, Valentin Vielzeuf, Lina Maria Rojas-Barahona, Yannick Estève, Benoît Favre

Figure 1 for OLISIA: a Cascade System for Spoken Dialogue State Tracking

Figure 2 for OLISIA: a Cascade System for Spoken Dialogue State Tracking

Figure 3 for OLISIA: a Cascade System for Spoken Dialogue State Tracking

Figure 4 for OLISIA: a Cascade System for Spoken Dialogue State Tracking

Abstract:Though Dialogue State Tracking (DST) is a core component of spoken dialogue systems, recent work on this task mostly deals with chat corpora, disregarding the discrepancies between spoken and written language.In this paper, we propose OLISIA, a cascade system which integrates an Automatic Speech Recognition (ASR) model and a DST model. We introduce several adaptations in the ASR and DST modules to improve integration and robustness to spoken conversations.With these adaptations, our system ranked first in DSTC11 Track 3, a benchmark to evaluate spoken DST. We conduct an in-depth analysis of the results and find that normalizing the ASR outputs and adapting the DST inputs through data augmentation, along with increasing the pre-trained models size all play an important role in reducing the performance discrepancy between written and spoken conversations.

Via

Access Paper or Ask Questions