Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Livia Qian

Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?

May 21, 2026

Luca Modica, Filip Landin, Mehrdad Farahani, Livia Qian, Gabriel Skantze, Richard Johansson

Abstract:In recent years, several Speech Language Models (SLMs) that represent speech and written text jointly have been presented. The question then emerges about how model-internal mechanisms are similar and different when operating in the two modalities. We focus on how these systems encode, store, and retrieve factual knowledge, which has previously been investigated for text-only models. To investigate mechanisms behind the storage and recall of factual association in SLMs, we leverage Causal Mediation Analysis, a technique previously applied to text-based models. Initial results using SpiritLM, a multimodal model integrating discrete speech tokens reveal discrepancies between text-to-text and speech-to-text results, suggesting that the emergent mechanisms for factual recall are only partially carried over from the text to the speech modality. These results advance our understanding of how internal mechanisms encode factual associations in SLMs while contributing insights for improving speech-enabled AI systems.

* In *SEM 2026, the 15th Joint Conference on Lexical and Computational Semantics

Via

Access Paper or Ask Questions

Representation of perceived prosodic similarity of conversational feedback

May 19, 2025

Livia Qian, Carol Figueroa, Gabriel Skantze

Abstract:Vocal feedback (e.g., `mhm', `yeah', `okay') is an important component of spoken dialogue and is crucial to ensuring common ground in conversational systems. The exact meaning of such feedback is conveyed through both lexical and prosodic form. In this work, we investigate the perceived prosodic similarity of vocal feedback with the same lexical form, and to what extent existing speech representations reflect such similarities. A triadic comparison task with recruited participants is used to measure perceived similarity of feedback responses taken from two different datasets. We find that spectral and self-supervised speech representations encode prosody better than extracted pitch features, especially in the case of feedback from the same speaker. We also find that it is possible to further condense and align the representations to human perception through contrastive learning.

* Interspeech 2025

Via

Access Paper or Ask Questions

Joint Learning of Context and Feedback Embeddings in Spoken Dialogue

Jun 11, 2024

Livia Qian, Gabriel Skantze

Figure 1 for Joint Learning of Context and Feedback Embeddings in Spoken Dialogue

Figure 2 for Joint Learning of Context and Feedback Embeddings in Spoken Dialogue

Figure 3 for Joint Learning of Context and Feedback Embeddings in Spoken Dialogue

Figure 4 for Joint Learning of Context and Feedback Embeddings in Spoken Dialogue

Abstract:Short feedback responses, such as backchannels, play an important role in spoken dialogue. So far, most of the modeling of feedback responses has focused on their timing, often neglecting how their lexical and prosodic form influence their contextual appropriateness and conversational function. In this paper, we investigate the possibility of embedding short dialogue contexts and feedback responses in the same representation space using a contrastive learning objective. In our evaluation, we primarily focus on how such embeddings can be used as a context-feedback appropriateness metric and thus for feedback response ranking in U.S. English dialogues. Our results show that the model outperforms humans given the same ranking task and that the learned embeddings carry information about the conversational function of feedback responses.

* Interspeech 2024

Via

Access Paper or Ask Questions

Resolving References in Visually-Grounded Dialogue via Text Generation

Sep 23, 2023

Bram Willemsen, Livia Qian, Gabriel Skantze

Figure 1 for Resolving References in Visually-Grounded Dialogue via Text Generation

Figure 2 for Resolving References in Visually-Grounded Dialogue via Text Generation

Figure 3 for Resolving References in Visually-Grounded Dialogue via Text Generation

Figure 4 for Resolving References in Visually-Grounded Dialogue via Text Generation

Abstract:Vision-language models (VLMs) have shown to be effective at image retrieval based on simple text queries, but text-image retrieval based on conversational input remains a challenge. Consequently, if we want to use VLMs for reference resolution in visually-grounded dialogue, the discourse processing capabilities of these models need to be augmented. To address this issue, we propose fine-tuning a causal large language model (LLM) to generate definite descriptions that summarize coreferential information found in the linguistic context of references. We then use a pretrained VLM to identify referents based on the generated descriptions, zero-shot. We evaluate our approach on a manually annotated dataset of visually-grounded dialogues and achieve results that, on average, exceed the performance of the baselines we compare against. Furthermore, we find that using referent descriptions based on larger context windows has the potential to yield higher returns.

* Published at SIGDIAL 2023

Via

Access Paper or Ask Questions