Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jonas Beskow

Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views

Oct 26, 2025

Anna Deichler, Jonas Beskow

Figure 1 for Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views

Figure 2 for Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views

Figure 3 for Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views

Figure 4 for Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views

Abstract:We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.

* 10 pages, 6 figures, 2 tables. Accepted to the NeurIPS 2025 Workshop on SPACE in Vision, Language, and Embodied AI (SpaVLE)

Via

Access Paper or Ask Questions

Gesture Evaluation in Virtual Reality

Sep 16, 2025

Axel Wiebe Werner, Jonas Beskow, Anna Deichler

Abstract:Gestures are central to human communication, enriching interactions through non-verbal expression. Virtual avatars increasingly use AI-generated gestures to enhance life-likeness, yet evaluations have largely been confined to 2D. Virtual Reality (VR) provides an immersive alternative that may affect how gestures are perceived. This paper presents a comparative evaluation of computer-generated gestures in VR and 2D, examining three models from the 2023 GENEA Challenge. Results show that gestures viewed in VR were rated slightly higher on average, with the strongest effect observed for motion-capture "true movement." While model rankings remained consistent across settings, VR influenced participants' overall perception and offered unique benefits over traditional 2D evaluation.

* Proceedings of the 26th International Conference on Multimodal Interaction (ICMI '24), ACM, 2024
* Published in Proceedings of the 26th International Conference on Multimodal Interaction (ICMI '24), ACM. Copyright 2024 ACM. Licensed under CC BY

Via

Access Paper or Ask Questions

Towards Context-Aware Human-like Pointing Gestures with RL Motion Imitation

Sep 16, 2025

Anna Deichler, Siyang Wang, Simon Alexanderson, Jonas Beskow

Abstract:Pointing is a key mode of interaction with robots, yet most prior work has focused on recognition rather than generation. We present a motion capture dataset of human pointing gestures covering diverse styles, handedness, and spatial targets. Using reinforcement learning with motion imitation, we train policies that reproduce human-like pointing while maximizing precision. Results show our approach enables context-aware pointing behaviors in simulation, balancing task performance with natural dynamics.

* Presented at the Context-Awareness in HRI (CONAWA) Workshop, ACM/IEEE International Conference on Human-Robot Interaction (HRI 2022), March 7, 2022

Via

Access Paper or Ask Questions

Learning to Generate Pointing Gestures in Situated Embodied Conversational Agents

Sep 15, 2025

Anna Deichler, Siyang Wang, Simon Alexanderson, Jonas Beskow

Abstract:One of the main goals of robotics and intelligent agent research is to enable natural communication with humans in physically situated settings. While recent work has focused on verbal modes such as language and speech, non-verbal communication is crucial for flexible interaction. We present a framework for generating pointing gestures in embodied agents by combining imitation and reinforcement learning. Using a small motion capture dataset, our method learns a motor control policy that produces physically valid, naturalistic gestures with high referential accuracy. We evaluate the approach against supervised learning and retrieval baselines in both objective metrics and a virtual reality referential game with human users. Results show that our system achieves higher naturalness and accuracy than state-of-the-art supervised models, highlighting the promise of imitation-RL for communicative gesture generation and its potential application to robots.

* Frontiers in Robotics and AI, 10:1110534 (2023)
* DOI: 10.3389/frobt.2023.1110534. This is the author's LaTeX version

Via

Access Paper or Ask Questions

Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents

Aug 07, 2024

Anna Deichler, Simon Alexanderson, Jonas Beskow

Figure 1 for Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents

Figure 2 for Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents

Figure 3 for Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents

Figure 4 for Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents

Abstract:This paper focuses on enhancing human-agent communication by integrating spatial context into virtual agents' non-verbal behaviors, specifically gestures. Recent advances in co-speech gesture generation have primarily utilized data-driven methods, which create natural motion but limit the scope of gestures to those performed in a void. Our work aims to extend these methods by enabling generative models to incorporate scene information into speech-driven gesture synthesis. We introduce a novel synthetic gesture dataset tailored for this purpose. This development represents a critical step toward creating embodied conversational agents that interact more naturally with their environment and users.

Via

Access Paper or Ask Questions

Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech

Jun 08, 2024

Shivam Mehta, Harm Lameris, Rajiv Punmiya, Jonas Beskow, Éva Székely, Gustav Eje Henter

Abstract:Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, prior studies generally only consider speech read aloud, and ignore spontaneous speech, despite the latter being both a more common and a more variable mode of speaking. We compare the effect of conventional deterministic duration modelling to durations sampled from a powerful probabilistic model based on conditional flow matching (OT-CFM), in three different NAR TTS approaches: regression-based, deep generative, and end-to-end. Across four different corpora, stochastic duration modelling improves probabilistic NAR TTS approaches, especially for spontaneous speech. Please see https://shivammehta25.github.io/prob_dur/ for audio and resources.

* 5 pages, 2 figures. Final version, accepted to Interspeech 2024

Via

Access Paper or Ask Questions

Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis

Apr 30, 2024

Shivam Mehta, Anna Deichler, Jim O'Regan, Birger Moëll, Jonas Beskow, Gustav Eje Henter, Simon Alexanderson

Abstract:Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of suitably large datasets, as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material. Specifically, we use unimodal synthesis models trained on large datasets to create multimodal (but synthetic) parallel training data, and then pre-train a joint synthesis model on that material. In addition, we propose a new synthesis architecture that adds better and more controllable prosody modelling to the state-of-the-art method in the field. Our results confirm that pre-training on large amounts of synthetic data improves the quality of both the speech and the motion synthesised by the multimodal model, with the proposed architecture yielding further benefits when pre-trained on the synthetic data. See https://shivammehta25.github.io/MAGI/ for example output.

* 13+1 pages, 2 figures, accepted at the Human Motion Generation workshop (HuMoGen) at CVPR 2024

Via

Access Paper or Ask Questions

Unified speech and gesture synthesis using flow matching

Oct 08, 2023

Shivam Mehta, Ruibo Tu, Simon Alexanderson, Jonas Beskow, Éva Székely, Gustav Eje Henter

Figure 1 for Unified speech and gesture synthesis using flow matching

Figure 2 for Unified speech and gesture synthesis using flow matching

Figure 3 for Unified speech and gesture synthesis using flow matching

Abstract:As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optimal-transport conditional flow matching (OT-CFM). The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures, generating both modalities together in one single process. The new training regime, meanwhile, enables better synthesis quality in much fewer steps (network evaluations) than before. Uni- and multimodal subjective tests demonstrate improved speech naturalness, gesture human-likeness, and cross-modal appropriateness compared to existing benchmarks.

* 5 pages, 1 figure. Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

Sep 11, 2023

Anna Deichler, Shivam Mehta, Simon Alexanderson, Jonas Beskow

Figure 1 for Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

Figure 2 for Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

Figure 3 for Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

Figure 4 for Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

Abstract:This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing diffusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the diffusion-based gesture synthesis model in order to achieve semantically-aware co-speech gesture generation. Our entry achieved highest human-likeness and highest speech appropriateness rating among the submitted entries. This indicates that our system is a promising approach to achieve human-like co-speech gestures in agents that carry semantic meaning.

Via

Access Paper or Ask Questions

Matcha-TTS: A fast TTS architecture with conditional flow matching

Sep 06, 2023

Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, Gustav Eje Henter

Figure 1 for Matcha-TTS: A fast TTS architecture with conditional flow matching

Figure 2 for Matcha-TTS: A fast TTS architecture with conditional flow matching

Figure 3 for Matcha-TTS: A fast TTS architecture with conditional flow matching

Figure 4 for Matcha-TTS: A fast TTS architecture with conditional flow matching

Abstract:We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest models on long utterances, and attains the highest mean opinion score in a listening test. Please see https://shivammehta25.github.io/Matcha-TTS/ for audio examples, code, and pre-trained models.

* 5 pages, 3 figures. Submitted to ICASSP 2024

Via

Access Paper or Ask Questions