Picture for Jonas Beskow

Jonas Beskow

Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views

Add code
Oct 26, 2025
Figure 1 for Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views
Figure 2 for Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views
Figure 3 for Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views
Figure 4 for Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views
Viaarxiv icon

Gesture Evaluation in Virtual Reality

Add code
Sep 16, 2025
Viaarxiv icon

Towards Context-Aware Human-like Pointing Gestures with RL Motion Imitation

Add code
Sep 16, 2025
Viaarxiv icon

Learning to Generate Pointing Gestures in Situated Embodied Conversational Agents

Add code
Sep 15, 2025
Figure 1 for Learning to Generate Pointing Gestures in Situated Embodied Conversational Agents
Figure 2 for Learning to Generate Pointing Gestures in Situated Embodied Conversational Agents
Figure 3 for Learning to Generate Pointing Gestures in Situated Embodied Conversational Agents
Figure 4 for Learning to Generate Pointing Gestures in Situated Embodied Conversational Agents
Viaarxiv icon

Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents

Add code
Aug 07, 2024
Figure 1 for Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents
Figure 2 for Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents
Figure 3 for Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents
Figure 4 for Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents
Viaarxiv icon

Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech

Add code
Jun 08, 2024
Viaarxiv icon

Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis

Add code
Apr 30, 2024
Viaarxiv icon

Unified speech and gesture synthesis using flow matching

Add code
Oct 08, 2023
Figure 1 for Unified speech and gesture synthesis using flow matching
Figure 2 for Unified speech and gesture synthesis using flow matching
Figure 3 for Unified speech and gesture synthesis using flow matching
Viaarxiv icon

Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

Add code
Sep 11, 2023
Figure 1 for Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
Figure 2 for Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
Figure 3 for Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
Figure 4 for Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
Viaarxiv icon

Matcha-TTS: A fast TTS architecture with conditional flow matching

Add code
Sep 06, 2023
Figure 1 for Matcha-TTS: A fast TTS architecture with conditional flow matching
Figure 2 for Matcha-TTS: A fast TTS architecture with conditional flow matching
Figure 3 for Matcha-TTS: A fast TTS architecture with conditional flow matching
Figure 4 for Matcha-TTS: A fast TTS architecture with conditional flow matching
Viaarxiv icon