Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:VividListener: Expressive and Controllable Listener Dynamics Modeling for Multi-Modal Responsive Interaction

Apr 30, 2025

Shiying Li, Xingqun Qi, Bingkun Yang, Chen Weile, Zezhao Tian, Muyi Sun, Qifeng Liu, Man Zhang, Zhenan Sun

Figure 1 for VividListener: Expressive and Controllable Listener Dynamics Modeling for Multi-Modal Responsive Interaction

Figure 2 for VividListener: Expressive and Controllable Listener Dynamics Modeling for Multi-Modal Responsive Interaction

Figure 3 for VividListener: Expressive and Controllable Listener Dynamics Modeling for Multi-Modal Responsive Interaction

Figure 4 for VividListener: Expressive and Controllable Listener Dynamics Modeling for Multi-Modal Responsive Interaction

Share this with someone who'll enjoy it:

Abstract:Generating responsive listener head dynamics with nuanced emotions and expressive reactions is crucial for practical dialogue modeling in various virtual avatar animations. Previous studies mainly focus on the direct short-term production of listener behavior. They overlook the fine-grained control over motion variations and emotional intensity, especially in long-sequence modeling. Moreover, the lack of long-term and large-scale paired speaker-listener corpora including head dynamics and fine-grained multi-modality annotations (e.g., text-based expression descriptions, emotional intensity) also limits the application of dialogue modeling.Therefore, we first newly collect a large-scale multi-turn dataset of 3D dyadic conversation containing more than 1.4M valid frames for multi-modal responsive interaction, dubbed ListenerX. Additionally, we propose VividListener, a novel framework enabling fine-grained, expressive and controllable listener dynamics modeling. This framework leverages multi-modal conditions as guiding principles for fostering coherent interactions between speakers and listeners.Specifically, we design the Responsive Interaction Module (RIM) to adaptively represent the multi-modal interactive embeddings. RIM ensures the listener dynamics achieve fine-grained semantic coordination with textual descriptions and adjustments, while preserving expressive reaction with speaker behavior. Meanwhile, we design the Emotional Intensity Tags (EIT) for emotion intensity editing with multi-modal information integration, applying to both text descriptions and listener motion amplitude.Extensive experiments conducted on our newly collected ListenerX dataset demonstrate that VividListener achieves state-of-the-art performance, realizing expressive and controllable listener dynamics.

View paper on

Share this with someone who'll enjoy it:

Title:VividListener: Expressive and Controllable Listener Dynamics Modeling for Multi-Modal Responsive Interaction

Paper and Code