Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chaochao Li

JoyAvatar: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning

Jan 31, 2026

Ruikui Wang, Jinheng Feng, Lang Tian, Huaishao Luo, Chaochao Li, Liangbo Zhou, Huan Zhang, Youzheng Wu, Xiaodong He

Abstract:Existing video avatar models have demonstrated impressive capabilities in scenarios such as talking, public speaking, and singing. However, the majority of these methods exhibit limited alignment with respect to text instructions, particularly when the prompts involve complex elements including large full-body movement, dynamic camera trajectory, background transitions, or human-object interactions. To break out this limitation, we present JoyAvatar, a framework capable of generating long duration avatar videos, featuring two key technical innovations. Firstly, we introduce a twin-teacher enhanced training algorithm that enables the model to transfer inherent text-controllability from the foundation model while simultaneously learning audio-visual synchronization. Secondly, during training, we dynamically modulate the strength of multi-modal conditions (e.g., audio and text) based on the distinct denoising timestep, aiming to mitigate conflicts between the heterogeneous conditioning signals. These two key designs serve to substantially expand the avatar model's capacity to generate natural, temporally coherent full-body motions and dynamic camera movements as well as preserve the basic avatar capabilities, such as accurate lip-sync and identity consistency. GSB evaluation results demonstrate that our JoyAvatar model outperforms the state-of-the-art models such as Omnihuman-1.5 and KlingAvatar 2.0. Moreover, our approach enables complex applications including multi-person dialogues and non-human subjects role-playing. Some video samples are provided on https://joyavatar.github.io/.

Via

Access Paper or Ask Questions

JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion

Dec 12, 2025

Chaochao Li, Ruikui Wang, Liangbo Zhou, Jinheng Feng, Huaishao Luo, Huan Zhang, Youzheng Wu, Xiaodong He

Abstract:Existing DiT-based audio-driven avatar generation methods have achieved considerable progress, yet their broader application is constrained by limitations such as high computational overhead and the inability to synthesize long-duration videos. Autoregressive methods address this problem by applying block-wise autoregressive diffusion methods. However, these methods suffer from the problem of error accumulation and quality degradation. To address this, we propose JoyAvatar, an audio-driven autoregressive model capable of real-time inference and infinite-length video generation with the following contributions: (1) Progressive Step Bootstrapping (PSB), which allocates more denoising steps to initial frames to stabilize generation and reduce error accumulation; (2) Motion Condition Injection (MCI), enhancing temporal coherence by injecting noise-corrupted previous frames as motion condition; and (3) Unbounded RoPE via Cache-Resetting (URCR), enabling infinite-length generation through dynamic positional encoding. Our 1.3B-parameter causal model achieves 16 FPS on a single GPU and achieves competitive results in visual quality, temporal consistency, and lip synchronization.

Via

Access Paper or Ask Questions

PRISM: Proof-Carrying Artifact Generation through LLM x MDE Synergy and Stratified Constraints

Oct 29, 2025

Tong Ma, Hui Lai, Hui Wang, Zhenhu Tian, Jizhou Wang, Haichao Wu, Yongfan Gao, Chaochao Li, Fengjie Xu, Ling Fang

Abstract:PRISM unifies Large Language Models with Model-Driven Engineering to generate regulator-ready artifacts and machine-checkable evidence for safety- and compliance-critical domains. PRISM integrates three pillars: a Unified Meta-Model (UMM) reconciles heterogeneous schemas and regulatory text into a single semantic space; an Integrated Constraint Model (ICM) compiles structural and semantic requirements into enforcement artifacts including generation-time automata (GBNF, DFA) and post-generation validators (e.g., SHACL, SMT); and Constraint-Guided Verifiable Generation (CVG) applies these through two-layer enforcement - structural constraints drive prefix-safe decoding while semantic/logical validation produces machine-checkable certificates. When violations occur, PRISM performs audit-guided repair and records generation traces for compliance review. We evaluate PRISM in automotive software engineering (AUTOSAR) and cross-border legal jurisdiction (Brussels I bis). PRISM produces structurally valid, auditable artifacts that integrate with existing tooling and substantially reduce manual remediation effort, providing a practical path toward automated artifact generation with built-in assurance.

* 45 pages, 9 figures

Via

Access Paper or Ask Questions

Antagonistic Crowd Simulation Model Integrating Emotion Contagion and Deep Reinforcement Learning

Apr 29, 2021

Pei Lv, Boya Xu, Chaochao Li, Qingqing Yu, Bing Zhou, Mingliang Xu

Figure 1 for Antagonistic Crowd Simulation Model Integrating Emotion Contagion and Deep Reinforcement Learning

Figure 2 for Antagonistic Crowd Simulation Model Integrating Emotion Contagion and Deep Reinforcement Learning

Figure 3 for Antagonistic Crowd Simulation Model Integrating Emotion Contagion and Deep Reinforcement Learning

Figure 4 for Antagonistic Crowd Simulation Model Integrating Emotion Contagion and Deep Reinforcement Learning

Abstract:The antagonistic behavior of the crowd often exacerbates the seriousness of the situation in sudden riots, where the spreading of antagonistic emotion and behavioral decision making in the crowd play very important roles. However, the mechanism of complex emotion influencing decision making, especially in the environment of sudden confrontation, has not yet been explored clearly. In this paper, we propose one new antagonistic crowd simulation model by combing emotional contagion and deep reinforcement learning (ACSED). Firstly, we build a group emotional contagion model based on the improved SIS contagion disease model, and estimate the emotional state of the group at each time step during the simulation. Then, the tendency of group antagonistic behavior is modeled based on Deep Q Network (DQN), where the agent can learn the combat behavior autonomously, and leverages the mean field theory to quickly calculate the influence of other surrounding individuals on the central one. Finally, the rationality of the predicted behaviors by the DQN is further analyzed in combination with group emotion, and the final combat behavior of the agent is determined. The method proposed in this paper is verified through several different settings of experiments. The results prove that emotions have a vital impact on the group combat, and positive emotional states are more conducive to combat. Moreover, by comparing the simulation results with real scenes, the feasibility of the method is further verified, which can provide good reference for formulating battle plans and improving the winning rate of righteous groups battles in a variety of situations.

Via

Access Paper or Ask Questions

Personality-Aware Probabilistic Map for Trajectory Prediction of Pedestrians

Nov 01, 2019

Chaochao Li, Pei Lv, Mingliang Xu, Xinyu Wang, Dinesh Manocha, Bing Zhou, Meng Wang

Figure 1 for Personality-Aware Probabilistic Map for Trajectory Prediction of Pedestrians

Figure 2 for Personality-Aware Probabilistic Map for Trajectory Prediction of Pedestrians

Figure 3 for Personality-Aware Probabilistic Map for Trajectory Prediction of Pedestrians

Figure 4 for Personality-Aware Probabilistic Map for Trajectory Prediction of Pedestrians

Abstract:We present a novel trajectory prediction algorithm for pedestrians based on a personality-aware probabilistic feature map. This map is computed using a spatial query structure and each value represents the probability of the predicted pedestrian passing through various positions in the crowd space. We update this map dynamically based on the agents in the environment and prior trajectory of a pedestrian. Furthermore, we estimate the personality characteristics of each pedestrian and use them to improve the prediction by estimating the shortest path in this map. Our approach is general and works well on crowd videos with low and high pedestrian density. We evaluate our model on standard human-trajectory datasets. In practice, our prediction algorithm improves the accuracy by 5-9% over prior algorithms.

Via

Access Paper or Ask Questions