Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhuding Liang

OccDirector: Language-Guided Behavior and Interaction Generation in 4D Occupancy Space

Apr 24, 2026

Zhuding Liang, Tianyi Yan, Dubing Chen, Jiasen Zheng, Huan Zheng, Cheng-zhong Xu, Yida Wang, Kun Zhan, Jianbing Shen

Abstract:Generative world models increasingly rely on 4D occupancy for realistic autonomous driving simulation. However, existing generation frameworks depend on rigid geometric conditions (e.g., explicit trajectories) or simplistic attribute-level text, failing to orchestrate complex, sequential multi-agent interactions. To address this semantic-spatiotemporal gap, we propose OccDirector, a pioneering framework that generates 4D occupancy dynamics conditioned solely on natural language. Operating as a ``scenario director'', OccDirector maps natural language scripts into physically plausible voxel dynamics without requiring geometric priors. Technically, it employs a VLM-driven Spatio-Temporal MMDiT equipped with a history-prefix anchoring strategy to ensure long-horizon interaction consistency. Furthermore, we introduce OccInteract-85k, a novel dataset uniquely annotated with multi-level language instructions: ranging from static layouts to intricate multi-agent behaviors, alongside a novel VLM-based evaluation benchmark. Extensive experiments demonstrate that OccDirector achieves state-of-the-art generation quality and unprecedented instruction-following capabilities, successfully shifting the paradigm from appearance synthesis to language-driven behavior orchestration.

Via

Access Paper or Ask Questions

UniAvatar: Taming Lifelike Audio-Driven Talking Head Generation with Comprehensive Motion and Lighting Control

Dec 26, 2024

Wenzhang Sun, Xiang Li, Donglin Di, Zhuding Liang, Qiyuan Zhang, Hao Li, Wei Chen, Jianxun Cui

Figure 1 for UniAvatar: Taming Lifelike Audio-Driven Talking Head Generation with Comprehensive Motion and Lighting Control

Figure 2 for UniAvatar: Taming Lifelike Audio-Driven Talking Head Generation with Comprehensive Motion and Lighting Control

Figure 3 for UniAvatar: Taming Lifelike Audio-Driven Talking Head Generation with Comprehensive Motion and Lighting Control

Figure 4 for UniAvatar: Taming Lifelike Audio-Driven Talking Head Generation with Comprehensive Motion and Lighting Control

Abstract:Recently, animating portrait images using audio input is a popular task. Creating lifelike talking head videos requires flexible and natural movements, including facial and head dynamics, camera motion, realistic light and shadow effects. Existing methods struggle to offer comprehensive, multifaceted control over these aspects. In this work, we introduce UniAvatar, a designed method that provides extensive control over a wide range of motion and illumination conditions. Specifically, we use the FLAME model to render all motion information onto a single image, maintaining the integrity of 3D motion details while enabling fine-grained, pixel-level control. Beyond motion, this approach also allows for comprehensive global illumination control. We design independent modules to manage both 3D motion and illumination, permitting separate and combined control. Extensive experiments demonstrate that our method outperforms others in both broad-range motion control and lighting control. Additionally, to enhance the diversity of motion and environmental contexts in current datasets, we collect and plan to publicly release two datasets, DH-FaceDrasMvVid-100 and DH-FaceReliVid-200, which capture significant head movements during speech and various lighting scenarios.

Via

Access Paper or Ask Questions