Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xuehao Gao

MAGE:A Multi-stage Avatar Generator with Sparse Observations

May 09, 2025

Fangyu Du, Yang Yang, Xuehao Gao, Hongye Hou

Abstract:Inferring full-body poses from Head Mounted Devices, which capture only 3-joint observations from the head and wrists, is a challenging task with wide AR/VR applications. Previous attempts focus on learning one-stage motion mapping and thus suffer from an over-large inference space for unobserved body joint motions. This often leads to unsatisfactory lower-body predictions and poor temporal consistency, resulting in unrealistic or incoherent motion sequences. To address this, we propose a powerful Multi-stage Avatar GEnerator named MAGE that factorizes this one-stage direct motion mapping learning with a progressive prediction strategy. Specifically, given initial 3-joint motions, MAGE gradually inferring multi-scale body part poses at different abstract granularity levels, starting from a 6-part body representation and gradually refining to 22 joints. With decreasing abstract levels step by step, MAGE introduces more motion context priors from former prediction stages and thus improves realistic motion completion with richer constraint conditions and less ambiguity. Extensive experiments on large-scale datasets verify that MAGE significantly outperforms state-of-the-art methods with better accuracy and continuity.

Via

Access Paper or Ask Questions

Multi-Condition Latent Diffusion Network for Scene-Aware Neural Human Motion Prediction

May 30, 2024

Xuehao Gao, Yang Yang, Yang Wu, Shaoyi Du, Guo-Jun Qi

Figure 1 for Multi-Condition Latent Diffusion Network for Scene-Aware Neural Human Motion Prediction

Figure 2 for Multi-Condition Latent Diffusion Network for Scene-Aware Neural Human Motion Prediction

Figure 3 for Multi-Condition Latent Diffusion Network for Scene-Aware Neural Human Motion Prediction

Figure 4 for Multi-Condition Latent Diffusion Network for Scene-Aware Neural Human Motion Prediction

Abstract:Inferring 3D human motion is fundamental in many applications, including understanding human activity and analyzing one's intention. While many fruitful efforts have been made to human motion prediction, most approaches focus on pose-driven prediction and inferring human motion in isolation from the contextual environment, thus leaving the body location movement in the scene behind. However, real-world human movements are goal-directed and highly influenced by the spatial layout of their surrounding scenes. In this paper, instead of planning future human motion in a 'dark' room, we propose a Multi-Condition Latent Diffusion network (MCLD) that reformulates the human motion prediction task as a multi-condition joint inference problem based on the given historical 3D body motion and the current 3D scene contexts. Specifically, instead of directly modeling joint distribution over the raw motion sequences, MCLD performs a conditional diffusion process within the latent embedding space, characterizing the cross-modal mapping from the past body movement and current scene context condition embeddings to the future human motion embedding. Extensive experiments on large-scale human motion prediction datasets demonstrate that our MCLD achieves significant improvements over the state-of-the-art methods on both realistic and diverse predictions.

* Accepted by IEEE Transactions on Image Processing

Via

Access Paper or Ask Questions

GUESS:GradUally Enriching SyntheSis for Text-Driven Human Motion Generation

Jan 06, 2024

Xuehao Gao, Yang Yang, Zhenyu Xie, Shaoyi Du, Zhongqian Sun, Yang Wu

Abstract:In this paper, we propose a novel cascaded diffusion-based generative framework for text-driven human motion synthesis, which exploits a strategy named GradUally Enriching SyntheSis (GUESS as its abbreviation). The strategy sets up generation objectives by grouping body joints of detailed skeletons in close semantic proximity together and then replacing each of such joint group with a single body-part node. Such an operation recursively abstracts a human pose to coarser and coarser skeletons at multiple granularity levels. With gradually increasing the abstraction level, human motion becomes more and more concise and stable, significantly benefiting the cross-modal motion synthesis task. The whole text-driven human motion synthesis problem is then divided into multiple abstraction levels and solved with a multi-stage generation framework with a cascaded latent diffusion model: an initial generator first generates the coarsest human motion guess from a given text description; then, a series of successive generators gradually enrich the motion details based on the textual description and the previous synthesized results. Notably, we further integrate GUESS with the proposed dynamic multi-condition fusion mechanism to dynamically balance the cooperative effects of the given textual condition and synthesized coarse motion prompt in different generation stages. Extensive experiments on large-scale datasets verify that GUESS outperforms existing state-of-the-art methods by large margins in terms of accuracy, realisticness, and diversity. Code is available at https://github.com/Xuehao-Gao/GUESS.

* Accepted by IEEE Transactions on Visualization and Computer Graphics (2024)

Via

Access Paper or Ask Questions

Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model

Dec 18, 2023

Zhenyu Xie, Yang Wu, Xuehao Gao, Zhongqian Sun, Wei Yang, Xiaodan Liang

Figure 1 for Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model

Figure 2 for Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model

Figure 3 for Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model

Figure 4 for Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model

Abstract:Text-guided motion synthesis aims to generate 3D human motion that not only precisely reflects the textual description but reveals the motion details as much as possible. Pioneering methods explore the diffusion model for text-to-motion synthesis and obtain significant superiority. However, these methods conduct diffusion processes either on the raw data distribution or the low-dimensional latent space, which typically suffer from the problem of modality inconsistency or detail-scarce. To tackle this problem, we propose a novel Basic-to-Advanced Hierarchical Diffusion Model, named B2A-HDM, to collaboratively exploit low-dimensional and high-dimensional diffusion models for high quality detailed motion synthesis. Specifically, the basic diffusion model in low-dimensional latent space provides the intermediate denoising result that to be consistent with the textual description, while the advanced diffusion model in high-dimensional latent space focuses on the following detail-enhancing denoising process. Besides, we introduce a multi-denoiser framework for the advanced diffusion model to ease the learning of high-dimensional model and fully explore the generative potential of the diffusion model. Quantitative and qualitative experiment results on two text-to-motion benchmarks (HumanML3D and KIT-ML) demonstrate that B2A-HDM can outperform existing state-of-the-art methods in terms of fidelity, modality consistency, and diversity.

Via

Access Paper or Ask Questions