Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xuetao Feng

MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

Oct 29, 2024

Yuan Wang, Di Huang, Yaqi Zhang, Wanli Ouyang, Jile Jiao, Xuetao Feng, Yan Zhou, Pengfei Wan, Shixiang Tang, Dan Xu

Figure 1 for MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

Figure 2 for MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

Figure 3 for MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

Figure 4 for MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

Abstract:Generating lifelike human motions from descriptive texts has experienced remarkable research focus in the recent years, propelled by the emerging requirements of digital humans.Despite impressive advances, existing approaches are often constrained by limited control modalities, task specificity, and focus solely on body motion representations.In this paper, we present MotionGPT-2, a unified Large Motion-Language Model (LMLM) that addresses these limitations. MotionGPT-2 accommodates multiple motion-relevant tasks and supporting multimodal control conditions through pre-trained Large Language Models (LLMs). It quantizes multimodal inputs-such as text and single-frame poses-into discrete, LLM-interpretable tokens, seamlessly integrating them into the LLM's vocabulary. These tokens are then organized into unified prompts, guiding the LLM to generate motion outputs through a pretraining-then-finetuning paradigm. We also show that the proposed MotionGPT-2 is highly adaptable to the challenging 3D holistic motion generation task, enabled by the innovative motion discretization framework, Part-Aware VQVAE, which ensures fine-grained representations of body and hand movements. Extensive experiments and visualizations validate the effectiveness of our method, demonstrating the adaptability of MotionGPT-2 across motion generation, motion captioning, and generalized motion completion tasks.

Via

Access Paper or Ask Questions

Holistic-Motion2D: Scalable Whole-body Human Motion Generation in 2D Space

Jun 17, 2024

Yuan Wang, Zhao Wang, Junhao Gong, Di Huang, Tong He, Wanli Ouyang, Jile Jiao, Xuetao Feng, Qi Dou, Shixiang Tang(+1 more)

Figure 1 for Holistic-Motion2D: Scalable Whole-body Human Motion Generation in 2D Space

Figure 2 for Holistic-Motion2D: Scalable Whole-body Human Motion Generation in 2D Space

Figure 3 for Holistic-Motion2D: Scalable Whole-body Human Motion Generation in 2D Space

Figure 4 for Holistic-Motion2D: Scalable Whole-body Human Motion Generation in 2D Space

Abstract:In this paper, we introduce a novel path to $\textit{general}$ human motion generation by focusing on 2D space. Traditional methods have primarily generated human motions in 3D, which, while detailed and realistic, are often limited by the scope of available 3D motion data in terms of both the size and the diversity. To address these limitations, we exploit extensive availability of 2D motion data. We present $\textbf{Holistic-Motion2D}$, the first comprehensive and large-scale benchmark for 2D whole-body motion generation, which includes over 1M in-the-wild motion sequences, each paired with high-quality whole-body/partial pose annotations and textual descriptions. Notably, Holistic-Motion2D is ten times larger than the previously largest 3D motion dataset. We also introduce a baseline method, featuring innovative $\textit{whole-body part-aware attention}$ and $\textit{confidence-aware modeling}$ techniques, tailored for 2D $\underline{\text T}$ext-driv$\underline{\text{EN}}$ whole-bo$\underline{\text D}$y motion gen$\underline{\text{ER}}$ation, namely $\textbf{Tender}$. Extensive experiments demonstrate the effectiveness of $\textbf{Holistic-Motion2D}$ and $\textbf{Tender}$ in generating expressive, diverse, and realistic human motions. We also highlight the utility of 2D motion for various downstream applications and its potential for lifting to 3D motion. The page link is: https://holistic-motion2d.github.io.

* 22 pages, 11figures, 17 tables

Via

Access Paper or Ask Questions

Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning

Apr 23, 2024

Wenjin Hou, Shiming Chen, Shuhuang Chen, Ziming Hong, Yan Wang, Xuetao Feng, Salman Khan, Fahad Shahbaz Khan, Xinge You

Figure 1 for Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning

Figure 2 for Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning

Figure 3 for Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning

Figure 4 for Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning

Abstract:Generative Zero-shot learning (ZSL) learns a generator to synthesize visual samples for unseen classes, which is an effective way to advance ZSL. However, existing generative methods rely on the conditions of Gaussian noise and the predefined semantic prototype, which limit the generator only optimized on specific seen classes rather than characterizing each visual instance, resulting in poor generalizations (\textit{e.g.}, overfitting to seen classes). To address this issue, we propose a novel Visual-Augmented Dynamic Semantic prototype method (termed VADS) to boost the generator to learn accurate semantic-visual mapping by fully exploiting the visual-augmented knowledge into semantic conditions. In detail, VADS consists of two modules: (1) Visual-aware Domain Knowledge Learning module (VDKL) learns the local bias and global prior of the visual features (referred to as domain visual knowledge), which replace pure Gaussian noise to provide richer prior noise information; (2) Vision-Oriented Semantic Updation module (VOSU) updates the semantic prototype according to the visual representations of the samples. Ultimately, we concatenate their output as a dynamic semantic prototype, which serves as the condition of the generator. Extensive experiments demonstrate that our VADS achieves superior CZSL and GZSL performances on three prominent datasets and outperforms other state-of-the-art methods with averaging increases by 6.4\%, 5.9\% and 4.2\% on SUN, CUB and AWA2, respectively.

Via

Access Paper or Ask Questions

Neural Surface Reconstruction of Dynamic Scenes with Monocular RGB-D Camera

Jun 30, 2022

Hongrui Cai, Wanquan Feng, Xuetao Feng, Yan Wang, Juyong Zhang

Figure 1 for Neural Surface Reconstruction of Dynamic Scenes with Monocular RGB-D Camera

Figure 2 for Neural Surface Reconstruction of Dynamic Scenes with Monocular RGB-D Camera

Figure 3 for Neural Surface Reconstruction of Dynamic Scenes with Monocular RGB-D Camera

Figure 4 for Neural Surface Reconstruction of Dynamic Scenes with Monocular RGB-D Camera

Abstract:We propose Neural-DynamicReconstruction (NDR), a template-free method to recover high-fidelity geometry and motions of a dynamic scene from a monocular RGB-D camera. In NDR, we adopt the neural implicit function for surface representation and rendering such that the captured color and depth can be fully utilized to jointly optimize the surface and deformations. To represent and constrain the non-rigid deformations, we propose a novel neural invertible deforming network such that the cycle consistency between arbitrary two frames is automatically satisfied. Considering that the surface topology of dynamic scene might change over time, we employ a topology-aware strategy to construct the topology-variant correspondence for the fused frames. NDR also further refines the camera poses in a global optimization manner. Experiments on public datasets and our collected dataset demonstrate that NDR outperforms existing monocular dynamic reconstruction methods.

* Project page: https://ustc3dv.github.io/ndr

Via

Access Paper or Ask Questions

Multi-initialization Optimization Network for Accurate 3D Human Pose and Shape Estimation

Dec 24, 2021

Zhiwei Liu, Xiangyu Zhu, Lu Yang, Xiang Yan, Ming Tang, Zhen Lei, Guibo Zhu, Xuetao Feng, Yan Wang, Jinqiao Wang

Figure 1 for Multi-initialization Optimization Network for Accurate 3D Human Pose and Shape Estimation

Figure 2 for Multi-initialization Optimization Network for Accurate 3D Human Pose and Shape Estimation

Figure 3 for Multi-initialization Optimization Network for Accurate 3D Human Pose and Shape Estimation

Figure 4 for Multi-initialization Optimization Network for Accurate 3D Human Pose and Shape Estimation

Abstract:3D human pose and shape recovery from a monocular RGB image is a challenging task. Existing learning based methods highly depend on weak supervision signals, e.g. 2D and 3D joint location, due to the lack of in-the-wild paired 3D supervision. However, considering the 2D-to-3D ambiguities existed in these weak supervision labels, the network is easy to get stuck in local optima when trained with such labels. In this paper, we reduce the ambituity by optimizing multiple initializations. Specifically, we propose a three-stage framework named Multi-Initialization Optimization Network (MION). In the first stage, we strategically select different coarse 3D reconstruction candidates which are compatible with the 2D keypoints of input sample. Each coarse reconstruction can be regarded as an initialization leads to one optimization branch. In the second stage, we design a mesh refinement transformer (MRT) to respectively refine each coarse reconstruction result via a self-attention mechanism. Finally, a Consistency Estimation Network (CEN) is proposed to find the best result from mutiple candidates by evaluating if the visual evidence in RGB image matches a given 3D reconstruction. Experiments demonstrate that our Multi-Initialization Optimization Network outperforms existing 3D mesh based methods on multiple public benchmarks.

* accepted by ACM Multimedia 2021

Via

Access Paper or Ask Questions

Dual-Tuning: Joint Prototype Transfer and Structure Regularization for Compatible Feature Learning

Aug 06, 2021

Yan Bai, Jile Jiao, Shengsen Wu, Yihang Lou, Jun Liu, Xuetao Feng, Ling-Yu Duan

Figure 1 for Dual-Tuning: Joint Prototype Transfer and Structure Regularization for Compatible Feature Learning

Figure 2 for Dual-Tuning: Joint Prototype Transfer and Structure Regularization for Compatible Feature Learning

Figure 3 for Dual-Tuning: Joint Prototype Transfer and Structure Regularization for Compatible Feature Learning

Figure 4 for Dual-Tuning: Joint Prototype Transfer and Structure Regularization for Compatible Feature Learning

Abstract:Visual retrieval system faces frequent model update and deployment. It is a heavy workload to re-extract features of the whole database every time.Feature compatibility enables the learned new visual features to be directly compared with the old features stored in the database. In this way, when updating the deployed model, we can bypass the inflexible and time-consuming feature re-extraction process. However, the old feature space that needs to be compatible is not ideal and faces the distribution discrepancy problem with the new space caused by different supervision losses. In this work, we propose a global optimization Dual-Tuning method to obtain feature compatibility against different networks and losses. A feature-level prototype loss is proposed to explicitly align two types of embedding features, by transferring global prototype information. Furthermore, we design a component-level mutual structural regularization to implicitly optimize the feature intrinsic structure. Experimental results on million-scale datasets demonstrate that our Dual-Tuning is able to obtain feature compatibility without sacrificing performance. (Our code will be avaliable at https://github.com/yanbai1993/Dual-Tuning)

Via

Access Paper or Ask Questions