Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xinxin Zuo

Towards 4D Human Video Stylization

Dec 07, 2023

Tiantian Wang, Xinxin Zuo, Fangzhou Mu, Jian Wang, Ming-Hsuan Yang

Abstract:We present a first step towards 4D (3D and time) human video stylization, which addresses style transfer, novel view synthesis and human animation within a unified framework. While numerous video stylization methods have been developed, they are often restricted to rendering images in specific viewpoints of the input video, lacking the capability to generalize to novel views and novel poses in dynamic scenes. To overcome these limitations, we leverage Neural Radiance Fields (NeRFs) to represent videos, conducting stylization in the rendered feature space. Our innovative approach involves the simultaneous representation of both the human subject and the surrounding scene using two NeRFs. This dual representation facilitates the animation of human subjects across various poses and novel viewpoints. Specifically, we introduce a novel geometry-guided tri-plane representation, significantly enhancing feature representation robustness compared to direct tri-plane optimization. Following the video reconstruction, stylization is performed within the NeRFs' rendered feature space. Extensive experiments demonstrate that the proposed method strikes a superior balance between stylized textures and temporal coherence, surpassing existing approaches. Furthermore, our framework uniquely extends its capabilities to accommodate novel poses and viewpoints, making it a versatile tool for creative human video stylization.

* Under Review

Via

Access Paper or Ask Questions

TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration

Apr 05, 2023

Kehong Gong, Dongze Lian, Heng Chang, Chuan Guo, Xinxin Zuo, Zihang Jiang, Xinchao Wang

Figure 1 for TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration

Figure 2 for TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration

Figure 3 for TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration

Figure 4 for TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration

Abstract:We propose a novel task for generating 3D dance movements that simultaneously incorporate both text and music modalities. Unlike existing works that generate dance movements using a single modality such as music, our goal is to produce richer dance movements guided by the instructive information provided by the text. However, the lack of paired motion data with both music and text modalities limits the ability to generate dance movements that integrate both. To alleviate this challenge, we propose to utilize a 3D human motion VQ-VAE to project the motions of the two datasets into a latent space consisting of quantized vectors, which effectively mix the motion tokens from the two datasets with different distributions for training. Additionally, we propose a cross-modal transformer to integrate text instructions into motion generation architecture for generating 3D dance movements without degrading the performance of music-conditioned dance generation. To better evaluate the quality of the generated motion, we introduce two novel metrics, namely Motion Prediction Distance (MPD) and Freezing Score, to measure the coherence and freezing percentage of the generated motion. Extensive experiments show that our approach can generate realistic and coherent dance movements conditioned on both text and music while maintaining comparable performance with the two single modalities. Code will be available at: https://garfield-kh.github.io/TM2D/.

Via

Access Paper or Ask Questions

Event-based Human Pose Tracking by Spiking Spatiotemporal Transformer

Mar 31, 2023

Shihao Zou, Yuxuan Mu, Xinxin Zuo, Sen Wang, Li Cheng

Figure 1 for Event-based Human Pose Tracking by Spiking Spatiotemporal Transformer

Figure 2 for Event-based Human Pose Tracking by Spiking Spatiotemporal Transformer

Figure 3 for Event-based Human Pose Tracking by Spiking Spatiotemporal Transformer

Figure 4 for Event-based Human Pose Tracking by Spiking Spatiotemporal Transformer

Abstract:Event camera, as an emerging biologically-inspired vision sensor for capturing motion dynamics, presents new potential for 3D human pose tracking, or video-based 3D human pose estimation. However, existing works in pose tracking either require the presence of additional gray-scale images to establish a solid starting pose, or ignore the temporal dependencies all together by collapsing segments of event streams to form static event frames. Meanwhile, although the effectiveness of Artificial Neural Networks (ANNs, a.k.a. dense deep learning) has been showcased in many event-based tasks, the use of ANNs tends to neglect the fact that compared to the dense frame-based image sequences, the occurrence of events from an event camera is spatiotemporally much sparser. Motivated by the above mentioned issues, we present in this paper a dedicated end-to-end sparse deep learning approach for event-based pose tracking: 1) to our knowledge this is the first time that 3D human pose tracking is obtained from events only, thus eliminating the need of accessing to any frame-based images as part of input; 2) our approach is based entirely upon the framework of Spiking Neural Networks (SNNs), which consists of Spike-Element-Wise (SEW) ResNet and a novel Spiking Spatiotemporal Transformer; 3) a large-scale synthetic dataset is constructed that features a broad and diverse set of annotated 3D human motions, as well as longer hours of event stream data, named SynEventHPD. Empirical experiments demonstrate that, with superior performance over the state-of-the-art (SOTA) ANNs counterparts, our approach also achieves a significant computation reduction of 80% in FLOPS. Furthermore, our proposed method also outperforms SOTA SNNs in the regression task of human pose tracking. Our implementation is available at https://github.com/JimmyZou/HumanPoseTracking_SNN and dataset will be released upon paper acceptance.

Via

Access Paper or Ask Questions

3D Pose Estimation and Future Motion Prediction from 2D Images

Nov 26, 2021

Ji Yang, Youdong Ma, Xinxin Zuo, Sen Wang, Minglun Gong, Li Cheng

Figure 1 for 3D Pose Estimation and Future Motion Prediction from 2D Images

Figure 2 for 3D Pose Estimation and Future Motion Prediction from 2D Images

Figure 3 for 3D Pose Estimation and Future Motion Prediction from 2D Images

Figure 4 for 3D Pose Estimation and Future Motion Prediction from 2D Images

Abstract:This paper considers to jointly tackle the highly correlated tasks of estimating 3D human body poses and predicting future 3D motions from RGB image sequences. Based on Lie algebra pose representation, a novel self-projection mechanism is proposed that naturally preserves human motion kinematics. This is further facilitated by a sequence-to-sequence multi-task architecture based on an encoder-decoder topology, which enables us to tap into the common ground shared by both tasks. Finally, a global refinement module is proposed to boost the performance of our framework. The effectiveness of our approach, called PoseMoNet, is demonstrated by ablation tests and empirical evaluations on Human3.6M and HumanEva-I benchmark, where competitive performance is obtained comparing to the state-of-the-arts.

* Accepted by Pattern Recognition

Via

Access Paper or Ask Questions

Action2video: Generating Videos of Human 3D Actions

Nov 12, 2021

Chuan Guo, Xinxin Zuo, Sen Wang, Xinshuang Liu, Shihao Zou, Minglun Gong, Li Cheng

Figure 1 for Action2video: Generating Videos of Human 3D Actions

Figure 2 for Action2video: Generating Videos of Human 3D Actions

Figure 3 for Action2video: Generating Videos of Human 3D Actions

Figure 4 for Action2video: Generating Videos of Human 3D Actions

Abstract:We aim to tackle the interesting yet challenging problem of generating videos of diverse and natural human motions from prescribed action categories. The key issue lies in the ability to synthesize multiple distinct motion sequences that are realistic in their visual appearances. It is achieved in this paper by a two-step process that maintains internal 3D pose and shape representations, action2motion and motion2video. Action2motion stochastically generates plausible 3D pose sequences of a prescribed action category, which are processed and rendered by motion2video to form 2D videos. Specifically, the Lie algebraic theory is engaged in representing natural human motions following the physical law of human kinematics; a temporal variational auto-encoder (VAE) is developed that encourages diversity of output motions. Moreover, given an additional input image of a clothed human character, an entire pipeline is proposed to extract his/her 3D detailed shape, and to render in videos the plausible motions from different views. This is realized by improving existing methods to extract 3D human shapes and textures from single 2D images, rigging, animating, and rendering to form 2D videos of human motions. It also necessitates the curation and reannotation of 3D human motion datasets for training purpose. Thorough empirical experiments including ablation study, qualitative and quantitative evaluations manifest the applicability of our approach, and demonstrate its competitiveness in addressing related tasks, where components of our approach are compared favorably to the state-of-the-arts.

* Accepted by IJCV

Via

Access Paper or Ask Questions

Human Pose and Shape Estimation from Single Polarization Images

Aug 15, 2021

Shihao Zou, Xinxin Zuo, Sen Wang, Yiming Qian, Chuan Guo, Wei Ji, Jingjing Li, Minglun Gong, Li Cheng

Figure 1 for Human Pose and Shape Estimation from Single Polarization Images

Figure 2 for Human Pose and Shape Estimation from Single Polarization Images

Figure 3 for Human Pose and Shape Estimation from Single Polarization Images

Figure 4 for Human Pose and Shape Estimation from Single Polarization Images

Abstract:This paper focuses on a new problem of estimating human pose and shape from single polarization images. Polarization camera is known to be able to capture the polarization of reflected lights that preserves rich geometric cues of an object surface. Inspired by the recent applications in surface normal reconstruction from polarization images, in this paper, we attempt to estimate human pose and shape from single polarization images by leveraging the polarization-induced geometric cues. A dedicated two-stage pipeline is proposed: given a single polarization image, stage one (Polar2Normal) focuses on the fine detailed human body surface normal estimation; stage two (Polar2Shape) then reconstructs clothed human shape from the polarization image and the estimated surface normal. To empirically validate our approach, a dedicated dataset (PHSPD) is constructed, consisting of over 500K frames with accurate pose and shape annotations. Empirical evaluations on this real-world dataset as well as a synthetic dataset, SURREAL, demonstrate the effectiveness of our approach. It suggests polarization camera as a promising alternative to the more conventional RGB camera for human pose and shape estimation.

* Submitted to IEEE TIP

Via

Access Paper or Ask Questions

EventHPE: Event-based 3D Human Pose and Shape Estimation

Aug 15, 2021

Shihao Zou, Chuan Guo, Xinxin Zuo, Sen Wang, Pengyu Wang, Xiaoqin Hu, Shoushun Chen, Minglun Gong, Li Cheng

Figure 1 for EventHPE: Event-based 3D Human Pose and Shape Estimation

Figure 2 for EventHPE: Event-based 3D Human Pose and Shape Estimation

Figure 3 for EventHPE: Event-based 3D Human Pose and Shape Estimation

Figure 4 for EventHPE: Event-based 3D Human Pose and Shape Estimation

Abstract:Event camera is an emerging imaging sensor for capturing dynamics of moving objects as events, which motivates our work in estimating 3D human pose and shape from the event signals. Events, on the other hand, have their unique challenges: rather than capturing static body postures, the event signals are best at capturing local motions. This leads us to propose a two-stage deep learning approach, called EventHPE. The first-stage, FlowNet, is trained by unsupervised learning to infer optical flow from events. Both events and optical flow are closely related to human body dynamics, which are fed as input to the ShapeNet in the second stage, to estimate 3D human shapes. To mitigate the discrepancy between image-based flow (optical flow) and shape-based flow (vertices movement of human body shape), a novel flow coherence loss is introduced by exploiting the fact that both flows are originated from the identical human motion. An in-house event-based 3D human dataset is curated that comes with 3D pose and shape annotations, which is by far the largest one to our knowledge. Empirical evaluations on DHP19 dataset and our in-house dataset demonstrate the effectiveness of our approach.

* ICCV 2021

Via

Access Paper or Ask Questions

Detailed Avatar Recovery from Single Image

Aug 06, 2021

Hao Zhu, Xinxin Zuo, Haotian Yang, Sen Wang, Xun Cao, Ruigang Yang

Abstract:This paper presents a novel framework to recover \emph{detailed} avatar from a single image. It is a challenging task due to factors such as variations in human shapes, body poses, texture, and viewpoints. Prior methods typically attempt to recover the human body shape using a parametric-based template that lacks the surface details. As such resulting body shape appears to be without clothing. In this paper, we propose a novel learning-based framework that combines the robustness of the parametric model with the flexibility of free-form 3D deformation. We use the deep neural networks to refine the 3D shape in a Hierarchical Mesh Deformation (HMD) framework, utilizing the constraints from body joints, silhouettes, and per-pixel shading information. Our method can restore detailed human body shapes with complete textures beyond skinned models. Experiments demonstrate that our method has outperformed previous state-of-the-art approaches, achieving better accuracy in terms of both 2D IoU number and 3D metric distance.

* Accepted by TPAMI

Via

Access Paper or Ask Questions

Object Wake-up: 3-D Object Reconstruction, Animation, and in-situ Rendering from a Single Image

Aug 05, 2021

Xinxin Zuo, Ji Yang, Sen Wang, Zhenbo Yu, Xinyu Li, Bingbing Ni, Minglun Gong, Li Cheng

Figure 1 for Object Wake-up: 3-D Object Reconstruction, Animation, and in-situ Rendering from a Single Image

Figure 2 for Object Wake-up: 3-D Object Reconstruction, Animation, and in-situ Rendering from a Single Image

Figure 3 for Object Wake-up: 3-D Object Reconstruction, Animation, and in-situ Rendering from a Single Image

Figure 4 for Object Wake-up: 3-D Object Reconstruction, Animation, and in-situ Rendering from a Single Image

Abstract:Given a picture of a chair, could we extract the 3-D shape of the chair, animate its plausible articulations and motions, and render in-situ in its original image space? The above question prompts us to devise an automated approach to extract and manipulate articulated objects in single images. Comparing with previous efforts on object manipulation, our work goes beyond 2-D manipulation and focuses on articulable objects, thus introduces greater flexibility for possible object deformations. The pipeline of our approach starts by reconstructing and refining a 3-D mesh representation of the object of interest from an input image; its control joints are predicted by exploiting the semantic part segmentation information; the obtained object 3-D mesh is then rigged \& animated by non-rigid deformation, and rendered to perform in-situ motions in its original image space. Quantitative evaluations are carried out on 3-D reconstruction from single images, an established task that is related to our pipeline, where our results surpass those of the SOTAs by a noticeable margin. Extensive visual results also demonstrate the applicability of our approach.

Via

Access Paper or Ask Questions

Unsupervised 3D Human Mesh Recovery from Noisy Point Clouds

Jul 15, 2021

Xinxin Zuo, Sen Wang, Minglun Gong, Li Cheng

Figure 1 for Unsupervised 3D Human Mesh Recovery from Noisy Point Clouds

Figure 2 for Unsupervised 3D Human Mesh Recovery from Noisy Point Clouds

Figure 3 for Unsupervised 3D Human Mesh Recovery from Noisy Point Clouds

Figure 4 for Unsupervised 3D Human Mesh Recovery from Noisy Point Clouds

Abstract:This paper presents a novel unsupervised approach to reconstruct human shape and pose from noisy point cloud. Traditional approaches search for correspondences and conduct model fitting iteratively where a good initialization is critical. Relying on large amount of dataset with ground-truth annotations, recent learning-based approaches predict correspondences for every vertice on the point cloud; Chamfer distance is usually used to minimize the distance between a deformed template model and the input point cloud. However, Chamfer distance is quite sensitive to noise and outliers, thus could be unreliable to assign correspondences. To address these issues, we model the probability distribution of the input point cloud as generated from a parametric human model under a Gaussian Mixture Model. Instead of explicitly aligning correspondences, we treat the process of correspondence search as an implicit probabilistic association by updating the posterior probability of the template model given the input. A novel unsupervised loss is further derived that penalizes the discrepancy between the deformed template and the input point cloud conditioned on the posterior probability. Our approach is very flexible, which works with both complete point cloud and incomplete ones including even a single depth image as input. Our network is trained from scratch with no need to warm-up the network with supervised data. Compared to previous unsupervised methods, our method shows the capability to deal with substantial noise and outliers. Extensive experiments conducted on various public synthetic datasets as well as a very noisy real dataset (i.e. CMU Panoptic) demonstrate the superior performance of our approach over the state-of-the-art methods. Code can be found \url{https://github.com/wangsen1312/unsupervised3dhuman.git}

Via

Access Paper or Ask Questions