Alert button
Picture for Zhigang Tu

Zhigang Tu

Alert button

Skinned Motion Retargeting with Residual Perception of Motion Semantics & Geometry

Mar 15, 2023
Jiaxu Zhang, Junwu Weng, Di Kang, Fang Zhao, Shaoli Huang, Xuefei Zhe, Linchao Bao, Ying Shan, Jue Wang, Zhigang Tu

Figure 1 for Skinned Motion Retargeting with Residual Perception of Motion Semantics & Geometry
Figure 2 for Skinned Motion Retargeting with Residual Perception of Motion Semantics & Geometry
Figure 3 for Skinned Motion Retargeting with Residual Perception of Motion Semantics & Geometry
Figure 4 for Skinned Motion Retargeting with Residual Perception of Motion Semantics & Geometry

A good motion retargeting cannot be reached without reasonable consideration of source-target differences on both the skeleton and shape geometry levels. In this work, we propose a novel Residual RETargeting network (R2ET) structure, which relies on two neural modification modules, to adjust the source motions to fit the target skeletons and shapes progressively. In particular, a skeleton-aware module is introduced to preserve the source motion semantics. A shape-aware module is designed to perceive the geometries of target characters to reduce interpenetration and contact-missing. Driven by our explored distance-based losses that explicitly model the motion semantics and geometry, these two modules can learn residual motion modifications on the source motion to generate plausible retargeted motion in a single inference without post-processing. To balance these two modifications, we further present a balancing gate to conduct linear interpolation between them. Extensive experiments on the public dataset Mixamo demonstrate that our R2ET achieves the state-of-the-art performance, and provides a good balance between the preservation of motion semantics as well as the attenuation of interpenetration and contact-missing. Code is available at https://github.com/Kebii/R2ET.

* CVPR 2023 
Viaarxiv icon

Distilling Inter-Class Distance for Semantic Segmentation

May 07, 2022
Zhengbo Zhang, Chunluan Zhou, Zhigang Tu

Figure 1 for Distilling Inter-Class Distance for Semantic Segmentation
Figure 2 for Distilling Inter-Class Distance for Semantic Segmentation
Figure 3 for Distilling Inter-Class Distance for Semantic Segmentation
Figure 4 for Distilling Inter-Class Distance for Semantic Segmentation

Knowledge distillation is widely adopted in semantic segmentation to reduce the computation cost.The previous knowledge distillation methods for semantic segmentation focus on pixel-wise feature alignment and intra-class feature variation distillation, neglecting to transfer the knowledge of the inter-class distance in the feature space, which is important for semantic segmentation. To address this issue, we propose an Inter-class Distance Distillation (IDD) method to transfer the inter-class distance in the feature space from the teacher network to the student network. Furthermore, semantic segmentation is a position-dependent task,thus we exploit a position information distillation module to help the student network encode more position information. Extensive experiments on three popular datasets: Cityscapes, Pascal VOC and ADE20K show that our method is helpful to improve the accuracy of semantic segmentation models and achieves the state-of-the-art performance. E.g. it boosts the benchmark model("PSPNet+ResNet18") by 7.50% in accuracy on the Cityscapes dataset.

* 7 pages, 5 figures 
Viaarxiv icon

MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

Mar 27, 2022
Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, Junsong Yuan

Figure 1 for MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video
Figure 2 for MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video
Figure 3 for MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video
Figure 4 for MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

Recent transformer-based solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn spatio-temporal correlation. We observe that the motions of different joints differ significantly. However, the previous methods cannot efficiently model the solid inter-frame correspondence of each joint, leading to insufficient learning of spatial-temporal correlation. We propose MixSTE (Mixed Spatio-Temporal Encoder), which has a temporal transformer block to separately model the temporal motion of each joint and a spatial transformer block to learn inter-joint spatial correlation. These two blocks are utilized alternately to obtain better spatio-temporal feature encoding. In addition, the network output is extended from the central frame to entire frames of the input video, thereby improving the coherence between the input and output sequences. Extensive experiments are conducted on three benchmarks (Human3.6M, MPI-INF-3DHP, and HumanEva). The results show that our model outperforms the state-of-the-art approach by 10.9% P-MPJPE and 7.6% MPJPE. The code is available at https://github.com/JinluZhang1126/MixSTE.

* CVPR2022 Accepted Paper 
Viaarxiv icon

Optical Flow for Video Super-Resolution: A Survey

Mar 20, 2022
Zhigang Tu, Hongyan Li, Wei Xie, Yuanzhong Liu, Shifu Zhang, Baoxin Li, Junsong Yuan

Figure 1 for Optical Flow for Video Super-Resolution: A Survey
Figure 2 for Optical Flow for Video Super-Resolution: A Survey
Figure 3 for Optical Flow for Video Super-Resolution: A Survey
Figure 4 for Optical Flow for Video Super-Resolution: A Survey

Video super-resolution is currently one of the most active research topics in computer vision as it plays an important role in many visual applications. Generally, video super-resolution contains a significant component, i.e., motion compensation, which is used to estimate the displacement between successive video frames for temporal alignment. Optical flow, which can supply dense and sub-pixel motion between consecutive frames, is among the most common ways for this task. To obtain a good understanding of the effect that optical flow acts in video super-resolution, in this work, we conduct a comprehensive review on this subject for the first time. This investigation covers the following major topics: the function of super-resolution (i.e., why we require super-resolution); the concept of video super-resolution (i.e., what is video super-resolution); the description of evaluation metrics (i.e., how (video) superresolution performs); the introduction of optical flow based video super-resolution; the investigation of using optical flow to capture temporal dependency for video super-resolution. Prominently, we give an in-depth study of the deep learning based video super-resolution method, where some representative algorithms are analyzed and compared. Additionally, we highlight some promising research directions and open issues that should be further addressed.

Viaarxiv icon

Slow-Fast Visual Tempo Learning for Video-based Action Recognition

Feb 24, 2022
Yuanzhong Liu, Zhigang Tu, Hongyan Li, Chi Chen, Baoxin Li, Junsong Yuan

Figure 1 for Slow-Fast Visual Tempo Learning for Video-based Action Recognition
Figure 2 for Slow-Fast Visual Tempo Learning for Video-based Action Recognition
Figure 3 for Slow-Fast Visual Tempo Learning for Video-based Action Recognition
Figure 4 for Slow-Fast Visual Tempo Learning for Video-based Action Recognition

Action visual tempo characterizes the dynamics and the temporal scale of an action, which is helpful to distinguish human actions that share high similarities in visual dynamics and appearance. Previous methods capture the visual tempo either by sampling raw videos with multiple rates, which requires a costly multi-layer network to handle each rate, or by hierarchically sampling backbone features, which relies heavily on high-level features that miss fine-grained temporal dynamics. In this work, we propose a Temporal Correlation Module (TCM), which can be easily embedded into the current action recognition backbones in a plug-in-and-play manner, to extract action visual tempo from low-level backbone features at single-layer remarkably. Specifically, our TCM contains two main components: a Multi-scale Temporal Dynamics Module (MTDM) and a Temporal Attention Module (TAM). MTDM applies a correlation operation to learn pixel-wise fine-grained temporal dynamics for both fast-tempo and slow-tempo. TAM adaptively emphasizes expressive features and suppresses inessential ones via analyzing the global information across various tempos. Extensive experiments conducted on several action recognition benchmarks, e.g. Something-Something V1 & V2, Kinetics-400, UCF-101, and HMDB-51, have demonstrated that the proposed TCM is effective to promote the performance of the existing video-based action recognition models for a large margin. The source code is publicly released at https://github.com/zphyix/TCM.

Viaarxiv icon

Joint-bone Fusion Graph Convolutional Network for Semi-supervised Skeleton Action Recognition

Feb 08, 2022
Zhigang Tu, Jiaxu Zhang, Hongyan Li, Yujin Chen, Junsong Yuan

Figure 1 for Joint-bone Fusion Graph Convolutional Network for Semi-supervised Skeleton Action Recognition
Figure 2 for Joint-bone Fusion Graph Convolutional Network for Semi-supervised Skeleton Action Recognition
Figure 3 for Joint-bone Fusion Graph Convolutional Network for Semi-supervised Skeleton Action Recognition
Figure 4 for Joint-bone Fusion Graph Convolutional Network for Semi-supervised Skeleton Action Recognition

In recent years, graph convolutional networks (GCNs) play an increasingly critical role in skeleton-based human action recognition. However, most GCN-based methods still have two main limitations: 1) They only consider the motion information of the joints or process the joints and bones separately, which are unable to fully explore the latent functional correlation between joints and bones for action recognition. 2) Most of these works are performed in the supervised learning way, which heavily relies on massive labeled training data. To address these issues, we propose a semi-supervised skeleton-based action recognition method which has been rarely exploited before. We design a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder to achieve semi-supervised learning. Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream, so that promoting both streams to learn more discriminative feature representations. The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data, which is essential for action recognition. Extensive experiments on two popular datasets, i.e. NTU-RGB+D and Kinetics-Skeleton, demonstrate that our model achieves the state-of-the-art performance for semi-supervised skeleton-based action recognition and is also useful for fully-supervised methods.

Viaarxiv icon

Consistent 3D Hand Reconstruction in Video via self-supervised Learning

Jan 24, 2022
Zhigang Tu, Zhisheng Huang, Yujin Chen, Di Kang, Linchao Bao, Bisheng Yang, Junsong Yuan

Figure 1 for Consistent 3D Hand Reconstruction in Video via self-supervised Learning
Figure 2 for Consistent 3D Hand Reconstruction in Video via self-supervised Learning
Figure 3 for Consistent 3D Hand Reconstruction in Video via self-supervised Learning
Figure 4 for Consistent 3D Hand Reconstruction in Video via self-supervised Learning

We present a method for reconstructing accurate and consistent 3D hands from a monocular video. We observe that detected 2D hand keypoints and the image texture provide important cues about the geometry and texture of the 3D hand, which can reduce or even eliminate the requirement on 3D hand annotation. Thus we propose ${\rm {S}^{2}HAND}$, a self-supervised 3D hand reconstruction model, that can jointly estimate pose, shape, texture, and the camera viewpoint from a single RGB input through the supervision of easily accessible 2D detected keypoints. We leverage the continuous hand motion information contained in the unlabeled video data and propose ${\rm {S}^{2}HAND(V)}$, which uses a set of weights shared ${\rm {S}^{2}HAND}$ to process each frame and exploits additional motion, texture, and shape consistency constrains to promote more accurate hand poses and more consistent shapes and textures. Experiments on benchmark datasets demonstrate that our self-supervised approach produces comparable hand reconstruction performance compared with the recent full-supervised methods in single-frame as input setup, and notably improves the reconstruction accuracy and consistency when using video training data.

* arXiv admin note: substantial text overlap with arXiv:2103.11703 
Viaarxiv icon

Model-based 3D Hand Reconstruction via Self-Supervised Learning

Mar 22, 2021
Yujin Chen, Zhigang Tu, Di Kang, Linchao Bao, Ying Zhang, Xuefei Zhe, Ruizhi Chen, Junsong Yuan

Figure 1 for Model-based 3D Hand Reconstruction via Self-Supervised Learning
Figure 2 for Model-based 3D Hand Reconstruction via Self-Supervised Learning
Figure 3 for Model-based 3D Hand Reconstruction via Self-Supervised Learning
Figure 4 for Model-based 3D Hand Reconstruction via Self-Supervised Learning

Reconstructing a 3D hand from a single-view RGB image is challenging due to various hand configurations and depth ambiguity. To reliably reconstruct a 3D hand from a monocular image, most state-of-the-art methods heavily rely on 3D annotations at the training stage, but obtaining 3D annotations is expensive. To alleviate reliance on labeled training data, we propose S2HAND, a self-supervised 3D hand reconstruction network that can jointly estimate pose, shape, texture, and the camera viewpoint. Specifically, we obtain geometric cues from the input image through easily accessible 2D detected keypoints. To learn an accurate hand reconstruction model from these noisy geometric cues, we utilize the consistency between 2D and 3D representations and propose a set of novel losses to rationalize outputs of the neural network. For the first time, we demonstrate the feasibility of training an accurate 3D hand reconstruction network without relying on manual annotations. Our experiments show that the proposed method achieves comparable performance with recent fully-supervised methods while using fewer supervision data.

* Accepted by CVPR21 
Viaarxiv icon

Multi-Attribute Enhancement Network for Person Search

Feb 16, 2021
Lequan Chen, Wei Xie, Zhigang Tu, Yaping Tao, Xinming Wang

Figure 1 for Multi-Attribute Enhancement Network for Person Search
Figure 2 for Multi-Attribute Enhancement Network for Person Search
Figure 3 for Multi-Attribute Enhancement Network for Person Search
Figure 4 for Multi-Attribute Enhancement Network for Person Search

Person Search is designed to jointly solve the problems of Person Detection and Person Re-identification (Re-ID), in which the target person will be located in a large number of uncut images. Over the past few years, Person Search based on deep learning has made great progress. Visual character attributes play a key role in retrieving the query person, which has been explored in Re-ID but has been ignored in Person Search. So, we introduce attribute learning into the model, allowing the use of attribute features for retrieval task. Specifically, we propose a simple and effective model called Multi-Attribute Enhancement (MAE) which introduces attribute tags to learn local features. In addition to learning the global representation of pedestrians, it also learns the local representation, and combines the two aspects to learn robust features to promote the search performance. Additionally, we verify the effectiveness of our module on the existing benchmark dataset, CUHK-SYSU and PRW. Ultimately, our model achieves state-of-the-art among end-to-end methods, especially reaching 91.8% of mAP and 93.0% of rank-1 on CUHK-SYSU. Codes and models are available at https://github.com/chenlq123/MAE.

Viaarxiv icon