Alert button
Picture for Yujin Chen

Yujin Chen

Alert button

S4R: Self-Supervised Semantic Scene Reconstruction from RGB-D Scans

Feb 21, 2023
Junwen Huang, Alexey Artemov, Yujin Chen, Shuaifeng Zhi, Kai Xu, Matthias Nießner

Figure 1 for S4R: Self-Supervised Semantic Scene Reconstruction from RGB-D Scans
Figure 2 for S4R: Self-Supervised Semantic Scene Reconstruction from RGB-D Scans
Figure 3 for S4R: Self-Supervised Semantic Scene Reconstruction from RGB-D Scans
Figure 4 for S4R: Self-Supervised Semantic Scene Reconstruction from RGB-D Scans

Most deep learning approaches to comprehensive semantic modeling of 3D indoor spaces require costly dense annotations in the 3D domain. In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction, using a fully self-supervised approach. To this end, we design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images, fusing cross-domain features into volumetric embeddings to predict complete 3D geometry, color, and semantics. Our key technical innovation is to leverage differentiable rendering of color and semantics, using the observed RGB images and a generic semantic segmentation model as color and semantics supervision, respectively. We additionally develop a method to synthesize an augmented set of virtual training views complementing the original real captures, enabling more efficient self-supervision for semantics. In this work we propose an end-to-end trainable solution jointly addressing geometry completion, colorization, and semantic mapping from a few RGB-D images, without 3D or 2D ground-truth. Our method is the first, to our knowledge, fully self-supervised method addressing completion and semantic segmentation of real-world 3D scans. It performs comparably well with the 3D supervised baselines, surpasses baselines with 2D supervision on real datasets, and generalizes well to unseen scenes.

Viaarxiv icon

MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

Mar 27, 2022
Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, Junsong Yuan

Figure 1 for MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video
Figure 2 for MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video
Figure 3 for MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video
Figure 4 for MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

Recent transformer-based solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn spatio-temporal correlation. We observe that the motions of different joints differ significantly. However, the previous methods cannot efficiently model the solid inter-frame correspondence of each joint, leading to insufficient learning of spatial-temporal correlation. We propose MixSTE (Mixed Spatio-Temporal Encoder), which has a temporal transformer block to separately model the temporal motion of each joint and a spatial transformer block to learn inter-joint spatial correlation. These two blocks are utilized alternately to obtain better spatio-temporal feature encoding. In addition, the network output is extended from the central frame to entire frames of the input video, thereby improving the coherence between the input and output sequences. Extensive experiments are conducted on three benchmarks (Human3.6M, MPI-INF-3DHP, and HumanEva). The results show that our model outperforms the state-of-the-art approach by 10.9% P-MPJPE and 7.6% MPJPE. The code is available at https://github.com/JinluZhang1126/MixSTE.

* CVPR2022 Accepted Paper 
Viaarxiv icon

Joint-bone Fusion Graph Convolutional Network for Semi-supervised Skeleton Action Recognition

Feb 08, 2022
Zhigang Tu, Jiaxu Zhang, Hongyan Li, Yujin Chen, Junsong Yuan

Figure 1 for Joint-bone Fusion Graph Convolutional Network for Semi-supervised Skeleton Action Recognition
Figure 2 for Joint-bone Fusion Graph Convolutional Network for Semi-supervised Skeleton Action Recognition
Figure 3 for Joint-bone Fusion Graph Convolutional Network for Semi-supervised Skeleton Action Recognition
Figure 4 for Joint-bone Fusion Graph Convolutional Network for Semi-supervised Skeleton Action Recognition

In recent years, graph convolutional networks (GCNs) play an increasingly critical role in skeleton-based human action recognition. However, most GCN-based methods still have two main limitations: 1) They only consider the motion information of the joints or process the joints and bones separately, which are unable to fully explore the latent functional correlation between joints and bones for action recognition. 2) Most of these works are performed in the supervised learning way, which heavily relies on massive labeled training data. To address these issues, we propose a semi-supervised skeleton-based action recognition method which has been rarely exploited before. We design a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder to achieve semi-supervised learning. Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream, so that promoting both streams to learn more discriminative feature representations. The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data, which is essential for action recognition. Extensive experiments on two popular datasets, i.e. NTU-RGB+D and Kinetics-Skeleton, demonstrate that our model achieves the state-of-the-art performance for semi-supervised skeleton-based action recognition and is also useful for fully-supervised methods.

Viaarxiv icon

Consistent 3D Hand Reconstruction in Video via self-supervised Learning

Jan 24, 2022
Zhigang Tu, Zhisheng Huang, Yujin Chen, Di Kang, Linchao Bao, Bisheng Yang, Junsong Yuan

Figure 1 for Consistent 3D Hand Reconstruction in Video via self-supervised Learning
Figure 2 for Consistent 3D Hand Reconstruction in Video via self-supervised Learning
Figure 3 for Consistent 3D Hand Reconstruction in Video via self-supervised Learning
Figure 4 for Consistent 3D Hand Reconstruction in Video via self-supervised Learning

We present a method for reconstructing accurate and consistent 3D hands from a monocular video. We observe that detected 2D hand keypoints and the image texture provide important cues about the geometry and texture of the 3D hand, which can reduce or even eliminate the requirement on 3D hand annotation. Thus we propose ${\rm {S}^{2}HAND}$, a self-supervised 3D hand reconstruction model, that can jointly estimate pose, shape, texture, and the camera viewpoint from a single RGB input through the supervision of easily accessible 2D detected keypoints. We leverage the continuous hand motion information contained in the unlabeled video data and propose ${\rm {S}^{2}HAND(V)}$, which uses a set of weights shared ${\rm {S}^{2}HAND}$ to process each frame and exploits additional motion, texture, and shape consistency constrains to promote more accurate hand poses and more consistent shapes and textures. Experiments on benchmark datasets demonstrate that our self-supervised approach produces comparable hand reconstruction performance compared with the recent full-supervised methods in single-frame as input setup, and notably improves the reconstruction accuracy and consistency when using video training data.

* arXiv admin note: substantial text overlap with arXiv:2103.11703 
Viaarxiv icon

4DContrast: Contrastive Learning with Dynamic Correspondences for 3D Scene Understanding

Dec 06, 2021
Yujin Chen, Matthias Nießner, Angela Dai

Figure 1 for 4DContrast: Contrastive Learning with Dynamic Correspondences for 3D Scene Understanding
Figure 2 for 4DContrast: Contrastive Learning with Dynamic Correspondences for 3D Scene Understanding
Figure 3 for 4DContrast: Contrastive Learning with Dynamic Correspondences for 3D Scene Understanding
Figure 4 for 4DContrast: Contrastive Learning with Dynamic Correspondences for 3D Scene Understanding

We present a new approach to instill 4D dynamic object priors into learned 3D representations by unsupervised pre-training. We observe that dynamic movement of an object through an environment provides important cues about its objectness, and thus propose to imbue learned 3D representations with such dynamic understanding, that can then be effectively transferred to improved performance in downstream 3D semantic scene understanding tasks. We propose a new data augmentation scheme leveraging synthetic 3D shapes moving in static 3D environments, and employ contrastive learning under 3D-4D constraints that encode 4D invariances into the learned 3D representations. Experiments demonstrate that our unsupervised representation learning results in improvement in downstream 3D semantic segmentation, object detection, and instance segmentation tasks, and moreover, notably improves performance in data-scarce scenarios.

* Video: https://youtu.be/qhGhWZmJq3U 
Viaarxiv icon

Model-based 3D Hand Reconstruction via Self-Supervised Learning

Mar 22, 2021
Yujin Chen, Zhigang Tu, Di Kang, Linchao Bao, Ying Zhang, Xuefei Zhe, Ruizhi Chen, Junsong Yuan

Figure 1 for Model-based 3D Hand Reconstruction via Self-Supervised Learning
Figure 2 for Model-based 3D Hand Reconstruction via Self-Supervised Learning
Figure 3 for Model-based 3D Hand Reconstruction via Self-Supervised Learning
Figure 4 for Model-based 3D Hand Reconstruction via Self-Supervised Learning

Reconstructing a 3D hand from a single-view RGB image is challenging due to various hand configurations and depth ambiguity. To reliably reconstruct a 3D hand from a monocular image, most state-of-the-art methods heavily rely on 3D annotations at the training stage, but obtaining 3D annotations is expensive. To alleviate reliance on labeled training data, we propose S2HAND, a self-supervised 3D hand reconstruction network that can jointly estimate pose, shape, texture, and the camera viewpoint. Specifically, we obtain geometric cues from the input image through easily accessible 2D detected keypoints. To learn an accurate hand reconstruction model from these noisy geometric cues, we utilize the consistency between 2D and 3D representations and propose a set of novel losses to rationalize outputs of the neural network. For the first time, we demonstrate the feasibility of training an accurate 3D hand reconstruction network without relying on manual annotations. Our experiments show that the proposed method achieves comparable performance with recent fully-supervised methods while using fewer supervision data.

* Accepted by CVPR21 
Viaarxiv icon

Joint Hand-object 3D Reconstruction from a Single Image with Cross-branch Feature Fusion

Jun 28, 2020
Yujin Chen, Zhigang Tu, Di Kang, Ruizhi Chen, Linchao Bao, Zhengyou Zhang, Junsong Yuan

Figure 1 for Joint Hand-object 3D Reconstruction from a Single Image with Cross-branch Feature Fusion
Figure 2 for Joint Hand-object 3D Reconstruction from a Single Image with Cross-branch Feature Fusion
Figure 3 for Joint Hand-object 3D Reconstruction from a Single Image with Cross-branch Feature Fusion
Figure 4 for Joint Hand-object 3D Reconstruction from a Single Image with Cross-branch Feature Fusion

Accurate 3D reconstruction of the hand and object shape from a hand-object image is important for understanding human-object interaction as well as human daily activities. Different from bare hand pose estimation, hand-object interaction poses a strong constraint on both the hand and its manipulated object, which suggests that hand configuration may be crucial contextual information for the object, and vice versa. However, current approaches address this task by training a two-branch network to reconstruct the hand and object separately with little communication between the two branches. In this work, we propose to consider hand and object jointly in feature space and explore the reciprocity of the two branches. We extensively investigate cross-branch feature fusion architectures with MLP or LSTM units. Among the investigated architectures, a variant with LSTM units that enhances object feature with hand feature shows the best performance gain. Moreover, we employ an auxiliary depth estimation module to augment the input RGB image with the estimated depth map, which further improves the reconstruction accuracy. Experiments conducted on public datasets demonstrate that our approach significantly outperforms existing approaches in terms of the reconstruction accuracy of objects.

Viaarxiv icon