Alert button
Picture for Yuzhi Zhao

Yuzhi Zhao

Alert button

Bidirectionally Deformable Motion Modulation For Video-based Human Pose Transfer

Jul 18, 2023
Wing-Yin Yu, Lai-Man Po, Ray C. C. Cheung, Yuzhi Zhao, Yu Xue, Kun Li

Figure 1 for Bidirectionally Deformable Motion Modulation For Video-based Human Pose Transfer
Figure 2 for Bidirectionally Deformable Motion Modulation For Video-based Human Pose Transfer
Figure 3 for Bidirectionally Deformable Motion Modulation For Video-based Human Pose Transfer
Figure 4 for Bidirectionally Deformable Motion Modulation For Video-based Human Pose Transfer

Video-based human pose transfer is a video-to-video generation task that animates a plain source human image based on a series of target human poses. Considering the difficulties in transferring highly structural patterns on the garments and discontinuous poses, existing methods often generate unsatisfactory results such as distorted textures and flickering artifacts. To address these issues, we propose a novel Deformable Motion Modulation (DMM) that utilizes geometric kernel offset with adaptive weight modulation to simultaneously perform feature alignment and style transfer. Different from normal style modulation used in style transfer, the proposed modulation mechanism adaptively reconstructs smoothed frames from style codes according to the object shape through an irregular receptive field of view. To enhance the spatio-temporal consistency, we leverage bidirectional propagation to extract the hidden motion information from a warped image sequence generated by noisy poses. The proposed feature propagation significantly enhances the motion prediction ability by forward and backward propagation. Both quantitative and qualitative experimental results demonstrate superiority over the state-of-the-arts in terms of image fidelity and visual continuity. The source code is publicly available at github.com/rocketappslab/bdmm.

* ICCV 2023 
Viaarxiv icon

SVCNet: Scribble-based Video Colorization Network with Temporal Aggregation

Mar 21, 2023
Yuzhi Zhao, Lai-Man Po, Kangcheng Liu, Xuehui Wang, Wing-Yin Yu, Pengfei Xian, Yujia Zhang, Mengyang Liu

Figure 1 for SVCNet: Scribble-based Video Colorization Network with Temporal Aggregation
Figure 2 for SVCNet: Scribble-based Video Colorization Network with Temporal Aggregation
Figure 3 for SVCNet: Scribble-based Video Colorization Network with Temporal Aggregation
Figure 4 for SVCNet: Scribble-based Video Colorization Network with Temporal Aggregation

In this paper, we propose a scribble-based video colorization network with temporal aggregation called SVCNet. It can colorize monochrome videos based on different user-given color scribbles. It addresses three common issues in the scribble-based video colorization area: colorization vividness, temporal consistency, and color bleeding. To improve the colorization quality and strengthen the temporal consistency, we adopt two sequential sub-networks in SVCNet for precise colorization and temporal smoothing, respectively. The first stage includes a pyramid feature encoder to incorporate color scribbles with a grayscale frame, and a semantic feature encoder to extract semantics. The second stage finetunes the output from the first stage by aggregating the information of neighboring colorized frames (as short-range connections) and the first colorized frame (as a long-range connection). To alleviate the color bleeding artifacts, we learn video colorization and segmentation simultaneously. Furthermore, we set the majority of operations on a fixed small image resolution and use a Super-resolution Module at the tail of SVCNet to recover original sizes. It allows the SVCNet to fit different image resolutions at the inference. Finally, we evaluate the proposed SVCNet on DAVIS and Videvo benchmarks. The experimental results demonstrate that SVCNet produces both higher-quality and more temporally consistent videos than other well-known video colorization approaches. The codes and models can be found at https://github.com/zhaoyuzhi/SVCNet.

* under revision of IEEE Transactions on Image Processing 
Viaarxiv icon

D2HNet: Joint Denoising and Deblurring with Hierarchical Network for Robust Night Image Restoration

Jul 14, 2022
Yuzhi Zhao, Yongzhe Xu, Qiong Yan, Dingdong Yang, Xuehui Wang, Lai-Man Po

Figure 1 for D2HNet: Joint Denoising and Deblurring with Hierarchical Network for Robust Night Image Restoration
Figure 2 for D2HNet: Joint Denoising and Deblurring with Hierarchical Network for Robust Night Image Restoration
Figure 3 for D2HNet: Joint Denoising and Deblurring with Hierarchical Network for Robust Night Image Restoration
Figure 4 for D2HNet: Joint Denoising and Deblurring with Hierarchical Network for Robust Night Image Restoration

Night imaging with modern smartphone cameras is troublesome due to low photon count and unavoidable noise in the imaging system. Directly adjusting exposure time and ISO ratings cannot obtain sharp and noise-free images at the same time in low-light conditions. Though many methods have been proposed to enhance noisy or blurry night images, their performances on real-world night photos are still unsatisfactory due to two main reasons: 1) Limited information in a single image and 2) Domain gap between synthetic training images and real-world photos (e.g., differences in blur area and resolution). To exploit the information from successive long- and short-exposure images, we propose a learning-based pipeline to fuse them. A D2HNet framework is developed to recover a high-quality image by deblurring and enhancing a long-exposure image under the guidance of a short-exposure image. To shrink the domain gap, we leverage a two-phase DeblurNet-EnhanceNet architecture, which performs accurate blur removal on a fixed low resolution so that it is able to handle large ranges of blur in different resolution inputs. In addition, we synthesize a D2-Dataset from HD videos and experiment on it. The results on the validation set and real photos demonstrate our methods achieve better visual quality and state-of-the-art quantitative scores. The D2HNet codes and D2-Dataset can be found at https://github.com/zhaoyuzhi/D2HNet.

* Accepted by the ECCV 2022, including supplementary material 
Viaarxiv icon

ChildPredictor: A Child Face Prediction Framework with Disentangled Learning

Apr 21, 2022
Yuzhi Zhao, Lai-Man Po, Xuehui Wang, Qiong Yan, Wei Shen, Yujia Zhang, Wei Liu, Chun-Kit Wong, Chiu-Sing Pang, Weifeng Ou, Wing-Yin Yu, Buhua Liu

Figure 1 for ChildPredictor: A Child Face Prediction Framework with Disentangled Learning
Figure 2 for ChildPredictor: A Child Face Prediction Framework with Disentangled Learning
Figure 3 for ChildPredictor: A Child Face Prediction Framework with Disentangled Learning
Figure 4 for ChildPredictor: A Child Face Prediction Framework with Disentangled Learning

The appearances of children are inherited from their parents, which makes it feasible to predict them. Predicting realistic children's faces may help settle many social problems, such as age-invariant face recognition, kinship verification, and missing child identification. It can be regarded as an image-to-image translation task. Existing approaches usually assume domain information in the image-to-image translation can be interpreted by "style", i.e., the separation of image content and style. However, such separation is improper for the child face prediction, because the facial contours between children and parents are not the same. To address this issue, we propose a new disentangled learning strategy for children's face prediction. We assume that children's faces are determined by genetic factors (compact family features, e.g., face contour), external factors (facial attributes irrelevant to prediction, such as moustaches and glasses), and variety factors (individual properties for each child). On this basis, we formulate predictions as a mapping from parents' genetic factors to children's genetic factors, and disentangle them from external and variety factors. In order to obtain accurate genetic factors and perform the mapping, we propose a ChildPredictor framework. It transfers human faces to genetic factors by encoders and back by generators. Then, it learns the relationship between the genetic factors of parents and children through a mapping function. To ensure the generated faces are realistic, we collect a large Family Face Database to train ChildPredictor and evaluate it on the FF-Database validation set. Experimental results demonstrate that ChildPredictor is superior to other well-known image-to-image translation methods in predicting realistic and diverse child faces. Implementation codes can be found at https://github.com/zhaoyuzhi/ChildPredictor.

* accepted to IEEE Transactions on Multimedia 
Viaarxiv icon

Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation

Dec 19, 2021
Yujia Zhang, Lai-Man Po, Xuyuan Xu, Mengyang Liu, Yexin Wang, Weifeng Ou, Yuzhi Zhao, Wing-Yin Yu

Figure 1 for Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation
Figure 2 for Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation
Figure 3 for Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation
Figure 4 for Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation

Spatio-temporal representation learning is critical for video self-supervised representation. Recent approaches mainly use contrastive learning and pretext tasks. However, these approaches learn representation by discriminating sampled instances via feature similarity in the latent space while ignoring the intermediate state of the learned representations, which limits the overall performance. In this work, taking into account the degree of similarity of sampled instances as the intermediate state, we propose a novel pretext task - spatio-temporal overlap rate (STOR) prediction. It stems from the observation that humans are capable of discriminating the overlap rates of videos in space and time. This task encourages the model to discriminate the STOR of two generated samples to learn the representations. Moreover, we employ a joint optimization combining pretext tasks with contrastive learning to further enhance the spatio-temporal representation learning. We also study the mutual influence of each component in the proposed scheme. Extensive experiments demonstrate that our proposed STOR task can favor both contrastive learning and pretext tasks. The joint optimization scheme can significantly improve the spatio-temporal representation in video understanding. The code is available at https://github.com/Katou2/CSTP.

* Accepted by AAAI 2022, Preprint version with Appendix 
Viaarxiv icon

VCGAN: Video Colorization with Hybrid Generative Adversarial Network

Apr 26, 2021
Yuzhi Zhao, Lai-Man Po, Wing-Yin Yu, Yasar Abbas Ur Rehman, Mengyang Liu, Yujia Zhang, Weifeng Ou

Figure 1 for VCGAN: Video Colorization with Hybrid Generative Adversarial Network
Figure 2 for VCGAN: Video Colorization with Hybrid Generative Adversarial Network
Figure 3 for VCGAN: Video Colorization with Hybrid Generative Adversarial Network
Figure 4 for VCGAN: Video Colorization with Hybrid Generative Adversarial Network

We propose a hybrid recurrent Video Colorization with Hybrid Generative Adversarial Network (VCGAN), an improved approach to video colorization using end-to-end learning. The VCGAN addresses two prevalent issues in the video colorization domain: Temporal consistency and unification of colorization network and refinement network into a single architecture. To enhance colorization quality and spatiotemporal consistency, the mainstream of generator in VCGAN is assisted by two additional networks, i.e., global feature extractor and placeholder feature extractor, respectively. The global feature extractor encodes the global semantics of grayscale input to enhance colorization quality, whereas the placeholder feature extractor acts as a feedback connection to encode the semantics of the previous colorized frame in order to maintain spatiotemporal consistency. If changing the input for placeholder feature extractor as grayscale input, the hybrid VCGAN also has the potential to perform image colorization. To improve the consistency of far frames, we propose a dense long-term loss that smooths the temporal disparity of every two remote frames. Trained with colorization and temporal losses jointly, VCGAN strikes a good balance between color vividness and video continuity. Experimental results demonstrate that VCGAN produces higher-quality and temporally more consistent colorful videos than existing approaches.

* Submitted Major Revision Manuscript of IEEE Transactions on Multimedia (TMM) 
Viaarxiv icon

Spatial Content Alignment For Pose Transfer

Mar 31, 2021
Wing-Yin Yu, Lai-Man Po, Yuzhi Zhao, Jingjing Xiong, Kin-Wai Lau

Figure 1 for Spatial Content Alignment For Pose Transfer
Figure 2 for Spatial Content Alignment For Pose Transfer
Figure 3 for Spatial Content Alignment For Pose Transfer
Figure 4 for Spatial Content Alignment For Pose Transfer

Due to unreliable geometric matching and content misalignment, most conventional pose transfer algorithms fail to generate fine-trained person images. In this paper, we propose a novel framework Spatial Content Alignment GAN (SCAGAN) which aims to enhance the content consistency of garment textures and the details of human characteristics. We first alleviate the spatial misalignment by transferring the edge content to the target pose in advance. Secondly, we introduce a new Content-Style DeBlk which can progressively synthesize photo-realistic person images based on the appearance features of the source image, the target pose heatmap and the prior transferred content in edge domain. We compare the proposed framework with several state-of-the-art methods to show its superiority in quantitative and qualitative analysis. Moreover, detailed ablation study results demonstrate the efficacy of our contributions. Codes are publicly available at github.com/rocketappslab/SCA-GAN.

* IEEE International Conference on Multimedia and Expo (ICME) 2021 Oral 
Viaarxiv icon

SCGAN: Saliency Map-guided Colorization with Generative Adversarial Network

Nov 23, 2020
Yuzhi Zhao, Lai-Man Po, Kwok-Wai Cheung, Wing-Yin Yu, Yasar Abbas Ur Rehman

Figure 1 for SCGAN: Saliency Map-guided Colorization with Generative Adversarial Network
Figure 2 for SCGAN: Saliency Map-guided Colorization with Generative Adversarial Network
Figure 3 for SCGAN: Saliency Map-guided Colorization with Generative Adversarial Network
Figure 4 for SCGAN: Saliency Map-guided Colorization with Generative Adversarial Network

Given a grayscale photograph, the colorization system estimates a visually plausible colorful image. Conventional methods often use semantics to colorize grayscale images. However, in these methods, only classification semantic information is embedded, resulting in semantic confusion and color bleeding in the final colorized image. To address these issues, we propose a fully automatic Saliency Map-guided Colorization with Generative Adversarial Network (SCGAN) framework. It jointly predicts the colorization and saliency map to minimize semantic confusion and color bleeding in the colorized image. Since the global features from pre-trained VGG-16-Gray network are embedded to the colorization encoder, the proposed SCGAN can be trained with much less data than state-of-the-art methods to achieve perceptually reasonable colorization. In addition, we propose a novel saliency map-based guidance method. Branches of the colorization decoder are used to predict the saliency map as a proxy target. Moreover, two hierarchical discriminators are utilized for the generated colorization and saliency map, respectively, in order to strengthen visual perception performance. The proposed system is evaluated on ImageNet validation set. Experimental results show that SCGAN can generate more reasonable colorized images than state-of-the-art techniques.

* accepted by IEEE Transactions on Circuits and Systems for Video Technology 
Viaarxiv icon

Lightweight Single-Image Super-Resolution Network with Attentive Auxiliary Feature Learning

Nov 13, 2020
Xuehui Wang, Qing Wang, Yuzhi Zhao, Junchi Yan, Lei Fan, Long Chen

Figure 1 for Lightweight Single-Image Super-Resolution Network with Attentive Auxiliary Feature Learning
Figure 2 for Lightweight Single-Image Super-Resolution Network with Attentive Auxiliary Feature Learning
Figure 3 for Lightweight Single-Image Super-Resolution Network with Attentive Auxiliary Feature Learning
Figure 4 for Lightweight Single-Image Super-Resolution Network with Attentive Auxiliary Feature Learning

Despite convolutional network-based methods have boosted the performance of single image super-resolution (SISR), the huge computation costs restrict their practical applicability. In this paper, we develop a computation efficient yet accurate network based on the proposed attentive auxiliary features (A$^2$F) for SISR. Firstly, to explore the features from the bottom layers, the auxiliary feature from all the previous layers are projected into a common space. Then, to better utilize these projected auxiliary features and filter the redundant information, the channel attention is employed to select the most important common feature based on current layer feature. We incorporate these two modules into a block and implement it with a lightweight network. Experimental results on large-scale dataset demonstrate the effectiveness of the proposed model against the state-of-the-art (SOTA) SR methods. Notably, when parameters are less than 320k, A$^2$F outperforms SOTA methods for all scales, which proves its ability to better utilize the auxiliary features. Codes are available at https://github.com/wxxxxxxh/A2F-SR.

* Accepted by ACCV 2020 
Viaarxiv icon