Alert button
Picture for Pengfei Wan

Pengfei Wan

Alert button

Towards Practical Capture of High-Fidelity Relightable Avatars

Sep 08, 2023
Haotian Yang, Mingwu Zheng, Wanquan Feng, Haibin Huang, Yu-Kun Lai, Pengfei Wan, Zhongyuan Wang, Chongyang Ma

In this paper, we propose a novel framework, Tracking-free Relightable Avatar (TRAvatar), for capturing and reconstructing high-fidelity 3D avatars. Compared to previous methods, TRAvatar works in a more practical and efficient setting. Specifically, TRAvatar is trained with dynamic image sequences captured in a Light Stage under varying lighting conditions, enabling realistic relighting and real-time animation for avatars in diverse scenes. Additionally, TRAvatar allows for tracking-free avatar capture and obviates the need for accurate surface tracking under varying illumination conditions. Our contributions are two-fold: First, we propose a novel network architecture that explicitly builds on and ensures the satisfaction of the linear nature of lighting. Trained on simple group light captures, TRAvatar can predict the appearance in real-time with a single forward pass, achieving high-quality relighting effects under illuminations of arbitrary environment maps. Second, we jointly optimize the facial geometry and relightable appearance from scratch based on image sequences, where the tracking is implicitly learned. This tracking-free approach brings robustness for establishing temporal correspondences between frames under different lighting conditions. Extensive qualitative and quantitative experiments demonstrate that our framework achieves superior performance for photorealistic avatar animation and relighting.

* Accepted to SIGGRAPH Asia 2023 (Conference); Project page: https://travatar-paper.github.io/ 
Viaarxiv icon

1st Place Solution for the 5th LSVOS Challenge: Video Instance Segmentation

Aug 28, 2023
Tao Zhang, Xingye Tian, Yikang Zhou, Yu Wu, Shunping Ji, Cilin Yan, Xuebo Wang, Xin Tao, Yuan Zhang, Pengfei Wan

Figure 1 for 1st Place Solution for the 5th LSVOS Challenge: Video Instance Segmentation
Figure 2 for 1st Place Solution for the 5th LSVOS Challenge: Video Instance Segmentation
Figure 3 for 1st Place Solution for the 5th LSVOS Challenge: Video Instance Segmentation
Figure 4 for 1st Place Solution for the 5th LSVOS Challenge: Video Instance Segmentation

Video instance segmentation is a challenging task that serves as the cornerstone of numerous downstream applications, including video editing and autonomous driving. In this report, we present further improvements to the SOTA VIS method, DVIS. First, we introduce a denoising training strategy for the trainable tracker, allowing it to achieve more stable and accurate object tracking in complex and long videos. Additionally, we explore the role of visual foundation models in video instance segmentation. By utilizing a frozen VIT-L model pre-trained by DINO v2, DVIS demonstrates remarkable performance improvements. With these enhancements, our method achieves 57.9 AP and 56.0 AP in the development and test phases, respectively, and ultimately ranked 1st in the VIS track of the 5th LSVOS Challenge. The code will be available at https://github.com/zhang-tao-whu/DVIS.

Viaarxiv icon

1st Place Solution for PVUW Challenge 2023: Video Panoptic Segmentation

Jun 08, 2023
Tao Zhang, Xingye Tian, Haoran Wei, Yu Wu, Shunping Ji, Xuebo Wang, Xin Tao, Yuan Zhang, Pengfei Wan

Figure 1 for 1st Place Solution for PVUW Challenge 2023: Video Panoptic Segmentation
Figure 2 for 1st Place Solution for PVUW Challenge 2023: Video Panoptic Segmentation
Figure 3 for 1st Place Solution for PVUW Challenge 2023: Video Panoptic Segmentation
Figure 4 for 1st Place Solution for PVUW Challenge 2023: Video Panoptic Segmentation

Video panoptic segmentation is a challenging task that serves as the cornerstone of numerous downstream applications, including video editing and autonomous driving. We believe that the decoupling strategy proposed by DVIS enables more effective utilization of temporal information for both "thing" and "stuff" objects. In this report, we successfully validated the effectiveness of the decoupling strategy in video panoptic segmentation. Finally, our method achieved a VPQ score of 51.4 and 53.7 in the development and test phases, respectively, and ultimately ranked 1st in the VPS track of the 2nd PVUW Challenge. The code is available at https://github.com/zhang-tao-whu/DVIS

Viaarxiv icon

DVIS: Decoupled Video Instance Segmentation Framework

Jun 08, 2023
Tao Zhang, Xingye Tian, Yu Wu, Shunping Ji, Xuebo Wang, Yuan Zhang, Pengfei Wan

Figure 1 for DVIS: Decoupled Video Instance Segmentation Framework
Figure 2 for DVIS: Decoupled Video Instance Segmentation Framework
Figure 3 for DVIS: Decoupled Video Instance Segmentation Framework
Figure 4 for DVIS: Decoupled Video Instance Segmentation Framework

Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing. Existing methods often underperform on complex and long videos in real world, primarily due to two factors. Firstly, offline methods are limited by the tightly-coupled modeling paradigm, which treats all frames equally and disregards the interdependencies between adjacent frames. Consequently, this leads to the introduction of excessive noise during long-term temporal alignment. Secondly, online methods suffer from inadequate utilization of temporal information. To tackle these challenges, we propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement. The efficacy of the decoupling strategy relies on two crucial elements: 1) attaining precise long-term alignment outcomes via frame-by-frame association during tracking, and 2) the effective utilization of temporal information predicated on the aforementioned accurate alignment outcomes during refinement. We introduce a novel referring tracker and temporal refiner to construct the \textbf{D}ecoupled \textbf{VIS} framework (\textbf{DVIS}). DVIS achieves new SOTA performance in both VIS and VPS, surpassing the current SOTA methods by 7.3 AP and 9.6 VPQ on the OVIS and VIPSeg datasets, which are the most challenging and realistic benchmarks. Moreover, thanks to the decoupling strategy, the referring tracker and temporal refiner are super light-weight (only 1.69\% of the segmenter FLOPs), allowing for efficient training and inference on a single GPU with 11G memory. The code is available at \href{https://github.com/zhang-tao-whu/DVIS}{https://github.com/zhang-tao-whu/DVIS}.

Viaarxiv icon

Multi-Modal Face Stylization with a Generative Prior

May 29, 2023
Mengtian Li, Yi Dong, Minxuan Lin, Haibin Huang, Pengfei Wan, Chongyang Ma

Figure 1 for Multi-Modal Face Stylization with a Generative Prior
Figure 2 for Multi-Modal Face Stylization with a Generative Prior
Figure 3 for Multi-Modal Face Stylization with a Generative Prior
Figure 4 for Multi-Modal Face Stylization with a Generative Prior

In this work, we introduce a new approach for artistic face stylization. Despite existing methods achieving impressive results in this task, there is still room for improvement in generating high-quality stylized faces with diverse styles and accurate facial reconstruction. Our proposed framework, MMFS, supports multi-modal face stylization by leveraging the strengths of StyleGAN and integrates it into an encoder-decoder architecture. Specifically, we use the mid-resolution and high-resolution layers of StyleGAN as the decoder to generate high-quality faces, while aligning its low-resolution layer with the encoder to extract and preserve input facial details. We also introduce a two-stage training strategy, where we train the encoder in the first stage to align the feature maps with StyleGAN and enable a faithful reconstruction of input faces. In the second stage, the entire network is fine-tuned with artistic data for stylized face generation. To enable the fine-tuned model to be applied in zero-shot and one-shot stylization tasks, we train an additional mapping network from the large-scale Contrastive-Language-Image-Pre-training (CLIP) space to a latent $w+$ space of fine-tuned StyleGAN. Qualitative and quantitative experiments show that our framework achieves superior face stylization performance in both one-shot and zero-shot stylization tasks, outperforming state-of-the-art methods by a large margin.

Viaarxiv icon

Bridging CLIP and StyleGAN through Latent Alignment for Image Editing

Oct 10, 2022
Wanfeng Zheng, Qiang Li, Xiaoyan Guo, Pengfei Wan, Zhongyuan Wang

Figure 1 for Bridging CLIP and StyleGAN through Latent Alignment for Image Editing
Figure 2 for Bridging CLIP and StyleGAN through Latent Alignment for Image Editing
Figure 3 for Bridging CLIP and StyleGAN through Latent Alignment for Image Editing
Figure 4 for Bridging CLIP and StyleGAN through Latent Alignment for Image Editing

Text-driven image manipulation is developed since the vision-language model (CLIP) has been proposed. Previous work has adopted CLIP to design a text-image consistency-based objective to address this issue. However, these methods require either test-time optimization or image feature cluster analysis for single-mode manipulation direction. In this paper, we manage to achieve inference-time optimization-free diverse manipulation direction mining by bridging CLIP and StyleGAN through Latent Alignment (CSLA). More specifically, our efforts consist of three parts: 1) a data-free training strategy to train latent mappers to bridge the latent space of CLIP and StyleGAN; 2) for more precise mapping, temporal relative consistency is proposed to address the knowledge distribution bias problem among different latent spaces; 3) to refine the mapped latent in s space, adaptive style mixing is also proposed. With this mapping scheme, we can achieve GAN inversion, text-to-image generation and text-driven image manipulation. Qualitative and quantitative comparisons are made to demonstrate the effectiveness of our method.

* 20 pages, 23 figures 
Viaarxiv icon

ITTR: Unpaired Image-to-Image Translation with Transformers

Mar 30, 2022
Wanfeng Zheng, Qiang Li, Guoxin Zhang, Pengfei Wan, Zhongyuan Wang

Figure 1 for ITTR: Unpaired Image-to-Image Translation with Transformers
Figure 2 for ITTR: Unpaired Image-to-Image Translation with Transformers
Figure 3 for ITTR: Unpaired Image-to-Image Translation with Transformers
Figure 4 for ITTR: Unpaired Image-to-Image Translation with Transformers

Unpaired image-to-image translation is to translate an image from a source domain to a target domain without paired training data. By utilizing CNN in extracting local semantics, various techniques have been developed to improve the translation performance. However, CNN-based generators lack the ability to capture long-range dependency to well exploit global semantics. Recently, Vision Transformers have been widely investigated for recognition tasks. Though appealing, it is inappropriate to simply transfer a recognition-based vision transformer to image-to-image translation due to the generation difficulty and the computation limitation. In this paper, we propose an effective and efficient architecture for unpaired Image-to-Image Translation with Transformers (ITTR). It has two main designs: 1) hybrid perception block (HPB) for token mixing from different receptive fields to utilize global semantics; 2) dual pruned self-attention (DPSA) to sharply reduce the computational complexity. Our ITTR outperforms the state-of-the-arts for unpaired image-to-image translation on six benchmark datasets.

* 18 pages, 7 figures, 5 tables 
Viaarxiv icon

Wavelet Knowledge Distillation: Towards Efficient Image-to-Image Translation

Mar 12, 2022
Linfeng Zhang, Xin Chen, Xiaobing Tu, Pengfei Wan, Ning Xu, Kaisheng Ma

Figure 1 for Wavelet Knowledge Distillation: Towards Efficient Image-to-Image Translation
Figure 2 for Wavelet Knowledge Distillation: Towards Efficient Image-to-Image Translation
Figure 3 for Wavelet Knowledge Distillation: Towards Efficient Image-to-Image Translation
Figure 4 for Wavelet Knowledge Distillation: Towards Efficient Image-to-Image Translation

Remarkable achievements have been attained with Generative Adversarial Networks (GANs) in image-to-image translation. However, due to a tremendous amount of parameters, state-of-the-art GANs usually suffer from low efficiency and bulky memory usage. To tackle this challenge, firstly, this paper investigates GANs performance from a frequency perspective. The results show that GANs, especially small GANs lack the ability to generate high-quality high frequency information. To address this problem, we propose a novel knowledge distillation method referred to as wavelet knowledge distillation. Instead of directly distilling the generated images of teachers, wavelet knowledge distillation first decomposes the images into different frequency bands with discrete wavelet transformation and then only distills the high frequency bands. As a result, the student GAN can pay more attention to its learning on high frequency bands. Experiments demonstrate that our method leads to 7.08 times compression and 6.80 times acceleration on CycleGAN with almost no performance drop. Additionally, we have studied the relation between discriminators and generators which shows that the compression of discriminators can promote the performance of compressed generators.

* Accepted by CVPR2022 
Viaarxiv icon

PMP-Net++: Point Cloud Completion by Transformer-Enhanced Multi-step Point Moving Paths

Feb 28, 2022
Xin Wen, Peng Xiang, Zhizhong Han, Yan-Pei Cao, Pengfei Wan, Wen Zheng, Yu-Shen Liu

Figure 1 for PMP-Net++: Point Cloud Completion by Transformer-Enhanced Multi-step Point Moving Paths
Figure 2 for PMP-Net++: Point Cloud Completion by Transformer-Enhanced Multi-step Point Moving Paths
Figure 3 for PMP-Net++: Point Cloud Completion by Transformer-Enhanced Multi-step Point Moving Paths
Figure 4 for PMP-Net++: Point Cloud Completion by Transformer-Enhanced Multi-step Point Moving Paths

Point cloud completion concerns to predict missing part for incomplete 3D shapes. A common strategy is to generate complete shape according to incomplete input. However, unordered nature of point clouds will degrade generation of high-quality 3D shapes, as detailed topology and structure of unordered points are hard to be captured during the generative process using an extracted latent code. We address this problem by formulating completion as point cloud deformation process. Specifically, we design a novel neural network, named PMP-Net++, to mimic behavior of an earth mover. It moves each point of incomplete input to obtain a complete point cloud, where total distance of point moving paths (PMPs) should be the shortest. Therefore, PMP-Net++ predicts unique PMP for each point according to constraint of point moving distances. The network learns a strict and unique correspondence on point-level, and thus improves quality of predicted complete shape. Moreover, since moving points heavily relies on per-point features learned by network, we further introduce a transformer-enhanced representation learning network, which significantly improves completion performance of PMP-Net++. We conduct comprehensive experiments in shape completion, and further explore application on point cloud up-sampling, which demonstrate non-trivial improvement of PMP-Net++ over state-of-the-art point cloud completion/up-sampling methods.

* 16 pages, 17 figures. Journel extension of CVPR 2021 paper PMP-Net(arXiv:2012.03408), Accepted by TPAMI 
Viaarxiv icon

Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer

Feb 22, 2022
Peng Xiang, Xin Wen, Yu-Shen Liu, Yan-Pei Cao, Pengfei Wan, Wen Zheng, Zhizhong Han

Figure 1 for Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer
Figure 2 for Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer
Figure 3 for Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer
Figure 4 for Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer

Most existing point cloud completion methods suffered from discrete nature of point clouds and unstructured prediction of points in local regions, which makes it hard to reveal fine local geometric details. To resolve this issue, we propose SnowflakeNet with Snowflake Point Deconvolution (SPD) to generate the complete point clouds. SPD models the generation of complete point clouds as the snowflake-like growth of points, where the child points are progressively generated by splitting their parent points after each SPD. Our insight of revealing detailed geometry is to introduce skip-transformer in SPD to learn point splitting patterns which can fit local regions the best. Skip-transformer leverages attention mechanism to summarize the splitting patterns used in previous SPD layer to produce the splitting in current SPD layer. The locally compact and structured point clouds generated by SPD precisely reveal the structure characteristic of 3D shape in local patches, which enables us to predict highly detailed geometries. Moreover, since SPD is a general operation, which is not limited to completion, we further explore the applications of SPD on other generative tasks, including point cloud auto-encoding, generation, single image reconstruction and upsampling. Our experimental results outperform the state-of-the-art methods under widely used benchmarks.

* IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Dec. 2021, under review. This work is a journal extension of our ICCV 2021 paper arXiv:2108.04444 . The first two authors contributed equally 
Viaarxiv icon