Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Baoyuan Wang

FashionTex: Controllable Virtual Try-on with Text and Texture

May 08, 2023

Anran Lin, Nanxuan Zhao, Shuliang Ning, Yuda Qiu, Baoyuan Wang, Xiaoguang Han

Figure 1 for FashionTex: Controllable Virtual Try-on with Text and Texture

Figure 2 for FashionTex: Controllable Virtual Try-on with Text and Texture

Figure 3 for FashionTex: Controllable Virtual Try-on with Text and Texture

Figure 4 for FashionTex: Controllable Virtual Try-on with Text and Texture

Abstract:Virtual try-on attracts increasing research attention as a promising way for enhancing the user experience for online cloth shopping. Though existing methods can generate impressive results, users need to provide a well-designed reference image containing the target fashion clothes that often do not exist. To support user-friendly fashion customization in full-body portraits, we propose a multi-modal interactive setting by combining the advantages of both text and texture for multi-level fashion manipulation. With the carefully designed fashion editing module and loss functions, FashionTex framework can semantically control cloth types and local texture patterns without annotated pairwise training data. We further introduce an ID recovery module to maintain the identity of input portrait. Extensive experiments have demonstrated the effectiveness of our proposed pipeline.

* Accepted to SIGGRAPH 2023 (Conference Proceedings)

Via

Access Paper or Ask Questions

An Effective Motion-Centric Paradigm for 3D Single Object Tracking in Point Clouds

Mar 21, 2023

Chaoda Zheng, Xu Yan, Haiming Zhang, Baoyuan Wang, Shenghui Cheng, Shuguang Cui, Zhen Li

Figure 1 for An Effective Motion-Centric Paradigm for 3D Single Object Tracking in Point Clouds

Figure 2 for An Effective Motion-Centric Paradigm for 3D Single Object Tracking in Point Clouds

Figure 3 for An Effective Motion-Centric Paradigm for 3D Single Object Tracking in Point Clouds

Figure 4 for An Effective Motion-Centric Paradigm for 3D Single Object Tracking in Point Clouds

Abstract:3D single object tracking in LiDAR point clouds (LiDAR SOT) plays a crucial role in autonomous driving. Current approaches all follow the Siamese paradigm based on appearance matching. However, LiDAR point clouds are usually textureless and incomplete, which hinders effective appearance matching. Besides, previous methods greatly overlook the critical motion clues among targets. In this work, beyond 3D Siamese tracking, we introduce a motion-centric paradigm to handle LiDAR SOT from a new perspective. Following this paradigm, we propose a matching-free two-stage tracker M^2-Track. At the 1st-stage, M^2-Track localizes the target within successive frames via motion transformation. Then it refines the target box through motion-assisted shape completion at the 2nd-stage. Due to the motion-centric nature, our method shows its impressive generalizability with limited training labels and provides good differentiability for end-to-end cycle training. This inspires us to explore semi-supervised LiDAR SOT by incorporating a pseudo-label-based motion augmentation and a self-supervised loss term. Under the fully-supervised setting, extensive experiments confirm that M^2-Track significantly outperforms previous state-of-the-arts on three large-scale datasets while running at 57FPS (~8%, ~17% and ~22% precision gains on KITTI, NuScenes, and Waymo Open Dataset respectively). While under the semi-supervised setting, our method performs on par with or even surpasses its fully-supervised counterpart using fewer than half labels from KITTI. Further analysis verifies each component's effectiveness and shows the motion-centric paradigm's promising potential for auto-labeling and unsupervised domain adaptation.

* Journal Extension of M^2-Track, under review. arXiv admin note: substantial text overlap with arXiv:2203.01730

Via

Access Paper or Ask Questions

Mimic3D: Thriving 3D-Aware GANs via 3D-to-2D Imitation

Mar 16, 2023

Xingyu Chen, Yu Deng, Baoyuan Wang

Abstract:Generating images with both photorealism and multiview 3D consistency is crucial for 3D-aware GANs, yet existing methods struggle to achieve them simultaneously. Improving the photorealism via CNN-based 2D super-resolution can break the strict 3D consistency, while keeping the 3D consistency by learning high-resolution 3D representations for direct rendering often compromises image quality. In this paper, we propose a novel learning strategy, namely 3D-to-2D imitation, which enables a 3D-aware GAN to generate high-quality images while maintaining their strict 3D consistency, by letting the images synthesized by the generator's 3D rendering branch to mimic those generated by its 2D super-resolution branch. We also introduce 3D-aware convolutions into the generator for better 3D representation learning, which further improves the image generation quality. With the above strategies, our method reaches FID scores of 5.4 and 4.3 on FFHQ and AFHQ-v2 Cats, respectively, at 512x512 resolution, largely outperforming existing 3D-aware GANs using direct 3D rendering and coming very close to the previous state-of-the-art method that leverages 2D super-resolution. Project website: https://seanchenxy.github.io/Mimic3DWeb.

Via

Access Paper or Ask Questions

Orca: A Few-shot Benchmark for Chinese Conversational Machine Reading Comprehension

Feb 27, 2023

Nuo Chen, Hongguang Li, Yinan Bao, Junqing He, Xinshi Lin, Qi Yang, Jianfeng Liu, Ruyi Gan, Jiaxing Zhang, Baoyuan Wang(+1 more)

Figure 1 for Orca: A Few-shot Benchmark for Chinese Conversational Machine Reading Comprehension

Figure 2 for Orca: A Few-shot Benchmark for Chinese Conversational Machine Reading Comprehension

Figure 3 for Orca: A Few-shot Benchmark for Chinese Conversational Machine Reading Comprehension

Figure 4 for Orca: A Few-shot Benchmark for Chinese Conversational Machine Reading Comprehension

Abstract:The conversational machine reading comprehension (CMRC) task aims to answer questions in conversations, which has been a hot research topic in recent years because of its wide applications. However, existing CMRC benchmarks in which each conversation is assigned a static passage are inconsistent with real scenarios. Thus, model's comprehension ability towards real scenarios are hard to evaluate reasonably. To this end, we propose the first Chinese CMRC benchmark Orca and further provide zero-shot/few-shot settings to evaluate model's generalization ability towards diverse domains. We collect 831 hot-topic driven conversations with 4,742 turns in total. Each turn of a conversation is assigned with a response-related passage, aiming to evaluate model's comprehension ability more reasonably. The topics of conversations are collected from social media platform and cover 33 domains, trying to be consistent with real scenarios. Importantly, answers in Orca are all well-annotated natural responses rather than the specific spans or short phrase in previous datasets. Besides, we implement three strong baselines to tackle the challenge in Orca. The results indicate the great challenge of our CMRC benchmark. Our datatset and checkpoints are available at https://github.com/nuochenpku/Orca.

* 14 pages

Via

Access Paper or Ask Questions

Natural Response Generation for Chinese Reading Comprehension

Feb 17, 2023

Nuo Chen, Hongguang Li, Yinan Bao, Baoyuan Wang, Jia Li

Abstract:Machine reading comprehension (MRC) is an important area of conversation agents and draws a lot of attention. However, there is a notable limitation to current MRC benchmarks: The labeled answers are mostly either spans extracted from the target corpus or the choices of the given candidates, ignoring the natural aspect of high-quality responses. As a result, MRC models trained on these datasets can not generate human-like responses in real QA scenarios. To this end, we construct a new dataset called Penguin to promote the research of MRC, providing a training and test bed for natural response generation to real scenarios. Concretely, Penguin consists of 200k training data with high-quality fluent, and well-informed responses. Penguin is the first benchmark towards natural response generation in Chinese MRC on a relatively large scale. To address the challenges in Penguin, we develop two strong baselines: end-to-end and two-stage frameworks. Following that, we further design Prompt-BART: fine-tuning the pre-trained generative language models with a mixture of prefix prompts in Penguin. Extensive experiments validated the effectiveness of this design.

* 15 pages

Via

Access Paper or Ask Questions

Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors

Dec 07, 2022

Zhentao Yu, Zixin Yin, Deyu Zhou, Duomin Wang, Finn Wong, Baoyuan Wang

Figure 1 for Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors

Figure 2 for Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors

Figure 3 for Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors

Figure 4 for Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors

Abstract:In this paper, we introduce a simple and novel framework for one-shot audio-driven talking head generation. Unlike prior works that require additional driving sources for controlled synthesis in a deterministic manner, we instead probabilistically sample all the holistic lip-irrelevant facial motions (i.e. pose, expression, blink, gaze, etc.) to semantically match the input audio while still maintaining both the photo-realism of audio-lip synchronization and the overall naturalness. This is achieved by our newly proposed audio-to-visual diffusion prior trained on top of the mapping between audio and disentangled non-lip facial representations. Thanks to the probabilistic nature of the diffusion prior, one big advantage of our framework is it can synthesize diverse facial motion sequences given the same audio clip, which is quite user-friendly for many real applications. Through comprehensive evaluations on public benchmarks, we conclude that (1) our diffusion prior outperforms auto-regressive prior significantly on almost all the concerned metrics; (2) our overall system is competitive with prior works in terms of audio-lip synchronization but can effectively sample rich and natural-looking lip-irrelevant facial motions while still semantically harmonized with the audio input.

* 16 pages

Via

Access Paper or Ask Questions

UDE: A Unified Driving Engine for Human Motion Generation

Nov 29, 2022

Zixiang Zhou, Baoyuan Wang

Abstract:Generating controllable and editable human motion sequences is a key challenge in 3D Avatar generation. It has been labor-intensive to generate and animate human motion for a long time until learning-based approaches have been developed and applied recently. However, these approaches are still task-specific or modality-specific\cite {ahuja2019language2pose}\cite{ghosh2021synthesis}\cite{ferreira2021learning}\cite{li2021ai}. In this paper, we propose ``UDE", the first unified driving engine that enables generating human motion sequences from natural language or audio sequences (see Fig.~\ref{fig:teaser}). Specifically, UDE consists of the following key components: 1) a motion quantization module based on VQVAE that represents continuous motion sequence as discrete latent code\cite{van2017neural}, 2) a modality-agnostic transformer encoder\cite{vaswani2017attention} that learns to map modality-aware driving signals to a joint space, and 3) a unified token transformer (GPT-like\cite{radford2019language}) network to predict the quantized latent code index in an auto-regressive manner. 4) a diffusion motion decoder that takes as input the motion tokens and decodes them into motion sequences with high diversity. We evaluate our method on HumanML3D\cite{Guo_2022_CVPR} and AIST++\cite{li2021learn} benchmarks, and the experiment results demonstrate our method achieves state-of-the-art performance. Project website: \url{https://github.com/zixiangzhou916/UDE/

* 14 pages, 10 figures, 6 tables

Via

Access Paper or Ask Questions

Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis

Nov 26, 2022

Duomin Wang, Yu Deng, Zixin Yin, Heung-Yeung Shum, Baoyuan Wang

Figure 1 for Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis

Figure 2 for Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis

Figure 3 for Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis

Figure 4 for Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis

Abstract:We present a novel one-shot talking head synthesis method that achieves disentangled and fine-grained control over lip motion, eye gaze&blink, head pose, and emotional expression. We represent different motions via disentangled latent representations and leverage an image generator to synthesize talking heads from them. To effectively disentangle each motion factor, we propose a progressive disentangled representation learning strategy by separating the factors in a coarse-to-fine manner, where we first extract unified motion feature from the driving signal, and then isolate each fine-grained motion from the unified feature. We introduce motion-specific contrastive learning and regressing for non-emotional motions, and feature-level decorrelation and self-reconstruction for emotional expression, to fully utilize the inherent properties of each motion factor in unstructured video data to achieve disentanglement. Experiments show that our method provides high quality speech&lip-motion synchronization along with precise and disentangled control over multiple extra facial motions, which can hardly be achieved by previous methods.

* 18 pages

Via

Access Paper or Ask Questions

Learning Detailed Radiance Manifolds for High-Fidelity and 3D-Consistent Portrait Synthesis from Monocular Image

Nov 25, 2022

Yu Deng, Baoyuan Wang, Heung-Yeung Shum

Abstract:A key challenge for novel view synthesis of monocular portrait images is 3D consistency under continuous pose variations. Most existing methods rely on 2D generative models which often leads to obvious 3D inconsistency artifacts. We present a 3D-consistent novel view synthesis approach for monocular portrait images based on a recent proposed 3D-aware GAN, namely Generative Radiance Manifolds (GRAM), which has shown strong 3D consistency at multiview image generation of virtual subjects via the radiance manifolds representation. However, simply learning an encoder to map a real image into the latent space of GRAM can only reconstruct coarse radiance manifolds without faithful fine details, while improving the reconstruction fidelity via instance-specific optimization is time-consuming. We introduce a novel detail manifolds reconstructor to learn 3D-consistent fine details on the radiance manifolds from monocular images, and combine them with the coarse radiance manifolds for high-fidelity reconstruction. The 3D priors derived from the coarse radiance manifolds are used to regulate the learned details to ensure reasonable synthesized results at novel views. Trained on in-the-wild 2D images, our method achieves high-fidelity and 3D-consistent portrait synthesis largely outperforming the prior art.

* Project page: https://yudeng.github.io/GRAMInverter/

Via

Access Paper or Ask Questions

Hand Avatar: Free-Pose Hand Animation and Rendering from Monocular Video

Nov 23, 2022

Xingyu Chen, Baoyuan Wang, Heung-Yeung Shum

Abstract:We present HandAvatar, a novel representation for hand animation and rendering, which can generate smoothly compositional geometry and self-occlusion-aware texture. Specifically, we first develop a MANO-HD model as a high-resolution mesh topology to fit personalized hand shapes. Sequentially, we decompose hand geometry into per-bone rigid parts, and then re-compose paired geometry encodings to derive an across-part consistent occupancy field. As for texture modeling, we propose a self-occlusion-aware shading field (SelF). In SelF, drivable anchors are paved on the MANO-HD surface to record albedo information under a wide variety of hand poses. Moreover, directed soft occupancy is designed to describe the ray-to-surface relation, which is leveraged to generate an illumination field for the disentanglement of pose-independent albedo and pose-dependent illumination. Trained from monocular video data, our HandAvatar can perform free-pose hand animation and rendering while at the same time achieving superior appearance fidelity. We also demonstrate that HandAvatar provides a route for hand appearance editing. Project website: https://seanchenxy.github.io/HandAvatarWeb.

Via

Access Paper or Ask Questions