Alert button
Picture for Liyang Chen

Liyang Chen

Alert button

AdaMesh: Personalized Facial Expressions and Head Poses for Speech-Driven 3D Facial Animation

Oct 11, 2023
Liyang Chen, Weihong Bao, Shun Lei, Boshi Tang, Zhiyong Wu, Shiyin Kang, Haozhi Huang

Figure 1 for AdaMesh: Personalized Facial Expressions and Head Poses for Speech-Driven 3D Facial Animation
Figure 2 for AdaMesh: Personalized Facial Expressions and Head Poses for Speech-Driven 3D Facial Animation
Figure 3 for AdaMesh: Personalized Facial Expressions and Head Poses for Speech-Driven 3D Facial Animation
Figure 4 for AdaMesh: Personalized Facial Expressions and Head Poses for Speech-Driven 3D Facial Animation

Speech-driven 3D facial animation aims at generating facial movements that are synchronized with the driving speech, which has been widely explored recently. Existing works mostly neglect the person-specific talking style in generation, including facial expression and head pose styles. Several works intend to capture the personalities by fine-tuning modules. However, limited training data leads to the lack of vividness. In this work, we propose AdaMesh, a novel adaptive speech-driven facial animation approach, which learns the personalized talking style from a reference video of about 10 seconds and generates vivid facial expressions and head poses. Specifically, we propose mixture-of-low-rank adaptation (MoLoRA) to fine-tune the expression adapter, which efficiently captures the facial expression style. For the personalized pose style, we propose a pose adapter by building a discrete pose prior and retrieving the appropriate style embedding with a semantic-aware pose style matrix without fine-tuning. Extensive experimental results show that our approach outperforms state-of-the-art methods, preserves the talking style in the reference video, and generates vivid facial animation. The supplementary video and code will be available at https://adamesh.github.io.

* Project Page: https://adamesh.github.io 
Viaarxiv icon

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Sep 22, 2023
Shun Lei, Yixuan Zhou, Liyang Chen, Dan Luo, Zhiyong Wu, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han, Helen Meng

Figure 1 for Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts
Figure 2 for Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts
Figure 3 for Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts
Figure 4 for Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.

* Submitted to ICASSP 2024 
Viaarxiv icon

VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer

Aug 11, 2023
Liyang Chen, Zhiyong Wu, Runnan Li, Weihong Bao, Jun Ling, Xu Tan, Sheng Zhao

Figure 1 for VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer
Figure 2 for VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer
Figure 3 for VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer
Figure 4 for VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer

Current talking face generation methods mainly focus on speech-lip synchronization. However, insufficient investigation on the facial talking style leads to a lifeless and monotonous avatar. Most previous works fail to imitate expressive styles from arbitrary video prompts and ensure the authenticity of the generated video. This paper proposes an unsupervised variational style transfer model (VAST) to vivify the neutral photo-realistic avatars. Our model consists of three key components: a style encoder that extracts facial style representations from the given video prompts; a hybrid facial expression decoder to model accurate speech-related movements; a variational style enhancer that enhances the style space to be highly expressive and meaningful. With our essential designs on facial style learning, our model is able to flexibly capture the expressive facial style from arbitrary video prompts and transfer it onto a personalized image renderer in a zero-shot manner. Experimental results demonstrate the proposed approach contributes to a more vivid talking avatar with higher authenticity and richer expressiveness.

* Accepted by ICCV2023 Workshop 
Viaarxiv icon

MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis

Jul 29, 2023
Shun Lei, Yixuan Zhou, Liyang Chen, Zhiyong Wu, Xixin Wu, Shiyin Kang, Helen Meng

Figure 1 for MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis
Figure 2 for MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis
Figure 3 for MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis
Figure 4 for MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis

Expressive speech synthesis is crucial for many human-computer interaction scenarios, such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the style embeddings at one single scale from the information within the current sentence. Whereas, context information in neighboring sentences and multi-scale nature of style in human speech are neglected, making it challenging to convert multi-sentence text into natural and expressive speech. In this paper, we propose MSStyleTTS, a style modeling method for expressive speech synthesis, to capture and predict styles at different levels from a wider range of context rather than a sentence. Two sub-modules, including multi-scale style extractor and multi-scale style predictor, are trained together with a FastSpeech 2 based acoustic model. The predictor is designed to explore the hierarchical context information by considering structural relationships in context and predict style embeddings at global-level, sentence-level and subword-level. The extractor extracts multi-scale style embedding from the ground-truth speech and explicitly guides the style prediction. Evaluations on both in-domain and out-of-domain audiobook datasets demonstrate that the proposed method significantly outperforms the three baselines. In addition, we conduct the analysis of the context information and multi-scale style representations that have never been discussed before.

* Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing 
Viaarxiv icon

GTN-Bailando: Genre Consistent Long-Term 3D Dance Generation based on Pre-trained Genre Token Network

Apr 25, 2023
Haolin Zhuang, Shun Lei, Long Xiao, Weiqin Li, Liyang Chen, Sicheng Yang, Zhiyong Wu, Shiyin Kang, Helen Meng

Figure 1 for GTN-Bailando: Genre Consistent Long-Term 3D Dance Generation based on Pre-trained Genre Token Network
Figure 2 for GTN-Bailando: Genre Consistent Long-Term 3D Dance Generation based on Pre-trained Genre Token Network
Figure 3 for GTN-Bailando: Genre Consistent Long-Term 3D Dance Generation based on Pre-trained Genre Token Network
Figure 4 for GTN-Bailando: Genre Consistent Long-Term 3D Dance Generation based on Pre-trained Genre Token Network

Music-driven 3D dance generation has become an intensive research topic in recent years with great potential for real-world applications. Most existing methods lack the consideration of genre, which results in genre inconsistency in the generated dance movements. In addition, the correlation between the dance genre and the music has not been investigated. To address these issues, we propose a genre-consistent dance generation framework, GTN-Bailando. First, we propose the Genre Token Network (GTN), which infers the genre from music to enhance the genre consistency of long-term dance generation. Second, to improve the generalization capability of the model, the strategy of pre-training and fine-tuning is adopted.Experimental results on the AIST++ dataset show that the proposed dance generation framework outperforms state-of-the-art methods in terms of motion quality and genre consistency.

* Accepted by ICASSP2023.Demo page: https://im1eon.github.io/ICASSP23-GTNB-DG/ 
Viaarxiv icon

Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis

Apr 13, 2023
Shun Lei, Yixuan Zhou, Liyang Chen, Zhiyong Wu, Shiyin Kang, Helen Meng

Figure 1 for Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis
Figure 2 for Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis
Figure 3 for Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis
Figure 4 for Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis

Recent advances in text-to-speech have significantly improved the expressiveness of synthesized speech. However, it is still challenging to generate speech with contextually appropriate and coherent speaking style for multi-sentence text in audiobooks. In this paper, we propose a context-aware coherent speaking style prediction method for audiobook speech synthesis. To predict the style embedding of the current utterance, a hierarchical transformer-based context-aware style predictor with a mixture attention mask is designed, considering both text-side context information and speech-side style information of previous speeches. Based on this, we can generate long-form speech with coherent style and prosody sentence by sentence. Objective and subjective evaluations on a Mandarin audiobook dataset demonstrate that our proposed model can generate speech with more expressive and coherent speaking style than baselines, for both single-sentence and multi-sentence test.

* Accepted by ICASSP 2023 
Viaarxiv icon

StableFace: Analyzing and Improving Motion Stability for Talking Face Generation

Aug 29, 2022
Jun Ling, Xu Tan, Liyang Chen, Runnan Li, Yuchao Zhang, Sheng Zhao, Li Song

Figure 1 for StableFace: Analyzing and Improving Motion Stability for Talking Face Generation
Figure 2 for StableFace: Analyzing and Improving Motion Stability for Talking Face Generation
Figure 3 for StableFace: Analyzing and Improving Motion Stability for Talking Face Generation
Figure 4 for StableFace: Analyzing and Improving Motion Stability for Talking Face Generation

While previous speech-driven talking face generation methods have made significant progress in improving the visual quality and lip-sync quality of the synthesized videos, they pay less attention to lip motion jitters which greatly undermine the realness of talking face videos. What causes motion jitters, and how to mitigate the problem? In this paper, we conduct systematic analyses on the motion jittering problem based on a state-of-the-art pipeline that uses 3D face representations to bridge the input audio and output video, and improve the motion stability with a series of effective designs. We find that several issues can lead to jitters in synthesized talking face video: 1) jitters from the input 3D face representations; 2) training-inference mismatch; 3) lack of dependency modeling among video frames. Accordingly, we propose three effective solutions to address this issue: 1) we propose a gaussian-based adaptive smoothing module to smooth the 3D face representations to eliminate jitters in the input; 2) we add augmented erosions on the input data of the neural renderer in training to simulate the distortion in inference to reduce mismatch; 3) we develop an audio-fused transformer generator to model dependency among video frames. Besides, considering there is no off-the-shelf metric for measuring motion jitters in talking face video, we devise an objective metric (Motion Stability Index, MSI), to quantitatively measure the motion jitters by calculating the reciprocal of variance acceleration. Extensive experimental results show the superiority of our method on motion-stable face video generation, with better quality than previous systems.

* 12 pages, 
Viaarxiv icon

The ReprGesture entry to the GENEA Challenge 2022

Aug 25, 2022
Sicheng Yang, Zhiyong Wu, Minglei Li, Mengchen Zhao, Jiuxin Lin, Liyang Chen, Weihong Bao

Figure 1 for The ReprGesture entry to the GENEA Challenge 2022
Figure 2 for The ReprGesture entry to the GENEA Challenge 2022
Figure 3 for The ReprGesture entry to the GENEA Challenge 2022
Figure 4 for The ReprGesture entry to the GENEA Challenge 2022

This paper describes the ReprGesture entry to the Generation and Evaluation of Non-verbal Behaviour for Embodied Agents (GENEA) challenge 2022. The GENEA challenge provides the processed datasets and performs crowdsourced evaluations to compare the performance of different gesture generation systems. In this paper, we explore an automatic gesture generation system based on multimodal representation learning. We use WavLM features for audio, FastText features for text and position and rotation matrix features for gesture. Each modality is projected to two distinct subspaces: modality-invariant and modality-specific. To learn inter-modality-invariant commonalities and capture the characters of modality-specific representations, gradient reversal layer based adversarial classifier and modality reconstruction decoders are used during training. The gesture decoder generates proper gestures using all representations and features related to the rhythm in the audio. Our code, pre-trained models and demo are available at https://github.com/YoungSeng/ReprGesture.

* 8 pages, 4 figures, ICMI 2022 
Viaarxiv icon

Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Apr 06, 2022
Shun Lei, Yixuan Zhou, Liyang Chen, Zhiyong Wu, Shiyin Kang, Helen Meng

Figure 1 for Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis
Figure 2 for Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis
Figure 3 for Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis
Figure 4 for Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Previous works on expressive speech synthesis mainly focus on current sentence. The context in adjacent sentences is neglected, resulting in inflexible speaking style for the same text, which lacks speech variations. In this paper, we propose a hierarchical framework to model speaking style from context. A hierarchical context encoder is proposed to explore a wider range of contextual information considering structural relationship in context, including inter-phrase and inter-sentence relations. Moreover, to encourage this encoder to learn style representation better, we introduce a novel training strategy with knowledge distillation, which provides the target for encoder training. Both objective and subjective evaluations on a Mandarin lecture dataset demonstrate that the proposed method can significantly improve the naturalness and expressiveness of the synthesized speech.

* Accepted by ICASSP 2022 
Viaarxiv icon