Alert button
Picture for Liang Huang

Liang Huang

Alert button

ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech

Nov 07, 2022
Xiaoran Fan, Chao Pang, Tian Yuan, He Bai, Renjie Zheng, Pengfei Zhu, Shuohuan Wang, Junkun Chen, Zeyu Chen, Liang Huang, Yu Sun, Hua Wu

Figure 1 for ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech
Figure 2 for ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech
Figure 3 for ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech
Figure 4 for ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech

Speech representation learning has improved both speech understanding and speech synthesis tasks for single language. However, its ability in cross-lingual scenarios has not been explored. In this paper, we extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks, including cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes given a speech example and its transcription. By learning to reconstruct the masked parts of the input in different languages, our model shows great improvements over speaker-embedding-based multi-speaker TTS methods. Moreover, our framework is end-to-end for both the training and the inference without any finetuning effort. In cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing tasks, our experiments show that our model outperforms speaker-embedding-based multi-speaker TTS methods. The code and model are publicly available at PaddleSpeech.

Viaarxiv icon

PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit

May 20, 2022
Hui Zhang, Tian Yuan, Junkun Chen, Xintong Li, Renjie Zheng, Yuxin Huang, Xiaojie Chen, Enlei Gong, Zeyu Chen, Xiaoguang Hu, Dianhai Yu, Yanjun Ma, Liang Huang

Figure 1 for PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit
Figure 2 for PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit
Figure 3 for PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit
Figure 4 for PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit

PaddleSpeech is an open-source all-in-one speech toolkit. It aims at facilitating the development and research of speech processing technologies by providing an easy-to-use command-line interface and a simple code structure. This paper describes the design philosophy and core architecture of PaddleSpeech to support several essential speech-to-text and text-to-speech tasks. PaddleSpeech achieves competitive or state-of-the-art performance on various speech datasets and implements the most popular methods. It also provides recipes and pretrained models to quickly reproduce the experimental results in this paper. PaddleSpeech is publicly avaiable at https://github.com/PaddlePaddle/PaddleSpeech.

Viaarxiv icon

A Fast Attention Network for Joint Intent Detection and Slot Filling on Edge Devices

May 16, 2022
Liang Huang, Senjie Liang, Feiyang Ye, Nan Gao

Figure 1 for A Fast Attention Network for Joint Intent Detection and Slot Filling on Edge Devices
Figure 2 for A Fast Attention Network for Joint Intent Detection and Slot Filling on Edge Devices
Figure 3 for A Fast Attention Network for Joint Intent Detection and Slot Filling on Edge Devices
Figure 4 for A Fast Attention Network for Joint Intent Detection and Slot Filling on Edge Devices

Intent detection and slot filling are two main tasks in natural language understanding and play an essential role in task-oriented dialogue systems. The joint learning of both tasks can improve inference accuracy and is popular in recent works. However, most joint models ignore the inference latency and cannot meet the need to deploy dialogue systems at the edge. In this paper, we propose a Fast Attention Network (FAN) for joint intent detection and slot filling tasks, guaranteeing both accuracy and latency. Specifically, we introduce a clean and parameter-refined attention module to enhance the information exchange between intent and slot, improving semantic accuracy by more than 2%. FAN can be implemented on different encoders and delivers more accurate models at every speed level. Our experiments on the Jetson Nano platform show that FAN inferences fifteen utterances per second with a small accuracy drop, showing its effectiveness and efficiency on edge devices.

* 9 pages, 4 figures 
Viaarxiv icon

Data-Driven Adaptive Simultaneous Machine Translation

Apr 27, 2022
Guangxu Xun, Mingbo Ma, Yuchen Bian, Xingyu Cai, Jiaji Huang, Renjie Zheng, Junkun Chen, Jiahong Yuan, Kenneth Church, Liang Huang

Figure 1 for Data-Driven Adaptive Simultaneous Machine Translation
Figure 2 for Data-Driven Adaptive Simultaneous Machine Translation
Figure 3 for Data-Driven Adaptive Simultaneous Machine Translation
Figure 4 for Data-Driven Adaptive Simultaneous Machine Translation

In simultaneous translation (SimulMT), the most widely used strategy is the wait-k policy thanks to its simplicity and effectiveness in balancing translation quality and latency. However, wait-k suffers from two major limitations: (a) it is a fixed policy that can not adaptively adjust latency given context, and (b) its training is much slower than full-sentence translation. To alleviate these issues, we propose a novel and efficient training scheme for adaptive SimulMT by augmenting the training corpus with adaptive prefix-to-prefix pairs, while the training complexity remains the same as that of training full-sentence translation models. Experiments on two language pairs show that our method outperforms all strong baselines in terms of translation quality and latency.

Viaarxiv icon

A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing

Mar 18, 2022
He Bai, Renjie Zheng, Junkun Chen, Xintong Li, Mingbo Ma, Liang Huang

Figure 1 for A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing
Figure 2 for A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing
Figure 3 for A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing
Figure 4 for A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing

Recently, speech representation learning has improved many speech-related tasks such as speech recognition, speech classification, and speech-to-text translation. However, all the above tasks are in the direction of speech understanding, but for the inverse direction, speech synthesis, the potential of representation learning is yet to be realized, due to the challenging nature of generating high-quality speech. To address this problem, we propose our framework, Alignment-Aware Acoustic-Text Pretraining (A$^3$T), which reconstructs masked acoustic signals with text input and acoustic-text alignment during training. In this way, the pretrained model can generate high quality of reconstructed spectrogram, which can be applied to the speech editing and unseen speaker TTS directly. Experiments show A$^3$T outperforms SOTA models on speech editing, and improves multi-speaker speech synthesis without the external speaker verification model.

* under review, 12 pages, 10 figures 
Viaarxiv icon

Computation Rate Maximum for Mobile Terminals in UAV-assisted Wireless Powered MEC Networks with Fairness Constraint

Sep 13, 2021
Xiaoyi Zhou, Liang Huang, Tong Ye, Weiqiang Sun

Figure 1 for Computation Rate Maximum for Mobile Terminals in UAV-assisted Wireless Powered MEC Networks with Fairness Constraint
Figure 2 for Computation Rate Maximum for Mobile Terminals in UAV-assisted Wireless Powered MEC Networks with Fairness Constraint
Figure 3 for Computation Rate Maximum for Mobile Terminals in UAV-assisted Wireless Powered MEC Networks with Fairness Constraint
Figure 4 for Computation Rate Maximum for Mobile Terminals in UAV-assisted Wireless Powered MEC Networks with Fairness Constraint

This paper investigates an unmanned aerial vehicle (UAV)-assisted wireless powered mobile-edge computing (MEC) system, where the UAV powers the mobile terminals by wireless power transfer (WPT) and provides computation service for them. We aim to maximize the computation rate of terminals while ensuring fairness among them. Considering the random trajectories of mobile terminals, we propose a soft actor-critic (SAC)-based UAV trajectory planning and resource allocation (SAC-TR) algorithm, which combines off-policy and maximum entropy reinforcement learning to promote the convergence of the algorithm. We design the reward as a heterogeneous function of computation rate, fairness, and reaching of destination. Simulation results show that SAC-TR can quickly adapt to varying network environments and outperform representative benchmarks in a variety of situations.

* 12 pages 
Viaarxiv icon

The Role of Phonetic Units in Speech Emotion Recognition

Aug 02, 2021
Jiahong Yuan, Xingyu Cai, Renjie Zheng, Liang Huang, Kenneth Church

Figure 1 for The Role of Phonetic Units in Speech Emotion Recognition
Figure 2 for The Role of Phonetic Units in Speech Emotion Recognition
Figure 3 for The Role of Phonetic Units in Speech Emotion Recognition
Figure 4 for The Role of Phonetic Units in Speech Emotion Recognition

We propose a method for emotion recognition through emotiondependent speech recognition using Wav2vec 2.0. Our method achieved a significant improvement over most previously reported results on IEMOCAP, a benchmark emotion dataset. Different types of phonetic units are employed and compared in terms of accuracy and robustness of emotion recognition within and across datasets and languages. Models of phonemes, broad phonetic classes, and syllables all significantly outperform the utterance model, demonstrating that phonetic units are helpful and should be incorporated in speech emotion recognition. The best performance is from using broad phonetic classes. Further research is needed to investigate the optimal set of broad phonetic classes for the task of emotion recognition. Finally, we found that Wav2vec 2.0 can be fine-tuned to recognize coarser-grained or larger phonetic units than phonemes, such as broad phonetic classes and syllables.

Viaarxiv icon

Decoupling recognition and transcription in Mandarin ASR

Aug 02, 2021
Jiahong Yuan, Xingyu Cai, Dongji Gao, Renjie Zheng, Liang Huang, Kenneth Church

Figure 1 for Decoupling recognition and transcription in Mandarin ASR
Figure 2 for Decoupling recognition and transcription in Mandarin ASR
Figure 3 for Decoupling recognition and transcription in Mandarin ASR
Figure 4 for Decoupling recognition and transcription in Mandarin ASR

Much of the recent literature on automatic speech recognition (ASR) is taking an end-to-end approach. Unlike English where the writing system is closely related to sound, Chinese characters (Hanzi) represent meaning, not sound. We propose factoring audio -> Hanzi into two sub-tasks: (1) audio -> Pinyin and (2) Pinyin -> Hanzi, where Pinyin is a system of phonetic transcription of standard Chinese. Factoring the audio -> Hanzi task in this way achieves 3.9% CER (character error rate) on the Aishell-1 corpus, the best result reported on this dataset so far.

* submitted to ASRU 2021 
Viaarxiv icon

Structure-aware Interactive Graph Neural Networks for the Prediction of Protein-Ligand Binding Affinity

Jul 21, 2021
Shuangli Li, Jingbo Zhou, Tong Xu, Liang Huang, Fan Wang, Haoyi Xiong, Weili Huang, Dejing Dou, Hui Xiong

Figure 1 for Structure-aware Interactive Graph Neural Networks for the Prediction of Protein-Ligand Binding Affinity
Figure 2 for Structure-aware Interactive Graph Neural Networks for the Prediction of Protein-Ligand Binding Affinity
Figure 3 for Structure-aware Interactive Graph Neural Networks for the Prediction of Protein-Ligand Binding Affinity
Figure 4 for Structure-aware Interactive Graph Neural Networks for the Prediction of Protein-Ligand Binding Affinity

Drug discovery often relies on the successful prediction of protein-ligand binding affinity. Recent advances have shown great promise in applying graph neural networks (GNNs) for better affinity prediction by learning the representations of protein-ligand complexes. However, existing solutions usually treat protein-ligand complexes as topological graph data, thus the biomolecular structural information is not fully utilized. The essential long-range interactions among atoms are also neglected in GNN models. To this end, we propose a structure-aware interactive graph neural network (SIGN) which consists of two components: polar-inspired graph attention layers (PGAL) and pairwise interactive pooling (PiPool). Specifically, PGAL iteratively performs the node-edge aggregation process to update embeddings of nodes and edges while preserving the distance and angle information among atoms. Then, PiPool is adopted to gather interactive edges with a subsequent reconstruction loss to reflect the global interactions. Exhaustive experimental study on two benchmarks verifies the superiority of SIGN.

* 11 pages, 8 figures, Accepted by KDD 2021 (Research Track) 
Viaarxiv icon

Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR

Jun 11, 2021
Junkun Chen, Mingbo Ma, Renjie Zheng, Liang Huang

Figure 1 for Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR
Figure 2 for Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR
Figure 3 for Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR
Figure 4 for Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR

Simultaneous speech-to-text translation is widely useful in many scenarios. The conventional cascaded approach uses a pipeline of streaming ASR followed by simultaneous MT, but suffers from error propagation and extra latency. To alleviate these issues, recent efforts attempt to directly translate the source speech into target text simultaneously, but this is much harder due to the combination of two separate tasks. We instead propose a new paradigm with the advantages of both cascaded and end-to-end approaches. The key idea is to use two separate, but synchronized, decoders on streaming ASR and direct speech-to-text translation (ST), respectively, and the intermediate results of ASR guide the decoding policy of (but is not fed as input to) ST. During training time, we use multitask learning to jointly learn these two tasks with a shared encoder. En-to-De and En-to-Es experiments on the MuSTC dataset demonstrate that our proposed technique achieves substantially better translation quality at similar levels of latency.

* accepted by Findings of ACL 2021 
Viaarxiv icon