Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jianhua Tao

EmotionNAS: Two-stream Architecture Search for Speech Emotion Recognition

Mar 25, 2022
Haiyang Sun, Zheng Lian, Bin Liu, Ying Li, Licai Sun, Cong Cai, Jianhua Tao, Meng Wang, Yuan Cheng

Figure 1 for EmotionNAS: Two-stream Architecture Search for Speech Emotion Recognition

Figure 2 for EmotionNAS: Two-stream Architecture Search for Speech Emotion Recognition

Figure 3 for EmotionNAS: Two-stream Architecture Search for Speech Emotion Recognition

Figure 4 for EmotionNAS: Two-stream Architecture Search for Speech Emotion Recognition

Speech emotion recognition (SER) is a crucial research topic in human-computer interactions. Existing works are mainly based on manually designed models. Despite their great success, these methods heavily rely on historical experience, which are time-consuming but cannot exhaust all possible structures. To address this problem, we propose a neural architecture search (NAS) based framework for SER, called "EmotionNAS". We take spectrogram and wav2vec features as the inputs, followed with NAS to optimize the network structure for these features separately. We further incorporate complementary information in these features through decision-level fusion. Experimental results on IEMOCAP demonstrate that our method succeeds over existing state-of-the-art strategies on SER.

* Submitted to Interspeech 2022

Via

Access Paper or Ask Questions

NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

Mar 05, 2022
Tao Wang, Ruibo Fu, Jiangyan Yi, Jianhua Tao, Zhengqi Wen

Figure 1 for NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

Figure 2 for NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

Figure 3 for NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

Figure 4 for NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

The traditional vocoders have the advantages of high synthesis efficiency, strong interpretability, and speech editability, while the neural vocoders have the advantage of high synthesis quality. To combine the advantages of two vocoders, inspired by the traditional deterministic plus stochastic model, this paper proposes a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability. Firstly, this framework contains four modules: a deterministic source module, a stochastic source module, a neural V/UV decision module and a neural filter module. The input required by the vocoder is just the spectral parameter, which avoids the error caused by estimating additional parameters, such as F0. Secondly, to solve the problem that different frequency bands may have different proportions of deterministic components and stochastic components, a multiband excitation strategy is used to generate a more accurate excitation signal and reduce the neural filter's burden. Thirdly, a method to control noise components of speech is proposed. In this way, the signal-to-noise ratio (SNR) of speech can be adjusted easily. Objective and subjective experimental results show that our proposed NeuralDPS vocoder can obtain similar performance with the WaveNet and it generates waveforms at least 280 times faster than the WaveNet vocoder. It is also 28% faster than WaveGAN's synthesis efficiency on a single CPU core. We have also verified through experiments that this method can effectively control the noise components in the predicted speech and adjust the SNR of speech. Examples of generated speech can be found at https://hairuo55.github.io/NeuralDPS.

* 15 pages, 12 figures; Accepted to TASLP. Demo page https://hairuo55.github.io/NeuralDPS. arXiv admin note: text overlap with arXiv:1906.09573 by other authors

Via

Access Paper or Ask Questions

GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation

Mar 04, 2022
Zheng Lian, Lan Chen, Licai Sun, Bin Liu, Jianhua Tao

Figure 1 for GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation

Figure 2 for GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation

Figure 3 for GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation

Figure 4 for GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation

Conversations have become a critical data format on social media platforms. Understanding conversation from emotion, content, and other aspects also attracts increasing attention from researchers due to its widespread application in human-computer interaction. In real-world environments, we often encounter the problem of incomplete modalities, which has become a core issue of conversation understanding. To address this problem, researchers propose various methods. However, existing approaches are mainly designed for individual utterances or medical images rather than conversational data, which cannot exploit temporal and speaker information in conversations. To this end, we propose a novel framework for incomplete multimodal learning in conversations, called "Graph Complete Network (GCNet)", filling the gap of existing works. Our GCNet contains two well-designed graph neural network-based modules, "Speaker GNN" and "Temporal GNN", to capture temporal and speaker information in conversations. To make full use of complete and incomplete data in feature learning, we jointly optimize classification and reconstruction in an end-to-end manner. To verify the effectiveness of our method, we conduct experiments on three benchmark conversational datasets. Experimental results demonstrate that our GCNet is superior to existing state-of-the-art approaches in incomplete multimodal learning.

Via

Access Paper or Ask Questions

ADD 2022: the First Audio Deep Synthesis Detection Challenge

Feb 26, 2022
Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Ye Bai, Cunhang Fan, Shan Liang, Shiming Wang, Shuai Zhang, Xinrui Yan, Le Xu, Zhengqi Wen, Haizhou Li, Zheng Lian, Bin Liu

Figure 1 for ADD 2022: the First Audio Deep Synthesis Detection Challenge

Figure 2 for ADD 2022: the First Audio Deep Synthesis Detection Challenge

Figure 3 for ADD 2022: the First Audio Deep Synthesis Detection Challenge

Figure 4 for ADD 2022: the First Audio Deep Synthesis Detection Challenge

Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021. However, the recent shared tasks have not covered many real-life and challenging scenarios. The first Audio Deep synthesis Detection challenge (ADD) was motivated to fill in the gap. The ADD 2022 includes three tracks: low-quality fake audio detection (LF), partially fake audio detection (PF) and audio fake game (FG). The LF track focuses on dealing with bona fide and fully fake utterances with various real-world noises etc. The PF track aims to distinguish the partially fake audio from the real. The FG track is a rivalry game, which includes two tasks: an audio generation task and an audio fake detection task. In this paper, we describe the datasets, evaluation metrics, and protocols. We also report major findings that reflect the recent advances in audio deepfake detection tasks.

* Accepted by ICASSP 2022

Via

Access Paper or Ask Questions

CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

Feb 21, 2022
Tao Wang, Jiangyan Yi, Ruibo Fu, Jianhua Tao, Zhengqi Wen

Figure 1 for CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

Figure 2 for CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

Figure 3 for CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

Figure 4 for CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

The text-based speech editor allows the editing of speech through intuitive cutting, copying, and pasting operations to speed up the process of editing speech. However, the major drawback of current systems is that edited speech often sounds unnatural due to cut-copy-paste operation. In addition, it is not obvious how to synthesize records according to a new word not appearing in the transcript. This paper proposes a novel end-to-end text-based speech editing method called context-aware mask prediction network (CampNet). The model can simulate the text-based speech editing process by randomly masking part of speech and then predicting the masked region by sensing the speech context. It can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript. Secondly, for the possible operation of text-based speech editing, we design three text-based operations based on CampNet: deletion, insertion, and replacement. These operations can cover various situations of speech editing. Thirdly, to synthesize the speech corresponding to long text in insertion and replacement operations, a word-level autoregressive generation method is proposed. Fourthly, we propose a speaker adaptation method using only one sentence for CampNet and explore the ability of few-shot learning based on CampNet, which provides a new idea for speech forgery tasks. The subjective and objective experiments on VCTK and LibriTTS datasets show that the speech editing results based on CampNet are better than TTS technology, manual editing, and VoCo method. We also conduct detailed ablation experiments to explore the effect of the CampNet structure on its performance. Finally, the experiment shows that speaker adaptation with only one sentence can further improve the naturalness of speech. Examples of generated speech can be found at https://hairuo55.github.io/CampNet.

* under review, 14 pages, 14 figures, demo page is available at https://hairuo55.github.io/CampNet

Via

Access Paper or Ask Questions

MixKG: Mixing for harder negative samples in knowledge graph

Feb 19, 2022
Feihu Che, Guohua Yang, Pengpeng Shao, Dawei Zhang, Jianhua Tao

Figure 1 for MixKG: Mixing for harder negative samples in knowledge graph

Figure 2 for MixKG: Mixing for harder negative samples in knowledge graph

Figure 3 for MixKG: Mixing for harder negative samples in knowledge graph

Figure 4 for MixKG: Mixing for harder negative samples in knowledge graph

Knowledge graph embedding~(KGE) aims to represent entities and relations into low-dimensional vectors for many real-world applications. The representations of entities and relations are learned via contrasting the positive and negative triplets. Thus, high-quality negative samples are extremely important in KGE. However, the present KGE models either rely on simple negative sampling methods, which makes it difficult to obtain informative negative triplets; or employ complex adversarial methods, which requires more training data and strategies. In addition, these methods can only construct negative triplets using the existing entities, which limits the potential to explore harder negative triplets. To address these issues, we adopt mixing operation in generating harder negative samples for knowledge graphs and introduce an inexpensive but effective method called MixKG. Technically, MixKG first proposes two kinds of criteria to filter hard negative triplets among the sampled negatives: based on scoring function and based on correct entity similarity. Then, MixKG synthesizes harder negative samples via the convex combinations of the paired selected hard negatives. Experiments on two public datasets and four classical KGE methods show MixKG is superior to previous negative sampling algorithms.

Via

Access Paper or Ask Questions

Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis

Feb 16, 2022
Tao Wang, Ruibo Fu, Jiangyan Yi, Jianhua Tao, Zhengqi Wen

Figure 1 for Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis

Figure 2 for Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis

Figure 3 for Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis

Figure 4 for Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis

End-to-end singing voice synthesis (SVS) is attractive due to the avoidance of pre-aligned data. However, the auto learned alignment of singing voice with lyrics is difficult to match the duration information in musical score, which will lead to the model instability or even failure to synthesize voice. To learn accurate alignment information automatically, this paper proposes an end-to-end SVS framework, named Singing-Tacotron. The main difference between the proposed framework and Tacotron is that the speech can be controlled significantly by the musical score's duration information. Firstly, we propose a global duration control attention mechanism for the SVS model. The attention mechanism can control each phoneme's duration. Secondly, a duration encoder is proposed to learn a set of global transition tokens from the musical score. These transition tokens can help the attention mechanism decide whether moving to the next phoneme or staying at each decoding step. Thirdly, to further improve the model's stability, a dynamic filter is designed to help the model overcome noise interference and pay more attention to local context information. Subjective and objective evaluation verify the effectiveness of the method. Furthermore, the role of global transition tokens and the effect of duration control are explored. Examples of experiments can be found at https://hairuo55.github.io/SingingTacotron.

* 5 pages, 7 figures

Via

Access Paper or Ask Questions

Reducing language context confusion for end-to-end code-switching automatic speech recognition

Jan 28, 2022
Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Jianhua Tao, Yu Ting Yeung, Liqun Deng

Figure 1 for Reducing language context confusion for end-to-end code-switching automatic speech recognition

Figure 2 for Reducing language context confusion for end-to-end code-switching automatic speech recognition

Figure 3 for Reducing language context confusion for end-to-end code-switching automatic speech recognition

Figure 4 for Reducing language context confusion for end-to-end code-switching automatic speech recognition

Code-switching is about dealing with alternative languages in the communication process. Training end-to-end (E2E) automatic speech recognition (ASR) systems for code-switching is known to be a challenging problem because of the lack of data compounded by the increased language context confusion due to the presence of more than one language. In this paper, we propose a language-related attention mechanism to reduce multilingual context confusion for the E2E code-switching ASR model based on the Equivalence Constraint Theory (EC). The linguistic theory requires that any monolingual fragment that occurs in the code-switching sentence must occur in one of the monolingual sentences. It establishes a bridge between monolingual data and code-switching data. By calculating the respective attention of multiple languages, our method can efficiently transfer language knowledge from rich monolingual data. We evaluate our method on ASRU 2019 Mandarin-English code-switching challenge dataset. Compared with the baseline model, the proposed method achieves 11.37% relative mix error rate reduction.

* arXiv admin note: text overlap with arXiv:2010.14798

Via

Access Paper or Ask Questions

Knowledge graph enhanced recommender system

Dec 17, 2021
Zepeng Huai, Jianhua Tao, Feihu Che, Guohua Yang, Dawei Zhang

Figure 1 for Knowledge graph enhanced recommender system

Figure 2 for Knowledge graph enhanced recommender system

Figure 3 for Knowledge graph enhanced recommender system

Figure 4 for Knowledge graph enhanced recommender system

Knowledge Graphs (KGs) have shown great success in recommendation. This is attributed to the rich attribute information contained in KG to improve item and user representations as side information. However, existing knowledge-aware methods leverage attribute information at a coarse-grained level both in item and user side. In this paper, we proposed a novel attentive knowledge graph attribute network(AKGAN) to learn item attributes and user interests via attribute information in KG. Technically, AKGAN adopts a heterogeneous graph neural network framework, which has a different design between the first layer and the latter layer. With one attribute placed in the corresponding range of element-wise positions, AKGAN employs a novel interest-aware attention network, which releases the limitation that the sum of attention weight is 1, to model the complexity and personality of user interests towards attributes. Experimental results on three benchmark datasets show the effectiveness and explainability of AKGAN.

* 11 pages, 4 figures

Via

Access Paper or Ask Questions