Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Language and Noise Transfer in Speech Enhancement Generative Adversarial Network

Dec 18, 2017
Santiago Pascual, Maruchan Park, Joan Serrà, Antonio Bonafonte, Kang-Hun Ahn

Speech enhancement deep learning systems usually require large amounts of training data to operate in broad conditions or real applications. This makes the adaptability of those systems into new, low resource environments an important topic. In this work, we present the results of adapting a speech enhancement generative adversarial network by finetuning the generator with small amounts of data. We investigate the minimum requirements to obtain a stable behavior in terms of several objective metrics in two very different languages: Catalan and Korean. We also study the variability of test performance to unseen noise as a function of the amount of different types of noise available for training. Results show that adapting a pre-trained English model with 10 min of data already achieves a comparable performance to having two orders of magnitude more data. They also demonstrate the relative stability in test performance with respect to the number of training noise types.

  Access Paper or Ask Questions

Learning Nigerian accent embeddings from speech: preliminary results based on SautiDB-Naija corpus

Dec 12, 2021
Tejumade Afonja, Oladimeji Mudele, Iroro Orife, Kenechi Dukor, Lawrence Francis, Duru Goodness, Oluwafemi Azeez, Ademola Malomo, Clinton Mbataku

This paper describes foundational efforts with SautiDB-Naija, a novel corpus of non-native (L2) Nigerian English speech. We describe how the corpus was created and curated as well as preliminary experiments with accent classification and learning Nigerian accent embeddings. The initial version of the corpus includes over 900 recordings from L2 English speakers of Nigerian languages, such as Yoruba, Igbo, Edo, Efik-Ibibio, and Igala. We further demonstrate how fine-tuning on a pre-trained model like wav2vec can yield representations suitable for related speech tasks such as accent classification. SautiDB-Naija has been published to Zenodo for general use under a flexible Creative Commons License.

  Access Paper or Ask Questions

Improving Emotional Speech Synthesis by Using SUS-Constrained VAE and Text Encoder Aggregation

Oct 19, 2021
Fengyu Yang, Jian Luan, Yujun Wang

Learning emotion embedding from reference audio is a straightforward approach for multi-emotion speech synthesis in encoder-decoder systems. But how to get better emotion embedding and how to inject it into TTS acoustic model more effectively are still under investigation. In this paper, we propose an innovative constraint to help VAE extract emotion embedding with better cluster cohesion. Besides, the obtained emotion embedding is used as query to aggregate latent representations of all encoder layers via attention. Moreover, the queries from encoder layers themselves are also helpful. Experiments prove the proposed methods can enhance the encoding of comprehensive syntactic and semantic information and produce more expressive emotional speech.

* submitted to ICASSP2022 

  Access Paper or Ask Questions

Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based on BAVED Dataset

Oct 09, 2021
Omar Mohamed, Salah A. Aly

Recently, there have been tremendous research outcomes in the fields of speech recognition and natural language processing. This is due to the well-developed multi-layers deep learning paradigms such as wav2vec2.0, Wav2vecU, WavBERT, and HuBERT that provide better representation learning and high information capturing. Such paradigms run on hundreds of unlabeled data, then fine-tuned on a small dataset for specific tasks. This paper introduces a deep learning constructed emotional recognition model for Arabic speech dialogues. The developed model employs the state of the art audio representations include wav2vec2.0 and HuBERT. The experiment and performance results of our model overcome the previous known outcomes.

* 6 pages, 6 figures 

  Access Paper or Ask Questions

Automatic Prosody Prediction for Chinese Speech Synthesis using BLSTM-RNN and Embedding Features

Nov 02, 2015
Chuang Ding, Lei Xie, Jie Yan, Weini Zhang, Yang Liu

Prosody affects the naturalness and intelligibility of speech. However, automatic prosody prediction from text for Chinese speech synthesis is still a great challenge and the traditional conditional random fields (CRF) based method always heavily relies on feature engineering. In this paper, we propose to use neural networks to predict prosodic boundary labels directly from Chinese characters without any feature engineering. Experimental results show that stacking feed-forward and bidirectional long short-term memory (BLSTM) recurrent network layers achieves superior performance over the CRF-based method. The embedding features learned from raw text further enhance the performance.

* 5 pages, 4 figures, ASRU 2015 

  Access Paper or Ask Questions

Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network

Oct 21, 2015
Peilu Wang, Yao Qian, Frank K. Soong, Lei He, Hai Zhao

Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for tagging sequential data, e.g. speech utterances or handwritten documents. While word embedding has been demoed as a powerful representation for characterizing the statistical properties of natural language. In this study, we propose to use BLSTM-RNN with word embedding for part-of-speech (POS) tagging task. When tested on Penn Treebank WSJ test set, a state-of-the-art performance of 97.40 tagging accuracy is achieved. Without using morphological features, this approach can also achieve a good performance comparable with the Stanford POS tagger.

* rejected by ACL 2015 short, score: 4,3,2 (full is 5) 

  Access Paper or Ask Questions

Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions

Jun 30, 2019
Tejas Srinivasan, Ramon Sanabria, Florian Metze

Multimodal learning allows us to leverage information from multiple sources (visual, acoustic and text), similar to our experience of the real world. However, it is currently unclear to what extent auxiliary modalities improve performance over unimodal models, and under what circumstances the auxiliary modalities are useful. We examine the utility of the auxiliary visual context in Multimodal Automatic Speech Recognition in adversarial settings, where we deprive the models from partial audio signal during inference time. Our experiments show that while MMASR models show significant gains over traditional speech-to-text architectures (upto 4.2% WER improvements), they do not incorporate visual information when the audio signal has been corrupted. This shows that current methods of integrating the visual modality do not improve model robustness to noise, and we need better visually grounded adaptation techniques.

  Access Paper or Ask Questions

Code Switched and Code Mixed Speech Recognition for Indic languages

Mar 30, 2022
Harveen Singh Chadha, Priyanshi Shah, Ankur Dhuriya, Neeraj Chhimwal, Anirudh Gupta, Vivek Raghavan

Training multilingual automatic speech recognition (ASR) systems is challenging because acoustic and lexical information is typically language specific. Training multilingual system for Indic languages is even more tougher due to lack of open source datasets and results on different approaches. We compare the performance of end to end multilingual speech recognition system to the performance of monolingual models conditioned on language identification (LID). The decoding information from a multilingual model is used for language identification and then combined with monolingual models to get an improvement of 50% WER across languages. We also propose a similar technique to solve the Code Switched problem and achieve a WER of 21.77 and 28.27 over Hindi-English and Bengali-English respectively. Our work talks on how transformer based ASR especially wav2vec 2.0 can be applied in developing multilingual ASR and code switched ASR for Indic languages.

* This paper for submitted to Interspeech 2022 

  Access Paper or Ask Questions

FaceFormer: Speech-Driven 3D Facial Animation with Transformers

Dec 10, 2021
Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, Taku Komura

Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data. Prior works typically focus on learning phoneme-level features of short audio windows with limited context, occasionally resulting in inaccurate lip movements. To tackle this limitation, we propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes. To cope with the data scarcity issue, we integrate the self-supervised pre-trained speech representations. Also, we devise two biased attention mechanisms well suited to this specific task, including the biased cross-modal multi-head (MH) attention and the biased causal MH self-attention with a periodic positional encoding strategy. The former effectively aligns the audio-motion modalities, whereas the latter offers abilities to generalize to longer audio sequences. Extensive experiments and a perceptual user study show that our approach outperforms the existing state-of-the-arts. The code will be made available.

  Access Paper or Ask Questions

Fast Classification Learning with Neural Networks and Conceptors for Speech Recognition and Car Driving Maneuvers

Feb 10, 2021
Stefanie Krause, Oliver Otto, Frieder Stolzenburg

Recurrent neural networks are a powerful means in diverse applications. We show that, together with so-called conceptors, they also allow fast learning, in contrast to other deep learning methods. In addition, a relatively small number of examples suffices to train neural networks with high accuracy. We demonstrate this with two applications, namely speech recognition and detecting car driving maneuvers. We improve the state-of-the art by application-specific preparation techniques: For speech recognition, we use mel frequency cepstral coefficients leading to a compact representation of the frequency spectra, and detecting car driving maneuvers can be done without the commonly used polynomial interpolation, as our evaluation suggests.

* 17 pages, 6 figures, 6 tables 

  Access Paper or Ask Questions