Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit

Aug 13, 2020
Zhen Zeng, Jianzong Wang, Ning Cheng, Jing Xiao

Recent neural speech synthesis systems have gradually focused on the control of prosody to improve the quality of synthesized speech, but they rarely consider the variability of prosody and the correlation between prosody and semantics together. In this paper, a prosody learning mechanism is proposed to model the prosody of speech based on TTS system, where the prosody information of speech is extracted from the melspectrum by a prosody learner and combined with the phoneme sequence to reconstruct the mel-spectrum. Meanwhile, the sematic features of text from the pre-trained language model is introduced to improve the prosody prediction results. In addition, a novel self-attention structure, named as local attention, is proposed to lift this restriction of input text length, where the relative position information of the sequence is modeled by the relative position matrices so that the position encodings is no longer needed. Experiments on English and Mandarin show that speech with more satisfactory prosody has obtained in our model. Especially in Mandarin synthesis, our proposed model outperforms baseline model with a MOS gap of 0.08, and the overall naturalness of the synthesized speech has been significantly improved.

* will be presented in INTERSPEECH 2020 

  Access Paper or Ask Questions

Comparative Study of Speech Analysis Methods to Predict Parkinson's Disease

Nov 15, 2021
Adedolapo Aishat Toye, Suryaprakash Kompalli

One of the symptoms observed in the early stages of Parkinson's Disease (PD) is speech impairment. Speech disorders can be used to detect this disease before it degenerates. This work analyzes speech features and machine learning approaches to predict PD. Acoustic features such as shimmer and jitter variants, and Mel Frequency Cepstral Coefficients (MFCC) are extracted from speech signals. We use two datasets in this work: the MDVR-KCL and the Italian Parkinson's Voice and Speech database. To separate PD and non-PD speech signals, seven classification models were implemented: K-Nearest Neighbor, Decision Trees, Support Vector Machines, Naive Bayes, Logistic Regression, Gradient Boosting, Random Forests. Three feature sets were used for each of the models: (a) Acoustic features only, (b) All the acoustic features and MFCC, (c) Selected subset of features from acoustic features and MFCC. Using all the acoustic features and MFCC, together with SVM produced the highest performance with an accuracy of 98% and F1-Score of 99%. When compared with prior art, this shows a better performance. Our code and related documentation is available in a public domain repository.

* Machine Learning for Health (ML4H) - Extended Abstract 

  Access Paper or Ask Questions

A multispeaker dataset of raw and reconstructed speech production real-time MRI video and 3D volumetric images

Feb 16, 2021
Yongwan Lim, Asterios Toutios, Yannick Bliesener, Ye Tian, Sajan Goud Lingala, Colin Vaz, Tanner Sorensen, Miran Oh, Sarah Harper, Weiyi Chen, Yoonjeong Lee, Johannes Töger, Mairym Lloréns Montesserin, Caitlin Smith, Bianca Godinez, Louis Goldstein, Dani Byrd, Krishna S. Nayak, Shrikanth S. Narayanan

Real-time magnetic resonance imaging (RT-MRI) of human speech production is enabling significant advances in speech science, linguistics, bio-inspired speech technology development, and clinical applications. Easy access to RT-MRI is however limited, and comprehensive datasets with broad access are needed to catalyze research across numerous domains. The imaging of the rapidly moving articulators and dynamic airway shaping during speech demands high spatio-temporal resolution and robust reconstruction methods. Further, while reconstructed images have been published, to-date there is no open dataset providing raw multi-coil RT-MRI data from an optimized speech production experimental setup. Such datasets could enable new and improved methods for dynamic image reconstruction, artifact correction, feature extraction, and direct extraction of linguistically-relevant biomarkers. The present dataset offers a unique corpus of 2D sagittal-view RT-MRI videos along with synchronized audio for 75 subjects performing linguistically motivated speech tasks, alongside the corresponding first-ever public domain raw RT-MRI data. The dataset also includes 3D volumetric vocal tract MRI during sustained speech sounds and high-resolution static anatomical T2-weighted upper airway MRI for each subject.

* 27 pages, 6 figures, 5 tables, submitted to Nature Scientific Data 

  Access Paper or Ask Questions

Two-Staged Acoustic Modeling Adaption for Robust Speech Recognition by the Example of German Oral History Interviews

Aug 19, 2019
Michael Gref, Christoph Schmidt, Sven Behnke, Joachim Köhler

In automatic speech recognition, often little training data is available for specific challenging tasks, but training of state-of-the-art automatic speech recognition systems requires large amounts of annotated speech. To address this issue, we propose a two-staged approach to acoustic modeling that combines noise and reverberation data augmentation with transfer learning to robustly address challenges such as difficult acoustic recording conditions, spontaneous speech, and speech of elderly people. We evaluate our approach using the example of German oral history interviews, where a relative average reduction of the word error rate by 19.3% is achieved.

* IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, July 2019 
* Accepted for IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, July 2019 

  Access Paper or Ask Questions

Towards Interpretability of Speech Pause in Dementia Detection using Adversarial Learning

Nov 14, 2021
Youxiang Zhu, Bang Tran, Xiaohui Liang, John A. Batsis, Robert M. Roth

Speech pause is an effective biomarker in dementia detection. Recent deep learning models have exploited speech pauses to achieve highly accurate dementia detection, but have not exploited the interpretability of speech pauses, i.e., what and how positions and lengths of speech pauses affect the result of dementia detection. In this paper, we will study the positions and lengths of dementia-sensitive pauses using adversarial learning approaches. Specifically, we first utilize an adversarial attack approach by adding the perturbation to the speech pauses of the testing samples, aiming to reduce the confidence levels of the detection model. Then, we apply an adversarial training approach to evaluate the impact of the perturbation in training samples on the detection model. We examine the interpretability from the perspectives of model accuracy, pause context, and pause length. We found that some pauses are more sensitive to dementia than other pauses from the model's perspective, e.g., speech pauses near to the verb "is". Increasing lengths of sensitive pauses or adding sensitive pauses leads the model inference to Alzheimer's Disease, while decreasing the lengths of sensitive pauses or deleting sensitive pauses leads to non-AD.

  Access Paper or Ask Questions

A Training Framework for Stereo-Aware Speech Enhancement using Deep Neural Networks

Dec 09, 2021
Bahareh Tolooshams, Kazuhito Koishida

Deep learning-based speech enhancement has shown unprecedented performance in recent years. The most popular mono speech enhancement frameworks are end-to-end networks mapping the noisy mixture into an estimate of the clean speech. With growing computational power and availability of multichannel microphone recordings, prior works have aimed to incorporate spatial statistics along with spectral information to boost up performance. Despite an improvement in enhancement performance of mono output, the spatial image preservation and subjective evaluations have not gained much attention in the literature. This paper proposes a novel stereo-aware framework for speech enhancement, i.e., a training loss for deep learning-based speech enhancement to preserve the spatial image while enhancing the stereo mixture. The proposed framework is model independent, hence it can be applied to any deep learning based architecture. We provide an extensive objective and subjective evaluation of the trained models through a listening test. We show that by regularizing for an image preservation loss, the overall performance is improved, and the stereo aspect of the speech is better preserved.

* Submitted to ICASSP 2022 

  Access Paper or Ask Questions

F0 Modeling In Hmm-Based Speech Synthesis System Using Deep Belief Network

Feb 18, 2015
Sankar Mukherjee, Shyamal Kumar Das Mandal

In recent years multilayer perceptrons (MLPs) with many hid- den layers Deep Neural Network (DNN) has performed sur- prisingly well in many speech tasks, i.e. speech recognition, speaker verification, speech synthesis etc. Although in the context of F0 modeling these techniques has not been ex- ploited properly. In this paper, Deep Belief Network (DBN), a class of DNN family has been employed and applied to model the F0 contour of synthesized speech which was generated by HMM-based speech synthesis system. The experiment was done on Bengali language. Several DBN-DNN architectures ranging from four to seven hidden layers and up to 200 hid- den units per hidden layer was presented and evaluated. The results were compared against clustering tree techniques pop- ularly found in statistical parametric speech synthesis. We show that from textual inputs DBN-DNN learns a high level structure which in turn improves F0 contour in terms of ob- jective and subjective tests.

* OCOCOSDA 2014 

  Access Paper or Ask Questions

WaveCRN: An Efficient Convolutional Recurrent Neural Network for End-to-end Speech Enhancement

Apr 12, 2020
Tsun-An Hsieh, Hsin-Min Wang, Xugang Lu, Yu Tsao

Due to the simple design pipeline, end-to-end (E2E) neural models for speech enhancement (SE) have attracted great interest. In order to improve the performance of the E2E model, the locality and temporal sequential properties of speech should be efficiently taken into account when modelling. However, in most current E2E models for SE, these properties are either not fully considered or are too complex to be realized. In this paper, we propose an efficient E2E SE model, termed WaveCRN. In WaveCRN, the speech locality feature is captured by a convolutional neural network (CNN), while the temporal sequential property of the locality feature is modeled by stacked simple recurrent units (SRU). Unlike a conventional temporal sequential model that uses a long short-term memory (LSTM) network, which is difficult to parallelize, SRU can be efficiently parallelized in calculation with even fewer model parameters. In addition, in order to more effectively suppress the noise components in the input noisy speech, we derive a novel restricted feature masking (RFM) approach that performs enhancement on the feature maps in the hidden layers; this is different from the approach that applies the estimated ratio mask on the noisy spectral features, which is commonly used in speech separation methods. Experimental results on speech denoising and compressed speech restoration tasks confirm that with the lightweight architecture of SRU and the feature-mapping-based RFM, WaveCRN performs comparably with other state-of-the-art approaches with notably reduced model complexity and inference time.

  Access Paper or Ask Questions

Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis through Audio Analysis

Mar 27, 2019
Noé Tits, Fengna Wang, Kevin El Haddad, Vincent Pagel, Thierry Dutoit

The field of Text-to-Speech has experienced huge improvements last years benefiting from deep learning techniques. Producing realistic speech becomes possible now. As a consequence, the research on the control of the expressiveness, allowing to generate speech in different styles or manners, has attracted increasing attention lately. Systems able to control style have been developed and show impressive results. However the control parameters often consist of latent variables and remain complex to interpret. In this paper, we analyze and compare different latent spaces and obtain an interpretation of their influence on expressive speech. This will enable the possibility to build controllable speech synthesis systems with an understandable behaviour.

  Access Paper or Ask Questions

Character-Level Incremental Speech Recognition with Recurrent Neural Networks

Jan 28, 2016
Kyuyeon Hwang, Wonyong Sung

In real-time speech recognition applications, the latency is an important issue. We have developed a character-level incremental speech recognition (ISR) system that responds quickly even during the speech, where the hypotheses are gradually improved while the speaking proceeds. The algorithm employs a speech-to-character unidirectional recurrent neural network (RNN), which is end-to-end trained with connectionist temporal classification (CTC), and an RNN-based character-level language model (LM). The output values of the CTC-trained RNN are character-level probabilities, which are processed by beam search decoding. The RNN LM augments the decoding by providing long-term dependency information. We propose tree-based online beam search with additional depth-pruning, which enables the system to process infinitely long input speech with low latency. This system not only responds quickly on speech but also can dictate out-of-vocabulary (OOV) words according to pronunciation. The proposed model achieves the word error rate (WER) of 8.90% on the Wall Street Journal (WSJ) Nov'92 20K evaluation set when trained on the WSJ SI-284 training set.

* To appear in ICASSP 2016 

  Access Paper or Ask Questions