Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"speech": models, code, and papers

Deep Learning Based Assessment of Synthetic Speech Naturalness

Apr 23, 2021
Gabriel Mittag, Sebastian Möller

Figure 1 for Deep Learning Based Assessment of Synthetic Speech Naturalness

Figure 2 for Deep Learning Based Assessment of Synthetic Speech Naturalness

Figure 3 for Deep Learning Based Assessment of Synthetic Speech Naturalness

Figure 4 for Deep Learning Based Assessment of Synthetic Speech Naturalness

In this paper, we present a new objective prediction model for synthetic speech naturalness. It can be used to evaluate Text-To-Speech or Voice Conversion systems and works language independently. The model is trained end-to-end and based on a CNN-LSTM network that previously showed to give good results for speech quality estimation. We trained and tested the model on 16 different datasets, such as from the Blizzard Challenge and the Voice Conversion Challenge. Further, we show that the reliability of deep learning-based naturalness prediction can be improved by transfer learning from speech quality prediction models that are trained on objective POLQA scores. The proposed model is made publicly available and can, for example, be used to evaluate different TTS system configurations.

* Late upload, presented at Interspeech 2020

Via

Access Paper or Ask Questions

Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings

Jun 08, 2021
Marcely Zanon Boito, Bolaji Yusuf, Lucas Ondel, Aline Villavicencio, Laurent Besacier

Figure 1 for Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings

Figure 2 for Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings

Figure 3 for Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings

Figure 4 for Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings

When documenting oral-languages, Unsupervised Word Segmentation (UWS) from speech is a useful, yet challenging, task. It can be performed from phonetic transcriptions, or in the absence of these, from the output of unsupervised speech discretization models. These discretization models are trained using raw speech only, producing discrete speech units which can be applied for downstream (text-based) tasks. In this paper we compare five of these models: three Bayesian and two neural approaches, with regards to the exploitability of the produced units for UWS. Two UWS models are experimented with and we report results for Finnish, Hungarian, Mboshi, Romanian and Russian in a low-resource setting (using only 5k sentences). Our results suggest that neural models for speech discretization are difficult to exploit in our setting, and that it might be necessary to adapt them to limit sequence length. We obtain our best UWS results by using the SHMM and H-SHMM Bayesian models, which produce high quality, yet compressed, discrete representations of the input speech signal.

Via

Access Paper or Ask Questions

GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

Jun 29, 2021
Jinhyeok Yang, Jae-Sung Bae, Taejun Bak, Youngik Kim, Hoon-Young Cho

Figure 1 for GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

Figure 2 for GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

Figure 3 for GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

Figure 4 for GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

Recent advances in neural multi-speaker text-to-speech (TTS) models have enabled the generation of reasonably good speech quality with a single model and made it possible to synthesize the speech of a speaker with limited training data. Fine-tuning to the target speaker data with the multi-speaker model can achieve better quality, however, there still exists a gap compared to the real speech sample and the model depends on the speaker. In this work, we propose GANSpeech, which is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-speaker TTS model. In addition, we propose simple but efficient automatic scaling methods for feature matching loss used in adversarial training. In the subjective listening tests, GANSpeech significantly outperformed the baseline multi-speaker FastSpeech and FastSpeech2 models, and showed a better MOS score than the speaker-specific fine-tuned FastSpeech2.

* Accepted to INTERSPEECH 2021

Via

Access Paper or Ask Questions

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Apr 02, 2021
Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux

Figure 1 for Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Figure 2 for Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Figure 3 for Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Figure 4 for Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings' intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods. Audio samples can be found under https://resynthesis-ssl.github.io/.

Via

Access Paper or Ask Questions

DeepFilterNet: A Low Complexity Speech Enhancement Framework for Full-Band Audio based on Deep Filtering

Oct 11, 2021
Hendrik Schröter, Alberto N. Escalante-B., Tobias Rosenkranz, Andreas Maier

Figure 1 for DeepFilterNet: A Low Complexity Speech Enhancement Framework for Full-Band Audio based on Deep Filtering

Figure 2 for DeepFilterNet: A Low Complexity Speech Enhancement Framework for Full-Band Audio based on Deep Filtering

Figure 3 for DeepFilterNet: A Low Complexity Speech Enhancement Framework for Full-Band Audio based on Deep Filtering

Figure 4 for DeepFilterNet: A Low Complexity Speech Enhancement Framework for Full-Band Audio based on Deep Filtering

Complex-valued processing has brought deep learning-based speech enhancement and signal extraction to a new level. Typically, the process is based on a time-frequency (TF) mask which is applied to a noisy spectrogram, while complex masks (CM) are usually preferred over real-valued masks due to their ability to modify the phase. Recent work proposed to use a complex filter instead of a point-wise multiplication with a mask. This allows to incorporate information from previous and future time steps exploiting local correlations within each frequency band. In this work, we propose DeepFilterNet, a two stage speech enhancement framework utilizing deep filtering. First, we enhance the spectral envelope using ERB-scaled gains modeling the human frequency perception. The second stage employs deep filtering to enhance the periodic components of speech. Additionally to taking advantage of perceptual properties of speech, we enforce network sparsity via separable convolutions and extensive grouping in linear and recurrent layers to design a low complexity architecture. We further show that our two stage deep filtering approach outperforms complex masks over a variety of frequency resolutions and latencies and demonstrate convincing performance compared to other state-of-the-art models.

Via

Access Paper or Ask Questions

SEMOUR: A Scripted Emotional Speech Repository for Urdu

May 19, 2021
Nimra Zaheer, Obaid Ullah Ahmad, Ammar Ahmed, Muhammad Shehryar Khan, Mudassir Shabbir

Figure 1 for SEMOUR: A Scripted Emotional Speech Repository for Urdu

Figure 2 for SEMOUR: A Scripted Emotional Speech Repository for Urdu

Figure 3 for SEMOUR: A Scripted Emotional Speech Repository for Urdu

Figure 4 for SEMOUR: A Scripted Emotional Speech Repository for Urdu

Designing reliable Speech Emotion Recognition systems is a complex task that inevitably requires sufficient data for training purposes. Such extensive datasets are currently available in only a few languages, including English, German, and Italian. In this paper, we present SEMOUR, the first scripted database of emotion-tagged speech in the Urdu language, to design an Urdu Speech Recognition System. Our gender-balanced dataset contains 15,040 unique instances recorded by eight professional actors eliciting a syntactically complex script. The dataset is phonetically balanced, and reliably exhibits a varied set of emotions as marked by the high agreement scores among human raters in experiments. We also provide various baseline speech emotion prediction scores on the database, which could be used for various applications like personalized robot assistants, diagnosis of psychological disorders, and getting feedback from a low-tech-enabled population, etc. On a random test sample, our model correctly predicts an emotion with a state-of-the-art 92% accuracy.

* accepted in CHI 2021

Via

Access Paper or Ask Questions

Predicting speech intelligibility from EEG using a dilated convolutional network

May 19, 2021
Bernd Accou, Mohammad Jalilpour Monesi, Hugo Van hamme, Tom Francart

Figure 1 for Predicting speech intelligibility from EEG using a dilated convolutional network

Figure 2 for Predicting speech intelligibility from EEG using a dilated convolutional network

Figure 3 for Predicting speech intelligibility from EEG using a dilated convolutional network

Figure 4 for Predicting speech intelligibility from EEG using a dilated convolutional network

Objective: Currently, only behavioral speech understanding tests are available, which require active participation of the person. As this is infeasible for certain populations, an objective measure of speech intelligibility is required. Recently, brain imaging data has been used to establish a relationship between stimulus and brain response. Linear models have been successfully linked to speech intelligibility but require per-subject training. We present a deep-learning-based model incorporating dilated convolutions that can be used to predict speech intelligibility without subject-specific (re)training. Methods: We evaluated the performance of the model as a function of input segment length, EEG frequency band and receptive field size while comparing it to a baseline model. Next, we evaluated performance on held-out data and finetuning. Finally, we established a link between the accuracy of our model and the state-of-the-art behavioral MATRIX test. Results: The model significantly outperformed the baseline for every input segment length (p$\leq10^{-9}$), for all EEG frequency bands except the theta band (p$\leq0.001$) and for receptive field sizes larger than 125 ms (p$\leq0.05$). Additionally, finetuning significantly increased the accuracy (p$\leq0.05$) on a held-out dataset. Finally, a significant correlation (r=0.59, p=0.0154) was found between the speech reception threshold estimated using the behavioral MATRIX test and our objective method. Conclusion: Our proposed dilated convolutional model can be used as a proxy for speech intelligibility. Significance: Our method is the first to predict the speech reception threshold from EEG for unseen subjects, contributing to objective measures of speech intelligibility.

* 10 pages, 11 figures

Via

Access Paper or Ask Questions

BiFSMNv2: Pushing Binary Neural Networks for Keyword Spotting to Real-Network Performance

Nov 13, 2022
Haotong Qin, Xudong Ma, Yifu Ding, Xiaoyang Li, Yang Zhang, Zejun Ma, Jiakai Wang, Jie Luo, Xianglong Liu

Figure 1 for BiFSMNv2: Pushing Binary Neural Networks for Keyword Spotting to Real-Network Performance

Figure 2 for BiFSMNv2: Pushing Binary Neural Networks for Keyword Spotting to Real-Network Performance

Figure 3 for BiFSMNv2: Pushing Binary Neural Networks for Keyword Spotting to Real-Network Performance

Figure 4 for BiFSMNv2: Pushing Binary Neural Networks for Keyword Spotting to Real-Network Performance

Deep neural networks, such as the Deep-FSMN, have been widely studied for keyword spotting (KWS) applications while suffering expensive computation and storage. Therefore, network compression technologies like binarization are studied to deploy KWS models on edge. In this paper, we present a strong yet efficient binary neural network for KWS, namely BiFSMNv2, pushing it to the real-network accuracy performance. First, we present a Dual-scale Thinnable 1-bit-Architecture to recover the representation capability of the binarized computation units by dual-scale activation binarization and liberate the speedup potential from an overall architecture perspective. Second, we also construct a Frequency Independent Distillation scheme for KWS binarization-aware training, which distills the high and low-frequency components independently to mitigate the information mismatch between full-precision and binarized representations. Moreover, we implement BiFSMNv2 on ARMv8 real-world hardware with a novel Fast Bitwise Computation Kernel, which is proposed to fully utilize registers and increase instruction throughput. Comprehensive experiments show our BiFSMNv2 outperforms existing binary networks for KWS by convincing margins across different datasets and even achieves comparable accuracy with the full-precision networks (e.g., only 1.59% drop on Speech Commands V1-12). We highlight that benefiting from the compact architecture and optimized hardware kernel, BiFSMNv2 can achieve an impressive 25.1x speedup and 20.2x storage-saving on edge hardware.

* arXiv admin note: text overlap with arXiv:2202.06483

Via

Access Paper or Ask Questions

A Review on Part-of-Speech Technologies

Oct 11, 2021
Onyenwe Ikechukwu, Onyedikachukwu Ikechukwu-Onyenwe, Onyedinma Ebele

Figure 1 for A Review on Part-of-Speech Technologies

Figure 2 for A Review on Part-of-Speech Technologies

Figure 3 for A Review on Part-of-Speech Technologies

Developing an automatic part-of-speech (POS) tagging for any new language is considered a necessary step for further computational linguistics methodology beyond tagging, like chunking and parsing, to be fully applied to the language. Many POS disambiguation technologies have been developed for this type of research and there are factors that influence the choice of choosing one. This could be either corpus-based or non-corpus-based. In this paper, we present a review of POS tagging technologies.

* 8 pages, 3 images

Via

Access Paper or Ask Questions

Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models

Mar 30, 2022
Vineet Garg, Ognjen Rudovic, Pranay Dighe, Ahmed H. Abdelaziz, Erik Marchi, Saurabh Adya, Chandra Dhir, Ahmed Tewfik

Figure 1 for Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models

Figure 2 for Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models

Figure 3 for Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models

Figure 4 for Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models

We address the problem of detecting speech directed to a device that does not contain a specific wake-word. Specifically, we focus on audio coming from a touch-based invocation. Mitigating virtual assistants (VAs) activation due to accidental button presses is critical for user experience. While the majority of approaches to false trigger mitigation (FTM) are designed to detect the presence of a target keyword, inferring user intent in absence of keyword is difficult. This also poses a challenge when creating the training/evaluation data for such systems due to inherent ambiguity in the user's data. To this end, we propose a novel FTM approach that uses weakly-labeled training data obtained with a newly introduced data sampling strategy. While this sampling strategy reduces data annotation efforts, the data labels are noisy as the data are not annotated manually. We use these data to train an acoustics-only model for the FTM task by regularizing its loss function via knowledge distillation from an ASR-based (LatticeRNN) model. This improves the model decisions, resulting in 66% gain in accuracy, as measured by equal-error-rate (EER), over the base acoustics-only model. We also show that the ensemble of the LatticeRNN and acoustic-distilled models brings further accuracy improvement of 20%.

* Submitted to INTERSPEECH 2022

Via

Access Paper or Ask Questions