Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Adaptation of Tacotron2-based Text-To-Speech for Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging

Jul 26, 2021
Csaba Zainkó, László Tóth, Amin Honarmandi Shandiz, Gábor Gosztolya, Alexandra Markó, Géza Németh, Tamás Gábor Csapó

For articulatory-to-acoustic mapping, typically only limited parallel training data is available, making it impossible to apply fully end-to-end solutions like Tacotron2. In this paper, we experimented with transfer learning and adaptation of a Tacotron2 text-to-speech model to improve the final synthesis quality of ultrasound-based articulatory-to-acoustic mapping with a limited database. We use a multi-speaker pre-trained Tacotron2 TTS model and a pre-trained WaveGlow neural vocoder. The articulatory-to-acoustic conversion contains three steps: 1) from a sequence of ultrasound tongue image recordings, a 3D convolutional neural network predicts the inputs of the pre-trained Tacotron2 model, 2) the Tacotron2 model converts this intermediate representation to an 80-dimensional mel-spectrogram, and 3) the WaveGlow model is applied for final inference. This generated speech contains the timing of the original articulatory data from the ultrasound recording, but the F0 contour and the spectral information is predicted by the Tacotron2 model. The F0 values are independent of the original ultrasound images, but represent the target speaker, as they are inferred from the pre-trained Tacotron2 model. In our experiments, we demonstrated that the synthesized speech quality is more natural with the proposed solutions than with our earlier model.

* accepted at SSW11. arXiv admin note: text overlap with arXiv:2008.03152 

  Access Paper or Ask Questions

Automated speech tools for helping communities process restricted-access corpora for language revival efforts

Apr 15, 2022
Nay San, Martijn Bartelds, Tolúl\d{o}p\dé Ògúnr\dèmí, Alison Mount, Ruben Thompson, Michael Higgins, Roy Barker, Jane Simpson, Dan Jurafsky

Many archival recordings of speech from endangered languages remain unannotated and inaccessible to community members and language learning programs. One bottleneck is the time-intensive nature of annotation. An even narrower bottleneck occurs for recordings with access constraints, such as language that must be vetted or filtered by authorised community members before annotation can begin. We propose a privacy-preserving workflow to widen both bottlenecks for recordings where speech in the endangered language is intermixed with a more widely-used language such as English for meta-linguistic commentary and questions (e.g. What is the word for 'tree'?). We integrate voice activity detection (VAD), spoken language identification (SLI), and automatic speech recognition (ASR) to transcribe the metalinguistic content, which an authorised person can quickly scan to triage recordings that can be annotated by people with lower levels of access. We report work-in-progress processing 136 hours archival audio containing a mix of English and Muruwari. Our collaborative work with the Muruwari custodian of the archival materials show that this workflow reduces metalanguage transcription time by 20% even given only minimal amounts of annotated training data: 10 utterances per language for SLI and 39 minutes of the English for ASR.

* Accepted at ComputEL-5 

  Access Paper or Ask Questions

Input Length Matters: An Empirical Study Of RNN-T And MWER Training For Long-form Telephony Speech Recognition

Oct 08, 2021
Zhiyun Lu, Yanwei Pan, Thibault Doutre, Liangliang Cao, Rohit Prabhavalkar, Chao Zhang, Trevor Strohman

End-to-end models have achieved state-of-the-art results on several automatic speech recognition tasks. However, they perform poorly when evaluated on long-form data, e.g., minutes long conversational telephony audio. One reason the model fails on long-form speech is that it has only seen short utterances during training. This paper presents an empirical study on the effect of training utterance length on the word error rate (WER) for RNN-transducer (RNN-T) model. We compare two widely used training objectives, log loss (or RNN-T loss) and minimum word error rate (MWER) loss. We conduct experiments on telephony datasets in four languages. Our experiments show that for both losses, the WER on long-form speech reduces substantially as the training utterance length increases. The average relative WER gain is 15.7% for log loss and 8.8% for MWER loss. When training on short utterances, MWER loss leads to a lower WER than the log loss. Such difference between the two losses diminishes when the input length increases.

  Access Paper or Ask Questions

The "Sound of Silence" in EEG -- Cognitive voice activity detection

Oct 12, 2020
Rini A Sharon, Hema A Murthy

Speech cognition bears potential application as a brain computer interface that can improve the quality of life for the otherwise communication impaired people. While speech and resting state EEG are popularly studied, here we attempt to explore a "non-speech"(NS) state of brain activity corresponding to the silence regions of speech audio. Firstly, speech perception is studied to inspect the existence of such a state, followed by its identification in speech imagination. Analogous to how voice activity detection is employed to enhance the performance of speech recognition, the EEG state activity detection protocol implemented here is applied to boost the confidence of imagined speech EEG decoding. Classification of speech and NS state is done using two datasets collected from laboratory-based and commercial-based devices. The state sequential information thus obtained is further utilized to reduce the search space of imagined EEG unit recognition. Temporal signal structures and topographic maps of NS states are visualized across subjects and sessions. The recognition performance and the visual distinction observed demonstrates the existence of silence signatures in EEG.

  Access Paper or Ask Questions

Detecting Autism Spectrum Disorders with Machine Learning Models Using Speech Transcripts

Oct 07, 2021
Vikram Ramesh, Rida Assaf

Autism spectrum disorder (ASD) can be defined as a neurodevelopmental disorder that affects how children interact, communicate and socialize with others. This disorder can occur in a broad spectrum of symptoms, with varying effects and severity. While there is no permanent cure for ASD, early detection and proactive treatment can substantially improve the lives of many children. Current methods to accurately diagnose ASD are invasive, time-consuming, and tedious. They can also be subjective perspectives of a number of clinicians involved, including pediatricians, speech pathologists, psychologists, and psychiatrists. New technologies are rapidly emerging that include machine learning models using speech, computer vision from facial, retinal, and brain MRI images of patients to accurately and timely detect this disorder. Our research focuses on computational linguistics and machine learning using speech data from TalkBank, the world's largest spoken language database. We used data of both ASD and Typical Development (TD) in children from TalkBank to develop machine learning models to accurately predict ASD. More than 50 features were used from specifically two datasets in TalkBank to run our experiments using five different classifiers. Logistic Regression and Random Forest models were found to be the most effective for each of these two main datasets, with an accuracy of 0.75. These experiments confirm that while significant opportunities exist for improving the accuracy, machine learning models can reliably predict ASD status in children for effective diagnosis.

  Access Paper or Ask Questions

Enhancements in statistical spoken language translation by de-normalization of ASR results

Nov 18, 2015
Agnieszka Wołk, Krzysztof Wołk, Krzysztof Marasek

Spoken language translation (SLT) has become very important in an increasingly globalized world. Machine translation (MT) for automatic speech recognition (ASR) systems is a major challenge of great interest. This research investigates that automatic sentence segmentation of speech that is important for enriching speech recognition output and for aiding downstream language processing. This article focuses on the automatic sentence segmentation of speech and improving MT results. We explore the problem of identifying sentence boundaries in the transcriptions produced by automatic speech recognition systems in the Polish language. We also experiment with reverse normalization of the recognized speech samples.

* Journal of Computers, 2016 VOL 11, ISSN: 1796-203X, p. 33-40, 2016 
* International Academy Publishing. arXiv admin note: text overlap with arXiv:1510.04500 

  Access Paper or Ask Questions

Attacks as Defenses: Designing Robust Audio CAPTCHAs Using Attacks on Automatic Speech Recognition Systems

Mar 10, 2022
Hadi Abdullah, Aditya Karlekar, Saurabh Prasad, Muhammad Sajidur Rahman, Logan Blue, Luke A. Bauer, Vincent Bindschaedler, Patrick Traynor

Audio CAPTCHAs are supposed to provide a strong defense for online resources; however, advances in speech-to-text mechanisms have rendered these defenses ineffective. Audio CAPTCHAs cannot simply be abandoned, as they are specifically named by the W3C as important enablers of accessibility. Accordingly, demonstrably more robust audio CAPTCHAs are important to the future of a secure and accessible Web. We look to recent literature on attacks on speech-to-text systems for inspiration for the construction of robust, principle-driven audio defenses. We begin by comparing 20 recent attack papers, classifying and measuring their suitability to serve as the basis of new "robust to transcription" but "easy for humans to understand" CAPTCHAs. After showing that none of these attacks alone are sufficient, we propose a new mechanism that is both comparatively intelligible (evaluated through a user study) and hard to automatically transcribe (i.e., $P({\rm transcription}) = 4 \times 10^{-5}$). Finally, we demonstrate that our audio samples have a high probability of being detected as CAPTCHAs when given to speech-to-text systems ($P({\rm evasion}) = 1.77 \times 10^{-4}$). In so doing, we not only demonstrate a CAPTCHA that is approximately four orders of magnitude more difficult to crack, but that such systems can be designed based on the insights gained from attack papers using the differences between the ways that humans and computers process audio.

  Access Paper or Ask Questions

CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command Recognition

Jan 11, 2022
Wenliang Dai, Samuel Cahyawijaya, Tiezheng Yu, Elham J. Barezi, Peng Xu, Cheuk Tung Shadow Yiu, Rita Frieske, Holy Lovenia, Genta Indra Winata, Qifeng Chen, Xiaojuan Ma, Bertram E. Shi, Pascale Fung

With the rise of deep learning and intelligent vehicle, the smart assistant has become an essential in-car component to facilitate driving and provide extra functionalities. In-car smart assistants should be able to process general as well as car-related commands and perform corresponding actions, which eases driving and improves safety. However, there is a data scarcity issue for low resource languages, hindering the development of research and applications. In this paper, we introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR), for in-car command recognition in the Cantonese language with both video and audio data. It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers. Furthermore, we augment our dataset using common in-car background noises to simulate real environments, producing a dataset 10 times larger than the collected one. We provide detailed statistics of both the clean and the augmented versions of our dataset. Moreover, we implement two multimodal baselines to demonstrate the validity of CI-AVSR. Experiment results show that leveraging the visual signal improves the overall performance of the model. Although our best model can achieve a considerable quality on the clean test set, the speech recognition quality on the noisy data is still inferior and remains as an extremely challenging task for real in-car speech recognition systems. The dataset and code will be released at

* 6 pages 

  Access Paper or Ask Questions

Automatic Analysis of the Emotional Content of Speech in Daylong Child-Centered Recordings from a Neonatal Intensive Care Unit

Jun 14, 2021
Einari Vaaras, Sari Ahlqvist-Björkroth, Konstantinos Drossos, Okko Räsänen

Researchers have recently started to study how the emotional speech heard by young infants can affect their developmental outcomes. As a part of this research, hundreds of hours of daylong recordings from preterm infants' audio environments were collected from two hospitals in Finland and Estonia in the context of so-called APPLE study. In order to analyze the emotional content of speech in such a massive dataset, an automatic speech emotion recognition (SER) system is required. However, there are no emotion labels or existing indomain SER systems to be used for this purpose. In this paper, we introduce this initially unannotated large-scale real-world audio dataset and describe the development of a functional SER system for the Finnish subset of the data. We explore the effectiveness of alternative state-of-the-art techniques to deploy a SER system to a new domain, comparing cross-corpus generalization, WGAN-based domain adaptation, and active learning in the task. As a result, we show that the best-performing models are able to achieve a classification performance of 73.4% unweighted average recall (UAR) and 73.2% UAR for a binary classification for valence and arousal, respectively. The results also show that active learning achieves the most consistent performance compared to the two alternatives.

  Access Paper or Ask Questions