Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Robust end-to-end deep audiovisual speech recognition

Nov 21, 2016
Ramon Sanabria, Florian Metze, Fernando De La Torre

Speech is one of the most effective ways of communication among humans. Even though audio is the most common way of transmitting speech, very important information can be found in other modalities, such as vision. Vision is particularly useful when the acoustic signal is corrupted. Multi-modal speech recognition however has not yet found wide-spread use, mostly because the temporal alignment and fusion of the different information sources is challenging. This paper presents an end-to-end audiovisual speech recognizer (AVSR), based on recurrent neural networks (RNN) with a connectionist temporal classification (CTC) loss function. CTC creates sparse "peaky" output activations, and we analyze the differences in the alignments of output targets (phonemes or visemes) between audio-only, video-only, and audio-visual feature representations. We present the first such experiments on the large vocabulary IBM ViaVoice database, which outperform previously published approaches on phone accuracy in clean and noisy conditions.

  Access Paper or Ask Questions

Local Feature or Mel Frequency Cepstral Coefficients - Which One is Better for MLN-Based Bangla Speech Recognition?

Oct 05, 2013
Foyzul Hassan, Mohammed Rokibul Alam Kotwal, Md. Mostafizur Rahman, Mohammad Nasiruddin, Md. Abdul Latif, Mohammad Nurul Huda

This paper discusses the dominancy of local features (LFs), as input to the multilayer neural network (MLN), extracted from a Bangla input speech over mel frequency cepstral coefficients (MFCCs). Here, LF-based method comprises three stages: (i) LF extraction from input speech, (ii) phoneme probabilities extraction using MLN from LF and (iii) the hidden Markov model (HMM) based classifier to obtain more accurate phoneme strings. In the experiments on Bangla speech corpus prepared by us, it is observed that the LFbased automatic speech recognition (ASR) system provides higher phoneme correct rate than the MFCC-based system. Moreover, the proposed system requires fewer mixture components in the HMMs.

* 9 pages Advances in Computing and Communications (ACC) 2011 

  Access Paper or Ask Questions

Learning Robust Latent Representations for Controllable Speech Synthesis

May 10, 2021
Shakti Kumar, Jithin Pradeep, Hussain Zaidi

State-of-the-art Variational Auto-Encoders (VAEs) for learning disentangled latent representations give impressive results in discovering features like pitch, pause duration, and accent in speech data, leading to highly controllable text-to-speech (TTS) synthesis. However, these LSTM-based VAEs fail to learn latent clusters of speaker attributes when trained on either limited or noisy datasets. Further, different latent variables start encoding the same features, limiting the control and expressiveness during speech synthesis. To resolve these issues, we propose RTI-VAE (Reordered Transformer with Information reduction VAE) where we minimize the mutual information between different latent variables and devise a modified Transformer architecture with layer reordering to learn controllable latent representations in speech data. We show that RTI-VAE reduces the cluster overlap of speaker attributes by at least 30\% over LSTM-VAE and by at least 7\% over vanilla Transformer-VAE.

* Accepted in ACL2021 Findings 

  Access Paper or Ask Questions

Scaling laws in human speech, decreasing emergence of new words and a generalized model

Jan 07, 2015
Ruokuang Lin, Qianli D. Y. Ma, Chunhua Bian

Human language, as a typical complex system, its organization and evolution is an attractive topic for both physical and cultural researchers. In this paper, we present the first exhaustive analysis of the text organization of human speech. Two important results are that: (i) the construction and organization of spoken language can be characterized as Zipf's law and Heaps' law, as observed in written texts; (ii) word frequency vs. rank distribution and the growth of distinct words with the increase of text length shows significant differences between book and speech. In speech word frequency distribution are more concentrated on higher frequency words, and the emergence of new words decreases much rapidly when the content length grows. Based on these observations, a new generalized model is proposed to explain these complex dynamical behaviors and the differences between speech and book.

  Access Paper or Ask Questions

Improved far-field speech recognition using Joint Variational Autoencoder

Apr 24, 2022
Shashi Kumar, Shakti P. Rath, Abhishek Pandey

Automatic Speech Recognition (ASR) systems suffer considerably when source speech is corrupted with noise or room impulse responses (RIR). Typically, speech enhancement is applied in both mismatched and matched scenario training and testing. In matched setting, acoustic model (AM) is trained on dereverberated far-field features while in mismatched setting, AM is fixed. In recent past, mapping speech features from far-field to close-talk using denoising autoencoder (DA) has been explored. In this paper, we focus on matched scenario training and show that the proposed joint VAE based mapping achieves a significant improvement over DA. Specifically, we observe an absolute improvement of 2.5% in word error rate (WER) compared to DA based enhancement and 3.96% compared to AM trained directly on far-field filterbank features.

* 5 pages, 2 figures, 3 tables 

  Access Paper or Ask Questions

DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding

Oct 13, 2021
Sergey Nikonorov, Berrak Sisman, Mingyang Zhang, Haizhou Li

Conventional vocoders are commonly used as analysis tools to provide interpretable features for downstream tasks such as speech synthesis and voice conversion. They are built under certain assumptions about the signals following signal processing principle, therefore, not easily generalizable to different audio, for example, from speech to singing. In this paper, we propose a deep neural analyzer, denoted as DeepA - a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulate those defined in conventional vocoders. Therefore, the resulting parameters are more interpretable than other latent neural representations. At the same time, as the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing. The proposed neural analyzer is built based on a variational autoencoder (VAE) architecture. We show that DeepA improves F0 estimation over the conventional vocoder (WORLD). To our best knowledge, this is the first study dedicated to the development of a neural framework for extracting learnable vocoder-like parameters.

* Accepted to ASRU 2021 

  Access Paper or Ask Questions

Towards Robust Neural Vocoding for Speech Generation: A Survey

Dec 05, 2019
Po-chun Hsu, Chun-hsuan Wang, Andy T. Liu, Hung-yi Lee

Recently, neural vocoders have been widely used in speech synthesis tasks, including text-to-speech and voice conversion. However, in the encounter of data distribution mismatch between training and inference, neural vocoders trained on real data often degrade in voice quality for unseen scenarios. In this paper, we train three commonly used neural vocoders, including WaveNet, WaveRNN, and WaveGlow, alternately on five different datasets. To study the robustness of neural vocoders, we evaluate the models using acoustic features from seen/unseen speakers, seen/unseen languages, a text-to-speech model, and a voice conversion model. In this work, we found that WaveNet is more robust than WaveRNN, especially in the face of inconsistency between training and testing data. Through our experiments, we show that WaveNet is more suitable for text-to-speech models, and WaveRNN more suitable for voice conversion applications. Furthermore, we present results with considerable reference value of subjective human evaluation for future studies.

* Submitted to ICASSP 2020 

  Access Paper or Ask Questions

Low-Resource Speech-to-Text Translation

Jun 18, 2018
Sameer Bansal, Herman Kamper, Karen Livescu, Adam Lopez, Sharon Goldwater

Speech-to-text translation has many potential applications for low-resource languages, but the typical approach of cascading speech recognition with machine translation is often impossible, since the transcripts needed to train a speech recognizer are usually not available for low-resource languages. Recent work has found that neural encoder-decoder models can learn to directly translate foreign speech in high-resource scenarios, without the need for intermediate transcription. We investigate whether this approach also works in settings where both data and computation are limited. To make the approach efficient, we make several architectural changes, including a change from character-level to word-level decoding. We find that this choice yields crucial speed improvements that allow us to train with fewer computational resources, yet still performs well on frequent words. We explore models trained on between 20 and 160 hours of data, and find that although models trained on less data have considerably lower BLEU scores, they can still predict words with relatively high precision and recall---around 50% for a model trained on 50 hours of data, versus around 60% for the full 160 hour model. Thus, they may still be useful for some low-resource scenarios.

* Added references; results remain unchanged. Accepted to Interspeech 2018 

  Access Paper or Ask Questions

Multi-Window Data Augmentation Approach for Speech Emotion Recognition

Oct 28, 2020
Sarala Padi, Dinesh Manocha, Ram D. Sriram

We present a novel, Multi-Window Data Augmentation (MWA-SER) approach for speech emotion recognition. MWA-SER is a unimodal approach that focuses on two key concepts; designing the speech augmentation method to generate additional data samples and building the deep learning models to recognize the underlying emotion of an audio signal. The multi-window augmentation method extracts more audio features from the speech signal by employing multiple window sizes in the audio feature extraction process. We show that our proposed augmentation method, combined with a deep learning model, improves the speech emotion recognition performance. We evaluate the performance of our MWA-SER approach on the IEMOCAP corpus and show that our proposed method achieves state-of-the-art results. Furthermore, the proposed system demonstrated 70% and 88% accuracy while recognizing the emotions for the SAVEE and RAVDESS datasets, respectively.

  Access Paper or Ask Questions

Semantic query-by-example speech search using visual grounding

Apr 15, 2019
Herman Kamper, Aristotelis Anastassiou, Karen Livescu

A number of recent studies have started to investigate how speech systems can be trained on untranscribed speech by leveraging accompanying images at training time. Examples of tasks include keyword prediction and within- and across-mode retrieval. Here we consider how such models can be used for query-by-example (QbE) search, the task of retrieving utterances relevant to a given spoken query. We are particularly interested in semantic QbE, where the task is not only to retrieve utterances containing exact instances of the query, but also utterances whose meaning is relevant to the query. We follow a segmental QbE approach where variable-duration speech segments (queries, search utterances) are mapped to fixed-dimensional embedding vectors. We show that a QbE system using an embedding function trained on visually grounded speech data outperforms a purely acoustic QbE system in terms of both exact and semantic retrieval performance.

* Accepted to ICASSP 2019 

  Access Paper or Ask Questions