Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Speech enhancement with variational autoencoders and alpha-stable distributions

Feb 08, 2019
Simon Leglaive, Umut Simsekli, Antoine Liutkus, Laurent Girin, Radu Horaud

This paper focuses on single-channel semi-supervised speech enhancement. We learn a speaker-independent deep generative speech model using the framework of variational autoencoders. The noise model remains unsupervised because we do not assume prior knowledge of the noisy recording environment. In this context, our contribution is to propose a noise model based on alpha-stable distributions, instead of the more conventional Gaussian non-negative matrix factorization approach found in previous studies. We develop a Monte Carlo expectation-maximization algorithm for estimating the model parameters at test time. Experimental results show the superiority of the proposed approach both in terms of perceptual quality and intelligibility of the enhanced speech signal.

* IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Brighton, UK, May 2019 
* 5 pages, 3 figures, audio examples and code available online : arXiv admin note: text overlap with arXiv:1811.06713 

  Access Paper or Ask Questions

Focal Loss based Residual Convolutional Neural Network for Speech Emotion Recognition

Jun 11, 2019
Suraj Tripathi, Abhay Kumar, Abhiram Ramesh, Chirag Singh, Promod Yenigalla

This paper proposes a Residual Convolutional Neural Network (ResNet) based on speech features and trained under Focal Loss to recognize emotion in speech. Speech features such as Spectrogram and Mel-frequency Cepstral Coefficients (MFCCs) have shown the ability to characterize emotion better than just plain text. Further Focal Loss, first used in One-Stage Object Detectors, has shown the ability to focus the training process more towards hard-examples and down-weight the loss assigned to well-classified examples, thus preventing the model from being overwhelmed by easily classifiable examples.

* Accepted in CICLing 2019 

  Access Paper or Ask Questions

Speech Recognition Oriented Vowel Classification Using Temporal Radial Basis Functions

Dec 19, 2009
Mustapha Guezouri, Larbi Mesbahi, Abdelkader Benyettou

The recent resurgence of interest in spatio-temporal neural network as speech recognition tool motivates the present investigation. In this paper an approach was developed based on temporal radial basis function "TRBF" looking to many advantages: few parameters, speed convergence and time invariance. This application aims to identify vowels taken from natural speech samples from the Timit corpus of American speech. We report a recognition accuracy of 98.06 percent in training and 90.13 in test on a subset of 6 vowel phonemes, with the possibility to expend the vowel sets in future.

* Journal of Computing, Volume 1, Issue 1, pp 162-167, December 2009 

  Access Paper or Ask Questions

Neural Constituency Parsing of Speech Transcripts

Apr 17, 2019
Paria Jamshid Lou, Yufei Wang, Mark Johnson

This paper studies the performance of a neural self-attentive parser on transcribed speech. Speech presents parsing challenges that do not appear in written text, such as the lack of punctuation and the presence of speech disfluencies (including filled pauses, repetitions, corrections, etc.). Disfluencies are especially problematic for conventional syntactic parsers, which typically fail to find any EDITED disfluency nodes at all. This motivated the development of special disfluency detection systems, and special mechanisms added to parsers specifically to handle disfluencies. However, we show here that neural parsers can find EDITED disfluency nodes, and the best neural parsers find them with an accuracy surpassing that of specialized disfluency detection systems, thus making these specialized mechanisms unnecessary. This paper also investigates a modified loss function that puts more weight on EDITED nodes. It also describes tree-transformations that simplify the disfluency detection task by providing alternative encodings of disfluencies and syntactic information.

  Access Paper or Ask Questions

Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data

May 17, 2018
Chan Woo Lee, Kyu Ye Song, Jihoon Jeong, Woo Yong Choi

Emotion recognition has become a popular topic of interest, especially in the field of human computer interaction. Previous works involve unimodal analysis of emotion, while recent efforts focus on multi-modal emotion recognition from vision and speech. In this paper, we propose a new method of learning about the hidden representations between just speech and text data using convolutional attention networks. Compared to the shallow model which employs simple concatenation of feature vectors, the proposed attention model performs much better in classifying emotion from speech and text data contained in the CMU-MOSEI dataset.

  Access Paper or Ask Questions

FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

Jun 29, 2021
Taejun Bak, Jae-Sung Bae, Hanbin Bae, Young-Ik Kim, Hoon-Young Cho

Methods for modeling and controlling prosody with acoustic features have been proposed for neural text-to-speech (TTS) models. Prosodic speech can be generated by conditioning acoustic features. However, synthesized speech with a large pitch-shift scale suffers from audio quality degradation, and speaker characteristics deformation. To address this problem, we propose a feed-forward Transformer based TTS model that is designed based on the source-filter theory. This model, called FastPitchFormant, has a unique structure that handles text and acoustic features in parallel. With modeling each feature separately, the tendency that the model learns the relationship between two features can be mitigated.

* Accepted to INTERSPEECH 2021 

  Access Paper or Ask Questions

A Hindi Speech Actuated Computer Interface for Web Search

Nov 12, 2012
Kamlesh Sharma, S. V. A. V. Prasad, T. V. Prasad

Aiming at increasing system simplicity and flexibility, an audio evoked based system was developed by integrating simplified headphone and user-friendly software design. This paper describes a Hindi Speech Actuated Computer Interface for Web search (HSACIWS), which accepts spoken queries in Hindi language and provides the search result on the screen. This system recognizes spoken queries by large vocabulary continuous speech recognition (LVCSR), retrieves relevant document by text retrieval, and provides the search result on the Web by the integration of the Web and the voice systems. The LVCSR in this system showed enough performance levels for speech with acoustic and language models derived from a query corpus with target contents.

* International Journal of Advanced Computer Science and Applications 3(10), 2012, 147-152 
* 7 pages 

  Access Paper or Ask Questions

A Metric to Classify Style of Spoken Speech

Apr 06, 2015
Sunil Kopparapu, Saurabh Bhatnagar, K. Sahana, Sathyanarayana, Akhilesh Srivastava, P. V. S. Rao

The ability to classify spoken speech based on the style of speaking is an important problem. With the advent of BPO's in recent times, specifically those that cater to a population other than the local population, it has become necessary for BPO's to identify people with certain style of speaking (American, British etc). Today BPO's employ accent analysts to identify people having the required style of speaking. This process while involving human bias, it is becoming increasingly infeasible because of the high attrition rate in the BPO industry. In this paper, we propose a new metric, which robustly and accurately helps classify spoken speech based on the style of speaking. The role of the proposed metric is substantiated by using it to classify real speech data collected from over seventy different people working in a BPO. We compare the performance of the metric against human experts who independently carried out the classification process. Experimental results show that the performance of the system using the novel metric performs better than two different human expert.

* 5 pages; OCOCOSDA 2010 

  Access Paper or Ask Questions

Database of Parliamentary Speeches in Ireland, 1919-2013

Aug 15, 2017
Alexander Herzog, Slava J. Mikhaylov

We present a database of parliamentary debates that contains the complete record of parliamentary speeches from D\'ail \'Eireann, the lower house and principal chamber of the Irish parliament, from 1919 to 2013. In addition, the database contains background information on all TDs (Teachta D\'ala, members of parliament), such as their party affiliations, constituencies and office positions. The current version of the database includes close to 4.5 million speeches from 1,178 TDs. The speeches were downloaded from the official parliament website and further processed and parsed with a Python script. Background information on TDs was collected from the member database of the parliament website. Data on cabinet positions (ministers and junior ministers) was collected from the official website of the government. A record linkage algorithm and human coders were used to match TDs and ministers.

* The database is made available on the Harvard Dataverse at 

  Access Paper or Ask Questions

Towards Learning a Universal Non-Semantic Representation of Speech

Mar 02, 2020
Joel Shor, Aren Jansen, Ronnie Maor, Oran Lang, Omry Tuval, Felix de Chaumont Quitry, Marco Tagliasacchi, Ira Shavitt, Dotan Emanuel, Yinnon Haviv

The ultimate goal of transfer learning is to reduce labeled data requirements by exploiting a pre-existing embedding model trained for different datasets or tasks. While significant progress has been made in the visual and language domains, the speech community has yet to identify a strategy with wide-reaching applicability across tasks. This paper describes a representation of speech based on an unsupervised triplet-loss objective, which exceeds state-of-the-art performance on a number of transfer learning tasks drawn from the non-semantic speech domain. The embedding is trained on a publicly available dataset, and it is tested on a variety of low-resource downstream tasks, including personalization tasks and medical domain. The model will be publicly released.

  Access Paper or Ask Questions