Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Accent Classification with Phonetic Vowel Representation

Feb 24, 2016
Zhenhao Ge, Yingyi Tan, Aravind Ganapathiraju

Previous accent classification research focused mainly on detecting accents with pure acoustic information without recognizing accented speech. This work combines phonetic knowledge such as vowels with acoustic information to build Guassian Mixture Model (GMM) classifier with Perceptual Linear Predictive (PLP) features, optimized by Hetroscedastic Linear Discriminant Analysis (HLDA). With input about 20-second accented speech, this system achieves classification rate of 51% on a 7-way classification system focusing on the major types of accents in English, which is competitive to the state-of-the-art results in this field.

* Asian Conference on Pattern Recognition (ACPR) 2015 

  Access Paper or Ask Questions

Topic Stability over Noisy Sources

Aug 05, 2015
Jing Su, Oisín Boydell, Derek Greene, Gerard Lynch

Topic modelling techniques such as LDA have recently been applied to speech transcripts and OCR output. These corpora may contain noisy or erroneous texts which may undermine topic stability. Therefore, it is important to know how well a topic modelling algorithm will perform when applied to noisy data. In this paper we show that different types of textual noise will have diverse effects on the stability of different topic models. From these observations, we propose guidelines for text corpus generation, with a focus on automatic speech transcription. We also suggest topic model selection methods for noisy corpora.

  Access Paper or Ask Questions

Discourse Processing of Dialogues with Multiple Threads

Apr 27, 1995
Carolyn Penstein Rose', Barbara Di Eugenio, Lori S. Levin, Carol Van Ess-Dykema

In this paper we will present our ongoing work on a plan-based discourse processor developed in the context of the Enthusiast Spanish to English translation system as part of the JANUS multi-lingual speech-to-speech translation system. We will demonstrate that theories of discourse which postulate a strict tree structure of discourse on either the intentional or attentional level are not totally adequate for handling spontaneous dialogues. We will present our extension to this approach along with its implementation in our plan-based discourse processor. We will demonstrate that the implementation of our approach outperforms an implementation based on the strict tree structure approach.

* Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, MIT, 1995 
* 8 pages, compressed, uuencoded postscript. If you have trouble printing the postscript, please send mail to [email protected] 

  Access Paper or Ask Questions

Data mining Mandarin tone contour shapes

Jul 02, 2019
Shuo Zhang

In spontaneous speech, Mandarin tones that belong to the same tone category may exhibit many different contour shapes. We explore the use of data mining and NLP techniques for understanding the variability of tones in a large corpus of Mandarin newscast speech. First, we adapt a graph-based approach to characterize the clusters (fuzzy types) of tone contour shapes observed in each tone n-gram category. Second, we show correlations between these realized contour shape types and a bag of automatically extracted linguistic features. We discuss the implications of the current study within the context of phonological and information theory.

  Access Paper or Ask Questions

Speaker Sincerity Detection based on Covariance Feature Vectors and Ensemble Methods

Apr 26, 2019
Mohammed Senoussaoui, Patrick Cardinal, Najim Dehak, Alessandro Lameiras Koerich

Automatic measuring of speaker sincerity degree is a novel research problem in computational paralinguistics. This paper proposes covariance-based feature vectors to model speech and ensembles of support vector regressors to estimate the degree of sincerity of a speaker. The elements of each covariance vector are pairwise statistics between the short-term feature components. These features are used alone as well as in combination with the ComParE acoustic feature set. The experimental results on the development set of the Sincerity Speech Corpus using a cross-validation procedure have shown an 8.1% relative improvement in the Spearman's correlation coefficient over the baseline system.

  Access Paper or Ask Questions

Deep Residual Learning for Small-Footprint Keyword Spotting

Sep 21, 2018
Raphael Tang, Jimmy Lin

We explore the application of deep residual learning and dilated convolutions to the keyword spotting task, using the recently-released Google Speech Commands Dataset as our benchmark. Our best residual network (ResNet) implementation significantly outperforms Google's previous convolutional neural networks in terms of accuracy. By varying model depth and width, we can achieve compact models that also outperform previous small-footprint variants. To our knowledge, we are the first to examine these approaches for keyword spotting, and our results establish an open-source state-of-the-art reference to support the development of future speech-based interfaces.

* Published in ICASSP 2018 

  Access Paper or Ask Questions

A Recurrent Latent Variable Model for Sequential Data

Apr 06, 2016
Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, Yoshua Bengio

In this paper, we explore the inclusion of latent random variables into the dynamic hidden state of a recurrent neural network (RNN) by combining elements of the variational autoencoder. We argue that through the use of high-level latent random variables, the variational RNN (VRNN)1 can model the kind of variability observed in highly structured sequential data such as natural speech. We empirically evaluate the proposed model against related sequential models on four speech datasets and one handwriting dataset. Our results show the important roles that latent random variables can play in the RNN dynamic hidden state.

  Access Paper or Ask Questions

A Cross-media Retrieval System for Lecture Videos

Sep 13, 2003
Atsushi Fujii, Katunobu Itou, Tomoyosi Akiba, Tetsuya Ishikawa

We propose a cross-media lecture-on-demand system, in which users can selectively view specific segments of lecture videos by submitting text queries. Users can easily formulate queries by using the textbook associated with a target lecture, even if they cannot come up with effective keywords. Our system extracts the audio track from a target lecture video, generates a transcription by large vocabulary continuous speech recognition, and produces a text index. Experimental results showed that by adapting speech recognition to the topic of the lecture, the recognition accuracy increased and the retrieval accuracy was comparable with that obtained by human transcription.

* Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech 2003), pp.1149-1152, Sep. 2003 

  Access Paper or Ask Questions

Speaker Generation

Nov 07, 2021
Daisy Stanton, Matt Shannon, Soroosh Mariooryad, RJ Skerry-Ryan, Eric Battenberg, Tom Bagby, David Kao

This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to implement, and does not require transfer learning from speaker ID systems. We present objective and subjective metrics for evaluating performance on this task, and demonstrate that our proposed objective metrics correlate with human perception of speaker similarity. Audio samples are available on our demo page.

* 12 pages, 3 figures, 4 tables, appendix with 2 tables 

  Access Paper or Ask Questions

Attention-gated convolutional neural networks for off-resonance correction of spiral real-time MRI

Feb 14, 2021
Yongwan Lim, Shrikanth S. Narayanan, Krishna S. Nayak

Spiral acquisitions are preferred in real-time MRI because of their efficiency, which has made it possible to capture vocal tract dynamics during natural speech. A fundamental limitation of spirals is blurring and signal loss due to off-resonance, which degrades image quality at air-tissue boundaries. Here, we present a new CNN-based off-resonance correction method that incorporates an attention-gate mechanism. This leverages spatial and channel relationships of filtered outputs and improves the expressiveness of the networks. We demonstrate improved performance with the attention-gate, on 1.5 Tesla spiral speech RT-MRI, compared to existing off-resonance correction methods.

* 28th Int. Soc. Magn. Reson. Med. (ISMRM) Scientific Sessions, 2020, p.1005 
* 8 pages, 4 figures, 1 table 

  Access Paper or Ask Questions