Previous accent classification research focused mainly on detecting accents with pure acoustic information without recognizing accented speech. This work combines phonetic knowledge such as vowels with acoustic information to build Guassian Mixture Model (GMM) classifier with Perceptual Linear Predictive (PLP) features, optimized by Hetroscedastic Linear Discriminant Analysis (HLDA). With input about 20-second accented speech, this system achieves classification rate of 51% on a 7-way classification system focusing on the major types of accents in English, which is competitive to the state-of-the-art results in this field.
Topic modelling techniques such as LDA have recently been applied to speech transcripts and OCR output. These corpora may contain noisy or erroneous texts which may undermine topic stability. Therefore, it is important to know how well a topic modelling algorithm will perform when applied to noisy data. In this paper we show that different types of textual noise will have diverse effects on the stability of different topic models. From these observations, we propose guidelines for text corpus generation, with a focus on automatic speech transcription. We also suggest topic model selection methods for noisy corpora.
In this paper we will present our ongoing work on a plan-based discourse processor developed in the context of the Enthusiast Spanish to English translation system as part of the JANUS multi-lingual speech-to-speech translation system. We will demonstrate that theories of discourse which postulate a strict tree structure of discourse on either the intentional or attentional level are not totally adequate for handling spontaneous dialogues. We will present our extension to this approach along with its implementation in our plan-based discourse processor. We will demonstrate that the implementation of our approach outperforms an implementation based on the strict tree structure approach.
In spontaneous speech, Mandarin tones that belong to the same tone category may exhibit many different contour shapes. We explore the use of data mining and NLP techniques for understanding the variability of tones in a large corpus of Mandarin newscast speech. First, we adapt a graph-based approach to characterize the clusters (fuzzy types) of tone contour shapes observed in each tone n-gram category. Second, we show correlations between these realized contour shape types and a bag of automatically extracted linguistic features. We discuss the implications of the current study within the context of phonological and information theory.
Automatic measuring of speaker sincerity degree is a novel research problem in computational paralinguistics. This paper proposes covariance-based feature vectors to model speech and ensembles of support vector regressors to estimate the degree of sincerity of a speaker. The elements of each covariance vector are pairwise statistics between the short-term feature components. These features are used alone as well as in combination with the ComParE acoustic feature set. The experimental results on the development set of the Sincerity Speech Corpus using a cross-validation procedure have shown an 8.1% relative improvement in the Spearman's correlation coefficient over the baseline system.
We explore the application of deep residual learning and dilated convolutions to the keyword spotting task, using the recently-released Google Speech Commands Dataset as our benchmark. Our best residual network (ResNet) implementation significantly outperforms Google's previous convolutional neural networks in terms of accuracy. By varying model depth and width, we can achieve compact models that also outperform previous small-footprint variants. To our knowledge, we are the first to examine these approaches for keyword spotting, and our results establish an open-source state-of-the-art reference to support the development of future speech-based interfaces.
In this paper, we explore the inclusion of latent random variables into the dynamic hidden state of a recurrent neural network (RNN) by combining elements of the variational autoencoder. We argue that through the use of high-level latent random variables, the variational RNN (VRNN)1 can model the kind of variability observed in highly structured sequential data such as natural speech. We empirically evaluate the proposed model against related sequential models on four speech datasets and one handwriting dataset. Our results show the important roles that latent random variables can play in the RNN dynamic hidden state.
We propose a cross-media lecture-on-demand system, in which users can selectively view specific segments of lecture videos by submitting text queries. Users can easily formulate queries by using the textbook associated with a target lecture, even if they cannot come up with effective keywords. Our system extracts the audio track from a target lecture video, generates a transcription by large vocabulary continuous speech recognition, and produces a text index. Experimental results showed that by adapting speech recognition to the topic of the lecture, the recognition accuracy increased and the retrieval accuracy was comparable with that obtained by human transcription.
This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to implement, and does not require transfer learning from speaker ID systems. We present objective and subjective metrics for evaluating performance on this task, and demonstrate that our proposed objective metrics correlate with human perception of speaker similarity. Audio samples are available on our demo page.
Spiral acquisitions are preferred in real-time MRI because of their efficiency, which has made it possible to capture vocal tract dynamics during natural speech. A fundamental limitation of spirals is blurring and signal loss due to off-resonance, which degrades image quality at air-tissue boundaries. Here, we present a new CNN-based off-resonance correction method that incorporates an attention-gate mechanism. This leverages spatial and channel relationships of filtered outputs and improves the expressiveness of the networks. We demonstrate improved performance with the attention-gate, on 1.5 Tesla spiral speech RT-MRI, compared to existing off-resonance correction methods.