Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

On the Use/Misuse of the Term 'Phoneme'

Jul 26, 2019
Roger K. Moore, Lucy Skidmore

The term 'phoneme' lies at the heart of speech science and technology, and yet it is not clear that the research community fully appreciates its meaning and implications. In particular, it is suspected that many researchers use the term in a casual sense to refer to the sounds of speech, rather than as a well defined abstract concept. If true, this means that some sections of the community may be missing an opportunity to understand and exploit the implications of this important psychological phenomenon. Here we review the correct meaning of the term 'phoneme' and report the results of an investigation into its use/misuse in the accepted papers at INTERSPEECH-2018. It is confirmed that a significant proportion of the community (i) may not be aware of the critical difference between `phonetic' and 'phonemic' levels of description, (ii) may not fully understand the significance of 'phonemic contrast', and as a consequence, (iii) consistently misuse the term 'phoneme'. These findings are discussed, and recommendations are made as to how this situation might be mitigated.

* Accepted at INTERSPEECH-2019 

  Access Paper or Ask Questions

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Oct 09, 2018
Andrew Owens, Alexei A. Efros

The thud of a bouncing ball, the onset of speech as lips open -- when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation. We propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. We use this learned representation for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/off-screen audio source separation, e.g. removing the off-screen translator's voice from a foreign official's speech. Code, models, and video results are available on our webpage:

  Access Paper or Ask Questions

Recent Progresses in Deep Learning based Acoustic Models (Updated)

Apr 27, 2018
Dong Yu, Jinyu Li

In this paper, we summarize recent progresses made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques. We first discuss acoustic models that can effectively exploit variable-length contextual information, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and their various combination with other models. We then describe acoustic models that are optimized end-to-end with emphasis on feature representations learned jointly with rest of the system, the connectionist temporal classification (CTC) criterion, and the attention-based sequence-to-sequence model. We further illustrate robustness issues in speech recognition systems, and discuss acoustic model adaptation, speech enhancement and separation, and robust training strategies. We also cover modeling techniques that lead to more efficient decoding and discuss possible future directions in acoustic model research.

* This is an updated version with latest literature until ICASSP2018 of the paper: Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning based Acoustic Models," vol.4, no.3, IEEE/CAA Journal of Automatica Sinica, 2017 

  Access Paper or Ask Questions

Spoken Language Biomarkers for Detecting Cognitive Impairment

Oct 20, 2017
Tuka Alhanai, Rhoda Au, James Glass

In this study we developed an automated system that evaluates speech and language features from audio recordings of neuropsychological examinations of 92 subjects in the Framingham Heart Study. A total of 265 features were used in an elastic-net regularized binomial logistic regression model to classify the presence of cognitive impairment, and to select the most predictive features. We compared performance with a demographic model from 6,258 subjects in the greater study cohort (0.79 AUC), and found that a system that incorporated both audio and text features performed the best (0.92 AUC), with a True Positive Rate of 29% (at 0% False Positive Rate) and a good model fit (Hosmer-Lemeshow test > 0.05). We also found that decreasing pitch and jitter, shorter segments of speech, and responses phrased as questions were positively associated with cognitive impairment.

  Access Paper or Ask Questions

Recent Advances in Convolutional Neural Networks

Oct 19, 2017
Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu, Xingxing Wang, Li Wang, Gang Wang, Jianfei Cai, Tsuhan Chen

In the last few years, deep learning has led to very good performance on a variety of problems, such as visual recognition, speech recognition and natural language processing. Among different types of deep neural networks, convolutional neural networks have been most extensively studied. Leveraging on the rapid growth in the amount of the annotated data and the great improvements in the strengths of graphics processor units, the research on convolutional neural networks has been emerged swiftly and achieved state-of-the-art results on various tasks. In this paper, we provide a broad survey of the recent advances in convolutional neural networks. We detailize the improvements of CNN on different aspects, including layer design, activation function, loss function, regularization, optimization and fast computation. Besides, we also introduce various applications of convolutional neural networks in computer vision, speech and natural language processing.

* Pattern Recognition, Elsevier 

  Access Paper or Ask Questions

Semantics and Conversations for an Agent Communication Language

Sep 18, 1998
Yannis Labrou, Tim Finin

We address the issues of semantics and conversations for agent communication languages and the Knowledge Query Manipulation Language (KQML) in particular. Based on ideas from speech act theory, we present a semantic description for KQML that associates ``cognitive'' states of the agent with the use of the language's primitives (performatives). We have used this approach to describe the semantics for the whole set of reserved KQML performatives. Building on the semantics, we devise the conversation policies, i.e., a formal description of how KQML performatives may be combined into KQML exchanges (conversations), using a Definite Clause Grammar. Our research offers methods for a speech act theory-based semantic description of a language of communication acts and for the specification of the protocols associated with these acts. Languages of communication acts address the issue of communication among software applications at a level of abstraction that is useful to the emerging software agents paradigm.

* Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97) August, 1997 
* Also in in "Readings in Agents", Michael Huhns and Munindar Singh (eds), Morgan Kaufmann Publishers, Inc 

  Access Paper or Ask Questions

An Initialization Scheme for Meeting Separation with Spatial Mixture Models

Apr 04, 2022
Christoph Boeddeker, Tobias Cord-Landwehr, Thilo von Neumann, Reinhold Haeb-Umbach

Spatial mixture model (SMM) supported acoustic beamforming has been extensively used for the separation of simultaneously active speakers. However, it has hardly been considered for the separation of meeting data, that are characterized by long recordings and only partially overlapping speech. In this contribution, we show that the fact that often only a single speaker is active can be utilized for a clever initialization of an SMM that employs time-varying class priors. In experiments on LibriCSS we show that the proposed initialization scheme achieves a significantly lower Word Error Rate (WER) on a downstream speech recognition task than a random initialization of the class probabilities by drawing from a Dirichlet distribution. With the only requirement that the number of speakers has to be known, we obtain a WER of 5.9 %, which is comparable to the best reported WER on this data set. Furthermore, the estimated speaker activity from the mixture model serves as a diarization based on spatial information.

* Submitted to INTERSPEECH 2022 

  Access Paper or Ask Questions

VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording

Jul 15, 2021
Hirofumi Inaguma, Tatsuya Kawahara

In this work, we propose novel decoding algorithms to enable streaming automatic speech recognition (ASR) on unsegmented long-form recordings without voice activity detection (VAD), based on monotonic chunkwise attention (MoChA) with an auxiliary connectionist temporal classification (CTC) objective. We propose a block-synchronous beam search decoding to take advantage of efficient batched output-synchronous and low-latency input-synchronous searches. We also propose a VAD-free inference algorithm that leverages CTC probabilities to determine a suitable timing to reset the model states to tackle the vulnerability to long-form data. Experimental evaluations demonstrate that the block-synchronous decoding achieves comparable accuracy to the label-synchronous one. Moreover, the VAD-free inference can recognize long-form speech robustly for up to a few hours.

* Accepted at Interspeech 2021 

  Access Paper or Ask Questions

Relational Data Selection for Data Augmentation of Speaker-dependent Multi-band MelGAN Vocoder

Jun 10, 2021
Yi-Chiao Wu, Cheng-Hung Hu, Hung-Shin Lee, Yu-Huai Peng, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda

Nowadays, neural vocoders can generate very high-fidelity speech when a bunch of training data is available. Although a speaker-dependent (SD) vocoder usually outperforms a speaker-independent (SI) vocoder, it is impractical to collect a large amount of data of a specific target speaker for most real-world applications. To tackle the problem of limited target data, a data augmentation method based on speaker representation and similarity measurement of speaker verification is proposed in this paper. The proposed method selects utterances that have similar speaker identity to the target speaker from an external corpus, and then combines the selected utterances with the limited target data for SD vocoder adaptation. The evaluation results show that, compared with the vocoder adapted using only limited target data, the vocoder adapted using augmented data improves both the quality and similarity of synthesized speech.

* 5 pages, 1 figure, 3 tables, Proc. Interspeech, 2021 

  Access Paper or Ask Questions

NISP: A Multi-lingual Multi-accent Dataset for Speaker Profiling

Jul 12, 2020
Shareef Babu Kalluri, Deepu Vijayasenan, Sriram Ganapathy, Ragesh Rajan M, Prashant Krishnan

Many commercial and forensic applications of speech demand the extraction of information about the speaker characteristics, which falls into the broad category of speaker profiling. The speaker characteristics needed for profiling include physical traits of the speaker like height, age, and gender of the speaker along with the native language of the speaker. Many of the datasets available have only partial information for speaker profiling. In this paper, we attempt to overcome this limitation by developing a new dataset which has speech data from five different Indian languages along with English. The metadata information for speaker profiling applications like linguistic information, regional information, and physical characteristics of a speaker are also collected. We call this dataset as NITK-IISc Multilingual Multi-accent Speaker Profiling (NISP) dataset. The description of the dataset, potential applications, and baseline results for speaker profiling on this dataset are provided in this paper.

* 5pages, Initial version submitted to Interspeech2020 

  Access Paper or Ask Questions