Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"speech": models, code, and papers

A Novel Decision Tree for Depression Recognition in Speech

Feb 22, 2020
Zhenyu Liu, Dongyu Wang, Lan Zhang, Bin Hu

Figure 1 for A Novel Decision Tree for Depression Recognition in Speech

Figure 2 for A Novel Decision Tree for Depression Recognition in Speech

Figure 3 for A Novel Decision Tree for Depression Recognition in Speech

Figure 4 for A Novel Decision Tree for Depression Recognition in Speech

Depression is a common mental disorder worldwide which causes a range of serious outcomes. The diagnosis of depression relies on patient-reported scales and psychiatrist interview which may lead to subjective bias. In recent years, more and more researchers are devoted to depression recognition in speech , which may be an effective and objective indicator. This study proposes a new speech segment fusion method based on decision tree to improve the depression recognition accuracy and conducts a validation on a sample of 52 subjects (23 depressed patients and 29 healthy controls). The recognition accuracy are 75.8% and 68.5% for male and female respectively on gender-dependent models. It can be concluded from the data that the proposed decision tree model can improve the depression classification performance.

Via

Access Paper or Ask Questions

A Comparison of Discrete Latent Variable Models for Speech Representation Learning

Oct 24, 2020
Henry Zhou, Alexei Baevski, Michael Auli

Figure 1 for A Comparison of Discrete Latent Variable Models for Speech Representation Learning

Figure 2 for A Comparison of Discrete Latent Variable Models for Speech Representation Learning

Figure 3 for A Comparison of Discrete Latent Variable Models for Speech Representation Learning

Figure 4 for A Comparison of Discrete Latent Variable Models for Speech Representation Learning

Neural latent variable models enable the discovery of interesting structure in speech audio data. This paper presents a comparison of two different approaches which are broadly based on predicting future time-steps or auto-encoding the input signal. Our study compares the representations learned by vq-vae and vq-wav2vec in terms of sub-word unit discovery and phoneme recognition performance. Results show that future time-step prediction with vq-wav2vec achieves better performance. The best system achieves an error rate of 13.22 on the ZeroSpeech 2019 ABX phoneme discrimination challenge

* 7 pages, 4 figures

Via

Access Paper or Ask Questions

Who Needs Words? Lexicon-Free Speech Recognition

Apr 09, 2019
Tatiana Likhomanenko, Gabriel Synnaeve, Ronan Collobert

Figure 1 for Who Needs Words? Lexicon-Free Speech Recognition

Figure 2 for Who Needs Words? Lexicon-Free Speech Recognition

Figure 3 for Who Needs Words? Lexicon-Free Speech Recognition

Figure 4 for Who Needs Words? Lexicon-Free Speech Recognition

Lexicon-free speech recognition naturally deals with the problem of out-of-vocabulary (OOV) words. In this paper, we show that character-based language models (LM) can perform as well as word-based LMs for speech recognition, in word error rates (WER), even without restricting the decoding to a lexicon. We study character-based LMs and show that convolutional LMs can effectively leverage large (character) contexts, which is key for good speech recognition performance downstream. We specifically show that the lexicon-free decoding performance (WER) on utterances with OOV words using character-based LMs is better than lexicon-based decoding, both with character or word-based LMs.

* 8 pages, 1 figure

Via

Access Paper or Ask Questions

Joint AEC AND Beamforming with Double-Talk Detection using RNN-Transformer

Nov 09, 2021
Vinay Kothapally, Yong Xu, Meng Yu, Shi-Xiong Zhang, Dong Yu

Figure 1 for Joint AEC AND Beamforming with Double-Talk Detection using RNN-Transformer

Figure 2 for Joint AEC AND Beamforming with Double-Talk Detection using RNN-Transformer

Acoustic echo cancellation (AEC) is a technique used in full-duplex communication systems to eliminate acoustic feedback of far-end speech. However, their performance degrades in naturalistic environments due to nonlinear distortions introduced by the speaker, as well as background noise, reverberation, and double-talk scenarios. To address nonlinear distortions and co-existing background noise, several deep neural network (DNN)-based joint AEC and denoising systems were developed. These systems are based on either purely "black-box" neural networks or "hybrid" systems that combine traditional AEC algorithms with neural networks. We propose an all-deep-learning framework that combines multi-channel AEC and our recently proposed self-attentive recurrent neural network (RNN) beamformer. We propose an all-deep-learning framework that combines multi-channel AEC and our recently proposed self-attentive recurrent neural network (RNN) beamformer. Furthermore, we propose a double-talk detection transformer (DTDT) module based on the multi-head attention transformer structure that computes attention over time by leveraging frame-wise double-talk predictions. Experiments show that our proposed method outperforms other approaches in terms of improving speech quality and speech recognition rate of an ASR system.

* Submitted to ICASSP 2022

Via

Access Paper or Ask Questions

A Review of Language and Speech Features for Cognitive-Linguistic Assessment

Jun 04, 2019
Rohit Voleti, Julie M. Liss, Visar Berisha

Figure 1 for A Review of Language and Speech Features for Cognitive-Linguistic Assessment

Figure 2 for A Review of Language and Speech Features for Cognitive-Linguistic Assessment

Figure 3 for A Review of Language and Speech Features for Cognitive-Linguistic Assessment

Figure 4 for A Review of Language and Speech Features for Cognitive-Linguistic Assessment

It is widely accepted that information derived from analyzing speech (the acoustic signal) and language production (words and sentences) serves as a useful window into the health of an individual's cognitive ability. In fact, most neuropsychological batteries used in cognitive assessment have a component related to speech and language where clinicians elicit speech from patients for subjective evaluation across a broad set of dimensions. With advances in speech signal processing and natural language processing, there has been recent interest in developing tools to detect more subtle changes in cognitive-linguistic function. This work relies on extracting a set of features from recorded and transcribed speech for objective assessments of cognition, early diagnosis of neurological disease, and objective tracking of disease after diagnosis. In this paper we provide a review of existing speech and language features used in this domain, discuss their clinical application, and highlight their advantages and disadvantages. Broadly speaking, the review is split into two categories: language features based on natural language processing and speech features based on speech signal processing. Within each category, we consider features that aim to measure complementary dimensions of cognitive-linguistics, including language diversity, syntactic complexity, semantic coherence, and timing. We conclude the review with a proposal of new research directions to further advance the field.

* 13 pages, 5 figures. Submitted (under review) to IEEE Journal of Selected Topics on Signal Processing (JSTSP), Special Issue on Automatic Assessment of Health Disorders Based on Voice, Speech and Language Processing (planned for February 2020)

Via

Access Paper or Ask Questions

On Prosody Modeling for ASR+TTS based Voice Conversion

Jul 20, 2021
Wen-Chin Huang, Tomoki Hayashi, Xinjian Li, Shinji Watanabe, Tomoki Toda

Figure 1 for On Prosody Modeling for ASR+TTS based Voice Conversion

Figure 2 for On Prosody Modeling for ASR+TTS based Voice Conversion

Figure 3 for On Prosody Modeling for ASR+TTS based Voice Conversion

Figure 4 for On Prosody Modeling for ASR+TTS based Voice Conversion

In voice conversion (VC), an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents; these are then used as input by a text-to-speech (TTS) system to generate the converted speech. Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity. Although some researchers have considered transferring prosodic clues from the source speech, there arises a speaker mismatch during training and conversion. To address this issue, in this work, we propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP). We evaluate both methods on the VCC2020 benchmark and consider different linguistic representations. The results demonstrate the effectiveness of TTP in both objective and subjective evaluations.

* Submitted to ASRU2021. Under review

Via

Access Paper or Ask Questions

Probabilistic Spherical Discriminant Analysis: An Alternative to PLDA for length-normalized embeddings

Mar 28, 2022
Niko Brümmer, Albert Swart, Ladislav Mošner, Anna Silnova, Oldřich Plchot, Themos Stafylakis, Lukáš Burget

Figure 1 for Probabilistic Spherical Discriminant Analysis: An Alternative to PLDA for length-normalized embeddings

In speaker recognition, where speech segments are mapped to embeddings on the unit hypersphere, two scoring backends are commonly used, namely cosine scoring or PLDA. Both have advantages and disadvantages, depending on the context. Cosine scoring follows naturally from the spherical geometry, but for PLDA the blessing is mixed -- length normalization Gaussianizes the between-speaker distribution, but violates the assumption of a speaker-independent within-speaker distribution. We propose PSDA, an analogue to PLDA that uses Von Mises-Fisher distributions on the hypersphere for both within and between-class distributions. We show how the self-conjugacy of this distribution gives closed-form likelihood-ratio scores, making it a drop-in replacement for PLDA at scoring time. All kinds of trials can be scored, including single-enroll and multi-enroll verification, as well as more complex likelihood-ratios that could be used in clustering and diarization. Learning is done via an EM-algorithm with closed-form updates. We explain the model and present some first experiments.

* Submitted to Interspeech 2022

Via

Access Paper or Ask Questions

Generalization Ability of MOS Prediction Networks

Oct 18, 2021
Erica Cooper, Wen-Chin Huang, Tomoki Toda, Junichi Yamagishi

Figure 1 for Generalization Ability of MOS Prediction Networks

Figure 2 for Generalization Ability of MOS Prediction Networks

Figure 3 for Generalization Ability of MOS Prediction Networks

Figure 4 for Generalization Ability of MOS Prediction Networks

Automatic methods to predict listener opinions of synthesized speech remain elusive since listeners, systems being evaluated, characteristics of the speech, and even the instructions given and the rating scale all vary from test to test. While automatic predictors for metrics such as mean opinion score (MOS) can achieve high prediction accuracy on samples from the same test, they typically fail to generalize well to new listening test contexts. In this paper, using a variety of networks for MOS prediction including MOSNet and self-supervised speech models such as wav2vec2, we investigate their performance on data from different listening tests in both zero-shot and fine-tuned settings. We find that wav2vec2 models fine-tuned for MOS prediction have good generalization capability to out-of-domain data even for the most challenging case of utterance-level predictions in the zero-shot setting, and that fine-tuning to in-domain data can improve predictions. We also observe that unseen systems are especially challenging for MOS prediction models.

* Submitted to ICASSP 2022

Via

Access Paper or Ask Questions

Assessing Progress of Parkinson s Disease Using Acoustic Analysis of Phonation

Mar 17, 2022
Jiri Mekyska, Zoltan Galaz, Zdenek Mzourek, Zdenek Smekal, Irena Rektorova, Ilona Eliasova, Milena Kostalova, Martina Mrackova, Dagmar Berankov, Marcos Faundez-Zanuy, Karmele Lopez-de-Ipiña, Jesus B. Alonso-Hernandez

Figure 1 for Assessing Progress of Parkinson s Disease Using Acoustic Analysis of Phonation

Figure 2 for Assessing Progress of Parkinson s Disease Using Acoustic Analysis of Phonation

Figure 3 for Assessing Progress of Parkinson s Disease Using Acoustic Analysis of Phonation

Figure 4 for Assessing Progress of Parkinson s Disease Using Acoustic Analysis of Phonation

This paper deals with a complex acoustic analysis of phonation in patients with Parkinson's disease (PD) with a special focus on estimation of disease progress that is described by 7 different clinical scales ,e. g. Unified Parkinson's disease rating scale or Beck depression inventory. The analysis is based on parametrization of 5 Czech vowels pronounced by 84 PD patients. Using classification and regression trees we estimated all clinical scores with maximal error lower or equal to 13 %. Best estimation was observed in the case of Mini-mental state examination (MAE = 0.77, estimation error 5.50 %. Finally, we proposed a binary classification based on random forests that is able to identify Parkinson's disease with sensitivity SEN = 92.86 % (SPE = 85.71 %). The parametrization process was based on extraction of 107 speech features quantifying different clinical signs of hypokinetic dysarthria present in PD.

* 4th IEEE IWOBI 2015, pp. 115-122, 10-12 June, 2015 Donostia-San Sebastian. ISBN: 978-84-606-8733-7
* 8 pages published in the 4th IEEE IWOBI 2015, pp. 115-122, 10-12 June, 2015 Donostia-San Sebastian. ISBN: 978-84-606-8733-7

Via

Access Paper or Ask Questions

Attentive Temporal Pooling for Conformer-based Streaming Language Identification in Long-form Speech

Mar 21, 2022
Quan Wang, Yang Yu, Jason Pelecanos, Yiling Huang, Ignacio Lopez Moreno

Figure 1 for Attentive Temporal Pooling for Conformer-based Streaming Language Identification in Long-form Speech

Figure 2 for Attentive Temporal Pooling for Conformer-based Streaming Language Identification in Long-form Speech

Figure 3 for Attentive Temporal Pooling for Conformer-based Streaming Language Identification in Long-form Speech

Figure 4 for Attentive Temporal Pooling for Conformer-based Streaming Language Identification in Long-form Speech

In this paper, we introduce a novel language identification system based on conformer layers. We propose an attentive temporal pooling mechanism to allow the model to carry information in long-form audio via a recurrent form, such that the inference can be performed in a streaming fashion. Additionally, a simple domain adaptation mechanism is introduced to allow adapting an existing language identification model to a new domain where the prior language distribution is different. We perform a comparative study of different model topologies under different constraints of model size, and find that conformer-base models outperform LSTM and transformer based models. Our experiments also show that attentive temporal pooling and domain adaptation significantly improve the model accuracy.

Via

Access Paper or Ask Questions