Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

To BERT or Not To BERT: Comparing Speech and Language-based Approaches for Alzheimer's Disease Detection

Jul 26, 2020
Aparna Balagopalan, Benjamin Eyre, Frank Rudzicz, Jekaterina Novikova

Research related to automatically detecting Alzheimer's disease (AD) is important, given the high prevalence of AD and the high cost of traditional methods. Since AD significantly affects the content and acoustics of spontaneous speech, natural language processing and machine learning provide promising techniques for reliably detecting AD. We compare and contrast the performance of two such approaches for AD detection on the recent ADReSS challenge dataset: 1) using domain knowledge-based hand-crafted features that capture linguistic and acoustic phenomena, and 2) fine-tuning Bidirectional Encoder Representations from Transformer (BERT)-based sequence classification models. We also compare multiple feature-based regression models for a neuropsychological score task in the challenge. We observe that fine-tuned BERT models, given the relative importance of linguistics in cognitive impairment detection, outperform feature-based approaches on the AD detection task.

* accepted to INTERSPEECH 2020 

  Access Paper or Ask Questions

Multilingual Transfer Learning for Code-Switched Language and Speech Neural Modeling

Apr 13, 2021
Genta Indra Winata

In this thesis, we address the data scarcity and limitations of linguistic theory by proposing language-agnostic multi-task training methods. First, we introduce a meta-learning-based approach, meta-transfer learning, in which information is judiciously extracted from high-resource monolingual speech data to the code-switching domain. The meta-transfer learning quickly adapts the model to the code-switching task from a number of monolingual tasks by learning to learn in a multi-task learning fashion. Second, we propose a novel multilingual meta-embeddings approach to effectively represent code-switching data by acquiring useful knowledge learned in other languages, learning the commonalities of closely related languages and leveraging lexical composition. The method is far more efficient compared to contextualized pre-trained multilingual models. Third, we introduce multi-task learning to integrate syntactic information as a transfer learning strategy to a language model and learn where to code-switch. To further alleviate the aforementioned issues, we propose a data augmentation method using Pointer-Gen, a neural network using a copy mechanism to teach the model the code-switch points from monolingual parallel sentences. We disentangle the need for linguistic theory, and the model captures code-switching points by attending to input words and aligning the parallel words, without requiring any word alignments or constituency parsers. More importantly, the model can be effectively used for languages that are syntactically different, and it outperforms the linguistic theory-based models.

* HKUST PhD Thesis. 120 pages 

  Access Paper or Ask Questions

I-vector Based Within Speaker Voice Quality Identification on connected speech

Feb 15, 2021
Chuyao Feng, Eva van Leer, Mackenzie Lee Curtis, David V. Anderson

Voice disorders affect a large portion of the population, especially heavy voice users such as teachers or call-center workers. Most voice disorders can be treated effectively with behavioral voice therapy, which teaches patients to replace problematic, habituated voice production mechanics with optimal voice production technique(s), yielding improved voice quality. However, treatment often fails because patients have difficulty differentiating their habitual voice from the target technique independently, when clinician feedback is unavailable between therapy sessions. Therefore, with the long term aim to extend clinician feedback to extra-clinical settings, we built two systems that automatically differentiate various voice qualities produced by the same individual. We hypothesized that 1) a system based on i-vectors could classify these qualities as if they represent different speakers and 2) such a system would outperform one based on traditional voice signal processing algorithms. Training recordings were provided by thirteen amateur actors, each producing 5 perceptually different voice qualities in connected speech: normal, breathy, fry, twang, and hyponasal. As hypothesized, the i-vector system outperformed the acoustic measure system in classification accuracy (i.e. 97.5\% compared to 77.2\%, respectively). Findings are expected because the i-vector system maps features to an integrated space which better represents each voice quality than the 22-feature space of the baseline system. Therefore, an i-vector based system has potential for clinical application in voice therapy and voice training.

* s 

  Access Paper or Ask Questions

A Feature-Enriched Neural Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

Jul 02, 2017
Xinchi Chen, Xipeng Qiu, Xuanjing Huang

Recently, neural network models for natural language processing tasks have been increasingly focused on for their ability of alleviating the burden of manual feature engineering. However, the previous neural models cannot extract the complicated feature compositions as the traditional methods with discrete features. In this work, we propose a feature-enriched neural model for joint Chinese word segmentation and part-of-speech tagging task. Specifically, to simulate the feature templates of traditional discrete feature based models, we use different filters to model the complex compositional features with convolutional and pooling layer, and then utilize long distance dependency information with recurrent layer. Experimental results on five different datasets show the effectiveness of our proposed model.

  Access Paper or Ask Questions

AISPEECH-SJTU accent identification system for the Accented English Speech Recognition Challenge

Feb 19, 2021
Houjun Huang, Xu Xiang, Yexin Yang, Rao Ma, Yanmin Qian

This paper describes the AISpeech-SJTU system for the accent identification track of the Interspeech-2020 Accented English Speech Recognition Challenge. In this challenge track, only 160-hour accented English data collected from 8 countries and the auxiliary Librispeech dataset are provided for training. To build an accurate and robust accent identification system, we explore the whole system pipeline in detail. First, we introduce the ASR based phone posteriorgram (PPG) feature to accent identification and verify its efficacy. Then, a novel TTS based approach is carefully designed to augment the very limited accent training data for the first time. Finally, we propose the test time augmentation and embedding fusion schemes to further improve the system performance. Our final system is ranked first in the challenge and outperforms all the other participants by a large margin. The submitted system achieves 83.63\% average accuracy on the challenge evaluation data, ahead of the others by more than 10\% in absolute terms.

* Accepted to ICASSP 2021 

  Access Paper or Ask Questions

Cross-lingual Zero- and Few-shot Hate Speech Detection Utilising Frozen Transformer Language Models and AXEL

Apr 13, 2020
Lukas Stappen, Fabian Brunn, Björn Schuller

Detecting hate speech, especially in low-resource languages, is a non-trivial challenge. To tackle this, we developed a tailored architecture based on frozen, pre-trained Transformers to examine cross-lingual zero-shot and few-shot learning, in addition to uni-lingual learning, on the HatEval challenge data set. With our novel attention-based classification block AXEL, we demonstrate highly competitive results on the English and Spanish subsets. We also re-sample the English subset, enabling additional, meaningful comparisons in the future.

  Access Paper or Ask Questions

Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations

Oct 23, 2020
Wen-Chin Huang, Yi-Chiao Wu, Tomoki Hayashi, Tomoki Toda

We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence (seq2seq) framework. A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker. We utilize vq-wav2vec (VQW2V), a discretized self-supervised speech representation that was learned from massive unlabeled data, which is assumed to be speaker-independent and well corresponds to underlying linguistic contents. Given a training dataset of the target speaker, we extract VQW2V and acoustic features to estimate a seq2seq mapping function from the former to the latter. With the help of a pretraining method and a newly designed postprocessing technique, our model can be generalized to only 5 min of data, even outperforming the same model trained with parallel data.

* Submitted to ICASSP 2021 

  Access Paper or Ask Questions

Adding Connectionist Temporal Summarization into Conformer to Improve Its Decoder Efficiency For Speech Recognition

Apr 08, 2022
Nick J. C. Wang, Zongfeng Quan, Shaojun Wang, Jing Xiao

The Conformer model is an excellent architecture for speech recognition modeling that effectively utilizes the hybrid losses of connectionist temporal classification (CTC) and attention to train model parameters. To improve the decoding efficiency of Conformer, we propose a novel connectionist temporal summarization (CTS) method that reduces the number of frames required for the attention decoder fed from the acoustic sequences generated by the encoder, thus reducing operations. However, to achieve such decoding improvements, we must fine-tune model parameters, as cross-attention observations are changed and thus require corresponding refinements. Our final experiments show that, with a beamwidth of 4, the LibriSpeech's decoding budget can be reduced by up to 20% and for FluentSpeech data it can be reduced by 11%, without losing ASR accuracy. An improvement in accuracy is even found for the LibriSpeech "test-other" set. The word error rate (WER) is reduced by 6\% relative at the beam width of 1 and by 3% relative at the beam width of 4.

* Submitted to INTERSPEECH 2022 (5 pages, 2 figures) 

  Access Paper or Ask Questions

Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training

Nov 26, 2020
Sameer Khurana, Niko Moritz, Takaaki Hori, Jonathan Le Roux

The performance of automatic speech recognition (ASR) systems typically degrades significantly when the training and test data domains are mismatched. In this paper, we show that self-training (ST) combined with an uncertainty-based pseudo-label filtering approach can be effectively used for domain adaptation. We propose DUST, a dropout-based uncertainty-driven self-training technique which uses agreement between multiple predictions of an ASR system obtained for different dropout settings to measure the model's uncertainty about its prediction. DUST excludes pseudo-labeled data with high uncertainties from the training, which leads to substantially improved ASR results compared to ST without filtering, and accelerates the training time due to a reduced training data set. Domain adaptation experiments using WSJ as a source domain and TED-LIUM 3 as well as SWITCHBOARD as the target domains show that up to 80% of the performance of a system trained on ground-truth data can be recovered.

* Submitted to ICASSP 2021 

  Access Paper or Ask Questions

Foster Strengths and Circumvent Weaknesses: a Speech Enhancement Framework with Two-branch Collaborative Learning

Oct 12, 2021
Wenxin Tai, Jiajia Li, Yixiang Wang, Tian Lan, Qiao Liu

Recent single-channel speech enhancement methods usually convert waveform to the time-frequency domain and use magnitude/complex spectrum as the optimizing target. However, both magnitude-spectrum-based methods and complex-spectrum-based methods have their respective pros and cons. In this paper, we propose a unified two-branch framework to foster strengths and circumvent weaknesses of different paradigms. The proposed framework could take full advantage of the apparent spectral regularity in magnitude spectrogram and break the bottleneck that magnitude-based methods have suffered. Within each branch, we use collaborative expert block and its variants as substitutes for regular convolution layers. Experiments on TIMIT benchmark demonstrate that our method is superior to existing state-of-the-art ones.

  Access Paper or Ask Questions