This paper discusses the challenges that arise when large speech corpora receive an ever-broadening range of diverse and distinct annotations. Two case studies of this process are presented: the Switchboard Corpus of telephone conversations and the TDT2 corpus of broadcast news. Switchboard has undergone two independent transcriptions and various types of additional annotation, all carried out as separate projects that were dispersed both geographically and chronologically. The TDT2 corpus has also received a variety of annotations, but all directly created or managed by a core group. In both cases, issues arise involving the propagation of repairs, consistency of references, and the ability to integrate annotations having different formats and levels of detail. We describe a general framework whereby these issues can be addressed successfully.
Automatic speech recognition (ASR) tasks are resolved by end-to-end deep learning models, which benefits us by less preparation of raw data, and easier transformation between languages. We propose a novel end-to-end deep learning model architecture namely cascaded CNN-resBiLSTM-CTC. In the proposed model, we add residual blocks in BiLSTM layers to extract sophisticated phoneme and semantic information together, and apply cascaded structure to pay more attention mining information of hard negative samples. By applying both simple Fast Fourier Transform (FFT) technique and n-gram language model (LM) rescoring method, we manage to achieve word error rate (WER) of 3.41% on LibriSpeech test clean corpora. Furthermore, we propose a new batch-varied method to speed up the training process in length-varied tasks, which result in 25% less training time.
Research related to automatically detecting Alzheimer's disease (AD) is important, given the high prevalence of AD and the high cost of traditional methods. Since AD significantly affects the content and acoustics of spontaneous speech, natural language processing and machine learning provide promising techniques for reliably detecting AD. We compare and contrast the performance of two such approaches for AD detection on the recent ADReSS challenge dataset: 1) using domain knowledge-based hand-crafted features that capture linguistic and acoustic phenomena, and 2) fine-tuning Bidirectional Encoder Representations from Transformer (BERT)-based sequence classification models. We also compare multiple feature-based regression models for a neuropsychological score task in the challenge. We observe that fine-tuned BERT models, given the relative importance of linguistics in cognitive impairment detection, outperform feature-based approaches on the AD detection task.
In this thesis, we address the data scarcity and limitations of linguistic theory by proposing language-agnostic multi-task training methods. First, we introduce a meta-learning-based approach, meta-transfer learning, in which information is judiciously extracted from high-resource monolingual speech data to the code-switching domain. The meta-transfer learning quickly adapts the model to the code-switching task from a number of monolingual tasks by learning to learn in a multi-task learning fashion. Second, we propose a novel multilingual meta-embeddings approach to effectively represent code-switching data by acquiring useful knowledge learned in other languages, learning the commonalities of closely related languages and leveraging lexical composition. The method is far more efficient compared to contextualized pre-trained multilingual models. Third, we introduce multi-task learning to integrate syntactic information as a transfer learning strategy to a language model and learn where to code-switch. To further alleviate the aforementioned issues, we propose a data augmentation method using Pointer-Gen, a neural network using a copy mechanism to teach the model the code-switch points from monolingual parallel sentences. We disentangle the need for linguistic theory, and the model captures code-switching points by attending to input words and aligning the parallel words, without requiring any word alignments or constituency parsers. More importantly, the model can be effectively used for languages that are syntactically different, and it outperforms the linguistic theory-based models.
Voice disorders affect a large portion of the population, especially heavy voice users such as teachers or call-center workers. Most voice disorders can be treated effectively with behavioral voice therapy, which teaches patients to replace problematic, habituated voice production mechanics with optimal voice production technique(s), yielding improved voice quality. However, treatment often fails because patients have difficulty differentiating their habitual voice from the target technique independently, when clinician feedback is unavailable between therapy sessions. Therefore, with the long term aim to extend clinician feedback to extra-clinical settings, we built two systems that automatically differentiate various voice qualities produced by the same individual. We hypothesized that 1) a system based on i-vectors could classify these qualities as if they represent different speakers and 2) such a system would outperform one based on traditional voice signal processing algorithms. Training recordings were provided by thirteen amateur actors, each producing 5 perceptually different voice qualities in connected speech: normal, breathy, fry, twang, and hyponasal. As hypothesized, the i-vector system outperformed the acoustic measure system in classification accuracy (i.e. 97.5\% compared to 77.2\%, respectively). Findings are expected because the i-vector system maps features to an integrated space which better represents each voice quality than the 22-feature space of the baseline system. Therefore, an i-vector based system has potential for clinical application in voice therapy and voice training.
Recently, neural network models for natural language processing tasks have been increasingly focused on for their ability of alleviating the burden of manual feature engineering. However, the previous neural models cannot extract the complicated feature compositions as the traditional methods with discrete features. In this work, we propose a feature-enriched neural model for joint Chinese word segmentation and part-of-speech tagging task. Specifically, to simulate the feature templates of traditional discrete feature based models, we use different filters to model the complex compositional features with convolutional and pooling layer, and then utilize long distance dependency information with recurrent layer. Experimental results on five different datasets show the effectiveness of our proposed model.
This paper describes the AISpeech-SJTU system for the accent identification track of the Interspeech-2020 Accented English Speech Recognition Challenge. In this challenge track, only 160-hour accented English data collected from 8 countries and the auxiliary Librispeech dataset are provided for training. To build an accurate and robust accent identification system, we explore the whole system pipeline in detail. First, we introduce the ASR based phone posteriorgram (PPG) feature to accent identification and verify its efficacy. Then, a novel TTS based approach is carefully designed to augment the very limited accent training data for the first time. Finally, we propose the test time augmentation and embedding fusion schemes to further improve the system performance. Our final system is ranked first in the challenge and outperforms all the other participants by a large margin. The submitted system achieves 83.63\% average accuracy on the challenge evaluation data, ahead of the others by more than 10\% in absolute terms.
Detecting hate speech, especially in low-resource languages, is a non-trivial challenge. To tackle this, we developed a tailored architecture based on frozen, pre-trained Transformers to examine cross-lingual zero-shot and few-shot learning, in addition to uni-lingual learning, on the HatEval challenge data set. With our novel attention-based classification block AXEL, we demonstrate highly competitive results on the English and Spanish subsets. We also re-sample the English subset, enabling additional, meaningful comparisons in the future.
We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence (seq2seq) framework. A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker. We utilize vq-wav2vec (VQW2V), a discretized self-supervised speech representation that was learned from massive unlabeled data, which is assumed to be speaker-independent and well corresponds to underlying linguistic contents. Given a training dataset of the target speaker, we extract VQW2V and acoustic features to estimate a seq2seq mapping function from the former to the latter. With the help of a pretraining method and a newly designed postprocessing technique, our model can be generalized to only 5 min of data, even outperforming the same model trained with parallel data.
The Conformer model is an excellent architecture for speech recognition modeling that effectively utilizes the hybrid losses of connectionist temporal classification (CTC) and attention to train model parameters. To improve the decoding efficiency of Conformer, we propose a novel connectionist temporal summarization (CTS) method that reduces the number of frames required for the attention decoder fed from the acoustic sequences generated by the encoder, thus reducing operations. However, to achieve such decoding improvements, we must fine-tune model parameters, as cross-attention observations are changed and thus require corresponding refinements. Our final experiments show that, with a beamwidth of 4, the LibriSpeech's decoding budget can be reduced by up to 20% and for FluentSpeech data it can be reduced by 11%, without losing ASR accuracy. An improvement in accuracy is even found for the LibriSpeech "test-other" set. The word error rate (WER) is reduced by 6\% relative at the beam width of 1 and by 3% relative at the beam width of 4.