Inspired by the progress of the End-to-End approach , this paper systematically studies the effects of Number of Filters of convolutional layers on the model prediction accuracy of CNN+RNN (Convolutional Neural Networks adding to Recurrent Neural Networks) for ASR Models (Automatic Speech Recognition). Experimental results show that only when the CNN Number of Filters exceeds a certain threshold value is adding CNN to RNN able to improve the performance of the CNN+RNN speech recognition model, otherwise some parameter ranges of CNN can render it useless to add the CNN to the RNN model. Our results show a strong dependency of word accuracy on the Number of Filters of convolutional layers. Based on the experimental results, the paper suggests a possible hypothesis of Sound-2-Vector Embedding (Convolutional Embedding) to explain the above observations. Based on this Embedding hypothesis and the optimization of parameters, the paper develops an End-to-End speech recognition system which has a high word accuracy but also has a light model-weight. The developed LVCSR (Large Vocabulary Continuous Speech Recognition) model has achieved quite a high word accuracy of 90.2% only by its Acoustic Model alone, without any assistance from intermediate phonetic representation and any Language Model. Its acoustic model contains only 4.4 million weight parameters, compared to the 35~68 million acoustic-model weight parameters in DeepSpeech2  (one of the top state-of-the-art LVCSR models) which can achieve a word accuracy of 91.5%. The light-weighted model is good for improving the transcribing computing efficiency and also useful for mobile devices, Driverless Vehicles, etc. Our model weight is reduced to ~10% the size of DeepSpeech2, but our model accuracy remains close to that of DeepSpeech2. If combined with a Language Model, our LVCSR system is able to achieve 91.5% word accuracy.
In the domain of air traffic control (ATC) systems, efforts to train a practical automatic speech recognition (ASR) model always faces the problem of small training samples since the collection and annotation of speech samples are expert- and domain-dependent task. In this work, a novel training approach based on pretraining and transfer learning is proposed to address this issue, and an improved end-to-end deep learning model is developed to address the specific challenges of ASR in the ATC domain. An unsupervised pretraining strategy is first proposed to learn speech representations from unlabeled samples for a certain dataset. Specifically, a masking strategy is applied to improve the diversity of the sample without losing their general patterns. Subsequently, transfer learning is applied to fine-tune a pretrained or other optimized baseline models to finally achieves the supervised ASR task. By virtue of the common terminology used in the ATC domain, the transfer learning task can be regarded as a sub-domain adaption task, in which the transferred model is optimized using a joint corpus consisting of baseline samples and new transcribed samples from the target dataset. This joint corpus construction strategy enriches the size and diversity of the training samples, which is important for addressing the issue of the small transcribed corpus. In addition, speed perturbation is applied to augment the new transcribed samples to further improve the quality of the speech corpus. Three real ATC datasets are used to validate the proposed ASR model and training strategies. The experimental results demonstrate that the ASR performance is significantly improved on all three datasets, with an absolute character error rate only one-third of that achieved through the supervised training. The applicability of the proposed strategies to other ASR approaches is also validated.
Target speech separation refers to extracting the target speaker's speech from mixed signals. Despite the recent advances in deep learning based close-talk speech separation, the applications to real-world are still an open issue. Two main challenges are the complex acoustic environment and the real-time processing requirement. To address these challenges, we propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture in reverberant environments, assisted with directional information of the speaker(s). Firstly, against variations brought by complex environment, the key idea is to increase the acoustic representation completeness through the jointly modeling of temporal, spectral and spatial discriminability between the target and interference source. Specifically, temporal, spectral, spatial along with the designed directional features are integrated to create a joint acoustic representation. Secondly, to reduce the latency, we design a fully-convolutional autoencoder framework, which is purely end-to-end and single-pass. All the feature computation is implemented by the network layers and operations to speed up the separation procedure. Evaluation is conducted on simulated reverberant dataset WSJ0-2mix and WSJ0-3mix under speaker-independent scenario. Experimental results demonstrate that the proposed method outperforms state-of-the-art deep learning based multi-channel approaches with fewer parameters and faster processing speed. Furthermore, the proposed temporal-spatial neural filter can handle mixtures with varying and unknown number of speakers and exhibits persistent performance even when existing a direction estimation error. Codes and models will be released soon.
End-to-end DNN architectures have pushed the state-of-the-art in speech technologies, as well as in other spheres of AI, leading researchers to train more complex and deeper models. These improvements came at the cost of transparency. DNNs are innately opaque and difficult to interpret. We no longer understand what features are learned, where they are preserved, and how they inter-operate. Such an analysis is important for better model understanding, debugging and to ensure fairness in ethical decision making. In this work, we analyze the representations trained within deep speech models, towards the task of speaker recognition, dialect identification and reconstruction of masked signals. We carry a layer- and neuron-level analysis on the utterance-level representations captured within pretrained speech models for speaker, language and channel properties. We study: is this information captured in the learned representations? where is it preserved? how is it distributed? and can we identify a minimal subset of network that posses this information. Using diagnostic classifiers, we answered these questions. Our results reveal: (i) channel and gender information is omnipresent and is redundantly distributed (ii) complex properties such as dialectal information is encoded only in the task-oriented pretrained network and is localised in the upper layers (iii) a minimal subset of neurons can be extracted to encode the predefined property (iv) salient neurons are sometimes shared between properties and can highlights presence of biases in the network. Our cross-architectural comparison indicates that (v) the pretrained models captures speaker-invariant information and (vi) the pretrained CNNs models are competitive to the Transformers for encoding information for the studied properties. To the best of our knowledge, this is the first study to investigate neuron analysis on the speech models.
A new scheme to represent phonological changes during continuous speech recognition is suggested. A phonological tag coupled with its morphological tag is designed to represent the conditions of Korean phonological changes. A pairwise language model of these morphological and phonological tags is implemented in Korean speech recognition system. Performance of the model is verified through the TDNN-based speech recognition experiments.
Quoted speech is often set off by punctuation marks, in particular quotation marks. Thus, it might seem that the quotation marks would be extremely useful in identifying these structures in texts. Unfortunately, the situation is not quite so clear. In this work, I will argue that quotation marks are not adequate for either identifying or constraining the syntax of quoted speech. More useful information comes from the presence of a quoting verb, which is either a verb of saying or a punctual verb, and the presence of other punctuation marks, usually commas. Using a lexicalized grammar, we can license most quoting clauses as text adjuncts. A distinction will be made not between direct and indirect quoted speech, but rather between adjunct and non-adjunct quoting clauses.
Natural language processing tools are used to automatically detect disturbances in transcribed speech of schizophrenia inpatients who speak Hebrew. We measure topic mutation over time and show that controls maintain more cohesive speech than inpatients. We also examine differences in how inpatients and controls use adjectives and adverbs to describe content words and show that the ones used by controls are more common than the those of inpatients. We provide experimental results and show their potential for automatically detecting schizophrenia in patients by means only of their speech patterns.
The present work is based on the COST Action IC1206 for De-identification in multimedia content. It was performed to test four algorithms of voice modifications on a speech gender recognizer to find the degree of modification of pitch when the speech recognizer have the probability of success equal to the probability of failure. The purpose of this analysis is to assess the intensity of the speech tone modification, the quality, the reversibility and not-reversibility of the changes made.
In this paper we investigate whether electroencephalography (EEG) features can be used to improve the performance of continuous visual speech recognition systems. We implemented a connectionist temporal classification (CTC) based end-to-end automatic speech recognition (ASR) model for performing recognition. Our results demonstrate that EEG features are helpful in enhancing the performance of continuous visual speech recognition systems.