We investigate a set of techniques for RNN Transducers (RNN-Ts) that were instrumental in lowering the word error rate on three different tasks (Switchboard 300 hours, conversational Spanish 780 hours and conversational Italian 900 hours). The techniques pertain to architectural changes, speaker adaptation, language model fusion, model combination and general training recipe. First, we introduce a novel multiplicative integration of the encoder and prediction network vectors in the joint network (as opposed to additive). Second, we discuss the applicability of i-vector speaker adaptation to RNN-Ts in conjunction with data perturbation. Third, we explore the effectiveness of the recently proposed density ratio language model fusion for these tasks. Last but not least, we describe the other components of our training recipe and their effect on recognition performance. We report a 5.9% and 12.5% word error rate on the Switchboard and CallHome test sets of the NIST Hub5 2000 evaluation and a 12.7% WER on the Mozilla CommonVoice Italian test set.
While being an essential component of spoken language, fillers (e.g."um" or "uh") often remain overlooked in Spoken Language Understanding (SLU) tasks. We explore the possibility of representing them with deep contextualised embeddings, showing improvements on modelling spoken language and two downstream tasks - predicting a speaker's stance and expressed confidence.
In topic identification (topic ID) on real-world unstructured audio, an audio instance of variable topic shifts is first broken into sequential segments, and each segment is independently classified. We first present a general purpose method for topic ID on spoken segments in low-resource languages, using a cascade of universal acoustic modeling, translation lexicons to English, and English-language topic classification. Next, instead of classifying each segment independently, we demonstrate that exploring the contextual dependencies across sequential segments can provide large improvements. In particular, we propose an attention-based contextual model which is able to leverage the contexts in a selective manner. We test both our contextual and non-contextual models on four LORELEI languages, and on all but one our attention-based contextual model significantly outperforms the context-independent models.
In this paper, we explore the use of a factorized hierarchical variational autoencoder (FHVAE) model to learn an unsupervised latent representation for dialect identification (DID). An FHVAE can learn a latent space that separates the more static attributes within an utterance from the more dynamic attributes by encoding them into two different sets of latent variables. Useful factors for dialect identification, such as phonetic or linguistic content, are encoded by a segmental latent variable, while irrelevant factors that are relatively constant within a sequence, such as a channel or a speaker information, are encoded by a sequential latent variable. The disentanglement property makes the segmental latent variable less susceptible to channel and speaker variation, and thus reduces degradation from channel domain mismatch. We demonstrate that on fully-supervised DID tasks, an end-to-end model trained on the features extracted from the FHVAE model achieves the best performance, compared to the same model trained on conventional acoustic features and an i-vector based system. Moreover, we also show that the proposed approach can leverage a large amount of unlabeled data for FHVAE training to learn domain-invariant features for DID, and significantly improve the performance in a low-resource condition, where the labels for the in-domain data are not available.
We train a bank of complex filters that operates on the raw waveform and is fed into a convolutional neural network for end-to-end phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation of mel-filterbanks, and then fine-tuned jointly with the remaining convolutional architecture. We perform phone recognition experiments on TIMIT and show that for several architectures, models trained on TD-filterbanks consistently outperform their counterparts trained on comparable mel-filterbanks. We get our best performance by learning all front-end steps, from pre-emphasis up to averaging. Finally, we observe that the filters at convergence have an asymmetric impulse response, and that some of them remain almost analytic.
In this paper, we explore the unsupervised learning of a semantic embedding space for co-occurring sensory inputs. Specifically, we focus on the task of learning a semantic vector space for both spoken and handwritten digits using the TIDIGITs and MNIST datasets. Current techniques encode image and audio/textual inputs directly to semantic embeddings. In contrast, our technique maps an input to the mean and log variance vectors of a diagonal Gaussian from which sample semantic embeddings are drawn. In addition to encouraging semantic similarity between co-occurring inputs,our loss function includes a regularization term borrowed from variational autoencoders (VAEs) which drives the posterior distributions over embeddings to be unit Gaussian. We can use this regularization term to filter out modality information while preserving semantic information. We speculate this technique may be more broadly applicable to other areas of cross-modality/domain information retrieval and transfer learning.
A natural language (or ordinary language) is a language that is spoken, written, or signed by humans for general-purpose communication, as distinguished from formal languages (such as computer-programming languages or the "languages" used in the study of formal logic). The computational activities required for enabling a computer to carry out information processing using natural language is called natural language processing. We have taken Assamese language to check the grammars of the input sentence. Our aim is to produce a technique to check the grammatical structures of the sentences in Assamese text. We have made grammar rules by analyzing the structures of Assamese sentences. Our parsing program finds the grammatical errors, if any, in the Assamese sentence. If there is no error, the program will generate the parse tree for the Assamese sentence
A technique for detecting errors made by Hidden Markov Model taggers is described, based on comparing observable values of the tagging process with a threshold. The resulting approach allows the accuracy of the tagger to be improved by accepting a lower efficiency, defined as the proportion of words which are tagged. Empirical observations are presented which demonstrate the validity of the technique and suggest how to choose an appropriate threshold.
This work concentrates on reducing the RTF and word error rate of a hybrid HMM-DNN. Our baseline system uses an architecture with TDNN and LSTM layers. We find this architecture particularly useful for lightly reverberated environments. However, these models tend to demand more computation than is desirable. In this work, we explore alternate architectures employing singular value decomposition (SVD) is applied to the TDNN layers to reduce the RTF, as well as to the affine transforms of every LSTM cell. We compare this approach with specifying bottleneck layers similar to those introduced by SVD before training. Additionally, we reduced the search space of the decoding graph to make it a better fit to operate in real-time applications. We report -61.57% relative reduction in RTF and almost 1% relative decrease in WER for our architecture trained on Fisher data along with reverberated versions of this dataset in order to match one of our target test distributions.
This document sums up our results forthe NLP lecture at ETH in the spring semester 2021. In this work, a BERT based neural network model (Devlin et al.,2018) is applied to the JIGSAW dataset (Jigsaw/Conversation AI, 2019) in order to create a model identifying hateful and toxic comments (strictly seperated from offensive language) in online social platforms (English language), inthis case Twitter. Three other neural network architectures and a GPT-2 (Radfordet al., 2019) model are also applied on the provided data set in order to compare these different models. The trained BERT model is then applied on two different data sets to evaluate its generalisation power, namely on another Twitter data set (Tom Davidson, 2017) (Davidsonet al., 2017) and the data set HASOC 2019 (Thomas Mandl, 2019) (Mandl et al.,2019) which includes Twitter and also Facebook comments; we focus on the English HASOC 2019 data. In addition, it can be shown that by fine-tuning the trained BERT model on these two datasets by applying different transfer learning scenarios via retraining partial or all layers the predictive scores improve compared to simply applying the model pre-trained on the JIGSAW data set. Withour results, we get precisions from 64% to around 90% while still achieving acceptable recall values of at least lower 60s%, proving that BERT is suitable for real usecases in social platforms.