End-to-end approaches for automatic speech recognition (ASR) benefit from directly modeling the probability of the word sequence given the input audio stream in a single neural network. However, compared to conventional ASR systems, these models typically require more data to achieve comparable results. Well-known model adaptation techniques, to account for domain and style adaptation, are not easily applicable to end-to-end systems. Conventional HMM-based systems, on the other hand, have been optimized for various production environments and use cases. In this work, we propose to combine the benefits of end-to-end approaches with a conventional system using an attention-based discriminative language model that learns to rescore the output of a first-pass ASR system. We show that learning to rescore a list of potential ASR outputs is much simpler than learning to generate the hypothesis. The proposed model results in 8% improvement in word error rate even when the amount of training data is a fraction of data used for training the first-pass system.
In recent times, with the increase of Artificial Neural Network (ANN), deep learning has brought a dramatic twist in the field of machine learning by making it more artificially intelligent. Deep learning is remarkably used in vast ranges of fields because of its diverse range of applications such as surveillance, health, medicine, sports, robotics, drones, etc. In deep learning, Convolutional Neural Network (CNN) is at the center of spectacular advances that mixes Artificial Neural Network (ANN) and up to date deep learning strategies. It has been used broadly in pattern recognition, sentence classification, speech recognition, face recognition, text categorization, document analysis, scene, and handwritten digit recognition. The goal of this paper is to observe the variation of accuracies of CNN to classify handwritten digits using various numbers of hidden layers and epochs and to make the comparison between the accuracies. For this performance evaluation of CNN, we performed our experiment using Modified National Institute of Standards and Technology (MNIST) dataset. Further, the network is trained using stochastic gradient descent and the backpropagation algorithm.
In Indian Languages , native speakers are able to understand new words formed by either combining or modifying root words with tense and / or gender. Due to data insufficiency, Automatic Speech Recognition system (ASR) may not accommodate all the words in the language model irrespective of the size of the text corpus. It also becomes computationally challenging if the volume of the data increases exponentially due to morphological changes to the root word. In this paper a new unsupervised method is proposed for a Indian language: Telugu, based on the unsupervised method for Hindi, to generate the Out of Vocabulary (OOV) words in the language model. By using techniques like smoothing and interpolation of pre-processed data with supervised and unsupervised stemming, different issues in language model for Indian language: Telugu has been addressed. We observe that the smoothing techniques Witten-Bell and Kneser-Ney perform well when compared to other techniques on pre-processed data from supervised learning. The ASRs accuracy is improved by 0.76% and 0.94% with supervised and unsupervised stemming respectively.
During the last fifteen years, text scaling approaches have become a central element for the text-as-data community. However, they are based on the assumption that latent positions can be captured just by modeling word-frequency information from the different documents under study. We challenge this by presenting a new semantically aware unsupervised scaling algorithm, SemScale, which relies upon distributional representations of the documents under study. We conduct an extensive quantitative analysis over a collection of speeches from the European Parliament in five different languages and from two different legislations, in order to understand whether a) an approach that is aware of semantics would better capture known underlying political dimensions compared to a frequency-based scaling method, b) such positioning correlates in particular with a specific subset of linguistic traits, compared to the use of the entire text, and c) these findings hold across different languages. To support further research on this new branch of text scaling approaches, we release the employed dataset and evaluation setting, an easy-to-use online demo, and a Python implementation of SemScale.
We propose the factorized action variational autoencoder (FAVAE), a state-of-the-art generative model for learning disentangled and interpretable representations from sequential data via the information bottleneck without supervision. The purpose of disentangled representation learning is to obtain interpretable and transferable representations from data. We focused on the disentangled representation of sequential data since there is a wide range of potential applications if disentanglement representation is extended to sequential data such as video, speech, and stock market. Sequential data are characterized by dynamic and static factors: dynamic factors are time dependent, and static factors are independent of time. Previous models disentangle static and dynamic factors by explicitly modeling the priors of latent variables to distinguish between these factors. However, these models cannot disentangle representations between dynamic factors, such as disentangling "picking up" and "throwing" in robotic tasks. FAVAE can disentangle multiple dynamic factors. Since it does not require modeling priors, it can disentangle "between" dynamic factors. We conducted experiments to show that FAVAE can extract disentangled dynamic factors.
In this paper, we present an end-to-end language identification framework, the attention-based Convolutional Neural Network-Bidirectional Long-short Term Memory (CNN-BLSTM). The model is performed on the utterance level, which means the utterance-level decision can be directly obtained from the output of the neural network. To handle speech utterances with entire arbitrary and potentially long duration, we combine CNN-BLSTM model with a self-attentive pooling layer together. The front-end CNN-BLSTM module plays a role as local pattern extractor for the variable-length inputs, and the following self-attentive pooling layer is built on top to get the fixed-dimensional utterance-level representation. We conducted experiments on NIST LRE07 closed-set task, and the results reveal that the proposed attention-based CNN-BLSTM model achieves comparable error reduction with other state-of-the-art utterance-level neural network approaches for all 3 seconds, 10 seconds, 30 seconds duration tasks.
Classifying semantic relations between entity pairs in sentences is an important task in Natural Language Processing (NLP). Most previous models for relation classification rely on the high-level lexical and syntactic features obtained by NLP tools such as WordNet, dependency parser, part-of-speech (POS) tagger, and named entity recognizers (NER). In addition, state-of-the-art neural models based on attention mechanisms do not fully utilize information of entity that may be the most crucial features for relation classification. To address these issues, we propose a novel end-to-end recurrent neural model which incorporates an entity-aware attention mechanism with a latent entity typing (LET) method. Our model not only utilizes entities and their latent types as features effectively but also is more interpretable by visualizing attention mechanisms applied to our model and results of LET. Experimental results on the SemEval-2010 Task 8, one of the most popular relation classification task, demonstrate that our model outperforms existing state-of-the-art models without any high-level features.
The standard approach to mitigate errors made by an automatic speech recognition system is to use confidence scores associated with each predicted word. In the simplest case, these scores are word posterior probabilities whilst more complex schemes utilise bi-directional recurrent neural network (BiRNN) models. A number of upstream and downstream applications, however, rely on confidence scores assigned not only to 1-best hypotheses but to all words found in confusion networks or lattices. These include but are not limited to speaker adaptation, semi-supervised training and information retrieval. Although word posteriors could be used in those applications as confidence scores, they are known to have reliability issues. To make improved confidence scores more generally available, this paper shows how BiRNNs can be extended from 1-best sequences to confusion network and lattice structures. Experiments are conducted using one of the Cambridge University submissions to the IARPA OpenKWS 2016 competition. The results show that confusion network and lattice-based BiRNNs can provide a significant improvement in confidence estimation.
In recent years, neural networks have demonstrated outstanding effectiveness in a large amount of applications.However, recent works have shown that neural networks are susceptible to adversarial examples, indicating possible flaws intrinsic to the network structures. To address this problem and improve the robustness of neural networks, we investigate the fundamental mechanisms behind adversarial examples and propose a novel robust training method via regulating adversarial gradients. The regulation effectively squeezes the adversarial gradients of neural networks and significantly increases the difficulty of adversarial example generation.Without any adversarial example involved, the robust training method could generate naturally robust networks, which are near-immune to various types of adversarial examples. Experiments show the naturally robust networks can achieve optimal accuracy against Fast Gradient Sign Method (FGSM) and C\&W attacks on MNIST, Cifar10, and Google Speech Command dataset. Moreover, our proposed method also provides neural networks with consistent robustness against transferable attacks.