Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Enriching Rare Word Representations in Neural Language Models by Embedding Matrix Augmentation

Apr 08, 2019
Yerbolat Khassanov, Zhiping Zeng, Van Tung Pham, Haihua Xu, Eng Siong Chng

The neural language models (NLM) achieve strong generalization capability by learning the dense representation of words and using them to estimate probability distribution function. However, learning the representation of rare words is a challenging problem causing the NLM to produce unreliable probability estimates. To address this problem, we propose a method to enrich representations of rare words in pre-trained NLM and consequently improve its probability estimation performance. The proposed method augments the word embedding matrices of pre-trained NLM while keeping other parameters unchanged. Specifically, our method updates the embedding vectors of rare words using embedding vectors of other semantically and syntactically similar words. To evaluate the proposed method, we enrich the rare street names in the pre-trained NLM and use it to rescore 100-best hypotheses output from the Singapore English speech recognition system. The enriched NLM reduces the word error rate by 6% relative and improves the recognition accuracy of the rare words by 16% absolute as compared to the baseline NLM.

* 5 pages, 2 figures, submitted to INTERSPEECH 2019 

  Access Paper or Ask Questions

Adaptive Grasp Control through Multi-Modal Interactions for Assistive Prosthetic Devices

Oct 18, 2018
Michelle Esponda, Thomas M. Howard

The hand is one of the most complex and important parts of the human body. The dexterity provided by its multiple degrees of freedom enables us to perform many of the tasks of daily living which involve grasping and manipulating objects of interest. Contemporary prosthetic devices for people with transradial amputations or wrist disarticulation vary in complexity, from passive prosthetics to complex devices that are body or electrically driven. One of the important challenges in developing smart prosthetic hands is to create devices which are able to mimic all activities that a person might perform and address the needs of a wide variety of users. The approach explored here is to develop algorithms that permit a device to adapt its behavior to the preferences of the operator through interactions with the wearer. This device uses multiple sensing modalities including muscle activity from a myoelectric armband, visual information from an on-board camera, tactile input through a touchscreen interface, and speech input from an embedded microphone. Presented within this paper are the design, software and controls of a platform used to evaluate this architecture as well as results from experiments deigned to quantify the performance.

* Presented at AI-HRI AAAI-FSS, 2018 (arXiv:1809.06606) 

  Access Paper or Ask Questions

Using Rule-Based Labels for Weak Supervised Learning: A ChemNet for Transferable Chemical Property Prediction

Mar 18, 2018
Garrett B. Goh, Charles Siegel, Abhinav Vishnu, Nathan O. Hodas

With access to large datasets, deep neural networks (DNN) have achieved human-level accuracy in image and speech recognition tasks. However, in chemistry, data is inherently small and fragmented. In this work, we develop an approach of using rule-based knowledge for training ChemNet, a transferable and generalizable deep neural network for chemical property prediction that learns in a weak-supervised manner from large unlabeled chemical databases. When coupled with transfer learning approaches to predict other smaller datasets for chemical properties that it was not originally trained on, we show that ChemNet's accuracy outperforms contemporary DNN models that were trained using conventional supervised learning. Furthermore, we demonstrate that the ChemNet pre-training approach is equally effective on both CNN (Chemception) and RNN (SMILES2vec) models, indicating that this approach is network architecture agnostic and is effective across multiple data modalities. Our results indicate a pre-trained ChemNet that incorporates chemistry domain knowledge, enables the development of generalizable neural networks for more accurate prediction of novel chemical properties.

* Submitted to SIGKDD 2018 

  Access Paper or Ask Questions

Phonetic and Graphemic Systems for Multi-Genre Broadcast Transcription

Feb 01, 2018
Yu Wang, Xie Chen, Mark Gales, Anton Ragni, Jeremy Wong

State-of-the-art English automatic speech recognition systems typically use phonetic rather than graphemic lexicons. Graphemic systems are known to perform less well for English as the mapping from the written form to the spoken form is complicated. However, in recent years the representational power of deep-learning based acoustic models has improved, raising interest in graphemic acoustic models for English, due to the simplicity of generating the lexicon. In this paper, phonetic and graphemic models are compared for an English Multi-Genre Broadcast transcription task. A range of acoustic models based on lattice-free MMI training are constructed using phonetic and graphemic lexicons. For this task, it is found that having a long-span temporal history reduces the difference in performance between the two forms of models. In addition, system combination is examined, using parameter smoothing and hypothesis combination. As the combination approaches become more complicated the difference between the phonetic and graphemic systems further decreases. Finally, for all configurations examined the combination of phonetic and graphemic systems yields consistent gains.

* 5 pages, 6 tables, to appear in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018) 

  Access Paper or Ask Questions

Gaussian Quadrature for Kernel Features

Jan 31, 2018
Tri Dao, Christopher De Sa, Christopher Ré

Kernel methods have recently attracted resurgent interest, showing performance competitive with deep neural networks in tasks such as speech recognition. The random Fourier features map is a technique commonly used to scale up kernel machines, but employing the randomized feature map means that $O(\epsilon^{-2})$ samples are required to achieve an approximation error of at most $\epsilon$. We investigate some alternative schemes for constructing feature maps that are deterministic, rather than random, by approximating the kernel in the frequency domain using Gaussian quadrature. We show that deterministic feature maps can be constructed, for any $\gamma > 0$, to achieve error $\epsilon$ with $O(e^{e^\gamma} + \epsilon^{-1/\gamma})$ samples as $\epsilon$ goes to 0. Our method works particularly well with sparse ANOVA kernels, which are inspired by the convolutional layer of CNNs. We validate our methods on datasets in different domains, such as MNIST and TIMIT, showing that deterministic features are faster to generate and achieve accuracy comparable to the state-of-the-art kernel methods based on random Fourier features.

* Neural Information Processing Systems (NIPS) 2017 

  Access Paper or Ask Questions

A Sequential Model for Classifying Temporal Relations between Intra-Sentence Events

Jul 23, 2017
Prafulla Kumar Choubey, Ruihong Huang

We present a sequential model for temporal relation classification between intra-sentence events. The key observation is that the overall syntactic structure and compositional meanings of the multi-word context between events are important for distinguishing among fine-grained temporal relations. Specifically, our approach first extracts a sequence of context words that indicates the temporal relation between two events, which well align with the dependency path between two event mentions. The context word sequence, together with a parts-of-speech tag sequence and a dependency relation sequence that are generated corresponding to the word sequence, are then provided as input to bidirectional recurrent neural network (LSTM) models. The neural nets learn compositional syntactic and semantic representations of contexts surrounding the two events and predict the temporal relation between them. Evaluation of the proposed approach on TimeBank corpus shows that sequential modeling is capable of accurately recognizing temporal relations between events, which outperforms a neural net model using various discrete features as input that imitates previous feature based models.

* EMNLP 2017 

  Access Paper or Ask Questions

ASR error management for improving spoken language understanding

May 26, 2017
Edwin Simonnet, Sahar Ghannay, Nathalie Camelin, Yannick Estève, Renato De Mori

This paper addresses the problem of automatic speech recognition (ASR) error detection and their use for improving spoken language understanding (SLU) systems. In this study, the SLU task consists in automatically extracting, from ASR transcriptions , semantic concepts and concept/values pairs in a e.g touristic information system. An approach is proposed for enriching the set of semantic labels with error specific labels and by using a recently proposed neural approach based on word embeddings to compute well calibrated ASR confidence measures. Experimental results are reported showing that it is possible to decrease significantly the Concept/Value Error Rate with a state of the art system, outperforming previously published results performance on the same experimental data. It also shown that combining an SLU approach based on conditional random fields with a neural encoder/decoder attention based architecture , it is possible to effectively identifying confidence islands and uncertain semantic output segments useful for deciding appropriate error handling actions by the dialogue manager strategy .

* Interspeech 2017, Aug 2017, Stockholm, Sweden. 2017 

  Access Paper or Ask Questions

Hierarchical Attention Model for Improved Machine Comprehension of Spoken Content

Jan 01, 2017
Wei Fang, Jui-Yang Hsu, Hung-yi Lee, Lin-Shan Lee

Multimedia or spoken content presents more attractive information than plain text content, but the former is more difficult to display on a screen and be selected by a user. As a result, accessing large collections of the former is much more difficult and time-consuming than the latter for humans. It's therefore highly attractive to develop machines which can automatically understand spoken content and summarize the key information for humans to browse over. In this endeavor, a new task of machine comprehension of spoken content was proposed recently. The initial goal was defined as the listening comprehension test of TOEFL, a challenging academic English examination for English learners whose native languages are not English. An Attention-based Multi-hop Recurrent Neural Network (AMRNN) architecture was also proposed for this task, which considered only the sequential relationship within the speech utterances. In this paper, we propose a new Hierarchical Attention Model (HAM), which constructs multi-hopped attention mechanism over tree-structured rather than sequential representations for the utterances. Improved comprehension performance robust with respect to ASR errors were obtained.

* Copyright 2016 IEEE. Published in the 2016 IEEE Workshop on Spoken Language Technology (SLT 2016) 

  Access Paper or Ask Questions

Incorporating Belief Function in SVM for Phoneme Recognition

Jul 22, 2015
Rimah Amami, Dorra Ben Ayed, Nouerddine Ellouze

The Support Vector Machine (SVM) method has been widely used in numerous classification tasks. The main idea of this algorithm is based on the principle of the margin maximization to find an hyperplane which separates the data into two different classes.In this paper, SVM is applied to phoneme recognition task. However, in many real-world problems, each phoneme in the data set for recognition problems may differ in the degree of significance due to noise, inaccuracies, or abnormal characteristics; All those problems can lead to the inaccuracies in the prediction phase. Unfortunately, the standard formulation of SVM does not take into account all those problems and, in particular, the variation in the speech input. This paper presents a new formulation of SVM (B-SVM) that attributes to each phoneme a confidence degree computed based on its geometric position in the space. Then, this degree is used in order to strengthen the class membership of the tested phoneme. Hence, we introduce a reformulation of the standard SVM that incorporates the degree of belief. Experimental performance on TIMIT database shows the effectiveness of the proposed method B-SVM on a phoneme recognition problem.

* Lecture Notes in Computer Science Volume 8480, 2014, pp 191-199 
* 9th International Conference, Hybrid Artificial Intelligence Systems, Salamanca, Spain, June 11-13, 2014 

  Access Paper or Ask Questions

Visual Passwords Using Automatic Lip Reading

Sep 02, 2014
Ahmad Basheer Hassanat

This paper presents a visual passwords system to increase security. The system depends mainly on recognizing the speaker using the visual speech signal alone. The proposed scheme works in two stages: setting the visual password stage and the verification stage. At the setting stage the visual passwords system request the user to utter a selected password, a video recording of the user face is captured, and processed by a special words-based VSR system which extracts a sequence of feature vectors. In the verification stage, the same procedure is executed, the features will be sent to be compared with the stored visual password. The proposed scheme has been evaluated using a video database of 20 different speakers (10 females and 10 males), and 15 more males in another video database with different experiment sets. The evaluation has proved the system feasibility, with average error rate in the range of 7.63% to 20.51% at the worst tested scenario, and therefore, has potential to be a practical approach with the support of other conventional authentication methods such as the use of usernames and passwords.

* International Journal of Sciences: Basic and Applied Research (IJSBAR) (2014) Volume 13, No 1, pp 218-231 

  Access Paper or Ask Questions