Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Gated Recurrent Networks for Seizure Detection

Jan 03, 2018
Meysam Golmohammadi, Saeedeh Ziyabari, Vinit Shah, Eva Von Weltin, Christopher Campbell, Iyad Obeid, Joseph Picone

Recurrent Neural Networks (RNNs) with sophisticated units that implement a gating mechanism have emerged as powerful technique for modeling sequential signals such as speech or electroencephalography (EEG). The latter is the focus on this paper. A significant big data resource, known as the TUH EEG Corpus (TUEEG), has recently become available for EEG research, creating a unique opportunity to evaluate these recurrent units on the task of seizure detection. In this study, we compare two types of recurrent units: long short-term memory units (LSTM) and gated recurrent units (GRU). These are evaluated using a state of the art hybrid architecture that integrates Convolutional Neural Networks (CNNs) with RNNs. We also investigate a variety of initialization methods and show that initialization is crucial since poorly initialized networks cannot be trained. Furthermore, we explore regularization of these convolutional gated recurrent networks to address the problem of overfitting. Our experiments revealed that convolutional LSTM networks can achieve significantly better performance than convolutional GRU networks. The convolutional LSTM architecture with proper initialization and regularization delivers 30% sensitivity at 6 false alarms per 24 hours.

* Published in Dec 2017 publication In IEEE Signal Processing in Medicine and Biology Symposium. Philadelphia, Pennsylvania, USA. arXiv admin note: text overlap with arXiv:1712.09776 

  Access Paper or Ask Questions

Multimodal Affect Recognition using Kinect

Jul 09, 2016
Amol Patwardhan, Gerald Knapp

Affect (emotion) recognition has gained significant attention from researchers in the past decade. Emotion-aware computer systems and devices have many applications ranging from interactive robots, intelligent online tutor to emotion based navigation assistant. In this research data from multiple modalities such as face, head, hand, body and speech was utilized for affect recognition. The research used color and depth sensing device such as Kinect for facial feature extraction and tracking human body joints. Temporal features across multiple frames were used for affect recognition. Event driven decision level fusion was used to combine the results from each individual modality using majority voting to recognize the emotions. The study also implemented affect recognition by matching the features to the rule based emotion templates per modality. Experiments showed that multimodal affect recognition rates using combination of emotion templates and supervised learning were better compared to recognition rates based on supervised learning alone. Recognition rates obtained using temporal feature were higher compared to recognition rates obtained using position based features only.

* 9 pages, 2 tables, 1 figure, Peer reviewed in ACM TIST 

  Access Paper or Ask Questions

Improving Automated Patent Claim Parsing: Dataset, System, and Experiments

May 05, 2016
Mengke Hu, David Cinciruk, John MacLaren Walsh

Off-the-shelf natural language processing software performs poorly when parsing patent claims owing to their use of irregular language relative to the corpora built from news articles and the web typically utilized to train this software. Stopping short of the extensive and expensive process of accumulating a large enough dataset to completely retrain parsers for patent claims, a method of adapting existing natural language processing software towards patent claims via forced part of speech tag correction is proposed. An Amazon Mechanical Turk collection campaign organized to generate a public corpus to train such an improved claim parsing system is discussed, identifying lessons learned during the campaign that can be of use in future NLP dataset collection campaigns with AMT. Experiments utilizing this corpus and other patent claim sets measure the parsing performance improvement garnered via the claim parsing system. Finally, the utility of the improved claim parsing system within other patent processing applications is demonstrated via experiments showing improved automated patent subject classification when the new claim parsing system is utilized to generate the features.

  Access Paper or Ask Questions

Semi-supervised and Unsupervised Methods for Categorizing Posts in Web Discussion Forums

Apr 24, 2016
Krish Perumal

Web discussion forums are used by millions of people worldwide to share information belonging to a variety of domains such as automotive vehicles, pets, sports, etc. They typically contain posts that fall into different categories such as problem, solution, feedback, spam, etc. Automatic identification of these categories can aid information retrieval that is tailored for specific user requirements. Previously, a number of supervised methods have attempted to solve this problem; however, these depend on the availability of abundant training data. A few existing unsupervised and semi-supervised approaches are either focused on identifying a single category or do not report category-specific performance. In contrast, this work proposes unsupervised and semi-supervised methods that require no or minimal training data to achieve this objective without compromising on performance. A fine-grained analysis is also carried out to discuss their limitations. The proposed methods are based on sequence models (specifically, Hidden Markov Models) that can model language for each category using word and part-of-speech probability distributions, and manually specified features. Empirical evaluations across domains demonstrate that the proposed methods are better suited for this task than existing ones.

  Access Paper or Ask Questions

Statistical and Computational Guarantees for the Baum-Welch Algorithm

Dec 27, 2015
Fanny Yang, Sivaraman Balakrishnan, Martin J. Wainwright

The Hidden Markov Model (HMM) is one of the mainstays of statistical modeling of discrete time series, with applications including speech recognition, computational biology, computer vision and econometrics. Estimating an HMM from its observation process is often addressed via the Baum-Welch algorithm, which is known to be susceptible to local optima. In this paper, we first give a general characterization of the basin of attraction associated with any global optimum of the population likelihood. By exploiting this characterization, we provide non-asymptotic finite sample guarantees on the Baum-Welch updates, guaranteeing geometric convergence to a small ball of radius on the order of the minimax rate around a global optimum. As a concrete example, we prove a linear rate of convergence for a hidden Markov mixture of two isotropic Gaussians given a suitable mean separation and an initialization within a ball of large radius around (one of) the true parameters. To our knowledge, these are the first rigorous local convergence guarantees to global optima for the Baum-Welch algorithm in a setting where the likelihood function is nonconvex. We complement our theoretical results with thorough numerical simulations studying the convergence of the Baum-Welch algorithm and illustrating the accuracy of our predictions.

  Access Paper or Ask Questions

sense2vec - A Fast and Accurate Method for Word Sense Disambiguation In Neural Word Embeddings

Nov 19, 2015
Andrew Trask, Phil Michalak, John Liu

Neural word representations have proven useful in Natural Language Processing (NLP) tasks due to their ability to efficiently model complex semantic and syntactic word relationships. However, most techniques model only one representation per word, despite the fact that a single word can have multiple meanings or "senses". Some techniques model words by using multiple vectors that are clustered based on context. However, recent neural approaches rarely focus on the application to a consuming NLP algorithm. Furthermore, the training process of recent word-sense models is expensive relative to single-sense embedding processes. This paper presents a novel approach which addresses these concerns by modeling multiple embeddings for each word based on supervised disambiguation, which provides a fast and accurate way for a consuming NLP model to select a sense-disambiguated embedding. We demonstrate that these embeddings can disambiguate both contrastive senses such as nominal and verbal senses as well as nuanced senses such as sarcasm. We further evaluate Part-of-Speech disambiguated embeddings on neural dependency parsing, yielding a greater than 8% average error reduction in unlabeled attachment scores across 6 languages.

  Access Paper or Ask Questions

Neural Token Segmentation for High Token-Internal Complexity

Mar 21, 2022
Idan Brusilovsky, Reut Tsarfaty

Tokenizing raw texts into word units is an essential pre-processing step for critical tasks in the NLP pipeline such as tagging, parsing, named entity recognition, and more. For most languages, this tokenization step straightforward. However, for languages with high token-internal complexity, further token-to-word segmentation is required. Previous canonical segmentation studies were based on character-level frameworks, with no contextualised representation involved. Contextualized vectors a la BERT show remarkable results in many applications, but were not shown to improve performance on linguistic segmentation per se. Here we propose a novel neural segmentation model which combines the best of both worlds, contextualised token representation and char-level decoding, which is particularly effective for languages with high token-internal complexity and extreme morphological ambiguity. Our model shows substantial improvements in segmentation accuracy on Hebrew and Arabic compared to the state-of-the-art, and leads to further improvements on downstream tasks such as Part-of-Speech Tagging, Dependency Parsing and Named-Entity Recognition, over existing pipelines. When comparing our segmentation-first pipeline with joint segmentation and labeling in the same settings, we show that, contrary to pre-neural studies, the pipeline performance is superior.

  Access Paper or Ask Questions

Automatic COVID-19 disease diagnosis using 1D convolutional neural network and augmentation with human respiratory sound based on parameters: cough, breath, and voice

Dec 14, 2021
Kranthi Kumar Lella, Alphonse Pja

The issue in respiratory sound classification has attained good attention from the clinical scientists and medical researcher's group in the last year to diagnosing COVID-19 disease. To date, various models of Artificial Intelligence (AI) entered into the real-world to detect the COVID-19 disease from human-generated sounds such as voice/speech, cough, and breath. The Convolutional Neural Network (CNN) model is implemented for solving a lot of real-world problems on machines based on Artificial Intelligence (AI). In this context, one dimension (1D) CNN is suggested and implemented to diagnose respiratory diseases of COVID-19 from human respiratory sounds such as a voice, cough, and breath. An augmentation-based mechanism is applied to improve the preprocessing performance of the COVID-19 sounds dataset and to automate COVID-19 disease diagnosis using the 1D convolutional network. Furthermore, a DDAE (Data De-noising Auto Encoder) technique is used to generate deep sound features such as the input function to the 1D CNN instead of adopting the standard input of MFCC (Mel-frequency cepstral coefficient), and it is performed better accuracy and performance than previous models.

* AIMS Public Health. 2021;8(2):240-264 

  Access Paper or Ask Questions

Self-Normalized Importance Sampling for Neural Language Modeling

Nov 11, 2021
Zijian Yang, Yingbo Gao, Alexander Gerstenberger, Jintao Jiang, Ralf Schlüter, Hermann Ney

To mitigate the problem of having to traverse over the full vocabulary in the softmax normalization of a neural language model, sampling-based training criteria are proposed and investigated in the context of large vocabulary word-based neural language models. These training criteria typically enjoy the benefit of faster training and testing, at a cost of slightly degraded performance in terms of perplexity and almost no visible drop in word error rate. While noise contrastive estimation is one of the most popular choices, recently we show that other sampling-based criteria can also perform well, as long as an extra correction step is done, where the intended class posterior probability is recovered from the raw model outputs. In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step. Compared to noise contrastive estimation, our method is directly comparable in terms of complexity in application. Through self-normalized language model training as well as lattice rescoring experiments, we show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.

* submitted to ICASSP 2022 

  Access Paper or Ask Questions

Dehumanizing Voice Technology: Phonetic & Experiential Consequences of Restricted Human-Machine Interaction

Nov 02, 2021
Christian Hildebrand, Donna Hoffman, Tom Novak

The use of natural language and voice-based interfaces gradu-ally transforms how consumers search, shop, and express their preferences. The current work explores how changes in the syntactical structure of the interaction with conversational interfaces (command vs. request based expression modalities) negatively affects consumers' subjective task enjoyment and systematically alters objective vocal features in the human voice. We show that requests (vs. commands) lead to an in-crease in phonetic convergence and lower phonetic latency, and ultimately a more natural task experience for consumers. To the best of our knowledge, this is the first work docu-menting that altering the input modality of how consumers interact with smart objects systematically affects consumers' IoT experience. We provide evidence that altering the required input to initiate a conversation with smart objects provokes systematic changes both in terms of consumers' subjective experience and objective phonetic changes in the human voice. The current research also makes a methodological con-tribution by highlighting the unexplored potential of feature extraction in human voice as a novel data format linking consumers' vocal features during speech formation and their sub-jective task experiences.

  Access Paper or Ask Questions