Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Petrophysical property estimation from seismic data using recurrent neural networks

Jan 24, 2019
Motaz Alfarraj, Ghassan AlRegib

Reservoir characterization involves the estimation petrophysical properties from well-log data and seismic data. Estimating such properties is a challenging task due to the non-linearity and heterogeneity of the subsurface. Various attempts have been made to estimate petrophysical properties using machine learning techniques such as feed-forward neural networks and support vector regression (SVR). Recent advances in machine learning have shown promising results for recurrent neural networks (RNN) in modeling complex sequential data such as videos and speech signals. In this work, we propose an algorithm for property estimation from seismic data using recurrent neural networks. An applications of the proposed workflow to estimate density and p-wave impedance using seismic data shows promising results compared to feed-forward neural networks.

* Published in SEG Technical Program Expanded Abstracts 201 

  Access Paper or Ask Questions

Towards end-to-end spoken language understanding

Feb 23, 2018
Dmitriy Serdyuk, Yongqiang Wang, Christian Fuegen, Anuj Kumar, Baiyang Liu, Yoshua Bengio

Spoken language understanding system is traditionally designed as a pipeline of a number of components. First, the audio signal is processed by an automatic speech recognizer for transcription or n-best hypotheses. With the recognition results, a natural language understanding system classifies the text to structured data as domain, intent and slots for down-streaming consumers, such as dialog system, hands-free applications. These components are usually developed and optimized independently. In this paper, we present our study on an end-to-end learning system for spoken language understanding. With this unified approach, we can infer the semantic meaning directly from audio features without the intermediate text representation. This study showed that the trained model can achieve reasonable good result and demonstrated that the model can capture the semantic attention directly from the audio features.

* submitted to ICASSP 2018 

  Access Paper or Ask Questions

Monotonic Chunkwise Attention

Feb 23, 2018
Chung-Cheng Chiu, Colin Raffel

Sequence-to-sequence models with soft attention have been successfully applied to a wide variety of problems, but their decoding process incurs a quadratic time and space cost and is inapplicable to real-time sequence transduction. To address these issues, we propose Monotonic Chunkwise Attention (MoChA), which adaptively splits the input sequence into small chunks over which soft attention is computed. We show that models utilizing MoChA can be trained efficiently with standard backpropagation while allowing online and linear-time decoding at test time. When applied to online speech recognition, we obtain state-of-the-art results and match the performance of a model using an offline soft attention mechanism. In document summarization experiments where we do not expect monotonic alignments, we show significantly improved performance compared to a baseline monotonic attention-based model.

* ICLR camera-ready version 

  Access Paper or Ask Questions

Learning Robust Dialog Policies in Noisy Environments

Dec 11, 2017
Maryam Fazel-Zarandi, Shang-Wen Li, Jin Cao, Jared Casale, Peter Henderson, David Whitney, Alborz Geramifard

Modern virtual personal assistants provide a convenient interface for completing daily tasks via voice commands. An important consideration for these assistants is the ability to recover from automatic speech recognition (ASR) and natural language understanding (NLU) errors. In this paper, we focus on learning robust dialog policies to recover from these errors. To this end, we develop a user simulator which interacts with the assistant through voice commands in realistic scenarios with noisy audio, and use it to learn dialog policies through deep reinforcement learning. We show that dialogs generated by our simulator are indistinguishable from human generated dialogs, as determined by human evaluators. Furthermore, preliminary experimental results show that the learned policies in noisy environments achieve the same execution success rate with fewer dialog turns compared to fixed rule-based policies.

* 1st Workshop on Conversational AI at NIPS 2017 

  Access Paper or Ask Questions

Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework

Jun 12, 2017
Xiaohui Zhang, Vimal Manohar, Daniel Povey, Sanjeev Khudanpur

Speech recognition systems for irregularly-spelled languages like English normally require hand-written pronunciations. In this paper, we describe a system for automatically obtaining pronunciations of words for which pronunciations are not available, but for which transcribed data exists. Our method integrates information from the letter sequence and from the acoustic evidence. The novel aspect of the problem that we address is the problem of how to prune entries from such a lexicon (since, empirically, lexicons with too many entries do not tend to be good for ASR performance). Experiments on various ASR tasks show that, with the proposed framework, starting with an initial lexicon of several thousand words, we are able to learn a lexicon which performs close to a full expert lexicon in terms of WER performance on test data, and is better than lexicons built using G2P alone or with a pruning criterion based on pronunciation probability.

  Access Paper or Ask Questions

Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers

May 31, 2017
Ken Hoover, Sourish Chaudhuri, Caroline Pantofaru, Malcolm Slaney, Ian Sturdy

In this paper, we present a system that associates faces with voices in a video by fusing information from the audio and visual signals. The thesis underlying our work is that an extremely simple approach to generating (weak) speech clusters can be combined with visual signals to effectively associate faces and voices by aggregating statistics across a video. This approach does not need any training data specific to this task and leverages the natural coherence of information in the audio and visual streams. It is particularly applicable to tracking speakers in videos on the web where a priori information about the environment (e.g., number of speakers, spatial signals for beamforming) is not available. We performed experiments on a real-world dataset using this analysis framework to determine the speaker in a video. Given a ground truth labeling determined by human rater consensus, our approach had ~71% accuracy.

  Access Paper or Ask Questions

An Overview on Data Representation Learning: From Traditional Feature Learning to Recent Deep Learning

Nov 25, 2016
Guoqiang Zhong, Li-Na Wang, Junyu Dong

Since about 100 years ago, to learn the intrinsic structure of data, many representation learning approaches have been proposed, including both linear ones and nonlinear ones, supervised ones and unsupervised ones. Particularly, deep architectures are widely applied for representation learning in recent years, and have delivered top results in many tasks, such as image classification, object detection and speech recognition. In this paper, we review the development of data representation learning methods. Specifically, we investigate both traditional feature learning algorithms and state-of-the-art deep learning models. The history of data representation learning is introduced, while available resources (e.g. online course, tutorial and book information) and toolboxes are provided. Finally, we conclude this paper with remarks and some interesting research directions on data representation learning.

* About 20 pages. Submitted to Journal of Finance and Data Science as an invited paper 

  Access Paper or Ask Questions

openXBOW - Introducing the Passau Open-Source Crossmodal Bag-of-Words Toolkit

May 22, 2016
Maximilian Schmitt, Björn W. Schuller

We introduce openXBOW, an open-source toolkit for the generation of bag-of-words (BoW) representations from multimodal input. In the BoW principle, word histograms were first used as features in document classification, but the idea was and can easily be adapted to, e.g., acoustic or visual low-level descriptors, introducing a prior step of vector quantisation. The openXBOW toolkit supports arbitrary numeric input features and text input and concatenates computed subbags to a final bag. It provides a variety of extensions and options. To our knowledge, openXBOW is the first publicly available toolkit for the generation of crossmodal bags-of-words. The capabilities of the tool are exemplified in two sample scenarios: time-continuous speech-based emotion recognition and sentiment analysis in tweets where improved results over other feature representation forms were observed.

* 9 pages, 1 figure, pre-print 

  Access Paper or Ask Questions

Noisy Parallel Approximate Decoding for Conditional Recurrent Language Model

May 12, 2016
Kyunghyun Cho

Recent advances in conditional recurrent language modelling have mainly focused on network architectures (e.g., attention mechanism), learning algorithms (e.g., scheduled sampling and sequence-level training) and novel applications (e.g., image/video description generation, speech recognition, etc.) On the other hand, we notice that decoding algorithms/strategies have not been investigated as much, and it has become standard to use greedy or beam search. In this paper, we propose a novel decoding strategy motivated by an earlier observation that nonlinear hidden layers of a deep neural network stretch the data manifold. The proposed strategy is embarrassingly parallelizable without any communication overhead, while improving an existing decoding algorithm. We extensively evaluate it with attention-based neural machine translation on the task of En->Cz translation.

  Access Paper or Ask Questions

Multilingual Language Processing From Bytes

Apr 02, 2016
Dan Gillick, Cliff Brunk, Oriol Vinyals, Amarnag Subramanya

We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads text as bytes and outputs span annotations of the form [start, length, label] where start positions, lengths, and labels are separate entries in our vocabulary. Because we operate directly on unicode bytes rather than language-specific words or characters, we can analyze text in many languages with a single model. Due to the small vocabulary size, these multilingual models are very compact, but produce results similar to or better than the state-of- the-art in Part-of-Speech tagging and Named Entity Recognition that use only the provided training datasets (no external data sources). Our models are learning "from scratch" in that they do not rely on any elements of the standard pipeline in Natural Language Processing (including tokenization), and thus can run in standalone fashion on raw text.

  Access Paper or Ask Questions