Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

George Saon

A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

Jul 10, 2019

Wei Zhang, Xiaodong Cui, Ulrich Finkler, George Saon, Abdullah Kayi, Alper Buyuktosunoglu, Brian Kingsbury, David Kung, Michael Picheny

Figure 1 for A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

Figure 2 for A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

Figure 3 for A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

Figure 4 for A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

Abstract:Modern Automatic Speech Recognition (ASR) systems rely on distributed deep learning to for quick training completion. To enable efficient distributed training, it is imperative that the training algorithms can converge with a large mini-batch size. In this work, we discovered that Asynchronous Decentralized Parallel Stochastic Gradient Descent (ADPSGD) can work with much larger batch size than commonly used Synchronous SGD (SSGD) algorithm. On commonly used public SWB-300 and SWB-2000 ASR datasets, ADPSGD can converge with a batch size 3X as large as the one used in SSGD, thus enable training at a much larger scale. Further, we proposed a Hierarchical-ADPSGD (H-ADPSGD) system in which learners on the same computing node construct a super learner via a fast allreduce implementation, and super learners deploy ADPSGD algorithm among themselves. On a 64 Nvidia V100 GPU cluster connected via a 100Gb/s Ethernet network, our system is able to train SWB-2000 to reach a 7.6% WER on the Hub5-2000 Switchboard (SWB) test-set and a 13.2% WER on the Call-home (CH) test-set in 5.2 hours. To the best of our knowledge, this is the fastest ASR training system that attains this level of model accuracy for SWB-2000 task to be ever reported in the literature.

* INTERSPEECH 2019

Via

Access Paper or Ask Questions

English Broadcast News Speech Recognition by Humans and Machines

Apr 30, 2019

Samuel Thomas, Masayuki Suzuki, Yinghui Huang, Gakuto Kurata, Zoltan Tuske, George Saon, Brian Kingsbury, Michael Picheny, Tom Dibert, Alice Kaiser-Schatzlein(+1 more)

Figure 1 for English Broadcast News Speech Recognition by Humans and Machines

Figure 2 for English Broadcast News Speech Recognition by Humans and Machines

Figure 3 for English Broadcast News Speech Recognition by Humans and Machines

Figure 4 for English Broadcast News Speech Recognition by Humans and Machines

Abstract:With recent advances in deep learning, considerable attention has been given to achieving automatic speech recognition performance close to human performance on tasks like conversational telephone speech (CTS) recognition. In this paper we evaluate the usefulness of these proposed techniques on broadcast news (BN), a similar challenging task. We also perform a set of recognition measurements to understand how close the achieved automatic speech recognition results are to human performance on this task. On two publicly available BN test sets, DEV04F and RT04, our speech recognition system using LSTM and residual network based acoustic models with a combination of n-gram and neural network language models performs at 6.5% and 5.9% word error rate. By achieving new performance milestones on these test sets, our experiments show that techniques developed on other related tasks, like CTS, can be transferred to achieve similar performance. In contrast, the best measured human recognition performance on these test sets is much lower, at 3.6% and 2.8% respectively, indicating that there is still room for new techniques and improvements in this space, to reach human performance levels.

* \copyright 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Via

Access Paper or Ask Questions

Distributed Deep Learning Strategies For Automatic Speech Recognition

Apr 10, 2019

Wei Zhang, Xiaodong Cui, Ulrich Finkler, Brian Kingsbury, George Saon, David Kung, Michael Picheny

Figure 1 for Distributed Deep Learning Strategies For Automatic Speech Recognition

Figure 2 for Distributed Deep Learning Strategies For Automatic Speech Recognition

Figure 3 for Distributed Deep Learning Strategies For Automatic Speech Recognition

Figure 4 for Distributed Deep Learning Strategies For Automatic Speech Recognition

Abstract:In this paper, we propose and investigate a variety of distributed deep learning strategies for automatic speech recognition (ASR) and evaluate them with a state-of-the-art Long short-term memory (LSTM) acoustic model on the 2000-hour Switchboard (SWB2000), which is one of the most widely used datasets for ASR performance benchmark. We first investigate what are the proper hyper-parameters (e.g., learning rate) to enable the training with sufficiently large batch size without impairing the model accuracy. We then implement various distributed strategies, including Synchronous (SYNC), Asynchronous Decentralized Parallel SGD (ADPSGD) and the hybrid of the two HYBRID, to study their runtime/accuracy trade-off. We show that we can train the LSTM model using ADPSGD in 14 hours with 16 NVIDIA P100 GPUs to reach a 7.6% WER on the Hub5- 2000 Switchboard (SWB) test set and a 13.1% WER on the CallHome (CH) test set. Furthermore, we can train the model using HYBRID in 11.5 hours with 32 NVIDIA V100 GPUs without loss in accuracy.

* Published in ICASSP'19

Via

Access Paper or Ask Questions

Building competitive direct acoustics-to-word models for English conversational speech recognition

Dec 08, 2017

Kartik Audhkhasi, Brian Kingsbury, Bhuvana Ramabhadran, George Saon, Michael Picheny

Figure 1 for Building competitive direct acoustics-to-word models for English conversational speech recognition

Figure 2 for Building competitive direct acoustics-to-word models for English conversational speech recognition

Figure 3 for Building competitive direct acoustics-to-word models for English conversational speech recognition

Figure 4 for Building competitive direct acoustics-to-word models for English conversational speech recognition

Abstract:Direct acoustics-to-word (A2W) models in the end-to-end paradigm have received increasing attention compared to conventional sub-word based automatic speech recognition models using phones, characters, or context-dependent hidden Markov model states. This is because A2W models recognize words from speech without any decoder, pronunciation lexicon, or externally-trained language model, making training and decoding with such models simple. Prior work has shown that A2W models require orders of magnitude more training data in order to perform comparably to conventional models. Our work also showed this accuracy gap when using the English Switchboard-Fisher data set. This paper describes a recipe to train an A2W model that closes this gap and is at-par with state-of-the-art sub-word based models. We achieve a word error rate of 8.8%/13.9% on the Hub5-2000 Switchboard/CallHome test sets without any decoder or language model. We find that model initialization, training data order, and regularization have the most impact on the A2W model performance. Next, we present a joint word-character A2W model that learns to first spell the word and then recognize it. This model provides a rich output to the user instead of simple word hypotheses, making it especially useful in the case of words unseen or rarely-seen during training.

* Submitted to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018

Via

Access Paper or Ask Questions

Embedding-Based Speaker Adaptive Training of Deep Neural Networks

Oct 17, 2017

Xiaodong Cui, Vaibhava Goel, George Saon

Figure 1 for Embedding-Based Speaker Adaptive Training of Deep Neural Networks

Figure 2 for Embedding-Based Speaker Adaptive Training of Deep Neural Networks

Figure 3 for Embedding-Based Speaker Adaptive Training of Deep Neural Networks

Abstract:An embedding-based speaker adaptive training (SAT) approach is proposed and investigated in this paper for deep neural network acoustic modeling. In this approach, speaker embedding vectors, which are a constant given a particular speaker, are mapped through a control network to layer-dependent element-wise affine transformations to canonicalize the internal feature representations at the output of hidden layers of a main network. The control network for generating the speaker-dependent mappings is jointly estimated with the main network for the overall speaker adaptive acoustic modeling. Experiments on large vocabulary continuous speech recognition (LVCSR) tasks show that the proposed SAT scheme can yield superior performance over the widely-used speaker-aware training using i-vectors with speaker-adapted input features.

Via

Access Paper or Ask Questions

Language Modeling with Highway LSTM

Sep 19, 2017

Gakuto Kurata, Bhuvana Ramabhadran, George Saon, Abhinav Sethy

Figure 1 for Language Modeling with Highway LSTM

Figure 2 for Language Modeling with Highway LSTM

Figure 3 for Language Modeling with Highway LSTM

Figure 4 for Language Modeling with Highway LSTM

Abstract:Language models (LMs) based on Long Short Term Memory (LSTM) have shown good gains in many automatic speech recognition tasks. In this paper, we extend an LSTM by adding highway networks inside an LSTM and use the resulting Highway LSTM (HW-LSTM) model for language modeling. The added highway networks increase the depth in the time dimension. Since a typical LSTM has two internal states, a memory cell and a hidden state, we compare various types of HW-LSTM by adding highway networks onto the memory cell and/or the hidden state. Experimental results on English broadcast news and conversational telephone speech recognition show that the proposed HW-LSTM LM improves speech recognition accuracy on top of a strong LSTM LM baseline. We report 5.1% and 9.9% on the Switchboard and CallHome subsets of the Hub5 2000 evaluation, which reaches the best performance numbers reported on these tasks to date.

* to appear in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2017)

Via

Access Paper or Ask Questions

Direct Acoustics-to-Word Models for English Conversational Speech Recognition

Mar 22, 2017

Kartik Audhkhasi, Bhuvana Ramabhadran, George Saon, Michael Picheny, David Nahamoo

Figure 1 for Direct Acoustics-to-Word Models for English Conversational Speech Recognition

Figure 2 for Direct Acoustics-to-Word Models for English Conversational Speech Recognition

Figure 3 for Direct Acoustics-to-Word Models for English Conversational Speech Recognition

Figure 4 for Direct Acoustics-to-Word Models for English Conversational Speech Recognition

Abstract:Recent work on end-to-end automatic speech recognition (ASR) has shown that the connectionist temporal classification (CTC) loss can be used to convert acoustics to phone or character sequences. Such systems are used with a dictionary and separately-trained Language Model (LM) to produce word sequences. However, they are not truly end-to-end in the sense of mapping acoustics directly to words without an intermediate phone representation. In this paper, we present the first results employing direct acoustics-to-word CTC models on two well-known public benchmark tasks: Switchboard and CallHome. These models do not require an LM or even a decoder at run-time and hence recognize speech with minimal complexity. However, due to the large number of word output units, CTC word models require orders of magnitude more data to train reliably compared to traditional systems. We present some techniques to mitigate this issue. Our CTC word model achieves a word error rate of 13.0%/18.8% on the Hub5-2000 Switchboard/CallHome test sets without any LM or decoder compared with 9.6%/16.0% for phone-based CTC with a 4-gram LM. We also present rescoring results on CTC word model lattices to quantify the performance benefits of a LM, and contrast the performance of word and phone CTC models.

* Submitted to Interspeech-2017

Via

Access Paper or Ask Questions

English Conversational Telephone Speech Recognition by Humans and Machines

Mar 06, 2017

George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim(+2 more)

Figure 1 for English Conversational Telephone Speech Recognition by Humans and Machines

Figure 2 for English Conversational Telephone Speech Recognition by Humans and Machines

Figure 3 for English Conversational Telephone Speech Recognition by Humans and Machines

Figure 4 for English Conversational Telephone Speech Recognition by Humans and Machines

Abstract:One of the most difficult speech recognition tasks is accurate recognition of human to human communication. Advances in deep learning over the last few years have produced major speech recognition improvements on the representative Switchboard conversational corpus. Word error rates that just a few years ago were 14% have dropped to 8.0%, then 6.6% and most recently 5.8%, and are now believed to be within striking range of human performance. This then raises two issues - what IS human performance, and how far down can we still drive speech recognition error rates? A recent paper by Microsoft suggests that we have already achieved human performance. In trying to verify this statement, we performed an independent set of human performance measurements on two conversational tasks and found that human performance may be considerably better than what was earlier reported, giving the community a significantly harder goal to achieve. We also report on our own efforts in this area, presenting a set of acoustic and language modeling techniques that lowered the word error rate of our own English conversational telephone LVCSR system to the level of 5.5%/10.3% on the Switchboard/CallHome subsets of the Hub5 2000 evaluation, which - at least at the writing of this paper - is a new performance milestone (albeit not at what we measure to be human performance!). On the acoustic side, we use a score fusion of three models: one LSTM with multiple feature inputs, a second LSTM trained with speaker-adversarial multi-task learning and a third residual net (ResNet) with 25 convolutional layers and time-dilated convolutions. On the language modeling side, we use word and character LSTMs and convolutional WaveNet-style language models.

Via

Access Paper or Ask Questions

The IBM 2016 English Conversational Telephone Speech Recognition System

Jun 22, 2016

George Saon, Tom Sercu, Steven Rennie, Hong-Kwang J. Kuo

Figure 1 for The IBM 2016 English Conversational Telephone Speech Recognition System

Figure 2 for The IBM 2016 English Conversational Telephone Speech Recognition System

Figure 3 for The IBM 2016 English Conversational Telephone Speech Recognition System

Figure 4 for The IBM 2016 English Conversational Telephone Speech Recognition System

Abstract:We describe a collection of acoustic and language modeling techniques that lowered the word error rate of our English conversational telephone LVCSR system to a record 6.6% on the Switchboard subset of the Hub5 2000 evaluation testset. On the acoustic side, we use a score fusion of three strong models: recurrent nets with maxout activations, very deep convolutional nets with 3x3 kernels, and bidirectional long short-term memory nets which operate on FMLLR and i-vector features. On the language modeling side, we use an updated model "M" and hierarchical neural network LMs.

* Submitted to Interspeech 2016

Via

Access Paper or Ask Questions

The IBM 2015 English Conversational Telephone Speech Recognition System

May 21, 2015

George Saon, Hong-Kwang J. Kuo, Steven Rennie, Michael Picheny

Figure 1 for The IBM 2015 English Conversational Telephone Speech Recognition System

Figure 2 for The IBM 2015 English Conversational Telephone Speech Recognition System

Figure 3 for The IBM 2015 English Conversational Telephone Speech Recognition System

Figure 4 for The IBM 2015 English Conversational Telephone Speech Recognition System

Abstract:We describe the latest improvements to the IBM English conversational telephone speech recognition system. Some of the techniques that were found beneficial are: maxout networks with annealed dropout rates; networks with a very large number of outputs trained on 2000 hours of data; joint modeling of partially unfolded recurrent neural networks and convolutional nets by combining the bottleneck and output layers and retraining the resulting model; and lastly, sophisticated language model rescoring with exponential and neural network LMs. These techniques result in an 8.0% word error rate on the Switchboard part of the Hub5-2000 evaluation test set which is 23% relative better than our previous best published result.

* Submitted to Interspeech 2015

Via

Access Paper or Ask Questions