Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chung-Cheng Chiu

No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models

Dec 05, 2017

Tara N. Sainath, Rohit Prabhavalkar, Shankar Kumar, Seungji Lee, Anjuli Kannan, David Rybach, Vlad Schogol, Patrick Nguyen, Bo Li, Yonghui Wu(+2 more)

Figure 1 for No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models

Figure 2 for No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models

Figure 3 for No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models

Figure 4 for No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models

Abstract:For decades, context-dependent phonemes have been the dominant sub-word unit for conventional acoustic modeling systems. This status quo has begun to be challenged recently by end-to-end models which seek to combine acoustic, pronunciation, and language model components into a single neural network. Such systems, which typically predict graphemes or words, simplify the recognition process since they remove the need for a separate expert-curated pronunciation lexicon to map from phoneme-based units to words. However, there has been little previous work comparing phoneme-based versus grapheme-based sub-word units in the end-to-end modeling framework, to determine whether the gains from such approaches are primarily due to the new probabilistic model, or from the joint learning of the various components with grapheme-based units. In this work, we conduct detailed experiments which are aimed at quantifying the value of phoneme-based pronunciation lexica in the context of end-to-end models. We examine phoneme-based end-to-end models, which are contrasted against grapheme-based ones on a large vocabulary English Voice-search task, where we find that graphemes do indeed outperform phonemes. We also compare grapheme and phoneme-based approaches on a multi-dialect English task, which once again confirm the superiority of graphemes, greatly simplifying the system for recognizing multiple dialects.

Via

Access Paper or Ask Questions

Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models

Dec 05, 2017

Rohit Prabhavalkar, Tara N. Sainath, Yonghui Wu, Patrick Nguyen, Zhifeng Chen, Chung-Cheng Chiu, Anjuli Kannan

Figure 1 for Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models

Figure 2 for Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models

Figure 3 for Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models

Figure 4 for Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models

Abstract:Sequence-to-sequence models, such as attention-based models in automatic speech recognition (ASR), are typically trained to optimize the cross-entropy criterion which corresponds to improving the log-likelihood of the data. However, system performance is usually measured in terms of word error rate (WER), not log-likelihood. Traditional ASR systems benefit from discriminative sequence training which optimizes criteria such as the state-level minimum Bayes risk (sMBR) which are more closely related to WER. In the present work, we explore techniques to train attention-based models to directly minimize expected word error rate. We consider two loss functions which approximate the expected number of word errors: either by sampling from the model, or by using N-best lists of decoded hypotheses, which we find to be more effective than the sampling-based method. In experimental evaluations, we find that the proposed training procedure improves performance by up to 8.2% relative to the baseline system. This allows us to train grapheme-based, uni-directional attention-based models which match the performance of a traditional, state-of-the-art, discriminative sequence-trained system on a mobile voice-search task.

Via

Access Paper or Ask Questions

Improving the Performance of Online Neural Transducer Models

Dec 05, 2017

Tara N. Sainath, Chung-Cheng Chiu, Rohit Prabhavalkar, Anjuli Kannan, Yonghui Wu, Patrick Nguyen, Zhifeng Chen

Figure 1 for Improving the Performance of Online Neural Transducer Models

Figure 2 for Improving the Performance of Online Neural Transducer Models

Figure 3 for Improving the Performance of Online Neural Transducer Models

Figure 4 for Improving the Performance of Online Neural Transducer Models

Abstract:Having a sequence-to-sequence model which can operate in an online fashion is important for streaming applications such as Voice Search. Neural transducer is a streaming sequence-to-sequence model, but has shown a significant degradation in performance compared to non-streaming models such as Listen, Attend and Spell (LAS). In this paper, we present various improvements to NT. Specifically, we look at increasing the window over which NT computes attention, mainly by looking backwards in time so the model still remains online. In addition, we explore initializing a NT model from a LAS-trained model so that it is guided with a better alignment. Finally, we explore including stronger language models such as using wordpiece models, and applying an external LM during the beam search. On a Voice Search task, we find with these improvements we can get NT to match the performance of LAS.

Via

Access Paper or Ask Questions

Learning Hard Alignments with Variational Inference

Nov 01, 2017

Dieterich Lawson, Chung-Cheng Chiu, George Tucker, Colin Raffel, Kevin Swersky, Navdeep Jaitly

Figure 1 for Learning Hard Alignments with Variational Inference

Figure 2 for Learning Hard Alignments with Variational Inference

Figure 3 for Learning Hard Alignments with Variational Inference

Abstract:There has recently been significant interest in hard attention models for tasks such as object recognition, visual captioning and speech recognition. Hard attention can offer benefits over soft attention such as decreased computational cost, but training hard attention models can be difficult because of the discrete latent variables they introduce. Previous work used REINFORCE and Q-learning to approach these issues, but those methods can provide high-variance gradient estimates and be slow to train. In this paper, we tackle the problem of learning hard attention for a sequential task using variational inference methods, specifically the recently introduced VIMCO and NVIL. Furthermore, we propose a novel baseline that adapts VIMCO to this setting. We demonstrate our method on a phoneme recognition task in clean and noisy environments and show that our method outperforms REINFORCE, with the difference being greater for a more complicated task.

Via

Access Paper or Ask Questions

An online sequence-to-sequence model for noisy speech recognition

Jun 16, 2017

Chung-Cheng Chiu, Dieterich Lawson, Yuping Luo, George Tucker, Kevin Swersky, Ilya Sutskever, Navdeep Jaitly

Figure 1 for An online sequence-to-sequence model for noisy speech recognition

Figure 2 for An online sequence-to-sequence model for noisy speech recognition

Figure 3 for An online sequence-to-sequence model for noisy speech recognition

Figure 4 for An online sequence-to-sequence model for noisy speech recognition

Abstract:Generative models have long been the dominant approach for speech recognition. The success of these models however relies on the use of sophisticated recipes and complicated machinery that is not easily accessible to non-practitioners. Recent innovations in Deep Learning have given rise to an alternative - discriminative models called Sequence-to-Sequence models, that can almost match the accuracy of state of the art generative models. While these models are easy to train as they can be trained end-to-end in a single step, they have a practical limitation that they can only be used for offline recognition. This is because the models require that the entirety of the input sequence be available at the beginning of inference, an assumption that is not valid for instantaneous speech recognition. To address this problem, online sequence-to-sequence models were recently introduced. These models are able to start producing outputs as data arrives, and the model feels confident enough to output partial transcripts. These models, like sequence-to-sequence are causal - the output produced by the model until any time, $t$, affects the features that are computed subsequently. This makes the model inherently more powerful than generative models that are unable to change features that are computed from the data. This paper highlights two main contributions - an improvement to online sequence-to-sequence model training, and its application to noisy settings with mixed speech from two speakers.

* arXiv admin note: substantial text overlap with arXiv:1608.01281

Via

Access Paper or Ask Questions

Learning Online Alignments with Continuous Rewards Policy Gradient

Aug 03, 2016

Yuping Luo, Chung-Cheng Chiu, Navdeep Jaitly, Ilya Sutskever

Figure 1 for Learning Online Alignments with Continuous Rewards Policy Gradient

Figure 2 for Learning Online Alignments with Continuous Rewards Policy Gradient

Figure 3 for Learning Online Alignments with Continuous Rewards Policy Gradient

Figure 4 for Learning Online Alignments with Continuous Rewards Policy Gradient

Abstract:Sequence-to-sequence models with soft attention had significant success in machine translation, speech recognition, and question answering. Though capable and easy to use, they require that the entirety of the input sequence is available at the beginning of inference, an assumption that is not valid for instantaneous translation and speech recognition. To address this problem, we present a new method for solving sequence-to-sequence problems using hard online alignments instead of soft offline alignments. The online alignments model is able to start producing outputs without the need to first process the entire input sequence. A highly accurate online sequence-to-sequence model is useful because it can be used to build an accurate voice-based instantaneous translator. Our model uses hard binary stochastic decisions to select the timesteps at which outputs will be produced. The model is trained to produce these stochastic decisions using a standard policy gradient method. In our experiments, we show that this model achieves encouraging performance on TIMIT and Wall Street Journal (WSJ) speech recognition datasets.

Via

Access Paper or Ask Questions