Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shankar Kumar

Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models

Apr 01, 2022
Felix Stahlberg, Ilia Kulikov, Shankar Kumar

Figure 1 for Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models

Figure 2 for Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models

Figure 3 for Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models

Figure 4 for Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models

In many natural language processing (NLP) tasks the same input (e.g. source sentence) can have multiple possible outputs (e.g. translations). To analyze how this ambiguity (also known as intrinsic uncertainty) shapes the distribution learned by neural sequence models we measure sentence-level uncertainty by computing the degree of overlap between references in multi-reference test sets from two different NLP tasks: machine translation (MT) and grammatical error correction (GEC). At both the sentence- and the task-level, intrinsic uncertainty has major implications for various aspects of search such as the inductive biases in beam search and the complexity of exact search. In particular, we show that well-known pathologies such as a high number of beam search errors, the inadequacy of the mode, and the drop in system performance with large beam sizes apply to tasks with high level of ambiguity such as MT but not to less uncertain tasks such as GEC. Furthermore, we propose a novel exact $n$-best search algorithm for neural sequence models, and show that intrinsic uncertainty affects model uncertainty as the model tends to overly spread out the probability mass for uncertain tasks and sentences.

* ACL 2022 paper

Via

Access Paper or Ask Questions

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Mar 09, 2022
W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor Strohman, Shankar Kumar

Figure 1 for Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Figure 2 for Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Figure 3 for Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Figure 4 for Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Language model fusion helps smart assistants recognize words which are rare in acoustic data but abundant in text-only corpora (typed search logs). However, such corpora have properties that hinder downstream performance, including being (1) too large, (2) beset with domain-mismatched content, and (3) heavy-headed rather than heavy-tailed (excessively many duplicate search queries such as "weather"). We show that three simple strategies for selecting language modeling data can dramatically improve rare-word recognition without harming overall performance. First, to address the heavy-headedness, we downsample the data according to a soft log function, which tunably reduces high frequency (head) sentences. Second, to encourage rare-word exposure, we explicitly filter for words rare in the acoustic data. Finally, we tackle domain-mismatch via perplexity-based contrastive selection, filtering for examples matched to the target domain. We down-select a large corpus of web search queries by a factor of 53x and achieve better LM perplexities than without down-selection. When shallow-fused with a state-of-the-art, production speech engine, our LM achieves WER reductions of up to 24% relative on rare-word sentences (without changing overall WER) compared to a baseline LM trained on the raw corpus. These gains are further validated through favorable side-by-side evaluations on live voice search traffic.

Via

Access Paper or Ask Questions

Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model

Feb 16, 2022
Hao Zhang, You-Chi Cheng, Shankar Kumar, W. Ronny Huang, Mingqing Chen, Rajiv Mathews

Figure 1 for Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model

Figure 2 for Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model

Figure 3 for Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model

Figure 4 for Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model

Capitalization normalization (truecasing) is the task of restoring the correct case (uppercase or lowercase) of noisy text. We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model. We use the truecaser to normalize user-generated text in a Federated Learning framework for language modeling. A case-aware language model trained on this normalized text achieves the same perplexity as a model trained on text with gold capitalization. In a real user A/B experiment, we demonstrate that the improvement translates to reduced prediction error rates in a virtual keyboard application. Similarly, in an ASR language model fusion experiment, we show reduction in uppercase character error rate and word error rate.

* arXiv admin note: substantial text overlap with arXiv:2108.11943

Via

Access Paper or Ask Questions

Transformer-based Models of Text Normalization for Speech Applications

Feb 01, 2022
Jae Hun Ro, Felix Stahlberg, Ke Wu, Shankar Kumar

Figure 1 for Transformer-based Models of Text Normalization for Speech Applications

Figure 2 for Transformer-based Models of Text Normalization for Speech Applications

Figure 3 for Transformer-based Models of Text Normalization for Speech Applications

Figure 4 for Transformer-based Models of Text Normalization for Speech Applications

Text normalization, or the process of transforming text into a consistent, canonical form, is crucial for speech applications such as text-to-speech synthesis (TTS). In TTS, the system must decide whether to verbalize "1995" as "nineteen ninety five" in "born in 1995" or as "one thousand nine hundred ninety five" in "page 1995". We present an experimental comparison of various Transformer-based sequence-to-sequence (seq2seq) models of text normalization for speech and evaluate them on a variety of datasets of written text aligned to its normalized spoken form. These models include variants of the 2-stage RNN-based tagging/seq2seq architecture introduced by Zhang et al. (2019), where we replace the RNN with a Transformer in one or more stages, as well as vanilla Transformers that output string representations of edit sequences. Of our approaches, using Transformers for sentence context encoding within the 2-stage model proved most effective, with the fine-tuned BERT encoder yielding the best performance.

Via

Access Paper or Ask Questions

Position-Invariant Truecasing with a Word-and-Character Hierarchical Recurrent Neural Network

Sep 01, 2021
Hao Zhang, You-Chi Cheng, Shankar Kumar, Mingqing Chen, Rajiv Mathews

Figure 1 for Position-Invariant Truecasing with a Word-and-Character Hierarchical Recurrent Neural Network

Figure 2 for Position-Invariant Truecasing with a Word-and-Character Hierarchical Recurrent Neural Network

Figure 3 for Position-Invariant Truecasing with a Word-and-Character Hierarchical Recurrent Neural Network

Figure 4 for Position-Invariant Truecasing with a Word-and-Character Hierarchical Recurrent Neural Network

Truecasing is the task of restoring the correct case (uppercase or lowercase) of noisy text generated either by an automatic system for speech recognition or machine translation or by humans. It improves the performance of downstream NLP tasks such as named entity recognition and language modeling. We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model, the first of its kind for this problem. Using sequence distillation, we also address the problem of truecasing while ignoring token positions in the sentence, i.e. in a position-invariant manner.

Via

Access Paper or Ask Questions

Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

May 27, 2021
Felix Stahlberg, Shankar Kumar

Figure 1 for Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

Figure 2 for Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

Figure 3 for Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

Figure 4 for Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

Synthetic data generation is widely known to boost the accuracy of neural grammatical error correction (GEC) systems, but existing methods often lack diversity or are too simplistic to generate the broad range of grammatical errors made by human writers. In this work, we use error type tags from automatic annotation tools such as ERRANT to guide synthetic data generation. We compare several models that can produce an ungrammatical sentence given a clean sentence and an error type tag. We use these models to build a new, large synthetic pre-training data set with error tag frequency distributions matching a given development set. Our synthetic data set yields large and consistent gains, improving the state-of-the-art on the BEA-19 and CoNLL-14 test sets. We also show that our approach is particularly effective in adapting a GEC system, trained on mixed native and non-native English, to a native English test set, even surpassing real training data consisting of high-quality sentence pairs.

* Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, 2021. https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction

Via

Access Paper or Ask Questions

Lookup-Table Recurrent Language Models for Long Tail Speech Recognition

Apr 09, 2021
W. Ronny Huang, Tara N. Sainath, Cal Peyser, Shankar Kumar, David Rybach, Trevor Strohman

Figure 1 for Lookup-Table Recurrent Language Models for Long Tail Speech Recognition

Figure 2 for Lookup-Table Recurrent Language Models for Long Tail Speech Recognition

Figure 3 for Lookup-Table Recurrent Language Models for Long Tail Speech Recognition

Figure 4 for Lookup-Table Recurrent Language Models for Long Tail Speech Recognition

We introduce Lookup-Table Language Models (LookupLM), a method for scaling up the size of RNN language models with only a constant increase in the floating point operations, by increasing the expressivity of the embedding table. In particular, we instantiate an (additional) embedding table which embeds the previous n-gram token sequence, rather than a single token. This allows the embedding table to be scaled up arbitrarily -- with a commensurate increase in performance -- without changing the token vocabulary. Since embeddings are sparsely retrieved from the table via a lookup; increasing the size of the table adds neither extra operations to each forward pass nor extra parameters that need to be stored on limited GPU/TPU memory. We explore scaling n-gram embedding tables up to nearly a billion parameters. When trained on a 3-billion sentence corpus, we find that LookupLM improves long tail log perplexity by 2.44 and long tail WER by 23.4% on a downstream speech recognition task over a standard RNN language model baseline, an improvement comparable to a scaling up the baseline by 6.2x the number of floating point operations.

* Submitted to Interspeech 2021

Via

Access Paper or Ask Questions

Seq2Edits: Sequence Transduction Using Span-level Edit Operations

Sep 23, 2020
Felix Stahlberg, Shankar Kumar

Figure 1 for Seq2Edits: Sequence Transduction Using Span-level Edit Operations

Figure 2 for Seq2Edits: Sequence Transduction Using Span-level Edit Operations

Figure 3 for Seq2Edits: Sequence Transduction Using Span-level Edit Operations

Figure 4 for Seq2Edits: Sequence Transduction Using Span-level Edit Operations

We propose Seq2Edits, an open-vocabulary approach to sequence editing for natural language processing (NLP) tasks with a high degree of overlap between input and output texts. In this approach, each sequence-to-sequence transduction is represented as a sequence of edit operations, where each operation either replaces an entire source span with target tokens or keeps it unchanged. We evaluate our method on five NLP tasks (text normalization, sentence fusion, sentence splitting & rephrasing, text simplification, and grammatical error correction) and report competitive results across the board. For grammatical error correction, our method speeds up inference by up to 5.2x compared to full sequence models because inference time depends on the number of edits rather than the number of target tokens. For text normalization, sentence fusion, and grammatical error correction, our approach improves explainability by associating each edit operation with a human-readable tag.

* Accepted at EMNLP 2020

Via

Access Paper or Ask Questions

Data Weighted Training Strategies for Grammatical Error Correction

Sep 09, 2020
Jared Lichtarge, Chris Alberti, Shankar Kumar

Figure 1 for Data Weighted Training Strategies for Grammatical Error Correction

Figure 2 for Data Weighted Training Strategies for Grammatical Error Correction

Figure 3 for Data Weighted Training Strategies for Grammatical Error Correction

Figure 4 for Data Weighted Training Strategies for Grammatical Error Correction

Recent progress in the task of Grammatical Error Correction (GEC) has been driven by addressing data sparsity, both through new methods for generating large and noisy pretraining data and through the publication of small and higher-quality finetuning data in the BEA-2019 shared task. Building upon recent work in Neural Machine Translation (NMT), we make use of both kinds of data by deriving example-level scores on our large pretraining data based on a smaller, higher-quality dataset. In this work, we perform an empirical study to discover how to best incorporate delta-log-perplexity, a type of example scoring, into a training schedule for GEC. In doing so, we perform experiments that shed light on the function and applicability of delta-log-perplexity. Models trained on scored data achieve state-of-the-art results on common GEC test sets.

* Accepted to TACL (Transactions of the Association for Computational Linguistics)

Via

Access Paper or Ask Questions