Alert button
Picture for Shankar Kumar

Shankar Kumar

Alert button

Towards an On-device Agent for Text Rewriting

Aug 22, 2023
Yun Zhu, Yinxiao Liu, Felix Stahlberg, Shankar Kumar, Yu-hui Chen, Liangchen Luo, Lei Shu, Renjie Liu, Jindong Chen, Lei Meng

Figure 1 for Towards an On-device Agent for Text Rewriting
Figure 2 for Towards an On-device Agent for Text Rewriting
Figure 3 for Towards an On-device Agent for Text Rewriting
Figure 4 for Towards an On-device Agent for Text Rewriting

Large Language Models (LLMs) have demonstrated impressive capabilities for text rewriting. Nonetheless, the large sizes of these models make them impractical for on-device inference, which would otherwise allow for enhanced privacy and economical inference. Creating a smaller yet potent language model for text rewriting presents a formidable challenge because it requires balancing the need for a small size with the need to retain the emergent capabilities of the LLM, that requires costly data collection. To address the above challenge, we introduce a new instruction tuning approach for building a mobile-centric text rewriting model. Our strategies enable the generation of high quality training data without any human labeling. In addition, we propose a heuristic reinforcement learning framework which substantially enhances performance without requiring preference data. To further bridge the performance gap with the larger server-side model, we propose an effective approach that combines the mobile rewrite agent with the server model using a cascade. To tailor the text rewriting tasks to mobile scenarios, we introduce MessageRewriteEval, a benchmark that focuses on text rewriting for messages through natural language instructions. Through empirical experiments, we demonstrate that our on-device model surpasses the current state-of-the-art LLMs in text rewriting while maintaining a significantly reduced model size. Notably, we show that our proposed cascading approach improves model performance.

Viaarxiv icon

Semantic Segmentation with Bidirectional Language Models Improves Long-form ASR

May 28, 2023
W. Ronny Huang, Hao Zhang, Shankar Kumar, Shuo-yiin Chang, Tara N. Sainath

Figure 1 for Semantic Segmentation with Bidirectional Language Models Improves Long-form ASR
Figure 2 for Semantic Segmentation with Bidirectional Language Models Improves Long-form ASR
Figure 3 for Semantic Segmentation with Bidirectional Language Models Improves Long-form ASR
Figure 4 for Semantic Segmentation with Bidirectional Language Models Improves Long-form ASR

We propose a method of segmenting long-form speech by separating semantically complete sentences within the utterance. This prevents the ASR decoder from needlessly processing faraway context while also preventing it from missing relevant context within the current sentence. Semantically complete sentence boundaries are typically demarcated by punctuation in written text; but unfortunately, spoken real-world utterances rarely contain punctuation. We address this limitation by distilling punctuation knowledge from a bidirectional teacher language model (LM) trained on written, punctuated text. We compare our segmenter, which is distilled from the LM teacher, against a segmenter distilled from a acoustic-pause-based teacher used in other works, on a streaming ASR pipeline. The pipeline with our segmenter achieves a 3.2% relative WER gain along with a 60 ms median end-of-segment latency reduction on a YouTube captioning task.

* Interspeech 2023. First 3 authors contributed equally 
Viaarxiv icon

Measuring Re-identification Risk

Apr 12, 2023
CJ Carey, Travis Dick, Alessandro Epasto, Adel Javanmard, Josh Karlin, Shankar Kumar, Andres Munoz Medina, Vahab Mirrokni, Gabriel Henrique Nunes, Sergei Vassilvitskii, Peilin Zhong

Figure 1 for Measuring Re-identification Risk
Figure 2 for Measuring Re-identification Risk
Figure 3 for Measuring Re-identification Risk
Figure 4 for Measuring Re-identification Risk

Compact user representations (such as embeddings) form the backbone of personalization services. In this work, we present a new theoretical framework to measure re-identification risk in such user representations. Our framework, based on hypothesis testing, formally bounds the probability that an attacker may be able to obtain the identity of a user from their representation. As an application, we show how our framework is general enough to model important real-world applications such as the Chrome's Topics API for interest-based advertising. We complement our theoretical bounds by showing provably good attack algorithms for re-identification that we use to estimate the re-identification risk in the Topics API. We believe this work provides a rigorous and interpretable notion of re-identification risk and a framework to measure it that can be used to inform real-world applications.

Viaarxiv icon

Improved Long-Form Spoken Language Translation with Large Language Models

Dec 19, 2022
Arya D. McCarthy, Hao Zhang, Shankar Kumar, Felix Stahlberg, Axel H. Ng

Figure 1 for Improved Long-Form Spoken Language Translation with Large Language Models
Figure 2 for Improved Long-Form Spoken Language Translation with Large Language Models
Figure 3 for Improved Long-Form Spoken Language Translation with Large Language Models
Figure 4 for Improved Long-Form Spoken Language Translation with Large Language Models

A challenge in spoken language translation is that plenty of spoken content is long-form, but short units are necessary for obtaining high-quality translations. To address this mismatch, we fine-tune a general-purpose, large language model to split long ASR transcripts into segments that can be independently translated so as to maximize the overall translation quality. We compare to several segmentation strategies and find that our approach improves BLEU score on three languages by an average of 2.7 BLEU overall compared to an automatic punctuation baseline. Further, we demonstrate the effectiveness of two constrained decoding strategies to improve well-formedness of the model output from above 99% to 100%.

Viaarxiv icon

Conciseness: An Overlooked Language Task

Nov 08, 2022
Felix Stahlberg, Aashish Kumar, Chris Alberti, Shankar Kumar

Figure 1 for Conciseness: An Overlooked Language Task
Figure 2 for Conciseness: An Overlooked Language Task
Figure 3 for Conciseness: An Overlooked Language Task
Figure 4 for Conciseness: An Overlooked Language Task

We report on novel investigations into training models that make sentences concise. We define the task and show that it is different from related tasks such as summarization and simplification. For evaluation, we release two test sets, consisting of 2000 sentences each, that were annotated by two and five human annotators, respectively. We demonstrate that conciseness is a difficult task for which zero-shot setups with large neural language models often do not perform well. Given the limitations of these approaches, we propose a synthetic data generation method based on round-trip translations. Using this data to either train Transformers from scratch or fine-tune T5 models yields our strongest baselines that can be further improved by fine-tuning on an artificial conciseness dataset that we derived from multi-annotator machine translation test sets.

* EMNLP 2022 Workshop on Text Simplification, Accessibility, and Readability (TSAR) 
Viaarxiv icon

Simple and Effective Gradient-Based Tuning of Sequence-to-Sequence Models

Sep 10, 2022
Jared Lichtarge, Chris Alberti, Shankar Kumar

Figure 1 for Simple and Effective Gradient-Based Tuning of Sequence-to-Sequence Models
Figure 2 for Simple and Effective Gradient-Based Tuning of Sequence-to-Sequence Models
Figure 3 for Simple and Effective Gradient-Based Tuning of Sequence-to-Sequence Models
Figure 4 for Simple and Effective Gradient-Based Tuning of Sequence-to-Sequence Models

Recent trends towards training ever-larger language models have substantially improved machine learning performance across linguistic tasks. However, the huge cost of training larger models can make tuning them prohibitively expensive, motivating the study of more efficient methods. Gradient-based hyper-parameter optimization offers the capacity to tune hyper-parameters during training, yet has not previously been studied in a sequence-to-sequence setting. We apply a simple and general gradient-based hyperparameter optimization method to sequence-to-sequence tasks for the first time, demonstrating both efficiency and performance gains over strong baselines for both Neural Machine Translation and Natural Language Understanding (NLU) tasks (via T5 pretraining). For translation, we show the method generalizes across language pairs, is more efficient than Bayesian hyper-parameter optimization, and that learned schedules for some hyper-parameters can out-perform even optimal constant-valued tuning. For T5, we show that learning hyper-parameters during pretraining can improve performance across downstream NLU tasks. When learning multiple hyper-parameters concurrently, we show that the global learning rate can follow a schedule over training that improves performance and is not explainable by the `short-horizon bias' of greedy methods \citep{wu2018}. We release the code used to facilitate further research.

* 18 pages, 6 figures, In Proceedings of AutoML 2022 (Workshop track), Baltimore, MD, USA 
Viaarxiv icon

Text Generation with Text-Editing Models

Jun 14, 2022
Eric Malmi, Yue Dong, Jonathan Mallinson, Aleksandr Chuklin, Jakub Adamek, Daniil Mirylenka, Felix Stahlberg, Sebastian Krause, Shankar Kumar, Aliaksei Severyn

Figure 1 for Text Generation with Text-Editing Models
Figure 2 for Text Generation with Text-Editing Models
Figure 3 for Text Generation with Text-Editing Models
Figure 4 for Text Generation with Text-Editing Models

Text-editing models have recently become a prominent alternative to seq2seq models for monolingual text-generation tasks such as grammatical error correction, simplification, and style transfer. These tasks share a common trait - they exhibit a large amount of textual overlap between the source and target texts. Text-editing models take advantage of this observation and learn to generate the output by predicting edit operations applied to the source sequence. In contrast, seq2seq models generate outputs word-by-word from scratch thus making them slow at inference time. Text-editing models provide several benefits over seq2seq models including faster inference speed, higher sample efficiency, and better control and interpretability of the outputs. This tutorial provides a comprehensive overview of text-editing models and current state-of-the-art approaches, and analyzes their pros and cons. We discuss challenges related to productionization and how these models can be used to mitigate hallucination and bias, both pressing challenges in the field of text generation.

* Accepted as a tutorial at NAACL 2022 
Viaarxiv icon

Jam or Cream First? Modeling Ambiguity in Neural Machine Translation with SCONES

May 02, 2022
Felix Stahlberg, Shankar Kumar

Figure 1 for Jam or Cream First? Modeling Ambiguity in Neural Machine Translation with SCONES
Figure 2 for Jam or Cream First? Modeling Ambiguity in Neural Machine Translation with SCONES
Figure 3 for Jam or Cream First? Modeling Ambiguity in Neural Machine Translation with SCONES
Figure 4 for Jam or Cream First? Modeling Ambiguity in Neural Machine Translation with SCONES

The softmax layer in neural machine translation is designed to model the distribution over mutually exclusive tokens. Machine translation, however, is intrinsically uncertain: the same source sentence can have multiple semantically equivalent translations. Therefore, we propose to replace the softmax activation with a multi-label classification layer that can model ambiguity more effectively. We call our loss function Single-label Contrastive Objective for Non-Exclusive Sequences (SCONES). We show that the multi-label output layer can still be trained on single reference training data using the SCONES loss function. SCONES yields consistent BLEU score gains across six translation directions, particularly for medium-resource language pairs and small beam sizes. By using smaller beam sizes we can speed up inference by a factor of 3.9x and still match or improve the BLEU score obtained using softmax. Furthermore, we demonstrate that SCONES can be used to train NMT models that assign the highest probability to adequate translations, thus mitigating the "beam search curse". Additional experiments on synthetic language pairs with varying levels of uncertainty suggest that the improvements from SCONES can be attributed to better handling of ambiguity.

* NAACL 2022 paper 
Viaarxiv icon

Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models

Apr 01, 2022
Felix Stahlberg, Ilia Kulikov, Shankar Kumar

Figure 1 for Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models
Figure 2 for Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models
Figure 3 for Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models
Figure 4 for Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models

In many natural language processing (NLP) tasks the same input (e.g. source sentence) can have multiple possible outputs (e.g. translations). To analyze how this ambiguity (also known as intrinsic uncertainty) shapes the distribution learned by neural sequence models we measure sentence-level uncertainty by computing the degree of overlap between references in multi-reference test sets from two different NLP tasks: machine translation (MT) and grammatical error correction (GEC). At both the sentence- and the task-level, intrinsic uncertainty has major implications for various aspects of search such as the inductive biases in beam search and the complexity of exact search. In particular, we show that well-known pathologies such as a high number of beam search errors, the inadequacy of the mode, and the drop in system performance with large beam sizes apply to tasks with high level of ambiguity such as MT but not to less uncertain tasks such as GEC. Furthermore, we propose a novel exact $n$-best search algorithm for neural sequence models, and show that intrinsic uncertainty affects model uncertainty as the model tends to overly spread out the probability mass for uncertain tasks and sentences.

* ACL 2022 paper 
Viaarxiv icon

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Mar 09, 2022
W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor Strohman, Shankar Kumar

Figure 1 for Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition
Figure 2 for Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition
Figure 3 for Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition
Figure 4 for Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Language model fusion helps smart assistants recognize words which are rare in acoustic data but abundant in text-only corpora (typed search logs). However, such corpora have properties that hinder downstream performance, including being (1) too large, (2) beset with domain-mismatched content, and (3) heavy-headed rather than heavy-tailed (excessively many duplicate search queries such as "weather"). We show that three simple strategies for selecting language modeling data can dramatically improve rare-word recognition without harming overall performance. First, to address the heavy-headedness, we downsample the data according to a soft log function, which tunably reduces high frequency (head) sentences. Second, to encourage rare-word exposure, we explicitly filter for words rare in the acoustic data. Finally, we tackle domain-mismatch via perplexity-based contrastive selection, filtering for examples matched to the target domain. We down-select a large corpus of web search queries by a factor of 53x and achieve better LM perplexities than without down-selection. When shallow-fused with a state-of-the-art, production speech engine, our LM achieves WER reductions of up to 24% relative on rare-word sentences (without changing overall WER) compared to a baseline LM trained on the raw corpus. These gains are further validated through favorable side-by-side evaluations on live voice search traffic.

Viaarxiv icon