Alert button
Picture for Anirudh Raju

Anirudh Raju

Alert button

Federated Self-Learning with Weak Supervision for Speech Recognition

Jun 21, 2023
Milind Rao, Gopinath Chennupati, Gautam Tiwari, Anit Kumar Sahu, Anirudh Raju, Ariya Rastrow, Jasha Droppo

Figure 1 for Federated Self-Learning with Weak Supervision for Speech Recognition
Figure 2 for Federated Self-Learning with Weak Supervision for Speech Recognition
Figure 3 for Federated Self-Learning with Weak Supervision for Speech Recognition
Figure 4 for Federated Self-Learning with Weak Supervision for Speech Recognition

Automatic speech recognition (ASR) models with low-footprint are increasingly being deployed on edge devices for conversational agents, which enhances privacy. We study the problem of federated continual incremental learning for recurrent neural network-transducer (RNN-T) ASR models in the privacy-enhancing scheme of learning on-device, without access to ground truth human transcripts or machine transcriptions from a stronger ASR model. In particular, we study the performance of a self-learning based scheme, with a paired teacher model updated through an exponential moving average of ASR models. Further, we propose using possibly noisy weak-supervision signals such as feedback scores and natural language understanding semantics determined from user behavior across multiple turns in a session of interactions with the conversational agent. These signals are leveraged in a multi-task policy-gradient training approach to improve the performance of self-learning for ASR. Finally, we show how catastrophic forgetting can be mitigated by combining on-device learning with a memory-replay approach using selected historical datasets. These innovations allow for 10% relative improvement in WER on new use cases with minimal degradation on other test sets in the absence of strong-supervision signals such as ground-truth transcriptions.

* Proceedings of ICASSP 2023 
Viaarxiv icon

Cross-utterance ASR Rescoring with Graph-based Label Propagation

Mar 27, 2023
Srinath Tankasala, Long Chen, Andreas Stolcke, Anirudh Raju, Qianli Deng, Chander Chandak, Aparna Khare, Roland Maas, Venkatesh Ravichandran

Figure 1 for Cross-utterance ASR Rescoring with Graph-based Label Propagation
Figure 2 for Cross-utterance ASR Rescoring with Graph-based Label Propagation
Figure 3 for Cross-utterance ASR Rescoring with Graph-based Label Propagation
Figure 4 for Cross-utterance ASR Rescoring with Graph-based Label Propagation

We propose a novel approach for ASR N-best hypothesis rescoring with graph-based label propagation by leveraging cross-utterance acoustic similarity. In contrast to conventional neural language model (LM) based ASR rescoring/reranking models, our approach focuses on acoustic information and conducts the rescoring collaboratively among utterances, instead of individually. Experiments on the VCTK dataset demonstrate that our approach consistently improves ASR performance, as well as fairness across speaker groups with different accents. Our approach provides a low-cost solution for mitigating the majoritarian bias of ASR systems, without the need to train new domain- or accent-specific models.

* To appear in IEEE ICASSP 2023 
Viaarxiv icon

Adaptive Endpointing with Deep Contextual Multi-armed Bandits

Mar 23, 2023
Do June Min, Andreas Stolcke, Anirudh Raju, Colin Vaz, Di He, Venkatesh Ravichandran, Viet Anh Trinh

Figure 1 for Adaptive Endpointing with Deep Contextual Multi-armed Bandits
Figure 2 for Adaptive Endpointing with Deep Contextual Multi-armed Bandits
Figure 3 for Adaptive Endpointing with Deep Contextual Multi-armed Bandits
Figure 4 for Adaptive Endpointing with Deep Contextual Multi-armed Bandits

Current endpointing (EP) solutions learn in a supervised framework, which does not allow the model to incorporate feedback and improve in an online setting. Also, it is a common practice to utilize costly grid-search to find the best configuration for an endpointing model. In this paper, we aim to provide a solution for adaptive endpointing by proposing an efficient method for choosing an optimal endpointing configuration given utterance-level audio features in an online setting, while avoiding hyperparameter grid-search. Our method does not require ground truth labels, and only uses online learning from reward signals without requiring annotated labels. Specifically, we propose a deep contextual multi-armed bandit-based approach, which combines the representational power of neural networks with the action exploration behavior of Thompson modeling algorithms. We compare our approach to several baselines, and show that our deep bandit models also succeed in reducing early cutoff errors while maintaining low latency.

Viaarxiv icon

ILASR: Privacy-Preserving Incremental Learning for Automatic Speech Recognition at Production Scale

Jul 22, 2022
Gopinath Chennupati, Milind Rao, Gurpreet Chadha, Aaron Eakin, Anirudh Raju, Gautam Tiwari, Anit Kumar Sahu, Ariya Rastrow, Jasha Droppo, Andy Oberlin, Buddha Nandanoor, Prahalad Venkataramanan, Zheng Wu, Pankaj Sitpure

Figure 1 for ILASR: Privacy-Preserving Incremental Learning for Automatic Speech Recognition at Production Scale
Figure 2 for ILASR: Privacy-Preserving Incremental Learning for Automatic Speech Recognition at Production Scale
Figure 3 for ILASR: Privacy-Preserving Incremental Learning for Automatic Speech Recognition at Production Scale
Figure 4 for ILASR: Privacy-Preserving Incremental Learning for Automatic Speech Recognition at Production Scale

Incremental learning is one paradigm to enable model building and updating at scale with streaming data. For end-to-end automatic speech recognition (ASR) tasks, the absence of human annotated labels along with the need for privacy preserving policies for model building makes it a daunting challenge. Motivated by these challenges, in this paper we use a cloud based framework for production systems to demonstrate insights from privacy preserving incremental learning for automatic speech recognition (ILASR). By privacy preserving, we mean, usage of ephemeral data which are not human annotated. This system is a step forward for production levelASR models for incremental/continual learning that offers near real-time test-bed for experimentation in the cloud for end-to-end ASR, while adhering to privacy-preserving policies. We show that the proposed system can improve the production models significantly(3%) over a new time period of six months even in the absence of human annotated labels with varying levels of weak supervision and large batch sizes in incremental learning. This improvement is 20% over test sets with new words and phrases in the new time period. We demonstrate the effectiveness of model building in a privacy-preserving incremental fashion for ASR while further exploring the utility of having an effective teacher model and use of large batch sizes.

* 9 pages 
Viaarxiv icon

Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities

Jul 22, 2022
Pranav Dheram, Murugesan Ramakrishnan, Anirudh Raju, I-Fan Chen, Brian King, Katherine Powell, Melissa Saboowala, Karan Shetty, Andreas Stolcke

Figure 1 for Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities
Figure 2 for Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities
Figure 3 for Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities
Figure 4 for Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities

As for other forms of AI, speech recognition has recently been examined with respect to performance disparities across different user cohorts. One approach to achieve fairness in speech recognition is to (1) identify speaker cohorts that suffer from subpar performance and (2) apply fairness mitigation measures targeting the cohorts discovered. In this paper, we report on initial findings with both discovery and mitigation of performance disparities using data from a product-scale AI assistant speech recognition system. We compare cohort discovery based on geographic and demographic information to a more scalable method that groups speakers without human labels, using speaker embedding technology. For fairness mitigation, we find that oversampling of underrepresented cohorts, as well as modeling speaker cohort membership by additional input variables, reduces the gap between top- and bottom-performing cohorts, without deteriorating overall recognition accuracy.

* Proc. Interspeech 2022 
Viaarxiv icon

ILASR: Privacy-Preserving Incremental Learning for AutomaticSpeech Recognition at Production Scale

Jul 19, 2022
Gopinath Chennupati, Milind Rao, Gurpreet Chadha, Aaron Eakin, Anirudh Raju, Gautam Tiwari, Anit Kumar Sahu, Ariya Rastrow, Jasha Droppo, Andy Oberlin, Buddha Nandanoor, Prahalad Venkataramanan, Zheng Wu, Pankaj Sitpure

Figure 1 for ILASR: Privacy-Preserving Incremental Learning for AutomaticSpeech Recognition at Production Scale
Figure 2 for ILASR: Privacy-Preserving Incremental Learning for AutomaticSpeech Recognition at Production Scale
Figure 3 for ILASR: Privacy-Preserving Incremental Learning for AutomaticSpeech Recognition at Production Scale
Figure 4 for ILASR: Privacy-Preserving Incremental Learning for AutomaticSpeech Recognition at Production Scale

Incremental learning is one paradigm to enable model building and updating at scale with streaming data. For end-to-end automatic speech recognition (ASR) tasks, the absence of human annotated labels along with the need for privacy preserving policies for model building makes it a daunting challenge. Motivated by these challenges, in this paper we use a cloud based framework for production systems to demonstrate insights from privacy preserving incremental learning for automatic speech recognition (ILASR). By privacy preserving, we mean, usage of ephemeral data which are not human annotated. This system is a step forward for production levelASR models for incremental/continual learning that offers near real-time test-bed for experimentation in the cloud for end-to-end ASR, while adhering to privacy-preserving policies. We show that the proposed system can improve the production models significantly(3%) over a new time period of six months even in the absence of human annotated labels with varying levels of weak supervision and large batch sizes in incremental learning. This improvement is 20% over test sets with new words and phrases in the new time period. We demonstrate the effectiveness of model building in a privacy-preserving incremental fashion for ASR while further exploring the utility of having an effective teacher model and use of large batch sizes.

* 9 pages 
Viaarxiv icon

Attentive Contextual Carryover for Multi-Turn End-to-End Spoken Language Understanding

Dec 13, 2021
Kai Wei, Thanh Tran, Feng-Ju Chang, Kanthashree Mysore Sathyendra, Thejaswi Muniyappa, Jing Liu, Anirudh Raju, Ross McGowan, Nathan Susanj, Ariya Rastrow, Grant P. Strimel

Figure 1 for Attentive Contextual Carryover for Multi-Turn End-to-End Spoken Language Understanding
Figure 2 for Attentive Contextual Carryover for Multi-Turn End-to-End Spoken Language Understanding
Figure 3 for Attentive Contextual Carryover for Multi-Turn End-to-End Spoken Language Understanding
Figure 4 for Attentive Contextual Carryover for Multi-Turn End-to-End Spoken Language Understanding

Recent years have seen significant advances in end-to-end (E2E) spoken language understanding (SLU) systems, which directly predict intents and slots from spoken audio. While dialogue history has been exploited to improve conventional text-based natural language understanding systems, current E2E SLU approaches have not yet incorporated such critical contextual signals in multi-turn and task-oriented dialogues. In this work, we propose a contextual E2E SLU model architecture that uses a multi-head attention mechanism over encoded previous utterances and dialogue acts (actions taken by the voice assistant) of a multi-turn dialogue. We detail alternative methods to integrate these contexts into the state-ofthe-art recurrent and transformer-based models. When applied to a large de-identified dataset of utterances collected by a voice assistant, our method reduces average word and semantic error rates by 10.8% and 12.6%, respectively. We also present results on a publicly available dataset and show that our method significantly improves performance over a noncontextual baseline

* ASRU2021  
Viaarxiv icon

End-to-End Spoken Language Understanding using RNN-Transducer ASR

Jul 08, 2021
Anirudh Raju, Gautam Tiwari, Milind Rao, Pranav Dheram, Bryan Anderson, Zhe Zhang, Bach Bui, Ariya Rastrow

Figure 1 for End-to-End Spoken Language Understanding using RNN-Transducer ASR
Figure 2 for End-to-End Spoken Language Understanding using RNN-Transducer ASR
Figure 3 for End-to-End Spoken Language Understanding using RNN-Transducer ASR
Figure 4 for End-to-End Spoken Language Understanding using RNN-Transducer ASR

We propose an end-to-end trained spoken language understanding (SLU) system that extracts transcripts, intents and slots from an input speech utterance. It consists of a streaming recurrent neural network transducer (RNNT) based automatic speech recognition (ASR) model connected to a neural natural language understanding (NLU) model through a neural interface. This interface allows for end-to-end training using multi-task RNNT and NLU losses. Additionally, we introduce semantic sequence loss training for the joint RNNT-NLU system that allows direct optimization of non-differentiable SLU metrics. This end-to-end SLU model paradigm can leverage state-of-the-art advancements and pretrained models in both ASR and NLU research communities, outperforming recently proposed direct speech-to-semantics models, and conventional pipelined ASR and NLU systems. We show that this method improves both ASR and NLU metrics on both public SLU datasets and large proprietary datasets.

Viaarxiv icon

Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End

May 14, 2021
Swayambhu Nath Ray, Minhua Wu, Anirudh Raju, Pegah Ghahremani, Raghavendra Bilgi, Milind Rao, Harish Arsikere, Ariya Rastrow, Andreas Stolcke, Jasha Droppo

Figure 1 for Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End
Figure 2 for Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End
Figure 3 for Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End
Figure 4 for Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End

Comprehending the overall intent of an utterance helps a listener recognize the individual words spoken. Inspired by this fact, we perform a novel study of the impact of explicitly incorporating intent representations as additional information to improve a recurrent neural network-transducer (RNN-T) based automatic speech recognition (ASR) system. An audio-to-intent (A2I) model encodes the intent of the utterance in the form of embeddings or posteriors, and these are used as auxiliary inputs for RNN-T training and inference. Experimenting with a 50k-hour far-field English speech corpus, this study shows that when running the system in non-streaming mode, where intent representation is extracted from the entire utterance and then used to bias streaming RNN-T search from the start, it provides a 5.56% relative word error rate reduction (WERR). On the other hand, a streaming system using per-frame intent posteriors as extra inputs for the RNN-T ASR system yields a 3.33% relative WERR. A further detailed analysis of the streaming system indicates that our proposed method brings especially good gain on media-playing related intents (e.g. 9.12% relative WERR on PlayMusicIntent).

Viaarxiv icon