Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shashi Kumar

Unifying Streaming and Non-streaming Zipformer-based ASR

Jun 17, 2025

Bidisha Sharma, Karthik Pandia Durai, Shankar Venkatesan, Jeena J Prakash, Shashi Kumar, Malolan Chetlur, Andreas Stolcke

Abstract:There has been increasing interest in unifying streaming and non-streaming automatic speech recognition (ASR) models to reduce development, training, and deployment costs. We present a unified framework that trains a single end-to-end ASR model for both streaming and non-streaming applications, leveraging future context information. We propose to use dynamic right-context through the chunked attention masking in the training of zipformer-based ASR models. We demonstrate that using right-context is more effective in zipformer models compared to other conformer models due to its multi-scale nature. We analyze the effect of varying the number of right-context frames on accuracy and latency of the streaming ASR models. We use Librispeech and large in-house conversational datasets to train different versions of streaming and non-streaming models and evaluate them in a production grade server-client setup across diverse testsets of different domains. The proposed strategy reduces word error by relative 7.9\% with a small degradation in user-perceived latency. By adding more right-context frames, we are able to achieve streaming performance close to that of non-streaming models. Our approach also allows flexible control of the latency-accuracy tradeoff according to customers requirements.

* Accepted in ACL2025 Industry track

Via

Access Paper or Ask Questions

Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering

Jun 05, 2025

Andres Carofilis, Pradeep Rangappa, Srikanth Madikeri, Shashi Kumar, Sergio Burdisso, Jeena Prakash, Esau Villatoro-Tello, Petr Motlicek, Bidisha Sharma, Kadri Hacioglu(+3 more)

Abstract:Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxiliary data. Filtering based on multi-model consensus or named entity recognition (NER) is then applied to select and iteratively refine pseudo-labels, showing slower performance saturation compared to random selection. Evaluated on the multi-domain Wow call center and Fisher English corpora, it outperforms single-step fine-tuning. Consensus-based filtering outperforms other methods, providing up to 22.3% relative improvement on Wow and 24.8% on Fisher over single-step fine-tuning with random selection. NER is the second-best filter, providing competitive performance at a lower computational cost.

* Accepted at Interspeech 2025, Netherlands

Via

Access Paper or Ask Questions

A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport

Feb 03, 2025

Yacouba Kaloga, Shashi Kumar, Petr Motlicek, Ina Kodrasi

Figure 1 for A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport

Figure 2 for A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport

Figure 3 for A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport

Figure 4 for A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport

Abstract:Accurate sequence-to-sequence (seq2seq) alignment is critical for applications like medical speech analysis and language learning tools relying on automatic speech recognition (ASR). State-of-the-art end-to-end (E2E) ASR systems, such as the Connectionist Temporal Classification (CTC) and transducer-based models, suffer from peaky behavior and alignment inaccuracies. In this paper, we propose a novel differentiable alignment framework based on one-dimensional optimal transport, enabling the model to learn a single alignment and perform ASR in an E2E manner. We introduce a pseudo-metric, called Sequence Optimal Transport Distance (SOTD), over the sequence space and discuss its theoretical properties. Based on the SOTD, we propose Optimal Temporal Transport Classification (OTTC) loss for ASR and contrast its behavior with CTC. Experimental results on the TIMIT, AMI, and LibriSpeech datasets show that our method considerably improves alignment performance, though with a trade-off in ASR performance when compared to CTC. We believe this work opens new avenues for seq2seq alignment research, providing a solid foundation for further exploration and development within the community.

Via

Access Paper or Ask Questions

Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward

Nov 06, 2024

Shashi Kumar, Iuliia Thorbecke, Sergio Burdisso, Esaú Villatoro-Tello, Manjunath K E, Kadri Hacioğlu, Pradeep Rangappa, Petr Motlicek, Aravind Ganapathiraju, Andreas Stolcke

Figure 1 for Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward

Figure 2 for Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward

Figure 3 for Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward

Figure 4 for Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward

Abstract:Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and different speech perturbations. In this paper, we address these questions by conducting various ablation experiments using a recent and widely adopted approach called SLAM-ASR. We present novel empirical findings that offer insights on how to effectively utilize the SLAM-ASR architecture across a wide range of settings. Our main findings indicate that the SLAM-ASR exhibits poor performance in cross-domain evaluation settings. Additionally, speech perturbations within in-domain data, such as changes in speed or the presence of additive noise, can significantly impact performance. Our findings offer critical insights for fine-tuning and configuring robust LLM-based ASR models, tailored to different data characteristics and computational resources.

* Submitted to ICASSP 2025 SALMA Workshop

Via

Access Paper or Ask Questions

XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

Jul 05, 2024

Shashi Kumar, Srikanth Madikeri, Juan Zuluaga-Gomez, Esaú Villatoro-Tello, Iuliia Nigmatulina, Petr Motlicek, Manjunath K E, Aravind Ganapathiraju

Figure 1 for XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

Figure 2 for XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

Figure 3 for XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

Figure 4 for XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

Abstract:Self-supervised pretrained models exhibit competitive performance in automatic speech recognition on finetuning, even with limited in-domain supervised data for training. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our experiments on the AMI dataset reveal that the XLSR-Transducer achieves 4% absolute WER improvement over Whisper large-v2 and 8% over a Zipformer transducer model trained from scratch.To enable streaming capabilities, we investigate different attention masking patterns in the self-attention computation of transformer layers within the XLSR-53 model. We validate XLSR-Transducer on AMI and 5 languages from CommonVoice under low-resource scenarios. Finally, with the introduction of attention sinks, we reduce the left context by half while achieving a relative 12% improvement in WER.

* 5 pages, double column

Via

Access Paper or Ask Questions

TokenVerse: Unifying Speech and NLP Tasks via Transducer-based ASR

Jul 05, 2024

Shashi Kumar, Srikanth Madikeri, Juan Zuluaga-Gomez, Iuliia Nigmatulina, Esaú Villatoro-Tello, Sergio Burdisso, Petr Motlicek, Karthik Pandia, Aravind Ganapathiraju

Figure 1 for TokenVerse: Unifying Speech and NLP Tasks via Transducer-based ASR

Figure 2 for TokenVerse: Unifying Speech and NLP Tasks via Transducer-based ASR

Figure 3 for TokenVerse: Unifying Speech and NLP Tasks via Transducer-based ASR

Figure 4 for TokenVerse: Unifying Speech and NLP Tasks via Transducer-based ASR

Abstract:In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achieved by integrating task-specific tokens into the reference text during ASR model training, streamlining the inference and eliminating the need for separate NLP models. In addition to ASR, we conduct experiments on 3 different tasks: speaker change detection, endpointing, and NER. Our experiments on a public and a private dataset show that the proposed method improves ASR by up to 7.7% in relative WER while outperforming the cascaded pipeline approach in individual task performance. Additionally, we present task transfer learning to a new task within an existing TokenVerse.

* 5 pages, double column

Via

Access Paper or Ask Questions

Improved far-field speech recognition using Joint Variational Autoencoder

Apr 24, 2022

Shashi Kumar, Shakti P. Rath, Abhishek Pandey

Figure 1 for Improved far-field speech recognition using Joint Variational Autoencoder

Figure 2 for Improved far-field speech recognition using Joint Variational Autoencoder

Figure 3 for Improved far-field speech recognition using Joint Variational Autoencoder

Figure 4 for Improved far-field speech recognition using Joint Variational Autoencoder

Abstract:Automatic Speech Recognition (ASR) systems suffer considerably when source speech is corrupted with noise or room impulse responses (RIR). Typically, speech enhancement is applied in both mismatched and matched scenario training and testing. In matched setting, acoustic model (AM) is trained on dereverberated far-field features while in mismatched setting, AM is fixed. In recent past, mapping speech features from far-field to close-talk using denoising autoencoder (DA) has been explored. In this paper, we focus on matched scenario training and show that the proposed joint VAE based mapping achieves a significant improvement over DA. Specifically, we observe an absolute improvement of 2.5% in word error rate (WER) compared to DA based enhancement and 3.96% compared to AM trained directly on far-field filterbank features.

* 5 pages, 2 figures, 3 tables

Via

Access Paper or Ask Questions

SRIB Submission to Interspeech 2021 DiCOVA Challenge

Jun 15, 2021

Vishwanath Pratap Singh, Shashi Kumar, Ravi Shekhar Jha, Abhishek Pandey

Figure 1 for SRIB Submission to Interspeech 2021 DiCOVA Challenge

Figure 2 for SRIB Submission to Interspeech 2021 DiCOVA Challenge

Figure 3 for SRIB Submission to Interspeech 2021 DiCOVA Challenge

Figure 4 for SRIB Submission to Interspeech 2021 DiCOVA Challenge

Abstract:The COVID-19 pandemic has resulted in more than 125 million infections and more than 2.7 million casualties. In this paper, we attempt to classify covid vs non-covid cough sounds using signal processing and deep learning methods. Air turbulence, the vibration of tissues, movement of fluid through airways, opening, and closure of glottis are some of the causes for the production of the acoustic sound signals during cough. Does the COVID-19 alter the acoustic characteristics of breath, cough, and speech sounds produced through the respiratory system? This is an open question waiting for answers. In this paper, we incorporated novel data augmentation methods for cough sound augmentation and multiple deep neural network architectures and methods along with handcrafted features. Our proposed system gives 14% absolute improvement in area under the curve (AUC). The proposed system is developed as part of Interspeech 2021 special sessions and challenges viz. diagnosing of COVID-19 using acoustics (DiCOVA). Our proposed method secured the 5th position on the leaderboard among 29 participants.

* 5 pages, 5 figures

Via

Access Paper or Ask Questions

On Optimizing Human-Machine Task Assignments

Sep 24, 2015

Andreas Veit, Michael Wilber, Rajan Vaish, Serge Belongie, James Davis, Vishal Anand, Anshu Aviral, Prithvijit Chakrabarty, Yash Chandak, Sidharth Chaturvedi(+41 more)

Figure 1 for On Optimizing Human-Machine Task Assignments

Figure 2 for On Optimizing Human-Machine Task Assignments

Abstract:When crowdsourcing systems are used in combination with machine inference systems in the real world, they benefit the most when the machine system is deeply integrated with the crowd workers. However, if researchers wish to integrate the crowd with "off-the-shelf" machine classifiers, this deep integration is not always possible. This work explores two strategies to increase accuracy and decrease cost under this setting. First, we show that reordering tasks presented to the human can create a significant accuracy improvement. Further, we show that greedily choosing parameters to maximize machine accuracy is sub-optimal, and joint optimization of the combined system improves performance.

* HCOMP 2015 Work in Progress

Via

Access Paper or Ask Questions