Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hainan Xu

Word Level Timestamp Generation for Automatic Speech Recognition and Translation

May 21, 2025

Ke Hu, Krishna Puvvada, Elena Rastorgueva, Zhehuai Chen, He Huang, Shuoyang Ding, Kunal Dhawan, Hainan Xu, Jagadeesh Balam, Boris Ginsburg

Figure 1 for Word Level Timestamp Generation for Automatic Speech Recognition and Translation

Figure 2 for Word Level Timestamp Generation for Automatic Speech Recognition and Translation

Figure 3 for Word Level Timestamp Generation for Automatic Speech Recognition and Translation

Abstract:We introduce a data-driven approach for enabling word-level timestamp prediction in the Canary model. Accurate timestamp information is crucial for a variety of downstream tasks such as speech content retrieval and timed subtitles. While traditional hybrid systems and end-to-end (E2E) models may employ external modules for timestamp prediction, our approach eliminates the need for separate alignment mechanisms. By leveraging the NeMo Forced Aligner (NFA) as a teacher model, we generate word-level timestamps and train the Canary model to predict timestamps directly. We introduce a new <|timestamp|> token, enabling the Canary model to predict start and end timestamps for each word. Our method demonstrates precision and recall rates between 80% and 90%, with timestamp prediction errors ranging from 20 to 120 ms across four languages, with minimal WER degradation. Additionally, we extend our system to automatic speech translation (AST) tasks, achieving timestamp prediction errors around 200 milliseconds.

* Accepted to Interspeech 2025

Via

Access Paper or Ask Questions

WIND: Accelerated RNN-T Decoding with Windowed Inference for Non-blank Detection

May 19, 2025

Hainan Xu, Vladimir Bataev, Lilit Grigoryan, Boris Ginsburg

Abstract:We propose Windowed Inference for Non-blank Detection (WIND), a novel strategy that significantly accelerates RNN-T inference without compromising model accuracy. During model inference, instead of processing frames sequentially, WIND processes multiple frames simultaneously within a window in parallel, allowing the model to quickly locate non-blank predictions during decoding, resulting in significant speed-ups. We implement WIND for greedy decoding, batched greedy decoding with label-looping techniques, and also propose a novel beam-search decoding method. Experiments on multiple datasets with different conditions show that our method, when operating in greedy modes, speeds up as much as 2.4X compared to the baseline sequential approach while maintaining identical Word Error Rate (WER) performance. Our beam-search algorithm achieves slightly better accuracy than alternative methods, with significantly improved speed. We will open-source our WIND implementation.

Via

Access Paper or Ask Questions

Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR

Oct 03, 2024

Hainan Xu, Travis M. Bartley, Vladimir Bataev, Boris Ginsburg

Figure 1 for Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR

Figure 2 for Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR

Figure 3 for Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR

Figure 4 for Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR

Abstract:We present \textbf{H}ybrid-\textbf{A}utoregressive \textbf{IN}ference Tr\textbf{AN}sducers (HAINAN), a novel architecture for speech recognition that extends the Token-and-Duration Transducer (TDT) model. Trained with randomly masked predictor network outputs, HAINAN supports both autoregressive inference with all network components and non-autoregressive inference without the predictor. Additionally, we propose a novel semi-autoregressive inference paradigm that first generates an initial hypothesis using non-autoregressive inference, followed by refinement steps where each token prediction is regenerated using parallelized autoregression on the initial hypothesis. Experiments on multiple datasets across different languages demonstrate that HAINAN achieves efficiency parity with CTC in non-autoregressive mode and with TDT in autoregressive mode. In terms of accuracy, autoregressive HAINAN outperforms TDT and RNN-T, while non-autoregressive HAINAN significantly outperforms CTC. Semi-autoregressive inference further enhances the model's accuracy with minimal computational overhead, and even outperforms TDT results in some cases. These results highlight HAINAN's flexibility in balancing accuracy and speed, positioning it as a strong candidate for real-world speech recognition applications.

Via

Access Paper or Ask Questions

Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation

Sep 09, 2024

Nithin Rao Koluguri, Travis Bartley, Hainan Xu, Oleksii Hrinchuk, Jagadeesh Balam, Boris Ginsburg, Georg Kucsko

Figure 1 for Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation

Figure 2 for Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation

Figure 3 for Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation

Figure 4 for Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation

Abstract:This paper presents a new method for training sequence-to-sequence models for speech recognition and translation tasks. Instead of the traditional approach of training models on short segments containing only lowercase or partial punctuation and capitalization (PnC) sentences, we propose training on longer utterances that include complete sentences with proper punctuation and capitalization. We achieve this by using the FastConformer architecture which allows training 1 Billion parameter models with sequences up to 60 seconds long with full attention. However, while training with PnC enhances the overall performance, we observed that accuracy plateaus when training on sequences longer than 40 seconds across various evaluation settings. Our proposed method significantly improves punctuation and capitalization accuracy, showing a 25% relative word error rate (WER) improvement on the Earnings-21 and Earnings-22 benchmarks. Additionally, training on longer audio segments increases the overall model accuracy across speech recognition and translation benchmarks. The model weights and training code are open-sourced though NVIDIA NeMo.

* Accepted at SLT 2024

Via

Access Paper or Ask Questions

Romanization Encoding For Multilingual ASR

Jul 05, 2024

Wen Ding, Fei Jia, Hainan Xu, Yu Xi, Junjie Lai, Boris Ginsburg

Figure 1 for Romanization Encoding For Multilingual ASR

Figure 2 for Romanization Encoding For Multilingual ASR

Figure 3 for Romanization Encoding For Multilingual ASR

Figure 4 for Romanization Encoding For Multilingual ASR

Abstract:We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition (ASR) systems. By adopting romanization encoding alongside a balanced concatenated tokenizer within a FastConformer-RNNT framework equipped with a Roman2Char module, we significantly reduce vocabulary and output dimensions, enabling larger training batches and reduced memory consumption. Our method decouples acoustic modeling and language modeling, enhancing the flexibility and adaptability of the system. In our study, applying this method to Mandarin-English ASR resulted in a remarkable 63.51% vocabulary reduction and notable performance gains of 13.72% and 15.03% on SEAME code-switching benchmarks. Ablation studies on Mandarin-Korean and Mandarin-Japanese highlight our method's strong capability to address the complexities of other script-heavy languages, paving the way for more versatile and effective multilingual ASR systems.

Via

Access Paper or Ask Questions

Label-Looping: Highly Efficient Decoding for Transducers

Jun 10, 2024

Vladimir Bataev, Hainan Xu, Daniel Galvez, Vitaly Lavrukhin, Boris Ginsburg

Figure 1 for Label-Looping: Highly Efficient Decoding for Transducers

Figure 2 for Label-Looping: Highly Efficient Decoding for Transducers

Figure 3 for Label-Looping: Highly Efficient Decoding for Transducers

Figure 4 for Label-Looping: Highly Efficient Decoding for Transducers

Abstract:This paper introduces a highly efficient greedy decoding algorithm for Transducer inference. We propose a novel data structure using CUDA tensors to represent partial hypotheses in a batch that supports parallelized hypothesis manipulations. During decoding, our algorithm maximizes GPU parallelism by adopting a nested-loop design, where the inner loop consumes all blank predictions, while non-blank predictions are handled in the outer loop. Our algorithm is general-purpose and can work with both conventional Transducers and Token-and-Duration Transducers. Experiments show that the label-looping algorithm can bring a speedup up to 2.0X compared to conventional batched decoding algorithms when using batch size 32, and can be combined with other compiler or GPU call-related techniques to bring more speedup. We will open-source our implementation to benefit the research community.

Via

Access Paper or Ask Questions

Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU

Jun 06, 2024

Daniel Galvez, Vladimir Bataev, Hainan Xu, Tim Kaldewey

Figure 1 for Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU

Figure 2 for Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU

Figure 3 for Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU

Figure 4 for Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU

Abstract:The vast majority of inference time for RNN Transducer (RNN-T) models today is spent on decoding. Current state-of-the-art RNN-T decoding implementations leave the GPU idle ~80% of the time. Leveraging a new CUDA 12.4 feature, CUDA graph conditional nodes, we present an exact GPU-based implementation of greedy decoding for RNN-T models that eliminates this idle time. Our optimizations speed up a 1.1 billion parameter RNN-T model end-to-end by a factor of 2.5x. This technique can applied to the "label looping" alternative greedy decoding algorithm as well, achieving 1.7x and 1.4x end-to-end speedups when applied to 1.1 billion parameter RNN-T and Token and Duration Transducer models respectively. This work enables a 1.1 billion parameter RNN-T model to run only 16% slower than a similarly sized CTC model, contradicting the common belief that RNN-T models are not suitable for high throughput inference. The implementation is available in NVIDIA NeMo.

* Interspeech 2024 Proceedings

Via

Access Paper or Ask Questions

Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition

Apr 04, 2024

Hainan Xu, Zhehuai Chen, Fei Jia, Boris Ginsburg

Figure 1 for Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition

Figure 2 for Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition

Figure 3 for Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition

Figure 4 for Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition

Abstract:This paper proposes Transducers with Pronunciation-aware Embeddings (PET). Unlike conventional Transducers where the decoder embeddings for different tokens are trained independently, the PET model's decoder embedding incorporates shared components for text tokens with the same or similar pronunciations. With experiments conducted in multiple datasets in Mandarin Chinese and Korean, we show that PET models consistently improve speech recognition accuracy compared to conventional Transducers. Our investigation also uncovers a phenomenon that we call error chain reactions. Instead of recognition errors being evenly spread throughout an utterance, they tend to group together, with subsequent errors often following earlier ones. Our analysis shows that PET models effectively mitigate this issue by substantially reducing the likelihood of the model generating additional errors following a prior one. Our implementation will be open-sourced with the NeMo toolkit.

* accepted at the ICASSP 2024 conference

Via

Access Paper or Ask Questions

TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer

Mar 20, 2024

Yu Xi, Hao Li, Baochen Yang, Haoyu Li, Hainan Xu, Kai Yu

Figure 1 for TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer

Figure 2 for TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer

Figure 3 for TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer

Figure 4 for TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer

Abstract:Designing an efficient keyword spotting (KWS) system that delivers exceptional performance on resource-constrained edge devices has long been a subject of significant attention. Existing KWS search algorithms typically follow a frame-synchronous approach, where search decisions are made repeatedly at each frame despite the fact that most frames are keyword-irrelevant. In this paper, we propose TDT-KWS, which leverages token-and-duration Transducers (TDT) for KWS tasks. We also propose a novel KWS task-specific decoding algorithm for Transducer-based models, which supports highly effective frame-asynchronous keyword search in streaming speech scenarios. With evaluations conducted on both the public Hey Snips and self-constructed LibriKWS-20 datasets, our proposed KWS-decoding algorithm produces more accurate results than conventional ASR decoding algorithms. Additionally, TDT-KWS achieves on-par or better wake word detection performance than both RNN-T and traditional TDT-ASR systems while achieving significant inference speed-up. Furthermore, experiments show that TDT-KWS is more robust to noisy environments compared to RNN-T KWS.

* Accepted by ICASSP2024

Via

Access Paper or Ask Questions

Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition

Sep 26, 2023

Dongji Gao, Hainan Xu, Desh Raj, Leibny Paola Garcia Perera, Daniel Povey, Sanjeev Khudanpur

Figure 1 for Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition

Figure 2 for Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition

Figure 3 for Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition

Figure 4 for Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition

Abstract:Training automatic speech recognition (ASR) systems requires large amounts of well-curated paired data. However, human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models. In this paper, we propose Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties originating from such weak supervision. This allows the model to effectively learn speech-text alignments while accommodating errors present in the training transcripts. OTC extends the conventional CTC objective for imperfect transcripts by leveraging weighted finite state transducers. Through experiments conducted on the LibriSpeech and LibriVox datasets, we demonstrate that training ASR models with OTC avoids performance degradation even with transcripts containing up to 70% errors, a scenario where CTC models fail completely. Our implementation is available at https://github.com/k2-fsa/icefall.

Via

Access Paper or Ask Questions