Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Boris Ginsburg

NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

Aug 23, 2024

He Huang, Taejin Park, Kunal Dhawan, Ivan Medennikov, Krishna C. Puvvada, Nithin Rao Koluguri, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg

Figure 1 for NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

Figure 2 for NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

Figure 3 for NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

Figure 4 for NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

Abstract:Self-supervised learning has been proved to benefit a wide range of speech processing tasks, such as speech recognition/translation, speaker verification and diarization, etc. However, most of these approaches are computationally intensive due to using transformer encoder and lack of sub-sampling. In this paper, we propose a new self-supervised learning model termed as Neural Encoder for Self-supervised Training (NEST). Specifically, we adopt the FastConformer architecture, which has an 8x sub-sampling rate and is faster than Transformer or Conformer architectures. Instead of clustering-based token generation, we resort to fixed random projection for its simplicity and effectiveness. We also propose a generalized noisy speech augmentation that teaches the model to disentangle the main speaker from noise or other speakers. Experiments show that the proposed NEST model improves over existing self-supervised models on a variety of speech processing tasks. Code and checkpoints will be publicly available via NVIDIA NeMo toolkit.

Via

Access Paper or Ask Questions

Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

Jul 29, 2024

Somshubra Majumdar, Vahid Noroozi, Sean Narenthiran, Aleksander Ficek, Jagadeesh Balam, Boris Ginsburg

Figure 1 for Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

Figure 2 for Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

Figure 3 for Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

Figure 4 for Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

Abstract:Large Language Models (LLMs) rely on instruction samples for alignment, but creating these datasets poses challenges, particularly in expert-dependent tasks like coding, which can be cost-prohibitive. One approach to mitigate these challenges is synthesizing data using another LLM. In this paper, we introduce a scalable method for generating synthetic instructions to enhance the code generation capability of LLMs. The proposed algorithm, Genetic-Instruct, mimics evolutionary processes, utilizing self-instruction to create numerous synthetic samples from a limited number of seeds. Genetic-Instruct is designed for efficient scaling of the generation process. Fine-tuning multiple coding LLMs with the synthetic samples demonstrates a significant improvement in their code generation accuracy compared to the baselines.

Via

Access Paper or Ask Questions

Schrödinger Bridge for Generative Speech Enhancement

Jul 22, 2024

Ante Jukić, Roman Korostik, Jagadeesh Balam, Boris Ginsburg

Abstract:This paper proposes a generative speech enhancement model based on Schr\"odinger bridge (SB). The proposed model is employing a tractable SB to formulate a data-to-data process between the clean speech distribution and the observed noisy speech distribution. The model is trained with a data prediction loss, aiming to recover the complex-valued clean speech coefficients, and an auxiliary time-domain loss is used to improve training of the model. The effectiveness of the proposed SB-based model is evaluated in two different speech enhancement tasks: speech denoising and speech dereverberation. The experimental results demonstrate that the proposed SB-based outperforms diffusion-based models in terms of speech quality metrics and ASR performance, e.g., resulting in relative word error rate reduction of 20% for denoising and 6% for dereverberation compared to the best baseline model. The proposed model also demonstrates improved efficiency, achieving better quality than the baselines for the same number of sampling steps and with a reduced computational cost.

Via

Access Paper or Ask Questions

Romanization Encoding For Multilingual ASR

Jul 05, 2024

Wen Ding, Fei Jia, Hainan Xu, Yu Xi, Junjie Lai, Boris Ginsburg

Figure 1 for Romanization Encoding For Multilingual ASR

Figure 2 for Romanization Encoding For Multilingual ASR

Figure 3 for Romanization Encoding For Multilingual ASR

Figure 4 for Romanization Encoding For Multilingual ASR

Abstract:We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition (ASR) systems. By adopting romanization encoding alongside a balanced concatenated tokenizer within a FastConformer-RNNT framework equipped with a Roman2Char module, we significantly reduce vocabulary and output dimensions, enabling larger training batches and reduced memory consumption. Our method decouples acoustic modeling and language modeling, enhancing the flexibility and adaptability of the system. In our study, applying this method to Mandarin-English ASR resulted in a remarkable 63.51% vocabulary reduction and notable performance gains of 13.72% and 15.03% on SEAME code-switching benchmarks. Ablation studies on Mandarin-Korean and Mandarin-Japanese highlight our method's strong capability to address the complexities of other script-heavy languages, paving the way for more versatile and effective multilingual ASR systems.

Via

Access Paper or Ask Questions

Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations

Jul 03, 2024

Kunal Dhawan, Nithin Rao Koluguri, Ante Jukić, Ryan Langman, Jagadeesh Balam, Boris Ginsburg

Figure 1 for Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations

Figure 2 for Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations

Figure 3 for Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations

Figure 4 for Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations

Abstract:Discrete speech representations have garnered recent attention for their efficacy in training transformer-based models for various speech-related tasks such as automatic speech recognition (ASR), translation, speaker verification, and joint speech-text foundational models. In this work, we present a comprehensive analysis on building ASR systems with discrete codes. We investigate different methods for codec training such as quantization schemes and time-domain vs spectral feature encodings. We further explore ASR training techniques aimed at enhancing performance, training efficiency, and noise robustness. Drawing upon our findings, we introduce a codec ASR pipeline that outperforms Encodec at similar bit-rate. Remarkably, it also surpasses the state-of-the-art results achieved by strong self-supervised models on the 143 languages ML-SUPERB benchmark despite being smaller in size and pretrained on significantly less data.

* Accepted at Interspeech 2024

Via

Access Paper or Ask Questions

BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5

Jun 28, 2024

Zhehuai Chen, He Huang, Oleksii Hrinchuk, Krishna C. Puvvada, Nithin Rao Koluguri, Piotr Żelasko, Jagadeesh Balam, Boris Ginsburg

Figure 1 for BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5

Figure 2 for BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5

Figure 3 for BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5

Figure 4 for BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5

Abstract:Incorporating speech understanding capabilities into pretrained large-language models has become a vital research direction (SpeechLLM). The previous architectures can be categorized as: i) GPT-style, prepend speech prompts to the text prompts as a sequence of LLM inputs like a decoder-only model; ii) T5-style, introduce speech cross-attention to each layer of the pretrained LLMs. We propose BESTOW architecture to bring the BESt features from TwO Worlds into a single model that is highly efficient and has strong multitask capabilities. Moreover, there is no clear streaming solution for either style, especially considering the solution should generalize to speech multitask. We reformulate streamable SpeechLLM as a read-write policy problem and unifies the offline and streaming research with BESTOW architecture. Hence we demonstrate the first open-source SpeechLLM solution that enables Streaming and Multitask at scale (beyond ASR) at the same time. This streamable solution achieves very strong performance on a wide range of speech tasks (ASR, AST, SQA, unseen DynamicSuperb). It is end-to-end optimizable, with lower training/inference cost, and demonstrates LLM knowledge transferability to speech.

Via

Access Paper or Ask Questions

Less is More: Accurate Speech Recognition & Translation without Web-Scale Data

Jun 28, 2024

Krishna C. Puvvada, Piotr Żelasko, He Huang, Oleksii Hrinchuk, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva, Zhehuai Chen, Vitaly Lavrukhin(+2 more)

Abstract:Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while being trained on an order of magnitude less data than these models. Three key factors enables such data-efficient model: (1) a FastConformer-based attention encoder-decoder architecture (2) training on synthetic data generated with machine translation and (3) advanced training techniques: data-balancing, dynamic data blending, dynamic bucketing and noise-robust fine-tuning. The model, weights, and training code will be open-sourced.

* Accepted at Interspeech-2024

Via

Access Paper or Ask Questions

DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment

Jun 27, 2024

Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, He Huang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

Abstract:Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs). In this paper, we propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities, enabling SLMs to interpret and generate comprehensive natural language descriptions, thereby facilitating the capability to understand both linguistic and non-linguistic features in speech. Enhanced with the proposed approach, our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks. Moreover, we discover that the aligned model exhibits a zero-shot instruction-following capability without explicit speech instruction tuning. These findings highlight the potential to reshape instruction-following SLMs by incorporating rich, descriptive speech captions.

* Accepted to Interspeech 2024

Via

Access Paper or Ask Questions

Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

Jun 25, 2024

Paarth Neekhara, Shehzeen Hussain, Subhankar Ghosh, Jason Li, Rafael Valle, Rohan Badlani, Boris Ginsburg

Figure 1 for Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

Figure 2 for Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

Figure 3 for Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

Figure 4 for Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

Abstract:Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers. However, LLM-based TTS models are not robust as the generated output can contain repeating words, missing words and mis-aligned speech (referred to as hallucinations or attention errors), especially when the text contains multiple occurrences of the same token. We examine these challenges in an encoder-decoder transformer model and find that certain cross-attention heads in such models implicitly learn the text and speech alignment when trained for predicting speech tokens for a given text. To make the alignment more robust, we propose techniques utilizing CTC loss and attention priors that encourage monotonic cross-attention over the text tokens. Our guided attention training technique does not introduce any new learnable parameters and significantly improves robustness of LLM-based TTS models.

* Published as a conference paper at INTERSPEECH 2024

Via

Access Paper or Ask Questions

Instruction Data Generation and Unsupervised Adaptation for Speech Language Models

Jun 18, 2024

Vahid Noroozi, Zhehuai Chen, Somshubra Majumdar, Steve Huang, Jagadeesh Balam, Boris Ginsburg

Figure 1 for Instruction Data Generation and Unsupervised Adaptation for Speech Language Models

Figure 2 for Instruction Data Generation and Unsupervised Adaptation for Speech Language Models

Figure 3 for Instruction Data Generation and Unsupervised Adaptation for Speech Language Models

Figure 4 for Instruction Data Generation and Unsupervised Adaptation for Speech Language Models

Abstract:In this paper, we propose three methods for generating synthetic samples to train and evaluate multimodal large language models capable of processing both text and speech inputs. Addressing the scarcity of samples containing both modalities, synthetic data generation emerges as a crucial strategy to enhance the performance of such systems and facilitate the modeling of cross-modal relationships between the speech and text domains. Our process employs large language models to generate textual components and text-to-speech systems to generate speech components. The proposed methods offer a practical and effective means to expand the training dataset for these models. Experimental results show progress in achieving an integrated understanding of text and speech. We also highlight the potential of using unlabeled speech data to generate synthetic samples comparable in quality to those with available transcriptions, enabling the expansion of these models to more languages.

* Accepted for Interspeech 2024

Via

Access Paper or Ask Questions