Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anshu Bhatia

Zero-resource Speech Translation and Recognition with LLMs

Dec 24, 2024

Karel Mundnich, Xing Niu, Prashant Mathur, Srikanth Ronanki, Brady Houston, Veera Raghavendra Elluru, Nilaksh Das, Zejiang Hou, Goeric Huybrechts, Anshu Bhatia(+3 more)

Figure 1 for Zero-resource Speech Translation and Recognition with LLMs

Figure 2 for Zero-resource Speech Translation and Recognition with LLMs

Figure 3 for Zero-resource Speech Translation and Recognition with LLMs

Abstract:Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM. We perform several experiments both in ST and ASR to understand how to best train the model and what data has the most impact on performance in previously unseen languages. In ST, our best model is capable to achieve BLEU scores over 23 in CoVoST2 for two previously unseen languages, while in ASR, we achieve WERs of up to 28.2\%. We finally show that the performance of our system is bounded by the ability of the LLM to output text in the desired language.

* ICASSP 2025, 5 pages, 2 figures, 2 tables

Via

Access Paper or Ask Questions

SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

May 14, 2024

Raghuveer Peri, Sai Muralidhar Jayanthi, Srikanth Ronanki, Anshu Bhatia, Karel Mundnich, Saket Dingliwal, Nilaksh Das, Zejiang Hou, Goeric Huybrechts, Srikanth Vishnubhotla(+4 more)

Figure 1 for SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

Figure 2 for SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

Figure 3 for SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

Figure 4 for SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

Abstract:Integrated Speech and Large Language Models (SLMs) that can follow speech instructions and generate relevant text responses have gained popularity lately. However, the safety and robustness of these models remains largely unclear. In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. Specifically, we design algorithms that can generate adversarial examples to jailbreak SLMs in both white-box and black-box attack settings without human involvement. Additionally, we propose countermeasures to thwart such jailbreaking attacks. Our models, trained on dialog data with speech instructions, achieve state-of-the-art performance on spoken question-answering task, scoring over 80% on both safety and helpfulness metrics. Despite safety guardrails, experiments on jailbreaking demonstrate the vulnerability of SLMs to adversarial perturbations and transfer attacks, with average attack success rates of 90% and 10% respectively when evaluated on a dataset of carefully designed harmful questions spanning 12 different toxic categories. However, we demonstrate that our proposed countermeasures reduce the attack success significantly.

* 9+6 pages, Submitted to ACL 2024

Via

Access Paper or Ask Questions

Don't Stop Self-Supervision: Accent Adaptation of Speech Representations via Residual Adapters

Jul 02, 2023

Anshu Bhatia, Sanchit Sinha, Saket Dingliwal, Karthik Gopalakrishnan, Sravan Bodapati, Katrin Kirchhoff

Figure 1 for Don't Stop Self-Supervision: Accent Adaptation of Speech Representations via Residual Adapters

Figure 2 for Don't Stop Self-Supervision: Accent Adaptation of Speech Representations via Residual Adapters

Figure 3 for Don't Stop Self-Supervision: Accent Adaptation of Speech Representations via Residual Adapters

Figure 4 for Don't Stop Self-Supervision: Accent Adaptation of Speech Representations via Residual Adapters

Abstract:Speech representations learned in a self-supervised fashion from massive unlabeled speech corpora have been adapted successfully toward several downstream tasks. However, such representations may be skewed toward canonical data characteristics of such corpora and perform poorly on atypical, non-native accented speaker populations. With the state-of-the-art HuBERT model as a baseline, we propose and investigate self-supervised adaptation of speech representations to such populations in a parameter-efficient way via training accent-specific residual adapters. We experiment with 4 accents and choose automatic speech recognition (ASR) as the downstream task of interest. We obtain strong word error rate reductions (WERR) over HuBERT-large for all 4 accents, with a mean WERR of 22.7% with accent-specific adapters and a mean WERR of 25.1% if the entire encoder is accent-adapted. While our experiments utilize HuBERT and ASR as the downstream task, our proposed approach is both model and task-agnostic.

Via

Access Paper or Ask Questions

Masked Audio Text Encoders are Effective Multi-Modal Rescorers

May 24, 2023

Jinglun Cai, Monica Sunkara, Xilai Li, Anshu Bhatia, Xiao Pan, Sravan Bodapati

Figure 1 for Masked Audio Text Encoders are Effective Multi-Modal Rescorers

Figure 2 for Masked Audio Text Encoders are Effective Multi-Modal Rescorers

Figure 3 for Masked Audio Text Encoders are Effective Multi-Modal Rescorers

Figure 4 for Masked Audio Text Encoders are Effective Multi-Modal Rescorers

Abstract:Masked Language Models (MLMs) have proven to be effective for second-pass rescoring in Automatic Speech Recognition (ASR) systems. In this work, we propose Masked Audio Text Encoder (MATE), a multi-modal masked language model rescorer which incorporates acoustic representations into the input space of MLM. We adopt contrastive learning for effectively aligning the modalities by learning shared representations. We show that using a multi-modal rescorer is beneficial for domain generalization of the ASR system when target domain data is unavailable. MATE reduces word error rate (WER) by 4%-16% on in-domain, and 3%-7% on out-of-domain datasets, over the text-only baseline. Additionally, with very limited amount of training data (0.8 hours), MATE achieves a WER reduction of 8%-23% over the first-pass baseline.

Via

Access Paper or Ask Questions