Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kandarp Joshi

Are you speaking my languages? On spoken language adherence in multimodal LLMs

Jun 15, 2026

Hyungwon Kim, Kandarp Joshi, Lillian Zhou, Pavel Golik, Petar Aleksic

Abstract:While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To preserve flexibility and code-switching capabilities, we propose a soft prompting approach that hints at potential spoken languages without strictly constraining the output. We formally define this challenge as a lack of language adherence, introduce a novel metric to quantify violations, and evaluate three mitigation strategies: (1) zero-shot prompting for robust guidance under uncertainty, (2) supervised fine-tuning (SFT) to improve prompt adherence, and (3) Chain-of-Thought (CoT) reasoning to enforce adherence during decoding. We present a comparative analysis of these methods across multiple languages, evaluating effectiveness in reducing the language violation while maintaining overall ASR performance. Finally, we discuss trade-offs to guide strategy selection under various compute constraints.

* 7 pages, 3 tables in the main body

Via

Access Paper or Ask Questions

Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices

Sep 24, 2024

Leonid Velikovich, Christopher Li, Diamantino Caseiro, Shankar Kumar, Pat Rondon, Kandarp Joshi, Xavier Velez

Figure 1 for Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices

Figure 2 for Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices

Figure 3 for Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices

Figure 4 for Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices

Abstract:For end-to-end Automatic Speech Recognition (ASR) models, recognizing personal or rare phrases can be hard. A promising way to improve accuracy is through spelling correction (or rewriting) of the ASR lattice, where potentially misrecognized phrases are replaced with acoustically similar and contextually relevant alternatives. However, rewriting is challenging for ASR models trained with connectionist temporal classification (CTC) due to noisy hypotheses produced by a non-autoregressive, context-independent beam search. We present a finite-state transducer (FST) technique for rewriting wordpiece lattices generated by Transformer-based CTC models. Our algorithm performs grapheme-to-phoneme (G2P) conversion directly from wordpieces into phonemes, avoiding explicit word representations and exploiting the richness of the CTC lattice. Our approach requires no retraining or modification of the ASR model. We achieved up to a 15.2% relative reduction in sentence error rate (SER) on a test set with contextually relevant entities.

* 8 pages, 7 figures

Via

Access Paper or Ask Questions