Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Guohui Zhong

Joint decoding method for controllable contextual speech recognition based on Speech LLM

Aug 12, 2025

Yangui Fang, Jing Peng, Yu Xi, Xu Li, Haoyu Li, Chengwei Zhang, Guohui Zhong, Kai Yu

Figure 1 for Joint decoding method for controllable contextual speech recognition based on Speech LLM

Figure 2 for Joint decoding method for controllable contextual speech recognition based on Speech LLM

Figure 3 for Joint decoding method for controllable contextual speech recognition based on Speech LLM

Figure 4 for Joint decoding method for controllable contextual speech recognition based on Speech LLM

Abstract:Contextual speech recognition refers to the ability to identify preferences for specific content based on contextual information. Recently, leveraging the contextual understanding capabilities of Speech LLM to achieve contextual biasing by injecting contextual information through prompts have emerged as a research hotspot.However, the direct information injection method via prompts relies on the internal attention mechanism of the model, making it impossible to explicitly control the extent of information injection. To address this limitation, we propose a joint decoding method to control the contextual information. This approach enables explicit control over the injected contextual information and achieving superior recognition performance. Additionally, Our method can also be used for sensitive word suppression recognition.Furthermore, experimental results show that even Speech LLM not pre-trained on long contextual data can acquire long contextual capabilities through our method.

Via

Access Paper or Ask Questions

Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning

Jun 06, 2025

Yangui Fang, Jing Peng, Xu Li, Yu Xi, Chengwei Zhang, Guohui Zhong, Kai Yu

Figure 1 for Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning

Figure 2 for Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning

Figure 3 for Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning

Figure 4 for Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning

Abstract:Recent advances in automatic speech recognition (ASR) have combined speech encoders with large language models (LLMs) through projection, forming Speech LLMs with strong performance. However, adapting them to new domains remains challenging, especially in low-resource settings where paired speech-text data is scarce. We propose a text-only fine-tuning strategy for Speech LLMs using unpaired target-domain text without requiring additional audio. To preserve speech-text alignment, we introduce a real-time evaluation mechanism during fine-tuning. This enables effective domain adaptation while maintaining source-domain performance. Experiments on LibriSpeech, SlideSpeech, and Medical datasets show that our method achieves competitive recognition performance, with minimal degradation compared to full audio-text fine-tuning. It also improves generalization to new domains without catastrophic forgetting, highlighting the potential of text-only fine-tuning for low-resource domain adaptation of ASR.

Via

Access Paper or Ask Questions

Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction

May 30, 2025

Yangui Fang, Baixu Cheng, Jing Peng, Xu Li, Yu Xi, Chengwei Zhang, Guohui Zhong

Figure 1 for Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction

Figure 2 for Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction

Figure 3 for Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction

Figure 4 for Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction

Abstract:Automatic Speech Recognition (ASR) error correction aims to correct recognition errors while preserving accurate text. Although traditional approaches demonstrate moderate effectiveness, LLMs offer a paradigm that eliminates the need for training and labeled data. However, directly using LLMs will encounter hallucinations problem, which may lead to the modification of the correct text. To address this problem, we propose the Reliable LLM Correction Framework (RLLM-CF), which consists of three stages: (1) error pre-detection, (2) chain-of-thought sub-tasks iterative correction, and (3) reasoning process verification. The advantage of our method is that it does not require additional information or fine-tuning of the model, and ensures the correctness of the LLM correction under multi-pass programming. Experiments on AISHELL-1, AISHELL-2, and Librispeech show that the GPT-4o model enhanced by our framework achieves 21%, 11%, 9%, and 11.4% relative reductions in CER/WER.

Via

Access Paper or Ask Questions