Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yangui Fang

TC-BiMamba: Trans-Chunk bidirectionally within BiMamba for unified streaming and non-streaming ASR

Feb 12, 2026

Qingshun She, Jing Peng, Yangui Fang, Yu Xi, Kai Yu

Abstract:This work investigates bidirectional Mamba (BiMamba) for unified streaming and non-streaming automatic speech recognition (ASR). Dynamic chunk size training enables a single model for offline decoding and streaming decoding with various latency settings. In contrast, existing BiMamba based streaming method is limited to fixed chunk size decoding. When dynamic chunk size training is applied, training overhead increases substantially. To tackle this issue, we propose the Trans-Chunk BiMamba (TC-BiMamba) for dynamic chunk size training. Trans-Chunk mechanism trains both bidirectional sequences in an offline style with dynamic chunk size. On the one hand, compared to traditional chunk-wise processing, TC-BiMamba simultaneously achieves 1.3 times training speedup, reduces training memory by 50%, and improves model performance since it can capture bidirectional context. On the other hand, experimental results show that TC-BiMamba outperforms U2++ and matches LC-BiMmaba with smaller model size.

* This paper is submitted to Interspeech 2026

Via

Access Paper or Ask Questions

MOSA: Mixtures of Simple Adapters Outperform Monolithic Approaches in LLM-based Multilingual ASR

Aug 26, 2025

Junjie Li, Jing Peng, Yangui Fang, Shuai Wang, Kai Yu

Figure 1 for MOSA: Mixtures of Simple Adapters Outperform Monolithic Approaches in LLM-based Multilingual ASR

Figure 2 for MOSA: Mixtures of Simple Adapters Outperform Monolithic Approaches in LLM-based Multilingual ASR

Figure 3 for MOSA: Mixtures of Simple Adapters Outperform Monolithic Approaches in LLM-based Multilingual ASR

Figure 4 for MOSA: Mixtures of Simple Adapters Outperform Monolithic Approaches in LLM-based Multilingual ASR

Abstract:End-to-end multilingual ASR aims to transcribe speech from different languages into corresponding text, but is often limited by scarce multilingual data. LLM-based ASR aligns speech encoder outputs with LLM input space via a projector and has achieved notable success. However, prior work mainly improves performance by increasing data, with little focus on cross-lingual knowledge sharing. Moreover, a single complex projector struggles to capture both shared and language-specific features effectively. In this work, we propose MOSA (Mixture of Simple Adapters), leveraging a Mixture-of-Experts mechanism to combine lightweight adapters that learn shared and language-specific knowledge. This enables better utilization of high-resource language data to support low-resource languages, mitigating data scarcity issues. Experimental results show that MOSA-Base achieves a 15.4\% relative reduction in average WER compared to the Baseline-Base and consistently outperforms it across all languages. Remarkably, MOSA-Base surpasses the Baseline-Base even when trained with only 60\% of its parameters. Similarly, MOSA-Large outperforms the Baseline-Large in average WER and demonstrates greater robustness to data imbalance. Ablation studies further indicate that MOSA is more effective at handling individual languages and learning both language-specific and shared linguistic knowledge. These findings support that, in LLM-based ASR, a mixture of simple adapters is more effective than a single, complex adapter design.

Via

Access Paper or Ask Questions

Joint decoding method for controllable contextual speech recognition based on Speech LLM

Aug 12, 2025

Yangui Fang, Jing Peng, Yu Xi, Xu Li, Haoyu Li, Chengwei Zhang, Guohui Zhong, Kai Yu

Figure 1 for Joint decoding method for controllable contextual speech recognition based on Speech LLM

Figure 2 for Joint decoding method for controllable contextual speech recognition based on Speech LLM

Figure 3 for Joint decoding method for controllable contextual speech recognition based on Speech LLM

Figure 4 for Joint decoding method for controllable contextual speech recognition based on Speech LLM

Abstract:Contextual speech recognition refers to the ability to identify preferences for specific content based on contextual information. Recently, leveraging the contextual understanding capabilities of Speech LLM to achieve contextual biasing by injecting contextual information through prompts have emerged as a research hotspot.However, the direct information injection method via prompts relies on the internal attention mechanism of the model, making it impossible to explicitly control the extent of information injection. To address this limitation, we propose a joint decoding method to control the contextual information. This approach enables explicit control over the injected contextual information and achieving superior recognition performance. Additionally, Our method can also be used for sensitive word suppression recognition.Furthermore, experimental results show that even Speech LLM not pre-trained on long contextual data can acquire long contextual capabilities through our method.

Via

Access Paper or Ask Questions

Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning

Jun 06, 2025

Yangui Fang, Jing Peng, Xu Li, Yu Xi, Chengwei Zhang, Guohui Zhong, Kai Yu

Figure 1 for Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning

Figure 2 for Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning

Figure 3 for Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning

Figure 4 for Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning

Abstract:Recent advances in automatic speech recognition (ASR) have combined speech encoders with large language models (LLMs) through projection, forming Speech LLMs with strong performance. However, adapting them to new domains remains challenging, especially in low-resource settings where paired speech-text data is scarce. We propose a text-only fine-tuning strategy for Speech LLMs using unpaired target-domain text without requiring additional audio. To preserve speech-text alignment, we introduce a real-time evaluation mechanism during fine-tuning. This enables effective domain adaptation while maintaining source-domain performance. Experiments on LibriSpeech, SlideSpeech, and Medical datasets show that our method achieves competitive recognition performance, with minimal degradation compared to full audio-text fine-tuning. It also improves generalization to new domains without catastrophic forgetting, highlighting the potential of text-only fine-tuning for low-resource domain adaptation of ASR.

Via

Access Paper or Ask Questions

Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction

May 30, 2025

Yangui Fang, Baixu Cheng, Jing Peng, Xu Li, Yu Xi, Chengwei Zhang, Guohui Zhong

Figure 1 for Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction

Figure 2 for Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction

Figure 3 for Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction

Figure 4 for Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction

Abstract:Automatic Speech Recognition (ASR) error correction aims to correct recognition errors while preserving accurate text. Although traditional approaches demonstrate moderate effectiveness, LLMs offer a paradigm that eliminates the need for training and labeled data. However, directly using LLMs will encounter hallucinations problem, which may lead to the modification of the correct text. To address this problem, we propose the Reliable LLM Correction Framework (RLLM-CF), which consists of three stages: (1) error pre-detection, (2) chain-of-thought sub-tasks iterative correction, and (3) reasoning process verification. The advantage of our method is that it does not require additional information or fine-tuning of the model, and ensures the correctness of the LLM correction under multi-pass programming. Experiments on AISHELL-1, AISHELL-2, and Librispeech show that the GPT-4o model enhanced by our framework achieves 21%, 11%, 9%, and 11.4% relative reductions in CER/WER.

Via

Access Paper or Ask Questions