Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhao You

End-to-End Training for Discrete Token LLM based TTS System

Jun 08, 2026

Changfeng Gao, Yong Ren, Jun Yuan, Ye Bai, Zhao You, ShiDong Shang

Abstract:Recent state-of-the-art (SOTA) text-to-speech (TTS) systems typically adopt a cascaded pipeline consisting of a speech tokenizer, an autoregressive large language model (LLM), and a diffusion based flow-matching (FM) model, with these components trained independently. In this paper, we propose a fully end-to-end (E2E) optimization framework that unifies the training of the speech tokenizer, LLM, FM model, and an additional reward model (RM). Specifically, we first jointly optimize the tokenizer using multi-task objectives derived from reconstruction for FM, next-token prediction for LLM, and multi recognition task for RM. This joint training encourages the discrete speech token space to capture acoustically and semantically salient information that is better tailored to TTS. We then further optimize the LLM using downstream reconstruction and recognition by FM and RM, which reduces inference-time mismatch and steers the LLM toward more preferred generations. Experimental results show that our E2E framework consistently outperforms cascaded baselines. On the Seed-TTS-Eval benchmark, our system achieves a word error rate (WER) of 0.78% and 1.56%, a new SOTA result with a 0.6B-parameter LLM and 0.5B-parameter FM model. These results validate that holistic E2E optimization is critical for improving discrete-token-based TTS systems with a much simpler training pipeline.

Via

Access Paper or Ask Questions

Step-Audio 2 Technical Report

Jul 24, 2025

Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li(+99 more)

Figure 1 for Step-Audio 2 Technical Report

Figure 2 for Step-Audio 2 Technical Report

Figure 3 for Step-Audio 2 Technical Report

Figure 4 for Step-Audio 2 Technical Report

Abstract:This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.

Via

Access Paper or Ask Questions

Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

Sep 04, 2023

Jiaxu Zhu, Weinan Tong, Yaoxun Xu, Changhe Song, Zhiyong Wu, Zhao You, Dan Su, Dong Yu, Helen Meng

Figure 1 for Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

Figure 2 for Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

Figure 3 for Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

Figure 4 for Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

Abstract:Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech representation and text representation is inconsistent. Although the previous method up-samples the text representation to align with acoustic modality, it may not match the expected actual duration. In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. By introducing a continuous integrate-and-fire (CIF) module generating acoustic representations consistent with token length, our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain. Experiment results of new domain data demonstrate the effectiveness of the proposed method.

* Accepted by INTERSPEECH 2023. arXiv admin note: text overlap with arXiv:2309.01437

Via

Access Paper or Ask Questions

3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition

Apr 14, 2022

Zhao You, Shulin Feng, Dan Su, Dong Yu

Figure 1 for 3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition

Figure 2 for 3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition

Figure 3 for 3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition

Figure 4 for 3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition

Abstract:Recently, Conformer based CTC/AED model has become a mainstream architecture for ASR. In this paper, based on our prior work, we identify and integrate several approaches to achieve further improvements for ASR tasks, which we denote as multi-loss, multi-path and multi-level, summarized as "3M" model. Specifically, multi-loss refers to the joint CTC/AED loss and multi-path denotes the Mixture-of-Experts(MoE) architecture which can effectively increase the model capacity without remarkably increasing computation cost. Multi-level means that we introduce auxiliary loss at multiple level of a deep model to help training. We evaluate our proposed method on the public WenetSpeech dataset and experimental results show that the proposed method provides 12.2%-17.6% relative CER improvement over the baseline model trained by Wenet toolkit. On our large scale dataset of 150k hours corpus, the 3M model has also shown obvious superiority over the baseline Conformer model. Code is publicly available at https://github.com/tencent-ailab/3m-asr.

* 5 pages, 1 figure. Submitted to INTERSPEECH 2022

Via

Access Paper or Ask Questions

SpeechMoE2: Mixture-of-Experts Model with Improved Routing

Nov 23, 2021

Zhao You, Shulin Feng, Dan Su, Dong Yu

Figure 1 for SpeechMoE2: Mixture-of-Experts Model with Improved Routing

Figure 2 for SpeechMoE2: Mixture-of-Experts Model with Improved Routing

Figure 3 for SpeechMoE2: Mixture-of-Experts Model with Improved Routing

Figure 4 for SpeechMoE2: Mixture-of-Experts Model with Improved Routing

Abstract:Mixture-of-experts based acoustic models with dynamic routing mechanisms have proved promising results for speech recognition. The design principle of router architecture is important for the large model capacity and high computational efficiency. Our previous work SpeechMoE only uses local grapheme embedding to help routers to make route decisions. To further improve speech recognition performance against varying domains and accents, we propose a new router architecture which integrates additional global domain and accent embedding into router input to promote adaptability. Experimental results show that the proposed SpeechMoE2 can achieve lower character error rate (CER) with comparable parameters than SpeechMoE on both multi-domain and multi-accent task. Primarily, the proposed method provides up to 1.6% - 4.8% relative CER improvement for the multidomain task and 1.9% - 17.7% relative CER improvement for the multi-accent task respectively. Besides, increasing the number of experts also achieves consistent performance improvement and keeps the computational cost constant.

* 5 pages, 1 figure. Submitted to ICASSP 2022

Via

Access Paper or Ask Questions

GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

Jun 13, 2021

Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang(+11 more)

Figure 1 for GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

Figure 2 for GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

Figure 3 for GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

Figure 4 for GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

Abstract:This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.

Via

Access Paper or Ask Questions

SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts

May 07, 2021

Zhao You, Shulin Feng, Dan Su, Dong Yu

Figure 1 for SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts

Figure 2 for SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts

Figure 3 for SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts

Figure 4 for SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts

Abstract:Recently, Mixture of Experts (MoE) based Transformer has shown promising results in many domains. This is largely due to the following advantages of this architecture: firstly, MoE based Transformer can increase model capacity without computational cost increasing both at training and inference time. Besides, MoE based Transformer is a dynamic network which can adapt to the varying complexity of input instances in realworld applications. In this work, we explore the MoE based model for speech recognition, named SpeechMoE. To further control the sparsity of router activation and improve the diversity of gate values, we propose a sparsity L1 loss and a mean importance loss respectively. In addition, a new router architecture is used in SpeechMoE which can simultaneously utilize the information from a shared embedding network and the hierarchical representation of different MoE layers. Experimental results show that SpeechMoE can achieve lower character error rate (CER) with comparable computation cost than traditional static networks, providing 7.0%-23.0% relative CER improvements on four evaluation datasets.

* 5 pages, 2 figures. Submitted to Interspeech 2021

Via

Access Paper or Ask Questions

DFSMN-SAN with Persistent Memory Model for Automatic Speech Recognition

Oct 28, 2019

Zhao You, Dan Su, Jie Chen, Chao Weng, Dong Yu

Figure 1 for DFSMN-SAN with Persistent Memory Model for Automatic Speech Recognition

Figure 2 for DFSMN-SAN with Persistent Memory Model for Automatic Speech Recognition

Figure 3 for DFSMN-SAN with Persistent Memory Model for Automatic Speech Recognition

Figure 4 for DFSMN-SAN with Persistent Memory Model for Automatic Speech Recognition

Abstract:Self-attention networks (SAN) have been introduced into automatic speech recognition (ASR) and achieved state-of-the-art performance owing to its superior ability in capturing long term dependency. One of the key ingredients is the self-attention mechanism which can be effectively performed on the whole utterance level. In this paper, we try to investigate whether even more information beyond the whole utterance level can be exploited and beneficial. We propose to apply self-attention layer with augmented memory to ASR. Specifically, we first propose a variant model architecture which combines deep feed-forward sequential memory network (DFSMN) with self-attention layers to form a better baseline model compared with a purely self-attention network. Then, we propose and compare two kinds of additional memory structures added into self-attention layers. Experiments on large-scale LVCSR tasks show that on four individual test sets, the DFSMN-SAN architecture outperforms vanilla SAN encoder by 5% relatively in character error rate (CER). More importantly, the additional memory structure provides further 5% to 11% relative improvement in CER.

* 5 pages, 2 figures, subbmitted to ICASSP 2020

Via

Access Paper or Ask Questions

Teach an all-rounder with experts in different domains

Jul 09, 2019

Zhao You, Dan Su, Dong Yu

Figure 1 for Teach an all-rounder with experts in different domains

Figure 2 for Teach an all-rounder with experts in different domains

Figure 3 for Teach an all-rounder with experts in different domains

Figure 4 for Teach an all-rounder with experts in different domains

Abstract:In many automatic speech recognition (ASR) tasks, an ideal model has to be applicable over multiple domains. In this paper, we propose to teach an all-rounder with experts in different domains. Concretely, we build a multi-domain acoustic model by applying the teacher-student training framework. First, for each domain, a teacher model (domain-dependent model) is trained by fine-tuning a multi-condition model with domain-specific subset. Then all these teacher models are used to teach one single student model simultaneously. We perform experiments on two predefined domain setups. One is domains with different speaking styles, the other is nearfield, far-field and far-field with noise. Moreover, two types of models are examined: deep feedforward sequential memory network (DFSMN) and long short term memory (LSTM). Experimental results show that the model trained with this framework outperforms not only multi-condition model but also domain-dependent model. Specially, our training method provides up to 10.4% relative character error rate improvement over baseline model (multi-condition model).

* 5 pages and 2 figures; accepted by 2019 IEEE International Conference on Acoustics, Speech and Signal Processing

Via

Access Paper or Ask Questions