Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Roy Fejgin

MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data

Jun 16, 2026

Subhankar Ghosh, Jason Li, Paarth Neekhara, Shehzeen Hussain, Ryan Langman, Xuesong Yang, Roy Fejgin

Abstract:Neural Text-to-Speech (TTS) systems achieve remarkable quality on short utterances but long-form speech generation shows prosodic drift, speaker inconsistencies and sentence boundary artifacts. Existing approaches either compress sequences, increase context length or naively concatenate independently synthesized chunks. We present an inference-time approach called MagpieTTS-LF that enables MagpieTTS to produce coherent long-form speech without model retraining. Our method introduces three key innovations: (1) soft attention priors to guide monotonic alignment while preserving past and future context; (2) a stateful inference algorithm that maintains context across sentence chunks, ensuring prosodic continuity; (3) history-aware text encoding that uses past text for discourse-level prosodic planning. Experiments on long texts show significant improvements in long-range intelligibility, prosodic coherence, speaker consistency, and boundary naturalness compared to other baselines.

* Interspeech 2026

Via

Access Paper or Ask Questions

Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance

Feb 07, 2025

Shehzeen Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Mikyas T. Desta, Roy Fejgin, Rafael Valle, Jason Li

Figure 1 for Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance

Figure 2 for Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance

Figure 3 for Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance

Figure 4 for Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance

Abstract:While autoregressive speech token generation models produce speech with remarkable variety and naturalness, their inherent lack of controllability often results in issues such as hallucinations and undesired vocalizations that do not conform to conditioning inputs. We introduce Koel-TTS, a suite of enhanced encoder-decoder Transformer TTS models that address these challenges by incorporating preference alignment techniques guided by automatic speech recognition and speaker verification models. Additionally, we incorporate classifier-free guidance to further improve synthesis adherence to the transcript and reference speaker audio. Our experiments demonstrate that these optimizations significantly enhance target speaker similarity, intelligibility, and naturalness of synthesized speech. Notably, Koel-TTS directly maps text and context audio to acoustic tokens, and on the aforementioned metrics, outperforms state-of-the-art TTS models, despite being trained on a significantly smaller dataset. Audio samples and demos are available on our website.

Via

Access Paper or Ask Questions

Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation

Sep 17, 2024

Gerard I. Gállego, Roy Fejgin, Chunghsin Yeh, Xiaoyu Liu, Gautam Bhattacharya

Figure 1 for Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation

Figure 2 for Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation

Figure 3 for Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation

Figure 4 for Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation

Abstract:Audio token modeling has become a powerful framework for speech synthesis, with two-stage approaches employing semantic tokens remaining prevalent. In this paper, we aim to simplify this process by introducing a semantic knowledge distillation method that enables high-quality speech generation in a single stage. Our proposed model improves speech quality, intelligibility, and speaker similarity compared to a single-stage baseline. Although two-stage systems still lead in intelligibility, our model significantly narrows the gap while delivering comparable speech quality. These findings showcase the potential of single-stage models to achieve efficient, high-quality TTS with a more compact and streamlined architecture.

* Demo page: see https://narsistts.github.io

Via

Access Paper or Ask Questions