Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Junrui Ni

ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models

Jul 27, 2025

Kaizhi Qian, Xulin Fan, Junrui Ni, Slava Shechtman, Mark Hasegawa-Johnson, Chuang Gan, Yang Zhang

Abstract:Speech language models refer to language models with speech processing and understanding capabilities. One key desirable capability for speech language models is the ability to capture the intricate interdependency between content and prosody. The existing mainstream paradigm of training speech language models, which converts speech into discrete tokens before feeding them into LLMs, is sub-optimal in learning prosody information -- we find that the resulting LLMs do not exhibit obvious emerging prosody processing capabilities via pre-training alone. To overcome this, we propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody. Each speech utterance is first transcribed into text, followed by a sequence of word-level prosody tokens. Compared with conventional speech tokenization schemes, the proposed tokenization scheme retains more complete prosody information, and is more understandable to text-based LLMs. We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone, ranging from harnessing the prosody nuances in generated speech, such as contrastive focus, understanding emotion and stress in an utterance, to maintaining prosody consistency in long contexts.

Via

Access Paper or Ask Questions

Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM

Feb 24, 2025

Jiatong Shi, Chunlei Zhang, Jinchuan Tian, Junrui Ni, Hao Zhang, Shinji Watanabe, Dong Yu

Abstract:Recent efforts have extended textual LLMs to the speech domain. Yet, a key challenge remains, which is balancing speech understanding and generation while avoiding catastrophic forgetting when integrating acoustically rich codec-based representations into models originally trained on text. In this work, we propose a novel approach that leverages continual pre-training (CPT) on a pre-trained textual LLM to create a codec-based speech language model. This strategy mitigates the modality gap between text and speech, preserving the linguistic reasoning of the original model while enabling high-fidelity speech synthesis. We validate our approach with extensive experiments across multiple tasks, including automatic speech recognition, text-to-speech, speech-to-text translation, and speech-to-speech translation (S2ST), demonstrating that our model achieves superior TTS performance and, notably, the first end-to-end S2ST system based on neural codecs.

Via

Access Paper or Ask Questions

Towards Unsupervised Speech Recognition Without Pronunciation Models

Jun 12, 2024

Junrui Ni, Liming Wang, Yang Zhang, Kaizhi Qian, Heting Gao, Mark Hasegawa-Johnson, Chang D. Yoo

Figure 1 for Towards Unsupervised Speech Recognition Without Pronunciation Models

Figure 2 for Towards Unsupervised Speech Recognition Without Pronunciation Models

Figure 3 for Towards Unsupervised Speech Recognition Without Pronunciation Models

Figure 4 for Towards Unsupervised Speech Recognition Without Pronunciation Models

Abstract:Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech and text data to effectively train these systems. In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. We explore a new research direction: word-level unsupervised ASR. Using a curated speech corpus containing only high-frequency English words, our system achieves a word error rate of nearly 20% without parallel transcripts or oracle word boundaries. Furthermore, we experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling. This innovative model surpasses the performance of previous unsupervised ASR models trained with direct distribution matching.

* This work has been submitted to the IEEE for possible publication

Via

Access Paper or Ask Questions

Improving Self-Supervised Speech Representations by Disentangling Speakers

Apr 20, 2022

Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni, Cheng-I Lai, David Cox, Mark Hasegawa-Johnson, Shiyu Chang

Figure 1 for Improving Self-Supervised Speech Representations by Disentangling Speakers

Figure 2 for Improving Self-Supervised Speech Representations by Disentangling Speakers

Figure 3 for Improving Self-Supervised Speech Representations by Disentangling Speakers

Figure 4 for Improving Self-Supervised Speech Representations by Disentangling Speakers

Abstract:Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks. Since the majority of the downstream tasks of SSL learning in speech largely focus on the content information in speech, the most desirable speech representations should be able to disentangle unwanted variations, such as speaker variations, from the content. However, disentangling speakers is very challenging, because removing the speaker information could easily result in a loss of content as well, and the damage of the latter usually far outweighs the benefit of the former. In this paper, we propose a new SSL method that can achieve speaker disentanglement without severe loss of content. Our approach is adapted from the HuBERT framework, and incorporates disentangling mechanisms to regularize both the teacher labels and the learned representations. We evaluate the benefit of speaker disentanglement on a set of content-related downstream tasks, and observe a consistent and notable performance advantage of our speaker-disentangled representations.

Via

Access Paper or Ask Questions

WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models

Apr 14, 2022

Heting Gao, Junrui Ni, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson

Figure 1 for WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models

Figure 2 for WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models

Figure 3 for WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models

Figure 4 for WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models

Abstract:Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks with only a few text examples, without the need for fine-tuning. Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings functioning like the text embeddings of the language model. Interested in exploring the possibility of transferring the few-shot learning ability to the audio-text setting, we propose a novel speech understanding framework, WavPrompt, where we finetune a wav2vec model to generate a sequence of audio embeddings understood by the language model. We show that WavPrompt is a few-shot learner that can perform speech understanding tasks better than a naive text baseline. We conduct detailed ablation studies on different components and hyperparameters to empirically identify the best model configuration. In addition, we conduct a non-speech understanding experiment to show WavPrompt can extract more information than just the transcriptions. Code is available at https://github.com/Hertin/WavPrompt

* submitted to INTERSPEECH 2022

Via

Access Paper or Ask Questions

Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition

Mar 29, 2022

Junrui Ni, Liming Wang, Heting Gao, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson

Figure 1 for Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition

Figure 2 for Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition

Figure 3 for Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition

Figure 4 for Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition

Abstract:An unsupervised text-to-speech synthesis (TTS) system learns to generate the speech waveform corresponding to any written sentence in a language by observing: 1) a collection of untranscribed speech waveforms in that language; 2) a collection of texts written in that language without access to any transcribed speech. Developing such a system can significantly improve the availability of speech technology to languages without a large amount of parallel speech and text data. This paper proposes an unsupervised TTS system by leveraging recent advances in unsupervised automatic speech recognition (ASR). Our unsupervised system can achieve comparable performance to the supervised system in seven languages with about 10-20 hours of speech each. A careful study on the effect of text units and vocoders has also been conducted to better understand what factors may affect unsupervised TTS performance. The samples generated by our models can be found at https://cactuswiththoughts.github.io/UnsupTTS-Demo.

* submitted to INTERSPEECH

Via

Access Paper or Ask Questions