Picture for Zhihao Du

Zhihao Du

MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis

Add code
Sep 18, 2025
Viaarxiv icon

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Add code
May 23, 2025
Viaarxiv icon

EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

Add code
Apr 22, 2025
Viaarxiv icon

Unispeaker: A Unified Approach for Multimodality-driven Speaker Generation

Add code
Jan 11, 2025
Viaarxiv icon

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

Add code
Jan 10, 2025
Figure 1 for MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
Figure 2 for MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
Figure 3 for MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
Figure 4 for MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
Viaarxiv icon

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Add code
Dec 13, 2024
Figure 1 for CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Figure 2 for CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Figure 3 for CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Figure 4 for CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Viaarxiv icon

Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap

Add code
Oct 22, 2024
Figure 1 for Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap
Figure 2 for Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap
Figure 3 for Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap
Figure 4 for Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap
Viaarxiv icon

IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities

Add code
Oct 09, 2024
Figure 1 for IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities
Figure 2 for IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities
Figure 3 for IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities
Figure 4 for IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities
Viaarxiv icon

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Add code
Jul 09, 2024
Figure 1 for CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens
Figure 2 for CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens
Figure 3 for CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens
Figure 4 for CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens
Viaarxiv icon

An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

Add code
Feb 13, 2024
Viaarxiv icon