Picture for Yuzhe Liang

Yuzhe Liang

SoulX-Duplug: Plug-and-Play Streaming State Prediction Module for Realtime Full-Duplex Speech Conversation

Add code
Mar 16, 2026
Viaarxiv icon

V2A-DPO: Omni-Preference Optimization for Video-to-Audio Generation

Add code
Mar 11, 2026
Viaarxiv icon

Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis

Add code
Jan 20, 2026
Viaarxiv icon

MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

Add code
Aug 08, 2025
Viaarxiv icon

Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

Add code
Jun 24, 2025
Figure 1 for Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation
Figure 2 for Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation
Figure 3 for Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation
Figure 4 for Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation
Viaarxiv icon

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Add code
May 19, 2025
Figure 1 for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Figure 2 for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Figure 3 for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Figure 4 for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Viaarxiv icon

Towards Flow-Matching-based TTS without Classifier-Free Guidance

Add code
Apr 29, 2025
Viaarxiv icon

Interleaved Speech-Text Language Models are Simple Streaming Text to Speech Synthesizers

Add code
Dec 23, 2024
Viaarxiv icon

SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

Add code
Dec 20, 2024
Figure 1 for SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training
Figure 2 for SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training
Figure 3 for SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training
Figure 4 for SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training
Viaarxiv icon

DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning

Add code
Oct 12, 2024
Figure 1 for DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning
Figure 2 for DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning
Figure 3 for DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning
Figure 4 for DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning
Viaarxiv icon