Picture for Xinfa Zhu

Xinfa Zhu

XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation

Add code
Aug 12, 2025
Viaarxiv icon

Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech Synthesis

Add code
Aug 08, 2025
Viaarxiv icon

Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR

Add code
May 28, 2025
Viaarxiv icon

FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching

Add code
May 26, 2025
Viaarxiv icon

U-SAM: An audio language Model for Unified Speech, Audio, and Music Understanding

Add code
May 20, 2025
Viaarxiv icon

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Add code
Mar 03, 2025
Viaarxiv icon

Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

Add code
Feb 25, 2025
Figure 1 for Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought
Figure 2 for Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought
Figure 3 for Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought
Figure 4 for Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought
Viaarxiv icon

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

Add code
Feb 06, 2025
Viaarxiv icon

CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions

Add code
Jan 28, 2025
Viaarxiv icon

ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training

Add code
Jan 08, 2025
Figure 1 for ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
Figure 2 for ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
Figure 3 for ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
Figure 4 for ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
Viaarxiv icon