Picture for Xinfa Zhu

Xinfa Zhu

PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation

Add code
Oct 01, 2025
Viaarxiv icon

XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation

Add code
Aug 12, 2025
Figure 1 for XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation
Figure 2 for XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation
Figure 3 for XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation
Figure 4 for XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation
Viaarxiv icon

Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech Synthesis

Add code
Aug 08, 2025
Viaarxiv icon

Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR

Add code
May 28, 2025
Viaarxiv icon

FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching

Add code
May 26, 2025
Viaarxiv icon

U-SAM: An audio language Model for Unified Speech, Audio, and Music Understanding

Add code
May 20, 2025
Viaarxiv icon

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Add code
Mar 03, 2025
Figure 1 for Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Figure 2 for Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Figure 3 for Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Figure 4 for Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Viaarxiv icon

Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

Add code
Feb 25, 2025
Figure 1 for Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought
Figure 2 for Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought
Figure 3 for Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought
Figure 4 for Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought
Viaarxiv icon

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

Add code
Feb 06, 2025
Viaarxiv icon

CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions

Add code
Jan 28, 2025
Viaarxiv icon