Picture for Zhiyong Wu

Zhiyong Wu

How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

Add code
May 11, 2026
Viaarxiv icon

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

Add code
Apr 24, 2026
Viaarxiv icon

Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

Add code
Apr 21, 2026
Viaarxiv icon

The Interspeech 2026 Audio Encoder Capability Challenge for Large Audio Language Models

Add code
Mar 24, 2026
Viaarxiv icon

PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On

Add code
Mar 12, 2026
Viaarxiv icon

Kling-MotionControl Technical Report

Add code
Mar 03, 2026
Viaarxiv icon

UniSRCodec: Unified and Low-Bitrate Single Codebook Codec with Sub-Band Reconstruction

Add code
Jan 06, 2026
Viaarxiv icon

From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing

Add code
Dec 31, 2025
Viaarxiv icon

E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis

Add code
Nov 10, 2025
Figure 1 for E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis
Figure 2 for E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis
Figure 3 for E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis
Figure 4 for E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis
Viaarxiv icon

HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

Add code
Sep 10, 2025
Viaarxiv icon