Picture for Sanyuan Chen

Sanyuan Chen

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

Add code
Dec 22, 2025
Viaarxiv icon

SAM Audio: Segment Anything in Audio

Add code
Dec 19, 2025
Viaarxiv icon

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Add code
Feb 07, 2025
Figure 1 for Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Figure 2 for Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Figure 3 for Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Figure 4 for Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Viaarxiv icon

Movie Gen: A Cast of Media Foundation Models

Add code
Oct 17, 2024
Figure 1 for Movie Gen: A Cast of Media Foundation Models
Figure 2 for Movie Gen: A Cast of Media Foundation Models
Figure 3 for Movie Gen: A Cast of Media Foundation Models
Figure 4 for Movie Gen: A Cast of Media Foundation Models
Viaarxiv icon

Autoregressive Speech Synthesis without Vector Quantization

Add code
Jul 11, 2024
Figure 1 for Autoregressive Speech Synthesis without Vector Quantization
Figure 2 for Autoregressive Speech Synthesis without Vector Quantization
Figure 3 for Autoregressive Speech Synthesis without Vector Quantization
Figure 4 for Autoregressive Speech Synthesis without Vector Quantization
Viaarxiv icon

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

Add code
Jun 12, 2024
Figure 1 for VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
Figure 2 for VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
Figure 3 for VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
Figure 4 for VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
Viaarxiv icon

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

Add code
Jun 08, 2024
Viaarxiv icon

WavLLM: Towards Robust and Adaptive Speech Large Language Model

Add code
Mar 31, 2024
Viaarxiv icon

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

Add code
Aug 14, 2023
Viaarxiv icon

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

Add code
Mar 07, 2023
Viaarxiv icon