Picture for Dingdong Wang

Dingdong Wang

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Add code
May 29, 2026
Viaarxiv icon

AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

Add code
May 23, 2026
Viaarxiv icon

A Survey of Audio Reasoning in Multimodal Foundation Models

Add code
May 20, 2026
Viaarxiv icon

V2A-DPO: Omni-Preference Optimization for Video-to-Audio Generation

Add code
Mar 11, 2026
Viaarxiv icon

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Add code
Dec 09, 2025
Viaarxiv icon

Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs

Add code
Aug 25, 2025
Viaarxiv icon

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Add code
Jun 05, 2025
Viaarxiv icon

InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training

Add code
Mar 04, 2025
Figure 1 for InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training
Figure 2 for InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training
Figure 3 for InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training
Figure 4 for InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training
Viaarxiv icon

A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models

Add code
Nov 13, 2024
Figure 1 for A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models
Figure 2 for A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models
Figure 3 for A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models
Figure 4 for A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models
Viaarxiv icon

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Add code
Sep 26, 2024
Figure 1 for EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Figure 2 for EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Figure 3 for EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Figure 4 for EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Viaarxiv icon