Picture for Xiaoda Yang

Xiaoda Yang

SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation

Add code
Mar 23, 2026
Viaarxiv icon

CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation

Add code
Jun 24, 2025
Figure 1 for CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation
Figure 2 for CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation
Figure 3 for CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation
Figure 4 for CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation
Viaarxiv icon

Vela: Scalable Embeddings with Voice Large Language Models for Multimodal Retrieval

Add code
Jun 17, 2025
Viaarxiv icon

Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation

Add code
May 30, 2025
Viaarxiv icon

Diff-Prompt: Diffusion-Driven Prompt Generator with Mask Supervision

Add code
Apr 30, 2025
Viaarxiv icon

EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model

Add code
Apr 18, 2025
Viaarxiv icon

OmniCam: Unified Multimodal Video Generation via Camera Control

Add code
Apr 03, 2025
Figure 1 for OmniCam: Unified Multimodal Video Generation via Camera Control
Figure 2 for OmniCam: Unified Multimodal Video Generation via Camera Control
Figure 3 for OmniCam: Unified Multimodal Video Generation via Camera Control
Figure 4 for OmniCam: Unified Multimodal Video Generation via Camera Control
Viaarxiv icon

Astrea: A MOE-based Visual Understanding Model with Progressive Alignment

Add code
Mar 12, 2025
Figure 1 for Astrea: A MOE-based Visual Understanding Model with Progressive Alignment
Figure 2 for Astrea: A MOE-based Visual Understanding Model with Progressive Alignment
Figure 3 for Astrea: A MOE-based Visual Understanding Model with Progressive Alignment
Figure 4 for Astrea: A MOE-based Visual Understanding Model with Progressive Alignment
Viaarxiv icon

Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

Add code
Feb 26, 2025
Figure 1 for Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
Figure 2 for Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
Figure 3 for Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
Figure 4 for Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
Viaarxiv icon

EAGER-LLM: Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration

Add code
Feb 20, 2025
Viaarxiv icon