Picture for Zhen Ye

Zhen Ye

One-Step Token-to-Waveform Generation with MeanFlow in Latent Space

Add code
Jun 16, 2026
Viaarxiv icon

Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation

Add code
Jun 10, 2026
Viaarxiv icon

EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding

Add code
May 14, 2026
Viaarxiv icon

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

Add code
Apr 26, 2026
Viaarxiv icon

Towards Sparse Video Understanding and Reasoning

Add code
Feb 14, 2026
Viaarxiv icon

Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking

Add code
Jan 06, 2026
Viaarxiv icon

DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning

Add code
Nov 17, 2025
Viaarxiv icon

SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports

Add code
Nov 17, 2025
Viaarxiv icon

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

Add code
Oct 10, 2025
Viaarxiv icon

UnifiedVisual: A Framework for Constructing Unified Vision-Language Datasets

Add code
Sep 18, 2025
Viaarxiv icon