Picture for Yali Wang

Yali Wang

ShenZhen Key Lab of Computer Vision and Pattern Recognition, SIAT-SenseTime Joint Lab, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model

Add code
Feb 15, 2026
Viaarxiv icon

MotionWeaver: Holistic 4D-Anchored Framework for Multi-Humanoid Image Animation

Add code
Feb 11, 2026
Viaarxiv icon

Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning

Add code
Jan 30, 2026
Viaarxiv icon

WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction

Add code
Aug 07, 2025
Viaarxiv icon

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

Add code
Jun 12, 2025
Viaarxiv icon

Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding

Add code
Jun 09, 2025
Figure 1 for Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding
Figure 2 for Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding
Figure 3 for Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding
Figure 4 for Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding
Viaarxiv icon

VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning

Add code
Jun 06, 2025
Viaarxiv icon

Video-GPT via Next Clip Diffusion

Add code
May 18, 2025
Figure 1 for Video-GPT via Next Clip Diffusion
Figure 2 for Video-GPT via Next Clip Diffusion
Figure 3 for Video-GPT via Next Clip Diffusion
Figure 4 for Video-GPT via Next Clip Diffusion
Viaarxiv icon

Weakly Supervised Temporal Sentence Grounding via Positive Sample Mining

Add code
May 10, 2025
Viaarxiv icon

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Add code
Apr 10, 2025
Viaarxiv icon