Picture for Alex Jinpeng Wang

Alex Jinpeng Wang

TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering

Add code
Dec 18, 2025
Viaarxiv icon

Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification

Add code
Nov 17, 2025
Figure 1 for Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification
Figure 2 for Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification
Figure 3 for Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification
Figure 4 for Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification
Viaarxiv icon

See the Text: From Tokenization to Visual Reading

Add code
Oct 21, 2025
Figure 1 for See the Text: From Tokenization to Visual Reading
Figure 2 for See the Text: From Tokenization to Visual Reading
Figure 3 for See the Text: From Tokenization to Visual Reading
Figure 4 for See the Text: From Tokenization to Visual Reading
Viaarxiv icon

Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT

Add code
May 30, 2025
Viaarxiv icon

Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought

Add code
May 21, 2025
Viaarxiv icon

V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models

Add code
Apr 08, 2025
Viaarxiv icon

Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models

Add code
Mar 26, 2025
Viaarxiv icon

TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation

Add code
Feb 11, 2025
Viaarxiv icon

Vision-centric Token Compression in Large Language Model

Add code
Feb 04, 2025
Figure 1 for Vision-centric Token Compression in Large Language Model
Figure 2 for Vision-centric Token Compression in Large Language Model
Figure 3 for Vision-centric Token Compression in Large Language Model
Figure 4 for Vision-centric Token Compression in Large Language Model
Viaarxiv icon

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

Add code
Jun 04, 2024
Viaarxiv icon