Picture for Jianhua Han

Jianhua Han

Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs

Add code
Jun 06, 2025
Viaarxiv icon

SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

Add code
May 25, 2025
Viaarxiv icon

ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement

Add code
Apr 03, 2025
Viaarxiv icon

SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation

Add code
Mar 09, 2025
Viaarxiv icon

Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?

Add code
Mar 08, 2025
Viaarxiv icon

TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba

Add code
Feb 21, 2025
Viaarxiv icon

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance

Add code
Dec 09, 2024
Viaarxiv icon

AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning

Add code
Nov 18, 2024
Viaarxiv icon

VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation

Add code
Nov 14, 2024
Viaarxiv icon

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Add code
Sep 26, 2024
Figure 1 for EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Figure 2 for EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Figure 3 for EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Figure 4 for EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Viaarxiv icon