Picture for Jiaming Han

Jiaming Han

UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model

Add code
Feb 15, 2026
Viaarxiv icon

BitDance: Scaling Autoregressive Generative Models with Binary Tokens

Add code
Feb 15, 2026
Viaarxiv icon

Growing Visual Generative Capacity for Pre-Trained MLLMs

Add code
Oct 02, 2025
Viaarxiv icon

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Add code
Jul 30, 2025
Viaarxiv icon

Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

Add code
Jun 23, 2025
Figure 1 for Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
Figure 2 for Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
Figure 3 for Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
Figure 4 for Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
Viaarxiv icon

CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

Add code
May 22, 2025
Viaarxiv icon

Multimodal Long Video Modeling Based on Temporal Dynamic Context

Add code
Apr 14, 2025
Viaarxiv icon

Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation

Add code
Feb 23, 2025
Figure 1 for Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
Figure 2 for Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
Figure 3 for Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
Figure 4 for Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
Viaarxiv icon

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Add code
Dec 03, 2024
Figure 1 for AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Figure 2 for AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Figure 3 for AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Figure 4 for AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Viaarxiv icon

Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

Add code
Oct 17, 2024
Figure 1 for Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant
Figure 2 for Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant
Figure 3 for Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant
Figure 4 for Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant
Viaarxiv icon