Picture for Jiaming Han

Jiaming Han

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Add code
Jul 30, 2025
Viaarxiv icon

Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

Add code
Jun 23, 2025
Viaarxiv icon

CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

Add code
May 22, 2025
Viaarxiv icon

Multimodal Long Video Modeling Based on Temporal Dynamic Context

Add code
Apr 14, 2025
Viaarxiv icon

Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation

Add code
Feb 23, 2025
Figure 1 for Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
Figure 2 for Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
Figure 3 for Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
Figure 4 for Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
Viaarxiv icon

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Add code
Dec 03, 2024
Figure 1 for AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Figure 2 for AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Figure 3 for AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Figure 4 for AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Viaarxiv icon

Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

Add code
Oct 17, 2024
Figure 1 for Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant
Figure 2 for Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant
Figure 3 for Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant
Figure 4 for Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant
Viaarxiv icon

OneLLM: One Framework to Align All Modalities with Language

Add code
Dec 06, 2023
Viaarxiv icon

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Add code
Nov 13, 2023
Figure 1 for SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
Figure 2 for SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
Figure 3 for SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
Figure 4 for SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
Viaarxiv icon

ImageBind-LLM: Multi-modality Instruction Tuning

Add code
Sep 11, 2023
Figure 1 for ImageBind-LLM: Multi-modality Instruction Tuning
Figure 2 for ImageBind-LLM: Multi-modality Instruction Tuning
Figure 3 for ImageBind-LLM: Multi-modality Instruction Tuning
Figure 4 for ImageBind-LLM: Multi-modality Instruction Tuning
Viaarxiv icon