Picture for Mu Cai

Mu Cai

When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios

Add code
Jul 27, 2025
Viaarxiv icon

Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models

Add code
May 26, 2025
Viaarxiv icon

Magma: A Foundation Model for Multimodal AI Agents

Add code
Feb 18, 2025
Viaarxiv icon

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Add code
Oct 15, 2024
Figure 1 for TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Figure 2 for TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Figure 3 for TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Figure 4 for TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Viaarxiv icon

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

Add code
Oct 03, 2024
Figure 1 for Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos
Figure 2 for Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos
Figure 3 for Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos
Figure 4 for Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos
Viaarxiv icon

Removing Distributional Discrepancies in Captions Improves Image-Text Alignment

Add code
Oct 01, 2024
Figure 1 for Removing Distributional Discrepancies in Captions Improves Image-Text Alignment
Figure 2 for Removing Distributional Discrepancies in Captions Improves Image-Text Alignment
Figure 3 for Removing Distributional Discrepancies in Captions Improves Image-Text Alignment
Figure 4 for Removing Distributional Discrepancies in Captions Improves Image-Text Alignment
Viaarxiv icon

Cross-Modal Self-Supervised Learning with Effective Contrastive Units for LiDAR Point Clouds

Add code
Sep 10, 2024
Figure 1 for Cross-Modal Self-Supervised Learning with Effective Contrastive Units for LiDAR Point Clouds
Figure 2 for Cross-Modal Self-Supervised Learning with Effective Contrastive Units for LiDAR Point Clouds
Figure 3 for Cross-Modal Self-Supervised Learning with Effective Contrastive Units for LiDAR Point Clouds
Figure 4 for Cross-Modal Self-Supervised Learning with Effective Contrastive Units for LiDAR Point Clouds
Viaarxiv icon

VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation

Add code
Jul 15, 2024
Viaarxiv icon

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Add code
Jun 28, 2024
Figure 1 for LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Figure 2 for LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Figure 3 for LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Figure 4 for LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Viaarxiv icon

Yo'LLaVA: Your Personalized Language and Vision Assistant

Add code
Jun 13, 2024
Figure 1 for Yo'LLaVA: Your Personalized Language and Vision Assistant
Figure 2 for Yo'LLaVA: Your Personalized Language and Vision Assistant
Figure 3 for Yo'LLaVA: Your Personalized Language and Vision Assistant
Figure 4 for Yo'LLaVA: Your Personalized Language and Vision Assistant
Viaarxiv icon