Picture for Renrui Zhang

Renrui Zhang

MAVIS: Mathematical Visual Instruction Tuning

Add code
Jul 11, 2024
Viaarxiv icon

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Add code
Jul 10, 2024
Viaarxiv icon

RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation

Add code
Jun 06, 2024
Viaarxiv icon

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Add code
May 31, 2024
Figure 1 for Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Figure 2 for Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Figure 3 for Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Figure 4 for Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Viaarxiv icon

TripletMix: Triplet Data Augmentation for 3D Understanding

Add code
May 28, 2024
Viaarxiv icon

Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation

Add code
May 27, 2024
Figure 1 for Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation
Figure 2 for Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation
Figure 3 for Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation
Figure 4 for Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation
Viaarxiv icon

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models

Add code
May 25, 2024
Figure 1 for SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
Figure 2 for SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
Figure 3 for SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
Figure 4 for SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
Viaarxiv icon

TerDiT: Ternary Diffusion Models with Transformers

Add code
May 23, 2024
Viaarxiv icon

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

Add code
May 09, 2024
Viaarxiv icon

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Add code
Apr 24, 2024
Figure 1 for MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Figure 2 for MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Figure 3 for MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Figure 4 for MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Viaarxiv icon