Picture for Renrui Zhang

Renrui Zhang

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Add code
Jul 10, 2024
Viaarxiv icon

RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation

Add code
Jun 06, 2024
Figure 1 for RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation
Figure 2 for RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation
Figure 3 for RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation
Figure 4 for RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation
Viaarxiv icon

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Add code
May 31, 2024
Figure 1 for Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Figure 2 for Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Figure 3 for Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Figure 4 for Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Viaarxiv icon

TripletMix: Triplet Data Augmentation for 3D Understanding

Add code
May 28, 2024
Viaarxiv icon

Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation

Add code
May 27, 2024
Viaarxiv icon

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models

Add code
May 25, 2024
Figure 1 for SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
Figure 2 for SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
Figure 3 for SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
Figure 4 for SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
Viaarxiv icon

TerDiT: Ternary Diffusion Models with Transformers

Add code
May 23, 2024
Figure 1 for TerDiT: Ternary Diffusion Models with Transformers
Figure 2 for TerDiT: Ternary Diffusion Models with Transformers
Figure 3 for TerDiT: Ternary Diffusion Models with Transformers
Figure 4 for TerDiT: Ternary Diffusion Models with Transformers
Viaarxiv icon

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

Add code
May 09, 2024
Figure 1 for Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers
Figure 2 for Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers
Figure 3 for Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers
Figure 4 for Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers
Viaarxiv icon

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Add code
Apr 24, 2024
Figure 1 for MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Figure 2 for MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Figure 3 for MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Figure 4 for MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Viaarxiv icon

No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation

Add code
Apr 05, 2024
Viaarxiv icon