Picture for Yali Wang

Yali Wang

ShenZhen Key Lab of Computer Vision and Pattern Recognition, SIAT-SenseTime Joint Lab, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Add code
Apr 24, 2024
Figure 1 for MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Figure 2 for MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Figure 3 for MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Figure 4 for MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Viaarxiv icon

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

Add code
Mar 24, 2024
Viaarxiv icon

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

Add code
Mar 22, 2024
Viaarxiv icon

VideoMamba: State Space Model for Efficient Video Understanding

Add code
Mar 12, 2024
Viaarxiv icon

Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of Foundation Models for Open-World Video Recognition

Add code
Feb 29, 2024
Figure 1 for Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of Foundation Models for Open-World Video Recognition
Figure 2 for Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of Foundation Models for Open-World Video Recognition
Figure 3 for Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of Foundation Models for Open-World Video Recognition
Figure 4 for Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of Foundation Models for Open-World Video Recognition
Viaarxiv icon

From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

Add code
Jan 29, 2024
Figure 1 for From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities
Figure 2 for From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities
Figure 3 for From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities
Figure 4 for From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities
Viaarxiv icon

Vlogger: Make Your Dream A Vlog

Add code
Jan 17, 2024
Figure 1 for Vlogger: Make Your Dream A Vlog
Figure 2 for Vlogger: Make Your Dream A Vlog
Figure 3 for Vlogger: Make Your Dream A Vlog
Figure 4 for Vlogger: Make Your Dream A Vlog
Viaarxiv icon

M-BEV: Masked BEV Perception for Robust Autonomous Driving

Add code
Dec 19, 2023
Figure 1 for M-BEV: Masked BEV Perception for Robust Autonomous Driving
Figure 2 for M-BEV: Masked BEV Perception for Robust Autonomous Driving
Figure 3 for M-BEV: Masked BEV Perception for Robust Autonomous Driving
Figure 4 for M-BEV: Masked BEV Perception for Robust Autonomous Driving
Viaarxiv icon

MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding

Add code
Dec 08, 2023
Viaarxiv icon

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Add code
Dec 03, 2023
Figure 1 for MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Figure 2 for MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Figure 3 for MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Figure 4 for MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Viaarxiv icon