Picture for Yali Wang

Yali Wang

ShenZhen Key Lab of Computer Vision and Pattern Recognition, SIAT-SenseTime Joint Lab, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

Add code
Dec 30, 2024
Figure 1 for Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
Figure 2 for Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
Figure 3 for Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
Figure 4 for Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
Viaarxiv icon

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Add code
Dec 26, 2024
Viaarxiv icon

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

Add code
Dec 16, 2024
Figure 1 for CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
Figure 2 for CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
Figure 3 for CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
Figure 4 for CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
Viaarxiv icon

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

Add code
Dec 11, 2024
Figure 1 for Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
Figure 2 for Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
Figure 3 for Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
Figure 4 for Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
Viaarxiv icon

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

Add code
Oct 25, 2024
Figure 1 for TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
Figure 2 for TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
Figure 3 for TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
Figure 4 for TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
Viaarxiv icon

TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration

Add code
Oct 16, 2024
Viaarxiv icon

MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

Add code
Aug 21, 2024
Figure 1 for MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration
Figure 2 for MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration
Figure 3 for MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration
Figure 4 for MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration
Viaarxiv icon

EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation

Add code
Jun 27, 2024
Figure 1 for EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation
Figure 2 for EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation
Figure 3 for EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation
Figure 4 for EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation
Viaarxiv icon

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Add code
Jun 13, 2024
Figure 1 for OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Figure 2 for OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Figure 3 for OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Figure 4 for OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Viaarxiv icon

OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Add code
Jun 12, 2024
Figure 1 for OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Figure 2 for OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Figure 3 for OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Figure 4 for OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Viaarxiv icon