Picture for Luowei Zhou

Luowei Zhou

Visual Grounding with Attention-Driven Constraint Balancing

Add code
Jul 03, 2024
Viaarxiv icon

ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation

Add code
Jan 01, 2024
Viaarxiv icon

Gemini: A Family of Highly Capable Multimodal Models

Add code
Dec 19, 2023
Viaarxiv icon

AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn

Add code
Jun 28, 2023
Figure 1 for AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
Figure 2 for AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
Figure 3 for AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
Figure 4 for AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
Viaarxiv icon

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

Add code
Mar 30, 2023
Figure 1 for MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Figure 2 for MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Figure 3 for MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Figure 4 for MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Viaarxiv icon

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

Add code
Dec 19, 2022
Figure 1 for MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
Figure 2 for MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
Figure 3 for MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
Figure 4 for MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
Viaarxiv icon

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

Add code
Sep 15, 2022
Figure 1 for OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
Figure 2 for OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
Figure 3 for OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
Figure 4 for OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
Viaarxiv icon

Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling

Add code
Aug 25, 2022
Figure 1 for Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling
Figure 2 for Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling
Figure 3 for Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling
Figure 4 for Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling
Viaarxiv icon

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

Add code
Jul 26, 2022
Figure 1 for Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
Figure 2 for Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
Figure 3 for Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
Figure 4 for Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
Viaarxiv icon

Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning

Add code
Jun 03, 2022
Figure 1 for Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning
Figure 2 for Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning
Figure 3 for Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning
Figure 4 for Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning
Viaarxiv icon