Picture for Yaya Shi

Yaya Shi

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training

Add code
Mar 01, 2024
Figure 1 for Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training
Figure 2 for Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training
Figure 3 for Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training
Figure 4 for Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training
Viaarxiv icon

Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval

Add code
Feb 26, 2024
Figure 1 for Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval
Figure 2 for Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval
Figure 3 for Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval
Figure 4 for Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval
Viaarxiv icon

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model

Add code
Nov 30, 2023
Viaarxiv icon

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

Add code
Jun 07, 2023
Figure 1 for Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks
Figure 2 for Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks
Figure 3 for Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks
Figure 4 for Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks
Viaarxiv icon

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Add code
Apr 27, 2023
Figure 1 for mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Figure 2 for mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Figure 3 for mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Figure 4 for mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Viaarxiv icon

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

Add code
Feb 01, 2023
Figure 1 for mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Figure 2 for mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Figure 3 for mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Figure 4 for mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Viaarxiv icon

EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

Add code
Nov 17, 2021
Figure 1 for EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching
Figure 2 for EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching
Figure 3 for EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching
Figure 4 for EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching
Viaarxiv icon

A Simple and Strong Baseline for Universal Targeted Attacks on Siamese Visual Tracking

Add code
May 06, 2021
Figure 1 for A Simple and Strong Baseline for Universal Targeted Attacks on Siamese Visual Tracking
Figure 2 for A Simple and Strong Baseline for Universal Targeted Attacks on Siamese Visual Tracking
Figure 3 for A Simple and Strong Baseline for Universal Targeted Attacks on Siamese Visual Tracking
Figure 4 for A Simple and Strong Baseline for Universal Targeted Attacks on Siamese Visual Tracking
Viaarxiv icon

Object Relational Graph with Teacher-Recommended Learning for Video Captioning

Add code
Feb 26, 2020
Figure 1 for Object Relational Graph with Teacher-Recommended Learning for Video Captioning
Figure 2 for Object Relational Graph with Teacher-Recommended Learning for Video Captioning
Figure 3 for Object Relational Graph with Teacher-Recommended Learning for Video Captioning
Figure 4 for Object Relational Graph with Teacher-Recommended Learning for Video Captioning
Viaarxiv icon

VATEX Captioning Challenge 2019: Multi-modal Information Fusion and Multi-stage Training Strategy for Video Captioning

Add code
Oct 13, 2019
Figure 1 for VATEX Captioning Challenge 2019: Multi-modal Information Fusion and Multi-stage Training Strategy for Video Captioning
Figure 2 for VATEX Captioning Challenge 2019: Multi-modal Information Fusion and Multi-stage Training Strategy for Video Captioning
Viaarxiv icon