Picture for Pengchuan Zhang

Pengchuan Zhang

Jack

UniVTG: Towards Unified Video-Language Temporal Grounding

Add code
Aug 18, 2023
Figure 1 for UniVTG: Towards Unified Video-Language Temporal Grounding
Figure 2 for UniVTG: Towards Unified Video-Language Temporal Grounding
Figure 3 for UniVTG: Towards Unified Video-Language Temporal Grounding
Figure 4 for UniVTG: Towards Unified Video-Language Temporal Grounding
Viaarxiv icon

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

Add code
Jul 11, 2023
Figure 1 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Figure 2 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Figure 3 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Figure 4 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Viaarxiv icon

VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores

Add code
Jun 02, 2023
Viaarxiv icon

Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality

Add code
May 23, 2023
Figure 1 for Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
Figure 2 for Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
Figure 3 for Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
Figure 4 for Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
Viaarxiv icon

DIME-FM: DIstilling Multimodal and Efficient Foundation Models

Add code
Mar 31, 2023
Figure 1 for DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Figure 2 for DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Figure 3 for DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Figure 4 for DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Viaarxiv icon

A Unified Model for Tracking and Image-Video Detection Has More Power

Add code
Nov 20, 2022
Figure 1 for A Unified Model for Tracking and Image-Video Detection Has More Power
Figure 2 for A Unified Model for Tracking and Image-Video Detection Has More Power
Figure 3 for A Unified Model for Tracking and Image-Video Detection Has More Power
Figure 4 for A Unified Model for Tracking and Image-Video Detection Has More Power
Viaarxiv icon

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Add code
Jun 15, 2022
Figure 1 for Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Figure 2 for Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Figure 3 for Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Figure 4 for Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Viaarxiv icon

GLIPv2: Unifying Localization and Vision-Language Understanding

Add code
Jun 12, 2022
Figure 1 for GLIPv2: Unifying Localization and Vision-Language Understanding
Figure 2 for GLIPv2: Unifying Localization and Vision-Language Understanding
Figure 3 for GLIPv2: Unifying Localization and Vision-Language Understanding
Figure 4 for GLIPv2: Unifying Localization and Vision-Language Understanding
Viaarxiv icon

Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding

Add code
Jun 07, 2022
Figure 1 for Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding
Figure 2 for Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding
Figure 3 for Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding
Figure 4 for Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding
Viaarxiv icon

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Add code
Apr 20, 2022
Figure 1 for ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models
Figure 2 for ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models
Figure 3 for ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models
Figure 4 for ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models
Viaarxiv icon