Picture for Pengchuan Zhang

Pengchuan Zhang

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

Add code
May 15, 2024
Viaarxiv icon

Evaluating Text-to-Visual Generation with Image-to-Text Generation

Add code
Apr 01, 2024
Figure 1 for Evaluating Text-to-Visual Generation with Image-to-Text Generation
Figure 2 for Evaluating Text-to-Visual Generation with Image-to-Text Generation
Figure 3 for Evaluating Text-to-Visual Generation with Image-to-Text Generation
Figure 4 for Evaluating Text-to-Visual Generation with Image-to-Text Generation
Viaarxiv icon

The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task

Add code
Nov 15, 2023
Viaarxiv icon

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Add code
Oct 26, 2023
Figure 1 for MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Figure 2 for MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Figure 3 for MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Figure 4 for MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Viaarxiv icon

Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding

Add code
Sep 20, 2023
Figure 1 for Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding
Figure 2 for Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding
Figure 3 for Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding
Figure 4 for Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding
Viaarxiv icon

UniVTG: Towards Unified Video-Language Temporal Grounding

Add code
Aug 18, 2023
Figure 1 for UniVTG: Towards Unified Video-Language Temporal Grounding
Figure 2 for UniVTG: Towards Unified Video-Language Temporal Grounding
Figure 3 for UniVTG: Towards Unified Video-Language Temporal Grounding
Figure 4 for UniVTG: Towards Unified Video-Language Temporal Grounding
Viaarxiv icon

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

Add code
Jul 11, 2023
Figure 1 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Figure 2 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Figure 3 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Figure 4 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Viaarxiv icon

VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores

Add code
Jun 02, 2023
Figure 1 for VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores
Figure 2 for VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores
Figure 3 for VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores
Figure 4 for VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores
Viaarxiv icon

Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality

Add code
May 23, 2023
Figure 1 for Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
Figure 2 for Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
Figure 3 for Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
Figure 4 for Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
Viaarxiv icon

DIME-FM: DIstilling Multimodal and Efficient Foundation Models

Add code
Mar 31, 2023
Figure 1 for DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Figure 2 for DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Figure 3 for DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Figure 4 for DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Viaarxiv icon