Alert button
Picture for Pengchuan Zhang

Pengchuan Zhang

Alert button

The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task

Nov 15, 2023
Yifan Wu, Pengchuan Zhang, Wenhan Xiong, Barlas Oguz, James C. Gee, Yixin Nie

Viaarxiv icon

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Oct 26, 2023
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, Mohamed Elhoseiny

Viaarxiv icon

Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding

Sep 20, 2023
Mohamed Afham, Satya Narayan Shukla, Omid Poursaeed, Pengchuan Zhang, Ashish Shah, Sernam Lim

Figure 1 for Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding
Figure 2 for Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding
Figure 3 for Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding
Figure 4 for Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding
Viaarxiv icon

UniVTG: Towards Unified Video-Language Temporal Grounding

Aug 18, 2023
Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou

Figure 1 for UniVTG: Towards Unified Video-Language Temporal Grounding
Figure 2 for UniVTG: Towards Unified Video-Language Temporal Grounding
Figure 3 for UniVTG: Towards Unified Video-Language Temporal Grounding
Figure 4 for UniVTG: Towards Unified Video-Language Temporal Grounding
Viaarxiv icon

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

Jul 11, 2023
Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, Pengchuan Zhang

Figure 1 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Figure 2 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Figure 3 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Figure 4 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Viaarxiv icon

VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores

Jun 02, 2023
Zhiqiu Lin, Xinyue Chen, Deepak Pathak, Pengchuan Zhang, Deva Ramanan

Figure 1 for VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores
Figure 2 for VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores
Figure 3 for VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores
Figure 4 for VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores
Viaarxiv icon

Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality

May 23, 2023
Harman Singh, Pengchuan Zhang, Qifan Wang, Mengjiao Wang, Wenhan Xiong, Jingfei Du, Yu Chen

Figure 1 for Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
Figure 2 for Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
Figure 3 for Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
Figure 4 for Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
Viaarxiv icon

DIME-FM: DIstilling Multimodal and Efficient Foundation Models

Mar 31, 2023
Ximeng Sun, Pengchuan Zhang, Peizhao Zhang, Hardik Shah, Kate Saenko, Xide Xia

Figure 1 for DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Figure 2 for DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Figure 3 for DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Figure 4 for DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Viaarxiv icon