Picture for Shih-Fu Chang

Shih-Fu Chang

Columbia University

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Add code
Dec 16, 2021
Figure 1 for SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning
Figure 2 for SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning
Figure 3 for SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning
Figure 4 for SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning
Viaarxiv icon

PreViTS: Contrastive Pretraining with Video Tracking Supervision

Add code
Dec 01, 2021
Figure 1 for PreViTS: Contrastive Pretraining with Video Tracking Supervision
Figure 2 for PreViTS: Contrastive Pretraining with Video Tracking Supervision
Figure 3 for PreViTS: Contrastive Pretraining with Video Tracking Supervision
Figure 4 for PreViTS: Contrastive Pretraining with Video Tracking Supervision
Viaarxiv icon

Joint Multimedia Event Extraction from Video and Article

Add code
Sep 27, 2021
Figure 1 for Joint Multimedia Event Extraction from Video and Article
Figure 2 for Joint Multimedia Event Extraction from Video and Article
Figure 3 for Joint Multimedia Event Extraction from Video and Article
Figure 4 for Joint Multimedia Event Extraction from Video and Article
Viaarxiv icon

Partner-Assisted Learning for Few-Shot Image Classification

Add code
Sep 15, 2021
Figure 1 for Partner-Assisted Learning for Few-Shot Image Classification
Figure 2 for Partner-Assisted Learning for Few-Shot Image Classification
Figure 3 for Partner-Assisted Learning for Few-Shot Image Classification
Figure 4 for Partner-Assisted Learning for Few-Shot Image Classification
Viaarxiv icon

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Add code
May 05, 2021
Figure 1 for Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos
Figure 2 for Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos
Figure 3 for Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos
Figure 4 for Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos
Viaarxiv icon

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Add code
Apr 22, 2021
Figure 1 for VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Figure 2 for VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Figure 3 for VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Figure 4 for VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Viaarxiv icon

Meta Faster R-CNN: Towards Accurate Few-Shot Object Detection with Attentive Feature Alignment

Add code
Apr 15, 2021
Figure 1 for Meta Faster R-CNN: Towards Accurate Few-Shot Object Detection with Attentive Feature Alignment
Figure 2 for Meta Faster R-CNN: Towards Accurate Few-Shot Object Detection with Attentive Feature Alignment
Figure 3 for Meta Faster R-CNN: Towards Accurate Few-Shot Object Detection with Attentive Feature Alignment
Figure 4 for Meta Faster R-CNN: Towards Accurate Few-Shot Object Detection with Attentive Feature Alignment
Viaarxiv icon

Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos

Add code
Mar 23, 2021
Figure 1 for Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos
Figure 2 for Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos
Figure 3 for Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos
Figure 4 for Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos
Viaarxiv icon

VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

Add code
Jan 29, 2021
Figure 1 for VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
Figure 2 for VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
Figure 3 for VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
Figure 4 for VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
Viaarxiv icon

Task-Adaptive Negative Class Envision for Few-Shot Open-Set Recognition

Add code
Dec 24, 2020
Figure 1 for Task-Adaptive Negative Class Envision for Few-Shot Open-Set Recognition
Figure 2 for Task-Adaptive Negative Class Envision for Few-Shot Open-Set Recognition
Figure 3 for Task-Adaptive Negative Class Envision for Few-Shot Open-Set Recognition
Figure 4 for Task-Adaptive Negative Class Envision for Few-Shot Open-Set Recognition
Viaarxiv icon