Picture for Shraman Pramanick

Shraman Pramanick

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

Add code
Jul 12, 2024
Viaarxiv icon

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

Add code
Dec 19, 2023
Viaarxiv icon

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Add code
Nov 30, 2023
Figure 1 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Figure 2 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Figure 3 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Figure 4 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Viaarxiv icon

UniVTG: Towards Unified Video-Language Temporal Grounding

Add code
Aug 18, 2023
Figure 1 for UniVTG: Towards Unified Video-Language Temporal Grounding
Figure 2 for UniVTG: Towards Unified Video-Language Temporal Grounding
Figure 3 for UniVTG: Towards Unified Video-Language Temporal Grounding
Figure 4 for UniVTG: Towards Unified Video-Language Temporal Grounding
Viaarxiv icon

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

Add code
Jul 11, 2023
Figure 1 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Figure 2 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Figure 3 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Figure 4 for EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Viaarxiv icon

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

Add code
Oct 09, 2022
Figure 1 for VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Figure 2 for VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Figure 3 for VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Figure 4 for VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Viaarxiv icon

Where in the World is this Image? Transformer-based Geo-localization in the Wild

Add code
Apr 29, 2022
Figure 1 for Where in the World is this Image? Transformer-based Geo-localization in the Wild
Figure 2 for Where in the World is this Image? Transformer-based Geo-localization in the Wild
Figure 3 for Where in the World is this Image? Transformer-based Geo-localization in the Wild
Figure 4 for Where in the World is this Image? Transformer-based Geo-localization in the Wild
Viaarxiv icon

Multimodal Learning using Optimal Transport for Sarcasm and Humor Detection

Add code
Oct 21, 2021
Figure 1 for Multimodal Learning using Optimal Transport for Sarcasm and Humor Detection
Figure 2 for Multimodal Learning using Optimal Transport for Sarcasm and Humor Detection
Figure 3 for Multimodal Learning using Optimal Transport for Sarcasm and Humor Detection
Figure 4 for Multimodal Learning using Optimal Transport for Sarcasm and Humor Detection
Viaarxiv icon

Detecting Harmful Memes and Their Targets

Add code
Sep 24, 2021
Figure 1 for Detecting Harmful Memes and Their Targets
Figure 2 for Detecting Harmful Memes and Their Targets
Figure 3 for Detecting Harmful Memes and Their Targets
Figure 4 for Detecting Harmful Memes and Their Targets
Viaarxiv icon

MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets

Add code
Sep 22, 2021
Figure 1 for MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets
Figure 2 for MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets
Figure 3 for MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets
Figure 4 for MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets
Viaarxiv icon