Picture for David Harwath

David Harwath

Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality

Add code
Nov 11, 2022
Figure 1 for Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality
Figure 2 for Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality
Figure 3 for Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality
Figure 4 for Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality
Viaarxiv icon

Phoneme Segmentation Using Self-Supervised Speech Models

Add code
Nov 02, 2022
Figure 1 for Phoneme Segmentation Using Self-Supervised Speech Models
Figure 2 for Phoneme Segmentation Using Self-Supervised Speech Models
Figure 3 for Phoneme Segmentation Using Self-Supervised Speech Models
Figure 4 for Phoneme Segmentation Using Self-Supervised Speech Models
Viaarxiv icon

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

Add code
Nov 02, 2022
Figure 1 for M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval
Figure 2 for M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval
Figure 3 for M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval
Figure 4 for M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval
Viaarxiv icon

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Add code
Oct 07, 2022
Figure 1 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Figure 2 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Figure 3 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Figure 4 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Viaarxiv icon

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

Add code
Oct 03, 2022
Figure 1 for SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
Figure 2 for SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
Figure 3 for SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
Figure 4 for SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
Viaarxiv icon

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

Add code
Mar 30, 2022
Figure 1 for MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
Figure 2 for MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
Figure 3 for MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
Figure 4 for MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
Viaarxiv icon

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Add code
Mar 28, 2022
Figure 1 for Word Discovery in Visually Grounded, Self-Supervised Speech Models
Figure 2 for Word Discovery in Visually Grounded, Self-Supervised Speech Models
Figure 3 for Word Discovery in Visually Grounded, Self-Supervised Speech Models
Figure 4 for Word Discovery in Visually Grounded, Self-Supervised Speech Models
Viaarxiv icon

Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling

Add code
Feb 07, 2022
Figure 1 for Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling
Figure 2 for Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling
Figure 3 for Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling
Figure 4 for Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling
Viaarxiv icon

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

Add code
Dec 08, 2021
Figure 1 for Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Figure 2 for Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Figure 3 for Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Figure 4 for Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Viaarxiv icon

Routing with Self-Attention for Multimodal Capsule Networks

Add code
Dec 01, 2021
Figure 1 for Routing with Self-Attention for Multimodal Capsule Networks
Figure 2 for Routing with Self-Attention for Multimodal Capsule Networks
Figure 3 for Routing with Self-Attention for Multimodal Capsule Networks
Figure 4 for Routing with Self-Attention for Multimodal Capsule Networks
Viaarxiv icon