Alert button
Picture for David Harwath

David Harwath

Alert button

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Add code
Bookmark button
Alert button
Oct 07, 2022
Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass

Figure 1 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Figure 2 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Figure 3 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Figure 4 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Viaarxiv icon

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

Add code
Bookmark button
Alert button
Oct 03, 2022
Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-yi Lee, David Harwath

Figure 1 for SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
Figure 2 for SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
Figure 3 for SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
Figure 4 for SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
Viaarxiv icon

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

Add code
Bookmark button
Alert button
Mar 30, 2022
Alan Baade, Puyuan Peng, David Harwath

Figure 1 for MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
Figure 2 for MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
Figure 3 for MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
Figure 4 for MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
Viaarxiv icon

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Add code
Bookmark button
Alert button
Mar 28, 2022
Puyuan Peng, David Harwath

Figure 1 for Word Discovery in Visually Grounded, Self-Supervised Speech Models
Figure 2 for Word Discovery in Visually Grounded, Self-Supervised Speech Models
Figure 3 for Word Discovery in Visually Grounded, Self-Supervised Speech Models
Figure 4 for Word Discovery in Visually Grounded, Self-Supervised Speech Models
Viaarxiv icon

Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling

Add code
Bookmark button
Alert button
Feb 07, 2022
Puyuan Peng, David Harwath

Figure 1 for Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling
Figure 2 for Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling
Figure 3 for Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling
Figure 4 for Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling
Viaarxiv icon

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

Add code
Bookmark button
Alert button
Dec 08, 2021
Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne

Figure 1 for Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Figure 2 for Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Figure 3 for Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Figure 4 for Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Viaarxiv icon

Routing with Self-Attention for Multimodal Capsule Networks

Add code
Bookmark button
Alert button
Dec 01, 2021
Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah

Figure 1 for Routing with Self-Attention for Multimodal Capsule Networks
Figure 2 for Routing with Self-Attention for Multimodal Capsule Networks
Figure 3 for Routing with Self-Attention for Multimodal Capsule Networks
Figure 4 for Routing with Self-Attention for Multimodal Capsule Networks
Viaarxiv icon

Cascaded Multilingual Audio-Visual Learning from Videos

Add code
Bookmark button
Alert button
Nov 08, 2021
Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass

Figure 1 for Cascaded Multilingual Audio-Visual Learning from Videos
Figure 2 for Cascaded Multilingual Audio-Visual Learning from Videos
Figure 3 for Cascaded Multilingual Audio-Visual Learning from Videos
Figure 4 for Cascaded Multilingual Audio-Visual Learning from Videos
Viaarxiv icon

Fast-Slow Transformer for Visually Grounding Speech

Add code
Bookmark button
Alert button
Oct 01, 2021
Puyuan Peng, David Harwath

Figure 1 for Fast-Slow Transformer for Visually Grounding Speech
Figure 2 for Fast-Slow Transformer for Visually Grounding Speech
Figure 3 for Fast-Slow Transformer for Visually Grounding Speech
Figure 4 for Fast-Slow Transformer for Visually Grounding Speech
Viaarxiv icon