Picture for Andrew Rouditchenko

Andrew Rouditchenko

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

Add code
Jun 14, 2024
Figure 1 for Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Figure 2 for Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Figure 3 for Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Figure 4 for Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Viaarxiv icon

AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition

Add code
Sep 29, 2023
Figure 1 for AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition
Figure 2 for AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition
Figure 3 for AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition
Figure 4 for AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition
Viaarxiv icon

Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

Add code
May 21, 2023
Figure 1 for Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages
Figure 2 for Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages
Figure 3 for Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages
Viaarxiv icon

What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

Add code
Mar 29, 2023
Figure 1 for What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Figure 2 for What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Figure 3 for What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Figure 4 for What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Viaarxiv icon

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Add code
Oct 07, 2022
Figure 1 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Figure 2 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Figure 3 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Figure 4 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Viaarxiv icon

UAVM: A Unified Model for Audio-Visual Learning

Add code
Jul 29, 2022
Figure 1 for UAVM: A Unified Model for Audio-Visual Learning
Figure 2 for UAVM: A Unified Model for Audio-Visual Learning
Figure 3 for UAVM: A Unified Model for Audio-Visual Learning
Figure 4 for UAVM: A Unified Model for Audio-Visual Learning
Viaarxiv icon

CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification

Add code
Mar 13, 2022
Figure 1 for CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification
Figure 2 for CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification
Figure 3 for CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification
Figure 4 for CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification
Viaarxiv icon

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

Add code
Dec 08, 2021
Figure 1 for Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Figure 2 for Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Figure 3 for Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Figure 4 for Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Viaarxiv icon

Routing with Self-Attention for Multimodal Capsule Networks

Add code
Dec 01, 2021
Figure 1 for Routing with Self-Attention for Multimodal Capsule Networks
Figure 2 for Routing with Self-Attention for Multimodal Capsule Networks
Figure 3 for Routing with Self-Attention for Multimodal Capsule Networks
Figure 4 for Routing with Self-Attention for Multimodal Capsule Networks
Viaarxiv icon

Cascaded Multilingual Audio-Visual Learning from Videos

Add code
Nov 08, 2021
Figure 1 for Cascaded Multilingual Audio-Visual Learning from Videos
Figure 2 for Cascaded Multilingual Audio-Visual Learning from Videos
Figure 3 for Cascaded Multilingual Audio-Visual Learning from Videos
Figure 4 for Cascaded Multilingual Audio-Visual Learning from Videos
Viaarxiv icon