Picture for Samuel Thomas

Samuel Thomas

Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

Add code
May 14, 2025
Viaarxiv icon

Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities

Add code
May 14, 2025
Viaarxiv icon

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

Add code
May 02, 2025
Viaarxiv icon

mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition

Add code
Feb 03, 2025
Viaarxiv icon

A Non-autoregressive Model for Joint STT and TTS

Add code
Jan 15, 2025
Viaarxiv icon

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

Add code
Jun 14, 2024
Figure 1 for Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Figure 2 for Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Figure 3 for Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Figure 4 for Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Viaarxiv icon

Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

Add code
May 21, 2023
Viaarxiv icon

FisHook -- An Optimized Approach to Marine Specie Classification using MobileNetV2

Add code
Apr 04, 2023
Viaarxiv icon

What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

Add code
Mar 29, 2023
Viaarxiv icon

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Add code
Oct 07, 2022
Figure 1 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Figure 2 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Figure 3 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Figure 4 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Viaarxiv icon