Audio Visual Video Captioning


Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

Add code
Oct 14, 2025
Viaarxiv icon

VeS: Teaching Pixels to Listen Without Supervision

Add code
Jul 29, 2025
Viaarxiv icon

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos

Add code
Jul 16, 2025
Figure 1 for Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos
Figure 2 for Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos
Figure 3 for Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos
Figure 4 for Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos
Viaarxiv icon

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

Add code
Jul 28, 2025
Viaarxiv icon

video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models

Add code
Jun 18, 2025
Viaarxiv icon

Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities

Add code
May 20, 2025
Viaarxiv icon

FocusedAD: Character-centric Movie Audio Description

Add code
Apr 16, 2025
Figure 1 for FocusedAD: Character-centric Movie Audio Description
Figure 2 for FocusedAD: Character-centric Movie Audio Description
Figure 3 for FocusedAD: Character-centric Movie Audio Description
Figure 4 for FocusedAD: Character-centric Movie Audio Description
Viaarxiv icon

Aligned Better, Listen Better for Audio-Visual Large Language Models

Add code
Apr 02, 2025
Viaarxiv icon

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

Add code
Mar 30, 2025
Viaarxiv icon

Unified Multimodal Discrete Diffusion

Add code
Mar 26, 2025
Viaarxiv icon