Audio Visual Video Captioning


Klear: Unified Multi-Task Audio-Video Joint Generation

Add code
Jan 07, 2026
Viaarxiv icon

Omni2Sound: Towards Unified Video-Text-to-Audio Generation

Add code
Jan 06, 2026
Viaarxiv icon

OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

Add code
Dec 29, 2025
Viaarxiv icon

TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation

Add code
Dec 16, 2025
Viaarxiv icon

FoleyBench: A Benchmark For Video-to-Audio Models

Add code
Nov 17, 2025
Figure 1 for FoleyBench: A Benchmark For Video-to-Audio Models
Figure 2 for FoleyBench: A Benchmark For Video-to-Audio Models
Figure 3 for FoleyBench: A Benchmark For Video-to-Audio Models
Figure 4 for FoleyBench: A Benchmark For Video-to-Audio Models
Viaarxiv icon

Caption Injection for Optimization in Generative Search Engine

Add code
Nov 06, 2025
Viaarxiv icon

Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

Add code
Oct 14, 2025
Viaarxiv icon

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos

Add code
Jul 16, 2025
Figure 1 for Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos
Figure 2 for Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos
Figure 3 for Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos
Figure 4 for Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos
Viaarxiv icon

VeS: Teaching Pixels to Listen Without Supervision

Add code
Jul 29, 2025
Viaarxiv icon

video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models

Add code
Jun 18, 2025
Viaarxiv icon