Audio Visual Captioning


MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus

Add code
Jan 14, 2026
Viaarxiv icon

Klear: Unified Multi-Task Audio-Video Joint Generation

Add code
Jan 07, 2026
Viaarxiv icon

Omni2Sound: Towards Unified Video-Text-to-Audio Generation

Add code
Jan 06, 2026
Viaarxiv icon

OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

Add code
Dec 29, 2025
Viaarxiv icon

TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation

Add code
Dec 16, 2025
Viaarxiv icon

FoleyBench: A Benchmark For Video-to-Audio Models

Add code
Nov 17, 2025
Figure 1 for FoleyBench: A Benchmark For Video-to-Audio Models
Figure 2 for FoleyBench: A Benchmark For Video-to-Audio Models
Figure 3 for FoleyBench: A Benchmark For Video-to-Audio Models
Figure 4 for FoleyBench: A Benchmark For Video-to-Audio Models
Viaarxiv icon

DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions

Add code
Nov 16, 2025
Viaarxiv icon

Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

Add code
Oct 14, 2025
Viaarxiv icon

Caption Injection for Optimization in Generative Search Engine

Add code
Nov 06, 2025
Viaarxiv icon

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos

Add code
Jul 16, 2025
Figure 1 for Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos
Figure 2 for Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos
Figure 3 for Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos
Figure 4 for Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos
Viaarxiv icon