Zero Shot Audio Captioning


SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training

Add code
Jan 18, 2026
Viaarxiv icon

Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training

Add code
Jan 06, 2026
Viaarxiv icon

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

Add code
Dec 22, 2025
Viaarxiv icon

TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation

Add code
Dec 16, 2025
Viaarxiv icon

EBind: a practical approach to space binding

Add code
Nov 18, 2025
Viaarxiv icon

MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models

Add code
Sep 16, 2025
Viaarxiv icon

Jamendo-QA: A Large-Scale Music Question Answering Dataset

Add code
Sep 19, 2025
Viaarxiv icon

RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing

Add code
Sep 17, 2025
Viaarxiv icon

VeS: Teaching Pixels to Listen Without Supervision

Add code
Jul 29, 2025
Viaarxiv icon

AC/DC: LLM-based Audio Comprehension via Dialogue Continuation

Add code
Jun 12, 2025
Viaarxiv icon