Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models

Jun 06, 2025

Seung-jae Lee, Paul Hongsuck Seo

Share this with someone who'll enjoy it:

Abstract:Audiovisual segmentation (AVS) aims to identify visual regions corresponding to sound sources, playing a vital role in video understanding, surveillance, and human-computer interaction. Traditional AVS methods depend on large-scale pixel-level annotations, which are costly and time-consuming to obtain. To address this, we propose a novel zero-shot AVS framework that eliminates task-specific training by leveraging multiple pretrained models. Our approach integrates audio, vision, and text representations to bridge modality gaps, enabling precise sound source segmentation without AVS-specific annotations. We systematically explore different strategies for connecting pretrained models and evaluate their efficacy across multiple datasets. Experimental results demonstrate that our framework achieves state-of-the-art zero-shot AVS performance, highlighting the effectiveness of multimodal model integration for finegrained audiovisual segmentation.

* Accepted on INTERSPEECH2025

View paper on

Share this with someone who'll enjoy it:

Title:Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models

Paper and Code