Abstract:We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes and complex real-world clips), the evaluation protocol (top-1 accuracy with answer-shuffling robustness), and baseline systems (Qwen2-Audio-7B, AudioFlamingo 2, Gemini-2-Flash). Preliminary results on the development set are compared, showing strong variation across models and subsets. This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity, which are crucial for enabling AI agents to perceive and interact about the world effectively.
Abstract:In traditional sound event localization and detection (SELD) tasks, the focus is typically on sound event detection (SED) and direction-of-arrival (DOA) estimation, but they fall short of providing full spatial information about the sound source. The 3D SELD task addresses this limitation by integrating source distance estimation (SDE), allowing for complete spatial localization. We propose three approaches to tackle this challenge: a novel method with independent training and joint prediction, which firstly treats DOA and distance estimation as separate tasks and then combines them to solve 3D SELD; a dual-branch representation with source Cartesian coordinate used for simultaneous DOA and distance estimation; and a three-branch structure that jointly models SED, DOA, and SDE within a unified framework. Our proposed method ranked first in the DCASE 2024 Challenge Task 3, demonstrating the effectiveness of joint modeling for addressing the 3D SELD task. The relevant code for this paper will be open-sourced in the future.
Abstract:Sound event localization and detection with source distance estimation (3D SELD) involves not only identifying the sound category and its direction-of-arrival (DOA) but also predicting the source's distance, aiming to provide full information about the sound position. This paper proposes a multi-stage video attention network (MVANet) for audio-visual (AV) 3D SELD. Multi-stage audio features are used to adaptively capture the spatial information of sound sources in videos. We propose a novel output representation that combines the DOA with distance of sound sources by calculating the real Cartesian coordinates to address the newly introduced source distance estimation (SDE) task in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 Challenge. We also employ a variety of effective data augmentation and pre-training methods. Experimental results on the STARSS23 dataset have proven the effectiveness of our proposed MVANet. By integrating the aforementioned techniques, our system outperforms the top-ranked method we used in the AV 3D SELD task of the DCASE 2024 Challenge without model ensemble. The code will be made publicly available in the future.