Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning

May 26, 2025

Wenrui Li, Penghong Wang, Xingtao Wang, Wangmeng Zuo, Xiaopeng Fan, Yonghong Tian

Figure 1 for Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning

Figure 2 for Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning

Figure 3 for Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning

Figure 4 for Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning

Share this with someone who'll enjoy it:

Abstract:Audio-visual zero-shot learning (ZSL) has been extensively researched for its capability to classify video data from unseen classes during training. Nevertheless, current methodologies often struggle with background scene biases and inadequate motion detail. This paper proposes a novel dual-stream Multi-Timescale Motion-Decoupled Spiking Transformer (MDST++), which decouples contextual semantic information and sparse dynamic motion information. The recurrent joint learning unit is proposed to extract contextual semantic information and capture joint knowledge across various modalities to understand the environment of actions. By converting RGB images to events, our method captures motion information more accurately and mitigates background scene biases. Moreover, we introduce a discrepancy analysis block to model audio motion information. To enhance the robustness of SNNs in extracting temporal and motion cues, we dynamically adjust the threshold of Leaky Integrate-and-Fire neurons based on global motion and contextual semantic information. Our experiments validate the effectiveness of MDST++, demonstrating their consistent superiority over state-of-the-art methods on mainstream benchmarks. Additionally, incorporating motion and multi-timescale information significantly improves HM and ZSL accuracy by 26.2\% and 39.9\%.

* Accepted by IEEE TCSVT

View paper on

Share this with someone who'll enjoy it:

Title:Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning

Paper and Code