Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Audio-Visual Transformer Based Crowd Counting

Sep 04, 2021

Usman Sajid, Xiangyu Chen, Hasan Sajid, Taejoon Kim, Guanghui Wang

Figure 1 for Audio-Visual Transformer Based Crowd Counting

Figure 2 for Audio-Visual Transformer Based Crowd Counting

Figure 3 for Audio-Visual Transformer Based Crowd Counting

Figure 4 for Audio-Visual Transformer Based Crowd Counting

Share this with someone who'll enjoy it:

Abstract:Crowd estimation is a very challenging problem. The most recent study tries to exploit auditory information to aid the visual models, however, the performance is limited due to the lack of an effective approach for feature extraction and integration. The paper proposes a new audiovisual multi-task network to address the critical challenges in crowd counting by effectively utilizing both visual and audio inputs for better modalities association and productive feature extraction. The proposed network introduces the notion of auxiliary and explicit image patch-importance ranking (PIR) and patch-wise crowd estimate (PCE) information to produce a third (run-time) modality. These modalities (audio, visual, run-time) undergo a transformer-inspired cross-modality co-attention mechanism to finally output the crowd estimate. To acquire rich visual features, we propose a multi-branch structure with transformer-style fusion in-between. Extensive experimental evaluations show that the proposed scheme outperforms the state-of-the-art networks under all evaluation settings with up to 33.8% improvement. We also analyze and compare the vision-only variant of our network and empirically demonstrate its superiority over previous approaches.

View paper on

Share this with someone who'll enjoy it:

Title:Audio-Visual Transformer Based Crowd Counting

Paper and Code