Alert button

Audio-Visual Contrastive Learning for Self-supervised Action Recognition

Apr 28, 2022
Haoyuan Lan, Yang Liu, Liang Lin

Figure 1 for Audio-Visual Contrastive Learning for Self-supervised Action Recognition
Figure 2 for Audio-Visual Contrastive Learning for Self-supervised Action Recognition
Figure 3 for Audio-Visual Contrastive Learning for Self-supervised Action Recognition
Figure 4 for Audio-Visual Contrastive Learning for Self-supervised Action Recognition

Share this with someone who'll enjoy it:

The underlying correlation between audio and visual modalities within videos can be utilized to learn supervised information for unlabeled videos. In this paper, we present an end-to-end self-supervised framework named Audio-Visual Contrastive Learning (AVCL), to learn discriminative audio-visual representations for action recognition. Specifically, we design an attention based multi-modal fusion module (AMFM) to fuse audio and visual modalities. To align heterogeneous audio-visual modalities, we construct a novel co-correlation guided representation alignment module (CGRA). To learn supervised information from unlabeled videos, we propose a novel self-supervised contrastive learning module (SelfCL). Furthermore, to expand the existing audio-visual action recognition datasets and better evaluate our framework AVCL, we build a new audio-visual action recognition dataset named Kinetics-Sounds100. Experimental results on Kinetics-Sounds32 and Kinetics-Sounds100 datasets demonstrate the superiority of our AVCL over the state-of-the-art methods on large-scale action recognition benchmark.

View paper onarxiv icon

Share this with someone who'll enjoy it: