Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Video and Audio are Images: A Cross-Modal Mixer for Original Data on Video-Audio Retrieval

Aug 26, 2023

Zichen Yuan, Qi Shen, Bingyi Zheng, Yuting Liu, Linying Jiang, Guibing Guo

Figure 1 for Video and Audio are Images: A Cross-Modal Mixer for Original Data on Video-Audio Retrieval

Figure 2 for Video and Audio are Images: A Cross-Modal Mixer for Original Data on Video-Audio Retrieval

Figure 3 for Video and Audio are Images: A Cross-Modal Mixer for Original Data on Video-Audio Retrieval

Figure 4 for Video and Audio are Images: A Cross-Modal Mixer for Original Data on Video-Audio Retrieval

Share this with someone who'll enjoy it:

Abstract:Cross-modal retrieval has become popular in recent years, particularly with the rise of multimedia. Generally, the information from each modality exhibits distinct representations and semantic information, which makes feature tends to be in separate latent spaces encoded with dual-tower architecture and makes it difficult to establish semantic relationships between modalities, resulting in poor retrieval performance. To address this issue, we propose a novel framework for cross-modal retrieval which consists of a cross-modal mixer, a masked autoencoder for pre-training, and a cross-modal retriever for downstream tasks.In specific, we first adopt cross-modal mixer and mask modeling to fuse the original modality and eliminate redundancy. Then, an encoder-decoder architecture is applied to achieve a fuse-then-separate task in the pre-training phase.We feed masked fused representations into the encoder and reconstruct them with the decoder, ultimately separating the original data of two modalities. In downstream tasks, we use the pre-trained encoder to build the cross-modal retrieval method. Extensive experiments on 2 real-world datasets show that our approach outperforms previous state-of-the-art methods in video-audio matching tasks, improving retrieval accuracy by up to 2 times. Furthermore, we prove our model performance by transferring it to other downstream tasks as a universal model.

View paper on

Share this with someone who'll enjoy it:

Title:Video and Audio are Images: A Cross-Modal Mixer for Original Data on Video-Audio Retrieval

Paper and Code