Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

May 17, 2020

Vladimir Iashin, Esa Rahtu

Figure 1 for A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

Figure 2 for A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

Figure 3 for A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

Figure 4 for A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

Share this with someone who'll enjoy it:

Abstract:Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting only visual features, while completely neglecting the audio track. Only a few prior works have utilized both modalities, yet they show poor results or demonstrate the importance on a dataset with a specific domain. In this paper, we introduce Bi-modal Transformer which generalizes the Transformer architecture for a bi-modal input. We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable to input any two modalities in a sequence-to-sequence task. We show that the pre-training a bi-modal encoder along with a bi-modal decoder for captioning can be used as a feature extractor for a simple proposal generation module. The performance is demonstrated on a challenging ActivityNet Captions dataset where our model achieves outstanding performance.

* Project page is available on https://v-iashin.github.io/bmt

View paper on

Share this with someone who'll enjoy it:

Title:A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

Paper and Code