Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Jul 23, 2020

Tao Jin, Siyu Huang, Ming Chen, Yingming Li, Zhongfei Zhang

Figure 1 for SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Figure 2 for SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Figure 3 for SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Figure 4 for SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Share this with someone who'll enjoy it:

Abstract:In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.

* Appearing at IJCAI 2020

View paper on

Share this with someone who'll enjoy it:

Title:SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Paper and Code