Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions

Aug 27, 2018

Ke Ning, Linchao Zhu, Ming Cai, Yi Yang, Di Xie, Fei Wu

Figure 1 for Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions

Figure 2 for Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions

Figure 3 for Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions

Figure 4 for Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions

Share this with someone who'll enjoy it:

Abstract:We propose a novel attentive sequence to sequence translator (ASST) for clip localization in videos by natural language descriptions. We make two contributions. First, we propose a bi-directional Recurrent Neural Network (RNN) with a finely calibrated vision-language attentive mechanism to comprehensively understand the free-formed natural language descriptions. The RNN parses natural language descriptions in two directions, and the attentive model attends every meaningful word or phrase to each frame, thereby resulting in a more detailed understanding of video content and description semantics. Second, we design a hierarchical architecture for the network to jointly model language descriptions and video content. Given a video-description pair, the network generates a matrix representation, i.e., a sequence of vectors. Each vector in the matrix represents a video frame conditioned by the description. The 2D representation not only preserves the temporal dependencies of frames but also provides an effective way to perform frame-level video-language matching. The hierarchical architecture exploits video content with multiple granularities, ranging from subtle details to global context. Integration of the multiple granularities yields a robust representation for multi-level video-language abstraction. We validate the effectiveness of our ASST on two large-scale datasets. Our ASST outperforms the state-of-the-art by $4.28\%$ in Rank$@1$ on the DiDeMo dataset. On the Charades-STA dataset, we significantly improve the state-of-the-art by $13.41\%$ in Rank$@1,IoU=0.5$.

View paper on

Share this with someone who'll enjoy it:

Title:Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions

Paper and Code