Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

Oct 12, 2020

Mingzhe Li, Xiuying Chen, Shen Gao, Zhangming Chan, Dongyan Zhao, Rui Yan

Figure 1 for VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

Figure 2 for VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

Figure 3 for VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

Figure 4 for VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

Share this with someone who'll enjoy it:

Abstract:A popular multimedia news format nowadays is providing users with a lively video and a corresponding news article, which is employed by influential news media including CNN, BBC, and social media including Twitter and Weibo. In such a case, automatically choosing a proper cover frame of the video and generating an appropriate textual summary of the article can help editors save time, and readers make the decision more effectively. Hence, in this paper, we propose the task of Video-based Multimodal Summarization with Multimodal Output (VMSMO) to tackle such a problem. The main challenge in this task is to jointly model the temporal dependency of video with semantic meaning of article. To this end, we propose a Dual-Interaction-based Multimodal Summarizer (DIMS), consisting of a dual interaction module and multimodal generator. In the dual interaction module, we propose a conditional self-attention mechanism that captures local semantic information within video and a global-attention mechanism that handles the semantic relationship between news text and video from a high level. Extensive experiments conducted on a large-scale real-world VMSMO dataset show that DIMS achieves the state-of-the-art performance in terms of both automatic metrics and human evaluations.

* Accepted by The 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)

View paper on

Share this with someone who'll enjoy it:

Title:VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

Paper and Code