Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Camille Guinaudeau

STL, LISN

Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

Apr 14, 2025

Théo Gigant, Camille Guinaudeau, Frédéric Dufaux

Figure 1 for Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

Figure 2 for Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

Figure 3 for Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

Figure 4 for Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

Abstract:Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.

Via

Access Paper or Ask Questions

Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics

Oct 08, 2024

Théo Gigant, Camille Guinaudeau, Marc Decombas, Frédéric Dufaux

Figure 1 for Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics

Figure 2 for Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics

Figure 3 for Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics

Figure 4 for Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics

Abstract:Automatic metrics are used as proxies to evaluate abstractive summarization systems when human annotations are too expensive. To be useful, these metrics should be fine-grained, show a high correlation with human annotations, and ideally be independent of reference quality; however, most standard evaluation metrics for summarization are reference-based, and existing reference-free metrics correlate poorly with relevance, especially on summaries of longer documents. In this paper, we introduce a reference-free metric that correlates well with human evaluated relevance, while being very cheap to compute. We show that this metric can also be used alongside reference-based metrics to improve their robustness in low quality reference settings.

* The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), Nov 2024, Miami (FL), United States

Via

Access Paper or Ask Questions

Cross-modal Retrieval for Knowledge-based Visual Question Answering

Jan 11, 2024

Paul Lerner, Olivier Ferret, Camille Guinaudeau

Abstract:Knowledge-based Visual Question Answering about Named Entities is a challenging task that requires retrieving information from a multimodal Knowledge Base. Named entities have diverse visual representations and are therefore difficult to recognize. We argue that cross-modal retrieval may help bridge the semantic gap between an entity and its depictions, and is foremost complementary with mono-modal retrieval. We provide empirical evidence through experiments with a multimodal dual encoder, namely CLIP, on the recent ViQuAE, InfoSeek, and Encyclopedic-VQA datasets. Additionally, we study three different strategies to fine-tune such a model: mono-modal, cross-modal, or joint training. Our method, which combines mono-and cross-modal retrieval, is competitive with billion-parameter models on the three datasets, while being conceptually simpler and computationally cheaper.

Via

Access Paper or Ask Questions

Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering

Jan 11, 2023

Paul Lerner, Olivier Ferret, Camille Guinaudeau

Figure 1 for Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering

Figure 2 for Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering

Figure 3 for Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering

Figure 4 for Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering

Abstract:We present a new pre-training method, Multimodal Inverse Cloze Task, for Knowledge-based Visual Question Answering about named Entities (KVQAE). KVQAE is a recently introduced task that consists in answering questions about named entities grounded in a visual context using a Knowledge Base. Therefore, the interaction between the modalities is paramount to retrieve information and must be captured with complex fusion models. As these models require a lot of training data, we design this pre-training task from existing work in textual Question Answering. It consists in considering a sentence as a pseudo-question and its context as a pseudo-relevant passage and is extended by considering images near texts in multimodal documents. Our method is applicable to different neural network architectures and leads to a 9% relative-MRR and 15% relative-F1 gain for retrieval and reading comprehension, respectively, over a no-pre-training baseline.

* Accepted at ECIR 2023

Via

Access Paper or Ask Questions