Alert button
Picture for Lluis Gomez

Lluis Gomez

Alert button

Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia

Sep 21, 2022
Khanh Nguyen, Ali Furkan Biten, Andres Mafla, Lluis Gomez, Dimosthenis Karatzas

Figure 1 for Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia
Figure 2 for Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia
Figure 3 for Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia
Figure 4 for Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia

Humans exploit prior knowledge to describe images, and are able to adapt their explanation to specific contextual information, even to the extent of inventing plausible explanations when contextual information and images do not match. In this work, we propose the novel task of captioning Wikipedia images by integrating contextual knowledge. Specifically, we produce models that jointly reason over Wikipedia articles, Wikimedia images and their associated descriptions to produce contextualized captions. Particularly, a similar Wikimedia image can be used to illustrate different articles, and the produced caption needs to be adapted to a specific context, therefore allowing us to explore the limits of a model to adjust captions to different contextual information. A particular challenging task in this domain is dealing with out-of-dictionary words and Named Entities. To address this, we propose a pre-training objective, Masked Named Entity Modeling (MNEM), and show that this pretext task yields an improvement compared to baseline models. Furthermore, we verify that a model pre-trained with the MNEM objective in Wikipedia generalizes well to a News Captioning dataset. Additionally, we define two different test splits according to the difficulty of the captioning task. We offer insights on the role and the importance of each modality and highlight the limitations of our model. The code, models and data splits are publicly available at Upon acceptance.

Viaarxiv icon

MUST-VQA: MUltilingual Scene-text VQA

Sep 14, 2022
Emanuele Vivoli, Ali Furkan Biten, Andres Mafla, Dimosthenis Karatzas, Lluis Gomez

Figure 1 for MUST-VQA: MUltilingual Scene-text VQA
Figure 2 for MUST-VQA: MUltilingual Scene-text VQA
Figure 3 for MUST-VQA: MUltilingual Scene-text VQA
Figure 4 for MUST-VQA: MUltilingual Scene-text VQA

In this paper, we present a framework for Multilingual Scene Text Visual Question Answering that deals with new languages in a zero-shot fashion. Specifically, we consider the task of Scene Text Visual Question Answering (STVQA) in which the question can be asked in different languages and it is not necessarily aligned to the scene text language. Thus, we first introduce a natural step towards a more generalized version of STVQA: MUST-VQA. Accounting for this, we discuss two evaluation scenarios in the constrained setting, namely IID and zero-shot and we demonstrate that the models can perform on a par on a zero-shot setting. We further provide extensive experimentation and show the effectiveness of adapting multilingual language models into STVQA tasks.

* To be appeared in Text In Everything Workshop in ECCV 2022 
Viaarxiv icon

A Generic Image Retrieval Method for Date Estimation of Historical Document Collections

Apr 08, 2022
Adrià Molina, Lluis Gomez, Oriol Ramos Terrades, Josep Lladós

Figure 1 for A Generic Image Retrieval Method for Date Estimation of Historical Document Collections
Figure 2 for A Generic Image Retrieval Method for Date Estimation of Historical Document Collections
Figure 3 for A Generic Image Retrieval Method for Date Estimation of Historical Document Collections
Figure 4 for A Generic Image Retrieval Method for Date Estimation of Historical Document Collections

Date estimation of historical document images is a challenging problem, with several contributions in the literature that lack of the ability to generalize from one dataset to others. This paper presents a robust date estimation system based in a retrieval approach that generalizes well in front of heterogeneous collections. we use a ranking loss function named smooth-nDCG to train a Convolutional Neural Network that learns an ordination of documents for each problem. One of the main usages of the presented approach is as a tool for historical contextual retrieval. It means that scholars could perform comparative analysis of historical images from big datasets in terms of the period where they were produced. We provide experimental evaluation on different types of documents from real datasets of manuscript and newspaper images.

* Preprint of paper accepted at DAS2022 
Viaarxiv icon

Text-DIAE: Degradation Invariant Autoencoders for Text Recognition and Document Enhancement

Mar 16, 2022
Mohamed Ali Souibgui, Sanket Biswas, Andres Mafla, Ali Furkan Biten, Alicia Fornés, Yousri Kessentini, Josep Lladós, Lluis Gomez, Dimosthenis Karatzas

Figure 1 for Text-DIAE: Degradation Invariant Autoencoders for Text Recognition and Document Enhancement
Figure 2 for Text-DIAE: Degradation Invariant Autoencoders for Text Recognition and Document Enhancement
Figure 3 for Text-DIAE: Degradation Invariant Autoencoders for Text Recognition and Document Enhancement
Figure 4 for Text-DIAE: Degradation Invariant Autoencoders for Text Recognition and Document Enhancement

In this work, we propose Text-Degradation Invariant Auto Encoder (Text-DIAE) aimed to solve two tasks, text recognition (handwritten or scene-text) and document image enhancement. We define three pretext tasks as learning objectives to be optimized during pre-training without the usage of labelled data. Each of the pre-text objectives is specifically tailored for the final downstream tasks. We conduct several ablation experiments that show the importance of each degradation for a specific domain. Exhaustive experimentation shows that our method does not have limitations of previous state-of-the-art based on contrastive losses while at the same time requiring essentially fewer data samples to converge. Finally, we demonstrate that our method surpasses the state-of-the-art significantly in existing supervised and self-supervised settings in handwritten and scene text recognition and document image enhancement. Our code and trained models will be made publicly available at~\url{ http://Upon_Acceptance}.

* Preprint 
Viaarxiv icon

OCR-IDL: OCR Annotations for Industry Document Library Dataset

Feb 25, 2022
Ali Furkan Biten, Rubèn Tito, Lluis Gomez, Ernest Valveny, Dimosthenis Karatzas

Figure 1 for OCR-IDL: OCR Annotations for Industry Document Library Dataset
Figure 2 for OCR-IDL: OCR Annotations for Industry Document Library Dataset
Figure 3 for OCR-IDL: OCR Annotations for Industry Document Library Dataset
Figure 4 for OCR-IDL: OCR Annotations for Industry Document Library Dataset

Pretraining has proven successful in Document Intelligence tasks where deluge of documents are used to pretrain the models only later to be finetuned on downstream tasks. One of the problems of the pretraining approaches is the inconsistent usage of pretraining data with different OCR engines leading to incomparable results between models. In other words, it is not obvious whether the performance gain is coming from diverse usage of amount of data and distinct OCR engines or from the proposed models. To remedy the problem, we make public the OCR annotations for IDL documents using commercial OCR engine given their superior performance over open source OCR models. The contributed dataset (OCR-IDL) has an estimated monetary value over 20K US$. It is our hope that OCR-IDL can be a starting point for future works on Document Intelligence. All of our data and its collection process with the annotations can be found in https://github.com/furkanbiten/idl_data.

Viaarxiv icon

Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Oct 06, 2021
Ali Furkan Biten, Andres Mafla, Lluis Gomez, Dimosthenis Karatzas

Figure 1 for Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching
Figure 2 for Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching
Figure 3 for Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching
Figure 4 for Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

The task of image-text matching aims to map representations from different modalities into a common joint visual-textual embedding. However, the most widely used datasets for this task, MSCOCO and Flickr30K, are actually image captioning datasets that offer a very limited set of relationships between images and sentences in their ground-truth annotations. This limited ground truth information forces us to use evaluation metrics based on binary relevance: given a sentence query we consider only one image as relevant. However, many other relevant images or captions may be present in the dataset. In this work, we propose two metrics that evaluate the degree of semantic relevance of retrieved items, independently of their annotated binary relevance. Additionally, we incorporate a novel strategy that uses an image captioning metric, CIDEr, to define a Semantic Adaptive Margin (SAM) to be optimized in a standard triplet loss. By incorporating our formulation to existing models, a \emph{large} improvement is obtained in scenarios where available training data is limited. We also demonstrate that the performance on the annotated image-caption pairs is maintained while improving on other non-annotated relevant items when employing the full training set. Code with our metrics and adaptive margin formulation will be made public.

* Accepted WACV 2022 
Viaarxiv icon

Let there be a clock on the beach: Reducing Object Hallucination in Image Captioning

Oct 04, 2021
Ali Furkan Biten, Lluis Gomez, Dimosthenis Karatzas

Figure 1 for Let there be a clock on the beach: Reducing Object Hallucination in Image Captioning
Figure 2 for Let there be a clock on the beach: Reducing Object Hallucination in Image Captioning
Figure 3 for Let there be a clock on the beach: Reducing Object Hallucination in Image Captioning
Figure 4 for Let there be a clock on the beach: Reducing Object Hallucination in Image Captioning

Explaining an image with missing or non-existent objects is known as object bias (hallucination) in image captioning. This behaviour is quite common in the state-of-the-art captioning models which is not desirable by humans. To decrease the object hallucination in captioning, we propose three simple yet efficient training augmentation method for sentences which requires no new training data or increase in the model size. By extensive analysis, we show that the proposed methods can significantly diminish our models' object bias on hallucination metrics. Moreover, we experimentally demonstrate that our methods decrease the dependency on the visual features. All of our code, configuration files and model weights will be made public.

* Accepted to WACV 2022 
Viaarxiv icon

Asking questions on handwritten document collections

Oct 02, 2021
Minesh Mathew, Lluis Gomez, Dimosthenis Karatzas, CV Jawahar

Figure 1 for Asking questions on handwritten document collections
Figure 2 for Asking questions on handwritten document collections
Figure 3 for Asking questions on handwritten document collections
Figure 4 for Asking questions on handwritten document collections

This work addresses the problem of Question Answering (QA) on handwritten document collections. Unlike typical QA and Visual Question Answering (VQA) formulations where the answer is a short text, we aim to locate a document snippet where the answer lies. The proposed approach works without recognizing the text in the documents. We argue that the recognition-free approach is suitable for handwritten documents and historical collections where robust text recognition is often difficult. At the same time, for human users, document image snippets containing answers act as a valid alternative to textual answers. The proposed approach uses an off-the-shelf deep embedding network which can project both textual words and word images into a common sub-space. This embedding bridges the textual and visual domains and helps us retrieve document snippets that potentially answer a question. We evaluate results of the proposed approach on two new datasets: (i) HW-SQuAD: a synthetic, handwritten document image counterpart of SQuAD1.0 dataset and (ii) BenthamQA: a smaller set of QA pairs defined on documents from the popular Bentham manuscripts collection. We also present a thorough analysis of the proposed recognition-free approach compared to a recognition-based approach which uses text recognized from the images using an OCR. Datasets presented in this work are available to download at docvqa.org

* journal = {Int. J. Document Anal. Recognit.}, volume = {24}, number = {3}, pages = {235--249}, year = {2021}  
* pre-print version 
Viaarxiv icon

Date Estimation in the Wild of Scanned Historical Photos: An Image Retrieval Approach

Jun 10, 2021
Adrià Molina, Pau Riba, Lluis Gomez, Oriol Ramos-Terrades, Josep Lladós

Figure 1 for Date Estimation in the Wild of Scanned Historical Photos: An Image Retrieval Approach
Figure 2 for Date Estimation in the Wild of Scanned Historical Photos: An Image Retrieval Approach
Figure 3 for Date Estimation in the Wild of Scanned Historical Photos: An Image Retrieval Approach
Figure 4 for Date Estimation in the Wild of Scanned Historical Photos: An Image Retrieval Approach

This paper presents a novel method for date estimation of historical photographs from archival sources. The main contribution is to formulate the date estimation as a retrieval task, where given a query, the retrieved images are ranked in terms of the estimated date similarity. The closer are their embedded representations the closer are their dates. Contrary to the traditional models that design a neural network that learns a classifier or a regressor, we propose a learning objective based on the nDCG ranking metric. We have experimentally evaluated the performance of the method in two different tasks: date estimation and date-sensitive image retrieval, using the DEW public database, overcoming the baseline methods.

* Accepted at ICDAR 2021 
Viaarxiv icon