Recent advances in multimodal pre-trained models have significantly improved information extraction from visually-rich documents (VrDs), in which named entity recognition (NER) is treated as a sequence-labeling task of predicting the BIO entity tags for tokens, following the typical setting of NLP. However, BIO-tagging scheme relies on the correct order of model inputs, which is not guaranteed in real-world NER on scanned VrDs where text are recognized and arranged by OCR systems. Such reading order issue hinders the accurate marking of entities by BIO-tagging scheme, making it impossible for sequence-labeling methods to predict correct named entities. To address the reading order issue, we introduce Token Path Prediction (TPP), a simple prediction head to predict entity mentions as token sequences within documents. Alternative to token classification, TPP models the document layout as a complete directed graph of tokens, and predicts token paths within the graph as entities. For better evaluation of VrD-NER systems, we also propose two revised benchmark datasets of NER on scanned documents which can reflect real-world scenarios. Experiment results demonstrate the effectiveness of our method, and suggest its potential to be a universal solution to various information extraction tasks on documents.
Visually-rich Document Understanding (VrDU) has attracted much research attention over the past years. Pre-trained models on a large number of document images with transformer-based backbones have led to significant performance gains in this field. The major challenge is how to fusion the different modalities (text, layout, and image) of the documents in a unified model with different pre-training tasks. This paper focuses on improving text-layout interactions and proposes a novel multi-modal pre-training model, LayoutMask. LayoutMask uses local 1D position, instead of global 1D position, as layout input and has two pre-training objectives: (1) Masked Language Modeling: predicting masked tokens with two novel masking strategies; (2) Masked Position Modeling: predicting masked 2D positions to improve layout representation learning. LayoutMask can enhance the interactions between text and layout modalities in a unified model and produce adaptive and robust multi-modal representations for downstream tasks. Experimental results show that our proposed method can achieve state-of-the-art results on a wide variety of VrDU problems, including form understanding, receipt understanding, and document image classification.
Aesthetic image cropping is a practical but challenging task which aims at finding the best crops with the highest aesthetic quality in an image. Recently, many deep learning methods have been proposed to address this problem, but they did not reveal the intrinsic mechanism of aesthetic evaluation. In this paper, we propose an interpretable image cropping model to unveil the mystery. For each image, we use a fully convolutional network to produce an aesthetic score map, which is shared among all candidate crops during crop-level aesthetic evaluation. Then, we require the aesthetic score map to be both composition-aware and saliency-aware. In particular, the same region is assigned with different aesthetic scores based on its relative positions in different crops. Moreover, a visually salient region is supposed to have more sensitive aesthetic scores so that our network can learn to place salient objects at more proper positions. Such an aesthetic score map can be used to localize aesthetically important regions in an image, which sheds light on the composition rules learned by our model. We show the competitive performance of our model in the image cropping task on several benchmark datasets, and also demonstrate its generality in real-world applications.
Learning from web data has attracted lots of research interest in recent years. However, crawled web images usually have two types of noises, label noise and background noise, which induce extra difficulties in utilizing them effectively. Most existing methods either rely on human supervision or ignore the background noise. In this paper, we propose the novel ProtoNet, which is capable of handling these two types of noises together, without the supervision of clean images in the training stage. Particularly, we use a memory module to identify the representative and discriminative prototypes for each category. Then, we remove noisy images and noisy region proposals from the web dataset with the aid of the memory module. Our approach is efficient and can be easily integrated into arbitrary CNN model. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our method.