Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

StrucTexT: Structured Text Understanding with Multi-Modal Transformers

Aug 06, 2021
Yulin Li, Yuxi Qian, Yuchen Yu, Xiameng Qin, Chengquan Zhang, Yan Liu, Kun Yao, Junyu Han, Jingtuo Liu, Errui Ding

Structured text understanding on Visually Rich Documents (VRDs) is a crucial part of Document Intelligence. Due to the complexity of content and layout in VRDs, structured text understanding has been a challenging task. Most existing studies decoupled this problem into two sub-tasks: entity labeling and entity linking, which require an entire understanding of the context of documents at both token and segment levels. However, little work has been concerned with the solutions that efficiently extract the structured data from different levels. This paper proposes a unified framework named StrucTexT, which is flexible and effective for handling both sub-tasks. Specifically, based on the transformer, we introduce a segment-token aligned encoder to deal with the entity labeling and entity linking tasks at different levels of granularity. Moreover, we design a novel pre-training strategy with three self-supervised tasks to learn a richer representation. StrucTexT uses the existing Masked Visual Language Modeling task and the new Sentence Length Prediction and Paired Boxes Direction tasks to incorporate the multi-modal information across text, image, and layout. We evaluate our method for structured text understanding at segment-level and token-level and show it outperforms the state-of-the-art counterparts with significantly superior performance on the FUNSD, SROIE, and EPHOIE datasets.

* 9 pages 

  Access Paper or Ask Questions

Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts

Oct 22, 2021
Allen Kim, Charuta Pethe, Naoya Inoue, Steve Skiena

Substantial amounts of work are required to clean large collections of digitized books for NLP analysis, both because of the presence of errors in the scanned text and the presence of duplicate volumes in the corpora. In this paper, we consider the issue of deduplication in the presence of optical character recognition (OCR) errors. We present methods to handle these errors, evaluated on a collection of 19,347 texts from the Project Gutenberg dataset and 96,635 texts from the HathiTrust Library. We demonstrate that improvements in language models now enable the detection and correction of OCR errors without consideration of the scanning image itself. The inconsistencies found by aligning pairs of scans of the same underlying work provides training data to build models for detecting and correcting errors. We identify the canonical version for each of 17,136 repeatedly-scanned books from 58,808 scans. Finally, we investigate methods to detect and correct errors in single-copy texts. We show that on average, our method corrects over six times as many errors as it introduces. We also provide interesting analysis on the relation between scanning quality and other factors such as location and publication year.

* Accepted for Findings of EMNLP 2021 

  Access Paper or Ask Questions

StyleMC: Multi-Channel Based Fast Text-Guided Image Generation and Manipulation

Dec 15, 2021
Umut Kocasari, Alara Dirik, Mert Tiftikci, Pinar Yanardag

Discovering meaningful directions in the latent space of GANs to manipulate semantic attributes typically requires large amounts of labeled data. Recent work aims to overcome this limitation by leveraging the power of Contrastive Language-Image Pre-training (CLIP), a joint text-image model. While promising, these methods require several hours of preprocessing or training to achieve the desired manipulations. In this paper, we present StyleMC, a fast and efficient method for text-driven image generation and manipulation. StyleMC uses a CLIP-based loss and an identity loss to manipulate images via a single text prompt without significantly affecting other attributes. Unlike prior work, StyleMC requires only a few seconds of training per text prompt to find stable global directions, does not require prompt engineering and can be used with any pre-trained StyleGAN2 model. We demonstrate the effectiveness of our method and compare it to state-of-the-art methods. Our code can be found at

* Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2022) 

  Access Paper or Ask Questions

Towards Robustness of Text-to-SQL Models against Synonym Substitution

Jun 19, 2021
Yujian Gan, Xinyun Chen, Qiuping Huang, Matthew Purver, John R. Woodward, Jinxia Xie, Pengsheng Huang

Recently, there has been significant progress in studying neural networks to translate text descriptions into SQL queries. Despite achieving good performance on some public benchmarks, existing text-to-SQL models typically rely on the lexical matching between words in natural language (NL) questions and tokens in table schemas, which may render the models vulnerable to attacks that break the schema linking mechanism. In this work, we investigate the robustness of text-to-SQL models to synonym substitution. In particular, we introduce Spider-Syn, a human-curated dataset based on the Spider benchmark for text-to-SQL translation. NL questions in Spider-Syn are modified from Spider, by replacing their schema-related words with manually selected synonyms that reflect real-world question paraphrases. We observe that the accuracy dramatically drops by eliminating such explicit correspondence between NL questions and table schemas, even if the synonyms are not adversarially selected to conduct worst-case adversarial attacks. Finally, we present two categories of approaches to improve the model robustness. The first category of approaches utilizes additional synonym annotations for table schemas by modifying the model input, while the second category is based on adversarial training. We demonstrate that both categories of approaches significantly outperform their counterparts without the defense, and the first category of approaches are more effective.

* To appear in ACL 2021 

  Access Paper or Ask Questions

TediGAN: Text-Guided Diverse Image Generation and Manipulation

Dec 06, 2020
Weihao Xia, Yujiu Yang, Jing-Hao Xue, Baoyuan Wu

In this work, we propose TediGAN, a novel framework for multi-modal image generation and manipulation with textual descriptions. The proposed method consists of three components: StyleGAN inversion module, visual-linguistic similarity learning, and instance-level optimization. The inversion module is to train an image encoder to map real images to the latent space of a well-trained StyleGAN. The visual-linguistic similarity is to learn the text-image matching by mapping the image and text into a common embedding space. The instance-level optimization is for identity preservation in manipulation. Our model can provide the lowest effect guarantee, and produce diverse and high-quality images with an unprecedented resolution at 1024. Using a control mechanism based on style-mixing, our TediGAN inherently supports image synthesis with multi-modal inputs, such as sketches or semantic labels with or without instance (text or real image) guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. Code and data are available at

* Code:; Data:; Video: 

  Access Paper or Ask Questions

Short Text Classification via Knowledge powered Attention with Similarity Matrix based CNN

Feb 09, 2020
Mingchen Li, Gabtone. Clinton, Yijia Miao, Feng Gao

Short text is becoming more and more popular on the web, such as Chat Message, SMS and Product Reviews. Accurately classifying short text is an important and challenging task. A number of studies have difficulties in addressing this problem because of the word ambiguity and data sparsity. To address this issue, we propose a knowledge powered attention with similarity matrix based convolutional neural network (KASM) model, which can compute comprehensive information by utilizing the knowledge and deep neural network. We use knowledge graph (KG) to enrich the semantic representation of short text, specially, the information of parent-entity is introduced in our model. Meanwhile, we consider the word interaction in the literal-level between short text and the representation of label, and utilize similarity matrix based convolutional neural network (CNN) to extract it. For the purpose of measuring the importance of knowledge, we introduce the attention mechanisms to choose the important information. Experimental results on five standard datasets show that our model significantly outperforms state-of-the-art methods.

* 10 pages 

  Access Paper or Ask Questions

Key Fact as Pivot: A Two-Stage Model for Low Resource Table-to-Text Generation

Aug 08, 2019
Shuming Ma, Pengcheng Yang, Tianyu Liu, Peng Li, Jie Zhou, Xu Sun

Table-to-text generation aims to translate the structured data into the unstructured text. Most existing methods adopt the encoder-decoder framework to learn the transformation, which requires large-scale training samples. However, the lack of large parallel data is a major practical problem for many domains. In this work, we consider the scenario of low resource table-to-text generation, where only limited parallel data is available. We propose a novel model to separate the generation into two stages: key fact prediction and surface realization. It first predicts the key facts from the tables, and then generates the text with the key facts. The training of key fact prediction needs much fewer annotated data, while surface realization can be trained with pseudo parallel corpus. We evaluate our model on a biography generation dataset. Our model can achieve $27.34$ BLEU score with only $1,000$ parallel data, while the baseline model only obtain the performance of $9.71$ BLEU score.

  Access Paper or Ask Questions

Word level Script Identification from Bangla and Devanagri Handwritten Texts mixed with Roman Script

Feb 21, 2010
Ram Sarkar, Nibaran Das, Subhadip Basu, Mahantapas Kundu, Mita Nasipuri, Dipak Kumar Basu

India is a multi-lingual country where Roman script is often used alongside different Indic scripts in a text document. To develop a script specific handwritten Optical Character Recognition (OCR) system, it is therefore necessary to identify the scripts of handwritten text correctly. In this paper, we present a system, which automatically separates the scripts of handwritten words from a document, written in Bangla or Devanagri mixed with Roman scripts. In this script separation technique, we first, extract the text lines and words from document pages using a script independent Neighboring Component Analysis technique. Then we have designed a Multi Layer Perceptron (MLP) based classifier for script separation, trained with 8 different wordlevel holistic features. Two equal sized datasets, one with Bangla and Roman scripts and the other with Devanagri and Roman scripts, are prepared for the system evaluation. On respective independent text samples, word-level script identification accuracies of 99.29% and 98.43% are achieved.

* Journal of Computing, Volume 2, Issue 2, February 2010, 

  Access Paper or Ask Questions

GraphFormers: GNN-nested Language Models for Linked Text Representation

May 06, 2021
Junhan Yang, Zheng Liu, Shitao Xiao, Chaozhuo Li, Guangzhong Sun, Xing Xie

Linked text representation is critical for many intelligent web applications, such as online advertisement and recommender systems. Recent breakthroughs on pretrained language models and graph neural networks facilitate the development of corresponding techniques. However, the existing works mainly rely on cascaded model structures: the texts are independently encoded by language models at first, and the textual embeddings are further aggregated by graph neural networks. We argue that the neighbourhood information is insufficiently utilized within the above process, which restricts the representation quality. In this work, we propose GraphFormers, where graph neural networks are nested alongside each transformer layer of the language models. On top of the above architecture, the linked texts will iteratively extract neighbourhood information for the enhancement of their own semantics. Such an iterative workflow gives rise to more effective utilization of neighbourhood information, which contributes to the representation quality. We further introduce an adaptation called unidirectional GraphFormers, which is much more efficient and comparably effective; and we leverage a pretraining strategy called the neighbourhood-aware masked language modeling to enhance the training effect. We perform extensive experiment studies with three large-scale linked text datasets, whose results verify the effectiveness of our proposed methods.

  Access Paper or Ask Questions