The recognition of Arabic text, in both handwritten and printed forms, represents a fertile provenance of technical difficulties for Optical Character Recognition (OCR). Indeed, the printed is commonly governed by well-established calligraphy rules and the characters are well aligned. However, there is not always a system capable of reading Arabic printed text in an unconstrained environments such as unlimited vocabulary, multi styles, mixed-font and their great morphological variability. This diversity complicates the choice of features to extract and algorithm of segmentation. In this context, we adopt a new solution for unlimited-vocabulary and mixed-font Arabic printed text recognition. The proposed system is based on the adoption of Bag of Features (BoF) model using Sparse Auto-Encoder (SAE) for features representation and Hidden Markov Models (HMM) for recognition. As results, the obtained average accuracies of recognition vary between 99.65% and 99.96% for the mono-font and exceed 99% for mixed-font.
Traditional methods of summarization are not cost-effective and possible today. Extractive summarization is a process that helps to extract the most important sentences from a text automatically and generates a short informative summary. In this work, we propose an unsupervised method to summarize Persian texts. This method is a novel hybrid approach that clusters the concepts of the text using deep learning and traditional statistical methods. First we produce a word embedding based on Hamshahri2 corpus and a dictionary of word frequencies. Then the proposed algorithm extracts the keywords of the document, clusters its concepts, and finally ranks the sentences to produce the summary. We evaluated the proposed method on Pasokh single-document corpus using the ROUGE evaluation measure. Without using any hand-crafted features, our proposed method achieves state-of-the-art results. We compared our unsupervised method with the best supervised Persian methods and we achieved an overall improvement of ROUGE-2 recall score of 7.5%.
This work presents a two-stage text line detection method for historical documents. In a first stage, a deep neural network called ARU-Net labels pixels to belong to one of the three classes: baseline, separator or other. The separator class marks beginning and end of each text line. The ARU-Net is trainable from scratch with manageably few manually annotated example images (less than 50). This is achieved by utilizing data augmentation strategies. The network predictions are used as input for the second stage which performs a bottom-up clustering to build baselines. The developed method is capable of handling complex layouts as well as curved and arbitrarily oriented text lines. It substantially outperforms current state-of-the-art approaches. For example, for the complex track of the cBAD: ICDAR2017 Competiton on Baseline Detection the F-value is increased from 0.859 to 0.922. The framework to train and run the ARU-Net is open source.
Controlled text generation techniques aim to regulate specific attributes (e.g. sentiment) while preserving the attribute independent content. The state-of-the-art approaches model the specified attribute as a structured or discrete representation while making the content representation independent of it to achieve a better control. However, disentangling the text representation into separate latent spaces overlooks complex dependencies between content and attribute, leading to generation of poorly constructed and not so meaningful sentences. Moreover, such an approach fails to provide a finer control on the degree of attribute change. To address these problems of controlled text generation, in this paper, we propose DE-VAE, a hierarchical framework which captures both information enriched entangled representation and attribute specific disentangled representation in different hierarchies. DE-VAE achieves better control of sentiment as an attribute while preserving the content by learning a suitable lossless transformation network from the disentangled sentiment space to the desired entangled representation. Through feature supervision on a single dimension of the disentangled representation, DE-VAE maps the variation of sentiment to a continuous space which helps in smoothly regulating sentiment from positive to negative and vice versa. Detailed experiments on three publicly available review datasets show the superiority of DE-VAE over recent state-of-the-art approaches.
Extracting information from unstructured text documents is a demanding task, since these documents can have a broad variety of different layouts and a non-trivial reading order, like it is the case for multi-column documents or nested tables. Additionally, many business documents are received in paper form, meaning that the textual contents need to be digitized before further analysis. Nonetheless, automatic detection and capturing of crucial document information like the sender address would boost many companies' processing efficiency. In this work we propose a hybrid approach that combines deep learning with reasoning for finding and extracting addresses from unstructured text documents. We use a visual deep learning model to detect the boundaries of possible address regions on the scanned document images and validate these results by analyzing the containing text using domain knowledge represented as a rule based system.
The neural attention model has achieved great success in data-to-text generation tasks. Though usually excelling at producing fluent text, it suffers from the problem of information missing, repetition and "hallucination". Due to the black-box nature of the neural attention architecture, avoiding these problems in a systematic way is non-trivial. To address this concern, we propose to explicitly segment target text into fragment units and align them with their data correspondences. The segmentation and correspondence are jointly learned as latent variables without any human annotations. We further impose a soft statistical constraint to regularize the segmental granularity. The resulting architecture maintains the same expressive power as neural attention models, while being able to generate fully interpretable outputs with several times less computational cost. On both E2E and WebNLG benchmarks, we show the proposed model consistently outperforms its neural attention counterparts.
LexNLP is an open source Python package focused on natural language processing and machine learning for legal and regulatory text. The package includes functionality to (i) segment documents, (ii) identify key text such as titles and section headings, (iii) extract over eighteen types of structured information like distances and dates, (iv) extract named entities such as companies and geopolitical entities, (v) transform text into features for model training, and (vi) build unsupervised and supervised models such as word embedding or tagging models. LexNLP includes pre-trained models based on thousands of unit tests drawn from real documents available from the SEC EDGAR database as well as various judicial and regulatory proceedings. LexNLP is designed for use in both academic research and industrial applications, and is distributed at https://github.com/LexPredict/lexpredict-lexnlp.
Writing a good essay typically involves students revising an initial paper draft after receiving feedback. We present eRevise, a web-based writing and revising environment that uses natural language processing features generated for rubric-based essay scoring to trigger formative feedback messages regarding students' use of evidence in response-to-text writing. By helping students understand the criteria for using text evidence during writing, eRevise empowers students to better revise their paper drafts. In a pilot deployment of eRevise in 7 classrooms spanning grades 5 and 6, the quality of text evidence usage in writing improved after students received formative feedback then engaged in paper revision.
Synthesizing images or texts automatically is a useful research area in the artificial intelligence nowadays. Generative adversarial networks (GANs), which are proposed by Goodfellow in 2014, make this task to be done more efficiently by using deep neural networks. We consider generating corresponding images from an input text description using a GAN. In this paper, we analyze the GAN-CLS algorithm, which is a kind of advanced method of GAN proposed by Scott Reed in 2016. First, we find the problem with this algorithm through inference. Then we correct the GAN-CLS algorithm according to the inference by modifying the objective function of the model. Finally, we do the experiments on the Oxford-102 dataset and the CUB dataset. As a result, our modified algorithm can generate images which are more plausible than the GAN-CLS algorithm in some cases. Also, some of the generated images match the input texts better.
Text Style Transfer can be named as one of the most important Natural Language Processing tasks. Up until now, there have been several approaches and methods experimented for this purpose. In this work, we introduce PGST, a novel polyglot text style transfer approach in gender domain composed of different building blocks. If they become fulfilled with required elements, our method can be applied in multiple languages. We have proceeded with a pre-trained word embedding for token replacement purposes, a character-based token classifier for gender exchange purposes, and the beam search algorithm for extracting the most fluent combination among all suggestions. Since different approaches are introduced in our research, we determine a trade-off value for evaluating different models' success in faking our gender identification model with transferred text. To demonstrate our method's multilingual applicability, we applied our method on both English and Persian corpora and finally ended up defeating our proposed gender identification model by 45.6% and 39.2%, respectively, and obtained highly competitive evaluation results in an analogy among English state of the art methods.