Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Pose Guided Multi-person Image Generation From Text

Mar 09, 2022
Soon Yau Cheong, Armin Mustafa, Andrew Gilbert

Transformers have recently been shown to generate high quality images from texts. However, existing methods struggle to create high fidelity full-body images, especially multiple people. A person's pose has a high degree of freedom that is difficult to describe using words only; this creates errors in the generated image, such as incorrect body proportions and pose. We propose a pose-guided text-to-image model, using pose as an additional input constraint. Using the proposed Keypoint Pose Encoding (KPE) to encode human pose into low dimensional representation, our model can generate novel multi-person images accurately representing the pose and text descriptions provided, with minimal errors. We demonstrate that KPE is invariant to changes in the target image domain and image resolution; we show results on the Deepfashion dataset and create a new multi-person Deepfashion dataset to demonstrate the multi-capabilities of our approach.

  Access Paper or Ask Questions

Text-to-Image Synthesis Based on Machine Generated Captions

Oct 09, 2019
Marco Menardi, Alex Falcon, Saida S. Mohamed, Lorenzo Seidenari, Giuseppe Serra, Alberto Del Bimbo, Carlo Tasso

Text to Image Synthesis refers to the process of automatic generation of a photo-realistic image starting from a given text and is revolutionizing many real-world applications. In order to perform such process it is necessary to exploit datasets containing captioned images, meaning that each image is associated with one (or more) captions describing it. Despite the abundance of uncaptioned images datasets, the number of captioned datasets is limited. To address this issue, in this paper we propose an approach capable of generating images starting from a given text using conditional GANs trained on uncaptioned images dataset. In particular, uncaptioned images are fed to an Image Captioning Module to generate the descriptions. Then, the GAN Module is trained on both the input image and the machine-generated caption. To evaluate the results, the performance of our solution is compared with the results obtained by the unconditional GAN. For the experiments, we chose to use the uncaptioned dataset LSUN bedroom. The results obtained in our study are preliminary but still promising.

  Access Paper or Ask Questions

Hybrid approaches for automatic vowelization of Arabic texts

Oct 09, 2014
Mohamed Bebah, Chennoufi Amine, Mazroui Azzeddine, Lakhouaja Abdelhak

Hybrid approaches for automatic vowelization of Arabic texts are presented in this article. The process is made up of two modules. In the first one, a morphological analysis of the text words is performed using the open source morphological Analyzer AlKhalil Morpho Sys. Outputs for each word analyzed out of context, are its different possible vowelizations. The integration of this Analyzer in our vowelization system required the addition of a lexical database containing the most frequent words in Arabic language. Using a statistical approach based on two hidden Markov models (HMM), the second module aims to eliminate the ambiguities. Indeed, for the first HMM, the unvowelized Arabic words are the observed states and the vowelized words are the hidden states. The observed states of the second HMM are identical to those of the first, but the hidden states are the lists of possible diacritics of the word without its Arabic letters. Our system uses Viterbi algorithm to select the optimal path among the solutions proposed by Al Khalil Morpho Sys. Our approach opens an important way to improve the performance of automatic vowelization of Arabic texts for other uses in automatic natural language processing.

* 19 pages 

  Access Paper or Ask Questions

Robust Text CAPTCHAs Using Adversarial Examples

Jan 07, 2021
Rulin Shao, Zhouxing Shi, Jinfeng Yi, Pin-Yu Chen, Cho-Jui Hsieh

CAPTCHA (Completely Automated Public Truing test to tell Computers and Humans Apart) is a widely used technology to distinguish real users and automated users such as bots. However, the advance of AI technologies weakens many CAPTCHA tests and can induce security concerns. In this paper, we propose a user-friendly text-based CAPTCHA generation method named Robust Text CAPTCHA (RTC). At the first stage, the foregrounds and backgrounds are constructed with randomly sampled font and background images, which are then synthesized into identifiable pseudo adversarial CAPTCHAs. At the second stage, we design and apply a highly transferable adversarial attack for text CAPTCHAs to better obstruct CAPTCHA solvers. Our experiments cover comprehensive models including shallow models such as KNN, SVM and random forest, various deep neural networks and OCR models. Experiments show that our CAPTCHAs have a failure rate lower than one millionth in general and high usability. They are also robust against various defensive techniques that attackers may employ, including adversarial training, data pre-processing and manual tagging.

  Access Paper or Ask Questions

Heavy-tailed Representations, Text Polarity Classification & Data Augmentation

Mar 25, 2020
Hamid Jalalzai, Pierre Colombo, Chloé Clavel, Eric Gaussier, Giovanna Varni, Emmanuel Vignon, Anne Sabourin

The dominant approaches to text representation in natural language rely on learning embeddings on massive corpora which have convenient properties such as compositionality and distance preservation. In this paper, we develop a novel method to learn a heavy-tailed embedding with desirable regularity properties regarding the distributional tails, which allows to analyze the points far away from the distribution bulk using the framework of multivariate extreme value theory. In particular, a classifier dedicated to the tails of the proposed embedding is obtained which performance outperforms the baseline. This classifier exhibits a scale invariance property which we leverage by introducing a novel text generation method for label preserving dataset augmentation. Numerical experiments on synthetic and real text data demonstrate the relevance of the proposed framework and confirm that this method generates meaningful sentences with controllable attribute, e.g. positive or negative sentiment.

  Access Paper or Ask Questions

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Dec 31, 2019
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou

Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the wide spread of pre-training models for NLP applications, they almost focused on text-level manipulation, while neglecting the layout and style information that is vital for document image understanding. In this paper, we propose \textbf{LayoutLM} to jointly model the interaction between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents. We also leverage the image features to incorporate the style information of words in LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for document-level pre-training, leading to significant performance improvement in downstream tasks for document image understanding.

* Work in progress 

  Access Paper or Ask Questions

Extracting Connected Concepts from Biomedical Texts using Fog Index

Jul 30, 2013
Rushdi Shams, Robert E. Mercer

In this paper, we establish Fog Index (FI) as a text filter to locate the sentences in texts that contain connected biomedical concepts of interest. To do so, we have used 24 random papers each containing four pairs of connected concepts. For each pair, we categorize sentences based on whether they contain both, any or none of the concepts. We then use FI to measure difficulty of the sentences of each category and find that sentences containing both of the concepts have low readability. We rank sentences of a text according to their FI and select 30 percent of the most difficult sentences. We use an association matrix to track the most frequent pairs of concepts in them. This matrix reports that the first filter produces some pairs that hold almost no connections. To remove these unwanted pairs, we use the Equally Weighted Harmonic Mean of their Positive Predictive Value (PPV) and Sensitivity as a second filter. Experimental results demonstrate the effectiveness of our method.

* 12th Conference of the Pacific Association for Computational Linguistics (PACLING 2011), Kuala Lumpur, Malaysia, July 19-21, 2011 

  Access Paper or Ask Questions

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

Nov 16, 2021
Yan Zeng, Xinsong Zhang, Hang Li

Most existing methods in vision language pre-training rely on object-centric features extracted through object detection, and make fine-grained alignments between the extracted features and texts. We argue that the use of object detection may not be suitable for vision language pre-training. Instead, we point out that the task should be performed so that the regions of `visual concepts' mentioned in the texts are located in the images, and in the meantime alignments between texts and visual concepts are identified, where the alignments are in multi-granularity. This paper proposes a new method called X-VLM to perform `multi-grained vision language pre-training'. Experimental results show that X-VLM consistently outperforms state-of-the-art methods in many downstream vision language tasks.

* 13 pages, 5 figures 

  Access Paper or Ask Questions

Pre-trained Language Models as Prior Knowledge for Playing Text-based Games

Jul 18, 2021
Ishika Singh, Gargi Singh, Ashutosh Modi

Recently, text world games have been proposed to enable artificial agents to understand and reason about real-world scenarios. These text-based games are challenging for artificial agents, as it requires understanding and interaction using natural language in a partially observable environment. In this paper, we improve the semantic understanding of the agent by proposing a simple RL with LM framework where we use transformer-based language models with Deep RL models. We perform a detailed study of our framework to demonstrate how our model outperforms all existing agents on the popular game, Zork1, to achieve a score of 44.7, which is 1.6 higher than the state-of-the-art model. Our proposed approach also performs comparably to the state-of-the-art models on the other set of text games.

* 55 Pages (8 Pages main content + 2 Pages references + 45 Pages Appendix) 

  Access Paper or Ask Questions

End-to-end Handwritten Paragraph Text Recognition Using a Vertical Attention Network

Dec 07, 2020
Denis Coquenet, Clément Chatelain, Thierry Paquet

Unconstrained handwritten text recognition remains challenging for computer vision systems. Paragraph text recognition is traditionally achieved by two models: the first one for line segmentation and the second one for text line recognition. We propose a unified end-to-end model using hybrid attention to tackle this task. We achieve state-of-the-art character error rate at line and paragraph levels on three popular datasets: 1.90% for RIMES, 4.32% for IAM and 3.63% for READ 2016. The proposed model can be trained from scratch, without using any segmentation label contrary to the standard approach. Our code and trained model weights are available at

  Access Paper or Ask Questions