Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Jointly Dynamic Topic Model for Recognition of Lead-lag Relationship in Two Text Corpora

Nov 21, 2021
Yandi Zhu, Xiaoling Lu, Jingya Hong, Feifei Wang

Topic evolution modeling has received significant attentions in recent decades. Although various topic evolution models have been proposed, most studies focus on the single document corpus. However in practice, we can easily access data from multiple sources and also observe relationships between them. Then it is of great interest to recognize the relationship between multiple text corpora and further utilize this relationship to improve topic modeling. In this work, we focus on a special type of relationship between two text corpora, which we define as the "lead-lag relationship". This relationship characterizes the phenomenon that one text corpus would influence the topics to be discussed in the other text corpus in the future. To discover the lead-lag relationship, we propose a jointly dynamic topic model and also develop an embedding extension to address the modeling problem of large-scale text corpus. With the recognized lead-lag relationship, the similarities of the two text corpora can be figured out and the quality of topic learning in both corpora can be improved. We numerically investigate the performance of the jointly dynamic topic modeling approach using synthetic data. Finally, we apply the proposed model on two text corpora consisting of statistical papers and the graduation theses. Results show the proposed model can well recognize the lead-lag relationship between the two corpora, and the specific and shared topic patterns in the two corpora are also discovered.

  Access Paper or Ask Questions

SynthText3D: Synthesizing Scene Text Images from 3D Virtual Worlds

Jul 13, 2019
Minghui Liao, Boyu Song, Minghang He, Shangbang Long, Cong Yao, Xiang Bai

With the development of deep neural networks, the demand for a significant amount of annotated training data becomes the performance bottlenecks in many fields of research and applications. Image synthesis can generate annotated images automatically and freely, which gains increasing attention recently. In this paper, we propose to synthesize scene text images from the 3D virtual worlds, where the precise descriptions of scenes, editable illumination/visibility, and realistic physics are provided. Different from the previous methods which paste the rendered text on static 2D images, our method can render the 3D virtual scene and text instances as an entirety. In this way, complex perspective transforms, various illuminations, and occlusions can be realized in our synthesized scene text images. Moreover, the same text instances with various viewpoints can be produced by randomly moving and rotating the virtual camera, which acts as human eyes. The experiments on the standard scene text detection benchmarks using the generated synthetic data demonstrate the effectiveness and superiority of the proposed method. The code and synthetic data will be made available at

  Access Paper or Ask Questions

Universal Adversarial Perturbation for Text Classification

Oct 10, 2019
Hang Gao, Tim Oates

Given a state-of-the-art deep neural network text classifier, we show the existence of a universal and very small perturbation vector (in the embedding space) that causes natural text to be misclassified with high probability. Unlike images on which a single fixed-size adversarial perturbation can be found, text is of variable length, so we define the "universality" as "token-agnostic", where a single perturbation is applied to each token, resulting in different perturbations of flexible sizes at the sequence level. We propose an algorithm to compute universal adversarial perturbations, and show that the state-of-the-art deep neural networks are highly vulnerable to them, even though they keep the neighborhood of tokens mostly preserved. We also show how to use these adversarial perturbations to generate adversarial text samples. The surprising existence of universal "token-agnostic" adversarial perturbations may reveal important properties of a text classifier.

  Access Paper or Ask Questions

TextSETTR: Label-Free Text Style Extraction and Tunable Targeted Restyling

Oct 08, 2020
Parker Riley, Noah Constant, Mandy Guo, Girish Kumar, David Uthus, Zarana Parekh

We present a novel approach to the problem of text style transfer. Unlike previous approaches that use parallel or non-parallel labeled data, our technique removes the need for labels entirely, relying instead on the implicit connection in style between adjacent sentences in unlabeled text. We show that T5 (Raffel et al., 2019), a strong pretrained text-to-text model, can be adapted to extract a style vector from arbitrary text and use this vector to condition the decoder to perform style transfer. As the resulting learned style vector space encodes many facets of textual style, we recast transfers as "targeted restyling" vector operations that adjust specific attributes of the input text while preserving others. When trained over unlabeled Amazon reviews data, our resulting TextSETTR model is competitive on sentiment transfer, even when given only four exemplars of each class. Furthermore, we demonstrate that a single model trained on unlabeled Common Crawl data is capable of transferring along multiple dimensions including dialect, emotiveness, formality, politeness, and sentiment.

  Access Paper or Ask Questions

Unsupervised Pidgin Text Generation By Pivoting English Data and Self-Training

Mar 18, 2020
Ernie Chang, David Ifeoluwa Adelani, Xiaoyu Shen, Vera Demberg

West African Pidgin English is a language that is significantly spoken in West Africa, consisting of at least 75 million speakers. Nevertheless, proper machine translation systems and relevant NLP datasets for pidgin English are virtually absent. In this work, we develop techniques targeted at bridging the gap between Pidgin English and English in the context of natural language generation. %As a proof of concept, we explore the proposed techniques in the area of data-to-text generation. By building upon the previously released monolingual Pidgin English text and parallel English data-to-text corpus, we hope to build a system that can automatically generate Pidgin English descriptions from structured data. We first train a data-to-English text generation system, before employing techniques in unsupervised neural machine translation and self-training to establish the Pidgin-to-English cross-lingual alignment. The human evaluation performed on the generated Pidgin texts shows that, though still far from being practically usable, the pivoting + self-training technique improves both Pidgin text fluency and relevance.

* Accepted to Workshop at ICLR 2020 

  Access Paper or Ask Questions

Jointly Measuring Diversity and Quality in Text Generation Models

May 20, 2019
Ehsan Montahaei, Danial Alihosseini, Mahdieh Soleymani Baghshah

Text generation is an important Natural Language Processing task with various applications. Although several metrics have already been introduced to evaluate the text generation methods, each of them has its own shortcomings. The most widely used metrics such as BLEU only consider the quality of generated sentences and neglect their diversity. For example, repeatedly generation of only one high quality sentence would result in a high BLEU score. On the other hand, the more recent metric introduced to evaluate the diversity of generated texts known as Self-BLEU ignores the quality of generated texts. In this paper, we propose metrics to evaluate both the quality and diversity simultaneously by approximating the distance of the learned generative model and the real data distribution. For this purpose, we first introduce a metric that approximates this distance using n-gram based measures. Then, a feature-based measure which is based on a recent highly deep model trained on a large text corpus called BERT is introduced. Finally, for oracle training mode in which the generator's density can also be calculated, we propose to use the distance measures between the corresponding explicit distributions. Eventually, the most popular and recent text generation models are evaluated using both the existing and the proposed metrics and the preferences of the proposed metrics are determined.

* Accepted by NAACL 2019 workshop (NeuralGen 2019) 

  Access Paper or Ask Questions

Are You Robert or RoBERTa? Deceiving Online Authorship Attribution Models Using Neural Text Generators

Mar 18, 2022
Keenan Jones, Jason R. C. Nurse, Shujun Li

Recently, there has been a rise in the development of powerful pre-trained natural language models, including GPT-2, Grover, and XLM. These models have shown state-of-the-art capabilities towards a variety of different NLP tasks, including question answering, content summarisation, and text generation. Alongside this, there have been many studies focused on online authorship attribution (AA). That is, the use of models to identify the authors of online texts. Given the power of natural language models in generating convincing texts, this paper examines the degree to which these language models can generate texts capable of deceiving online AA models. Experimenting with both blog and Twitter data, we utilise GPT-2 language models to generate texts using the existing posts of online users. We then examine whether these AI-based text generators are capable of mimicking authorial style to such a degree that they can deceive typical AA models. From this, we find that current AI-based text generators are able to successfully mimic authorship, showing capabilities towards this on both datasets. Our findings, in turn, highlight the current capacity of powerful natural language models to generate original online posts capable of mimicking authorial style sufficiently to deceive popular AA methods; a key finding given the proposed role of AA in real world applications such as spam-detection and forensic investigation.

* 13 pages, 6 figures, 4 tables, Accepted for publication in the proceedings of the sixteenth International AAAI Conference on Web and Social Media (ICWSM-22) 

  Access Paper or Ask Questions

A Multi-Object Rectified Attention Network for Scene Text Recognition

Jan 10, 2019
Canjie Luo, Lianwen Jin, Zenghui Sun

Irregular text is widely used. However, it is considerably difficult to recognize because of its various shapes and distorted patterns. In this paper, we thus propose a multi-object rectified attention network (MORAN) for general scene text recognition. The MORAN consists of a multi-object rectification network and an attention-based sequence recognition network. The multi-object rectification network is designed for rectifying images that contain irregular text. It decreases the difficulty of recognition and enables the attention-based sequence recognition network to more easily read irregular text. It is trained in a weak supervision way, thus requiring only images and corresponding text labels. The attention-based sequence recognition network focuses on target characters and sequentially outputs the predictions. Moreover, to improve the sensitivity of the attention-based sequence recognition network, a fractional pickup method is proposed for an attention-based decoder in the training phase. With the rectification mechanism, the MORAN can read both regular and irregular scene text. Extensive experiments on various benchmarks are conducted, which show that the MORAN achieves state-of-the-art performance. The source code is available.

* 9 Tables, 9 Figures. Accepted to appear in Pattern Recognition, 2019 

  Access Paper or Ask Questions

FACLSTM: ConvLSTM with Focused Attention for Scene Text Recognition

Apr 20, 2019
Qingqing Wang, Wenjing Jia, Xiangjian He, Yue Lu, Michael Blumenstein, Ye Huang

Scene text recognition has recently been widely treated as a sequence-to-sequence prediction problem, where traditional fully-connected-LSTM (FC-LSTM) has played a critical role. Due to the limitation of FC-LSTM, existing methods have to convert 2-D feature maps into 1-D sequential feature vectors, resulting in severe damages of the valuable spatial and structural information of text images. In this paper, we argue that scene text recognition is essentially a spatiotemporal prediction problem for its 2-D image inputs, and propose a convolution LSTM (ConvLSTM)-based scene text recognizer, namely, FACLSTM, i.e., Focused Attention ConvLSTM, where the spatial correlation of pixels is fully leveraged when performing sequential prediction with LSTM. Particularly, the attention mechanism is properly incorporated into an efficient ConvLSTM structure via the convolutional operations and additional character center masks are generated to help focus attention on right feature areas. The experimental results on benchmark datasets IIIT5K, SVT and CUTE demonstrate that our proposed FACLSTM performs competitively on the regular, low-resolution and noisy text images, and outperforms the state-of-the-art approaches on the curved text with large margins.

* 9 pages 

  Access Paper or Ask Questions