Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Adversarial Feature Matching for Text Generation

Nov 18, 2017
Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, Lawrence Carin

The Generative Adversarial Network (GAN) has achieved great success in generating realistic (real-valued) synthetic data. However, convergence issues and difficulties dealing with discrete data hinder the applicability of GAN to text. We propose a framework for generating realistic text via adversarial training. We employ a long short-term memory network as generator, and a convolutional network as discriminator. Instead of using the standard objective of GAN, we propose matching the high-dimensional latent feature distributions of real and synthetic sentences, via a kernelized discrepancy metric. This eases adversarial training by alleviating the mode-collapsing problem. Our experiments show superior performance in quantitative evaluation, and demonstrate that our model can generate realistic-looking sentences.

* Accepted by ICML 2017 

  Access Paper or Ask Questions

Text recognition in both ancient and cartographic documents

Aug 28, 2013
Nizar Zaghden, Badreddine Khelifi, Adel M. Alimi, Remy Mullot

This paper deals with the recognition and matching of text in both cartographic maps and ancient documents. The purpose of this work is to find similar text regions based on statistical and global features. A phase of normalization is done first, in object to well categorize the same quantity of information. A phase of wordspotting is done next by combining local and global features. We make different experiments by combining the different techniques of extracting features in order to obtain better results in recognition phase. We applied fontspotting on both ancient documents and cartographic ones. We also applied the wordspotting in which we adopted a new technique which tries to compare the images of character and not the entire images words. We present the precision and recall values obtained with three methods for the new method of wordspotting applied on characters only.

* 4 pages 

  Access Paper or Ask Questions

Logical Activation Functions: Logit-space equivalents of Boolean Operators

Oct 22, 2021
Scott C. Lowe, Robert Earle, Jason d'Eon, Thomas Trappenberg, Sageev Oore

Neuronal representations within artificial neural networks are commonly understood as logits, representing the log-odds score of presence (versus absence) of features within the stimulus. Under this interpretation, we can derive the probability $P(x_0 \land x_1)$ that a pair of independent features are both present in the stimulus from their logits. By converting the resulting probability back into a logit, we obtain a logit-space equivalent of the AND operation. However, since this function involves taking multiple exponents and logarithms, it is not well suited to be directly used within neural networks. We thus constructed an efficient approximation named $\text{AND}_\text{AIL}$ (the AND operator Approximate for Independent Logits) utilizing only comparison and addition operations, which can be deployed as an activation function in neural networks. Like MaxOut, $\text{AND}_\text{AIL}$ is a generalization of ReLU to two-dimensions. Additionally, we constructed efficient approximations of the logit-space equivalents to the OR and XNOR operators. We deployed these new activation functions, both in isolation and in conjunction, and demonstrated their effectiveness on a variety of tasks including image classification, transfer learning, abstract reasoning, and compositional zero-shot learning.

  Access Paper or Ask Questions

Membership Inference on Word Embedding and Beyond

Jun 21, 2021
Saeed Mahloujifar, Huseyin A. Inan, Melissa Chase, Esha Ghosh, Marcello Hasegawa

In the text processing context, most ML models are built on word embeddings. These embeddings are themselves trained on some datasets, potentially containing sensitive data. In some cases this training is done independently, in other cases, it occurs as part of training a larger, task-specific model. In either case, it is of interest to consider membership inference attacks based on the embedding layer as a way of understanding sensitive information leakage. But, somewhat surprisingly, membership inference attacks on word embeddings and their effect in other natural language processing (NLP) tasks that use these embeddings, have remained relatively unexplored. In this work, we show that word embeddings are vulnerable to black-box membership inference attacks under realistic assumptions. Furthermore, we show that this leakage persists through two other major NLP applications: classification and text-generation, even when the embedding layer is not exposed to the attacker. We show that our MI attack achieves high attack accuracy against a classifier model and an LSTM-based language model. Indeed, our attack is a cheaper membership inference attack on text-generative models, which does not require the knowledge of the target model or any expensive training of text-generative models as shadow models.

  Access Paper or Ask Questions

OSACT4 Shared Task on Offensive Language Detection: Intensive Preprocessing-Based Approach

May 14, 2020
Fatemah Husain

The preprocessing phase is one of the key phases within the text classification pipeline. This study aims at investigating the impact of the preprocessing phase on text classification, specifically on offensive language and hate speech classification for Arabic text. The Arabic language used in social media is informal and written using Arabic dialects, which makes the text classification task very complex. Preprocessing helps in dimensionality reduction and removing useless content. We apply intensive preprocessing techniques to the dataset before processing it further and feeding it into the classification model. An intensive preprocessing-based approach demonstrates its significant impact on offensive language detection and hate speech detection shared tasks of the fourth workshop on Open-Source Arabic Corpora and Corpora Processing Tools (OSACT). Our team wins the third place (3rd) in the Sub-Task A Offensive Language Detection division and wins the first place (1st) in the Sub-Task B Hate Speech Detection division, with an F1 score of 89% and 95%, respectively, by providing the state-of-the-art performance in terms of F1, accuracy, recall, and precision for Arabic hate speech detection.

* Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), Marseille, France (2020) 

  Access Paper or Ask Questions

Parser Extraction of Triples in Unstructured Text

Nov 06, 2018
Shaun D'Souza

The web contains vast repositories of unstructured text. We investigate the opportunity for building a knowledge graph from these text sources. We generate a set of triples which can be used in knowledge gathering and integration. We define the architecture of a language compiler for processing subject-predicate-object triples using the OpenNLP parser. We implement a depth-first search traversal on the POS tagged syntactic tree appending predicate and object information. A parser enables higher precision and higher recall extractions of syntactic relationships across conjunction boundaries. We are able to extract 2-2.5 times the correct extractions of ReVerb. The extractions are used in a variety of semantic web applications and question answering. We verify extraction of 50,000 triples on the ClueWeb dataset.

* IAES International Journal of Artificial Intelligence (IJ-AI), 5(4):143-148, 2017 

  Access Paper or Ask Questions

CausalNLP: A Practical Toolkit for Causal Inference with Text

Jun 21, 2021
Arun S. Maiya

The vast majority of existing methods and systems for causal inference assume that all variables under consideration are categorical or numerical (e.g., gender, price, blood pressure, enrollment). In this paper, we present CausalNLP, a toolkit for inferring causality from observational data that includes text in addition to traditional numerical and categorical variables. CausalNLP employs the use of meta-learners for treatment effect estimation and supports using raw text and its linguistic properties as both a treatment and a "controlled-for" variable (e.g., confounder). The library is open-source and available at:

* 7 pages 

  Access Paper or Ask Questions

Generative Adversarial Network for Abstractive Text Summarization

Nov 26, 2017
Linqing Liu, Yao Lu, Min Yang, Qiang Qu, Jia Zhu, Hongyan Li

In this paper, we propose an adversarial process for abstractive text summarization, in which we simultaneously train a generative model G and a discriminative model D. In particular, we build the generator G as an agent of reinforcement learning, which takes the raw text as input and predicts the abstractive summarization. We also build a discriminator which attempts to distinguish the generated summary from the ground truth summary. Extensive experiments demonstrate that our model achieves competitive ROUGE scores with the state-of-the-art methods on CNN/Daily Mail dataset. Qualitatively, we show that our model is able to generate more abstractive, readable and diverse summaries.

* AAAI 2018 abstract, Supplemental material: 

  Access Paper or Ask Questions

Parallel Texts in the Hebrew Bible, New Methods and Visualizations

Mar 04, 2016
Martijn Naaijer, Dirk Roorda

In this article we develop an algorithm to detect parallel texts in the Masoretic Text of the Hebrew Bible. The results are presented online and chapters in the Hebrew Bible containing parallel passages can be inspected synoptically. Differences between parallel passages are highlighted. In a similar way the MT of Isaiah is presented synoptically with 1QIsaa. We also investigate how one can investigate the degree of similarity between parallel passages with the help of a case study of 2 Kings 19-25 and its parallels in Isaiah, Jeremiah and 2 Chronicles.

* 15 pages, 5 figures 

  Access Paper or Ask Questions