Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Replication of the Keyword Extraction part of the paper "'Without the Clutter of Unimportant Words': Descriptive Keyphrases for Text Visualization"

Aug 15, 2019
Shibamouli Lahiri

"Keyword Extraction" refers to the task of automatically identifying the most relevant and informative phrases in natural language text. As we are deluged with large amounts of text data in many different forms and content - emails, blogs, tweets, Facebook posts, academic papers, news articles - the task of "making sense" of all this text by somehow summarizing them into a coherent structure assumes paramount importance. Keyword extraction - a well-established problem in Natural Language Processing - can help us here. In this report, we construct and test three different hypotheses (all related to the task of keyword extraction) that take us one step closer to understanding how to meaningfully identify and extract "descriptive" keyphrases. The work reported here was done as part of replicating the study by Chuang et al. [3].

* 36 pages, 12 figures 

  Access Paper or Ask Questions

Cross Language Text Classification via Subspace Co-Regularized Multi-View Learning

Jun 27, 2012
Yuhong Guo, Min Xiao

In many multilingual text classification problems, the documents in different languages often share the same set of categories. To reduce the labeling cost of training a classification model for each individual language, it is important to transfer the label knowledge gained from one language to another language by conducting cross language classification. In this paper we develop a novel subspace co-regularized multi-view learning method for cross language text classification. This method is built on parallel corpora produced by machine translation. It jointly minimizes the training error of each classifier in each language while penalizing the distance between the subspace representations of parallel documents. Our empirical study on a large set of cross language text classification tasks shows the proposed method consistently outperforms a number of inductive methods, domain adaptation methods, and multi-view learning methods.

* Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012) 

  Access Paper or Ask Questions

Utilizing Large Scale Vision and Text Datasets for Image Segmentation from Referring Expressions

Aug 30, 2016
Ronghang Hu, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell

Image segmentation from referring expressions is a joint vision and language modeling task, where the input is an image and a textual expression describing a particular region in the image; and the goal is to localize and segment the specific image region based on the given expression. One major difficulty to train such language-based image segmentation systems is the lack of datasets with joint vision and text annotations. Although existing vision datasets such as MS COCO provide image captions, there are few datasets with region-level textual annotations for images, and these are often smaller in scale. In this paper, we explore how existing large scale vision-only and text-only datasets can be utilized to train models for image segmentation from referring expressions. We propose a method to address this problem, and show in experiments that our method can help this joint vision and language modeling task with vision-only and text-only data and outperforms previous results.

  Access Paper or Ask Questions

Automatic text extraction and character segmentation using maximally stable extremal regions

Aug 11, 2016
Nitigya Sambyal, Pawanesh Abrol

Text detection and segmentation is an important prerequisite for many content based image analysis tasks. The paper proposes a novel text extraction and character segmentation algorithm using Maximally Stable Extremal Regions as basic letter candidates. These regions are then subjected to thresholding and thereafter various connected components are determined to identify separate characters. The algorithm is tested along a set of various JPEG, PNG and BMP images over four different character sets; English, Russian, Hindi and Urdu. The algorithm gives good results for English and Russian character set; however character segmentation in Urdu and Hindi language is not much accurate. The algorithm is simple, efficient, involves no overhead as required in training and gives good results for even low quality images. The paper also proposes various challenges in text extraction and segmentation for multilingual inputs.

* International Journal of Modern Computer Science,Vol 4,Issue 3,pp.136-141,2016 

  Access Paper or Ask Questions

FairFil: Contrastive Neural Debiasing Method for Pretrained Text Encoders

Mar 11, 2021
Pengyu Cheng, Weituo Hao, Siyang Yuan, Shijing Si, Lawrence Carin

Pretrained text encoders, such as BERT, have been applied increasingly in various natural language processing (NLP) tasks, and have recently demonstrated significant performance gains. However, recent studies have demonstrated the existence of social bias in these pretrained NLP models. Although prior works have made progress on word-level debiasing, improved sentence-level fairness of pretrained encoders still lacks exploration. In this paper, we proposed the first neural debiasing method for a pretrained sentence encoder, which transforms the pretrained encoder outputs into debiased representations via a fair filter (FairFil) network. To learn the FairFil, we introduce a contrastive learning framework that not only minimizes the correlation between filtered embeddings and bias words but also preserves rich semantic information of the original sentences. On real-world datasets, our FairFil effectively reduces the bias degree of pretrained text encoders, while continuously showing desirable performance on downstream tasks. Moreover, our post-hoc method does not require any retraining of the text encoders, further enlarging FairFil's application space.

* Accepted by the 9th International Conference on Learning Representations (ICLR 2021) 

  Access Paper or Ask Questions

STELA: A Real-Time Scene Text Detector with Learned Anchor

Sep 23, 2019
Linjie Deng, Yanxiang Gong, Xinchen Lu, Yi Lin, Zheng Ma, Mei Xie

To achieve high coverage of target boxes, a normal strategy of conventional one-stage anchor-based detectors is to utilize multiple priors at each spatial position, especially in scene text detection tasks. In this work, we present a simple and intuitive method for multi-oriented text detection where each location of feature maps only associates with one reference box. The idea is inspired from the twostage R-CNN framework that can estimate the location of objects with any shape by using learned proposals. The aim of our method is to integrate this mechanism into a onestage detector and employ the learned anchor which is obtained through a regression operation to replace the original one into the final predictions. Based on RetinaNet, our method achieves competitive performances on several public benchmarks with a totally real-time efficiency (26:5fps at 800p), which surpasses all of anchor-based scene text detectors. In addition, with less attention on anchor design, we believe our method is easy to be applied on other analogous detection tasks. The code will publicly available at

  Access Paper or Ask Questions

The role of grammar in transition-probabilities of subsequent words in English text

Dec 28, 2018
Rudolf Hanel, Stefan Thurner

Sentence formation is a highly structured, history-dependent, and sample-space reducing (SSR) process. While the first word in a sentence can be chosen from the entire vocabulary, typically, the freedom of choosing subsequent words gets more and more constrained by grammar and context, as the sentence progresses. This sample-space reducing property offers a natural explanation of Zipf's law in word frequencies, however, it fails to capture the structure of the word-to-word transition probability matrices of English text. Here we adopt the view that grammatical constraints (such as subject--predicate--object) locally re-order the word order in sentences that are sampled with a SSR word generation process. We demonstrate that superimposing grammatical structure -- as a local word re-ordering (permutation) process -- on a sample-space reducing process is sufficient to explain both, word frequencies and word-to-word transition probabilities. We compare the quality of the grammatically ordered SSR model in reproducing several test statistics of real texts with other text generation models, such as the Bernoulli model, the Simon model, and the Monkey typewriting model.

* 8 pages, 3 figures 

  Access Paper or Ask Questions

Neural Text Classification and StackedHeterogeneous Embeddings for Named Entity Recognition in SMM4H 2021

Jun 10, 2021
Usama Yaseen, Stefan Langer

This paper presents our findings from participating in the SMM4H Shared Task 2021. We addressed Named Entity Recognition (NER) and Text Classification. To address NER we explored BiLSTM-CRF with Stacked Heterogeneous Embeddings and linguistic features. We investigated various machine learning algorithms (logistic regression, Support Vector Machine (SVM) and Neural Networks) to address text classification. Our proposed approaches can be generalized to different languages and we have shown its effectiveness for English and Spanish. Our text classification submissions (team:MIC-NLP) have achieved competitive performance with F1-score of $0.46$ and $0.90$ on ADE Classification (Task 1a) and Profession Classification (Task 7a) respectively. In the case of NER, our submissions scored F1-score of $0.50$ and $0.82$ on ADE Span Detection (Task 1b) and Profession Span detection (Task 7b) respectively.

* NAACL 2021 

  Access Paper or Ask Questions

Does a Hybrid Neural Network based Feature Selection Model Improve Text Classification?

Jan 22, 2021
Suman Dowlagar, Radhika Mamidi

Text classification is a fundamental problem in the field of natural language processing. Text classification mainly focuses on giving more importance to all the relevant features that help classify the textual data. Apart from these, the text can have redundant or highly correlated features. These features increase the complexity of the classification algorithm. Thus, many dimensionality reduction methods were proposed with the traditional machine learning classifiers. The use of dimensionality reduction methods with machine learning classifiers has achieved good results. In this paper, we propose a hybrid feature selection method for obtaining relevant features by combining various filter-based feature selection methods and fastText classifier. We then present three ways of implementing a feature selection and neural network pipeline. We observed a reduction in training time when feature selection methods are used along with neural networks. We also observed a slight increase in accuracy on some datasets.

  Access Paper or Ask Questions