Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Automatic Classification of the Complexity of Nonfiction Texts in Portuguese for Early School Years

Apr 10, 2017
Nathan Siegle Hartmann, Livia Cucatto, Danielle Brants, Sandra Aluísio

Recent research shows that most Brazilian students have serious problems regarding their reading skills. The full development of this skill is key for the academic and professional future of every citizen. Tools for classifying the complexity of reading materials for children aim to improve the quality of the model of teaching reading and text comprehension. For English, Fengs work [11] is considered the state-of-art in grade level prediction and achieved 74% of accuracy in automatically classifying 4 levels of textual complexity for close school grades. There are no classifiers for nonfiction texts for close grades in Portuguese. In this article, we propose a scheme for manual annotation of texts in 5 grade levels, which will be used for customized reading to avoid the lack of interest by students who are more advanced in reading and the blocking of those that still need to make further progress. We obtained 52% of accuracy in classifying texts into 5 levels and 74% in 3 levels. The results prove to be promising when compared to the state-of-art work.9

* Hartmann N., Cucatto L., Brants D., Alu\'isio S. (2016) Automatic Classification of the Complexity of Nonfiction Texts in Portuguese for Early School Years. In: Computational Processing of the Portuguese Language. PROPOR 2016. Springer 
* PROPOR International Conference on the Computational Processing of Portuguese, 2016, 9 pages 

  Access Paper or Ask Questions

Experimental Evaluation of Deep Learning models for Marathi Text Classification

Jan 14, 2021
Atharva Kulkarni, Meet Mandhane, Manali Likhitkar, Gayatri Kshirsagar, Jayashree Jagdale, Raviraj Joshi

The Marathi language is one of the prominent languages used in India. It is predominantly spoken by the people of Maharashtra. Over the past decade, the usage of language on online platforms has tremendously increased. However, research on Natural Language Processing (NLP) approaches for Marathi text has not received much attention. Marathi is a morphologically rich language and uses a variant of the Devanagari script in the written form. This works aims to provide a comprehensive overview of available resources and models for Marathi text classification. We evaluate CNN, LSTM, ULMFiT, and BERT based models on two publicly available Marathi text classification datasets and present a comparative analysis. The pre-trained Marathi fast text word embeddings by Facebook and IndicNLP are used in conjunction with word-based models. We show that basic single layer models based on CNN and LSTM coupled with FastText embeddings perform on par with the BERT based models on the available datasets. We hope our paper aids focused research and experiments in the area of Marathi NLP.

* Accepted at ICMISC 2021 

  Access Paper or Ask Questions

In Layman's Terms: Semi-Open Relation Extraction from Scientific Texts

May 26, 2020
Ruben Kruiper, Julian F. V. Vincent, Jessica Chen-Burger, Marc P. Y. Desmulliez, Ioannis Konstas

Information Extraction (IE) from scientific texts can be used to guide readers to the central information in scientific documents. But narrow IE systems extract only a fraction of the information captured, and Open IE systems do not perform well on the long and complex sentences encountered in scientific texts. In this work we combine the output of both types of systems to achieve Semi-Open Relation Extraction, a new task that we explore in the Biology domain. First, we present the Focused Open Biological Information Extraction (FOBIE) dataset and use FOBIE to train a state-of-the-art narrow scientific IE system to extract trade-off relations and arguments that are central to biology texts. We then run both the narrow IE system and a state-of-the-art Open IE system on a corpus of 10k open-access scientific biological texts. We show that a significant amount (65%) of erroneous and uninformative Open IE extractions can be filtered using narrow IE extractions. Furthermore, we show that the retained extractions are significantly more often informative to a reader.

* To be published in ACL 2020 conference proceedings. Updated dataset statistics, results unchanged 

  Access Paper or Ask Questions

Parallel Refinements for Lexically Constrained Text Generation with BART

Sep 26, 2021
Xingwei He

Lexically constrained text generation aims to control the generated text by incorporating some pre-specified keywords into the output. Previous work injects lexical constraints into the output by controlling the decoding process or refining the candidate output iteratively, which tends to generate generic or ungrammatical sentences, and has high computational complexity. To address these challenges, we propose Constrained BART (CBART) for lexically constrained text generation. CBART leverages the pre-trained model BART and transfers part of the generation burden from the decoder to the encoder by decomposing this task into two sub-tasks, thereby improving the sentence quality. Concretely, we extend BART by adding a token-level classifier over the encoder, aiming at instructing the decoder where to replace and insert. Guided by the encoder, the decoder refines multiple tokens of the input in one step by inserting tokens before specific positions and re-predicting tokens with low confidence. To further reduce the inference latency, the decoder predicts all tokens in parallel. Experiment results on One-Billion-Word and Yelp show that CBART can generate plausible text with high quality and diversity while significantly accelerating inference.

* Accepted to the main conference of EMNLP 2021 

  Access Paper or Ask Questions

An end-to-end Neural Network Framework for Text Clustering

Mar 22, 2019
Jie Zhou, Xingyi Cheng, Jinchao Zhang

The unsupervised text clustering is one of the major tasks in natural language processing (NLP) and remains a difficult and complex problem. Conventional \mbox{methods} generally treat this task using separated steps, including text representation learning and clustering the representations. As an improvement, neural methods have also been introduced for continuous representation learning to address the sparsity problem. However, the multi-step process still deviates from the unified optimization target. Especially the second step of cluster is generally performed with conventional methods such as k-Means. We propose a pure neural framework for text clustering in an end-to-end manner. It jointly learns the text representation and the clustering model. Our model works well when the context can be obtained, which is nearly always the case in the field of NLP. We have our method \mbox{evaluated} on two widely used benchmarks: IMDB movie reviews for sentiment classification and $20$-Newsgroup for topic categorization. Despite its simplicity, experiments show the model outperforms previous clustering methods by a large margin. Furthermore, the model is also verified on English wiki dataset as a large corpus.

  Access Paper or Ask Questions

Deep Scene Text Detection with Connected Component Proposals

Aug 17, 2017
Fan Jiang, Zhihui Hao, Xinran Liu

A growing demand for natural-scene text detection has been witnessed by the computer vision community since text information plays a significant role in scene understanding and image indexing. Deep neural networks are being used due to their strong capabilities of pixel-wise classification or word localization, similar to being used in common vision problems. In this paper, we present a novel two-task network with integrating bottom and top cues. The first task aims to predict a pixel-by-pixel labeling and based on which, word proposals are generated with a canonical connected component analysis. The second task aims to output a bundle of character candidates used later to verify the word proposals. The two sub-networks share base convolutional features and moreover, we present a new loss to strengthen the interaction between them. We evaluate the proposed network on public benchmark datasets and show it can detect arbitrary-orientation scene text with a finer output boundary. In ICDAR 2013 text localization task, we achieve the state-of-the-art performance with an F-score of 0.919 and a much better recall of 0.915.

* 10 pages, 5 figures 

  Access Paper or Ask Questions

GroupViT: Semantic Segmentation Emerges from Text Supervision

Feb 22, 2022
Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang

Grouping and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grouping of image regions usually happens implicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grouping mechanism into deep networks, which allows semantic segments to emerge automatically with only text supervision. We propose a hierarchical Grouping Vision Transformer (GroupViT), which goes beyond the regular grid structure representation and learns to group image regions into progressively larger arbitrary-shaped segments. We train GroupViT jointly with a text encoder on a large-scale image-text dataset via contrastive losses. With only text supervision and without any pixel-level annotations, GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner, i.e., without any further fine-tuning. It achieves a zero-shot accuracy of 51.2% mIoU on the PASCAL VOC 2012 and 22.3% mIoU on PASCAL Context datasets, and performs competitively to state-of-the-art transfer-learning methods requiring greater levels of supervision. Project page is available at

* Project page: 

  Access Paper or Ask Questions

Recurrent Neural Network based Part-of-Speech Tagger for Code-Mixed Social Media Text

Nov 16, 2016
Raj Nath Patel, Prakash B. Pimpale, M Sasikumar

This paper describes Centre for Development of Advanced Computing's (CDACM) submission to the shared task-'Tool Contest on POS tagging for Code-Mixed Indian Social Media (Facebook, Twitter, and Whatsapp) Text', collocated with ICON-2016. The shared task was to predict Part of Speech (POS) tag at word level for a given text. The code-mixed text is generated mostly on social media by multilingual users. The presence of the multilingual words, transliterations, and spelling variations make such content linguistically complex. In this paper, we propose an approach to POS tag code-mixed social media text using Recurrent Neural Network Language Model (RNN-LM) architecture. We submitted the results for Hindi-English (hi-en), Bengali-English (bn-en), and Telugu-English (te-en) code-mixed data.

* In Proceedings of the Tool Contest on POS tagging for Indian Social Media Text, ICON 2016 
* 7 pages, Published at the Tool Contest on POS tagging for Indian Social Media Text, ICON 2016 

  Access Paper or Ask Questions