Get our free extension to see links to code for papers anywhere online!

# "Text Classification": models, code, and papers

## Text Guide: Improving the quality of long text classification by a text selection method based on feature importance

The performance of text classification methods has improved greatly over the last decade for text instances of less than 512 tokens. This limit has been adopted by most state-of-the-research transformer models due to the high computational cost of analyzing longer text instances. To mitigate this problem and to improve classification for longer texts, researchers have sought to resolve the underlying causes of the computational cost and have proposed optimizations for the attention mechanism, which is the key element of every transformer model. In our study, we are not pursuing the ultimate goal of long text classification, i.e., the ability to analyze entire text instances at one time while preserving high performance at a reasonable computational cost. Instead, we propose a text truncation method called Text Guide, in which the original text length is reduced to a predefined limit in a manner that improves performance over naive and semi-naive approaches while preserving low computational costs. Text Guide benefits from the concept of feature importance, a notion from the explainable artificial intelligence domain. We demonstrate that Text Guide can be used to improve the performance of recent language models specifically designed for long text classification, such as Longformer. Moreover, we discovered that parameter optimization is the key to Text Guide performance and must be conducted before the method is deployed. Future experiments may reveal additional benefits provided by this new method.

## Text Classification using Association Rule with a Hybrid Concept of Naive Bayes Classifier and Genetic Algorithm

Sep 25, 2010
S. M. Kamruzzaman, Farhana Haider, Ahmed Ryadh Hasan

Text classification is the automated assignment of natural language texts to predefined categories based on their content. Text classification is the primary requirement of text retrieval systems, which retrieve texts in response to a user query, and text understanding systems, which transform text in some way such as producing summaries, answering questions or extracting data. Now a day the demand of text classification is increasing tremendously. Keeping this demand into consideration, new and updated techniques are being developed for the purpose of automated text classification. This paper presents a new algorithm for text classification. Instead of using words, word relation i.e. association rules is used to derive feature set from pre-classified text documents. The concept of Naive Bayes Classifier is then used on derived features and finally a concept of Genetic Algorithm has been added for final classification. A system based on the proposed algorithm has been implemented and tested. The experimental results show that the proposed system works as a successful text classifier.

* Proc. 7th International Conference on Computer and Information Technology (ICCIT-2004), Dhaka, Bangladesh, pp. 682-687, Dec. 2004
* 6 Pages, International Conference

## A Hierarchical End-to-End Model for Jointly Improving Text Summarization and Sentiment Classification

May 30, 2018
Shuming Ma, Xu Sun, Junyang Lin, Xuancheng Ren

Text summarization and sentiment classification both aim to capture the main ideas of the text but at different levels. Text summarization is to describe the text within a few sentences, while sentiment classification can be regarded as a special type of summarization which "summarizes" the text into a even more abstract fashion, i.e., a sentiment class. Based on this idea, we propose a hierarchical end-to-end model for joint learning of text summarization and sentiment classification, where the sentiment classification label is treated as the further "summarization" of the text summarization output. Hence, the sentiment classification layer is put upon the text summarization layer, and a hierarchical structure is derived. Experimental results on Amazon online reviews datasets show that our model achieves better performance than the strong baseline systems on both abstractive summarization and sentiment classification.

* accepted by IJCAI-18

## A survey on phrase structure learning methods for text classification

Jun 21, 2014

Text classification is a task of automatic classification of text into one of the predefined categories. The problem of text classification has been widely studied in different communities like natural language processing, data mining and information retrieval. Text classification is an important constituent in many information management tasks like topic identification, spam filtering, email routing, language identification, genre classification, readability assessment etc. The performance of text classification improves notably when phrase patterns are used. The use of phrase patterns helps in capturing non-local behaviours and thus helps in the improvement of text classification task. Phrase structure extraction is the first step to continue with the phrase pattern identification. In this survey, detailed study of phrase structure learning methods have been carried out. This will enable future work in several NLP tasks, which uses syntactic information from phrase structure like grammar checkers, question answering, information extraction, machine translation, text classification. The paper also provides different levels of classification and detailed comparison of the phrase structure learning methods.

* 14 pages, 2 figures, 2 tables, International Journal on Natural Language Computing (IJNLC) Vol. 3, No.2, April 2014

## Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation: An Application to Hate-Speech Detection

Classification of personal text messages has many useful applications in surveillance, e-commerce, and mental health care, to name a few. Giving applications access to personal texts can easily lead to (un)intentional privacy violations. We propose the first privacy-preserving solution for text classification that is provably secure. Our method, which is based on Secure Multiparty Computation (SMC), encompasses both feature extraction from texts, and subsequent classification with logistic regression and tree ensembles. We prove that when using our secure text classification method, the application does not learn anything about the text, and the author of the text does not learn anything about the text classification model used by the application beyond what is given by the classification result itself. We perform end-to-end experiments with an application for detecting hate speech against women and immigrants, demonstrating excellent runtime results without loss of accuracy.

## Description Based Text Classification with Reinforcement Learning

Feb 08, 2020
Duo Chai, Wei Wu, Qinghong Han, Fei Wu, Jiwei Li

The task of text classification is usually divided into two stages: {\it text feature extraction} and {\it classification}. In this standard formalization categories are merely represented as indexes in the label vocabulary, and the model lacks for explicit instructions on what to classify. Inspired by the current trend of formalizing NLP problems as question answering tasks, we propose a new framework for text classification, in which each category label is associated with a category description. Descriptions are generated by hand-crafted templates or using abstractive/extractive models from reinforcement learning. The concatenation of the description and the text is fed to the classifier to decide whether or not the current label should be assigned to the text. The proposed strategy forces the model to attend to the most salient texts with respect to the label, which can be regarded as a hard version of attention, leading to better performances. We observe significant performance boosts over strong baselines on a wide range of text classification tasks including single-label classification, multi-label classification and multi-aspect sentiment analysis.

## Graph Convolutional Networks for Text Classification

Oct 17, 2018
Liang Yao, Chengsheng Mao, Yuan Luo

Text Classification is an important and classical problem in natural language processing. There have been a number of studies that applied convolutional neural networks (convolution on regular grid, e.g., sequence) to classification. However, only a limited number of studies have explored the more flexible graph convolutional neural networks (convolution on non-grid, e.g., arbitrary graph) for the task. In this work, we propose to use graph convolutional networks for text classification. We build a single text graph for a corpus based on word co-occurrence and document word relations, then learn a Text Graph Convolutional Network (Text GCN) for the corpus. Our Text GCN is initialized with one-hot representation for word and document, it then jointly learns the embeddings for both words and documents, as supervised by the known class labels for documents. Our experimental results on multiple benchmark datasets demonstrate that a vanilla Text GCN without any external word embeddings or knowledge outperforms state-of-the-art methods for text classification. On the other hand, Text GCN also learns predictive word and document embeddings. In addition, experimental results show that the improvement of Text GCN over state-of-the-art comparison methods become more prominent as we lower the percentage of training data, suggesting the robustness of Text GCN to less training data in text classification.

## An Improved Classification Model for Igbo Text Using N-Gram And K-Nearest Neighbour Approaches

Apr 01, 2020
Nkechi Ifeanyi-Reuben, Chidiebere Ugwu

This paper presents an improved classification model for Igbo text using N-gram and K-Nearest Neighbour approaches. The N-gram model was used for text representation and the classification was carried out on the text using the K-Nearest Neighbour model. Object-Oriented design methodology is used for the work and is implemented with the Python programming language with tools from Natural Language Toolkit (NLTK). The performance of the Igbo text classification system is measured by computing the precision, recall and F1-measure of the result obtained on Unigram, Bigram and Trigram represented text. The Igbo text classification on bigram represented text has highest degree of exactness (precision); result obtained with three N-gram models has the same level of completeness (recall) while trigram has the lowest level of precision. This shows that the classification on bigram Igbo represented text outperforms unigram and trigram represented texts. Therefore, bigram text representation model is highly recommended for any intelligent text-based system in Igbo language.

## Text-based classification of interviews for mental health -- juxtaposing the state of the art

Jul 29, 2020
Joppe Valentijn Wouts

Currently, the state of the art for classification of psychiatric illness is based on audio-based classification. This thesis aims to design and evaluate a state of the art text classification network on this challenge. The hypothesis is that a well designed text-based approach poses a strong competition against the state-of-the-art audio based approaches. Dutch natural language models are being limited by the scarcity of pre-trained monolingual NLP models, as a result Dutch natural language models have a low capture of long range semantic dependencies over sentences. For this issue, this thesis presents belabBERT, a new Dutch language model extending the RoBERTa[15] architecture. belabBERT is trained on a large Dutch corpus (+32GB) of web crawled texts. After this thesis evaluates the strength of text-based classification, a brief exploration is done, extending the framework to a hybrid text- and audio-based classification. The goal of this hybrid framework is to show the principle of hybridisation with a very basic audio-classification network. The overall goal is to create the foundations for a hybrid psychiatric illness classification, by proving that the new text-based classification is already a strong stand-alone solution.

* 33 pages, 7 figures, belabBERT is available on http://github.com/Joppewouts/belabBERT

## Enhancement of Short Text Clustering by Iterative Classification

Short text clustering is a challenging task due to the lack of signal contained in such short texts. In this work, we propose iterative classification as a method to b o ost the clustering quality (e.g., accuracy) of short texts. Given a clustering of short texts obtained using an arbitrary clustering algorithm, iterative classification applies outlier removal to obtain outlier-free clusters. Then it trains a classification algorithm using the non-outliers based on their cluster distributions. Using the trained classification model, iterative classification reclassifies the outliers to obtain a new set of clusters. By repeating this several times, we obtain a much improved clustering of texts. Our experimental results show that the proposed clustering enhancement method not only improves the clustering quality of different clustering methods (e.g., k-means, k-means--, and hierarchical clustering) but also outperforms the state-of-the-art short text clustering methods on several short text datasets by a statistically significant margin.

* 30 pages, 2 figures

## Explicit Interaction Model towards Text Classification

Nov 23, 2018
Cunxiao Du, Zhaozheng Chin, Fuli Feng, Lei Zhu, Tian Gan, Liqiang Nie

Text classification is one of the fundamental tasks in natural language processing. Recently, deep neural networks have achieved promising performance in the text classification task compared to shallow models. Despite of the significance of deep models, they ignore the fine-grained (matching signals between words and classes) classification clues since their classifications mainly rely on the text-level representations. To address this problem, we introduce the interaction mechanism to incorporate word-level matching signals into the text classification task. In particular, we design a novel framework, EXplicit interAction Model (dubbed as EXAM), equipped with the interaction mechanism. We justified the proposed approach on several benchmark datasets including both multi-label and multi-class text classification tasks. Extensive experimental results demonstrate the superiority of the proposed method. As a byproduct, we have released the codes and parameter settings to facilitate other researches.

* AAAI 2019
* 8 pages

## CSTR: A Classification Perspective on Scene Text Recognition

Feb 22, 2021
Hongxiang Cai, Jun Sun, Yichao Xiong

The prevalent perspectives of scene text recognition are from sequence to sequence (seq2seq) and segmentation. In this paper, we propose a new perspective on scene text recognition, in which we model the scene text recognition as an image classification problem. Based on the image classification perspective, a scene text recognition model is proposed, which is named as CSTR. The CSTR model consists of a series of convolutional layers and a global average pooling layer at the end, followed by independent multi-class classification heads, each of which predicts the corresponding character of the word sequence in input image. The CSTR model is easy to train using parallel cross entropy losses. CSTR is as simple as image classification models like ResNet \cite{he2016deep} which makes it easy to implement, and the fully convolutional neural network architecture makes it efficient to train and deploy. We demonstrate the effectiveness of the classification perspective on scene text recognition with thorough experiments. Futhermore, CSTR achieves nearly state-of-the-art performance on six public benchmarks including regular text, irregular text. The code will be available at https://github.com/Media-Smart/vedastr.

* 10 pages, 4 figures

## Hierarchical Text Classification of Urdu News using Deep Neural Network

Jul 07, 2021

Digital text is increasing day by day on the internet. It is very challenging to classify a large and heterogeneous collection of data, which require improved information processing methods to organize text. To classify large size of corpus, one common approach is to use hierarchical text classification, which aims to classify textual data in a hierarchical structure. Several approaches have been proposed to tackle classification of text but most of the research has been done on English language. This paper proposes a deep learning model for hierarchical text classification of news in Urdu language - consisting of 51,325 sentences from 8 online news websites belonging to the following genres: Sports; Technology; and Entertainment. The objectives of this paper are twofold: (1) to develop a large human-annotated dataset of news in Urdu language for hierarchical text classification; and (2) to classify Urdu news hierarchically using our proposed model based on LSTM mechanism named as Hierarchical Multi-layer LSTMs (HMLSTM). Our model consists of two modules: Text Representing Layer, for obtaining text representation in which we use Word2vec embedding to transform the words to vector and Urdu Hierarchical LSTM Layer (UHLSTML) an end-to-end fully connected deep LSTMs network to perform automatic feature learning, we train one LSTM layer for each level of the class hierarchy. We have performed extensive experiments on our self created dataset named as Urdu News Dataset for Hierarchical Text Classification (UNDHTC). The result shows that our proposed method is very effective for hierarchical text classification and it outperforms baseline methods significantly and also achieved good results as compare to deep neural model.

* 22 pages with 16 figures

## Text classification with word embedding regularization and soft similarity measure

Mar 10, 2020
Vít Novotný, Eniafe Festus Ayetiran, Michal Štefánik, Petr Sojka

Since the seminal work of Mikolov et al., word embeddings have become the preferred word representations for many natural language processing tasks. Document similarity measures extracted from word embeddings, such as the soft cosine measure (SCM) and the Word Mover's Distance (WMD), were reported to achieve state-of-the-art performance on semantic text similarity and text classification. Despite the strong performance of the WMD on text classification and semantic text similarity, its super-cubic average time complexity is impractical. The SCM has quadratic worst-case time complexity, but its performance on text classification has never been compared with the WMD. Recently, two word embedding regularization techniques were shown to reduce storage and memory costs, and to improve training speed, document processing speed, and task performance on word analogy, word similarity, and semantic text similarity. However, the effect of these techniques on text classification has not yet been studied. In our work, we investigate the individual and joint effect of the two word embedding regularization techniques on the document processing speed and the task performance of the SCM and the WMD on text classification. For evaluation, we use the $k$NN classifier and six standard datasets: BBCSPORT, TWITTER, OHSUMED, REUTERS-21578, AMAZON, and 20NEWS. We show 39% average $k$NN test error reduction with regularized word embeddings compared to non-regularized word embeddings. We describe a practical procedure for deriving such regularized embeddings through Cholesky factorization. We also show that the SCM with regularized word embeddings significantly outperforms the WMD on text classification and is over 10,000 times faster.

## Dual-State Capsule Networks for Text Classification

Sep 10, 2021
Piyumal Demotte, Surangika Ranathunga

Text classification systems based on contextual embeddings are not viable options for many of the low resource languages. On the other hand, recently introduced capsule networks have shown performance in par with these text classification models. Thus, they could be considered as a viable alternative for text classification for languages that do not have pre-trained contextual embedding models. However, current capsule networks depend upon spatial patterns without considering the sequential features of the text. They are also sub-optimal in capturing the context-level information in longer sequences. This paper presents a novel Dual-State Capsule (DS-Caps) network-based technique for text classification, which is optimized to mitigate these issues. Two varieties of states, namely sentence-level and word-level, are integrated with capsule layers to capture deeper context-level information for language modeling. The dynamic routing process among capsules was also optimized using the context-level information obtained through sentence-level states. The DS-Caps networks outperform the existing capsule network architectures for multiple datasets, particularly for tasks with longer sequences of text. We also demonstrate the superiority of DS-Caps in text classification for a low resource language.

* 9 pages

## A hybrid learning algorithm for text classification

Sep 23, 2010
S. M. Kamruzzaman, Farhana Haider

Text classification is the process of classifying documents into predefined categories based on their content. Existing supervised learning algorithms to automatically classify text need sufficient documents to learn accurately. This paper presents a new algorithm for text classification that requires fewer documents for training. Instead of using words, word relation i.e association rules from these words is used to derive feature set from preclassified text documents. The concept of Naive Bayes classifier is then used on derived features and finally only a single concept of Genetic Algorithm has been added for final classification. Experimental results show that the classifier build this way is more accurate than the existing text classification systems.

* Proc. 3rd International Conference on Electrical & Computer Engineering (ICECE 2004), Dhaka Bangladesh, pp. 577-580, Dec. 2004
* 4 pages, International Conference

## The Authors Matter: Understanding and Mitigating Implicit Bias in Deep Text Classification

May 06, 2021
Haochen Liu, Wei Jin, Hamid Karimi, Zitao Liu, Jiliang Tang

It is evident that deep text classification models trained on human data could be biased. In particular, they produce biased outcomes for texts that explicitly include identity terms of certain demographic groups. We refer to this type of bias as explicit bias, which has been extensively studied. However, deep text classification models can also produce biased outcomes for texts written by authors of certain demographic groups. We refer to such bias as implicit bias of which we still have a rather limited understanding. In this paper, we first demonstrate that implicit bias exists in different text classification tasks for different demographic groups. Then, we build a learning-based interpretation method to deepen our knowledge of implicit bias. Specifically, we verify that classifiers learn to make predictions based on language features that are related to the demographic attributes of the authors. Next, we propose a framework Debiased-TC to train deep text classifiers to make predictions on the right features and consequently mitigate implicit bias. We conduct extensive experiments on three real-world datasets. The results show that the text classification models trained under our proposed framework outperform traditional models significantly in terms of fairness, and also slightly in terms of classification performance.

* Accepted by Findings of ACL 2021

## A Framework for Explainable Text Classification in Legal Document Review

Companies regularly spend millions of dollars producing electronically-stored documents in legal matters. Recently, parties on both sides of the 'legal aisle' are accepting the use of machine learning techniques like text classification to cull massive volumes of data and to identify responsive documents for use in these matters. While text classification is regularly used to reduce the discovery costs in legal matters, it also faces a peculiar perception challenge: amongst lawyers, this technology is sometimes looked upon as a "black box", little information provided for attorneys to understand why documents are classified as responsive. In recent years, a group of AI and ML researchers have been actively researching Explainable AI, in which actions or decisions are human understandable. In legal document review scenarios, a document can be identified as responsive, if one or more of its text snippets are deemed responsive. In these scenarios, if text classification can be used to locate these snippets, then attorneys could easily evaluate the model's classification decision. When deployed with defined and explainable results, text classification can drastically enhance overall quality and speed of the review process by reducing the review time. Moreover, explainable predictive coding provides lawyers with greater confidence in the results of that supervised learning task. This paper describes a framework for explainable text classification as a valuable tool in legal services: for enhancing the quality and efficiency of legal document review and for assisting in locating responsive snippets within responsive documents. This framework has been implemented in our legal analytics product, which has been used in hundreds of legal matters. We also report our experimental results using the data from an actual legal matter that used this type of document review.

* 2019 IEEE International Conference on Big Data (Big Data). arXiv admin note: text overlap with arXiv:1904.01721

## TextZoo, a New Benchmark for Reconsidering Text Classification

Mar 19, 2018
Benyou Wang, Li Wang, Qikang Wei, Lichun Liu

Text representation is a fundamental concern in Natural Language Processing, especially in text classification. Recently, many neural network approaches with delicate representation model (e.g. FASTTEXT, CNN, RNN and many hybrid models with attention mechanisms) claimed that they achieved state-of-art in specific text classification datasets. However, it lacks an unified benchmark to compare these models and reveals the advantage of each sub-components for various settings. We re-implement more than 20 popular text representation models for classification in more than 10 datasets. In this paper, we reconsider the text classification task in the perspective of neural network and get serval effects with analysis of the above results.

* a benchmark need to be completed

## Text Classification using the Concept of Association Rule of Data Mining

As the amount of online text increases, the demand for text classification to aid the analysis and management of text is increasing. Text is cheap, but information, in the form of knowing what classes a text belongs to, is expensive. Automatic classification of text can provide this information at low cost, but the classifiers themselves must be built with expensive human effort, or trained from texts which have themselves been manually classified. In this paper we will discuss a procedure of classifying text using the concept of association rule of data mining. Association rule mining technique has been used to derive feature set from pre-classified text documents. Naive Bayes classifier is then used on derived features for final classification.

* Proc. International Conference on Information Technology, Kathmandu, Nepal, pp. 234-241, May. 2003
* 8 Pages, International Conference