Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Text Classification": models, code, and papers

Arabic Language Text Classification Using Dependency Syntax-Based Feature Selection

Oct 17, 2014
Yannis Haralambous, Yassir Elidrissi, Philippe Lenca

We study the performance of Arabic text classification combining various techniques: (a) tfidf vs. dependency syntax, for feature selection and weighting; (b) class association rules vs. support vector machines, for classification. The Arabic text is used in two forms: rootified and lightly stemmed. The results we obtain show that lightly stemmed text leads to better performance than rootified text; that class association rules are better suited for small feature sets obtained by dependency syntax constraints; and, finally, that support vector machines are better suited for large feature sets based on morphological feature selection criteria.

* 10 pages, 4 figure, accepted at CITALA 2014 (
Access Paper or Ask Questions

Deep Learning for Technical Document Classification

Jun 27, 2021
Shuo Jiang, Jianxi Luo, Jie Hu, Christopher L. Magee

In large technology companies, the requirements for managing and organizing technical documents created by engineers and managers in supporting relevant decision making have increased dramatically in recent years, which has led to a higher demand for more scalable, accurate, and automated document classification. Prior studies have primarily focused on processing text for classification and small-scale databases. This paper describes a novel multimodal deep learning architecture, called TechDoc, for technical document classification, which utilizes both natural language and descriptive images to train hierarchical classifiers. The architecture synthesizes convolutional neural networks and recurrent neural networks through an integrated training process. We applied the architecture to a large multimodal technical document database and trained the model for classifying documents based on the hierarchical International Patent Classification system. Our results show that the trained neural network presents a greater classification accuracy than those using a single modality and several earlier text classification methods. The trained model can potentially be scaled to millions of real-world technical documents with both text and figures, which is useful for data and knowledge management in large technology companies and organizations.

* 34 pages, 7 figures, 10 tables 
Access Paper or Ask Questions

DIALOG-22 RuATD Generated Text Detection

Jun 16, 2022
Narek Maloyan, Bulat Nutfullin, Eugene Ilyushin

Text Generation Models (TGMs) succeed in creating text that matches human language style reasonably well. Detectors that can distinguish between TGM-generated text and human-written ones play an important role in preventing abuse of TGM. In this paper, we describe our pipeline for the two DIALOG-22 RuATD tasks: detecting generated text (binary task) and classification of which model was used to generate text (multiclass task). We achieved 1st place on the binary classification task with an accuracy score of 0.82995 on the private test set and 4th place on the multiclass classification task with an accuracy score of 0.62856 on the private test set. We proposed an ensemble method of different pre-trained models based on the attention mechanism.

* 6 pages 
Access Paper or Ask Questions

Word-Class Embeddings for Multiclass Text Classification

Nov 26, 2019
Alejandro Moreo, Andrea Esuli, Fabrizio Sebastiani

Pre-trained word embeddings encode general word semantics and lexical regularities of natural language, and have proven useful across many NLP tasks, including word sense disambiguation, machine translation, and sentiment analysis, to name a few. In supervised tasks such as multiclass text classification (the focus of this article) it seems appealing to enhance word representations with ad-hoc embeddings that encode task-specific information. We propose (supervised) word-class embeddings (WCEs), and show that, when concatenated to (unsupervised) pre-trained word embeddings, they substantially facilitate the training of deep-learning models in multiclass classification by topic. We show empirical evidence that WCEs yield a consistent improvement in multiclass classification accuracy, using four popular neural architectures and six widely used and publicly available datasets for multiclass text classification. Our code that implements WCEs is publicly available at

Access Paper or Ask Questions

DocSCAN: Unsupervised Text Classification via Learning from Neighbors

May 11, 2021
Dominik Stammbach, Elliott Ash

We introduce DocSCAN, a completely unsupervised text classification approach using Semantic Clustering by Adopting Nearest-Neighbors (SCAN). For each document, we obtain semantically informative vectors from a large pre-trained language model. Similar documents have proximate vectors, so neighbors in the representation space tend to share topic labels. Our learnable clustering approach uses pairs of neighboring datapoints as a weak learning signal. The proposed approach learns to assign classes to the whole dataset without provided ground-truth labels. On five topic classification benchmarks, we improve on various unsupervised baselines by a large margin. In datasets with relatively few and balanced outcome classes, DocSCAN approaches the performance of supervised classification. The method fails for other types of classification, such as sentiment analysis, pointing to important conceptual and practical differences between classifying images and texts.

Access Paper or Ask Questions

Topological Data Analysis in Text Classification: Extracting Features with Additive Information

Mar 29, 2020
Shafie Gholizadeh, Ketki Savle, Armin Seyeditabari, Wlodek Zadrozny

While the strength of Topological Data Analysis has been explored in many studies on high dimensional numeric data, it is still a challenging task to apply it to text. As the primary goal in topological data analysis is to define and quantify the shapes in numeric data, defining shapes in the text is much more challenging, even though the geometries of vector spaces and conceptual spaces are clearly relevant for information retrieval and semantics. In this paper, we examine two different methods of extraction of topological features from text, using as the underlying representations of words the two most popular methods, namely word embeddings and TF-IDF vectors. To extract topological features from the word embedding space, we interpret the embedding of a text document as high dimensional time series, and we analyze the topology of the underlying graph where the vertices correspond to different embedding dimensions. For topological data analysis with the TF-IDF representations, we analyze the topology of the graph whose vertices come from the TF-IDF vectors of different blocks in the textual document. In both cases, we apply homological persistence to reveal the geometric structures under different distance resolutions. Our results show that these topological features carry some exclusive information that is not captured by conventional text mining methods. In our experiments we observe adding topological features to the conventional features in ensemble models improves the classification results (up to 5\%). On the other hand, as expected, topological features by themselves may be not sufficient for effective classification. It is an open problem to see whether TDA features from word embeddings might be sufficient, as they seem to perform within a range of few points from top results obtained with a linear support vector classifier.

Access Paper or Ask Questions

Variable Instance-Level Explainability for Text Classification

Apr 16, 2021
George Chrysostomou, Nikolaos Aletras

Despite the high accuracy of pretrained transformer networks in text classification, a persisting issue is their significant complexity that makes them hard to interpret. Recent research has focused on developing feature scoring methods for identifying which parts of the input are most important for the model to make a particular prediction and use it as an explanation (i.e. rationale). A limitation of these approaches is that they assume that a particular feature scoring method should be used across all instances in a dataset using a predefined fixed length, which might not be optimal across all instances. To address this, we propose a method for extracting variable-length explanations using a set of different feature scoring methods at instance-level. Our method is inspired by word erasure approaches which assume that the most faithful rationale for a prediction should be the one with the highest divergence between the model's output distribution using the full text and the text after removing the rationale for a particular instance. Evaluation on four standard text classification datasets shows that our method consistently provides more faithful explanations compared to previous fixed-length and fixed-feature scoring methods for rationale extraction.

* NLP Interpretability 
Access Paper or Ask Questions

T-Miner: A Generative Approach to Defend Against Trojan Attacks on DNN-based Text Classification

Mar 11, 2021
Ahmadreza Azizi, Ibrahim Asadullah Tahmid, Asim Waheed, Neal Mangaokar, Jiameng Pu, Mobin Javed, Chandan K. Reddy, Bimal Viswanath

Deep Neural Network (DNN) classifiers are known to be vulnerable to Trojan or backdoor attacks, where the classifier is manipulated such that it misclassifies any input containing an attacker-determined Trojan trigger. Backdoors compromise a model's integrity, thereby posing a severe threat to the landscape of DNN-based classification. While multiple defenses against such attacks exist for classifiers in the image domain, there have been limited efforts to protect classifiers in the text domain. We present Trojan-Miner (T-Miner) -- a defense framework for Trojan attacks on DNN-based text classifiers. T-Miner employs a sequence-to-sequence (seq-2-seq) generative model that probes the suspicious classifier and learns to produce text sequences that are likely to contain the Trojan trigger. T-Miner then analyzes the text produced by the generative model to determine if they contain trigger phrases, and correspondingly, whether the tested classifier has a backdoor. T-Miner requires no access to the training dataset or clean inputs of the suspicious classifier, and instead uses synthetically crafted "nonsensical" text inputs to train the generative model. We extensively evaluate T-Miner on 1100 model instances spanning 3 ubiquitous DNN model architectures, 5 different classification tasks, and a variety of trigger phrases. We show that T-Miner detects Trojan and clean models with a 98.75% overall accuracy, while achieving low false positives on clean models. We also show that T-Miner is robust against a variety of targeted, advanced attacks from an adaptive attacker.

* Accepted to Usenix Security 2021; First two authors contributed equally to this work; 18 pages, 11 tables 
Access Paper or Ask Questions

GUDN A novel guide network for extreme multi-label text classification

Jan 10, 2022
Qing Wang, Hongji Shu, Jia Zhu

The problem of extreme multi-label text classification (XMTC) is to recall some most relevant labels for a text from an extremely large label set. Though the methods based on deep pre-trained models have reached significant achievement, the pre-trained models are still not fully utilized. Label semantics has not attracted much attention so far, and the latent space between texts and labels has not been effectively explored. This paper constructs a novel guide network (GUDN) to help fine-tune the pre-trained model to instruct classification later. Also, we use the raw label semantics to effectively explore the latent space between texts and labels, which can further improve predicted accuracy. Experimental results demonstrate that GUDN outperforms state-of-the-art methods on several popular datasets. Our source code is released at

* 7 pages, 3 figures 
Access Paper or Ask Questions