Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Text Classification": models, code, and papers

Hierarchical Taxonomy-Aware and Attentional Graph Capsule RCNNs for Large-Scale Multi-Label Text Classification

Jun 09, 2019
Hao Peng, Jianxin Li, Qiran Gong, Senzhang Wang, Lifang He, Bo Li, Lihong Wang, Philip S. Yu

CNNs, RNNs, GCNs, and CapsNets have shown significant insights in representation learning and are widely used in various text mining tasks such as large-scale multi-label text classification. However, most existing deep models for multi-label text classification consider either the non-consecutive and long-distance semantics or the sequential semantics, but how to consider them both coherently is less studied. In addition, most existing methods treat output labels as independent methods, but ignore the hierarchical relations among them, leading to useful semantic information loss. In this paper, we propose a novel hierarchical taxonomy-aware and attentional graph capsule recurrent CNNs framework for large-scale multi-label text classification. Specifically, we first propose to model each document as a word order preserved graph-of-words and normalize it as a corresponding words-matrix representation which preserves both the non-consecutive, long-distance and local sequential semantics. Then the words-matrix is input to the proposed attentional graph capsule recurrent CNNs for more effectively learning the semantic features. To leverage the hierarchical relations among the class labels, we propose a hierarchical taxonomy embedding method to learn their representations, and define a novel weighted margin loss by incorporating the label representation similarity. Extensive evaluations on three datasets show that our model significantly improves the performance of large-scale multi-label text classification by comparing with state-of-the-art approaches.


Benchmark Performance of Machine And Deep Learning Based Methodologies for Urdu Text Document Classification

Mar 03, 2020
Muhammad Nabeel Asim, Muhammad Usman Ghani, Muhammad Ali Ibrahim, Sheraz Ahmad, Waqar Mahmood, Andreas Dengel

In order to provide benchmark performance for Urdu text document classification, the contribution of this paper is manifold. First, it pro-vides a publicly available benchmark dataset manually tagged against 6 classes. Second, it investigates the performance impact of traditional machine learning based Urdu text document classification methodologies by embedding 10 filter-based feature selection algorithms which have been widely used for other languages. Third, for the very first time, it as-sesses the performance of various deep learning based methodologies for Urdu text document classification. In this regard, for experimentation, we adapt 10 deep learning classification methodologies which have pro-duced best performance figures for English text classification. Fourth, it also investigates the performance impact of transfer learning by utiliz-ing Bidirectional Encoder Representations from Transformers approach for Urdu language. Fifth, it evaluates the integrity of a hybrid approach which combines traditional machine learning based feature engineering and deep learning based automated feature engineering. Experimental results show that feature selection approach named as Normalised Dif-ference Measure along with Support Vector Machine outshines state-of-the-art performance on two closed source benchmark datasets CLE Urdu Digest 1000k, and CLE Urdu Digest 1Million with a significant margin of 32%, and 13% respectively. Across all three datasets, Normalised Differ-ence Measure outperforms other filter based feature selection algorithms as it significantly uplifts the performance of all adopted machine learning, deep learning, and hybrid approaches. The source code and presented dataset are available at Github repository.


Supervised Text Classification using Text Search

Nov 30, 2020
Nabarun Mondal, Mrunal Lohia

Supervised text classification is a classical and active area of ML research. In large enterprise, solutions to this problem has significant importance. This is specifically true in ticketing systems where prediction of the type and subtype of tickets given new incoming ticket text to find out optimal routing is a multi billion dollar industry. In this paper authors describe a class of industrial standard algorithms which can accurately ( 86\% and above ) predict classification of any text given prior labelled text data - by novel use of any text search engine. These algorithms were used to automate routing of issue tickets to the appropriate team. This class of algorithms has far reaching consequences for a wide variety of industrial applications, IT support, RPA script triggering, even legal domain where massive set of pre labelled data are already available.


Term-Weighting Learning via Genetic Programming for Text Classification

Oct 06, 2014
Hugo Jair Escalante, Mauricio A. Garc铆a-Lim贸n, Alicia Morales-Reyes, Mario Graff, Manuel Montes-y-G贸mez, Eduardo F. Morales

This paper describes a novel approach to learning term-weighting schemes (TWSs) in the context of text classification. In text mining a TWS determines the way in which documents will be represented in a vector space model, before applying a classifier. Whereas acceptable performance has been obtained with standard TWSs (e.g., Boolean and term-frequency schemes), the definition of TWSs has been traditionally an art. Further, it is still a difficult task to determine what is the best TWS for a particular problem and it is not clear yet, whether better schemes, than those currently available, can be generated by combining known TWS. We propose in this article a genetic program that aims at learning effective TWSs that can improve the performance of current schemes in text classification. The genetic program learns how to combine a set of basic units to give rise to discriminative TWSs. We report an extensive experimental study comprising data sets from thematic and non-thematic text classification as well as from image classification. Our study shows the validity of the proposed method; in fact, we show that TWSs learned with the genetic program outperform traditional schemes and other TWSs proposed in recent works. Further, we show that TWSs learned from a specific domain can be effectively used for other tasks.


Text Complexity Classification Based on Linguistic Information: Application to Intelligent Tutoring of ESL

Jan 07, 2020
M. Zakaria Kurdi

The goal of this work is to build a classifier that can identify text complexity within the context of teaching reading to English as a Second Language (ESL) learners. To present language learners with texts that are suitable to their level of English, a set of features that can describe the phonological, morphological, lexical, syntactic, discursive, and psychological complexity of a given text were identified. Using a corpus of 6171 texts, which had already been classified into three different levels of difficulty by ESL experts, different experiments were conducted with five machine learning algorithms. The results showed that the adopted linguistic features provide a good overall classification performance (F-Score = 0.97). A scalability evaluation was conducted to test if such a classifier could be used within real applications, where it can be, for example, plugged into a search engine or a web-scraping module. In this evaluation, the texts in the test set are not only different from those from the training set but also of different types (ESL texts vs. children reading texts). Although the overall performance of the classifier decreased significantly (F-Score = 0.65), the confusion matrix shows that most of the classification errors are between the classes two and three (the middle-level classes) and that the system has a robust performance in categorizing texts of class one and four. This behavior can be explained by the difference in classification criteria between the two corpora. Hence, the observed results confirm the usability of such a classifier within a real-world application.

* This is an unpublished pre-print, the JDMDH journal requires submission to before the submission to the journal (see the link: 

Deep Short Text Classification with Knowledge Powered Attention

Feb 21, 2019
Jindong Chen, Yizhou Hu, Jingping Liu, Yanghua Xiao, Haiyun Jiang

Short text classification is one of important tasks in Natural Language Processing (NLP). Unlike paragraphs or documents, short texts are more ambiguous since they have not enough contextual information, which poses a great challenge for classification. In this paper, we retrieve knowledge from external knowledge source to enhance the semantic representation of short texts. We take conceptual information as a kind of knowledge and incorporate it into deep neural networks. For the purpose of measuring the importance of knowledge, we introduce attention mechanisms and propose deep Short Text Classification with Knowledge powered Attention (STCKA). We utilize Concept towards Short Text (C- ST) attention and Concept towards Concept Set (C-CS) attention to acquire the weight of concepts from two aspects. And we classify a short text with the help of conceptual information. Unlike traditional approaches, our model acts like a human being who has intrinsic ability to make decisions based on observation (i.e., training data for machines) and pays more attention to important knowledge. We also conduct extensive experiments on four public datasets for different tasks. The experimental results and case studies show that our model outperforms the state-of-the-art methods, justifying the effectiveness of knowledge powered attention.


Early Forecasting of Text Classification Accuracy and F-Measure with Active Learning

Jan 20, 2020
Thomas Orth, Michael Bloodgood

When creating text classification systems, one of the major bottlenecks is the annotation of training data. Active learning has been proposed to address this bottleneck using stopping methods to minimize the cost of data annotation. An important capability for improving the utility of stopping methods is to effectively forecast the performance of the text classification models. Forecasting can be done through the use of logarithmic models regressed on some portion of the data as learning is progressing. A critical unexplored question is what portion of the data is needed for accurate forecasting. There is a tension, where it is desirable to use less data so that the forecast can be made earlier, which is more useful, versus it being desirable to use more data, so that the forecast can be more accurate. We find that when using active learning it is even more important to generate forecasts earlier so as to make them more useful and not waste annotation effort. We investigate the difference in forecasting difficulty when using accuracy and F-measure as the text classification system performance metrics and we find that F-measure is more difficult to forecast. We conduct experiments on seven text classification datasets in different semantic domains with different characteristics and with three different base machine learning algorithms. We find that forecasting is easiest for decision tree learning, moderate for Support Vector Machines, and most difficult for neural networks.

* 8 pages, 9 figures, 2 tables; to appear in Proceedings of the IEEE 14th International Conference on Semantic Computing (ICSC 2020), San Diego, California, 2020 

Gender classification by means of online uppercase handwriting: A text-dependent allographic approach

Mar 18, 2022
Enric Sesa-Nogueras, Marcos Faundez-Zanuy, Josep Roure-Alcob茅

This paper presents a gender classification schema based on online handwriting. Using samples acquired with a digital tablet that captures the dynamics of the writing, it classifies the writer as a male or a female. The method proposed is allographic, regarding strokes as the structural units of handwriting. Strokes performed while the writing device is not exerting any pressure on the writing surface, pen-up (in-air) strokes, are also taken into account. The method is also text-dependent meaning that training and testing is done with exactly the same text. Text-dependency allows classification be performed with very small amounts of text. Experimentation, performed with samples from the BiosecurID database, yields results that fall in the range of the classification averages expected from human judges. With only four repetitions of a single uppercase word, the average rate of well classified writers is 68%; with sixteen words, the rate rises to an average 72.6%. Statistical analysis reveals that the aforementioned rates are highly significant. In order to explore the classification potential of the pen-up strokes, these are also considered. Although in this case results are not conclusive, an outstanding average of 74% of well classified writers is obtained when information from pen-up strokes is combined with information from pen-down ones.

* Cognitive computation vol. 8 pages 15-19, 2016 
* 25 pages, published in Cogn Comput 8, pages 15 to 29, year 2016 

Improving Medical Short Text Classification with Semantic Expansion Using Word-Cluster Embedding

Dec 05, 2018
Ying Shen, Qiang Zhang, Jin Zhang, Jiyue Huang, Yuming Lu, Kai Lei

Automatic text classification (TC) research can be used for real-world problems such as the classification of in-patient discharge summaries and medical text reports, which is beneficial to make medical documents more understandable to doctors. However, in electronic medical records (EMR), the texts containing sentences are shorter than that in general domain, which leads to the lack of semantic features and the ambiguity of semantic. To tackle this challenge, we propose to add word-cluster embedding to deep neural network for improving short text classification. Concretely, we first use hierarchical agglomerative clustering to cluster the word vectors in the semantic space. Then we calculate the cluster center vector which represents the implicit topic information of words in the cluster. Finally, we expand word vector with cluster center vector, and implement classifiers using CNN and LSTM respectively. To evaluate the performance of our proposed method, we conduct experiments on public data sets TREC and the medical short sentences data sets which is constructed and released by us. The experimental results demonstrate that our proposed method outperforms state-of-the-art baselines in short sentence classification on both medical domain and general domain.

* International Conference on Information Science and Applications ICISA 2018: Information Science and Applications 2018 pp 401-411 

Text Classification in Memristor-based Spiking Neural Networks

Jul 31, 2022
Jinqi Huang, Alex Serb, Spyros Stathopoulos, Themis Prodromakis

Memristors, emerging non-volatile memory devices, have shown promising potential in neuromorphic hardware designs, especially in spiking neural network (SNN) hardware implementation. Memristor-based SNNs have been successfully applied in a wide range of various applications, including image classification and pattern recognition. However, implementing memristor-based SNNs in text classification is still under exploration. One of the main reasons is that training memristor-based SNNs for text classification is costly due to the lack of efficient learning rules and memristor non-idealities. To address these issues and accelerate the research of exploring memristor-based spiking neural networks in text classification applications, we develop a simulation framework with a virtual memristor array using an empirical memristor model. We use this framework to demonstrate a sentiment analysis task in the IMDB movie reviews dataset. We take two approaches to obtain trained spiking neural networks with memristor models: 1) by converting a pre-trained artificial neural network (ANN) to a memristor-based SNN, or 2) by training a memristor-based SNN directly. These two approaches can be applied in two scenarios: offline classification and online training. We achieve the classification accuracy of 85.88% by converting a pre-trained ANN to a memristor-based SNN and 84.86% by training the memristor-based SNN directly, given that the baseline training accuracy of the equivalent ANN is 86.02%. We conclude that it is possible to achieve similar classification accuracy in simulation from ANNs to SNNs and from non-memristive synapses to data-driven memristive synapses. We also investigate how global parameters such as spike train length, the read noise, and the weight updating stop conditions affect the neural networks in both approaches.

* 23 pages, 5 figures