Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Text Classification": models, code, and papers

Rank4Class: A Ranking Formulation for Multiclass Classification

Dec 17, 2021
Nan Wang, Zhen Qin, Le Yan, Honglei Zhuang, Xuanhui Wang, Michael Bendersky, Marc Najork

Multiclass classification (MCC) is a fundamental machine learning problem which aims to classify each instance into one of a predefined set of classes. Given an instance, a classification model computes a score for each class, all of which are then used to sort the classes. The performance of a classification model is usually measured by Top-K Accuracy/Error (e.g., K=1 or 5). In this paper, we do not aim to propose new neural representation learning models as most recent works do, but to show that it is easy to boost MCC performance with a novel formulation through the lens of ranking. In particular, by viewing MCC as to rank classes for an instance, we first argue that ranking metrics, such as Normalized Discounted Cumulative Gain (NDCG), can be more informative than existing Top-K metrics. We further demonstrate that the dominant neural MCC architecture can be formulated as a neural ranking framework with a specific set of design choices. Based on such generalization, we show that it is straightforward and intuitive to leverage techniques from the rich information retrieval literature to improve the MCC performance out of the box. Extensive empirical results on both text and image classification tasks with diverse datasets and backbone models (e.g., BERT and ResNet for text and image classification) show the value of our proposed framework.

Access Paper or Ask Questions

Evaluation of Representation Models for Text Classification with AutoML Tools

Jul 07, 2021
Sebastian Brändle, Marc Hanussek, Matthias Blohm, Maximilien Kintz

Automated Machine Learning (AutoML) has gained increasing success on tabular data in recent years. However, processing unstructured data like text is a challenge and not widely supported by open-source AutoML tools. This work compares three manually created text representations and text embeddings automatically created by AutoML tools. Our benchmark includes four popular open-source AutoML tools and eight datasets for text classification purposes. The results show that straightforward text representations perform better than AutoML tools with automatically created text embeddings.

* Accecpted for Future Technologies Conference 2021 
Access Paper or Ask Questions

The Utility of General Domain Transfer Learning for Medical Language Tasks

Feb 16, 2020
Daniel Ranti, Katie Hanss, Shan Zhao, Varun Arvind, Joseph Titano, Anthony Costa, Eric Oermann

The purpose of this study is to analyze the efficacy of transfer learning techniques and transformer-based models as applied to medical natural language processing (NLP) tasks, specifically radiological text classification. We used 1,977 labeled head CT reports, from a corpus of 96,303 total reports, to evaluate the efficacy of pretraining using general domain corpora and a combined general and medical domain corpus with a bidirectional representations from transformers (BERT) model for the purpose of radiological text classification. Model performance was benchmarked to a logistic regression using bag-of-words vectorization and a long short-term memory (LSTM) multi-label multi-class classification model, and compared to the published literature in medical text classification. The BERT models using either set of pretrained checkpoints outperformed the logistic regression model, achieving sample-weighted average F1-scores of 0.87 and 0.87 for the general domain model and the combined general and biomedical-domain model. General text transfer learning may be a viable technique to generate state-of-the-art results within medical NLP tasks on radiological corpora, outperforming other deep models such as LSTMs. The efficacy of pretraining and transformer-based models could serve to facilitate the creation of groundbreaking NLP models in the uniquely challenging data environment of medical text.

* 8 pages, 5 figures, 2 tables 
Access Paper or Ask Questions

A Comparative Study on using Principle Component Analysis with Different Text Classifiers

Jul 04, 2018
Ahmed I. Taloba, D. A. Eisa, Safaa S. I. Ismail

Text categorization (TC) is the task of automatically organizing a set of documents into a set of pre-defined categories. Over the last few years, increased attention has been paid to the use of documents in digital form and this makes text categorization becomes a challenging issue. The most significant problem of text categorization is its huge number of features. Most of these features are redundant, noisy and irrelevant that cause over fitting with most of the classifiers. Hence, feature extraction is an important step to improve the overall accuracy and the performance of the text classifiers. In this paper, we will provide an overview of using principle component analysis (PCA) as a feature extraction with various classifiers. It was observed that the performance rate of the classifiers after using PCA to reduce the dimension of data improved. Experiments are conducted on three UCI data sets, Classic03, CNAE-9 and DBWorld e-mails. We compare the classification performance results of using PCA with popular and well-known text classifiers. Results show that using PCA encouragingly enhances classification performance on most of the classifiers.

* International Journal of Computer Applications 180(31):1-6, April 2018 
Access Paper or Ask Questions

Profitable Trade-Off Between Memory and Performance In Multi-Domain Chatbot Architectures

Nov 06, 2021
D Emre Tasar, Sukru Ozan, M Fatih Akca, Oguzhan Olmez, Semih Gulum, Secilay Kutay, Ceren Belhan

Text classification problem is a very broad field of study in the field of natural language processing. In short, the text classification problem is to determine which of the previously determined classes the given text belongs to. Successful studies have been carried out in this field in the past studies. In the study, Bidirectional Encoder Representations for Transformers (BERT), which is a frequently preferred method for solving the classification problem in the field of natural language processing, is used. By solving classification problems through a single model to be used in a chatbot architecture, it is aimed to alleviate the load on the server that will be created by more than one model used for solving more than one classification problem. At this point, with the masking method applied during the estimation of a single BERT model, which was created for classification in more than one subject, the estimation of the model was provided on a problem-based basis. Three separate data sets covering different fields from each other are divided by various methods in order to complicate the problem, and classification problems that are very close to each other in terms of field are also included in this way. The dataset used in this way consists of five classification problems with 154 classes. A BERT model containing all classification problems and other BERT models trained specifically for the problems were compared with each other in terms of performance and the space they occupied on the server.

* in Turkish language. ICADA 21 1st International Conference on Artificial Intelligence and Data Science Nov 26-Nov 28 2021 Izmir Katip Celebi University Izmir, Turkey 
Access Paper or Ask Questions

Collective Classification of Textual Documents by Guided Self-Organization in T-Cell Cross-Regulation Dynamics

Feb 04, 2011
Alaa Abi-Haidar, Luis M. Rocha

We present and study an agent-based model of T-Cell cross-regulation in the adaptive immune system, which we apply to binary classification. Our method expands an existing analytical model of T-cell cross-regulation (Carneiro et al. in Immunol Rev 216(1):48-68, 2007) that was used to study the self-organizing dynamics of a single population of T-Cells in interaction with an idealized antigen presenting cell capable of presenting a single antigen. With agent-based modeling we are able to study the self-organizing dynamics of multiple populations of distinct T-cells which interact via antigen presenting cells that present hundreds of distinct antigens. Moreover, we show that such self-organizing dynamics can be guided to produce an effective binary classification of antigens, which is competitive with existing machine learning methods when applied to biomedical text classification. More specifically, here we test our model on a dataset of publicly available full-text biomedical articles provided by the BioCreative challenge (Krallinger in The biocreative ii. 5 challenge overview, p 19, 2009). We study the robustness of our model's parameter configurations, and show that it leads to encouraging results comparable to state-of-the-art classifiers. Our results help us understand both T-cell cross-regulation as a general principle of guided self-organization, as well as its applicability to document classification. Therefore, we show that our bio-inspired algorithm is a promising novel method for biomedical article classification and for binary document classification in general.

* Evolutionary Intelligence. 2011. Volume 4, Number 2, 69-80 
Access Paper or Ask Questions

An Empirical Evaluation of Text Representation Schemes on Multilingual Social Web to Filter the Textual Aggression

Apr 16, 2019
Sandip Modha, Prasenjit Majumder

This paper attempt to study the effectiveness of text representation schemes on two tasks namely: User Aggression and Fact Detection from the social media contents. In User Aggression detection, The aim is to identify the level of aggression from the contents generated in the Social media and written in the English, Devanagari Hindi and Romanized Hindi. Aggression levels are categorized into three predefined classes namely: `Non-aggressive`, `Overtly Aggressive`, and `Covertly Aggressive`. During the disaster-related incident, Social media like, Twitter is flooded with millions of posts. In such emergency situations, identification of factual posts is important for organizations involved in the relief operation. We anticipated this problem as a combination of classification and Ranking problem. This paper presents a comparison of various text representation scheme based on BoW techniques, distributed word/sentence representation, transfer learning on classifiers. Weighted $F_1$ score is used as a primary evaluation metric. Results show that text representation using BoW performs better than word embedding on machine learning classifiers. While pre-trained Word embedding techniques perform better on classifiers based on deep neural net. Recent transfer learning model like ELMO, ULMFiT are fine-tuned for the Aggression classification task. However, results are not at par with pre-trained word embedding model. Overall, word embedding using fastText produce best weighted $F_1$-score than Word2Vec and Glove. Results are further improved using pre-trained vector model. Statistical significance tests are employed to ensure the significance of the classification results. In the case of lexically different test Dataset, other than training Dataset, deep neural models are more robust and perform substantially better than machine learning classifiers.

* 21 Page, 2 Figure 
Access Paper or Ask Questions

A Graph Total Variation Regularized Softmax for Text Generation

Jan 01, 2021
Liu Bin, Wang Liang, Yin Guosheng

The softmax operator is one of the most important functions in machine learning models. When applying neural networks to multi-category classification, the correlations among different categories are often ignored. For example, in text generation, a language model makes a choice of each new word based only on the former selection of its context. In this scenario, the link statistics information of concurrent words based on a corpus (an analogy of the natural way of expression) is also valuable in choosing the next word, which can help to improve the sentence's fluency and smoothness. To fully explore such important information, we propose a graph softmax function for text generation. It is expected that the final classification result would be dominated by both the language model and graphical text relationships among words. We use a graph total variation term to regularize softmax so as to incorporate the concurrent relationship into the language model. The total variation of the generated words should be small locally. We apply the proposed graph softmax to GPT2 for the text generation task. Experimental results demonstrate that the proposed graph softmax achieves better BLEU and perplexity than softmax. Human testers can also easily distinguish the text generated by the graph softmax or softmax.

Access Paper or Ask Questions

Active Discriminative Text Representation Learning

Dec 01, 2016
Ye Zhang, Matthew Lease, Byron C. Wallace

We propose a new active learning (AL) method for text classification with convolutional neural networks (CNNs). In AL, one selects the instances to be manually labeled with the aim of maximizing model performance with minimal effort. Neural models capitalize on word embeddings as representations (features), tuning these to the task at hand. We argue that AL strategies for multi-layered neural models should focus on selecting instances that most affect the embedding space (i.e., induce discriminative word representations). This is in contrast to traditional AL approaches (e.g., entropy-based uncertainty sampling), which specify higher level objectives. We propose a simple approach for sentence classification that selects instances containing words whose embeddings are likely to be updated with the greatest magnitude, thereby rapidly learning discriminative, task-specific embeddings. We extend this approach to document classification by jointly considering: (1) the expected changes to the constituent word representations; and (2) the model's current overall uncertainty regarding the instance. The relative emphasis placed on these criteria is governed by a stochastic process that favors selecting instances likely to improve representations at the outset of learning, and then shifts toward general uncertainty sampling as AL progresses. Empirical results show that our method outperforms baseline AL approaches on both sentence and document classification tasks. We also show that, as expected, the method quickly learns discriminative word embeddings. To the best of our knowledge, this is the first work on AL addressing neural models for text classification.

* This paper got accepted by AAAI 2017 
Access Paper or Ask Questions