Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Text Classification": models, code, and papers

An Automated Text Categorization Framework based on Hyperparameter Optimization

Sep 14, 2017
Eric S. Tellez, Daniela Moctezuma, Sabino Miranda-Jímenez, Mario Graff

A great variety of text tasks such as topic or spam identification, user profiling, and sentiment analysis can be posed as a supervised learning problem and tackle using a text classifier. A text classifier consists of several subprocesses, some of them are general enough to be applied to any supervised learning problem, whereas others are specifically designed to tackle a particular task, using complex and computational expensive processes such as lemmatization, syntactic analysis, etc. Contrary to traditional approaches, we propose a minimalistic and wide system able to tackle text classification tasks independent of domain and language, namely microTC. It is composed by some easy to implement text transformations, text representations, and a supervised learning algorithm. These pieces produce a competitive classifier even in the domain of informally written text. We provide a detailed description of microTC along with an extensive experimental comparison with relevant state-of-the-art methods. mircoTC was compared on 30 different datasets. Regarding accuracy, microTC obtained the best performance in 20 datasets while achieves competitive results in the remaining 10. The compared datasets include several problems like topic and polarity classification, spam detection, user profiling and authorship attribution. Furthermore, it is important to state that our approach allows the usage of the technology even without knowledge of machine learning and natural language processing.


A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques

Jul 28, 2017
Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saied Safaei, Elizabeth D. Trippe, Juan B. Gutierrez, Krys Kochut

The amount of text that is generated every day is increasing dramatically. This tremendous volume of mostly unstructured text cannot be simply processed and perceived by computers. Therefore, efficient and effective techniques and algorithms are required to discover useful patterns. Text mining is the task of extracting meaningful information from text, which has gained significant attentions in recent years. In this paper, we describe several of the most fundamental text mining tasks and techniques including text pre-processing, classification and clustering. Additionally, we briefly explain text mining in biomedical and health care domains.

* some of References format have updated 

Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features

Jan 14, 2020
Andres Mafla, Sounak Dey, Ali Furkan Biten, Lluis Gomez, Dimosthenis Karatzas

Text contained in an image carries high-level semantics that can be exploited to achieve richer image understanding. In particular, the mere presence of text provides strong guiding content that should be employed to tackle a diversity of computer vision tasks such as image retrieval, fine-grained classification, and visual question answering. In this paper, we address the problem of fine-grained classification and image retrieval by leveraging textual information along with visual cues to comprehend the existing intrinsic relation between the two modalities. The novelty of the proposed model consists of the usage of a PHOC descriptor to construct a bag of textual words along with a Fisher Vector Encoding that captures the morphology of text. This approach provides a stronger multimodal representation for this task and as our experiments demonstrate, it achieves state-of-the-art results on two different tasks, fine-grained classification and image retrieval.

* Winter Conference on Applications of Computer Vision (WACV 2020) Accepted paper 

Detecting Text Formality: A Study of Text Classification Approaches

Apr 19, 2022
Daryna Dementieva, Ivan Trifinov, Andrey Likhachev, Alexander Panchenko

Formality is an important characteristic of text documents. The automatic detection of the formality level of a text is potentially beneficial for various natural language processing tasks, such as retrieval of texts with a desired formality level, integration in language learning and document editing platforms, or evaluating the desired conversation tone by chatbots. Recently two large-scale datasets were introduced for multiple languages featuring formality annotation. However, they were primarily used for the training of style transfer models. However, detection text formality on its own may also be a useful application. This work proposes the first systematic study of formality detection methods based on current (and more classic) machine learning methods and delivers the best-performing models for public usage. We conducted three types of experiments -- monolingual, multilingual, and cross-lingual. The study shows the overcome of BiLSTM-based models over transformer-based ones for the formality classification task. We release formality detection models for several languages yielding state of the art results and possessing tested cross-lingual capabilities.


Empirical Study of Text Augmentation on Social Media Text in Vietnamese

Oct 09, 2020
Son T. Luu, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

In the text classification problem, the imbalance of labels in datasets affect the performance of the text-classification models. Practically, the data about user comments on social networking sites not altogether appeared - the administrators often only allow positive comments and hide negative comments. Thus, when collecting the data about user comments on the social network, the data is usually skewed about one label, which leads the dataset to become imbalanced and deteriorate the model's ability. The data augmentation techniques are applied to solve the imbalance problem between classes of the dataset, increasing the prediction model's accuracy. In this paper, we performed augmentation techniques on the VLSP2019 Hate Speech Detection on Vietnamese social texts and the UIT - VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis. The result of augmentation increases by about 1.5% in the F1-macro score on both corpora.

* Accepted by The 34th Pacific Asia Conference on Language, Information and Computation 

ESGBERT: Language Model to Help with Classification Tasks Related to Companies Environmental, Social, and Governance Practices

Mar 31, 2022
Srishti Mehra, Robert Louka, Yixun Zhang

Environmental, Social, and Governance (ESG) are non-financial factors that are garnering attention from investors as they increasingly look to apply these as part of their analysis to identify material risks and growth opportunities. Some of this attention is also driven by clients who, now more aware than ever, are demanding for their money to be managed and invested responsibly. As the interest in ESG grows, so does the need for investors to have access to consumable ESG information. Since most of it is in text form in reports, disclosures, press releases, and 10-Q filings, we see a need for sophisticated NLP techniques for classification tasks for ESG text. We hypothesize that an ESG domain-specific pre-trained model will help with such and study building of the same in this paper. We explored doing this by fine-tuning BERTs pre-trained weights using ESG specific text and then further fine-tuning the model for a classification task. We were able to achieve accuracy better than the original BERT and baseline models in environment-specific classification tasks.

* pp. 183-190, 2022. CS & IT - CSCP 2022 

Machine learning based event classification for the energy-differential measurement of the $^\text{nat}$C(n,p) and $^\text{nat}$C(n,d) reactions

Apr 11, 2022
P. Žugec, M. Barbagallo, J. Andrzejewski, J. Perkowski, N. Colonna, D. Bosnar, A. Gawlik, M. Sabate-Gilarte, M. Bacak, F. Mingrone, E. Chiaveri

The paper explores the feasibility of using machine learning techniques, in particular neural networks, for classification of the experimental data from the joint $^\text{nat}$C(n,p) and $^\text{nat}$C(n,d) reaction cross section measurement from the neutron time of flight facility n_TOF at CERN. Each relevant $\Delta E$-$E$ pair of strips from two segmented silicon telescopes is treated separately and afforded its own dedicated neural network. An important part of the procedure is a careful preparation of training datasets, based on the raw data from Geant4 simulations. Instead of using these raw data for the training of neural networks, we divide a relevant 3-parameter space into discrete voxels, classify each voxel according to a particle/reaction type and submit these voxels to a training procedure. The classification capabilities of the structurally optimized and trained neural networks are found to be superior to those of the manually selected cuts.

* 11 pages, 5 figures, 2 tables 

An alternative text representation to TF-IDF and Bag-of-Words

Jan 28, 2013
Zhixiang, Xu, Minmin Chen, Kilian Q. Weinberger, Fei Sha

In text mining, information retrieval, and machine learning, text documents are commonly represented through variants of sparse Bag of Words (sBoW) vectors (e.g. TF-IDF). Although simple and intuitive, sBoW style representations suffer from their inherent over-sparsity and fail to capture word-level synonymy and polysemy. Especially when labeled data is limited (e.g. in document classification), or the text documents are short (e.g. emails or abstracts), many features are rarely observed within the training corpus. This leads to overfitting and reduced generalization accuracy. In this paper we propose Dense Cohort of Terms (dCoT), an unsupervised algorithm to learn improved sBoW document features. dCoT explicitly models absent words by removing and reconstructing random sub-sets of words in the unlabeled corpus. With this approach, dCoT learns to reconstruct frequent words from co-occurring infrequent words and maps the high dimensional sparse sBoW vectors into a low-dimensional dense representation. We show that the feature removal can be marginalized out and that the reconstruction can be solved for in closed-form. We demonstrate empirically, on several benchmark datasets, that dCoT features significantly improve the classification accuracy across several document classification tasks.


Fast Gradient Projection Method for Text Adversary Generation and Adversarial Training

Aug 09, 2020
Xiaosen Wang, Yichen Yang, Yihe Deng, Kun He

Adversarial training has shown effectiveness and efficiency in improving the robustness of deep neural networks for image classification. For text classification, however, the discrete property of the text input space makes it hard to adapt the gradient-based adversarial methods from the image domain. Existing text attack methods, moreover, are effective but not efficient enough to be incorporated into practical text adversarial training. In this work, we propose a Fast Gradient Projection Method (FGPM) to generate text adversarial examples based on synonym substitution, where each substitution is scored by the product of gradient magnitude and the projected distance between the original word and the candidate word in the gradient direction. Empirical evaluations demonstrate that FGPM achieves similar attack performance and transferability when compared with competitive attack baselines, at the same time it is about 20 times faster than the current fastest text attack method. Such performance enables us to incorporate FGPM with adversarial training as an effective defense method, and scale to large neural networks and datasets. Experiments show that the adversarial training with FGPM (ATF) significantly improves the model robustness, and blocks the transferability of adversarial examples without any decay on the model generalization.

* 10 pages, 1 figure, 5 tables