Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Text Classification": models, code, and papers

Empirical Study of Text Augmentation on Social Media Text in Vietnamese

Sep 25, 2020
Son T. Luu, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

In the text classification problem, the imbalance of labels in datasets affect the performance of the text-classification models. Practically, the data about user comments on social networking sites not altogether appeared - the administrators often only allow positive comments and hide negative comments. Thus, when collecting the data about user comments on the social network, the data is usually skewed about one label, which leads the dataset to become imbalanced and deteriorate the model's ability. The data augmentation techniques are applied to solve the imbalance problem between classes of the dataset, increasing the prediction model's accuracy. In this paper, we performed augmentation techniques on the VLSP2019 Hate Speech Detection on Vietnamese social texts and the UIT - VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis. The result of augmentation increases by about 1.5% in the F1-macro score on both corpora.

* Accepted by The 34th Pacific Asia Conference on Language, Information and Computation 
Access Paper or Ask Questions

ESGBERT: Language Model to Help with Classification Tasks Related to Companies Environmental, Social, and Governance Practices

Mar 31, 2022
Srishti Mehra, Robert Louka, Yixun Zhang

Environmental, Social, and Governance (ESG) are non-financial factors that are garnering attention from investors as they increasingly look to apply these as part of their analysis to identify material risks and growth opportunities. Some of this attention is also driven by clients who, now more aware than ever, are demanding for their money to be managed and invested responsibly. As the interest in ESG grows, so does the need for investors to have access to consumable ESG information. Since most of it is in text form in reports, disclosures, press releases, and 10-Q filings, we see a need for sophisticated NLP techniques for classification tasks for ESG text. We hypothesize that an ESG domain-specific pre-trained model will help with such and study building of the same in this paper. We explored doing this by fine-tuning BERTs pre-trained weights using ESG specific text and then further fine-tuning the model for a classification task. We were able to achieve accuracy better than the original BERT and baseline models in environment-specific classification tasks.

* pp. 183-190, 2022. CS & IT - CSCP 2022 
Access Paper or Ask Questions

Machine learning based event classification for the energy-differential measurement of the $^\text{nat}$C(n,p) and $^\text{nat}$C(n,d) reactions

Apr 11, 2022
P. Žugec, M. Barbagallo, J. Andrzejewski, J. Perkowski, N. Colonna, D. Bosnar, A. Gawlik, M. Sabate-Gilarte, M. Bacak, F. Mingrone, E. Chiaveri

The paper explores the feasibility of using machine learning techniques, in particular neural networks, for classification of the experimental data from the joint $^\text{nat}$C(n,p) and $^\text{nat}$C(n,d) reaction cross section measurement from the neutron time of flight facility n_TOF at CERN. Each relevant $\Delta E$-$E$ pair of strips from two segmented silicon telescopes is treated separately and afforded its own dedicated neural network. An important part of the procedure is a careful preparation of training datasets, based on the raw data from Geant4 simulations. Instead of using these raw data for the training of neural networks, we divide a relevant 3-parameter space into discrete voxels, classify each voxel according to a particle/reaction type and submit these voxels to a training procedure. The classification capabilities of the structurally optimized and trained neural networks are found to be superior to those of the manually selected cuts.

* 11 pages, 5 figures, 2 tables 
Access Paper or Ask Questions

An alternative text representation to TF-IDF and Bag-of-Words

Jan 28, 2013
Zhixiang, Xu, Minmin Chen, Kilian Q. Weinberger, Fei Sha

In text mining, information retrieval, and machine learning, text documents are commonly represented through variants of sparse Bag of Words (sBoW) vectors (e.g. TF-IDF). Although simple and intuitive, sBoW style representations suffer from their inherent over-sparsity and fail to capture word-level synonymy and polysemy. Especially when labeled data is limited (e.g. in document classification), or the text documents are short (e.g. emails or abstracts), many features are rarely observed within the training corpus. This leads to overfitting and reduced generalization accuracy. In this paper we propose Dense Cohort of Terms (dCoT), an unsupervised algorithm to learn improved sBoW document features. dCoT explicitly models absent words by removing and reconstructing random sub-sets of words in the unlabeled corpus. With this approach, dCoT learns to reconstruct frequent words from co-occurring infrequent words and maps the high dimensional sparse sBoW vectors into a low-dimensional dense representation. We show that the feature removal can be marginalized out and that the reconstruction can be solved for in closed-form. We demonstrate empirically, on several benchmark datasets, that dCoT features significantly improve the classification accuracy across several document classification tasks.

Access Paper or Ask Questions

Fast Gradient Projection Method for Text Adversary Generation and Adversarial Training

Aug 09, 2020
Xiaosen Wang, Yichen Yang, Yihe Deng, Kun He

Adversarial training has shown effectiveness and efficiency in improving the robustness of deep neural networks for image classification. For text classification, however, the discrete property of the text input space makes it hard to adapt the gradient-based adversarial methods from the image domain. Existing text attack methods, moreover, are effective but not efficient enough to be incorporated into practical text adversarial training. In this work, we propose a Fast Gradient Projection Method (FGPM) to generate text adversarial examples based on synonym substitution, where each substitution is scored by the product of gradient magnitude and the projected distance between the original word and the candidate word in the gradient direction. Empirical evaluations demonstrate that FGPM achieves similar attack performance and transferability when compared with competitive attack baselines, at the same time it is about 20 times faster than the current fastest text attack method. Such performance enables us to incorporate FGPM with adversarial training as an effective defense method, and scale to large neural networks and datasets. Experiments show that the adversarial training with FGPM (ATF) significantly improves the model robustness, and blocks the transferability of adversarial examples without any decay on the model generalization.

* 10 pages, 1 figure, 5 tables 
Access Paper or Ask Questions

From text saliency to linguistic objects: learning linguistic interpretable markers with a multi-channels convolutional architecture

Apr 07, 2020
Laurent Vanni, Marco Corneli, Damon Mayaffre, Frédéric Precioso

A lot of effort is currently made to provide methods to analyze and understand deep neural network impressive performances for tasks such as image or text classification. These methods are mainly based on visualizing the important input features taken into account by the network to build a decision. However these techniques, let us cite LIME, SHAP, Grad-CAM, or TDS, require extra effort to interpret the visualization with respect to expert knowledge. In this paper, we propose a novel approach to inspect the hidden layers of a fitted CNN in order to extract interpretable linguistic objects from texts exploiting classification process. In particular, we detail a weighted extension of the Text Deconvolution Saliency (wTDS) measure which can be used to highlight the relevant features used by the CNN to perform the classification task. We empirically demonstrate the efficiency of our approach on corpora from two different languages: English and French. On all datasets, wTDS automatically encodes complex linguistic objects based on co-occurrences and possibly on grammatical and syntax analysis.

* 7 pages, 22 figures 
Access Paper or Ask Questions

Optimizing Stochastic Gradient Descent in Text Classification Based on Fine-Tuning Hyper-Parameters Approach. A Case Study on Automatic Classification of Global Terrorist Attacks

Feb 23, 2019
Shadi Diab

The objective of this research is to enhance performance of Stochastic Gradient Descent (SGD) algorithm in text classification. In our research, we proposed using SGD learning with Grid-Search approach to fine-tuning hyper-parameters in order to enhance the performance of SGD classification. We explored different settings for representation, transformation and weighting features from the summary description of terrorist attacks incidents obtained from the Global Terrorism Database as a pre-classification step, and validated SGD learning on Support Vector Machine (SVM), Logistic Regression and Perceptron classifiers by stratified 10-K-fold cross-validation to compare the performance of different classifiers embedded in SGD algorithm. The research concludes that using a grid-search to find the hyper-parameters optimize SGD classification, not in the pre-classification settings only, but also in the performance of the classifiers in terms of accuracy and execution time.

* International Journal of Computer Science and Information Security (IJCSIS),Vol. 16, No. 12, December 2018 
* 6 pages, 3 figures, 7 tables, Journal Article 
Access Paper or Ask Questions

OSACT4 Shared Task on Offensive Language Detection: Intensive Preprocessing-Based Approach

May 14, 2020
Fatemah Husain

The preprocessing phase is one of the key phases within the text classification pipeline. This study aims at investigating the impact of the preprocessing phase on text classification, specifically on offensive language and hate speech classification for Arabic text. The Arabic language used in social media is informal and written using Arabic dialects, which makes the text classification task very complex. Preprocessing helps in dimensionality reduction and removing useless content. We apply intensive preprocessing techniques to the dataset before processing it further and feeding it into the classification model. An intensive preprocessing-based approach demonstrates its significant impact on offensive language detection and hate speech detection shared tasks of the fourth workshop on Open-Source Arabic Corpora and Corpora Processing Tools (OSACT). Our team wins the third place (3rd) in the Sub-Task A Offensive Language Detection division and wins the first place (1st) in the Sub-Task B Hate Speech Detection division, with an F1 score of 89% and 95%, respectively, by providing the state-of-the-art performance in terms of F1, accuracy, recall, and precision for Arabic hate speech detection.

* Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), Marseille, France (2020) 
Access Paper or Ask Questions

Estimating Confidence of Predictions of Individual Classifiers and Their Ensembles for the Genre Classification Task

Jun 15, 2022
Mikhail Lepekhin, Serge Sharoff

Genre identification is a subclass of non-topical text classification. The main difference between this task and topical classification is that genres, unlike topics, usually do not correspond to simple keywords, and thus they need to be defined in terms of their functions in communication. Neural models based on pre-trained transformers, such as BERT or XLM-RoBERTa, demonstrate SOTA results in many NLP tasks, including non-topical classification. However, in many cases, their downstream application to very large corpora, such as those extracted from social media, can lead to unreliable results because of dataset shifts, when some raw texts do not match the profile of the training set. To mitigate this problem, we experiment with individual models as well as with their ensembles. To evaluate the robustness of all models we use a prediction confidence metric, which estimates the reliability of a prediction in the absence of a gold standard label. We can evaluate robustness via the confidence gap between the correctly classified texts and the misclassified ones on a labeled test corpus, higher gaps make it easier to improve our confidence that our classifier made the right decision. Our results show that for all of the classifiers tested in this study, there is a confidence gap, but for the ensembles, the gap is bigger, meaning that ensembles are more robust than their individual models.

* Published at LREC, 
Access Paper or Ask Questions