Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Text Classification": models, code, and papers

Classification Analysis Of Authorship Fiction Texts in The Space Of Semantic Fields

Oct 22, 2012
Bohdan Pavlyshenko

The use of naive Bayesian classifier (NB) and the classifier by the k nearest neighbors (kNN) in classification semantic analysis of authors' texts of English fiction has been analysed. The authors' works are considered in the vector space the basis of which is formed by the frequency characteristics of semantic fields of nouns and verbs. Highly precise classification of authors' texts in the vector space of semantic fields indicates about the presence of particular spheres of author's idiolect in this space which characterizes the individual author's style.

* 6 pages, 2 figures 
Access Paper or Ask Questions

Distant finetuning with discourse relations for stance classification

Apr 27, 2022
Lifeng Jin, Kun Xu, Linfeng Song, Dong Yu

Approaches for the stance classification task, an important task for understanding argumentation in debates and detecting fake news, have been relying on models which deal with individual debate topics. In this paper, in order to train a system independent from topics, we propose a new method to extract data with silver labels from raw text to finetune a model for stance classification. The extraction relies on specific discourse relation information, which is shown as a reliable and accurate source for providing stance information. We also propose a 3-stage training framework where the noisy level in the data used for finetuning decreases over different stages going from the most noisy to the least noisy. Detailed experiments show that the automatically annotated dataset as well as the 3-stage training help improve model performance in stance classification. Our approach ranks 1st among 26 competing teams in the stance classification track of the NLPCC 2021 shared task Argumentative Text Understanding for AI Debater, which confirms the effectiveness of our approach.

* NLPCC 2021 
Access Paper or Ask Questions

Deep Health Care Text Classification

Oct 23, 2017
Vinayakumar R, Barathi Ganesh HB, Anand Kumar M, Soman KP

Health related social media mining is a valuable apparatus for the early recognition of the diverse antagonistic medicinal conditions. Mostly, the existing methods are based on machine learning with knowledge-based learning. This working note presents the Recurrent neural network (RNN) and Long short-term memory (LSTM) based embedding for automatic health text classification in the social media mining. For each task, two systems are built and that classify the tweet at the tweet level. RNN and LSTM are used for extracting features and non-linear activation function at the last layer facilitates to distinguish the tweets of different categories. The experiments are conducted on 2nd Social Media Mining for Health Applications Shared Task at AMIA 2017. The experiment results are considerable; however the proposed method is appropriate for the health text classification. This is primarily due to the reason that, it doesn't rely on any feature engineering mechanisms.

* 4 pages 
Access Paper or Ask Questions

LA-HCN: Label-based Attention for Hierarchical Multi-label TextClassification Neural Network

Sep 23, 2020
Xinyi Zhang, Jiahao Xu, Charlie Soh, Lihui Chen

Hierarchical multi-label text classification(HMTC) problems become popular recently because of its practicality. Most existing algorithms for HMTC focus on the design of classifiers, and are largely referred to as local, global, or a combination of local/global approaches. However, a few studies have started exploring hierarchical feature extraction based on the label hierarchy associating with text in HMTC. In this paper, a \textbf{N}eural network-based method called \textbf{LA-HCN} is proposed where a novel \textbf{L}abel-based \textbf{A}ttention module is designed to hierarchically extract important information from the text based on different labels. Besides, local and global document embeddings are separately generated to support the respective local and global classifications. In our experiments, LA-HCN achieves the top performance on the four public HMTC datasets when compared with other neural network-based state-of-the-art algorithms. The comparison between LA-HCN with its variants also demonstrates the effectiveness of the proposed label-based attention module as well as the use of the combination of local and global classifications. By visualizing the learned attention(words), we find LA-HCN is able to extract meaningful but different information from text based on different labels which is helpful for human understanding and explanation of classification results.

Access Paper or Ask Questions

Enhance Long Text Understanding via Distilled Gist Detector from Abstractive Summarization

Oct 10, 2021
Yan Liu, Yazheng Yang

Long text understanding is important yet challenging in natural language processing. A long article or essay usually contains many redundant words that are not pertinent to its gist and sometimes can be regarded as noise. In this paper, we consider the problem of how to disentangle the gist-relevant and irrelevant information for long text understanding. With distillation mechanism, we transfer the knowledge about how to focus the salient parts from the abstractive summarization model and further integrate the distilled model, named \emph{Gist Detector}, into existing models as a supplementary component to augment the long text understanding. Experiments on document classification, distantly supervised open-domain question answering (DS-QA) and non-parallel text style transfer show that our method can significantly improve the performance of the baseline models, and achieves state-of-the-art overall results for document classification.

Access Paper or Ask Questions

Fooling Explanations in Text Classifiers

Jun 07, 2022
Adam Ivankay, Ivan Girardi, Chiara Marchiori, Pascal Frossard

State-of-the-art text classification models are becoming increasingly reliant on deep neural networks (DNNs). Due to their black-box nature, faithful and robust explanation methods need to accompany classifiers for deployment in real-life scenarios. However, it has been shown in vision applications that explanation methods are susceptible to local, imperceptible perturbations that can significantly alter the explanations without changing the predicted classes. We show here that the existence of such perturbations extends to text classifiers as well. Specifically, we introduceTextExplanationFooler (TEF), a novel explanation attack algorithm that alters text input samples imperceptibly so that the outcome of widely-used explanation methods changes considerably while leaving classifier predictions unchanged. We evaluate the performance of the attribution robustness estimation performance in TEF on five sequence classification datasets, utilizing three DNN architectures and three transformer architectures for each dataset. TEF can significantly decrease the correlation between unchanged and perturbed input attributions, which shows that all models and explanation methods are susceptible to TEF perturbations. Moreover, we evaluate how the perturbations transfer to other model architectures and attribution methods, and show that TEF perturbations are also effective in scenarios where the target model and explanation method are unknown. Finally, we introduce a semi-universal attack that is able to compute fast, computationally light perturbations with no knowledge of the attacked classifier nor explanation method. Overall, our work shows that explanations in text classifiers are very fragile and users need to carefully address their robustness before relying on them in critical applications.

* International Conference on Learning Representations, 2022 
Access Paper or Ask Questions

Short Text Classification via Term Graph

Jan 20, 2020
Wei Pang

Short text classi cation is a method for classifying short sentence with prede ned labels. However, short text is limited in shortness in text length that leads to a challenging problem of sparse features. Most of existing methods treat each short sentences as independently and identically distributed (IID), local context only in the sentence itself is focused and the relational information between sentences are lost. To overcome these limitations, we propose a PathWalk model that combine the strength of graph networks and short sentences to solve the sparseness of short text. Experimental results on four different available datasets show that our PathWalk method achieves the state-of-the-art results, demonstrating the efficiency and robustness of graph networks for short text classification.

* 9 pages, 15 figures, Short Text Classification, Term Graph 
Access Paper or Ask Questions

Exploiting Dynamic and Fine-grained Semantic Scope for Extreme Multi-label Text Classification

May 24, 2022
Yuan Wang, Huiling Song, Peng Huo, Tao Xu, Jucheng Yang, Yarui Chen, Tingting Zhao

Extreme multi-label text classification (XMTC) refers to the problem of tagging a given text with the most relevant subset of labels from a large label set. A majority of labels only have a few training instances due to large label dimensionality in XMTC. To solve this data sparsity issue, most existing XMTC methods take advantage of fixed label clusters obtained in early stage to balance performance on tail labels and head labels. However, such label clusters provide static and coarse-grained semantic scope for every text, which ignores distinct characteristics of different texts and has difficulties modelling accurate semantics scope for texts with tail labels. In this paper, we propose a novel framework TReaderXML for XMTC, which adopts dynamic and fine-grained semantic scope from teacher knowledge for individual text to optimize text conditional prior category semantic ranges. TReaderXML dynamically obtains teacher knowledge for each text by similar texts and hierarchical label information in training sets to release the ability of distinctly fine-grained label-oriented semantic scope. Then, TReaderXML benefits from a novel dual cooperative network that firstly learns features of a text and its corresponding label-oriented semantic scope by parallel Encoding Module and Reading Module, secondly embeds two parts by Interaction Module to regularize the text's representation by dynamic and fine-grained label-oriented semantic scope, and finally find target labels by Prediction Module. Experimental results on three XMTC benchmark datasets show that our method achieves new state-of-the-art results and especially performs well for severely imbalanced and sparse datasets.

Access Paper or Ask Questions

A Chinese Text Classification Method With Low Hardware Requirement Based on Improved Model Concatenation

Oct 28, 2020
Yuanhao Zhuo

In order to improve the accuracy performance of Chinese text classification models with low hardware requirements, an improved concatenation-based model is designed in this paper, which is a concatenation of 5 different sub-models, including TextCNN, LSTM, and Bi-LSTM. Compared with the existing ensemble learning method, for a text classification mission, this model's accuracy is 2% higher. Meanwhile, the hardware requirements of this model are much lower than the BERT-based model.

* 5 pages, 2 figures, 5 tables 
Access Paper or Ask Questions

Building for Tomorrow: Assessing the Temporal Persistence of Text Classifiers

May 19, 2022
Rabab Alkhalifa, Elena Kochkina, Arkaitz Zubiaga

Where performance of text classification models drops over time due to changes in data, development of models whose performance persists over time is important. An ability to predict a model's ability to persist over time can help design models that can be effectively used over a longer period of time. In this paper, we look at this problem from a practical perspective by assessing the ability of a wide range of language models and classification algorithms to persist over time, as well as how dataset characteristics can help predict the temporal stability of different models. We perform longitudinal classification experiments on three datasets spanning between 6 and 19 years, and involving diverse tasks and types of data. We find that one can estimate how a model will retain its performance over time based on (i) how well the model performs over a restricted time period and its extrapolation to a longer time period, and (ii) the linguistic characteristics of the dataset, such as the familiarity score between subsets from different years. Findings from these experiments have important implications for the design of text classification models with the aim of preserving performance over time.

Access Paper or Ask Questions