In this paper, we explore SPPIM-based text classification method, and the experiment reveals that the SPPIM method is equal to or even superior than SGNS method in text classification task on three international and standard text datasets, namely 20newsgroups, Reuters52 and WebKB. Comparing to SGNS, although SPPMI provides a better solution, it is not necessarily better than SGNS in text classification tasks. Based on our analysis, SGNS takes into the consideration of weight calculation during decomposition process, so it has better performance than SPPIM in some standard datasets. Inspired by this, we propose a WL-SPPIM semantic model based on SPPIM model, and experiment shows that WL-SPPIM approach has better classification and higher scalability in the text classification task compared with LDA, SGNS and SPPIM approaches.
Text classification is fundamental in natural language processing (NLP), and Graph Neural Networks (GNN) are recently applied in this task. However, the existing graph-based works can neither capture the contextual word relationships within each document nor fulfil the inductive learning of new words. In this work, to overcome such problems, we propose TextING for inductive text classification via GNN. We first build individual graphs for each document and then use GNN to learn the fine-grained word representations based on their local structures, which can also effectively produce embeddings for unseen words in the new document. Finally, the word nodes are aggregated as the document embedding. Extensive experiments on four benchmark datasets show that our method outperforms state-of-the-art text classification methods.
A substantial amount of research has been carried out in developing machine learning algorithms that account for term dependence in text classification. These algorithms offer acceptable performance in most cases but they are associated with a substantial cost. They require significantly greater resources to operate. This paper argues against the justification of the higher costs of these algorithms, based on their performance in text classification problems. In order to prove the conjecture, the performance of one of the best dependence models is compared to several well established algorithms in text classification. A very specific collection of datasets have been designed, which would best reflect the disparity in the nature of text data, that are present in real world applications. The results show that even one of the best term dependence models, performs decent at best when compared to other independence models. Coupled with their substantially greater requirement for hardware resources for operation, this makes them an impractical choice for being used in real world scenarios.
Few-shot text classification is a fundamental NLP task in which a model aims to classify text into a large number of categories, given only a few training examples per category. This paper explores data augmentation -- a technique particularly suitable for training with limited data -- for this few-shot, highly-multiclass text classification setting. On four diverse text classification tasks, we find that common data augmentation techniques can improve the performance of triplet networks by up to 3.0% on average. To further boost performance, we present a simple training strategy called curriculum data augmentation, which leverages curriculum learning by first training on only original examples and then introducing augmented data as training progresses. We explore a two-stage and a gradual schedule, and find that, compared with standard single-stage training, curriculum data augmentation trains faster, improves performance, and remains robust to high amounts of noising from augmentation.
Text classification systems will help to solve the text clustering problem in the Azerbaijani language. There are some text-classification applications for foreign languages, but we tried to build a newly developed system to solve this problem for the Azerbaijani language. Firstly, we tried to find out potential practice areas. The system will be useful in a lot of areas. It will be mostly used in news feed categorization. News websites can automatically categorize news into classes such as sports, business, education, science, etc. The system is also used in sentiment analysis for product reviews. For example, the company shares a photo of a new product on Facebook and the company receives a thousand comments for new products. The systems classify the comments into categories like positive or negative. The system can also be applied in recommended systems, spam filtering, etc. Various machine learning techniques such as Naive Bayes, SVM, Decision Trees have been devised to solve the text classification problem in Azerbaijani language.
With the recent proliferation of the use of text classifications, researchers have found that there are certain unintended biases in text classification datasets. For example, texts containing some demographic identity-terms (e.g., "gay", "black") are more likely to be abusive in existing abusive language detection datasets. As a result, models trained with these datasets may consider sentences like "She makes me happy to be gay" as abusive simply because of the word "gay." In this paper, we formalize the unintended biases in text classification datasets as a kind of selection bias from the non-discrimination distribution to the discrimination distribution. Based on this formalization, we further propose a model-agnostic debiasing training framework by recovering the non-discrimination distribution using instance weighting, which does not require any extra resources or annotations apart from a pre-defined set of demographic identity-terms. Experiments demonstrate that our method can effectively alleviate the impacts of the unintended biases without significantly hurting models' generalization ability.
Text classification is one of the fundamental tasks in natural language processing to label an open-ended text and is useful for various applications such as sentiment analysis. In this paper, we discuss various classification approaches for Khmer text, ranging from a classical TF-IDF algorithm with support vector machine classifier to modern word embedding-based neural network classifiers including linear layer model, recurrent neural network and convolutional neural network. A Khmer word embedding model is trained on a 30-million-Khmer-word corpus to construct word vector representations that are used to train three different neural network classifiers. We evaluate the performance of different approaches on a news article dataset for both multi-class and multi-label text classification tasks. The result suggests that neural network classifiers using a word embedding model consistently outperform the traditional classifier using TF-IDF. The recurrent neural network classifier provides a slightly better result compared to the convolutional network and the linear layer network.
Extreme Multilabel Text Classification (XMTC) is a text classification problem in which, (i) the output space is extremely large, (ii) each data point may have multiple positive labels, and (iii) the data follows a strongly imbalanced distribution. With applications in recommendation systems and automatic tagging of web-scale documents, the research on XMTC has been focused on improving prediction accuracy and dealing with imbalanced data. However, the robustness of deep learning based XMTC models against adversarial examples has been largely underexplored. In this paper, we investigate the behaviour of XMTC models under adversarial attacks. To this end, first, we define adversarial attacks in multilabel text classification problems. We categorize attacking multilabel text classifiers as (a) positive-targeted, where the target positive label should fall out of top-k predicted labels, and (b) negative-targeted, where the target negative label should be among the top-k predicted labels. Then, by experiments on APLC-XLNet and AttentionXML, we show that XMTC models are highly vulnerable to positive-targeted attacks but more robust to negative-targeted ones. Furthermore, our experiments show that the success rate of positive-targeted adversarial attacks has an imbalanced distribution. More precisely, tail classes are highly vulnerable to adversarial attacks for which an attacker can generate adversarial samples with high similarity to the actual data-points. To overcome this problem, we explore the effect of rebalanced loss functions in XMTC where not only do they increase accuracy on tail classes, but they also improve the robustness of these classes against adversarial attacks. The code for our experiments is available at https://github.com/xmc-aalto/adv-xmtc
Nowadays, an abundance of short text is being generated that uses nonstandard writing styles influenced by regional languages. Such informal and code-switched content are under-resourced in terms of labeled datasets and language models even for popular tasks like sentiment classification. In this work, we (1) present a labeled dataset called MultiSenti for sentiment classification of code-switched informal short text, (2) explore the feasibility of adapting resources from a resource-rich language for an informal one, and (3) propose a deep learning-based model for sentiment classification of code-switched informal short text. We aim to achieve this without any lexical normalization, language translation, or code-switching indication. The performance of the proposed models is compared with three existing multilingual sentiment classification models. The results show that the proposed model performs better in general and adapting character-based embeddings yield equivalent performance while being computationally more efficient than training word-based domain-specific embeddings.