Many classification models work poorly on short texts due to data sparsity. To address this issue, we propose topic memory networks for short text classification with a novel topic memory mechanism to encode latent topic representations indicative of class labels. Different from most prior work that focuses on extending features with external knowledge or pre-trained topics, our model jointly explores topic inference and text classification with memory networks in an end-to-end manner. Experimental results on four benchmark datasets show that our model outperforms state-of-the-art models on short text classification, meanwhile generates coherent topics.
In the short text, the extreme short length, feature sparsity and high ambiguity pose huge challenge to classification tasks. Recently, as an effective method for tuning Pre-trained Language Models for specific downstream tasks, prompt-learning has attracted vast amount of attention and research. The main intuition behind the prompt-learning is to insert template into the input and convert the text classification tasks into equivalent cloze-style tasks. However, most prompt-learning methods expand label words manually or only consider the class name for knowledge incorporating in cloze-style prediction, which will inevitably incurred omissions and bias in classification tasks. In this paper, we propose a simple short text classification approach that makes use of prompt-learning based on knowledgeable expansion, which can consider both the short text itself and class name during expanding label words space. Specifically, the top $N$ concepts related to the entity in short text are retrieved from the open Knowledge Graph like Probase, and we further refine the expanded label words by the distance calculation between selected concepts and class label. Experimental results show that our approach obtains obvious improvement compared with other fine-tuning, prompt-learning and knowledgeable prompt-tuning methods, outperforming the state-of-the-art by up to 6 Accuracy points on three well-known datasets.
Text classification is one of the most important and fundamental tasks in natural language processing. Performance of this task mainly dependents on text representation learning. Currently, most existing learning frameworks mainly focus on encoding local contextual information between words. These methods always neglect to exploit global clues, such as label information, for encoding text information. In this study, we propose a label-guided learning framework LguidedLearn for text representation and classification. Our method is novel but simple that we only insert a label-guided encoding layer into the commonly used text representation learning schemas. That label-guided layer performs label-based attentive encoding to map the universal text embedding (encoded by a contextual information learner) into different label spaces, resulting in label-wise embeddings. In our proposed framework, the label-guided layer can be easily and directly applied with a contextual encoding method to perform jointly learning. Text information is encoded based on both the local contextual information and the global label clues. Therefore, the obtained text embeddings are more robust and discriminative for text classification. Extensive experiments are conducted on benchmark datasets to illustrate the effectiveness of our proposed method.
Text classification is the most basic natural language processing task. It has a wide range of applications ranging from sentiment analysis to topic classification. Recently, deep learning approaches based on CNN, LSTM, and Transformers have been the de facto approach for text classification. In this work, we highlight a common issue associated with these approaches. We show that these systems are over-reliant on the important words present in the text that are useful for classification. With limited training data and discriminative training strategy, these approaches tend to ignore the semantic meaning of the sentence and rather just focus on keywords or important n-grams. We propose a simple black box technique ShutText to present the shortcomings of the model and identify the over-reliance of the model on keywords. This involves randomly shuffling the words in a sentence and evaluating the classification accuracy. We see that on common text classification datasets there is very little effect of shuffling and with high probability these models predict the original class. We also evaluate the effect of language model pretraining on these models and try to answer questions around model robustness to out of domain sentences. We show that simple models based on CNN or LSTM as well as complex models like BERT are questionable in terms of their syntactic and semantic understanding.
Deep learning models such as convolutional neural networks and recurrent networks are widely applied in text classification. In spite of their great success, most deep learning models neglect the importance of modeling context information, which is crucial to understanding texts. In this work, we propose the Adaptive Region Embedding to learn context representation to improve text classification. Specifically, a metanetwork is learned to generate a context matrix for each region, and each word interacts with its corresponding context matrix to produce the regional representation for further classification. Compared to previous models that are designed to capture context information, our model contains less parameters and is more flexible. We extensively evaluate our method on 8 benchmark datasets for text classification. The experimental results prove that our method achieves state-of-the-art performances and effectively avoids word ambiguity.
Text-based communication is highly favoured as a communication method, especially in business environments. As a result, it is often abused by sending malicious messages, e.g., spam emails, to deceive users into relaying personal information, including online accounts credentials or banking details. For this reason, many machine learning methods for text classification have been proposed and incorporated into the services of most email providers. However, optimising text classification algorithms and finding the right tradeoff on their aggressiveness is still a major research problem. We present an updated survey of 12 machine learning text classifiers applied to a public spam corpus. A new pipeline is proposed to optimise hyperparameter selection and improve the models' performance by applying specific methods (based on natural language processing) in the preprocessing stage. Our study aims to provide a new methodology to investigate and optimise the effect of different feature sizes and hyperparameters in machine learning classifiers that are widely used in text classification problems. The classifiers are tested and evaluated on different metrics including F-score (accuracy), precision, recall, and run time. By analysing all these aspects, we show how the proposed pipeline can be used to achieve a good accuracy towards spam filtering on the Enron dataset, a widely used public email corpus. Statistical tests and explainability techniques are applied to provide a robust analysis of the proposed pipeline and interpret the classification outcomes of the 12 machine learning models, also identifying words that drive the classification results. Our analysis shows that it is possible to identify an effective machine learning model to classify the Enron dataset with an F-score of 94%.
Text classification has been one of the major problems in natural language processing. With the advent of deep learning, convolutional neural network (CNN) has been a popular solution to this task. However, CNNs which were first proposed for images, face many crucial challenges in the context of text processing, namely in their elementary blocks: convolution filters and max pooling. These challenges have largely been overlooked by the most existing CNN models proposed for text classification. In this paper, we present an experimental study on the fundamental blocks of CNNs in text categorization. Based on this critique, we propose Sequential Convolutional Attentive Recurrent Network (SCARN). The proposed SCARN model utilizes both the advantages of recurrent and convolutional structures efficiently in comparison to previously proposed recurrent convolutional models. We test our model on different text classification datasets across tasks like sentiment analysis and question classification. Extensive experiments establish that SCARN outperforms other recurrent convolutional architectures with significantly less parameters. Furthermore, SCARN achieves better performance compared to equally large various deep CNN and LSTM architectures.
Text classification, as the task consisting in assigning categories to textual instances, is a very common task in information science. Methods learning distributed representations of words, such as word embeddings, have become popular in recent years as the features to use for text classification tasks. Despite the increasing use of word embeddings for text classification, these are generally used in an unsupervised manner, i.e. information derived from class labels in the training data are not exploited. While word embeddings inherently capture the distributional characteristics of words, and contexts observed around them in a large dataset, they aren't optimised to consider the distributions of words across categories in the classification dataset at hand. To optimise text representations based on word embeddings by incorporating class distributions in the training data, we propose the use of weighting schemes that assign a weight to embeddings of each word based on its saliency in each class. To achieve this, we introduce a novel weighting scheme, Term Frequency-Category Ratio (TF-CR), which can weight high-frequency, category-exclusive words higher when computing word embeddings. Our experiments on 16 classification datasets show the effectiveness of TF-CR, leading to improved performance scores over existing weighting schemes, with a performance gap that increases as the size of the training data grows.
Most work in text classification and Natural Language Processing (NLP) focuses on English or a handful of other languages that have text corpora of hundreds of millions of words. This is creating a new version of the digital divide: the artificial intelligence (AI) divide. Transfer-based approaches, such as Cross-Lingual Text Classification (CLTC) - the task of categorizing texts written in different languages into a common taxonomy, are a promising solution to the emerging AI divide. Recent work on CLTC has focused on demonstrating the benefits of using bilingual word embeddings as features, relegating the CLTC problem to a mere benchmark based on a simple averaged perceptron. In this paper, we explore more extensively and systematically two flavors of the CLTC problem: news topic classification and textual churn intent detection (TCID) in social media. In particular, we test the hypothesis that embeddings with context are more effective, by multi-tasking the learning of multilingual word embeddings and text classification; we explore neural architectures for CLTC; and we move from bi- to multi-lingual word embeddings. For all architectures, types of word embeddings and datasets, we notice a consistent gain trend in favor of multilingual joint training, especially for low-resourced languages.