Classification tasks are usually analysed and improved through new model architectures or hyperparameter optimisation but the underlying properties of datasets are discovered on an ad-hoc basis as errors occur. However, understanding the properties of the data is crucial in perfecting models. In this paper we analyse exactly which characteristics of a dataset best determine how difficult that dataset is for the task of text classification. We then propose an intuitive measure of difficulty for text classification datasets which is simple and fast to calculate. We show that this measure generalises to unseen data by comparing it to state-of-the-art datasets and results. This measure can be used to analyse the precise source of errors in a dataset and allows fast estimation of how difficult a dataset is to learn. We searched for this measure by training 12 classical and neural network based models on 78 real-world datasets, then use a genetic algorithm to discover the best measure of difficulty. Our difficulty-calculating code ( https://github.com/Wluper/edm ) and datasets ( http://data.wluper.com ) are publicly available.
Learning continuous representations from unlabeled textual data has been increasingly studied for benefiting semi-supervised learning. Although it is relatively easier to interpret discrete representations, due to the difficulty of training, learning discrete representations for unlabeled textual data has not been widely explored. This work proposes TIGAN that learns to encode texts into two disentangled representations, including a discrete code and a continuous noise, where the discrete code represents interpretable topics, and the noise controls the variance within the topics. The discrete code learned by TIGAN can be used for unsupervised text classification. Compared to other unsupervised baselines, the proposed TIGAN achieves superior performance on six different corpora. Also, the performance is on par with a recently proposed weakly-supervised text classification method. The extracted topical words for representing latent topics show that TIGAN learns coherent and highly interpretable topics.
Cross-lingual text classification aims at training a classifier on the source language and transferring the knowledge to target languages, which is very useful for low-resource languages. Recent multilingual pretrained language models (mPLM) achieve impressive results in cross-lingual classification tasks, but rarely consider factors beyond semantic similarity, causing performance degradation between some language pairs. In this paper we propose a simple yet effective method to incorporate heterogeneous information within and across languages for cross-lingual text classification using graph convolutional networks (GCN). In particular, we construct a heterogeneous graph by treating documents and words as nodes, and linking nodes with different relations, which include part-of-speech roles, semantic similarity, and document translations. Extensive experiments show that our graph-based method significantly outperforms state-of-the-art models on all tasks, and also achieves consistent performance gain over baselines in low-resource settings where external tools like translators are unavailable.
Text classification tends to be difficult when the data is deficient or when it is required to adapt to unseen classes. In such challenging scenarios, recent studies have often used meta-learning to simulate the few-shot task, thus negating explicit common linguistic features across tasks. Deep language representations have proven to be very effective forms of unsupervised pretraining, yielding contextualized features that capture linguistic properties and benefit downstream natural language understanding tasks. However, the effect of pretrained language representation for few-shot learning on text classification tasks is still not well understood. In this study, we design a few-shot learning model with pretrained language representations and report the empirical results. We show that our approach is not only simple but also produces state-of-the-art performance on a well-studied sentiment classification dataset. It can thus be further suggested that pretraining could be a promising solution for few shot learning of many other NLP tasks. The code and the dataset to replicate the experiments are made available at https://github.com/zxlzr/FewShotNLP.
The rapid development of deep natural language processing (NLP) models for text classification has led to an urgent need for a unified understanding of these models proposed individually. Existing methods cannot meet the need for understanding different models in one framework due to the lack of a unified measure for explaining both low-level (e.g., words) and high-level (e.g., phrases) features. We have developed a visual analysis tool, DeepNLPVis, to enable a unified understanding of NLP models for text classification. The key idea is a mutual information-based measure, which provides quantitative explanations on how each layer of a model maintains the information of input words in a sample. We model the intra- and inter-word information at each layer measuring the importance of a word to the final prediction as well as the relationships between words, such as the formation of phrases. A multi-level visualization, which consists of a corpus-level, a sample-level, and a word-level visualization, supports the analysis from the overall training set to individual samples. Two case studies on classification tasks and comparison between models demonstrate that DeepNLPVis can help users effectively identify potential problems caused by samples and model architectures and then make informed improvements.
Continual learning has become increasingly important as it enables NLP models to constantly learn and gain knowledge over time. Previous continual learning methods are mainly designed to preserve knowledge from previous tasks, without much emphasis on how to well generalize models to new tasks. In this work, we propose an information disentanglement based regularization method for continual learning on text classification. Our proposed method first disentangles text hidden spaces into representations that are generic to all tasks and representations specific to each individual task, and further regularizes these representations differently to better constrain the knowledge required to generalize. We also introduce two simple auxiliary tasks: next sentence prediction and task-id prediction, for learning better generic and specific representation spaces. Experiments conducted on large-scale benchmarks demonstrate the effectiveness of our method in continual text classification tasks with various sequences and lengths over state-of-the-art baselines. We have publicly released our code at https://github.com/GT-SALT/IDBR.
The growing prosperity of social networks has brought great challenges to the sentimental tendency mining of users. As more and more researchers pay attention to the sentimental tendency of online users, rich research results have been obtained based on the sentiment classification of explicit texts. However, research on the implicit sentiment of users is still in its infancy. Aiming at the difficulty of implicit sentiment classification, a research on implicit sentiment classification model based on deep neural network is carried out. Classification models based on DNN, LSTM, Bi-LSTM and CNN were established to judge the tendency of the user's implicit sentiment text. Based on the Bi-LSTM model, the classification model of word-level attention mechanism is studied. The experimental results on the public dataset show that the established LSTM series classification model and CNN classification model can achieve good sentiment classification effect, and the classification effect is significantly better than the DNN model. The Bi-LSTM based attention mechanism classification model obtained the optimal R value in the positive category identification.
Semi-supervised learning is a promising way to reduce the annotation cost for text-classification. Combining with pre-trained language models (PLMs), e.g., BERT, recent semi-supervised learning methods achieved impressive performance. In this work, we further investigate the marriage between semi-supervised learning and a pre-trained language model. Unlike existing approaches that utilize PLMs only for model parameter initialization, we explore the inherent topic matching capability inside PLMs for building a more powerful semi-supervised learning approach. Specifically, we propose a joint semi-supervised learning process that can progressively build a standard $K$-way classifier and a matching network for the input text and the Class Semantic Representation (CSR). The CSR will be initialized from the given labeled sentences and progressively updated through the training process. By means of extensive experiments, we show that our method can not only bring remarkable improvement to baselines, but also overall be more stable, and achieves state-of-the-art performance in semi-supervised text classification.
Many text classification tasks are domain-dependent, and various domain adaptation approaches have been proposed to predict unlabeled data in a new domain. Domain-adversarial neural networks(DANN) and their variants have been actively used recently and have achieved state-of-the-art results for this problem. However, most of these approaches assume that the label proportions of the source and target domains are similar, which rarely holds in real-world scenarios. Sometimes the label shift is very large and the DANN fails to learn domain-invariant features. In this study, we focus on unsupervised domain adaptation of text classification with label shift and introduce a domain adversarial network with label proportions estimation (DAN-LPE) framework. The DAN-LPEsimultaneously trains a domain adversarial net and processes label proportions estimation by the distributions of the predictions. Experiments show the DAN-LPE achieves a good estimate of the target label distributions and reduces the label shift to improve the classification performance.