We propose a Label Propagation based algorithm for weakly supervised text classification. We construct a graph where each document is represented by a node and edge weights represent similarities among the documents. Additionally, we discover underlying topics using Latent Dirichlet Allocation (LDA) and enrich the document graph by including the topics in the form of additional nodes. The edge weights between a topic and a text document represent level of "affinity" between them. Our approach does not require document level labelling, instead it expects manual labels only for topic nodes. This significantly minimizes the level of supervision needed as only a few topics are observed to be enough for achieving sufficiently high accuracy. The Label Propagation Algorithm is employed on this enriched graph to propagate labels among the nodes. Our approach combines the advantages of Label Propagation (through document-document similarities) and Topic Modelling (for minimal but smart supervision). We demonstrate the effectiveness of our approach on various datasets and compare with state-of-the-art weakly supervised text classification approaches.
Text classification is a significant branch of natural language processing, and has many applications including document classification and sentiment analysis. Unsurprisingly, those who do text classification are concerned with the run-time of their algorithms, many of which depend on the size of the corpus' vocabulary due to their bag-of-words representation. Although many studies have examined the effect of preprocessing techniques on vocabulary size and accuracy, none have examined how these methods affect a model's run-time. To fill this gap, we provide a comprehensive study that examines how preprocessing techniques affect the vocabulary size, model performance, and model run-time, evaluating ten techniques over four models and two datasets. We show that some individual methods can reduce run-time with no loss of accuracy, while some combinations of methods can trade 2-5% of the accuracy for up to a 65% reduction of run-time. Furthermore, some combinations of preprocessing techniques can even provide a 15% reduction in run-time while simultaneously improving model accuracy.
Text classification is one of the most frequent tasks for processing textual data, facilitating among others research from large-scale datasets. Embeddings of different kinds have recently become the de facto standard as features used for text classification. These embeddings have the capacity to capture meanings of words inferred from occurrences in large external collections. While they are built out of external collections, they are unaware of the distributional characteristics of words in the classification dataset at hand, including most importantly the distribution of words across classes in training data. To make the most of these embeddings as features and to boost the performance of classifiers using them, we introduce a weighting scheme, Term Frequency-Category Ratio (TF-CR), which can weight high-frequency, category-exclusive words higher when computing word embeddings. Our experiments on eight datasets show the effectiveness of TF-CR, leading to improved performance scores over the well-known weighting schemes TF-IDF and KLD as well as over the absence of a weighting scheme in most cases.
Text classification struggles to generalize to unseen classes with very few labeled text instances per class. In such a few-shot learning (FSL) setting, metric-based meta-learning approaches have shown promising results. Previous studies mainly aim to derive a prototype representation for each class. However, they neglect that it is challenging-yet-unnecessary to construct a compact representation which expresses the entire meaning for each class. They also ignore the importance to capture the inter-dependency between query and the support set for few-shot text classification. To deal with these issues, we propose a meta-learning based method MGIMN which performs instance-wise comparison followed by aggregation to generate class-wise matching vectors instead of prototype learning. The key of instance-wise comparison is the interactive matching within the class-specific context and episode-specific context. Extensive experiments demonstrate that the proposed method significantly outperforms the existing state-of-the-art approaches, under both the standard FSL and generalized FSL settings.
Many text classification tasks are domain-dependent, and various domain adaptation approaches have been proposed to predict unlabeled data in a new domain. Domain-adversarial neural networks (DANN) and their variants have been used widely recently and have achieved promising results for this problem. However, most of these approaches assume that the label proportions of the source and target domains are similar, which rarely holds in most real-world scenarios. Sometimes the label shift can be large and the DANN fails to learn domain-invariant features. In this study, we focus on unsupervised domain adaptation of text classification with label shift and introduce a domain adversarial network with label proportions estimation (DAN-LPE) framework. The DAN-LPE simultaneously trains a domain adversarial net and processes label proportions estimation by the confusion of the source domain and the predictions of the target domain. Experiments show the DAN-LPE achieves a good estimate of the target label distributions and reduces the label shift to improve the classification performance.
With the increase of information, document classification as one of the methods of text mining, plays vital role in many management and organizing information. Document classification is the process of assigning a document to one or more predefined category labels. Document classification includes different parts such as text processing, term selection, term weighting and final classification. The accuracy of document classification is very important. Thus improvement in each part of classification should lead to better results and higher precision. Term weighting has a great impact on the accuracy of the classification. Most of the existing weighting methods exploit the statistical information of terms in documents and do not consider semantic relations between words. In this paper, an automated document classification system is presented that uses a novel term weighting method based on semantic relations between terms. To evaluate the proposed method, three standard Persian corpuses are used. Experiment results show 2 to 4 percent improvement in classification accuracy compared with the best previous designed system for Persian documents.
Book covers are usually the very first impression to its readers and they often convey important information about the content of the book. Book genre classification based on its cover would be utterly beneficial to many modern retrieval systems, considering that the complete digitization of books is an extremely expensive task. At the same time, it is also an extremely challenging task due to the following reasons: First, there exists a wide variety of book genres, many of which are not concretely defined. Second, book covers, as graphic designs, vary in many different ways such as colors, styles, textual information, etc, even for books of the same genre. Third, book cover designs may vary due to many external factors such as country, culture, target reader populations, etc. With the growing competitiveness in the book industry, the book cover designers and typographers push the cover designs to its limit in the hope of attracting sales. The cover-based book classification systems become a particularly exciting research topic in recent years. In this paper, we propose a multi-modal deep learning framework to solve this problem. The contribution of this paper is four-fold. First, our method adds an extra modality by extracting texts automatically from the book covers. Second, image-based and text-based, state-of-the-art models are evaluated thoroughly for the task of book cover classification. Third, we develop an efficient and salable multi-modal framework based on the images and texts shown on the covers only. Fourth, a thorough analysis of the experimental results is given and future works to improve the performance is suggested. The results show that the multi-modal framework significantly outperforms the current state-of-the-art image-based models. However, more efforts and resources are needed for this classification task in order to reach a satisfactory level.
In this paper, we propose a novel approach for text classification based on clustering word embeddings, inspired by the bag of visual words model, which is widely used in computer vision. After each word in a collection of documents is represented as word vector using a pre-trained word embeddings model, a k-means algorithm is applied on the word vectors in order to obtain a fixed-size set of clusters. The centroid of each cluster is interpreted as a super word embedding that embodies all the semantically related word vectors in a certain region of the embedding space. Every embedded word in the collection of documents is then assigned to the nearest cluster centroid. In the end, each document is represented as a bag of super word embeddings by computing the frequency of each super word embedding in the respective document. We also diverge from the idea of building a single vocabulary for the entire collection of documents, and propose to build class-specific vocabularies for better performance. Using this kind of representation, we report results on two text mining tasks, namely text categorization by topic and polarity classification. On both tasks, our model yields better performance than the standard bag of words.
Research has shown that Convolutional Neural Networks (CNN) can be effectively applied to text classification as part of a predictive coding protocol. That said, most research to date has been conducted on data sets with short documents that do not reflect the variety of documents in real world document reviews. Using data from four actual reviews with documents of varying lengths, we compared CNN with other popular machine learning algorithms for text classification, including Logistic Regression, Support Vector Machine, and Random Forest. For each data set, classification models were trained with different training sample sizes using different learning algorithms. These models were then evaluated using a large randomly sampled test set of documents, and the results were compared using precision and recall curves. Our study demonstrates that CNN performed well, but that there was no single algorithm that performed the best across the combination of data sets and training sample sizes. These results will help advance research into the legal profession's use of machine learning algorithms that maximize performance.