Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Text Classification": models, code, and papers

A Label Proportions Estimation Technique for Adversarial Domain Adaptation in Text Classification

Mar 26, 2020
Zhuohao Chen, Singla Karan, David C. Atkins, Zac E Imel, Shrikanth Narayanan

Many text classification tasks are domain-dependent, and various domain adaptation approaches have been proposed to predict unlabeled data in a new domain. Domain-adversarial neural networks (DANN) and their variants have been used widely recently and have achieved promising results for this problem. However, most of these approaches assume that the label proportions of the source and target domains are similar, which rarely holds in most real-world scenarios. Sometimes the label shift can be large and the DANN fails to learn domain-invariant features. In this study, we focus on unsupervised domain adaptation of text classification with label shift and introduce a domain adversarial network with label proportions estimation (DAN-LPE) framework. The DAN-LPE simultaneously trains a domain adversarial net and processes label proportions estimation by the confusion of the source domain and the predictions of the target domain. Experiments show the DAN-LPE achieves a good estimate of the target label distributions and reduces the label shift to improve the classification performance.

* add a proposition and a proof of it, correct typos 
Access Paper or Ask Questions

Improving Persian Document Classification Using Semantic Relations between Words

Dec 28, 2014
Saeed Parseh, Ahmad Baraani

With the increase of information, document classification as one of the methods of text mining, plays vital role in many management and organizing information. Document classification is the process of assigning a document to one or more predefined category labels. Document classification includes different parts such as text processing, term selection, term weighting and final classification. The accuracy of document classification is very important. Thus improvement in each part of classification should lead to better results and higher precision. Term weighting has a great impact on the accuracy of the classification. Most of the existing weighting methods exploit the statistical information of terms in documents and do not consider semantic relations between words. In this paper, an automated document classification system is presented that uses a novel term weighting method based on semantic relations between terms. To evaluate the proposed method, three standard Persian corpuses are used. Experiment results show 2 to 4 percent improvement in classification accuracy compared with the best previous designed system for Persian documents.

* 7 pages 
Access Paper or Ask Questions

Deep multi-modal networks for book genre classification based on its cover

Nov 15, 2020
Chandra Kundu, Lukun Zheng

Book covers are usually the very first impression to its readers and they often convey important information about the content of the book. Book genre classification based on its cover would be utterly beneficial to many modern retrieval systems, considering that the complete digitization of books is an extremely expensive task. At the same time, it is also an extremely challenging task due to the following reasons: First, there exists a wide variety of book genres, many of which are not concretely defined. Second, book covers, as graphic designs, vary in many different ways such as colors, styles, textual information, etc, even for books of the same genre. Third, book cover designs may vary due to many external factors such as country, culture, target reader populations, etc. With the growing competitiveness in the book industry, the book cover designers and typographers push the cover designs to its limit in the hope of attracting sales. The cover-based book classification systems become a particularly exciting research topic in recent years. In this paper, we propose a multi-modal deep learning framework to solve this problem. The contribution of this paper is four-fold. First, our method adds an extra modality by extracting texts automatically from the book covers. Second, image-based and text-based, state-of-the-art models are evaluated thoroughly for the task of book cover classification. Third, we develop an efficient and salable multi-modal framework based on the images and texts shown on the covers only. Fourth, a thorough analysis of the experimental results is given and future works to improve the performance is suggested. The results show that the multi-modal framework significantly outperforms the current state-of-the-art image-based models. However, more efforts and resources are needed for this classification task in order to reach a satisfactory level.

* 23 pages, 8 figures 
Access Paper or Ask Questions

From Image to Text Classification: A Novel Approach based on Clustering Word Embeddings

Jul 25, 2017
Andrei M. Butnaru, Radu Tudor Ionescu

In this paper, we propose a novel approach for text classification based on clustering word embeddings, inspired by the bag of visual words model, which is widely used in computer vision. After each word in a collection of documents is represented as word vector using a pre-trained word embeddings model, a k-means algorithm is applied on the word vectors in order to obtain a fixed-size set of clusters. The centroid of each cluster is interpreted as a super word embedding that embodies all the semantically related word vectors in a certain region of the embedding space. Every embedded word in the collection of documents is then assigned to the nearest cluster centroid. In the end, each document is represented as a bag of super word embeddings by computing the frequency of each super word embedding in the respective document. We also diverge from the idea of building a single vocabulary for the entire collection of documents, and propose to build class-specific vocabularies for better performance. Using this kind of representation, we report results on two text mining tasks, namely text categorization by topic and polarity classification. On both tasks, our model yields better performance than the standard bag of words.

* Accepted at KES 2017 
Access Paper or Ask Questions

Empirical Comparisons of CNN with Other Learning Algorithms for Text Classification in Legal Document Review

Dec 19, 2019
Robert Keeling, Rishi Chhatwal, Nathaniel Huber-Fliflet, Jianping Zhang, Fusheng Wei, Haozhen Zhao, Shi Ye, Han Qin

Research has shown that Convolutional Neural Networks (CNN) can be effectively applied to text classification as part of a predictive coding protocol. That said, most research to date has been conducted on data sets with short documents that do not reflect the variety of documents in real world document reviews. Using data from four actual reviews with documents of varying lengths, we compared CNN with other popular machine learning algorithms for text classification, including Logistic Regression, Support Vector Machine, and Random Forest. For each data set, classification models were trained with different training sample sizes using different learning algorithms. These models were then evaluated using a large randomly sampled test set of documents, and the results were compared using precision and recall curves. Our study demonstrates that CNN performed well, but that there was no single algorithm that performed the best across the combination of data sets and training sample sizes. These results will help advance research into the legal profession's use of machine learning algorithms that maximize performance.

* 2019 IEEE International Conference on Big Data (Big Data) 
Access Paper or Ask Questions

When BERT Meets Quantum Temporal Convolution Learning for Text Classification in Heterogeneous Computing

Feb 17, 2022
Chao-Han Huck Yang, Jun Qi, Samuel Yen-Chi Chen, Yu Tsao, Pin-Yu Chen

The rapid development of quantum computing has demonstrated many unique characteristics of quantum advantages, such as richer feature representation and more secured protection on model parameters. This work proposes a vertical federated learning architecture based on variational quantum circuits to demonstrate the competitive performance of a quantum-enhanced pre-trained BERT model for text classification. In particular, our proposed hybrid classical-quantum model consists of a novel random quantum temporal convolution (QTC) learning framework replacing some layers in the BERT-based decoder. Our experiments on intent classification show that our proposed BERT-QTC model attains competitive experimental results in the Snips and ATIS spoken language datasets. Particularly, the BERT-QTC boosts the performance of the existing quantum circuit-based language model in two text classification datasets by 1.57% and 1.52% relative improvements. Furthermore, BERT-QTC can be feasibly deployed on both existing commercial-accessible quantum computation hardware and CPU-based interface for ensuring data isolation.

* Accepted to ICASSP 2022 
Access Paper or Ask Questions

An Enhanced Text Classification to Explore Health based Indian Government Policy Tweets

Aug 18, 2020
Aarzoo Dhiman, Durga Toshniwal

Government-sponsored policy-making and scheme generations is one of the means of protecting and promoting the social, economic, and personal development of the citizens. The evaluation of effectiveness of these schemes done by government only provide the statistical information in terms of facts and figures which do not include the in-depth knowledge of public perceptions, experiences and views on the topic. In this research work, we propose an improved text classification framework that classifies the Twitter data of different health-based government schemes. The proposed framework leverages the language representation models (LR models) BERT, ELMO, and USE. However, these LR models have less real-time applicability due to the scarcity of the ample annotated data. To handle this, we propose a novel GloVe word embeddings and class-specific sentiments based text augmentation approach (named Mod-EDA) which boosts the performance of text classification task by increasing the size of labeled data. Furthermore, the trained model is leveraged to identify the level of engagement of citizens towards these policies in different communities such as middle-income and low-income groups.

* Accepted to KDD 2020: Applied Data Science for Healthcare Workshop (4 pages, 2 figures, 2 tables) 
Access Paper or Ask Questions

Machine learning approach for text and document mining

Jun 06, 2014
Vishwanath Bijalwan, Pinki Kumari, Jordan Pascual, Vijay Bhaskar Semwal

Text Categorization (TC), also known as Text Classification, is the task of automatically classifying a set of text documents into different categories from a predefined set. If a document belongs to exactly one of the categories, it is a single-label classification task; otherwise, it is a multi-label classification task. TC uses several tools from Information Retrieval (IR) and Machine Learning (ML) and has received much attention in the last years from both researchers in the academia and industry developers. In this paper, we first categorize the documents using KNN based machine learning approach and then return the most relevant documents.

* arXiv admin note: text overlap with arXiv:1003.1795, arXiv:1212.2065 by other authors 
Access Paper or Ask Questions

Domain-Adaptive Text Classification with Structured Knowledge from Unlabeled Data

Jun 20, 2022
Tian Li, Xiang Chen, Zhen Dong, Weijiang Yu, Yijun Yan, Kurt Keutzer, Shanghang Zhang

Domain adaptive text classification is a challenging problem for the large-scale pretrained language models because they often require expensive additional labeled data to adapt to new domains. Existing works usually fails to leverage the implicit relationships among words across domains. In this paper, we propose a novel method, called Domain Adaptation with Structured Knowledge (DASK), to enhance domain adaptation by exploiting word-level semantic relationships. DASK first builds a knowledge graph to capture the relationship between pivot terms (domain-independent words) and non-pivot terms in the target domain. Then during training, DASK injects pivot-related knowledge graph information into source domain texts. For the downstream task, these knowledge-injected texts are fed into a BERT variant capable of processing knowledge-injected textual data. Thanks to the knowledge injection, our model learns domain-invariant features for non-pivots according to their relationships with pivots. DASK ensures the pivots to have domain-invariant behaviors by dynamically inferring via the polarity scores of candidate pivots during training with pseudo-labels. We validate DASK on a wide range of cross-domain sentiment classification tasks and observe up to 2.9% absolute performance improvement over baselines for 20 different domain pairs. Code will be made available at

* IJCAI-ECAI 2022 
Access Paper or Ask Questions