Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Text Classification": models, code, and papers

Improving Indonesian Text Classification Using Multilingual Language Model

Sep 12, 2020
Ilham Firdausi Putra, Ayu Purwarianti

Compared to English, the amount of labeled data for Indonesian text classification tasks is very small. Recently developed multilingual language models have shown its ability to create multilingual representations effectively. This paper investigates the effect of combining English and Indonesian data on building Indonesian text classification (e.g., sentiment analysis and hate speech) using multilingual language models. Using the feature-based approach, we observe its performance on various data sizes and total added English data. The experiment showed that the addition of English data, especially if the amount of Indonesian data is small, improves performance. Using the fine-tuning approach, we further showed its effectiveness in utilizing the English language to build Indonesian text classification models.

* 2020 International Conference on Advanced Informatics: Concept, Theory and Application (ICAICTA) 
  
Access Paper or Ask Questions

TextRGNN: Residual Graph Neural Networks for Text Classification

Dec 30, 2021
Jiayuan Chen, Boyu Zhang, Yinfei Xu, Meng Wang

Recently, text classification model based on graph neural network (GNN) has attracted more and more attention. Most of these models adopt a similar network paradigm, that is, using pre-training node embedding initialization and two-layer graph convolution. In this work, we propose TextRGNN, an improved GNN structure that introduces residual connection to deepen the convolution network depth. Our structure can obtain a wider node receptive field and effectively suppress the over-smoothing of node features. In addition, we integrate the probabilistic language model into the initialization of graph node embedding, so that the non-graph semantic information of can be better extracted. The experimental results show that our model is general and efficient. It can significantly improve the classification accuracy whether in corpus level or text level, and achieve SOTA performance on a wide range of text classification datasets.

  
Access Paper or Ask Questions

Hierarchical Heterogeneous Graph Representation Learning for Short Text Classification

Oct 30, 2021
Yaqing Wang, Song Wang, Quanming Yao, Dejing Dou

Short text classification is a fundamental task in natural language processing. It is hard due to the lack of context information and labeled data in practice. In this paper, we propose a new method called SHINE, which is based on graph neural network (GNN), for short text classification. First, we model the short text dataset as a hierarchical heterogeneous graph consisting of word-level component graphs which introduce more semantic and syntactic information. Then, we dynamically learn a short document graph that facilitates effective label propagation among similar short texts. Thus, compared with existing GNN-based methods, SHINE can better exploit interactions between nodes of the same types and capture similarities between short texts. Extensive experiments on various benchmark short text datasets show that SHINE consistently outperforms state-of-the-art methods, especially with fewer labels.

* Accepted to EMNLP 2021 
  
Access Paper or Ask Questions

HDLTex: Hierarchical Deep Learning for Text Classification

Oct 06, 2017
Kamran Kowsari, Donald E. Brown, Mojtaba Heidarysafa, Kiana Jafari Meimandi, Matthew S. Gerber, Laura E. Barnes

The continually increasing number of documents produced each year necessitates ever improving information processing methods for searching, retrieving, and organizing text. Central to these information processing methods is document classification, which has become an important application for supervised learning. Recently the performance of these traditional classifiers has degraded as the number of documents has increased. This is because along with this growth in the number of documents has come an increase in the number of categories. This paper approaches this problem differently from current document classification methods that view the problem as multi-class classification. Instead we perform hierarchical classification using an approach we call Hierarchical Deep Learning for Text classification (HDLTex). HDLTex employs stacks of deep learning architectures to provide specialized understanding at each level of the document hierarchy.

* ICMLA 2017 
  
Access Paper or Ask Questions

Challenging the Semi-Supervised VAE Framework for Text Classification

Sep 27, 2021
Ghazi Felhi, Joseph Le Roux, Djamé Seddah

Semi-Supervised Variational Autoencoders (SSVAEs) are widely used models for data efficient learning. In this paper, we question the adequacy of the standard design of sequence SSVAEs for the task of text classification as we exhibit two sources of overcomplexity for which we provide simplifications. These simplifications to SSVAEs preserve their theoretical soundness while providing a number of practical advantages in the semi-supervised setup where the result of training is a text classifier. These simplifications are the removal of (i) the Kullback-Liebler divergence from its objective and (ii) the fully unobserved latent variable from its probabilistic model. These changes relieve users from choosing a prior for their latent variables, make the model smaller and faster, and allow for a better flow of information into the latent variables. We compare the simplified versions to standard SSVAEs on 4 text classification tasks. On top of the above-mentioned simplification, experiments show a speed-up of 26%, while keeping equivalent classification scores. The code to reproduce our experiments is public.

* Accepted at the EMNLP 2021 Workshop on Insights from Negative Results 
  
Access Paper or Ask Questions

belabBERT: a Dutch RoBERTa-based language model applied to psychiatric classification

Jun 02, 2021
Joppe Wouts, Janna de Boer, Alban Voppel, Sanne Brederoo, Sander van Splunter, Iris Sommer

Natural language processing (NLP) is becoming an important means for automatic recognition of human traits and states, such as intoxication, presence of psychiatric disorders, presence of airway disorders and states of stress. Such applications have the potential to be an important pillar for online help lines, and may gradually be introduced into eHealth modules. However, NLP is language specific and for languages such as Dutch, NLP models are scarce. As a result, recent Dutch NLP models have a low capture of long range semantic dependencies over sentences. To overcome this, here we present belabBERT, a new Dutch language model extending the RoBERTa architecture. belabBERT is trained on a large Dutch corpus (+32 GB) of web crawled texts. We applied belabBERT to the classification of psychiatric illnesses. First, we evaluated the strength of text-based classification using belabBERT, and compared the results to the existing RobBERT model. Then, we compared the performance of belabBERT to audio classification for psychiatric disorders. Finally, a brief exploration was performed, extending the framework to a hybrid text- and audio-based classification. Our results show that belabBERT outperformed the current best text classification network for Dutch, RobBERT. belabBERT also outperformed classification based on audio alone.

* arXiv admin note: substantial text overlap with arXiv:2008.01543 
  
Access Paper or Ask Questions

A Robust Hybrid Approach for Textual Document Classification

Sep 12, 2019
Muhammad Nabeel Asim, Muhammad Usman Ghani Khan, Muhammad Imran Malik, Andreas Dengel, Sheraz Ahmed

Text document classification is an important task for diverse natural language processing based applications. Traditional machine learning approaches mainly focused on reducing dimensionality of textual data to perform classification. This although improved the overall classification accuracy, the classifiers still faced sparsity problem due to lack of better data representation techniques. Deep learning based text document classification, on the other hand, benefitted greatly from the invention of word embeddings that have solved the sparsity problem and researchers focus mainly remained on the development of deep architectures. Deeper architectures, however, learn some redundant features that limit the performance of deep learning based solutions. In this paper, we propose a two stage text document classification methodology which combines traditional feature engineering with automatic feature engineering (using deep learning). The proposed methodology comprises a filter based feature selection (FSE) algorithm followed by a deep convolutional neural network. This methodology is evaluated on the two most commonly used public datasets, i.e., 20 Newsgroups data and BBC news data. Evaluation results reveal that the proposed methodology outperforms the state-of-the-art of both the (traditional) machine learning and deep learning based text document classification methodologies with a significant margin of 7.7% on 20 Newsgroups and 6.6% on BBC news datasets.

* ICDAR Conference 
  
Access Paper or Ask Questions

COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

Jun 19, 2016
Andreas Veit, Tomas Matera, Lukas Neumann, Jiri Matas, Serge Belongie

This paper describes the COCO-Text dataset. In recent years large-scale datasets like SUN and Imagenet drove the advancement of scene understanding and object recognition. The goal of COCO-Text is to advance state-of-the-art in text detection and recognition in natural images. The dataset is based on the MS COCO dataset, which contains images of complex everyday scenes. The images were not collected with text in mind and thus contain a broad variety of text instances. To reflect the diversity of text in natural scenes, we annotate text with (a) location in terms of a bounding box, (b) fine-grained classification into machine printed text and handwritten text, (c) classification into legible and illegible text, (d) script of the text and (e) transcriptions of legible text. The dataset contains over 173k text annotations in over 63k images. We provide a statistical analysis of the accuracy of our annotations. In addition, we present an analysis of three leading state-of-the-art photo Optical Character Recognition (OCR) approaches on our dataset. While scene text detection and recognition enjoys strong advances in recent years, we identify significant shortcomings motivating future work.

  
Access Paper or Ask Questions

Transductive Learning with String Kernels for Cross-Domain Text Classification

Nov 02, 2018
Radu Tudor Ionescu, Andrei M. Butnaru

For many text classification tasks, there is a major problem posed by the lack of labeled data in a target domain. Although classifiers for a target domain can be trained on labeled text data from a related source domain, the accuracy of such classifiers is usually lower in the cross-domain setting. Recently, string kernels have obtained state-of-the-art results in various text classification tasks such as native language identification or automatic essay scoring. Moreover, classifiers based on string kernels have been found to be robust to the distribution gap between different domains. In this paper, we formally describe an algorithm composed of two simple yet effective transductive learning approaches to further improve the results of string kernels in cross-domain settings. By adapting string kernels to the test set without using the ground-truth test labels, we report significantly better accuracy rates in cross-domain English polarity classification.

* Accepted at ICONIP 2018. arXiv admin note: substantial text overlap with arXiv:1808.08409 
  
Access Paper or Ask Questions

Neural Text Classification by Jointly Learning to Cluster and Align

Nov 24, 2020
Yekun Chai, Haidong Zhang, Shuo Jin

Distributional text clustering delivers semantically informative representations and captures the relevance between each word and semantic clustering centroids. We extend the neural text clustering approach to text classification tasks by inducing cluster centers via a latent variable model and interacting with distributional word embeddings, to enrich the representation of tokens and measure the relatedness between tokens and each learnable cluster centroid. The proposed method jointly learns word clustering centroids and clustering-token alignments, achieving the state of the art results on multiple benchmark datasets and proving that the proposed cluster-token alignment mechanism is indeed favorable to text classification. Notably, our qualitative analysis has conspicuously illustrated that text representations learned by the proposed model are in accord well with our intuition.

  
Access Paper or Ask Questions
<<
11
12
13
14
15
16
17
18
19
20
21
22
23
>>