Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Text Classification": models, code, and papers

Ad Text Classification with Transformer-Based Natural Language Processing Methods

Jun 23, 2021
Umut Özdil, Büşra Arslan, D. Emre Taşar, Gökçe Polat, Şükrü Ozan

In this study, a natural language processing-based (NLP-based) method is proposed for the sector-wise automatic classification of ad texts created on online advertising platforms. Our data set consists of approximately 21,000 labeled advertising texts from 12 different sectors. In the study, the Bidirectional Encoder Representations from Transformers (BERT) model, which is a transformer-based language model that is recently used in fields such as text classification in the natural language processing literature, was used. The classification efficiencies obtained using a pre-trained BERT model for the Turkish language are shown in detail.

* 6 pages, in Turkish language, 4 figures, 3 tables 
  

Bangla Text Classification using Transformers

Nov 09, 2020
Tanvirul Alam, Akib Khan, Firoj Alam

Text classification has been one of the earliest problems in NLP. Over time the scope of application areas has broadened and the difficulty of dealing with new areas (e.g., noisy social media content) has increased. The problem-solving strategy switched from classical machine learning to deep learning algorithms. One of the recent deep neural network architecture is the Transformer. Models designed with this type of network and its variants recently showed their success in many downstream natural language processing tasks, especially for resource-rich languages, e.g., English. However, these models have not been explored fully for Bangla text classification tasks. In this work, we fine-tune multilingual transformer models for Bangla text classification tasks in different domains, including sentiment analysis, emotion detection, news categorization, and authorship attribution. We obtain the state of the art results on six benchmark datasets, improving upon the previous results by 5-29% accuracy across different tasks.

  

DataWords: Getting Contrarian with Text, Structured Data and Explanations

Nov 09, 2021
Stephen I. Gallant, Mirza Nasir Hossain

Our goal is to build classification models using a combination of free-text and structured data. To do this, we represent structured data by text sentences, DataWords, so that similar data items are mapped into the same sentence. This permits modeling a mixture of text and structured data by using only text-modeling algorithms. Several examples illustrate that it is possible to improve text classification performance by first running extraction tools (named entity recognition), then converting the output to DataWords, and adding the DataWords to the original text -- before model building and classification. This approach also allows us to produce explanations for inferences in terms of both free text and structured data.

* 11 pages 
  

A novel text representation which enables image classifiers to perform text classification, applied to name disambiguation

Aug 19, 2019
Stephen M. Petrie, T'Mir D. Julius

Patent data are often used to study the process of innovation and research, but patent databases lack unique identifiers for individual inventors, making it difficult to study innovation processes at the individual level. Here we introduce an algorithm that performs highly accurate disambiguation of inventors (named entities) in US patent data (F1: 99.09%, precision: 99.41%, recall: 98.76%). The algorithm involves a novel method for converting text-based record data into abstract image representations, in which text from a given pairwise comparison between two inventor name records is converted into a 2D RGB (stacked) image representation. We train an image classification neural network to discriminate between such pairwise comparison images, and then use the trained network to label each pair of records as either matched (same inventor) or non-matched (different inventors). The resulting disambiguation algorithm produces highly accurate results, out-performing other inventor name disambiguation studies on US patent data. Our new text-to-image representation method could potentially be used more broadly for other NLP comparison problems, as it allows image-based processing techniques (e.g. image classification networks) to be applied to text-based comparison problems (such as disambiguation of academic publications, or data linkage problems).

  

Resilient Combination of Complementary CNN and RNN Features for Text Classification through Attention and Ensembling

Mar 28, 2019
Athanasios Giannakopoulos, Maxime Coriou, Andreea Hossmann, Michael Baeriswyl, Claudiu Musat

State-of-the-art methods for text classification include several distinct steps of pre-processing, feature extraction and post-processing. In this work, we focus on end-to-end neural architectures and show that the best performance in text classification is obtained by combining information from different neural modules. Concretely, we combine convolution, recurrent and attention modules with ensemble methods and show that they are complementary. We introduce ECGA, an end-to-end go-to architecture for novel text classification tasks. We prove that it is efficient and robust, as it attains or surpasses the state-of-the-art on varied datasets, including both low and high data regimes.

* 5 pages, 1 figure, SDS 2019 - The 6th Swiss Conference on Data Science 
  

A Survey on Text Classification: From Shallow to Deep Learning

Aug 04, 2020
Qian Li, Hao Peng, Jianxin Li, Congyin Xia, Renyu Yang, Lichao Sun, Philip S. Yu, Lifang He

Text classification is the most fundamental and essential task in natural language processing. The last decade has seen a surge of research in this area due to the unprecedented success of deep learning. Numerous methods, datasets, and evaluation metrics have been proposed in the literature, raising the need for a comprehensive and updated survey. This paper fills the gap by reviewing the state of the art approaches from 1961 to 2020, focusing on models from shallow to deep learning. We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification. We then discuss each of these categories in detail, dealing with both the technical developments and benchmark datasets that support tests of predictions. A comprehensive comparison between different techniques, as well as identifying the pros and cons of various evaluation metrics are also provided in this survey. Finally, we conclude by summarizing key implications, future research directions, and the challenges facing the research area.

  

A Text Classification Survey: From Shallow to Deep Learning

Aug 02, 2020
Qian Li, Hao Peng, Jianxin Li, Congyin Xia, Renyu Yang, Lichao Sun, Philip S. Yu, Lifang He

Text classification is the most fundamental and essential task in natural language processing. The last decade has seen a surge of research in this area due to the unprecedented success of deep learning. Numerous methods, datasets, and evaluation metrics have been proposed in the literature, raising the need for a comprehensive and updated survey. This paper fills the gap by reviewing the state of the art approaches from 1961 to 2020, focusing on models from shallow to deep learning. We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification. We then discuss each of these categories in detail, dealing with both the technical developments and benchmark datasets that support tests of predictions. A comprehensive comparison between different techniques, as well as identifying the pros and cons of various evaluation metrics are also provided in this survey. Finally, we conclude by summarizing key implications, future research directions, and the challenges facing the research area.

  

Genetic Optimization of Keywords Subset in the Classification Analysis of Texts Authorship

Nov 14, 2012
Bohdan Pavlyshenko

The genetic selection of keywords set, the text frequencies of which are considered as attributes in text classification analysis, has been analyzed. The genetic optimization was performed on a set of words, which is the fraction of the frequency dictionary with given frequency limits. The frequency dictionary was formed on the basis of analyzed text array of texts of English fiction. As the fitness function which is minimized by the genetic algorithm, the error of nearest k neighbors classifier was used. The obtained results show high precision and recall of texts classification by authorship categories on the basis of attributes of keywords set which were selected by the genetic algorithm from the frequency dictionary.

  

Multi-Task Label Embedding for Text Classification

Oct 17, 2017
Honglun Zhang, Liqiang Xiao, Wenqing Chen, Yongkun Wang, Yaohui Jin

Multi-task learning in text classification leverages implicit correlations among related tasks to extract common features and yield performance gains. However, most previous works treat labels of each task as independent and meaningless one-hot vectors, which cause a loss of potential information and makes it difficult for these models to jointly learn three or more tasks. In this paper, we propose Multi-Task Label Embedding to convert labels in text classification into semantic vectors, thereby turning the original tasks into vector matching tasks. We implement unsupervised, supervised and semi-supervised models of Multi-Task Label Embedding, all utilizing semantic correlations among tasks and making it particularly convenient to scale and transfer as more tasks are involved. Extensive experiments on five benchmark datasets for text classification show that our models can effectively improve performances of related tasks with semantic representations of labels and additional information from each other.

* arXiv admin note: text overlap with arXiv:1707.02892 
  
<<
10
11
12
13
14
15
16
17
18
19
20
21
22
>>