Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Text Classification": models, code, and papers

Efficient multivariate sequence classification

Sep 30, 2014
Pavel P. Kuksa

Kernel-based approaches for sequence classification have been successfully applied to a variety of domains, including the text categorization, image classification, speech analysis, biological sequence analysis, time series and music classification, where they show some of the most accurate results. Typical kernel functions for sequences in these domains (e.g., bag-of-words, mismatch, or subsequence kernels) are restricted to {\em discrete univariate} (i.e. one-dimensional) string data, such as sequences of words in the text analysis, codeword sequences in the image analysis, or nucleotide or amino acid sequences in the DNA and protein sequence analysis. However, original sequence data are often of real-valued multivariate nature, i.e. are not univariate and discrete as required by typical $k$-mer based sequence kernel functions. In this work, we consider the problem of the {\em multivariate} sequence classification such as classification of multivariate music sequences, or multidimensional protein sequence representations. To this end, we extend {\em univariate} kernel functions typically used in sequence analysis and propose efficient {\em multivariate} similarity kernel method (MVDFQ-SK) based on (1) a direct feature quantization (DFQ) of each sequence dimension in the original {\em real-valued} multivariate sequences and (2) applying novel multivariate discrete kernel measures on these multivariate discrete DFQ sequence representations to more accurately capture similarity relationships among sequences and improve classification performance. Experiments using the proposed MVDFQ-SK kernel method show excellent classification performance on three challenging music classification tasks as well as protein sequence classification with significant 25-40% improvements over univariate kernel methods and existing state-of-the-art sequence classification methods.

* multivariate sequence classification, string kernels, vector quantization, direct feature quantization, music classification, protein classification 
Access Paper or Ask Questions

Features Fusion for Classification of Logos

Sep 06, 2016
N. Vinay Kumar, Pratheek, V. Vijaya Kantha, K. N. Govindaraju, D. S. Guru

In this paper, a logo classification system based on the appearance of logo images is proposed. The proposed classification system makes use of global characteristics of logo images for classification. Color, texture, and shape of a logo wholly describe the global characteristics of logo images. The various combinations of these characteristics are used for classification. The combination contains only with single feature or with fusion of two features or fusion of all three features considered at a time respectively. Further, the system categorizes the logo image into: a logo image with fully text or with fully symbols or containing both symbols and texts.. The K-Nearest Neighbour (K-NN) classifier is used for classification. Due to the lack of color logo image dataset in the literature, the same is created consisting 5044 color logo images. Finally, the performance of the classification system is evaluated through accuracy, precision, recall and F-measure computed from the confusion matrix. The experimental results show that the most promising results are obtained for fusion of features.

* Procedia Computer Science, Volume 85, 2016, Pages 370-379 
* 10 pages, 5 figures, 9 tables 
Access Paper or Ask Questions

Improving Pretrained Models for Zero-shot Multi-label Text Classification through Reinforced Label Hierarchy Reasoning

Apr 04, 2021
Hui Liu, Danqing Zhang, Bing Yin, Xiaodan Zhu

Exploiting label hierarchies has become a promising approach to tackling the zero-shot multi-label text classification (ZS-MTC) problem. Conventional methods aim to learn a matching model between text and labels, using a graph encoder to incorporate label hierarchies to obtain effective label representations \cite{rios2018few}. More recently, pretrained models like BERT \cite{devlin2018bert} have been used to convert classification tasks into a textual entailment task \cite{yin-etal-2019-benchmarking}. This approach is naturally suitable for the ZS-MTC task. However, pretrained models are underexplored in the existing work because they do not generate individual vector representations for text or labels, making it unintuitive to combine them with conventional graph encoding methods. In this paper, we explore to improve pretrained models with label hierarchies on the ZS-MTC task. We propose a Reinforced Label Hierarchy Reasoning (RLHR) approach to encourage interdependence among labels in the hierarchies during training. Meanwhile, to overcome the weakness of flat predictions, we design a rollback algorithm that can remove logical errors from predictions during inference. Experimental results on three real-life datasets show that our approach achieves better performance and outperforms previous non-pretrained methods on the ZS-MTC task.

* Accepted to Main Conference of NAACL 2021 as Long Paper 
Access Paper or Ask Questions

Composite Semantic Relation Classification

May 16, 2018
Siamak Barzegar, Andre Freitas, Siegfried Handschuh, Brian Davis

Different semantic interpretation tasks such as text entailment and question answering require the classification of semantic relations between terms or entities within text. However, in most cases it is not possible to assign a direct semantic relation between entities/terms. This paper proposes an approach for composite semantic relation classification, extending the traditional semantic relation classification task. Different from existing approaches, which use machine learning models built over lexical and distributional word vector features, the proposed model uses the combination of a large commonsense knowledge base of binary relations, a distributional navigational algorithm and sequence classification to provide a solution for the composite semantic relation classification problem.

* 12 pages, 3 figures, conference 
Access Paper or Ask Questions

PETGEN: Personalized Text Generation Attack on Deep Sequence Embedding-based Classification Models

Sep 14, 2021
Bing He, Mustaque Ahamad, Srijan Kumar

\textit{What should a malicious user write next to fool a detection model?} Identifying malicious users is critical to ensure the safety and integrity of internet platforms. Several deep learning based detection models have been created. However, malicious users can evade deep detection models by manipulating their behavior, rendering these models of little use. The vulnerability of such deep detection models against adversarial attacks is unknown. Here we create a novel adversarial attack model against deep user sequence embedding-based classification models, which use the sequence of user posts to generate user embeddings and detect malicious users. In the attack, the adversary generates a new post to fool the classifier. We propose a novel end-to-end Personalized Text Generation Attack model, called \texttt{PETGEN}, that simultaneously reduces the efficacy of the detection model and generates posts that have several key desirable properties. Specifically, \texttt{PETGEN} generates posts that are personalized to the user's writing style, have knowledge about a given target context, are aware of the user's historical posts on the target context, and encapsulate the user's recent topical interests. We conduct extensive experiments on two real-world datasets (Yelp and Wikipedia, both with ground-truth of malicious users) to show that \texttt{PETGEN} significantly reduces the performance of popular deep user sequence embedding-based classification models. \texttt{PETGEN} outperforms five attack baselines in terms of text quality and attack efficacy in both white-box and black-box classifier settings. Overall, this work paves the path towards the next generation of adversary-aware sequence classification models.

* Accepted for publication at: 2021 ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'2021). Code and data at: 
Access Paper or Ask Questions

Findings of the The RuATD Shared Task 2022 on Artificial Text Detection in Russian

Jun 03, 2022
Tatiana Shamardina, Vladislav Mikhailov, Daniil Chernianskii, Alena Fenogenova, Marat Saidov, Anastasiya Valeeva, Tatiana Shavrina, Ivan Smurov, Elena Tutubalina, Ekaterina Artemova

We present the shared task on artificial text detection in Russian, which is organized as a part of the Dialogue Evaluation initiative, held in 2022. The shared task dataset includes texts from 14 text generators, i.e., one human writer and 13 text generative models fine-tuned for one or more of the following generation tasks: machine translation, paraphrase generation, text summarization, text simplification. We also consider back-translation and zero-shot generation approaches. The human-written texts are collected from publicly available resources across multiple domains. The shared task consists of two sub-tasks: (i) to determine if a given text is automatically generated or written by a human; (ii) to identify the author of a given text. The first task is framed as a binary classification problem. The second task is a multi-class classification problem. We provide count-based and BERT-based baselines, along with the human evaluation on the first sub-task. A total of 30 and 8 systems have been submitted to the binary and multi-class sub-tasks, correspondingly. Most teams outperform the baselines by a wide margin. We publicly release our codebase, human evaluation results, and other materials in our GitHub repository (

* Accepted to Dialogue-22 
Access Paper or Ask Questions

Text and Code Embeddings by Contrastive Pre-Training

Jan 24, 2022
Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, Johannes Heidecke, Pranav Shyam, Boris Power, Tyna Eloundou Nekoul, Girish Sastry, Gretchen Krueger, David Schnurr, Felipe Petroski Such, Kenny Hsu, Madeleine Thompson, Tabarak Khan, Toki Sherbakov, Joanne Jang, Peter Welinder, Lilian Weng

Text embeddings are useful features in many applications such as semantic search and computing text similarity. Previous work typically trains models customized for different use cases, varying in dataset choice, training objective and model architecture. In this work, we show that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models. On linear-probe classification accuracy averaging over 7 tasks, our best unsupervised model achieves a relative improvement of 4% and 1.8% over previous best unsupervised and supervised text embedding models respectively. The same text embeddings when evaluated on large-scale semantic search attains a relative improvement of 23.4%, 14.7%, and 10.6% over previous best unsupervised methods on MSMARCO, Natural Questions and TriviaQA benchmarks, respectively. Similarly to text embeddings, we train code embedding models on (text, code) pairs, obtaining a 20.8% relative improvement over prior best work on code search.

Access Paper or Ask Questions

Comparison of Classification Algorithms Used Medical Documents Categorization

Nov 02, 2018
Durmus Ozkan Sahin, Erdal Kilic

Volume of text based documents have been increasing day by day. Medical documents are located within this growing text documents. In this study, the techniques used for text classification applied on medical documents and evaluated classification performance. Used data sets are multi class and multi labelled. Chi Square (CHI) technique was used for feature selection also SMO, NB, C4.5, RF and KNN algorithms was used for classification. The aim of this study, success of various classifiers is evaluated on multi class and multi label data sets consisting of medical documents. The first 400 features, while the most successful in the KNN classifier, feature number 400 and after the SMO has become the most successful classifier.

* International Conference on Computer Science and Engineering 2016 
Access Paper or Ask Questions

Investigating the Working of Text Classifiers

Aug 05, 2018
Devendra Singh Sachan, Manzil Zaheer, Ruslan Salakhutdinov

Text classification is one of the most widely studied tasks in natural language processing. Motivated by the principle of compositionality, large multilayer neural network models have been employed for this task in an attempt to effectively utilize the constituent expressions. Almost all of the reported work train large networks using discriminative approaches, which come with a caveat of no proper capacity control, as they tend to latch on to any signal that may not generalize. Using various recent state-of-the-art approaches for text classification, we explore whether these models actually learn to compose the meaning of the sentences or still just focus on some keywords or lexicons for classifying the document. To test our hypothesis, we carefully construct datasets where the training and test splits have no direct overlap of such lexicons, but overall language structure would be similar. We study various text classifiers and observe that there is a big performance drop on these datasets. Finally, we show that even simple models with our proposed regularization techniques, which disincentivize focusing on key lexicons, can substantially improve classification accuracy.

* Proceedings of COLING 2018, the 27th International Conference on Computational Linguistics: Technical Papers (COLING 2018), NIPS 2017 Workshop on Deep Learning: Bridging Theory and Practice 
Access Paper or Ask Questions

Entropy optimized semi-supervised decomposed vector-quantized variational autoencoder model based on transfer learning for multiclass text classification and generation

Nov 10, 2021
Shivani Malhotra, Vinay Kumar, Alpana Agarwal

Semisupervised text classification has become a major focus of research over the past few years. Hitherto, most of the research has been based on supervised learning, but its main drawback is the unavailability of labeled data samples in practical applications. It is still a key challenge to train the deep generative models and learn comprehensive representations without supervision. Even though continuous latent variables are employed primarily in deep latent variable models, discrete latent variables, with their enhanced understandability and better compressed representations, are effectively used by researchers. In this paper, we propose a semisupervised discrete latent variable model for multi-class text classification and text generation. The proposed model employs the concept of transfer learning for training a quantized transformer model, which is able to learn competently using fewer labeled instances. The model applies decomposed vector quantization technique to overcome problems like posterior collapse and index collapse. Shannon entropy is used for the decomposed sub-encoders, on which a variable DropConnect is applied, to retain maximum information. Moreover, gradients of the Loss function are adaptively modified during backpropagation from decoder to encoder to enhance the performance of the model. Three conventional datasets of diversified range have been used for validating the proposed model on a variable number of labeled instances. Experimental results indicate that the proposed model has surpassed the state-of-the-art models remarkably.

* 12 pages, 4 figures 
Access Paper or Ask Questions