Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Text Classification": models, code, and papers

Cross-Lingual Task-Specific Representation Learning for Text Classification in Resource Poor Languages

Jun 10, 2018
Nurendra Choudhary, Rajat Singh, Manish Shrivastava

Neural network models have shown promising results for text classification. However, these solutions are limited by their dependence on the availability of annotated data. The prospect of leveraging resource-rich languages to enhance the text classification of resource-poor languages is fascinating. The performance on resource-poor languages can significantly improve if the resource availability constraints can be offset. To this end, we present a twin Bidirectional Long Short Term Memory (Bi-LSTM) network with shared parameters consolidated by a contrastive loss function (based on a similarity metric). The model learns the representation of resource-poor and resource-rich sentences in a common space by using the similarity between their assigned annotation tags. Hence, the model projects sentences with similar tags closer and those with different tags farther from each other. We evaluated our model on the classification tasks of sentiment analysis and emoji prediction for resource-poor languages - Hindi and Telugu and resource-rich languages - English and Spanish. Our model significantly outperforms the state-of-the-art approaches in both the tasks across all metrics.

* This work was presented at 1st Workshop on Humanizing AI (HAI) at IJCAI'18 in Stockholm, Sweden. arXiv admin note: text overlap with arXiv:1804.00805, arXiv:1804.01855 

Adversarial Reprogramming of Sequence Classification Neural Networks

Sep 07, 2018
Paarth Neekhara, Shehzeen Hussain, Shlomo Dubnov, Farinaz Koushanfar

Adversarial Reprogramming has demonstrated success in utilizing pre-trained neural network classifiers for alternative classification tasks without modification to the original network. An adversary in such an attack scenario trains an additive contribution to the inputs to repurpose the neural network for the new classification task. While this reprogramming approach works for neural networks with a continuous input space such as that of images, it is not directly applicable to neural networks trained for tasks such as text classification, where the input space is discrete. Repurposing such classification networks would require the attacker to learn an adversarial program that maps inputs from one discrete space to the other. In this work, we introduce a context-based vocabulary remapping model to reprogram neural networks trained on a specific sequence classification task, for a new sequence classification task desired by the adversary. We propose training procedures for this adversarial program in both white-box and black-box settings. We demonstrate the application of our model by adversarially repurposing various text-classification models including LSTM, bi-directional LSTM and CNN for alternate classification tasks.

* 10 pages, 5 figures, 2 tables 

CatVRNN: Generating Category Texts via Multi-task Learning

Jul 12, 2021
Pengsen Cheng, Jiayong Liu, Jinqiao Dai

Controlling the model to generate texts of different categories is a challenging task that is getting more and more attention. Recently, generative adversarial net (GAN) has shown promising results in category text generation. However, the texts generated by GANs usually suffer from the problems of mode collapse and training instability. To avoid the above problems, we propose a novel model named category-aware variational recurrent neural network (CatVRNN), which is inspired by multi-task learning. In our model, generation and classification are trained simultaneously, aiming at generating texts of different categories. Moreover, the use of multi-task learning can improve the quality of generated texts, when the classification task is appropriate. And we propose a function to initialize the hidden state of CatVRNN to force model to generate texts of a specific category. Experimental results on three datasets demonstrate that our model can do better than several state-of-the-art text generation methods based GAN in the category accuracy and quality of generated texts.


Improving Visual Reasoning by Exploiting The Knowledge in Texts

Feb 09, 2021
Sahand Sharifzadeh, Sina Moayed Baharlou, Martin Schmitt, Hinrich Schütze, Volker Tresp

This paper presents a new framework for training image-based classifiers from a combination of texts and images with very few labels. We consider a classification framework with three modules: a backbone, a relational reasoning component, and a classification component. While the backbone can be trained from unlabeled images by self-supervised learning, we can fine-tune the relational reasoning and the classification components from external sources of knowledge instead of annotated images. By proposing a transformer-based model that creates structured knowledge from textual input, we enable the utilization of the knowledge in texts. We show that, compared to the supervised baselines with 1% of the annotated images, we can achieve ~8x more accurate results in scene graph classification, ~3x in object classification, and ~1.5x in predicate classification.


Metadata-Induced Contrastive Learning for Zero-Shot Multi-Label Text Classification

Feb 11, 2022
Yu Zhang, Zhihong Shen, Chieh-Han Wu, Boya Xie, Junheng Hao, Ye-Yi Wang, Kuansan Wang, Jiawei Han

Large-scale multi-label text classification (LMTC) aims to associate a document with its relevant labels from a large candidate set. Most existing LMTC approaches rely on massive human-annotated training data, which are often costly to obtain and suffer from a long-tailed label distribution (i.e., many labels occur only a few times in the training set). In this paper, we study LMTC under the zero-shot setting, which does not require any annotated documents with labels and only relies on label surface names and descriptions. To train a classifier that calculates the similarity score between a document and a label, we propose a novel metadata-induced contrastive learning (MICoL) method. Different from previous text-based contrastive learning techniques, MICoL exploits document metadata (e.g., authors, venues, and references of research papers), which are widely available on the Web, to derive similar document-document pairs. Experimental results on two large-scale datasets show that: (1) MICoL significantly outperforms strong zero-shot text classification and contrastive learning baselines; (2) MICoL is on par with the state-of-the-art supervised metadata-aware LMTC method trained on 10K-200K labeled documents; and (3) MICoL tends to predict more infrequent labels than supervised methods, thus alleviates the deteriorated performance on long-tailed labels.

* 12 pages; Accepted to WWW 2022 

Efficient multivariate sequence classification

Sep 30, 2014
Pavel P. Kuksa

Kernel-based approaches for sequence classification have been successfully applied to a variety of domains, including the text categorization, image classification, speech analysis, biological sequence analysis, time series and music classification, where they show some of the most accurate results. Typical kernel functions for sequences in these domains (e.g., bag-of-words, mismatch, or subsequence kernels) are restricted to {\em discrete univariate} (i.e. one-dimensional) string data, such as sequences of words in the text analysis, codeword sequences in the image analysis, or nucleotide or amino acid sequences in the DNA and protein sequence analysis. However, original sequence data are often of real-valued multivariate nature, i.e. are not univariate and discrete as required by typical $k$-mer based sequence kernel functions. In this work, we consider the problem of the {\em multivariate} sequence classification such as classification of multivariate music sequences, or multidimensional protein sequence representations. To this end, we extend {\em univariate} kernel functions typically used in sequence analysis and propose efficient {\em multivariate} similarity kernel method (MVDFQ-SK) based on (1) a direct feature quantization (DFQ) of each sequence dimension in the original {\em real-valued} multivariate sequences and (2) applying novel multivariate discrete kernel measures on these multivariate discrete DFQ sequence representations to more accurately capture similarity relationships among sequences and improve classification performance. Experiments using the proposed MVDFQ-SK kernel method show excellent classification performance on three challenging music classification tasks as well as protein sequence classification with significant 25-40% improvements over univariate kernel methods and existing state-of-the-art sequence classification methods.

* multivariate sequence classification, string kernels, vector quantization, direct feature quantization, music classification, protein classification 

Features Fusion for Classification of Logos

Sep 06, 2016
N. Vinay Kumar, Pratheek, V. Vijaya Kantha, K. N. Govindaraju, D. S. Guru

In this paper, a logo classification system based on the appearance of logo images is proposed. The proposed classification system makes use of global characteristics of logo images for classification. Color, texture, and shape of a logo wholly describe the global characteristics of logo images. The various combinations of these characteristics are used for classification. The combination contains only with single feature or with fusion of two features or fusion of all three features considered at a time respectively. Further, the system categorizes the logo image into: a logo image with fully text or with fully symbols or containing both symbols and texts.. The K-Nearest Neighbour (K-NN) classifier is used for classification. Due to the lack of color logo image dataset in the literature, the same is created consisting 5044 color logo images. Finally, the performance of the classification system is evaluated through accuracy, precision, recall and F-measure computed from the confusion matrix. The experimental results show that the most promising results are obtained for fusion of features.

* Procedia Computer Science, Volume 85, 2016, Pages 370-379 
* 10 pages, 5 figures, 9 tables 

Improving Pretrained Models for Zero-shot Multi-label Text Classification through Reinforced Label Hierarchy Reasoning

Apr 04, 2021
Hui Liu, Danqing Zhang, Bing Yin, Xiaodan Zhu

Exploiting label hierarchies has become a promising approach to tackling the zero-shot multi-label text classification (ZS-MTC) problem. Conventional methods aim to learn a matching model between text and labels, using a graph encoder to incorporate label hierarchies to obtain effective label representations \cite{rios2018few}. More recently, pretrained models like BERT \cite{devlin2018bert} have been used to convert classification tasks into a textual entailment task \cite{yin-etal-2019-benchmarking}. This approach is naturally suitable for the ZS-MTC task. However, pretrained models are underexplored in the existing work because they do not generate individual vector representations for text or labels, making it unintuitive to combine them with conventional graph encoding methods. In this paper, we explore to improve pretrained models with label hierarchies on the ZS-MTC task. We propose a Reinforced Label Hierarchy Reasoning (RLHR) approach to encourage interdependence among labels in the hierarchies during training. Meanwhile, to overcome the weakness of flat predictions, we design a rollback algorithm that can remove logical errors from predictions during inference. Experimental results on three real-life datasets show that our approach achieves better performance and outperforms previous non-pretrained methods on the ZS-MTC task.

* Accepted to Main Conference of NAACL 2021 as Long Paper 

Composite Semantic Relation Classification

May 16, 2018
Siamak Barzegar, Andre Freitas, Siegfried Handschuh, Brian Davis

Different semantic interpretation tasks such as text entailment and question answering require the classification of semantic relations between terms or entities within text. However, in most cases it is not possible to assign a direct semantic relation between entities/terms. This paper proposes an approach for composite semantic relation classification, extending the traditional semantic relation classification task. Different from existing approaches, which use machine learning models built over lexical and distributional word vector features, the proposed model uses the combination of a large commonsense knowledge base of binary relations, a distributional navigational algorithm and sequence classification to provide a solution for the composite semantic relation classification problem.

* 12 pages, 3 figures, conference 

PETGEN: Personalized Text Generation Attack on Deep Sequence Embedding-based Classification Models

Sep 14, 2021
Bing He, Mustaque Ahamad, Srijan Kumar

\textit{What should a malicious user write next to fool a detection model?} Identifying malicious users is critical to ensure the safety and integrity of internet platforms. Several deep learning based detection models have been created. However, malicious users can evade deep detection models by manipulating their behavior, rendering these models of little use. The vulnerability of such deep detection models against adversarial attacks is unknown. Here we create a novel adversarial attack model against deep user sequence embedding-based classification models, which use the sequence of user posts to generate user embeddings and detect malicious users. In the attack, the adversary generates a new post to fool the classifier. We propose a novel end-to-end Personalized Text Generation Attack model, called \texttt{PETGEN}, that simultaneously reduces the efficacy of the detection model and generates posts that have several key desirable properties. Specifically, \texttt{PETGEN} generates posts that are personalized to the user's writing style, have knowledge about a given target context, are aware of the user's historical posts on the target context, and encapsulate the user's recent topical interests. We conduct extensive experiments on two real-world datasets (Yelp and Wikipedia, both with ground-truth of malicious users) to show that \texttt{PETGEN} significantly reduces the performance of popular deep user sequence embedding-based classification models. \texttt{PETGEN} outperforms five attack baselines in terms of text quality and attack efficacy in both white-box and black-box classifier settings. Overall, this work paves the path towards the next generation of adversary-aware sequence classification models.

* Accepted for publication at: 2021 ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'2021). Code and data at: