Alert button

"Text Classification": models, code, and papers
Alert button

Improve Text Classification Accuracy with Intent Information

Add code
Alert button
Dec 15, 2022
Yifeng Xie

Figure 1 for Improve Text Classification Accuracy with Intent Information
Figure 2 for Improve Text Classification Accuracy with Intent Information
Figure 3 for Improve Text Classification Accuracy with Intent Information
Figure 4 for Improve Text Classification Accuracy with Intent Information

Text classification, a core component of task-oriented dialogue systems, attracts continuous research from both the research and industry community, and has resulted in tremendous progress. However, existing method does not consider the use of label information, which may weaken the performance of text classification systems in some token-aware scenarios. To address the problem, in this paper, we introduce the use of label information as label embedding for the task of text classification and achieve remarkable performance on benchmark dataset.

Less is More: Parameter-Free Text Classification with Gzip

Add code
Alert button
Dec 19, 2022
Zhiying Jiang, Matthew Y. R. Yang, Mikhail Tsirlin, Raphael Tang, Jimmy Lin

Figure 1 for Less is More: Parameter-Free Text Classification with Gzip
Figure 2 for Less is More: Parameter-Free Text Classification with Gzip
Figure 3 for Less is More: Parameter-Free Text Classification with Gzip
Figure 4 for Less is More: Parameter-Free Text Classification with Gzip

Deep neural networks (DNNs) are often used for text classification tasks as they usually achieve high levels of accuracy. However, DNNs can be computationally intensive with billions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize and to transfer to out-of-distribution (OOD) cases in practice. In this paper, we propose a non-parametric alternative to DNNs that's easy, light-weight and universal in text classification: a combination of a simple compressor like gzip with a $k$-nearest-neighbor classifier. Without any training, pre-training or fine-tuning, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distributed datasets. It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also performs particularly well in few-shot settings where labeled data are too scarce for DNNs to achieve a satisfying accuracy.

Text classification in shipping industry using unsupervised models and Transformer based supervised models

Add code
Alert button
Dec 21, 2022
Ying Xie, Dongping Song

Figure 1 for Text classification in shipping industry using unsupervised models and Transformer based supervised models
Figure 2 for Text classification in shipping industry using unsupervised models and Transformer based supervised models
Figure 3 for Text classification in shipping industry using unsupervised models and Transformer based supervised models
Figure 4 for Text classification in shipping industry using unsupervised models and Transformer based supervised models

Obtaining labelled data in a particular context could be expensive and time consuming. Although different algorithms, including unsupervised learning, semi-supervised learning, self-learning have been adopted, the performance of text classification varies with context. Given the lack of labelled dataset, we proposed a novel and simple unsupervised text classification model to classify cargo content in international shipping industry using the Standard International Trade Classification (SITC) codes. Our method stems from representing words using pretrained Glove Word Embeddings and finding the most likely label using Cosine Similarity. To compare unsupervised text classification model with supervised classification, we also applied several Transformer models to classify cargo content. Due to lack of training data, the SITC numerical codes and the corresponding textual descriptions were used as training data. A small number of manually labelled cargo content data was used to evaluate the classification performances of the unsupervised classification and the Transformer based supervised classification. The comparison reveals that unsupervised classification significantly outperforms Transformer based supervised classification even after increasing the size of the training dataset by 30%. Lacking training data is a key bottleneck that prohibits deep learning models (such as Transformers) from successful practical applications. Unsupervised classification can provide an alternative efficient and effective method to classify text when there is scarce training data.

* 7 pages, 1 figure, 5 tables 

Data Augmentation using Transformers and Similarity Measures for Improving Arabic Text Classification

Add code
Alert button
Jan 01, 2023
Dania Refai, Saleh Abo-Soud, Mohammad Abdel-Rahman

Figure 1 for Data Augmentation using Transformers and Similarity Measures for Improving Arabic Text Classification
Figure 2 for Data Augmentation using Transformers and Similarity Measures for Improving Arabic Text Classification
Figure 3 for Data Augmentation using Transformers and Similarity Measures for Improving Arabic Text Classification
Figure 4 for Data Augmentation using Transformers and Similarity Measures for Improving Arabic Text Classification

Learning models are highly dependent on data to work effectively, and they give a better performance upon training on big datasets. Massive research exists in the literature to address the dataset adequacy issue. One promising approach for solving dataset adequacy issues is the data augmentation (DA) approach. In DA, the amount of training data instances is increased by making different transformations on the available data instances to generate new correct and representative data instances. DA increases the dataset size and its variability, which enhances the model performance and its prediction accuracy. DA also solves the class imbalance problem in the classification learning techniques. Few studies have recently considered DA in the Arabic language. These studies rely on traditional augmentation approaches, such as paraphrasing by using rules or noising-based techniques. In this paper, we propose a new Arabic DA method that employs the recent powerful modeling technique, namely the AraGPT-2, for the augmentation process. The generated sentences are evaluated in terms of context, semantics, diversity, and novelty using the Euclidean, cosine, Jaccard, and BLEU distances. Finally, the AraBERT transformer is used on sentiment classification tasks to evaluate the classification performance of the augmented Arabic dataset. The experiments were conducted on four sentiment Arabic datasets, namely AraSarcasm, ASTD, ATT, and MOVIE. The selected datasets vary in size, label number, and unbalanced classes. The results show that the proposed methodology enhanced the Arabic sentiment text classification on all datasets with an increase in F1 score by 4% in AraSarcasm, 6% in ASTD, 9% in ATT, and 13% in MOVIE.

* 15 pages, 6 Figures, this work has been submitted to the IEEE Access Journal for possible publication 

Azimuth: Systematic Error Analysis for Text Classification

Add code
Alert button
Dec 19, 2022
Gabrielle Gauthier-Melançon, Orlando Marquez Ayala, Lindsay Brin, Chris Tyler, Frédéric Branchaud-Charron, Joseph Marinier, Karine Grande, Di Le

Figure 1 for Azimuth: Systematic Error Analysis for Text Classification
Figure 2 for Azimuth: Systematic Error Analysis for Text Classification
Figure 3 for Azimuth: Systematic Error Analysis for Text Classification
Figure 4 for Azimuth: Systematic Error Analysis for Text Classification

We present Azimuth, an open-source and easy-to-use tool to perform error analysis for text classification. Compared to other stages of the ML development cycle, such as model training and hyper-parameter tuning, the process and tooling for the error analysis stage are less mature. However, this stage is critical for the development of reliable and trustworthy AI systems. To make error analysis more systematic, we propose an approach comprising dataset analysis and model quality assessment, which Azimuth facilitates. We aim to help AI practitioners discover and address areas where the model does not generalize by leveraging and integrating a range of ML techniques, such as saliency maps, similarity, uncertainty, and behavioral analyses, all in one tool. Our code and documentation are available at github.com/servicenow/azimuth.

* To be published in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 13 pages and 14 figures 

FastClass: A Time-Efficient Approach to Weakly-Supervised Text Classification

Add code
Alert button
Dec 15, 2022
Tingyu Xia, Yue Wang, Yuan Tian, Yi Chang

Figure 1 for FastClass: A Time-Efficient Approach to Weakly-Supervised Text Classification
Figure 2 for FastClass: A Time-Efficient Approach to Weakly-Supervised Text Classification
Figure 3 for FastClass: A Time-Efficient Approach to Weakly-Supervised Text Classification
Figure 4 for FastClass: A Time-Efficient Approach to Weakly-Supervised Text Classification

Weakly-supervised text classification aims to train a classifier using only class descriptions and unlabeled data. Recent research shows that keyword-driven methods can achieve state-of-the-art performance on various tasks. However, these methods not only rely on carefully-crafted class descriptions to obtain class-specific keywords but also require substantial amount of unlabeled data and takes a long time to train. This paper proposes FastClass, an efficient weakly-supervised classification approach. It uses dense text representation to retrieve class-relevant documents from external unlabeled corpus and selects an optimal subset to train a classifier. Compared to keyword-driven methods, our approach is less reliant on initial class descriptions as it no longer needs to expand each class description into a set of class-specific keywords. Experiments on a wide range of classification tasks show that the proposed approach frequently outperforms keyword-driven models in terms of classification accuracy and often enjoys orders-of-magnitude faster training speed.

* EMNLP 2022 

Evaluating Unsupervised Text Classification: Zero-shot and Similarity-based Approaches

Add code
Alert button
Nov 29, 2022
Tim Schopf, Daniel Braun, Florian Matthes

Figure 1 for Evaluating Unsupervised Text Classification: Zero-shot and Similarity-based Approaches
Figure 2 for Evaluating Unsupervised Text Classification: Zero-shot and Similarity-based Approaches
Figure 3 for Evaluating Unsupervised Text Classification: Zero-shot and Similarity-based Approaches
Figure 4 for Evaluating Unsupervised Text Classification: Zero-shot and Similarity-based Approaches

Text classification of unseen classes is a challenging Natural Language Processing task and is mainly attempted using two different types of approaches. Similarity-based approaches attempt to classify instances based on similarities between text document representations and class description representations. Zero-shot text classification approaches aim to generalize knowledge gained from a training task by assigning appropriate labels of unknown classes to text documents. Although existing studies have already investigated individual approaches to these categories, the experiments in literature do not provide a consistent comparison. This paper addresses this gap by conducting a systematic evaluation of different similarity-based and zero-shot approaches for text classification of unseen classes. Different state-of-the-art approaches are benchmarked on four text classification datasets, including a new dataset from the medical domain. Additionally, novel SimCSE and SBERT-based baselines are proposed, as other baselines used in existing work yield weak classification results and are easily outperformed. Finally, the novel similarity-based Lbl2TransformerVec approach is presented, which outperforms previous state-of-the-art approaches in unsupervised text classification. Our experiments show that similarity-based approaches significantly outperform zero-shot approaches in most cases. Additionally, using SimCSE or SBERT embeddings instead of simpler text representations increases similarity-based classification results even further.

* Accepted to 6th International Conference on Natural Language Processing and Information Retrieval (NLPIR '22) 

Moto: Enhancing Embedding with Multiple Joint Factors for Chinese Text Classification

Add code
Alert button
Dec 09, 2022
Xunzhu Tang, Rujie Zhu, Tiezhu Sun, Shi Wang

Figure 1 for Moto: Enhancing Embedding with Multiple Joint Factors for Chinese Text Classification
Figure 2 for Moto: Enhancing Embedding with Multiple Joint Factors for Chinese Text Classification
Figure 3 for Moto: Enhancing Embedding with Multiple Joint Factors for Chinese Text Classification
Figure 4 for Moto: Enhancing Embedding with Multiple Joint Factors for Chinese Text Classification

Recently, language representation techniques have achieved great performances in text classification. However, most existing representation models are specifically designed for English materials, which may fail in Chinese because of the huge difference between these two languages. Actually, few existing methods for Chinese text classification process texts at a single level. However, as a special kind of hieroglyphics, radicals of Chinese characters are good semantic carriers. In addition, Pinyin codes carry the semantic of tones, and Wubi reflects the stroke structure information, \textit{etc}. Unfortunately, previous researches neglected to find an effective way to distill the useful parts of these four factors and to fuse them. In our works, we propose a novel model called Moto: Enhancing Embedding with \textbf{M}ultiple J\textbf{o}int Fac\textbf{to}rs. Specifically, we design an attention mechanism to distill the useful parts by fusing the four-level information above more effectively. We conduct extensive experiments on four popular tasks. The empirical results show that our Moto achieves SOTA 0.8316 ($F_1$-score, 2.11\% improvement) on Chinese news titles, 96.38 (1.24\% improvement) on Fudan Corpus and 0.9633 (3.26\% improvement) on THUCNews.

Learning Label Modular Prompts for Text Classification in the Wild

Add code
Alert button
Dec 05, 2022
Hailin Chen, Amrita Saha, Shafiq Joty, Steven C. H. Hoi

Figure 1 for Learning Label Modular Prompts for Text Classification in the Wild
Figure 2 for Learning Label Modular Prompts for Text Classification in the Wild
Figure 3 for Learning Label Modular Prompts for Text Classification in the Wild
Figure 4 for Learning Label Modular Prompts for Text Classification in the Wild

Machine learning models usually assume i.i.d data during training and testing, but data and tasks in real world often change over time. To emulate the transient nature of real world, we propose a challenging but practical task: text classification in-the-wild, which introduces different non-stationary training/testing stages. Decomposing a complex task into modular components can enable robust generalisation under such non-stationary environment. However, current modular approaches in NLP do not take advantage of recent advances in parameter efficient tuning of pretrained language models. To close this gap, we propose MODULARPROMPT, a label-modular prompt tuning framework for text classification tasks. In MODULARPROMPT, the input prompt consists of a sequence of soft label prompts, each encoding modular knowledge related to the corresponding class label. In two of most formidable settings, MODULARPROMPT outperforms relevant baselines by a large margin demonstrating strong generalisation ability. We also conduct comprehensive analysis to validate whether the learned prompts satisfy properties of a modular representation.

* accepted to EMNLP 2022 

Active Learning for Abstractive Text Summarization

Add code
Alert button
Jan 09, 2023
Akim Tsvigun, Ivan Lysenko, Danila Sedashov, Ivan Lazichny, Eldar Damirov, Vladimir Karlov, Artemy Belousov, Leonid Sanochkin, Maxim Panov, Alexander Panchenko, Mikhail Burtsev, Artem Shelmanov

Figure 1 for Active Learning for Abstractive Text Summarization
Figure 2 for Active Learning for Abstractive Text Summarization
Figure 3 for Active Learning for Abstractive Text Summarization
Figure 4 for Active Learning for Abstractive Text Summarization

Construction of human-curated annotated datasets for abstractive text summarization (ATS) is very time-consuming and expensive because creating each instance requires a human annotator to read a long document and compose a shorter summary that would preserve the key information relayed by the original document. Active Learning (AL) is a technique developed to reduce the amount of annotation required to achieve a certain level of machine learning model performance. In information extraction and text classification, AL can reduce the amount of labor up to multiple times. Despite its potential for aiding expensive annotation, as far as we know, there were no effective AL query strategies for ATS. This stems from the fact that many AL strategies rely on uncertainty estimation, while as we show in our work, uncertain instances are usually noisy, and selecting them can degrade the model performance compared to passive annotation. We address this problem by proposing the first effective query strategy for AL in ATS based on diversity principles. We show that given a certain annotation budget, using our strategy in AL annotation helps to improve the model performance in terms of ROUGE and consistency scores. Additionally, we analyze the effect of self-learning and show that it can further increase the performance of the model.

* Accepted at EMNLP-2022 Findings