Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"Text Classification": models, code, and papers

SCAT: Robust Self-supervised Contrastive Learning via Adversarial Training for Text Classification

Jul 04, 2023
Junjie Wu, Dit-Yan Yeung

Figure 1 for SCAT: Robust Self-supervised Contrastive Learning via Adversarial Training for Text Classification

Figure 2 for SCAT: Robust Self-supervised Contrastive Learning via Adversarial Training for Text Classification

Figure 3 for SCAT: Robust Self-supervised Contrastive Learning via Adversarial Training for Text Classification

Figure 4 for SCAT: Robust Self-supervised Contrastive Learning via Adversarial Training for Text Classification

Despite their promising performance across various natural language processing (NLP) tasks, current NLP systems are vulnerable to textual adversarial attacks. To defend against these attacks, most existing methods apply adversarial training by incorporating adversarial examples. However, these methods have to rely on ground-truth labels to generate adversarial examples, rendering it impractical for large-scale model pre-training which is commonly used nowadays for NLP and many other tasks. In this paper, we propose a novel learning framework called SCAT (Self-supervised Contrastive Learning via Adversarial Training), which can learn robust representations without requiring labeled data. Specifically, SCAT modifies random augmentations of the data in a fully labelfree manner to generate adversarial examples. Adversarial training is achieved by minimizing the contrastive loss between the augmentations and their adversarial counterparts. We evaluate SCAT on two text classification datasets using two state-of-the-art attack schemes proposed recently. Our results show that SCAT can not only train robust language models from scratch, but it can also significantly improve the robustness of existing pre-trained language models. Moreover, to demonstrate its flexibility, we show that SCAT can also be combined with supervised adversarial training to further enhance model robustness.

Via

Access Paper or Ask Questions

Accelerating Thematic Investment with Prompt Tuned Pretrained Language Models

Sep 21, 2023
Valentin Leonhard Buchner, Lele Cao, Jan-Christoph Kalo

Figure 1 for Accelerating Thematic Investment with Prompt Tuned Pretrained Language Models

Figure 2 for Accelerating Thematic Investment with Prompt Tuned Pretrained Language Models

Figure 3 for Accelerating Thematic Investment with Prompt Tuned Pretrained Language Models

Figure 4 for Accelerating Thematic Investment with Prompt Tuned Pretrained Language Models

Prompt Tuning is emerging as a scalable and cost-effective method to fine-tune Pretrained Language Models (PLMs). This study benchmarks the performance and computational efficiency of Prompt Tuning and baseline methods on a multi-label text classification task. This is applied to the use case of classifying companies into an investment firm's proprietary industry taxonomy, supporting their thematic investment strategy. Text-to-text classification with PLMs is frequently reported to outperform classification with a classification head, but has several limitations when applied to a multi-label classification problem where each label consists of multiple tokens: (a) Generated labels may not match any label in the industry taxonomy; (b) During fine-tuning, multiple labels must be provided in an arbitrary order; (c) The model provides a binary decision for each label, rather than an appropriate confidence score. Limitation (a) is addressed by applying constrained decoding using Trie Search, which slightly improves classification performance. All limitations (a), (b), and (c) are addressed by replacing the PLM's language head with a classification head. This improves performance significantly, while also reducing computational costs during inference. The results indicate the continuing need to adapt state-of-the-art methods to domain-specific tasks, even in the era of PLMs with strong generalization abilities.

* A thesis written in fulfillment of the requirements for the joint MSc degree in Artificial Intelligence at the VU Amsterdam and University of Amsterdam

Via

Access Paper or Ask Questions

Large-Scale Korean Text Dataset for Classifying Biased Speech in Real-World Online Services

Oct 06, 2023
Dasol Choi, Jooyoung Song, Eunsun Lee, Jinwoo Seo, Heejune Park, Dongbin Na

Figure 1 for Large-Scale Korean Text Dataset for Classifying Biased Speech in Real-World Online Services

Figure 2 for Large-Scale Korean Text Dataset for Classifying Biased Speech in Real-World Online Services

Figure 3 for Large-Scale Korean Text Dataset for Classifying Biased Speech in Real-World Online Services

Figure 4 for Large-Scale Korean Text Dataset for Classifying Biased Speech in Real-World Online Services

With the growth of online services, the need for advanced text classification algorithms, such as sentiment analysis and biased text detection, has become increasingly evident. The anonymous nature of online services often leads to the presence of biased and harmful language, posing challenges to maintaining the health of online communities. This phenomenon is especially relevant in South Korea, where large-scale hate speech detection algorithms have not yet been broadly explored. In this paper, we introduce a new comprehensive, large-scale dataset collected from a well-known South Korean SNS platform. Our proposed dataset provides annotations including (1) Preferences, (2) Profanities, and (3) Nine types of Bias for the text samples, enabling multi-task learning for simultaneous classification of user-generated texts. Leveraging state-of-the-art BERT-based language models, our approach surpasses human-level accuracy across diverse classification tasks, as measured by various metrics. Beyond academic contributions, our work can provide practical solutions for real-world hate speech and bias mitigation, contributing directly to the improvement of online community health. Our work provides a robust foundation for future research aiming to improve the quality of online discourse and foster societal well-being. All source codes and datasets are publicly accessible at https://github.com/Dasol-Choi/KoMultiText.

* 13 pages

Via

Access Paper or Ask Questions

DKEC: Domain Knowledge Enhanced Multi-Label Classification for Electronic Health Records

Oct 10, 2023
Xueren Ge, Ronald Dean Williams, John A. Stankovic, Homa Alemzadeh

Figure 1 for DKEC: Domain Knowledge Enhanced Multi-Label Classification for Electronic Health Records

Figure 2 for DKEC: Domain Knowledge Enhanced Multi-Label Classification for Electronic Health Records

Figure 3 for DKEC: Domain Knowledge Enhanced Multi-Label Classification for Electronic Health Records

Figure 4 for DKEC: Domain Knowledge Enhanced Multi-Label Classification for Electronic Health Records

Multi-label text classification (MLTC) tasks in the medical domain often face long-tail label distribution, where rare classes have fewer training samples than frequent classes. Although previous works have explored different model architectures and hierarchical label structures to find important features, most of them neglect to incorporate the domain knowledge from medical guidelines. In this paper, we present DKEC, Domain Knowledge Enhanced Classifier for medical diagnosis prediction with two innovations: (1) a label-wise attention mechanism that incorporates a heterogeneous graph and domain ontologies to capture the semantic relationships between medical entities, (2) a simple yet effective group-wise training method based on similarity of labels to increase samples of rare classes. We evaluate DKEC on two real-world medical datasets: the RAA dataset, a collection of 4,417 patient care reports from emergency medical services (EMS) incidents, and a subset of 53,898 reports from the MIMIC-III dataset. Experimental results show that our method outperforms the state-of-the-art, particularly for the few-shot (tail) classes. More importantly, we study the applicability of DKEC to different language models and show that DKEC can help the smaller language models achieve comparable performance to large language models.

* Submitted to AAAI 2024

Via

Access Paper or Ask Questions

Harnessing Pre-Trained Sentence Transformers for Offensive Language Detection in Indian Languages

Oct 03, 2023
Ananya Joshi, Raviraj Joshi

Figure 1 for Harnessing Pre-Trained Sentence Transformers for Offensive Language Detection in Indian Languages

Figure 2 for Harnessing Pre-Trained Sentence Transformers for Offensive Language Detection in Indian Languages

In our increasingly interconnected digital world, social media platforms have emerged as powerful channels for the dissemination of hate speech and offensive content. This work delves into the domain of hate speech detection, placing specific emphasis on three low-resource Indian languages: Bengali, Assamese, and Gujarati. The challenge is framed as a text classification task, aimed at discerning whether a tweet contains offensive or non-offensive content. Leveraging the HASOC 2023 datasets, we fine-tuned pre-trained BERT and SBERT models to evaluate their effectiveness in identifying hate speech. Our findings underscore the superiority of monolingual sentence-BERT models, particularly in the Bengali language, where we achieved the highest ranking. However, the performance in Assamese and Gujarati languages signifies ongoing opportunities for enhancement. Our goal is to foster inclusive online spaces by countering hate speech proliferation.

* HASOC at FIRE 2023

Via

Access Paper or Ask Questions

Understanding and Mitigating Spurious Correlations in Text Classification

May 23, 2023
Oscar Chew, Kuan-Hao Huang, Kai-Wei Chang, Hsuan-Tien Lin

Figure 1 for Understanding and Mitigating Spurious Correlations in Text Classification

Figure 2 for Understanding and Mitigating Spurious Correlations in Text Classification

Figure 3 for Understanding and Mitigating Spurious Correlations in Text Classification

Figure 4 for Understanding and Mitigating Spurious Correlations in Text Classification

Recent work has shown that deep learning models are prone to exploit spurious correlations that are present in the training set, yet may not hold true in general. A sentiment classifier may erroneously learn that the token spielberg is always tied to positive movie reviews. Relying on spurious correlations may lead to significant degradation in generalizability and should be avoided. In this paper, we propose a neighborhood analysis framework to explain how exactly language models exploit spurious correlations. Driven by the analysis, we propose a family of regularization methods, NFL (do Not Forget your Language) to prevent the situation. Experiments on two text classification tasks show that NFL brings a significant improvement over standard fine-tuning in terms of robustness without sacrificing in-distribution accuracy.

Via

Access Paper or Ask Questions

Out-of-Distribution Detection by Leveraging Between-Layer Transformation Smoothness

Oct 04, 2023
Fran Jelenić, Josip Jukić, Martin Tutek, Mate Puljiz, Jan Šnajder

Figure 1 for Out-of-Distribution Detection by Leveraging Between-Layer Transformation Smoothness

Figure 2 for Out-of-Distribution Detection by Leveraging Between-Layer Transformation Smoothness

Figure 3 for Out-of-Distribution Detection by Leveraging Between-Layer Transformation Smoothness

Figure 4 for Out-of-Distribution Detection by Leveraging Between-Layer Transformation Smoothness

Effective OOD detection is crucial for reliable machine learning models, yet most current methods are limited in practical use due to requirements like access to training data or intervention in training. We present a novel method for detecting OOD data in deep neural networks based on transformation smoothness between intermediate layers of a network (BLOOD), which is applicable to pre-trained models without access to training data. BLOOD utilizes the tendency of between-layer representation transformations of in-distribution (ID) data to be smoother than the corresponding transformations of OOD data, a property that we also demonstrate empirically for Transformer networks. We evaluate BLOOD on several text classification tasks with Transformer networks and demonstrate that it outperforms methods with comparable resource requirements. Our analysis also suggests that when learning simpler tasks, OOD data transformations maintain their original sharpness, whereas sharpness increases with more complex tasks.

Via

Access Paper or Ask Questions

Enhancing Pashto Text Classification using Language Processing Techniques for Single And Multi-Label Analysis

May 04, 2023
Mursal Dawodi, Jawid Ahmad Baktash

Figure 1 for Enhancing Pashto Text Classification using Language Processing Techniques for Single And Multi-Label Analysis

Figure 2 for Enhancing Pashto Text Classification using Language Processing Techniques for Single And Multi-Label Analysis

Figure 3 for Enhancing Pashto Text Classification using Language Processing Techniques for Single And Multi-Label Analysis

Figure 4 for Enhancing Pashto Text Classification using Language Processing Techniques for Single And Multi-Label Analysis

Text classification has become a crucial task in various fields, leading to a significant amount of research on developing automated text classification systems for national and international languages. However, there is a growing need for automated text classification systems that can handle local languages. This study aims to establish an automated classification system for Pashto text. To achieve this goal, we constructed a dataset of Pashto documents and applied various models, including statistical and neural machine learning models such as DistilBERT-base-multilingual-cased, Multilayer Perceptron, Support Vector Machine, K Nearest Neighbor, decision tree, Gaussian na\"ive Bayes, multinomial na\"ive Bayes, random forest, and logistic regression, to identify the most effective approach. We also evaluated two different feature extraction methods, bag of words and Term Frequency Inverse Document Frequency. The study achieved an average testing accuracy rate of 94% using the MLP classification algorithm and TFIDF feature extraction method in single-label multiclass classification. Similarly, MLP+TFIDF yielded the best results, with an F1-measure of 0.81. Furthermore, the use of pre-trained language representation models, such as DistilBERT, showed promising results for Pashto text classification; however, the study highlights the importance of developing a specific tokenizer for a particular language to achieve reasonable results.

* https://nlpai2023.org/papers?fbclid=IwAR2v29d3nFUaIx9U-rnfN8pqJu1tXBS9P9OV1IJnsbJ0QHN9JZAMPhZA7Ds

Via

Access Paper or Ask Questions

Unleashing the power of Neural Collapse for Transferability Estimation

Oct 09, 2023
Yuhe Ding, Bo Jiang, Lijun Sheng, Aihua Zheng, Jian Liang

Figure 1 for Unleashing the power of Neural Collapse for Transferability Estimation

Figure 2 for Unleashing the power of Neural Collapse for Transferability Estimation

Figure 3 for Unleashing the power of Neural Collapse for Transferability Estimation

Figure 4 for Unleashing the power of Neural Collapse for Transferability Estimation

Transferability estimation aims to provide heuristics for quantifying how suitable a pre-trained model is for a specific downstream task, without fine-tuning them all. Prior studies have revealed that well-trained models exhibit the phenomenon of Neural Collapse. Based on a widely used neural collapse metric in existing literature, we observe a strong correlation between the neural collapse of pre-trained models and their corresponding fine-tuned models. Inspired by this observation, we propose a novel method termed Fair Collapse (FaCe) for transferability estimation by comprehensively measuring the degree of neural collapse in the pre-trained model. Typically, FaCe comprises two different terms: the variance collapse term, which assesses the class separation and within-class compactness, and the class fairness term, which quantifies the fairness of the pre-trained model towards each class. We investigate FaCe on a variety of pre-trained classification models across different network architectures, source datasets, and training loss functions. Results show that FaCe yields state-of-the-art performance on different tasks including image classification, semantic segmentation, and text classification, which demonstrate the effectiveness and generalization of our method.

Via

Access Paper or Ask Questions

Analyzing Textual Data for Fatality Classification in Afghanistan's Armed Conflicts: A BERT Approach

Oct 12, 2023
Hikmatullah Mohammadi, Ziaullah Momand, Parwin Habibi, Nazifa Ramaki, Bibi Storay Fazli, Sayed Zobair Rohany, Iqbal Samsoor

Afghanistan has witnessed many armed conflicts throughout history, especially in the past 20 years; these events have had a significant impact on human lives, including military and civilians, with potential fatalities. In this research, we aim to leverage state-of-the-art machine learning techniques to classify the outcomes of Afghanistan armed conflicts to either fatal or non-fatal based on their textual descriptions provided by the Armed Conflict Location & Event Data Project (ACLED) dataset. The dataset contains comprehensive descriptions of armed conflicts in Afghanistan that took place from August 2021 to March 2023. The proposed approach leverages the power of BERT (Bidirectional Encoder Representations from Transformers), a cutting-edge language representation model in natural language processing. The classifier utilizes the raw textual description of an event to estimate the likelihood of the event resulting in a fatality. The model achieved impressive performance on the test set with an accuracy of 98.8%, recall of 98.05%, precision of 99.6%, and an F1 score of 98.82%. These results highlight the model's robustness and indicate its potential impact in various areas such as resource allocation, policymaking, and humanitarian aid efforts in Afghanistan. The model indicates a machine learning-based text classification approach using the ACLED dataset to accurately classify fatality in Afghanistan armed conflicts, achieving robust performance with the BERT model and paving the way for future endeavors in predicting event severity in Afghanistan.

* 6 pages, 4 figures, 2 tables

Via

Access Paper or Ask Questions