Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Text Classification": models, code, and papers

Learning to increase matching efficiency in identifying additional b-jets in the $\text{t}\bar{\text{t}}\text{b}\bar{\text{b}}$ process

Mar 16, 2021
Cheongjae Jang, Sang-Kyun Ko, Yung-Kyun Noh, Jieun Choi, Jongwon Lim, Tae Jeong Kim

The $\text{t}\bar{\text{t}}\text{H}(\text{b}\bar{\text{b}})$ process is an essential channel to reveal the Higgs properties but has an irreducible background from the $\text{t}\bar{\text{t}}\text{b}\bar{\text{b}}$ process, which produces a top quark pair in association with a b quark pair. Therefore, understanding the $\text{t}\bar{\text{t}}\text{b}\bar{\text{b}}$ process is crucial for improving the sensitivity of a search for the $\text{t}\bar{\text{t}}\text{H}(\text{b}\bar{\text{b}})$ process. To this end, when measuring the differential cross-section of the $\text{t}\bar{\text{t}}\text{b}\bar{\text{b}}$ process, we need to distinguish the b-jets originated from top quark decays, and additional b-jets originated from gluon splitting. Since there are no simple identification rules, we adopt deep learning methods to learn from data to identify the additional b-jets from the $\text{t}\bar{\text{t}}\text{b}\bar{\text{b}}$ events. Specifically, by exploiting the special structure of the $\text{t}\bar{\text{t}}\text{b}\bar{\text{b}}$ event data, we propose several loss functions that can be minimized to directly increase the matching efficiency, the accuracy of identifying additional b-jets. We discuss the difference between our method and another deep learning-based approach based on binary classification arXiv:1910.14535 using synthetic data. We then verify that additional b-jets can be identified more accurately by increasing matching efficiency directly rather than the binary classification accuracy, using simulated $\text{t}\bar{\text{t}}\text{b}\bar{\text{b}}$ event data in the lepton+jets channel from pp collision at $\sqrt{s}$ = 13 TeV.


Zero-shot Text Classification With Generative Language Models

Dec 10, 2019
Raul Puri, Bryan Catanzaro

This work investigates the use of natural language to enable zero-shot model adaptation to new tasks. We use text and metadata from social commenting platforms as a source for a simple pretraining task. We then provide the language model with natural language descriptions of classification tasks as input and train it to generate the correct answer in natural language via a language modeling objective. This allows the model to generalize to new classification tasks without the need for multiple multitask classification heads. We show the zero-shot performance of these generative language models, trained with weak supervision, on six benchmark text classification datasets from the torchtext library. Despite no access to training data, we achieve up to a 45% absolute improvement in classification accuracy over random or majority class baselines. These results show that natural language can serve as simple and powerful descriptors for task adaptation. We believe this points the way to new metalearning strategies for text problems.


geoGAT: Graph Model Based on Attention Mechanism for Geographic Text Classification

Jan 13, 2021
Weipeng Jing, Xianyang Song, Donglin Di, Houbing Song

In the area of geographic information processing. There are few researches on geographic text classification. However, the application of this task in Chinese is relatively rare. In our work, we intend to implement a method to extract text containing geographical entities from a large number of network text. The geographic information in these texts is of great practical significance to transportation, urban and rural planning, disaster relief and other fields. We use the method of graph convolutional neural network with attention mechanism to achieve this function. Graph attention networks is an improvement of graph convolutional neural networks. Compared with GCN, the advantage of GAT is that the attention mechanism is proposed to weight the sum of the characteristics of adjacent nodes. In addition, We construct a Chinese dataset containing geographical classification from multiple datasets of Chinese text classification. The Macro-F Score of the geoGAT we used reached 95\% on the new Chinese dataset.


A Survey of Active Learning for Text Classification using Deep Neural Networks

Aug 17, 2020
Christopher Schröder, Andreas Niekler

Natural language processing (NLP) and neural networks (NNs) have both undergone significant changes in recent years. For active learning (AL) purposes, NNs are, however, less commonly used -- despite their current popularity. By using the superior text classification performance of NNs for AL, we can either increase a model's performance using the same amount of data or reduce the data and therefore the required annotation efforts while keeping the same performance. We review AL for text classification using deep neural networks (DNNs) and elaborate on two main causes which used to hinder the adoption: (a) the inability of NNs to provide reliable uncertainty estimates, on which the most commonly used query strategies rely, and (b) the challenge of training DNNs on small data. To investigate the former, we construct a taxonomy of query strategies, which distinguishes between data-based, model-based, and prediction-based instance selection, and investigate the prevalence of these classes in recent research. Moreover, we review recent NN-based advances in NLP like word embeddings or language models in the context of (D)NNs, survey the current state-of-the-art at the intersection of AL, text classification, and DNNs and relate recent advances in NLP to AL. Finally, we analyze recent work in AL for text classification, connect the respective query strategies to the taxonomy, and outline commonalities and shortcomings. As a result, we highlight gaps in current research and present open research questions.


Adapting Neural Text Classification for Improved Software Categorization

Jun 15, 2018
Alexander LeClair, Zachary Eberhart, Collin McMillan

Software Categorization is the task of organizing software into groups that broadly describe the behavior of the software, such as "editors" or "science." Categorization plays an important role in several maintenance tasks, such as repository navigation and feature elicitation. Current approaches attempt to cast the problem as text classification, to make use of the rich body of literature from the NLP domain. However, as we will show in this paper, text classification algorithms are generally not applicable off-the-shelf to source code; we found that they work well when high-level project descriptions are available, but suffer very large performance penalties when classifying source code and comments only. We propose a set of adaptations to a state-of-the-art neural classification algorithm and perform two evaluations: one with reference data from Debian end-user programs, and one with a set of C/C++ libraries that we hired professional programmers to annotate. We show that our proposed approach achieves performance exceeding that of previous software classification techniques as well as a state-of-the-art neural text classification technique.


The Power of Communities: A Text Classification Model with Automated Labeling Process Using Network Community Detection

Sep 25, 2019
Minjun Kim, Hiroki Sayama

The text classification is one of the most critical areas in machine learning and artificial intelligence research. It has been actively adopted in many business applications such as conversational intelligence systems, news articles categorizations, sentiment analysis, emotion detection systems, and many other recommendation systems in our daily life. One of the problems in supervised text classification models is that the models performance depend heavily on the quality of data labeling that are typically done by humans. In this study, we propose a new network community detection-based approach to automatically label and classify text data into multiclass value spaces. Specifically, we build a network with sentences as the network nodes and pairwise cosine similarities between TFIDF vector representations of the sentences as the network link weights. We use the Louvain method to detect the communities in the sentence network. We train and test Support vector machine and Random forest models on both the human labeled data and network community detection labeled data. Results showed that models with the data labeled by network community detection outperformed the models with the human-labeled data by 2.68-3.75% of classification accuracy. Our method may help development of a more accurate conversational intelligence system and other text classification systems.

* 14 pages, 6 figures, 1 table. Submitted for NetSci-X 2020 Tokyo 

Task-Adaptive Pre-Training for Boosting Learning With Noisy Labels: A Study on Text Classification for African Languages

Jun 03, 2022
Dawei Zhu, Michael A. Hedderich, Fangzhou Zhai, David Ifeoluwa Adelani, Dietrich Klakow

For high-resource languages like English, text classification is a well-studied task. The performance of modern NLP models easily achieves an accuracy of more than 90% in many standard datasets for text classification in English (Xie et al., 2019; Yang et al., 2019; Zaheer et al., 2020). However, text classification in low-resource languages is still challenging due to the lack of annotated data. Although methods like weak supervision and crowdsourcing can help ease the annotation bottleneck, the annotations obtained by these methods contain label noise. Models trained with label noise may not generalize well. To this end, a variety of noise-handling techniques have been proposed to alleviate the negative impact caused by the errors in the annotations (for extensive surveys see (Hedderich et al., 2021; Algan & Ulusoy, 2021)). In this work, we experiment with a group of standard noisy-handling methods on text classification tasks with noisy labels. We study both simulated noise and realistic noise induced by weak supervision. Moreover, we find task-adaptive pre-training techniques (Gururangan et al., 2020) are beneficial for learning with noisy labels.

* AfricaNLP Workshop @ ICLR2022 

Evolving Character-level Convolutional Neural Networks for Text Classification

Dec 03, 2020
Trevor Londt, Xiaoying Gao, Bing Xue, Peter Andreae

Character-level convolutional neural networks (char-CNN) require no knowledge of the semantic or syntactic structure of the language they classify. This property simplifies its implementation but reduces its classification accuracy. Increasing the depth of char-CNN architectures does not result in breakthrough accuracy improvements. Research has not established which char-CNN architectures are optimal for text classification tasks. Manually designing and training char-CNNs is an iterative and time-consuming process that requires expert domain knowledge. Evolutionary deep learning (EDL) techniques, including surrogate-based versions, have demonstrated success in automatically searching for performant CNN architectures for image analysis tasks. Researchers have not applied EDL techniques to search the architecture space of char-CNNs for text classification tasks. This article demonstrates the first work in evolving char-CNN architectures using a novel EDL algorithm based on genetic programming, an indirect encoding and surrogate models, to search for performant char-CNN architectures automatically. The algorithm is evaluated on eight text classification datasets and benchmarked against five manually designed CNN architecture and one long short-term memory (LSTM) architecture. Experiment results indicate that the algorithm can evolve architectures that outperform the LSTM in terms of classification accuracy and five of the manually designed CNN architectures in terms of classification accuracy and parameter count.


Unifying Question Answering and Text Classification via Span Extraction

Apr 19, 2019
Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, Richard Socher

Even as pre-trained language encoders such as BERT are shared across many tasks, the output layers of question answering and text classification models are significantly different. Span decoders are frequently used for question answering and fixed-class, classification layers for text classification. We show that this distinction is not necessary, and that both can be unified as span extraction. A unified, span-extraction approach leads to superior or comparable performance in multi-task learning, low-data and supplementary supervised pretraining experiments on several text classification and question answering benchmarks.


TextCNN with Attention for Text Classification

Aug 04, 2021
Ibrahim Alshubaily

The vast majority of textual content is unstructured, making automated classification an important task for many applications. The goal of text classification is to automatically classify text documents into one or more predefined categories. Recently proposed simple architectures for text classification such as Convolutional Neural Networks for Sentence Classification by Kim, Yoon showed promising results. In this paper, we propose incorporating an attention mechanism into the network to boost its performance, we also propose WordRank for vocabulary selection to reduce the network embedding parameters and speed up training with minimum accuracy loss. By adopting the proposed ideas TextCNN accuracy on 20News increased from 94.79 to 96.88, moreover, the number of parameters for the embedding layer can be reduced substantially with little accuracy loss by using WordRank. By using WordRank for vocabulary selection we can reduce the number of parameters by more than 5x from 7.9M to 1.5M, and the accuracy will only decrease by 1.2%.