Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Text Classification": models, code, and papers

Profitable Trade-Off Between Memory and Performance In Multi-Domain Chatbot Architectures

Nov 06, 2021
D Emre Tasar, Sukru Ozan, M Fatih Akca, Oguzhan Olmez, Semih Gulum, Secilay Kutay, Ceren Belhan

Text classification problem is a very broad field of study in the field of natural language processing. In short, the text classification problem is to determine which of the previously determined classes the given text belongs to. Successful studies have been carried out in this field in the past studies. In the study, Bidirectional Encoder Representations for Transformers (BERT), which is a frequently preferred method for solving the classification problem in the field of natural language processing, is used. By solving classification problems through a single model to be used in a chatbot architecture, it is aimed to alleviate the load on the server that will be created by more than one model used for solving more than one classification problem. At this point, with the masking method applied during the estimation of a single BERT model, which was created for classification in more than one subject, the estimation of the model was provided on a problem-based basis. Three separate data sets covering different fields from each other are divided by various methods in order to complicate the problem, and classification problems that are very close to each other in terms of field are also included in this way. The dataset used in this way consists of five classification problems with 154 classes. A BERT model containing all classification problems and other BERT models trained specifically for the problems were compared with each other in terms of performance and the space they occupied on the server.

* in Turkish language. ICADA 21 1st International Conference on Artificial Intelligence and Data Science Nov 26-Nov 28 2021 Izmir Katip Celebi University Izmir, Turkey 

Collective Classification of Textual Documents by Guided Self-Organization in T-Cell Cross-Regulation Dynamics

Feb 04, 2011
Alaa Abi-Haidar, Luis M. Rocha

We present and study an agent-based model of T-Cell cross-regulation in the adaptive immune system, which we apply to binary classification. Our method expands an existing analytical model of T-cell cross-regulation (Carneiro et al. in Immunol Rev 216(1):48-68, 2007) that was used to study the self-organizing dynamics of a single population of T-Cells in interaction with an idealized antigen presenting cell capable of presenting a single antigen. With agent-based modeling we are able to study the self-organizing dynamics of multiple populations of distinct T-cells which interact via antigen presenting cells that present hundreds of distinct antigens. Moreover, we show that such self-organizing dynamics can be guided to produce an effective binary classification of antigens, which is competitive with existing machine learning methods when applied to biomedical text classification. More specifically, here we test our model on a dataset of publicly available full-text biomedical articles provided by the BioCreative challenge (Krallinger in The biocreative ii. 5 challenge overview, p 19, 2009). We study the robustness of our model's parameter configurations, and show that it leads to encouraging results comparable to state-of-the-art classifiers. Our results help us understand both T-cell cross-regulation as a general principle of guided self-organization, as well as its applicability to document classification. Therefore, we show that our bio-inspired algorithm is a promising novel method for biomedical article classification and for binary document classification in general.

* Evolutionary Intelligence. 2011. Volume 4, Number 2, 69-80 

An Empirical Evaluation of Text Representation Schemes on Multilingual Social Web to Filter the Textual Aggression

Apr 16, 2019
Sandip Modha, Prasenjit Majumder

This paper attempt to study the effectiveness of text representation schemes on two tasks namely: User Aggression and Fact Detection from the social media contents. In User Aggression detection, The aim is to identify the level of aggression from the contents generated in the Social media and written in the English, Devanagari Hindi and Romanized Hindi. Aggression levels are categorized into three predefined classes namely: `Non-aggressive`, `Overtly Aggressive`, and `Covertly Aggressive`. During the disaster-related incident, Social media like, Twitter is flooded with millions of posts. In such emergency situations, identification of factual posts is important for organizations involved in the relief operation. We anticipated this problem as a combination of classification and Ranking problem. This paper presents a comparison of various text representation scheme based on BoW techniques, distributed word/sentence representation, transfer learning on classifiers. Weighted $F_1$ score is used as a primary evaluation metric. Results show that text representation using BoW performs better than word embedding on machine learning classifiers. While pre-trained Word embedding techniques perform better on classifiers based on deep neural net. Recent transfer learning model like ELMO, ULMFiT are fine-tuned for the Aggression classification task. However, results are not at par with pre-trained word embedding model. Overall, word embedding using fastText produce best weighted $F_1$-score than Word2Vec and Glove. Results are further improved using pre-trained vector model. Statistical significance tests are employed to ensure the significance of the classification results. In the case of lexically different test Dataset, other than training Dataset, deep neural models are more robust and perform substantially better than machine learning classifiers.

* 21 Page, 2 Figure 

A Graph Total Variation Regularized Softmax for Text Generation

Jan 01, 2021
Liu Bin, Wang Liang, Yin Guosheng

The softmax operator is one of the most important functions in machine learning models. When applying neural networks to multi-category classification, the correlations among different categories are often ignored. For example, in text generation, a language model makes a choice of each new word based only on the former selection of its context. In this scenario, the link statistics information of concurrent words based on a corpus (an analogy of the natural way of expression) is also valuable in choosing the next word, which can help to improve the sentence's fluency and smoothness. To fully explore such important information, we propose a graph softmax function for text generation. It is expected that the final classification result would be dominated by both the language model and graphical text relationships among words. We use a graph total variation term to regularize softmax so as to incorporate the concurrent relationship into the language model. The total variation of the generated words should be small locally. We apply the proposed graph softmax to GPT2 for the text generation task. Experimental results demonstrate that the proposed graph softmax achieves better BLEU and perplexity than softmax. Human testers can also easily distinguish the text generated by the graph softmax or softmax.


Active Discriminative Text Representation Learning

Dec 01, 2016
Ye Zhang, Matthew Lease, Byron C. Wallace

We propose a new active learning (AL) method for text classification with convolutional neural networks (CNNs). In AL, one selects the instances to be manually labeled with the aim of maximizing model performance with minimal effort. Neural models capitalize on word embeddings as representations (features), tuning these to the task at hand. We argue that AL strategies for multi-layered neural models should focus on selecting instances that most affect the embedding space (i.e., induce discriminative word representations). This is in contrast to traditional AL approaches (e.g., entropy-based uncertainty sampling), which specify higher level objectives. We propose a simple approach for sentence classification that selects instances containing words whose embeddings are likely to be updated with the greatest magnitude, thereby rapidly learning discriminative, task-specific embeddings. We extend this approach to document classification by jointly considering: (1) the expected changes to the constituent word representations; and (2) the model's current overall uncertainty regarding the instance. The relative emphasis placed on these criteria is governed by a stochastic process that favors selecting instances likely to improve representations at the outset of learning, and then shifts toward general uncertainty sampling as AL progresses. Empirical results show that our method outperforms baseline AL approaches on both sentence and document classification tasks. We also show that, as expected, the method quickly learns discriminative word embeddings. To the best of our knowledge, this is the first work on AL addressing neural models for text classification.

* This paper got accepted by AAAI 2017 

Joint Energy-based Detection and Classificationon of Multilingual Text Lines

Jul 23, 2014
Igor Milevskiy, Yuri Boykov

This paper proposes a new hierarchical MDL-based model for a joint detection and classification of multilingual text lines in im- ages taken by hand-held cameras. The majority of related text detec- tion methods assume alphabet-based writing in a single language, e.g. in Latin. They use simple clustering heuristics specific to such texts: prox- imity between letters within one line, larger distance between separate lines, etc. We are interested in a significantly more ambiguous problem where images combine alphabet and logographic characters from multiple languages and typographic rules vary a lot (e.g. English, Korean, and Chinese). Complexity of detecting and classifying text lines in multiple languages calls for a more principled approach based on information- theoretic principles. Our new MDL model includes data costs combining geometric errors with classification likelihoods and a hierarchical sparsity term based on label costs. This energy model can be efficiently minimized by fusion moves. We demonstrate robustness of the proposed algorithm on a large new database of multilingual text images collected in the pub- lic transit system of Seoul.


ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition -- RRC-MLT-2019

Jul 01, 2019
Nibal Nayef, Yash Patel, Michal Busta, Pinaki Nath Chowdhury, Dimosthenis Karatzas, Wafa Khlif, Jiri Matas, Umapada Pal, Jean-Christophe Burie, Cheng-lin Liu, Jean-Marc Ogier

With the growing cosmopolitan culture of modern cities, the need of robust Multi-Lingual scene Text (MLT) detection and recognition systems has never been more immense. With the goal to systematically benchmark and push the state-of-the-art forward, the proposed competition builds on top of the RRC-MLT-2017 with an additional end-to-end task, an additional language in the real images dataset, a large scale multi-lingual synthetic dataset to assist the training, and a baseline End-to-End recognition method. The real dataset consists of 20,000 images containing text from 10 languages. The challenge has 4 tasks covering various aspects of multi-lingual scene text: (a) text detection, (b) cropped word script classification, (c) joint text detection and script classification and (d) end-to-end detection and recognition. In total, the competition received 60 submissions from the research and industrial communities. This paper presents the dataset, the tasks and the findings of the presented RRC-MLT-2019 challenge.

* ICDAR'19 camera-ready version. Competition available at The first two authors contributed equally 

Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health

Nov 27, 2020
Denis Newman-Griffis, Eric Fosler-Lussier

Linking clinical narratives to standardized vocabularies and coding systems is a key component of unlocking the information in medical text for analysis. However, many domains of medical concepts lack well-developed terminologies that can support effective coding of medical text. We present a framework for developing natural language processing (NLP) technologies for automated coding of under-studied types of medical information, and demonstrate its applicability via a case study on physical mobility function. Mobility is a component of many health measures, from post-acute care and surgical outcomes to chronic frailty and disability, and is coded in the International Classification of Functioning, Disability, and Health (ICF). However, mobility and other types of functional activity remain under-studied in medical informatics, and neither the ICF nor commonly-used medical terminologies capture functional status terminology in practice. We investigated two data-driven paradigms, classification and candidate selection, to link narrative observations of mobility to standardized ICF codes, using a dataset of clinical narratives from physical therapy encounters. Recent advances in language modeling and word embedding were used as features for established machine learning models and a novel deep learning approach, achieving a macro F-1 score of 84% on linking mobility activity reports to ICF codes. Both classification and candidate selection approaches present distinct strengths for automated coding in under-studied domains, and we highlight that the combination of (i) a small annotated data set; (ii) expert definitions of codes of interest; and (iii) a representative text corpus is sufficient to produce high-performing automated coding systems. This study has implications for the ongoing growth of NLP tools for a variety of specialized applications in clinical care and research.

* 30 pages (21 text + 9 references); 9 figures, 2 tables 

Computational analyses of the topics, sentiments, literariness, creativity and beauty of texts in a large Corpus of English Literature

Jan 12, 2022
Arthur M. Jacobs, Annette Kinder

The Gutenberg Literary English Corpus (GLEC, Jacobs, 2018a) provides a rich source of textual data for research in digital humanities, computational linguistics or neurocognitive poetics. In this study we address differences among the different literature categories in GLEC, as well as differences between authors. We report the results of three studies providing i) topic and sentiment analyses for six text categories of GLEC (i.e., children and youth, essays, novels, plays, poems, stories) and its >100 authors, ii) novel measures of semantic complexity as indices of the literariness, creativity and book beauty of the works in GLEC (e.g., Jane Austen's six novels), and iii) two experiments on text classification and authorship recognition using novel features of semantic complexity. The data on two novel measures estimating a text's literariness, intratextual variance and stepwise distance (van Cranenburgh et al., 2019) revealed that plays are the most literary texts in GLEC, followed by poems and novels. Computation of a novel index of text creativity (Gray et al., 2016) revealed poems and plays as the most creative categories with the most creative authors all being poets (Milton, Pope, Keats, Byron, or Wordsworth). We also computed a novel index of perceived beauty of verbal art (Kintsch, 2012) for the works in GLEC and predict that Emma is the theoretically most beautiful of Austen's novels. Finally, we demonstrate that these novel measures of semantic complexity are important features for text classification and authorship recognition with overall predictive accuracies in the range of .75 to .97. Our data pave the way for future computational and empirical studies of literature or experiments in reading psychology and offer multiple baselines and benchmarks for analysing and validating other book corpora.

* 37 pages, 12 figures 

Black-box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers

May 23, 2018
Ji Gao, Jack Lanchantin, Mary Lou Soffa, Yanjun Qi

Although various techniques have been proposed to generate adversarial samples for white-box attacks on text, little attention has been paid to black-box attacks, which are more realistic scenarios. In this paper, we present a novel algorithm, DeepWordBug, to effectively generate small text perturbations in a black-box setting that forces a deep-learning classifier to misclassify a text input. We employ novel scoring strategies to identify the critical tokens that, if modified, cause the classifier to make an incorrect prediction. Simple character-level transformations are applied to the highest-ranked tokens in order to minimize the edit distance of the perturbation, yet change the original classification. We evaluated DeepWordBug on eight real-world text datasets, including text classification, sentiment analysis, and spam detection. We compare the result of DeepWordBug with two baselines: Random (Black-box) and Gradient (White-box). Our experimental results indicate that DeepWordBug reduces the prediction accuracy of current state-of-the-art deep-learning models, including a decrease of 68\% on average for a Word-LSTM model and 48\% on average for a Char-CNN model.

* This is an extended version of the 6page Workshop version appearing in 1st Deep Learning and Security Workshop colocated with IEEE S&P