In this paper, we address the problems of Arabic Text Classification and stemming using Transducers and Rational Kernels. We introduce a new stemming technique based on the use of Arabic patterns (Pattern Based Stemmer). Patterns are modelled using transducers and stemming is done without depending on any dictionary. Using transducers for stemming, documents are transformed into finite state transducers. This document representation allows us to use and explore rational kernels as a framework for Arabic Text Classification. Stemming experiments are conducted on three word collections and classification experiments are done on the Saudi Press Agency dataset. Results show that our approach, when compared with other approaches, is promising specially in terms of Accuracy, Recall and F1.
Considering that words with different characteristic in the text have different importance for classification, grouping them together separately can strengthen the semantic expression of each part. Thus we propose a new text representation scheme by clustering words according to their latent semantics and composing them together to get a set of cluster vectors, which are then concatenated as the final text representation. Evaluation on five classification benchmarks proves the effectiveness of our method. We further conduct visualization analysis showing statistical clustering results and verifying the validity of our motivation.
With the widespread use of the internet, the size of the text data increases day by day. Poems can be given as an example of the growing text. In this study, we aim to classify poetry according to poet. Firstly, data set consisting of three different poetry of poets written in English have been constructed. Then, text categorization techniques are implemented on it. Chi-Square technique are used for feature selection. In addition, five different classification algorithms are tried. These algorithms are Sequential minimal optimization, Naive Bayes, C4.5 decision tree, Random Forest and k-nearest neighbors. Although each classifier showed very different results, over the 70% classification success rate was taken by sequential minimal optimization technique.
Digitized archives contain and preserve the knowledge of generations of scholars in millions of documents. The size of these archives calls for automatic analysis since a manual analysis by specialists is often too expensive. In this paper, we focus on the analysis of handwriting in scanned documents from the art-historic archive of the WPI. Since the archive consists of documents written in several languages and lacks annotated training data for the creation of recognition models, we propose the task of handwriting classification as a new step for a handwriting OCR pipeline. We propose a handwriting classification model that labels extracted text fragments, eg, numbers, dates, or words, based on their visual structure. Such a classification supports historians by highlighting documents that contain a specific class of text without the need to read the entire content. To this end, we develop and compare several deep learning-based models for text classification. In extensive experiments, we show the advantages and disadvantages of our proposed approach and discuss possible usage scenarios on a real-world dataset.
Adversarial examples are artificially modified input samples which lead to misclassifications, while not being detectable by humans. These adversarial examples are a challenge for many tasks such as image and text classification, especially as research shows that many adversarial examples are transferable between different classifiers. In this work, we evaluate the performance of a popular defensive strategy for adversarial examples called defensive distillation, which can be successful in hardening neural networks against adversarial examples in the image domain. However, instead of applying defensive distillation to networks for image classification, we examine, for the first time, its performance on text classification tasks and also evaluate its effect on the transferability of adversarial text examples. Our results indicate that defensive distillation only has a minimal impact on text classifying neural networks and does neither help with increasing their robustness against adversarial examples nor prevent the transferability of adversarial examples between neural networks.
For management, documents are categorized into a specific category, and to do these, most of the organizations use manual labor. In today's automation era, manual efforts on such a task are not justified, and to avoid this, we have so many software out there in the market. However, efficiency and minimal resource consumption is the focal point which is also creating a competition. The categorization of such documents into specified classes by machine provides excellent help. One of categorization technique is text classification using a Convolutional neural network(TextCNN). TextCNN uses multiple sizes of filters, as in the case of the inception layer introduced in Googlenet. The network provides good accuracy but causes high memory consumption due to a large number of trainable parameters. As a solution to this problem, we introduced a whole new architecture based on separable convolution. The idea of separable convolution already exists in the field of image classification but not yet introduces to text classification tasks. With the help of this architecture, we can achieve a drastic reduction in trainable parameters.
Traditional text classifiers are limited to predicting over a fixed set of labels. However, in many real-world applications the label set is frequently changing. For example, in intent classification, new intents may be added over time while others are removed. We propose to address the problem of dynamic text classification by replacing the traditional, fixed-size output layer with a learned, semantically meaningful metric space. Here the distances between textual inputs are optimized to perform nearest-neighbor classification across overlapping label sets. Changing the label set does not involve removing parameters, but rather simply adding or removing support points in the metric space. Then the learned metric can be fine-tuned with only a few additional training examples. We demonstrate that this simple strategy is robust to changes in the label space. Furthermore, our results show that learning a non-Euclidean metric can improve performance in the low data regime, suggesting that further work on metric spaces may benefit low-resource research.
When a clinician refers a patient for an imaging exam, they include the reason (e.g. relevant patient history, suspected disease) in the scan request; this appears as the indication field in the radiology report. The interpretation and reporting of the image are substantially influenced by this request text, steering the radiologist to focus on particular aspects of the image. We use the indication field to drive better image classification, by taking a transformer network which is unimodally pre-trained on text (BERT) and fine-tuning it for multimodal classification of a dual image-text input. We evaluate the method on the MIMIC-CXR dataset, and present ablation studies to investigate the effect of the indication field on the classification performance. The experimental results show our approach achieves 87.8 average micro AUROC, outperforming the state-of-the-art methods for unimodal (84.4) and multimodal (86.0) classification. Our code is available at https://github.com/jacenkow/mmbt.
Over the last few years, Text classification is one of the fundamental tasks in natural language processing (NLP) in which the objective is to categorize text documents into one of the predefined classes. The news is full of our life. Therefore, news headlines classification is a crucial task to connect users with the right news. The news headline classification is a kind of text classification, which can be generally divided into three mainly parts: feature extraction, classifier selection, and evaluations. In this article, we use the dataset, containing news over a period of eighteen years provided by Kaggle platform to classify news headlines. We choose TF-IDF to extract features and neural network as the classifier, while the evaluation metrics is accuracy. From the experiment result, it is obvious that our NN model has the best performance among these models in the metrics of accuracy. The higher the accuracy is, the better performance the model will gain. Our NN model owns the accuracy 0.8622, which is highest accuracy among these four models. And it is 0.0134, 0.033, 0.080 higher than its of other models.
Text classification is a critical research topic with broad applications in natural language processing. Recently, graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task. Despite the success, their performance could be largely jeopardized in practice since they are: (1) unable to capture high-order interaction between words; (2) inefficient to handle large datasets and new documents. To address those issues, in this paper, we propose a principled model -- hypergraph attention networks (HyperGAT), which can obtain more expressive power with less computational consumption for text representation learning. Extensive experiments on various benchmark datasets demonstrate the efficacy of the proposed approach on the text classification task.