A lot of effort is currently made to provide methods to analyze and understand deep neural network impressive performances for tasks such as image or text classification. These methods are mainly based on visualizing the important input features taken into account by the network to build a decision. However these techniques, let us cite LIME, SHAP, Grad-CAM, or TDS, require extra effort to interpret the visualization with respect to expert knowledge. In this paper, we propose a novel approach to inspect the hidden layers of a fitted CNN in order to extract interpretable linguistic objects from texts exploiting classification process. In particular, we detail a weighted extension of the Text Deconvolution Saliency (wTDS) measure which can be used to highlight the relevant features used by the CNN to perform the classification task. We empirically demonstrate the efficiency of our approach on corpora from two different languages: English and French. On all datasets, wTDS automatically encodes complex linguistic objects based on co-occurrences and possibly on grammatical and syntax analysis.
The objective of this research is to enhance performance of Stochastic Gradient Descent (SGD) algorithm in text classification. In our research, we proposed using SGD learning with Grid-Search approach to fine-tuning hyper-parameters in order to enhance the performance of SGD classification. We explored different settings for representation, transformation and weighting features from the summary description of terrorist attacks incidents obtained from the Global Terrorism Database as a pre-classification step, and validated SGD learning on Support Vector Machine (SVM), Logistic Regression and Perceptron classifiers by stratified 10-K-fold cross-validation to compare the performance of different classifiers embedded in SGD algorithm. The research concludes that using a grid-search to find the hyper-parameters optimize SGD classification, not in the pre-classification settings only, but also in the performance of the classifiers in terms of accuracy and execution time.
The preprocessing phase is one of the key phases within the text classification pipeline. This study aims at investigating the impact of the preprocessing phase on text classification, specifically on offensive language and hate speech classification for Arabic text. The Arabic language used in social media is informal and written using Arabic dialects, which makes the text classification task very complex. Preprocessing helps in dimensionality reduction and removing useless content. We apply intensive preprocessing techniques to the dataset before processing it further and feeding it into the classification model. An intensive preprocessing-based approach demonstrates its significant impact on offensive language detection and hate speech detection shared tasks of the fourth workshop on Open-Source Arabic Corpora and Corpora Processing Tools (OSACT). Our team wins the third place (3rd) in the Sub-Task A Offensive Language Detection division and wins the first place (1st) in the Sub-Task B Hate Speech Detection division, with an F1 score of 89% and 95%, respectively, by providing the state-of-the-art performance in terms of F1, accuracy, recall, and precision for Arabic hate speech detection.
Genre identification is a subclass of non-topical text classification. The main difference between this task and topical classification is that genres, unlike topics, usually do not correspond to simple keywords, and thus they need to be defined in terms of their functions in communication. Neural models based on pre-trained transformers, such as BERT or XLM-RoBERTa, demonstrate SOTA results in many NLP tasks, including non-topical classification. However, in many cases, their downstream application to very large corpora, such as those extracted from social media, can lead to unreliable results because of dataset shifts, when some raw texts do not match the profile of the training set. To mitigate this problem, we experiment with individual models as well as with their ensembles. To evaluate the robustness of all models we use a prediction confidence metric, which estimates the reliability of a prediction in the absence of a gold standard label. We can evaluate robustness via the confidence gap between the correctly classified texts and the misclassified ones on a labeled test corpus, higher gaps make it easier to improve our confidence that our classifier made the right decision. Our results show that for all of the classifiers tested in this study, there is a confidence gap, but for the ensembles, the gap is bigger, meaning that ensembles are more robust than their individual models.
Text data augmentation, i.e. the creation of synthetic textual data from an original text, is challenging as augmentation transformations should take into account language complexity while being relevant to the target Natural Language Processing (NLP) task (e.g. Machine Translation, Question Answering, Text Classification, etc.). Motivated by a business application of Business Email Compromise (BEC) detection, we propose a corpus and task agnostic text augmentation framework combining different methods, utilizing BERT language model, multi-step back-translation and heuristics. We show that our augmentation framework improves performances on several text classification tasks using publicly available models and corpora (SST2 and TREC) as well as on a BEC detection task. We also provide a comprehensive argumentation about the limitations of our augmentation framework.
Despite recent advances in the application of deep learning algorithms to various kinds of medical data, clinical text classification, and extracting information from narrative clinical notes remains a challenging task. The challenges of representing, training and interpreting document classification models are amplified when dealing with small and clinical domain data sets. The objective of this research is to investigate the attention-based deep learning models to classify the de-identified clinical progress notes extracted from a real-world EHR system. The attention-based deep learning models can be used to interpret the models and understand the critical words that drive the correct or incorrect classification of the clinical progress notes. The attention-based models in this research are capable of presenting the human interpretable text classification models. The results show that the fine-tuned BERT with the attention layer can achieve a high classification accuracy of 97.6%, which is higher than the baseline fine-tuned BERT classification model. Furthermore, we demonstrate that the attention-based models can identify relevant keywords that strongly relate to the corresponding clinical categories.
Text classification algorithms investigate the intricate relationships between words or phrases and attempt to deduce the document's interpretation. In the last few years, these algorithms have progressed tremendously. Transformer architecture and sentence encoders have proven to give superior results on natural language processing tasks. But a major limitation of these architectures is their applicability for text no longer than a few hundred words. In this paper, we explore hierarchical transfer learning approaches for long document classification. We employ pre-trained Universal Sentence Encoder (USE) and Bidirectional Encoder Representations from Transformers (BERT) in a hierarchical setup to capture better representations efficiently. Our proposed models are conceptually simple where we divide the input data into chunks and then pass this through base models of BERT and USE. Then output representation for each chunk is then propagated through a shallow neural network comprising of LSTMs or CNNs for classifying the text data. These extensions are evaluated on 6 benchmark datasets. We show that USE + CNN/LSTM performs better than its stand-alone baseline. Whereas the BERT + CNN/LSTM performs on par with its stand-alone counterpart. However, the hierarchical BERT models are still desirable as it avoids the quadratic complexity of the attention mechanism in BERT. Along with the hierarchical approaches, this work also provides a comparison of different deep learning algorithms like USE, BERT, HAN, Longformer, and BigBird for long document classification. The Longformer approach consistently performs well on most of the datasets.
Learning text representation is crucial for text classification and other language related tasks. There are a diverse set of text representation networks in the literature, and how to find the optimal one is a non-trivial problem. Recently, the emerging Neural Architecture Search (NAS) techniques have demonstrated good potential to solve the problem. Nevertheless, most of the existing works of NAS focus on the search algorithms and pay little attention to the search space. In this paper, we argue that the search space is also an important human prior to the success of NAS in different applications. Thus, we propose a novel search space tailored for text representation. Through automatic search, the discovered network architecture outperforms state-of-the-art models on various public datasets on text classification and natural language inference tasks. Furthermore, some of the design principles found in the automatic network agree well with human intuition.
Product descriptions in e-commerce platforms contain detailed and valuable information about retailers assortment. In particular, coding promotions within digital leaflets are of great interest in e-commerce as they capture the attention of consumers by showing regular promotions for different products. However, this information is embedded into images, making it difficult to extract and process for downstream tasks. In this paper, we present an end-to-end approach that classifies promotions within digital leaflets into their corresponding product categories using both visual and textual information. Our approach can be divided into three key components: 1) region detection, 2) text recognition and 3) text classification. In many cases, a single promotion refers to multiple product categories, so we introduce a multi-label objective in the classification head. We demonstrate the effectiveness of our approach for two separated tasks: 1) image-based detection of the descriptions for each individual promotion and 2) multi-label classification of the product categories using the text from the product descriptions. We train and evaluate our models using a private dataset composed of images from digital leaflets obtained by Nielsen. Results show that we consistently outperform the proposed baseline by a large margin in all the experiments.