Voice-assistants text classification and named-entity recognition (NER) models are trained on millions of example utterances. Because of the large datasets, long training time is one of the bottlenecks for releasing improved models. In this work, we develop F10-SGD, a fast optimizer for text classification and NER elastic-net linear models. On internal datasets, F10-SGD provides 4x reduction in training time compared to the OWL-QN optimizer without loss of accuracy or increase in model size. Furthermore, we incorporate biased sampling that prioritizes harder examples towards the end of the training. As a result, in addition to faster training, we were able to obtain statistically significant accuracy improvements for NER. On public datasets, F10-SGD obtains 22% faster training time compared to FastText for text classification. And, 4x reduction in training time compared to CRFSuite OWL-QN for NER.
Hierarchical text classification aims to leverage label hierarchy in multi-label text classification. Existing methods encode label hierarchy in a global view, where label hierarchy is treated as the static hierarchical structure containing all labels. Since global hierarchy is static and irrelevant to text samples, it makes these methods hard to exploit hierarchical information. Contrary to global hierarchy, local hierarchy as the structured target labels hierarchy corresponding to each text sample is dynamic and relevant to text samples, which is ignored in previous methods. To exploit global and local hierarchies, we propose Hierarchy-guided BERT with Global and Local hierarchies (HBGL), which utilizes the large-scale parameters and prior language knowledge of BERT to model both global and local hierarchies. Moreover, HBGL avoids the intentional fusion of semantic and hierarchical modules by directly modeling semantic and hierarchical information with BERT. Compared with the state-of-the-art method HGCLR, our method achieves significant improvement on three benchmark datasets.
We propose a method named Super Characters for sentiment classification. This method converts the sentiment classification problem into image classification problem by projecting texts into images and then applying CNN models for classification. Text features are extracted automatically from the generated Super Characters images, hence there is no need of any explicit step of embedding the words or characters into numerical vector representations. Experimental results on large social media corpus show that the Super Characters method consistently outperforms other methods for sentiment classification and topic classification tasks on ten large social media datasets of millions of contents in four different languages, including Chinese, Japanese, Korean and English.
Multi-label text classification refers to the problem of assigning each given document its most relevant labels from the label set. Commonly, the metadata of the given documents and the hierarchy of the labels are available in real-world applications. However, most existing studies focus on only modeling the text information, with a few attempts to utilize either metadata or hierarchy signals, but not both of them. In this paper, we bridge the gap by formalizing the problem of metadata-aware text classification in a large label hierarchy (e.g., with tens of thousands of labels). To address this problem, we present the MATCH solution -- an end-to-end framework that leverages both metadata and hierarchy information. To incorporate metadata, we pre-train the embeddings of text and metadata in the same space and also leverage the fully-connected attentions to capture the interrelations between them. To leverage the label hierarchy, we propose different ways to regularize the parameters and output probability of each child label by its parents. Extensive experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH over state-of-the-art deep learning baselines.
Because large, human-annotated datasets suffer from labeling errors, it is crucial to be able to train deep neural networks in the presence of label noise. While training image classification models with label noise have received much attention, training text classification models have not. In this paper, we propose an approach to training deep networks that is robust to label noise. This approach introduces a non-linear processing layer (noise model) that models the statistics of the label noise into a convolutional neural network (CNN) architecture. The noise model and the CNN weights are learned jointly from noisy training data, which prevents the model from overfitting to erroneous labels. Through extensive experiments on several text classification datasets, we show that this approach enables the CNN to learn better sentence representations and is robust even to extreme label noise. We find that proper initialization and regularization of this noise model is critical. Further, by contrast to results focusing on large batch sizes for mitigating label noise for image classification, we find that altering the batch size does not have much effect on classification performance.
There are two cases describing how a classifier processes input text, namely, misclassification and correct classification. In terms of misclassified texts, a classifier handles the texts with both incorrect predictions and adversarial texts, which are generated to fool the classifier, which is called a victim. Both types are misunderstood by the victim, but they can still be recognized by other classifiers. This induces large gaps in predicted probabilities between the victim and the other classifiers. In contrast, text correctly classified by the victim is often successfully predicted by the others and induces small gaps. In this paper, we propose an ensemble model based on similarity estimation of predicted probabilities (SEPP) to exploit the large gaps in the misclassified predictions in contrast to small gaps in the correct classification. SEPP then corrects the incorrect predictions of the misclassified texts. We demonstrate the resilience of SEPP in defending and detecting adversarial texts through different types of victim classifiers, classification tasks, and adversarial attacks.
The current state-of-the-art model HiAGM for hierarchical text classification has two limitations. First, it correlates each text sample with all labels in the dataset which contains irrelevant information. Second, it does not consider any statistical constraint on the label representations learned by the structure encoder, while constraints for representation learning are proved to be helpful in previous work. In this paper, we propose HTCInfoMax to address these issues by introducing information maximization which includes two modules: text-label mutual information maximization and label prior matching. The first module can model the interaction between each text sample and its ground truth labels explicitly which filters out irrelevant information. The second one encourages the structure encoder to learn better representations with desired characteristics for all labels which can better handle label imbalance in hierarchical text classification. Experimental results on two benchmark datasets demonstrate the effectiveness of the proposed HTCInfoMax.
Recent advancements in language models based on recurrent neural networks and transformers architecture have achieved state-of-the-art results on a wide range of natural language processing tasks such as pos tagging, named entity recognition, and text classification. However, most of these language models are pre-trained in high resource languages like English, German, Spanish. Multi-lingual language models include Indian languages like Hindi, Telugu, Bengali in their training corpus, but they often fail to represent the linguistic features of these languages as they are not the primary language of the study. We introduce HinFlair, which is a language representation model (contextual string embeddings) pre-trained on a large monolingual Hindi corpus. Experiments were conducted on 6 text classification datasets and a Hindi dependency treebank to analyze the performance of these contextualized string embeddings for the Hindi language. Results show that HinFlair outperforms previous state-of-the-art publicly available pre-trained embeddings for downstream tasks like text classification and pos tagging. Also, HinFlair when combined with FastText embeddings outperforms many transformers-based language models trained particularly for the Hindi language.
A recently introduced classifier, called SS3, has shown to be well suited to deal with early risk detection (ERD) problems on text streams. It obtained state-of-the-art performance on early depression and anorexia detection on Reddit in the CLEF's eRisk open tasks. SS3 was created to naturally deal with ERD problems since: it supports incremental training and classification over text streams and it can visually explain its rationale. However, SS3 processes the input using a bag-of-word model lacking the ability to recognize important word sequences. This could negatively affect the classification performance and also reduces the descriptiveness of visual explanations. In the standard document classification field, it is very common to use word n-grams to try to overcome some of these limitations. Unfortunately, when working with text streams, using n-grams is not trivial since the system must learn and recognize which n-grams are important ``on the fly''. This paper introduces t-SS3, a variation of SS3 which expands the model to dynamically recognize useful patterns over text streams. We evaluated our model on the eRisk 2017 and 2018 tasks on early depression and anorexia detection. Experimental results show that t-SS3 is able to improve both, existing results and the richness of visual explanations.
Estimating the difficulty level of math word problems is an important task for many educational applications. Identification of relevant and irrelevant sentences in math word problems is an important step for calculating the difficulty levels of such problems. This paper addresses a novel application of text categorization to identify two types of sentences in mathematical word problems, namely relevant and irrelevant sentences. A novel joint probabilistic classification model is proposed to estimate the joint probability of classification decisions for all sentences of a math word problem by utilizing the correlation among all sentences along with the correlation between the question sentence and other sentences, and sentence text. The proposed model is compared with i) a SVM classifier which makes independent classification decisions for individual sentences by only using the sentence text and ii) a novel SVM classifier that considers the correlation between the question sentence and other sentences along with the sentence text. An extensive set of experiments demonstrates the effectiveness of the joint probabilistic classification model for identifying relevant and irrelevant sentences as well as the novel SVM classifier that utilizes the correlation between the question sentence and other sentences. Furthermore, empirical results and analysis show that i) it is highly beneficial not to remove stopwords and ii) utilizing part of speech tagging does not make a significant improvement although it has been shown to be effective for the related task of math word problem type classification.