We apply text analysis approaches for a specialized search engine for 3D CAD models and associated products. The main goals are to distinguish between actual product descriptions and other text on a website, as well as to decide whether a given text is or contains a product name. For this we use paragraph vectors for text classification, a character-level long short-term memory network (LSTM) for a single word classification and an LSTM tagger based on word embeddings for detecting product names within sentences. Despite the need to collect bigger datasets in our specific problem domain, the first results are promising and partially fit for production use.
We participated in three of the protein-protein interaction subtasks of the Second BioCreative Challenge: classification of abstracts relevant for protein-protein interaction (IAS), discovery of protein pairs (IPS) and text passages characterizing protein interaction (ISS) in full text documents. We approached the abstract classification task with a novel, lightweight linear model inspired by spam-detection techniques, as well as an uncertainty-based integration scheme. We also used a Support Vector Machine and the Singular Value Decomposition on the same features for comparison purposes. Our approach to the full text subtasks (protein pair and passage identification) includes a feature expansion method based on word-proximity networks. Our approach to the abstract classification task (IAS) was among the top submissions for this task in terms of the measures of performance used in the challenge evaluation (accuracy, F-score and AUC). We also report on a web-tool we produced using our approach: the Protein Interaction Abstract Relevance Evaluator (PIARE). Our approach to the full text tasks resulted in one of the highest recall rates as well as mean reciprocal rank of correct passages. Our approach to abstract classification shows that a simple linear model, using relatively few features, is capable of generalizing and uncovering the conceptual nature of protein-protein interaction from the bibliome. Since the novel approach is based on a very lightweight linear model, it can be easily ported and applied to similar problems. In full text problems, the expansion of word features with word-proximity networks is shown to be useful, though the need for some improvements is discussed.
In the era of big data, a large number of text data generated by the Internet has given birth to a variety of text representation methods. In natural language processing (NLP), text representation transforms text into vectors that can be processed by computer without losing the original semantic information. However, these methods are difficult to effectively extract the semantic features among words and distinguish polysemy in language. Therefore, a text feature representation model based on convolutional neural network (CNN) and variational autoencoder (VAE) is proposed to extract the text features and apply the obtained text feature representation on the text classification tasks. CNN is used to extract the features of text vector to get the semantics among words and VAE is introduced to make the text feature space more consistent with Gaussian distribution. In addition, the output of the improved word2vec model is employed as the input of the proposed model to distinguish different meanings of the same word in different contexts. The experimental results show that the proposed model outperforms in k-nearest neighbor (KNN), random forest (RF) and support vector machine (SVM) classification algorithms.
App classification is useful in a number of applications such as adding apps to an app store or building a user model based on the installed apps. Presently there are a number of existing methods to classify apps based on a given taxonomy on the basis of their text metadata. However, text based methods for app classification may not work in all cases, such as when the text descriptions are in a different language, or missing, or inadequate to classify the app. One solution in such cases is to utilize the app images to supplement the text description. In this paper, we evaluate a number of approaches in which app images can be used to classify the apps. In one approach, we use Optical character recognition (OCR) to extract text from images, which is then used to supplement the text description of the app. In another, we use pic2vec to convert the app images into vectors, then train an SVM to classify the vectors to the correct app label. In another, we use the captionbot.ai tool to generate natural language descriptions from the app images. Finally, we use a method to detect and label objects in the app images and use a voting technique to determine the category of the app based on all the images. We compare the performance of our image-based techniques to classify a number of apps in our dataset. We use a text based SVM app classifier as our base and obtained an improved classification accuracy of 96% for some classes when app images are added.
The predictions of text classifiers are often driven by spurious correlations -- e.g., the term `Spielberg' correlates with positively reviewed movies, even though the term itself does not semantically convey a positive sentiment. In this paper, we propose a method to distinguish spurious and genuine correlations in text classification. We treat this as a supervised classification problem, using features derived from treatment effect estimators to distinguish spurious correlations from "genuine" ones. Due to the generic nature of these features and their small dimensionality, we find that the approach works well even with limited training examples, and that it is possible to transport the word classifier to new domains. Experiments on four datasets (sentiment classification and toxicity detection) suggest that using this approach to inform feature selection also leads to more robust classification, as measured by improved worst-case accuracy on the samples affected by spurious correlations.
Digital texts have become an increasingly important source of data for social studies. However, textual data from open platforms are vulnerable to manipulation (e.g., censorship and information inflation), often leading to bias in subsequent empirical analysis. This paper investigates the problem of data distortion in text classification when controlling type I error (a relevant textual message is classified as irrelevant) is the priority. The default classical classification paradigm that minimizes the overall classification error can yield an undesirably large type I error, and data distortion exacerbates this situation. As a solution, we propose the Neyman-Pearson (NP) classification paradigm which minimizes type II error under a user-specified type I error constraint. Theoretically, we show that while the classical oracle (i.e., optimal classifier) cannot be recovered under unknown data distortion even if one has the entire post-distortion population, the NP oracle is unaffected by data distortion and can be recovered under the same condition. Empirically, we illustrate the advantage of NP classification methods in a case study that classifies posts about strikes and corruption published on a leading Chinese blogging platform.
Text in natural images is of arbitrary orientations, requiring detection in terms of oriented bounding boxes. Normally, a multi-oriented text detector often involves two key tasks: 1) text presence detection, which is a classification problem disregarding text orientation; 2) oriented bounding box regression, which concerns about text orientation. Previous methods rely on shared features for both tasks, resulting in degraded performance due to the incompatibility of the two tasks. To address this issue, we propose to perform classification and regression on features of different characteristics, extracted by two network branches of different designs. Concretely, the regression branch extracts rotation-sensitive features by actively rotating the convolutional filters, while the classification branch extracts rotation-invariant features by pooling the rotation-sensitive features. The proposed method named Rotation-sensitive Regression Detector (RRD) achieves state-of-the-art performance on three oriented scene text benchmark datasets, including ICDAR 2015, MSRA-TD500, RCTW-17 and COCO-Text. Furthermore, RRD achieves a significant improvement on a ship collection dataset, demonstrating its generality on oriented object detection.
Due to the increasing amount of data on the internet, finding a highly-informative, low-dimensional representation for text is one of the main challenges for efficient natural language processing tasks including text classification. This representation should capture the semantic information of the text while retaining their relevance level for document classification. This approach maps the documents with similar topics to a similar space in vector space representation. To obtain representation for large text, we propose the utilization of deep Siamese neural networks. To embed document relevance in topics in the distributed representation, we use a Siamese neural network to jointly learn document representations. Our Siamese network consists of two sub-network of multi-layer perceptron. We examine our representation for the text categorization task on BBC news dataset. The results show that the proposed representations outperform the conventional and state-of-the-art representations in the text classification task on this dataset.
Text classification tends to struggle when data is deficient or when it needs to adapt to unseen classes. In such challenging scenarios, recent studies often use meta learning to simulate the few-shot task, in which new queries are compared to a small support set on a sample-wise level. However, this sample-wise comparison may be severely disturbed by the various expressions in the same class. Therefore, we should be able to learn a general representation of each class in the support set and then compare it to new queries. In this paper, we propose a novel Induction Network to learn such generalized class-wise representations, innovatively combining the dynamic routing algorithm with the typical meta learning framework. In this way, our model is able to induce from particularity to university, which is a more human-like learning approach. We evaluate our model on a well-studied sentiment classification dataset (English) and a real-world dialogue intent classification dataset (Chinese). Experiment results show that, on both datasets, our model significantly outperforms existing state-of-the-art models and improves the average accuracy by more than 3%, which proves the effectiveness of class-wise generalization in few-shot text classification.
Performance of text classification models can drop over time when new data to be classified is more distant in time from the data used for training, due to naturally occurring changes in the data, such as vocabulary change. A solution to this is to continually label new data to retrain the model, which is, however, often unaffordable to be performed regularly due to its associated cost. This raises important research questions on the design of text classification models that are intended to persist over time: do all embedding models and classification algorithms exhibit similar performance drops over time and is the performance drop more prominent in some tasks or datasets than others? With the aim of answering these research questions, we perform longitudinal classification experiments on three datasets spanning between 6 and 19 years. Findings from these experiments inform the design of text classification models with the aim of preserving performance over time, discussing the extent to which one can rely on classification models trained from temporally distant training data, as well as how the characteristics of the dataset impact this.