Understanding robustness and sensitivity of BERT models predicting Alzheimer's disease from text is important for both developing better classification models and for understanding their capabilities and limitations. In this paper, we analyze how a controlled amount of desired and undesired text alterations impacts performance of BERT. We show that BERT is robust to natural linguistic variations in text. On the other hand, we show that BERT is not sensitive to removing clinically important information from text.
Authorship identification is a process in which the author of a text is identified. Most known literary texts can easily be attributed to a certain author because they are, for example, signed. Yet sometimes we find unfinished pieces of work or a whole bunch of manuscripts with a wide variety of possible authors. In order to assess the importance of such a manuscript, it is vital to know who wrote it. In this work, we aim to develop a machine learning framework to effectively determine authorship. We formulate the task as a single-label multi-class text categorization problem and propose a supervised machine learning framework incorporating stylometric features. This task is highly interdisciplinary in that it takes advantage of machine learning, information retrieval, and natural language processing. We present an approach and a model which learns the differences in writing style between $50$ different authors and is able to predict the author of a new text with high accuracy. The accuracy is seen to increase significantly after introducing certain linguistic stylometric features along with text features.
We introduce an algorithm for word-level text spotting that is able to accurately and reliably determine the bounding regions of individual words of text "in the wild". Our system is formed by the cascade of two convolutional neural networks. The first network is fully convolutional and is in charge of detecting areas containing text. This results in a very reliable but possibly inaccurate segmentation of the input image. The second network (inspired by the popular YOLO architecture) analyzes each segment produced in the first stage, and predicts oriented rectangular regions containing individual words. No post-processing (e.g. text line grouping) is necessary. With execution time of 450 ms for a 1000-by-560 image on a Titan X GPU, our system achieves the highest score to date among published algorithms on the ICDAR 2015 Incidental Scene Text dataset benchmark.
So far and trying to reach human capabilities, research in automatic summarization has been based on hypothesis that are both enabling and limiting. Some of these limitations are: how to take into account and reflect (in the generated summary) the implicit information conveyed in the text, the author intention, the reader intention, the context influence, the general world knowledge. Thus, if we want machines to mimic human abilities, then they will need access to this same large variety of knowledge. The implicit is affecting the orientation and the argumentation of the text and consequently its summary. Most of Text Summarizers (TS) are processing as compressing the initial data and they necessarily suffer from information loss. TS are focusing on features of the text only, not on what the author intended or why the reader is reading the text. In this paper, we address this problem and we present a system focusing on acquiring knowledge that is implicit. We principally spotlight the implicit information conveyed by the argumentative connectives such as: but, even, yet and their effect on the summary.
Traditional table-to-text natural language generation (NLG) tasks focus on generating text from schemas that are already seen in the training set. This limitation curbs their generalizabilities towards real-world scenarios, where the schemas of input tables are potentially infinite. In this paper, we propose the new task of table-to-text NLG with unseen schemas, which specifically aims to test the generalization of NLG for input tables with attribute types that never appear during training. To do this, we construct a new benchmark dataset for this task. To deal with the problem of unseen attribute types, we propose a new model that first aligns unseen table schemas to seen ones, and then generates text with updated table representations. Experimental evaluation on the new benchmark demonstrates that our model outperforms baseline methods by a large margin. In addition, comparison with standard data-to-text settings shows the challenges and uniqueness of our proposed task.
Recently, vector-quantized image modeling has demonstrated impressive performance on generation tasks such as text-to-image generation. However, we discover that the current image quantizers do not satisfy translation equivariance in the quantized space due to aliasing, degrading performance in the downstream text-to-image generation and image-to-text generation, even in simple experimental setups. Instead of focusing on anti-aliasing, we take a direct approach to encourage translation equivariance in the quantized space. In particular, we explore a desirable property of image quantizers, called 'Translation Equivariance in the Quantized Space' and propose a simple but effective way to achieve translation equivariance by regularizing orthogonality in the codebook embedding vectors. Using this method, we improve accuracy by +22% in text-to-image generation and +26% in image-to-text generation, outperforming the VQGAN.
Text data augmentation, i.e. the creation of synthetic textual data from an original text, is challenging as augmentation transformations should take into account language complexity while being relevant to the target Natural Language Processing (NLP) task (e.g. Machine Translation, Question Answering, Text Classification, etc.). Motivated by a business application of Business Email Compromise (BEC) detection, we propose a corpus and task agnostic text augmentation framework combining different methods, utilizing BERT language model, multi-step back-translation and heuristics. We show that our augmentation framework improves performances on several text classification tasks using publicly available models and corpora (SST2 and TREC) as well as on a BEC detection task. We also provide a comprehensive argumentation about the limitations of our augmentation framework.
Textual information in a captured scene play important role in scene interpretation and decision making. Pieces of dedicated research work are going on to detect and recognize textual data accurately in images. Though there exist methods that can successfully detect complex text regions present in a scene, to the best of our knowledge there is no work to modify the textual information in an image. This paper deals with a simple text editor that can edit/modify the textual part in an image. Apart from error correction in the text part of the image, this work can directly increase the reusability of images drastically. In this work, at first, we focus on the problem to generate unobserved characters with the similar font and color of an observed text character present in a natural scene with minimum user intervention. To generate the characters, we propose a multi-input neural network that adapts the font-characteristics of a given characters (source), and generate desired characters (target) with similar font features. We also propose a network that transfers color from source to target character without any visible distortion. Next, we place the generated character in a word for its modification maintaining the visual consistency with the other characters in the word. The proposed method is a unified platform that can work like a simple text editor and edit texts in images. We tested our methodology on popular ICDAR 2011 and ICDAR 2013 datasets and results are reported here.
Automatic text summarization tools help users in biomedical domain to acquire their intended information from various textual resources more efficiently. Some of the biomedical text summarization systems put the basis of their sentence selection approach on the frequency of concepts extracted from the input text. However, it seems that exploring other measures rather than the frequency for identifying the valuable content of the input document, and considering the correlations existing between concepts may be more useful for this type of summarization. In this paper, we describe a Bayesian summarizer for biomedical text documents. The Bayesian summarizer initially maps the input text to the Unified Medical Language System (UMLS) concepts, then it selects the important ones to be used as classification features. We introduce different feature selection approaches to identify the most important concepts of the text and to select the most informative content according to the distribution of these concepts. We show that with the use of an appropriate feature selection approach, the Bayesian biomedical summarizer can improve the performance of summarization. We perform extensive evaluations on a corpus of scientific papers in biomedical domain. The results show that the Bayesian summarizer outperforms the biomedical summarizers that rely on the frequency of concepts, the domain-independent and baseline methods based on the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metrics. Moreover, the results suggest that using the meaningfulness measure and considering the correlations of concepts in the feature selection step lead to a significant increase in the performance of summarization.
Text matching is a fundamental technique in both information retrieval and natural language processing. Text matching tasks share the same paradigm that determines the relationship between two given texts. Evidently, the relationships vary from task to task, e.g. relevance in document retrieval, semantic alignment in paraphrase identification and answerable judgment in question answering. However, the essential signals for text matching remain in a finite scope, i.e. exact matching, semantic matching, and inference matching. Recent state-of-the-art neural text matching models, e.g. pre-trained language models (PLMs), are hard to generalize to different tasks. It is because the end-to-end supervised learning on task-specific dataset makes model overemphasize the data sample bias and task-specific signals instead of the essential matching signals, which ruins the generalization of model to different tasks. To overcome this problem, we adopt a specialization-generalization training strategy and refer to it as Match-Prompt. In specialization stage, descriptions of different matching tasks are mapped to only a few prompt tokens. In generalization stage, text matching model explores the essential matching signals by being trained on diverse multiple matching tasks. High diverse matching tasks avoid model fitting the data sample bias on a specific task, so that model can focus on learning the essential matching signals. Meanwhile, the prompt tokens obtained in the first step are added to the corresponding tasks to help the model distinguish different task-specific matching signals. Experimental results on eighteen public datasets show that Match-Prompt can significantly improve multi-task generalization capability of PLMs in text matching, and yield better in-domain multi-task, out-of-domain multi-task and new task adaptation performance than task-specific model.