This work proposes an attention-based sequence-to-sequence model for handwritten word recognition and explores transfer learning for data-efficient training of HTR systems. To overcome training data scarcity, this work leverages models pre-trained on scene text images as a starting point towards tailoring the handwriting recognition models. ResNet feature extraction and bidirectional LSTM-based sequence modeling stages together form an encoder. The prediction stage consists of a decoder and a content-based attention mechanism. The effectiveness of the proposed end-to-end HTR system has been empirically evaluated on a novel multi-writer dataset Imgur5K and the IAM dataset. The experimental results evaluate the performance of the HTR framework, further supported by an in-depth analysis of the error cases. Source code and pre-trained models are available at https://github.com/dmitrijsk/AttentionHTR.
Extracting valuable facts or informative summaries from multi-dimensional tables, i.e. insight mining, is an important task in data analysis and business intelligence. However, ranking the importance of insights remains a challenging and unexplored task. The main challenge is that explicitly scoring an insight or giving it a rank requires a thorough understanding of the tables and costs a lot of manual efforts, which leads to the lack of available training data for the insight ranking problem. In this paper, we propose an insight ranking model that consists of two parts: A neural ranking model explores the data characteristics, such as the header semantics and the data statistical features, and a memory network model introduces table structure and context information into the ranking process. We also build a dataset with text assistance. Experimental results show that our approach largely improves the ranking precision as reported in multi evaluation metrics.
Given a graph over which the contagions (e.g. virus, gossip) propagate, leveraging subgraphs with highly correlated nodes is beneficial to many applications. Yet, challenges abound. First, the propagation pattern between a pair of nodes may change in various temporal dimensions. Second, not always the same contagion is propagated. Hence, state-of-the-art text mining approaches ranging from similarity measures to topic-modeling cannot use the textual contents to compute the weights between the nodes. Third, the word-word co-occurrence patterns may differ in various temporal dimensions, which increases the difficulty to employ current word embedding approaches. We argue that inseparable multi-aspect temporal collaborations are inevitably needed to better calculate the correlation metrics in dynamical processes. In this work, we showcase a sophisticated framework that on the one hand, integrates a neural network based time-aware word embedding component that can collectively construct the word vectors through an assembly of infinite latent temporal facets, and on the other hand, uses an elaborate generative model to compute the edge weights through heterogeneous temporal attributes. After computing the intra-nodes weights, we utilize our Max-Heap Graph cutting algorithm to exploit subgraphs. We then validate our model through comprehensive experiments on real-world propagation data. The results show that the knowledge gained from the versatile temporal dynamics is not only indispensable for word embedding approaches but also plays a significant role in the understanding of the propagation behaviors. Finally, we demonstrate that compared with other rivals, our model can dominantly exploit the subgraphs with highly coordinated nodes.
Theory-of-mind (ToM), a human ability to infer the intentions and thoughts of others, is an essential part of empathetic experiences. We provide here the framework for using NLP models to measure ToM expressed in written texts. For this purpose, we introduce ToM-Diary, a crowdsourced 18,238 diaries with 74,014 Korean sentences annotated with different ToM levels. Each diary was annotated with ToM levels by trained psychology students and reviewed by selected psychology experts. The annotators first divided the diaries based on whether they mentioned other people: self-focused and other-focused. Examples of self-focused sentences are "I am feeling good". The other-focused sentences were further classified into different levels. These levels differ by whether the writer 1) mentions the presence of others without inferring their mental state(e.g., I saw a man walking down the street), 2) fails to take the perspective of others (e.g., I don't understand why they refuse to wear masks), or 3) successfully takes the perspective of others (It must have been hard for them to continue working). We tested whether state-of-the-art transformer-based models (e.g., BERT) could predict underlying ToM levels in sentences. We found that BERT more successfully detected self-focused sentences than other-focused ones. Sentences that successfully take the perspective of others (the highest ToM level) were the most difficult to predict. Our study suggests a promising direction for large-scale and computational approaches for identifying the ability of authors to empathize and take the perspective of others. The dataset is at [URL](https://github.com/humanfactorspsych/covid19-tom-empathy-diary)
The increasing amount of available data and more affordable hardware solutions have opened a gate to the realm of Deep Learning (DL). Due to the rapid advancements and ever-growing popularity of DL, it has begun to invade almost every field, where machine learning is applicable, by altering the traditional state-of-the-art methods. While many researchers in the speaker recognition area have also started to replace the former state-of-the-art methods with DL techniques, some of the traditional i-vector-based methods are still state-of-the-art in the context of text-independent speaker verification (TI-SV). In this paper, we discuss the most recent generalized end-to-end (GE2E) DL technique based on Long Short-term Memory (LSTM) units for TI-SV by Google and compare different scenarios and aspects including utterance duration, training time, and accuracy to prove that our method outperforms the traditional methods.
We explore to what extent knowledge about the pre-trained language model that is used is beneficial for the task of abstractive summarization. To this end, we experiment with conditioning the encoder and decoder of a Transformer-based neural model on the BERT language model. In addition, we propose a new method of BERT-windowing, which allows chunk-wise processing of texts longer than the BERT window size. We also explore how locality modelling, i.e., the explicit restriction of calculations to the local context, can affect the summarization ability of the Transformer. This is done by introducing 2-dimensional convolutional self-attention into the first layers of the encoder. The results of our models are compared to a baseline and the state-of-the-art models on the CNN/Daily Mail dataset. We additionally train our model on the SwissText dataset to demonstrate usability on German. Both models outperform the baseline in ROUGE scores on two datasets and show its superiority in a manual qualitative analysis.
Many new proposals for scene text recognition (STR) models have been introduced in recent years. While each claim to have pushed the boundary of the technology, a holistic and fair comparison has been largely missing in the field due to the inconsistent choices of training and evaluation datasets. This paper addresses this difficulty with three major contributions. First, we examine the inconsistencies of training and evaluation datasets, and the performance gap results from inconsistencies. Second, we introduce a unified four-stage STR framework that most existing STR models fit into. Using this framework allows for the extensive evaluation of previously proposed STR modules and the discovery of previously unexplored module combinations. Third, we analyze the module-wise contributions to performance in terms of accuracy, speed, and memory demand, under one consistent set of training and evaluation datasets. Such analyses clean up the hindrance on the current comparisons to understand the performance gain of the existing modules.
Visual modifications to text are often used to obfuscate offensive comments in social media (e.g., "!d10t") or as a writing style ("1337" in "leet speak"), among other scenarios. We consider this as a new type of adversarial attack in NLP, a setting to which humans are very robust, as our experiments with both simple and more difficult visual input perturbations demonstrate. We then investigate the impact of visual adversarial attacks on current NLP systems on character-, word-, and sentence-level tasks, showing that both neural and non-neural models are, in contrast to humans, extremely sensitive to such attacks, suffering performance decreases of up to 82\%. We then explore three shielding methods---visual character embeddings, adversarial training, and rule-based recovery---which substantially improve the robustness of the models. However, the shielding methods still fall behind performances achieved in non-attack scenarios, which demonstrates the difficulty of dealing with visual attacks.
In cognitive psychology, automatic and self-reinforcing irrational thought patterns are known as cognitive distortions. Left unchecked, patients exhibiting these types of thoughts can become stuck in negative feedback loops of unhealthy thinking, leading to inaccurate perceptions of reality commonly associated with anxiety and depression. In this paper, we present a machine learning framework for the automatic detection and classification of 15 common cognitive distortions in two novel mental health free text datasets collected from both crowdsourcing and a real-world online therapy program. When differentiating between distorted and non-distorted passages, our model achieved a weighted F1 score of 0.88. For classifying distorted passages into one of 15 distortion categories, our model yielded weighted F1 scores of 0.68 in the larger crowdsourced dataset and 0.45 in the smaller online counseling dataset, both of which outperformed random baseline metrics by a large margin. For both tasks, we also identified the most discriminative words and phrases between classes to highlight common thematic elements for improving targeted and therapist-guided mental health treatment. Furthermore, we performed an exploratory analysis using unsupervised content-based clustering and topic modeling algorithms as first efforts towards a data-driven perspective on the thematic relationship between similar cognitive distortions traditionally deemed unique. Finally, we highlight the difficulties in applying mental health-based machine learning in a real-world setting and comment on the implications and benefits of our framework for improving automated delivery of therapeutic treatment in conjunction with traditional cognitive-behavioral therapy.