Pretraining Neural Language Models (NLMs) over a large corpus involves chunking the text into training examples, which are contiguous text segments of sizes processable by the neural architecture. We highlight a bias introduced by this common practice: we prove that the pretrained NLM can model much stronger dependencies between text segments that appeared in the same training example, than it can between text segments that appeared in different training examples. This intuitive result has a twofold role. First, it formalizes the motivation behind a broad line of recent successful NLM training heuristics, proposed for the pretraining and fine-tuning stages, which do not necessarily appear related at first glance. Second, our result clearly indicates further improvements to be made in NLM pretraining for the benefit of Natural Language Understanding tasks. As an example, we propose "kNN-Pretraining": we show that including semantically related non-neighboring sentences in the same pretraining example yields improved sentence representations and open domain question answering abilities. This theoretically motivated degree of freedom for "pretraining example design" indicates new training schemes for self-improving representations.
Scene text detection has witnessed rapid development in recent years. However, there still exists two main challenges: 1) many methods suffer from false positives in their text representations; 2) the large scale variance of scene texts makes it hard for network to learn samples. In this paper, we propose the ContourNet, which effectively handles these two problems taking a further step toward accurate arbitrary-shaped text detection. At first, a scale-insensitive Adaptive Region Proposal Network (Adaptive-RPN) is proposed to generate text proposals by only focusing on the Intersection over Union (IoU) values between predicted and ground-truth bounding boxes. Then a novel Local Orthogonal Texture-aware Module (LOTM) models the local texture information of proposal features in two orthogonal directions and represents text region with a set of contour points. Considering that the strong unidirectional or weakly orthogonal activation is usually caused by the monotonous texture characteristic of false-positive patterns (e.g. streaks.), our method effectively suppresses these false positives by only outputting predictions with high response value in both orthogonal directions. This gives more accurate description of text regions. Extensive experiments on three challenging datasets (Total-Text, CTW1500 and ICDAR2015) verify that our method achieves the state-of-the-art performance. Code is available at https://github.com/wangyuxin87/ContourNet.
Bidirectional Encoder Representations from Transformers (BERT) have shown to be a promising way to dramatically improve the performance across various Natural Language Processing tasks [Devlin et al., 2019]. Meanwhile, progress made over the past few years by various Neural Net-work has also proved the effectiveness of Neural Network in the field of Natural Language Processing. In this project, RoBERTa-wwm-ext [Cui et al., 2019] pre-train language model was adopted and fine-tuned for Chinese text classification. The models were able to classify Chinese texts into two categories, containing descriptions of legal behavior and descriptions of illegal behavior. Four different models are also proposed in the paper. Those models will use RoBERTa-wwm-extas their embedding layer and feed the embedding into different neural networks. The motivation be-hind proposing these models is straightforward. By introducing complex output layer architecture, the overall performance of the models could be improved. All the models were trained on a data set derived from Chinese public court records, and the performance of different models were compared.The experiment shows that the performance of pro-posed models failed to beat the original RoBERTa-wwm-ext model in terms of accuracy and training efficiency.
Text Simplification (TS) aims to reduce the linguistic complexity of content to make it easier to understand. Research in TS has been of keen interest, especially as approaches to TS have shifted from manual, hand-crafted rules to automated simplification. This survey seeks to provide a comprehensive overview of TS, including a brief description of earlier approaches used, discussion of various aspects of simplification (lexical, semantic and syntactic), and latest techniques being utilized in the field. We note that the research in the field has clearly shifted towards utilizing deep learning techniques to perform TS, with a specific focus on developing solutions to combat the lack of data available for simplification. We also include a discussion of datasets and evaluations metrics commonly used, along with discussion of related fields within Natural Language Processing (NLP), like semantic similarity.
In this paper we apply self-knowledge distillation to text summarization which we argue can alleviate problems with maximum-likelihood training on single reference and noisy datasets. Instead of relying on one-hot annotation labels, our student summarization model is trained with guidance from a teacher which generates smoothed labels to help regularize training. Furthermore, to better model uncertainty during training, we introduce multiple noise signals for both teacher and student models. We demonstrate experimentally on three benchmarks that our framework boosts the performance of both pretrained and non-pretrained summarizers achieving state-of-the-art results.
Knowing the Galactic 3D dust distribution is relevant for understanding many processes in the interstellar medium and for correcting many astronomical observations for dust absorption and emission. Here, we aim for a 3D reconstruction of the Galactic dust distribution with an increase in the number of meaningful resolution elements by orders of magnitude with respect to previous reconstructions, while taking advantage of the dust's spatial correlations to inform the dust map. We use iterative grid refinement to define a log-normal process in spherical coordinates. This log-normal process assumes a fixed correlation structure, which was inferred in an earlier reconstruction of Galactic dust. Our map is informed through 111 Million data points, combining data of PANSTARRS, 2MASS, Gaia DR2 and ALLWISE. The log-normal process is discretized to 122 Billion degrees of freedom, a factor of 400 more than our previous map. We derive the most probable posterior map and an uncertainty estimate using natural gradient descent and the Fisher-Laplace approximation. The dust reconstruction covers a quarter of the volume of our Galaxy, with a maximum coordinate distance of $16\,\text{kpc}$, and meaningful information can be found up to at distances of $4\,$kpc, still improving upon our earlier map by a factor of 5 in maximal distance, of $900$ in volume, and of about eighteen in angular grid resolution. Unfortunately, the maximum posterior approach chosen to make the reconstruction computational affordable introduces artifacts and reduces the accuracy of our uncertainty estimate. Despite of the apparent limitations of the presented 3D dust map, a good part of the reconstructed structures are confirmed by independent maser observations. Thus, the map is a step towards reliable 3D Galactic cartography and already can serve for a number of tasks, if used with care.
Modern end-to-end speech recognition models show astonishing results in transcribing audio signals into written text. However, conventional data feeding pipelines may be sub-optimal for low-resource speech recognition, which still remains a challenging task. We propose an automated curriculum learning approach to optimize the sequence of training examples based on both the progress of the model while training and prior knowledge about the difficulty of the training examples. We introduce a new difficulty measure called compression ratio that can be used as a scoring function for raw audio in various noise conditions. The proposed method improves speech recognition Word Error Rate performance by up to 33% relative over the baseline system
Natural language processing (NLP) and neural networks (NNs) have both undergone significant changes in recent years. For active learning (AL) purposes, NNs are, however, less commonly used -- despite their current popularity. By using the superior text classification performance of NNs for AL, we can either increase a model's performance using the same amount of data or reduce the data and therefore the required annotation efforts while keeping the same performance. We review AL for text classification using deep neural networks (DNNs) and elaborate on two main causes which used to hinder the adoption: (a) the inability of NNs to provide reliable uncertainty estimates, on which the most commonly used query strategies rely, and (b) the challenge of training DNNs on small data. To investigate the former, we construct a taxonomy of query strategies, which distinguishes between data-based, model-based, and prediction-based instance selection, and investigate the prevalence of these classes in recent research. Moreover, we review recent NN-based advances in NLP like word embeddings or language models in the context of (D)NNs, survey the current state-of-the-art at the intersection of AL, text classification, and DNNs and relate recent advances in NLP to AL. Finally, we analyze recent work in AL for text classification, connect the respective query strategies to the taxonomy, and outline commonalities and shortcomings. As a result, we highlight gaps in current research and present open research questions.
Since the low quality of document images will greatly undermine the chances of success in automatic text recognition and analysis, it is necessary to assess the quality of document images uploaded in online business process, so as to reject those images of low quality. In this paper, we attempt to achieve document image quality assessment and our contributions are twofold. Firstly, since document image quality assessment is more interested in text, we propose a text line based framework to estimate document image quality, which is composed of three stages: text line detection, text line quality prediction, and overall quality assessment. Text line detection aims to find potential text lines with a detector. In the text line quality prediction stage, the quality score is computed for each text line with a CNN-based prediction model. The overall quality of document images is finally assessed with the ensemble of all text line quality. Secondly, to train the prediction model, a large-scale dataset, comprising 52,094 text line images, is synthesized with diverse attributes. For each text line image, a quality label is computed with a piece-wise function. To demonstrate the effectiveness of the proposed framework, comprehensive experiments are evaluated on two popular document image quality assessment benchmarks. Our framework significantly outperforms the state-of-the-art methods by large margins on the large and complicated dataset.
Image captioning is a fast-growing research field of computer vision and natural language processing that involves creating text explanations for images. This study aims to develop a system that uses a pre-trained convolutional neural network (CNN) to extract features from an image, integrates the features with an attention mechanism, and creates captions using a recurrent neural network (RNN). To encode an image into a feature vector as graphical attributes, we employed multiple pre-trained convolutional neural networks. Following that, a language model known as GRU is chosen as the decoder to construct the descriptive sentence. In order to increase performance, we merge the Bahdanau attention model with GRU to allow learning to be focused on a specific portion of the image. On the MSCOCO dataset, the experimental results achieve competitive performance against state-of-the-art approaches.