Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0.
The quality of a document is affected by various factors, including grammaticality, readability, stylistics, and expertise depth, making the task of document quality assessment a complex one. In this paper, we explore this task in the context of assessing the quality of Wikipedia articles and academic papers. Observing that the visual rendering of a document can capture implicit quality indicators that are not present in the document text --- such as images, font choices, and visual layout --- we propose a joint model that combines the text content with a visual rendering of the document for document quality assessment. Experimental results over two datasets reveal that textual and visual features are complementary, achieving state-of-the-art results.
Understanding procedural language requires anticipating the causal effects of actions, even when they are not explicitly stated. In this work, we introduce Neural Process Networks to understand procedural text through (neural) simulation of action dynamics. Our model complements existing memory architectures with dynamic entity tracking by explicitly modeling actions as state transformers. The model updates the states of the entities by executing learned action operators. Empirical results demonstrate that our proposed model can reason about the unstated causal effects of actions, allowing it to provide more accurate contextual information for understanding and generating procedural text, all while offering more interpretable internal representations than existing alternatives.
Generative Adversarial Networks (GANs) have shown great promise recently in image generation. Training GANs for language generation has proven to be more difficult, because of the non-differentiable nature of generating text with recurrent neural networks. Consequently, past work has either resorted to pre-training with maximum-likelihood or used convolutional networks for generation. In this work, we show that recurrent neural networks can be trained to generate text with GANs from scratch using curriculum learning, by slowly teaching the model to generate sequences of increasing and variable length. We empirically show that our approach vastly improves the quality of generated sequences compared to a convolutional baseline.
We present an architecture for information extraction from text that augments an existing parser with a character-level neural network. The network is trained using a measure of consistency of extracted data with existing databases as a form of noisy supervision. Our architecture combines the ability of constraint-based information extraction systems to easily incorporate domain knowledge and constraints with the ability of deep neural networks to leverage large amounts of data to learn complex features. Boosting the existing parser's precision, the system led to large improvements over a mature and highly tuned constraint-based production information extraction system used at Bloomberg for financial language text.
Online content analysis employs algorithmic methods to identify entities in unstructured text. Both machine learning and knowledge-base approaches lie at the foundation of contemporary named entities extraction systems. However, the progress in deploying these approaches on web-scale has been been hampered by the computational cost of NLP over massive text corpora. We present SpeedRead (SR), a named entity recognition pipeline that runs at least 10 times faster than Stanford NLP pipeline. This pipeline consists of a high performance Penn Treebank- compliant tokenizer, close to state-of-art part-of-speech (POS) tagger and knowledge-based named entity recognizer.
Learning to respond to voice-text input involves the subject's ability in understanding the phonetic and text based contents and his/her ability to communicate based on his/her experience. The neuro-cognitive facility of the subject has to support two important domains in order to make the learning process complete. In many cases, though the understanding is complete, the response is partial. This is one valid reason why we need to support the information from the subject with scalable techniques such as Natural Language Processing (NLP) for abstraction of the contents from the output. This paper explores the feasibility of using NLP modules interlaced with Neural Networks to perform the required task in autogenic training related to medical applications.
In recent years, several methods have been proposed for explaining individual predictions of deep learning models, yet there has been little study of how to aggregate these predictions to explain how such models view classes as a whole in text classification tasks. In this work, we propose a method for explaining classes using deep learning models and the Integrated Gradients feature attribution technique by aggregating explanations of individual examples in text classification to general descriptions of the classes. We demonstrate the approach on Web register (genre) classification using the XML-R model and the Corpus of Online Registers of English (CORE), finding that the method identifies plausible and discriminative keywords characterizing all but the smallest class.
Most few-shot learning models utilize only one modality of data. We would like to investigate qualitatively and quantitatively how much will the model improve if we add an extra modality (i.e. text description of the image), and how it affects the learning procedure. To achieve this goal, we propose four types of fusion method to combine the image feature and text feature. To verify the effectiveness of improvement, we test the fusion methods with two classical few-shot learning models - ProtoNet and MAML, with image feature extractors such as ConvNet and ResNet12. The attention-based fusion method works best, which improves the classification accuracy by a large margin around 30% comparing to the baseline result.
Automated speech and text interfaces are continuing to improve, resulting in increased research in the area of dialogue systems. Moreover, conferences and workshops from various fields are focusing more on language through speech and text mediums as candidates for interaction with applications such as search interfaces and robots. In this paper, we explore how visible the SigDial conference is to outside conferences by analysing papers from top Natural Langauge Processing conferences since 2015 to determine the popularity of certain SigDial-related topics, as well as analysing what SigDial papers are being cited by others outside of SigDial. We find that despite a dramatic increase in dialogue-related research, SigDial visibility has not increased. We conclude by offering some suggestions.