Word representation is a fundamental component in neural language understanding models. Recently, pre-trained language models (PrLMs) offer a new performant method of contextualized word representations by leveraging the sequence-level context for modeling. Although the PrLMs generally give more accurate contextualized word representations than non-contextualized models do, they are still subject to a sequence of text contexts without diverse hints for word representation from multimodality. This paper thus proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance. In detail, we build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images. The texts and paired images are encoded in parallel, followed by an attention layer to integrate the multimodal representations. We show that the method substantially improves the accuracy of disambiguation. Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
We describe a gold standard corpus of protest events that comprise of various local and international sources from various countries in English. The corpus contains document, sentence, and token level annotations. This corpus facilitates creating machine learning models that automatically classify news articles and extract protest event-related information, constructing knowledge bases which enable comparative social and political science studies. For each news source, the annotation starts on random samples of news articles and continues with samples that are drawn using active learning. Each batch of samples was annotated by two social and political scientists, adjudicated by an annotation supervisor, and was improved by identifying annotation errors semi-automatically. We found that the corpus has the variety and quality to develop and benchmark text classification and event extraction systems in a cross-context setting, which contributes to the generalizability and robustness of automated text processing systems. This corpus and the reported results will set the currently lacking common ground in automated protest event collection studies.
Much of biomedical and healthcare data is encoded in discrete, symbolic form such as text and medical codes. There is a wealth of expert-curated biomedical domain knowledge stored in knowledge bases and ontologies, but the lack of reliable methods for learning knowledge representation has limited their usefulness in machine learning applications. While text-based representation learning has significantly improved in recent years through advances in natural language processing, attempts to learn biomedical concept embeddings so far have been lacking. A recent family of models called knowledge graph embeddings have shown promising results on general domain knowledge graphs, and we explore their capabilities in the biomedical domain. We train several state-of-the-art knowledge graph embedding models on the SNOMED-CT knowledge graph, provide a benchmark with comparison to existing methods and in-depth discussion on best practices, and make a case for the importance of leveraging the multi-relational nature of knowledge graphs for learning biomedical knowledge representation. The embeddings, code, and materials will be made available to the communitY.
WordNet-like Lexical Databases (WLDs) group English words into sets of synonyms called "synsets." Although the standard WLDs are being used in many successful Text-Mining applications, they have the limitation that word-senses are considered to represent the meaning associated to their corresponding synsets, to the same degree, which is not generally true. In order to overcome this limitation, several fuzzy versions of synsets have been proposed. A common trait of these studies is that, to the best of our knowledge, they do not aim to produce fuzzified versions of the existing WLD's, but build new WLDs from scratch, which has limited the attention received from the Text-Mining community, many of whose resources and applications are based on the existing WLDs. In this study, we present an algorithm for constructing fuzzy versions of WLDs of any language, given a corpus of documents and a word-sense disambiguation (WSD) system for that language. Then, using the Open-American-National-Corpus and UKB WSD as algorithm inputs, we construct and publish online the fuzzified version of English WordNet (FWN). We also propose a theoretical (mathematical) proof of the validity of its results.
End-to-end (E2E) systems for automatic speech recognition (ASR), such as RNN Transducer (RNN-T) and Listen-Attend-Spell (LAS) blend the individual components of a traditional hybrid ASR system - acoustic model, language model, pronunciation model - into a single neural network. While this has some nice advantages, it limits the system to be trained using only paired audio and text. Because of this, E2E models tend to have difficulties with correctly recognizing rare words that are not frequently seen during training, such as entity names. In this paper, we propose modifications to the RNN-T model that allow the model to utilize additional metadata text with the objective of improving performance on these named entity words. We evaluate our approach on an in-house dataset sampled from de-identified public social media videos, which represent an open domain ASR task. By using an attention model to leverage the contextual metadata that accompanies a video, we observe a relative improvement of about 12% in Word Error Rate on Named Entities (WER-NE) for videos with related metadata.
Authorship verification (AV) is an important sub-area of digital text forensics and has been researched for more than two decades. The fundamental question addressed by AV is whether two documents were written by the same person. A serious problem that has received little attention in the literature so far is the question if AV methods actually focus on the writing style during classification, or whether they are unintentionally distorted by the topic of the documents. To counteract this problem, we propose an effective technique called POSNoise, which aims to mask topic-related content in documents. In this way, AV methods are forced to focus on those text units that are more related to the author's writing style. Based on a comprehensive evaluation with eight existing AV methods applied to eight corpora, we demonstrate that POSNoise is able to outperform a well-known topic masking approach in 51 out of 64 cases with up to 12.5% improvement in terms of accuracy. Furthermore, we show that for corpora preprocessed with POSNoise, the AV methods examined often achieve higher accuracies (improvement of up to 20.6%) compared to the original corpora.
Historical Document Processing is the process of digitizing written material from the past for future use by historians and other scholars. It incorporates algorithms and software tools from various subfields of computer science, including computer vision, document analysis and recognition, natural language processing, and machine learning, to convert images of ancient manuscripts, letters, diaries, and early printed texts automatically into a digital format usable in data mining and information retrieval systems. Within the past twenty years, as libraries, museums, and other cultural heritage institutions have scanned an increasing volume of their historical document archives, the need to transcribe the full text from these collections has become acute. Since Historical Document Processing encompasses multiple sub-domains of computer science, knowledge relevant to its purpose is scattered across numerous journals and conference proceedings. This paper surveys the major phases of, standard algorithms, tools, and datasets in the field of Historical Document Processing, discusses the results of a literature review, and finally suggests directions for further research.
Social media generates an enormous amount of data on a daily basis but it is very challenging to effectively utilize the data without annotating or labeling it according to the target application. We investigate the problem of localized flood detection using the social sensing model (Twitter) in order to provide an efficient, reliable and accurate flood text classification model with minimal labeled data. This study is important since it can immensely help in providing the flood-related updates and notifications to the city officials for emergency decision making, rescue operations, and early warnings, etc. We propose to perform the text classification using the inductive transfer learning method i.e pre-trained language model ULMFiT and fine-tune it in order to effectively classify the flood-related feeds in any new location. Finally, we show that using very little new labeled data in the target domain we can successfully build an efficient and high performing model for flood detection and analysis with human-generated facts and observations from Twitter.
A quality abstractive summary should not only copy salient source texts as summaries but should also tend to generate new conceptual words to express concrete details. Inspired by the popular pointer generator sequence-to-sequence model, this paper presents a concept pointer network for improving these aspects of abstractive summarization. The network leverages knowledge-based, context-aware conceptualizations to derive an extended set of candidate concepts. The model then points to the most appropriate choice using both the concept set and original source text. This joint approach generates abstractive summaries with higher-level semantic concepts. The training model is also optimized in a way that adapts to different data, which is based on a novel method of distantly-supervised learning guided by reference summaries and testing set. Overall, the proposed approach provides statistically significant improvements over several state-of-the-art models on both the DUC-2004 and Gigaword datasets. A human evaluation of the model's abstractive abilities also supports the quality of the summaries produced within this framework.
Many applications, such as text modelling, high-throughput sequencing, and recommender systems, require analysing sparse, high-dimensional, and overdispersed discrete (count-valued or binary) data. Although probabilistic matrix factorisation and linear/nonlinear latent factor models have enjoyed great success in modelling such data, many existing models may have inferior modelling performance due to the insufficient capability of modelling overdispersion in count-valued data and model misspecification in general. In this paper, we comprehensively study these issues and propose a variational autoencoder based framework that generates discrete data via negative-binomial distribution. We also examine the model's ability to capture properties, such as self- and cross-excitations in discrete data, which is critical for modelling overdispersion. We conduct extensive experiments on three important problems from discrete data analysis: text analysis, collaborative filtering, and multi-label learning. Compared with several state-of-the-art baselines, the proposed models achieve significantly better performance on the above problems.