Anaphora resolution is envisaged in this paper as part of the reference resolution process. A general open architecture is proposed, which can be particularized and configured in order to simulate some classic anaphora resolution methods. With the aim of improving pronoun resolution, the system takes advantage of elementary cues about characters of the text, which are represented through a particular data structure. In its most robust configuration, the system uses only a general lexicon, a local morpho-syntactic parser and a dictionary of synonyms. A short comparative corpus analysis shows that narrative texts are the most suitable for testing such a system.
In this work, we present TGLS, a novel framework to unsupervised Text Generation by Learning from Search. We start by applying a strong search algorithm (in particular, simulated annealing) towards a heuristically defined objective that (roughly) estimates the quality of sentences. Then, a conditional generative model learns from the search results, and meanwhile smooth out the noise of search. The alternation between search and learning can be repeated for performance bootstrapping. We demonstrate the effectiveness of TGLS on two real-world natural language generation tasks, paraphrase generation and text formalization. Our model significantly outperforms unsupervised baseline methods in both tasks. Especially, it achieves comparable performance with the state-of-the-art supervised methods in paraphrase generation.
Colloquial English (CE) as found in television programs or typical conversations is different than text found in technical manuals, newspapers and books. Phrases tend to be shorter and less sophisticated. In this paper, we look at some of the theoretical and implementational issues involved in translating CE. We present a fully automatic large-scale multilingual natural language processing system for translation of CE input text, as found in the commercially transmitted closed-caption television signal, into simple target sentences. Our approach is based on the Whitelock's Shake and Bake machine translation paradigm, which relies heavily on lexical resources. The system currently translates from English to Spanish with the translation modules for Brazilian Portuguese under development.
Most learners fail to develop deep text comprehension when reading textbooks passively. Posing questions about what learners have read is a well-established way of fostering their text comprehension. However, many textbooks lack self-assessment questions because authoring them is timeconsuming and expensive. Automatic question generators may alleviate this scarcity by generating sound pedagogical questions. However, generating questions automatically poses linguistic and pedagogical challenges. What should we ask? And, how do we phrase the question automatically? We address those challenges with an automatic question generator grounded in learning theory. The paper introduces a novel pedagogically meaningful content selection mechanism to find question-worthy sentences and answers in arbitrary textbook contents. We conducted an empirical evaluation study with educational experts, annotating 150 generated questions in six different domains. Results indicate a high linguistic quality of the generated questions. Furthermore, the evaluation results imply that the majority of the generated questions inquire central information related to the given text and may foster text comprehension in specific learning scenarios.
Over the past decade, emoji have emerged as a new and widespread form of digital communication, spanning diverse social networks and spoken languages. We propose to treat these ideograms as a new modality in their own right, distinct in their semantic structure from both the text in which they are often embedded as well as the images which they resemble. As a new modality, emoji present rich novel possibilities for representation and interaction. In this paper, we explore the challenges that arise naturally from considering the emoji modality through the lens of multimedia research. Specifically, the ways in which emoji can be related to other common modalities such as text and images. To do so, we first present a large scale dataset of real-world emoji usage collected from Twitter. This dataset contains examples of both text-emoji and image-emoji relationships. We present baseline results on the challenge of predicting emoji from both text and images, using state-of-the-art neural networks. Further, we offer a first consideration into the problem of how to account for new, unseen emoji - a relevant issue as the emoji vocabulary continues to expand on a yearly basis. Finally, we present results for multimedia retrieval using emoji as queries.
Understanding the semantic of a collection of texts is a challenging task. Topic models are probabilistic models that aims at extracting "topics" from a corpus of documents. This task is particularly difficult when the corpus is composed of short texts, such as posts on social networks. Following several previous research papers, we explore in this paper a set of collected tweets about bitcoin. In this work, we train three topic models and evaluate their output with several scores. We also propose a concrete application of the extracted topics.
The paper presents a hierarchical naive Bayesian and lexicon based classifier for short text language identification (LID) useful for under resourced languages. The algorithm is evaluated on short pieces of text for the 11 official South African languages some of which are similar languages. The algorithm is compared to recent approaches using test sets from previous works on South African languages as well as the Discriminating between Similar Languages (DSL) shared tasks' datasets. Remaining research opportunities and pressing concerns in evaluating and comparing LID approaches are also discussed.
In this paper we explore the effect of architectural choices on learning a Variational Autoencoder (VAE) for text generation. In contrast to the previously introduced VAE model for text where both the encoder and decoder are RNNs, we propose a novel hybrid architecture that blends fully feed-forward convolutional and deconvolutional components with a recurrent language model. Our architecture exhibits several attractive properties such as faster run time and convergence, ability to better handle long sequences and, more importantly, it helps to avoid some of the major difficulties posed by training VAE models on textual data.
Increased popularity of different text representations has also brought many improvements in Natural Language Processing (NLP) tasks. Without need of supervised data, embeddings trained on large corpora provide us meaningful relations to be used on different NLP tasks. Even though training these vectors is relatively easy with recent methods, information gained from the data heavily depends on the structure of the corpus language. Since the popularly researched languages have a similar morphological structure, problems occurring for morphologically rich languages are mainly disregarded in studies. For morphologically rich languages, context-free word vectors ignore morphological structure of languages. In this study, we prepared texts in morphologically different forms in a morphologically rich language, Turkish, and compared the results on different intrinsic and extrinsic tasks. To see the effect of morphological structure, we trained word2vec model on texts which lemma and suffixes are treated differently. We also trained subword model fastText and compared the embeddings on word analogy, text classification, sentimental analysis, and language model tasks.