Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

Measuring the Similarity of Sentential Arguments in Dialog

Sep 06, 2017
Amita Misra, Brian Ecker, Marilyn A. Walker

When people converse about social or political topics, similar arguments are often paraphrased by different speakers, across many different conversations. Debate websites produce curated summaries of arguments on such topics; these summaries typically consist of lists of sentences that represent frequently paraphrased propositions, or labels capturing the essence of one particular aspect of an argument, e.g. Morality or Second Amendment. We call these frequently paraphrased propositions ARGUMENT FACETS. Like these curated sites, our goal is to induce and identify argument facets across multiple conversations, and produce summaries. However, we aim to do this automatically. We frame the problem as consisting of two steps: we first extract sentences that express an argument from raw social media dialogs, and then rank the extracted arguments in terms of their similarity to one another. Sets of similar arguments are used to represent argument facets. We show here that we can predict ARGUMENT FACET SIMILARITY with a correlation averaging 0.63 compared to a human topline averaging 0.68 over three debate topics, easily beating several reasonable baselines.

* Measuring the Similarity of Sentential Arguments in Dialog, by Misra, Amita and Ecker, Brian and Walker, Marilyn A, 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages={276}, year={2016} The dataset is available at 

  Access Paper or Ask Questions

Forecasting AI Progress: A Research Agenda

Aug 04, 2020
Ross Gruetzemacher, Florian Dorner, Niko Bernaola-Alvarez, Charlie Giattino, David Manheim

Forecasting AI progress is essential to reducing uncertainty in order to appropriately plan for research efforts on AI safety and AI governance. While this is generally considered to be an important topic, little work has been conducted on it and there is no published document that gives and objective overview of the field. Moreover, the field is very diverse and there is no published consensus regarding its direction. This paper describes the development of a research agenda for forecasting AI progress which utilized the Delphi technique to elicit and aggregate experts' opinions on what questions and methods to prioritize. The results of the Delphi are presented; the remainder of the paper follow the structure of these results, briefly reviewing relevant literature and suggesting future work for each topic. Experts indicated that a wide variety of methods should be considered for forecasting AI progress. Moreover, experts identified salient questions that were both general and completely unique to the problem of forecasting AI progress. Some of the highest priority topics include the validation of (partially unresolved) forecasts, how to make forecasting action-guiding and the quality of different performance metrics. While statistical methods seem more promising, there is also recognition that supplementing judgmental techniques can be quite beneficial.

* 40 pages including Appendices, 1 figure, 5 tables 

  Access Paper or Ask Questions

Acrostic Poem Generation

Oct 05, 2020
Rajat Agarwal, Katharina Kann

We propose a new task in the area of computational creativity: acrostic poem generation in English. Acrostic poems are poems that contain a hidden message; typically, the first letter of each line spells out a word or short phrase. We define the task as a generation task with multiple constraints: given an input word, 1) the initial letters of each line should spell out the provided word, 2) the poem's semantics should also relate to it, and 3) the poem should conform to a rhyming scheme. We further provide a baseline model for the task, which consists of a conditional neural language model in combination with a neural rhyming model. Since no dedicated datasets for acrostic poem generation exist, we create training data for our task by first training a separate topic prediction model on a small set of topic-annotated poems and then predicting topics for additional poems. Our experiments show that the acrostic poems generated by our baseline are received well by humans and do not lose much quality due to the additional constraints. Last, we confirm that poems generated by our model are indeed closely related to the provided prompts, and that pretraining on Wikipedia can boost performance.

* EMNLP 2020 

  Access Paper or Ask Questions

A Step Towards Interpretable Authorship Verification

Jul 07, 2020
Oren Halvani, Lukas Graner, Roey Regev

A central problem that has been researched for many years in the field of digital text forensics is the question whether two documents were written by the same author. Authorship verification (AV) is a research branch in this field that deals with this question. Over the years, research activities in the context of AV have steadily increased, which has led to a variety of approaches trying to solve this problem. Many of these approaches, however, make use of features that are related to or influenced by the topic of the documents. Therefore, it may accidentally happen that their verification results are based not on the writing style (the actual focus of AV), but on the topic of the documents. To address this problem, we propose an alternative AV approach that considers only topic-agnostic features in its classification decision. In addition, we present a post-hoc interpretation method that allows to understand which particular features have contributed to the prediction of the proposed AV method. To evaluate the performance of our AV method, we compared it with ten competing baselines (including the current state of the art) on four challenging data sets. The results show that our approach outperforms all baselines in two cases (with a maximum accuracy of 84%), while in the other two cases it performs as well as the strongest baseline.

* 21 pages, 5 figures. Paper has been accepted for publication in: The 15th International Conference on Availability, Reliability and Security (ARES 2020) 

  Access Paper or Ask Questions

Scalable Evaluation and Improvement of Document Set Expansion via Neural Positive-Unlabeled Learning

Oct 29, 2019
Alon Jacovi, Gang Niu, Yoav Goldberg, Masashi Sugiyama

We consider the situation in which a user has collected a small set of documents on a cohesive topic, and they want to retrieve additional documents on this topic from a large collection. Information Retrieval (IR) solutions treat the document set as a query, and look for similar documents in the collection. We propose to extend the IR approach by treating the problem as an instance of positive-unlabeled (PU) learning---i.e., learning binary classifiers from only positive and unlabeled data, where the positive data corresponds to the query documents, and the unlabeled data is the results returned by the IR engine. Utilizing PU learning for text with big neural networks is a largely unexplored field. We discuss various challenges in applying PU learning to the setting, including an unknown class prior, extremely imbalanced data and large-scale accurate evaluation of models, and we propose solutions and empirically validate them. We demonstrate the effectiveness of the method using a series of experiments of retrieving PubMed abstracts adhering to fine-grained topics. We demonstrate improvements over the base IR solution and other baselines. Implementation is available at

  Access Paper or Ask Questions

MOROCO: The Moldavian and Romanian Dialectal Corpus

Jan 19, 2019
Andrei M. Butnaru, Radu Tudor Ionescu

In this work, we introduce the MOldavian and ROmanian Dialectal COrpus (MOROCO), which is freely available for download at The corpus contains 33564 samples of text (with over 10 million tokens) collected from the news domain. The samples belong to one of the following six topics: culture, finance, politics, science, sports and tech. The data set is divided into 21719 samples for training, 5921 samples for validation and another 5924 samples for testing. For each sample, we provide corresponding dialectal and category labels. This allows us to perform empirical studies on several classification tasks such as (i) binary discrimination of Moldavian versus Romanian text samples, (ii) intra-dialect multi-class categorization by topic and (iii) cross-dialect multi-class categorization by topic. We perform experiments using a shallow approach based on string kernels, as well as a novel deep approach based on character-level convolutional neural networks containing Squeeze-and-Excitation blocks. We also present and analyze the most discriminative features of our best performing model, before and after named entity removal.

  Access Paper or Ask Questions

Computational Content Analysis of Negative Tweets for Obesity, Diet, Diabetes, and Exercise

Sep 22, 2017
George Shaw Jr., Amir Karami

Social media based digital epidemiology has the potential to support faster response and deeper understanding of public health related threats. This study proposes a new framework to analyze unstructured health related textual data via Twitter users' post (tweets) to characterize the negative health sentiments and non-health related concerns in relations to the corpus of negative sentiments, regarding Diet Diabetes Exercise, and Obesity (DDEO). Through the collection of 6 million Tweets for one month, this study identified the prominent topics of users as it relates to the negative sentiments. Our proposed framework uses two text mining methods, sentiment analysis and topic modeling, to discover negative topics. The negative sentiments of Twitter users support the literature narratives and the many morbidity issues that are associated with DDEO and the linkage between obesity and diabetes. The framework offers a potential method to understand the publics' opinions and sentiments regarding DDEO. More importantly, this research provides new opportunities for computational social scientists, medical experts, and public health professionals to collectively address DDEO-related issues.

* The 2017 Annual Meeting of the Association for Information Science and Technology (ASIST) 

  Access Paper or Ask Questions

Communicative need modulates competition in language change

Jun 16, 2020
Andres Karjus, Richard A. Blythe, Simon Kirby, Kenny Smith

All living languages change over time. The causes for this are many, one being the emergence and borrowing of new linguistic elements. Competition between the new elements and older ones with a similar semantic or grammatical function may lead to speakers preferring one of them, and leaving the other to go out of use. We introduce a general method for quantifying competition between linguistic elements in diachronic corpora which does not require language-specific resources other than a sufficiently large corpus. This approach is readily applicable to a wide range of languages and linguistic subsystems. Here, we apply it to lexical data in five corpora differing in language, type, genre, and time span. We find that changes in communicative need are consistently predictive of lexical competition dynamics. Near-synonymous words are more likely to directly compete if they belong to a topic of conversation whose importance to language users is constant over time, possibly leading to the extinction of one of the competing words. By contrast, in topics which are increasing in importance for language users, near-synonymous words tend not to compete directly and can coexist. This suggests that, in addition to direct competition between words, language change can be driven by competition between topics or semantic subspaces.

  Access Paper or Ask Questions

Short-Text Classification Using Unsupervised Keyword Expansion

Sep 16, 2019
Duncan Cameron-Steinke

Short-text classification, like all data science, struggles to achieve high performance using limited data. As a solution, a short sentence may be expanded with new and relevant feature words to form an artificially enlarged dataset, and add new features to testing data. This paper applies a novel approach to text expansion by generating new words directly for each input sentence, thus requiring no additional datasets or previous training. In this unsupervised approach, new keywords are formed within the hidden states of a pre-trained language model and then used to create extended pseudo documents. The word generation process was assessed by examining how well the predicted words matched to topics of the input sentence. It was found that this method could produce 3-10 relevant new words for each target topic, while generating just 1 word related to each non-target topic. Generated words were then added to short news headlines to create extended pseudo headlines. Experimental results have shown that models trained using the pseudo headlines can improve classification accuracy when limiting the number of training examples.

  Access Paper or Ask Questions