Semantically non-compositional phrases constitute an intriguing research topic in Natural Language Processing. Semantic non-compositionality --the situation when the meaning of a phrase cannot be derived from the meaning of its components, is the main characteristic of such phrases, however, they bear other characteristics such as high statistical association and non-substitutability. In this work, we present a model for identifying non-compositional phrases that takes into account all of these characteristics. We show that the presented model remarkably outperforms the existing models of identifying non-compositional phrases that mostly focus only on one of these characteristics.
In online discussion fora, speakers often make arguments for or against something, say birth control, by highlighting certain aspects of the topic. In social science, this is referred to as issue framing. In this paper, we introduce a new issue frame annotated corpus of online discussions. We explore to what extent models trained to detect issue frames in newswire and social media can be transferred to the domain of discussion fora, using a combination of multi-task and adversarial training, assuming only unlabeled training data in the target domain.
Patient summarization is essential for clinicians to provide coordinated care and practice effective communication. Automated summarization has the potential to save time, standardize notes, aid clinical decision making, and reduce medical errors. Here we provide an upper bound on extractive summarization of discharge notes and develop an LSTM model to sequentially label topics of history of present illness notes. We achieve an F1 score of 0.876, which indicates that this model can be employed to create a dataset for evaluation of extractive summarization methods.
Self-organization has been an important concept within a number of disciplines, which Artificial Life (ALife) also has heavily utilized since its inception. The term and its implications, however, are often confusing or misinterpreted. In this work, we provide a mini-review of self-organization and its relationship with ALife, aiming at initiating discussions on this important topic with the interested audience. We first articulate some fundamental aspects of self-organization, outline its usage, and review its applications to ALife within its soft, hard, and wet domains. We also provide perspectives for further research.
We present a new corpus, PersonaBank, consisting of 108 personal stories from weblogs that have been annotated with their Story Intention Graphs, a deep representation of the fabula of a story. We describe the topics of the stories and the basis of the Story Intention Graph representation, as well as the process of annotating the stories to produce the Story Intention Graphs and the challenges of adapting the tool to this new personal narrative domain We also discuss how the corpus can be used in applications that retell the story using different styles of tellings, co-tellings, or as a content planner.
Latent tree analysis seeks to model the correlations among a set of random variables using a tree of latent variables. It was proposed as an improvement to latent class analysis --- a method widely used in social sciences and medicine to identify homogeneous subgroups in a population. It provides new and fruitful perspectives on a number of machine learning areas, including cluster analysis, topic detection, and deep probabilistic modeling. This paper gives an overview of the research on latent tree analysis and various ways it is used in practice.
This paper presents a system which creates and visualizes probabilistic semantic links between concepts in a thesaurus and classes in a classification system. For creating the links, we build on the Polylingual Labeled Topic Model (PLL-TM). PLL-TM identifies probable thesaurus descriptors for each class in the classification system by using information from the natural language text of documents, their assigned thesaurus descriptors and their designated classes. The links are then presented to users of the system in an interactive visualization, providing them with an automatically generated overview of the relations between the thesaurus and the classification system.
Informatics around public health are increasingly shifting from the professional to the public spheres. In this work, we apply linguistic analytics to restaurant reviews, from Yelp, in order to automatically predict official health inspection reports. We consider two types of feature sets, i.e., keyword detection and topic model features, and use these in several classification methods. Our empirical analysis shows that these extracted features can predict public health inspection reports with over 90% accuracy using simple support vector machines.
In this work, we propose a new approach for discovering various relationships among keywords over the scientific publications based on a Markov Chain model. It is an important problem since keywords are the basic elements for representing abstract objects such as documents, user profiles, topics and many things else. Our model is very effective since it combines four important factors in scientific publications: content, publicity, impact and randomness. Particularly, a recommendation system (called SciRecSys) has been presented to support users to efficiently find out relevant articles.
Given empirical evidence for the dependence of an outcome variable on an exposure variable, we can typically only provide bounds for the "probability of causation" in the case of an individual who has developed the outcome after being exposed. We show how these bounds can be adapted or improved if further information becomes available. In addition to reviewing existing work on this topic, we provide a new analysis for the case where a mediating variable can be observed. In particular we show how the probability of causation can be bounded when there is no direct effect and no confounding. Keywords: Causal inference, Mediation Analysis, Probability of Causation