The paper proposes the task of universal semantic tagging---tagging word tokens with language-neutral, semantically informative tags. We argue that the task, with its independent nature, contributes to better semantic analysis for wide-coverage multilingual text. We present the initial version of the semantic tagset and show that (a) the tags provide semantically fine-grained information, and (b) they are suitable for cross-lingual semantic parsing. An application of the semantic tagging in the Parallel Meaning Bank supports both of these points as the tags contribute to formal lexical semantics and their cross-lingual projection. As a part of the application, we annotate a small corpus with the semantic tags and present new baseline result for universal semantic tagging.
Many genres of natural language text are narratively structured, a testament to our predilection for organizing our experiences as narratives. There is broad consensus that understanding a narrative requires identifying and tracking the goals and desires of the characters and their narrative outcomes. However, to date, there has been limited work on computational models for this problem. We introduce a new dataset, DesireDB, which includes gold-standard labels for identifying statements of desire, textual evidence for desire fulfillment, and annotations for whether the stated desire is fulfilled given the evidence in the narrative context. We report experiments on tracking desire fulfillment using different methods, and show that LSTM Skip-Thought model achieves F-measure of 0.7 on our corpus.
Word embeddings are a powerful approach for unsupervised analysis of language. Recently, Rudolph et al. (2016) developed exponential family embeddings, which cast word embeddings in a probabilistic framework. Here, we develop dynamic embeddings, building on exponential family embeddings to capture how the meanings of words change over time. We use dynamic embeddings to analyze three large collections of historical texts: the U.S. Senate speeches from 1858 to 2009, the history of computer science ACM abstracts from 1951 to 2014, and machine learning papers on the Arxiv from 2007 to 2015. We find dynamic embeddings provide better fits than classical embeddings and capture interesting patterns about how language changes.
We propose a general approach to modeling semi-supervised learning (SSL) algorithms. Specifically, we present a declarative language for modeling both traditional supervised classification tasks and many SSL heuristics, including both well-known heuristics such as co-training and novel domain-specific heuristics. In addition to representing individual SSL heuristics, we show that multiple heuristics can be automatically combined using Bayesian optimization methods. We experiment with two classes of tasks, link-based text classification and relation extraction. We show modest improvements on well-studied link-based classification benchmarks, and state-of-the-art results on relation-extraction tasks for two realistic domains.
We present a submodular function-based framework for query-focused opinion summarization. Within our framework, relevance ordering produced by a statistical ranker, and information coverage with respect to topic distribution and diverse viewpoints are both encoded as submodular functions. Dispersion functions are utilized to minimize the redundancy. We are the first to evaluate different metrics of text similarity for submodularity-based summarization methods. By experimenting on community QA and blog summarization, we show that our system outperforms state-of-the-art approaches in both automatic evaluation and human evaluation. A human evaluation task is conducted on Amazon Mechanical Turk with scale, and shows that our systems are able to generate summaries of high overall quality and information diversity.
We present a discriminative model for single-document summarization that integrally combines compression and anaphoricity constraints. Our model selects textual units to include in the summary based on a rich set of sparse features whose weights are learned on a large corpus. We allow for the deletion of content within a sentence when that deletion is licensed by compression rules; in our framework, these are implemented as dependencies between subsentential units of text. Anaphoricity constraints then improve cross-sentence coherence by guaranteeing that, for each pronoun included in the summary, the pronoun's antecedent is included as well or the pronoun is rewritten as a full mention. When trained end-to-end, our final system outperforms prior work on both ROUGE as well as on human judgments of linguistic quality.
In this letter, we present a novel exponentially embedded families (EEF) based classification method, in which the probability density function (PDF) on raw data is estimated from the PDF on features. With the PDF construction, we show that class-specific features can be used in the proposed classification method, instead of a common feature subset for all classes as used in conventional approaches. We apply the proposed EEF classifier for text categorization as a case study and derive an optimal Bayesian classification rule with class-specific feature selection based on the Information Gain (IG) score. The promising performance on real-life data sets demonstrates the effectiveness of the proposed approach and indicates its wide potential applications.
Despite the 'apparent clarity' of a given legal provision, its application may result in an outcome that does not exactly conform to the semantic level of a statute. The vagueness within a legal text is induced intentionally to accommodate all possible scenarios under which such norms should be applied, thus making the role of pragmatics an important aspect also in the representation of a legal norm and reasoning on top of it. The notion of pragmatics considered in this paper does not focus on the aspects associated with judicial decision making. The paper aims to shed light on the aspects of pragmatics in legal linguistics, mainly focusing on the domain of patent law, only from a knowledge representation perspective. The philosophical discussions presented in this paper are grounded based on the legal theories from Grice and Marmor.
We investigate properties of evolving linguistic networks defined by the word-adjacency relation. Such networks belong to the category of networks with accelerated growth but their shortest path length appears to reveal the network size dependence of different functional form than the ones known so far. We thus compare the networks created from literary texts with their artificial substitutes based on different variants of the Dorogovtsev-Mendes model and observe that none of them is able to properly simulate the novel asymptotics of the shortest path length. Then, we identify the local chain-like linear growth induced by grammar and style as a missing element in this model and extend it by incorporating such effects. It is in this way that a satisfactory agreement with the empirical result is obtained.
This paper presents a machine learning approach for identification of Bengali multiword expressions (MWE) which are bigram nominal compounds. Our proposed approach has two steps: (1) candidate extraction using chunk information and various heuristic rules and (2) training the machine learning algorithm called Random Forest to classify the candidates into two groups: bigram nominal compound MWE or not bigram nominal compound MWE. A variety of association measures, syntactic and linguistic clues and a set of WordNet-based similarity features have been used for our MWE identification task. The approach presented in this paper can be used to identify bigram nominal compound MWE in Bengali running text.