Prosopography is an investigation of the common characteristics of a group of people in history, by a collective study of their lives. It involves a study of biographies to solve historical problems. If such biographies are unavailable, surviving documents and secondary biographical data are used. Quantitative prosopography involves analysis of information from a wide variety of sources about "ordinary people". In this paper, we present a machine learning framework for automatically designing a people gazetteer which forms the basis of quantitative prosopographical research. The gazetteer is learnt from the noisy text of newspapers using a Named Entity Recognizer (NER). It is capable of identifying influential people from it by making use of a custom designed Influential Person Index (IPI). Our corpus comprises of 14020 articles from a local newspaper, "The Sun", published from New York in 1896. Some influential people identified by our algorithm include Captain Donald Hankey (an English soldier), Dame Nellie Melba (an Australian operatic soprano), Hugh Allan (a Canadian shipping magnate) and Sir Hugh John McDonald (the first Prime Minister of Canada).
Math word problems form a natural abstraction to a range of quantitative reasoning problems, such as understanding financial news, sports results, and casualties of war. Solving such problems requires the understanding of several mathematical concepts such as dimensional analysis, subset relationships, etc. In this paper, we develop declarative rules which govern the translation of natural language description of these concepts to math expressions. We then present a framework for incorporating such declarative knowledge into word problem solving. Our method learns to map arithmetic word problem text to math expressions, by learning to select the relevant declarative knowledge for each operation of the solution expression. This provides a way to handle multiple concepts in the same problem while, at the same time, support interpretability of the answer expression. Our method models the mapping to declarative knowledge as a latent variable, thus removing the need for expensive annotations. Experimental evaluation suggests that our domain knowledge based solver outperforms all other systems, and that it generalizes better in the realistic case where the training data it is exposed to is biased in a different way than the test data.
The enormous amount of texts published daily by Internet users has fostered the development of methods to analyze this content in several natural language processing areas, such as sentiment analysis. The main goal of this task is to classify the polarity of a message. Even though many approaches have been proposed for sentiment analysis, some of the most successful ones rely on the availability of large annotated corpus, which is an expensive and time-consuming process. In recent years, distant supervision has been used to obtain larger datasets. So, inspired by these techniques, in this paper we extend such approaches to incorporate popular graphic symbols used in electronic messages, the emojis, in order to create a large sentiment corpus for Portuguese. Trained on almost one million tweets, several models were tested in both same domain and cross-domain corpora. Our methods obtained very competitive results in five annotated corpora from mixed domains (Twitter and product reviews), which proves the domain-independent property of such approach. In addition, our results suggest that the combination of emoticons and emojis is able to properly capture the sentiment of a message.
We publicly release a new large-scale dataset, called SearchQA, for machine comprehension, or question-answering. Unlike recently released datasets, such as DeepMind CNN/DailyMail and SQuAD, the proposed SearchQA was constructed to reflect a full pipeline of general question-answering. That is, we start not from an existing article and generate a question-answer pair, but start from an existing question-answer pair, crawled from J! Archive, and augment it with text snippets retrieved by Google. Following this approach, we built SearchQA, which consists of more than 140k question-answer pairs with each pair having 49.6 snippets on average. Each question-answer-context tuple of the SearchQA comes with additional meta-data such as the snippet's URL, which we believe will be valuable resources for future research. We conduct human evaluation as well as test two baseline methods, one simple word selection and the other deep learning based, on the SearchQA. We show that there is a meaningful gap between the human and machine performances. This suggests that the proposed dataset could well serve as a benchmark for question-answering.
Scientific results are communicated visually in the literature through diagrams, visualizations, and photographs. These information-dense objects have been largely ignored in bibliometrics and scientometrics studies when compared to citations and text. In this paper, we use techniques from computer vision and machine learning to classify more than 8 million figures from PubMed into 5 figure types and study the resulting patterns of visual information as they relate to impact. We find that the distribution of figures and figure types in the literature has remained relatively constant over time, but can vary widely across field and topic. Remarkably, we find a significant correlation between scientific impact and the use of visual information, where higher impact papers tend to include more diagrams, and to a lesser extent more plots and photographs. To explore these results and other ways of extracting this visual information, we have built a visual browser to illustrate the concept and explore design alternatives for supporting viziometric analysis and organizing visual information. We use these results to articulate a new research agenda -- viziometrics -- to study the organization and presentation of visual information in the scientific literature.
Any natural language can be considered as a tool for producing large databases (consisting of texts, written, or discursive). This tool for its description in turn requires other large databases (dictionaries, grammars etc.). Nowadays, the notion of database is associated with computer processing and computer memory. However, a natural language resides also in human brains and functions in human communication, from interpersonal to intergenerational one. We discuss in this survey/research paper mathematical, in particular geometric, constructions, which help to bridge these two worlds. In particular, in this paper we consider the Vector Space Model of semantics based on frequency matrices, as used in Natural Language Processing. We investigate underlying geometries, formulated in terms of Grassmannians, projective spaces, and flag varieties. We formulate the relation between vector space models and semantic spaces based on semic axes in terms of projectability of subvarieties in Grassmannians and projective spaces. We interpret Latent Semantics as a geometric flow on Grassmannians. We also discuss how to formulate G\"ardenfors' notion of "meeting of minds" in our geometric setting.
Higher-level cognition includes logical reasoning and the ability of question answering with common sense. The RatioLog project addresses the problem of rational reasoning in deep question answering by methods from automated deduction and cognitive computing. In a first phase, we combine techniques from information retrieval and machine learning to find appropriate answer candidates from the huge amount of text in the German version of the free encyclopedia "Wikipedia". In a second phase, an automated theorem prover tries to verify the answer candidates on the basis of their logical representations. In a third phase - because the knowledge may be incomplete and inconsistent -, we consider extensions of logical reasoning to improve the results. In this context, we work toward the application of techniques from human reasoning: We employ defeasible reasoning to compare the answers w.r.t. specificity, deontic logic, normative reasoning, and model construction. Moreover, we use integrated case-based reasoning and machine learning techniques on the basis of the semantic structure of the questions and answer candidates to learn giving the right answers.
This work aims to resolve issues related to Web Service retrieval, also known as Service Selection, Discovery or essentially Matching, in two directions. Firstly, a novel matching algorithm for SAWSDL is introduced. The algorithm is hybrid in nature, combining novel and known concepts, such as a logic-based strategy and syntactic text-similarity measures on semantic annotations and textual descriptions. A plugin for the S3 contest environment was developed, in order to position Tomaco amongst state-of-the-art in an objective, reproducible manner. Evaluation showed that Tomaco ranks high amongst state of the art, especially for early recall levels. Secondly, this work introduces the Tomaco web application, which aims to accelerate the wide-spread adoption of Semantic Web Service technologies and algorithms while targeting the lack of user-friendly applications in this field. Tomaco integrates a variety of configurable matching algorithms proposed in this paper. It, finally, allows discovery of both existing and user-contributed service collections and ontologies, serving also as a service registry.
The project presented in this article aims to formalize criteria and procedures in order to extract semantic information from parsed dictionary glosses. The actual purpose of the project is the generation of a semantic network (nearly an ontology) issued from a monolingual Italian dictionary, through unsupervised procedures. Since the project involves rule-based Parsing, Semantic Tagging and Word Sense Disambiguation techniques, its outcomes may find an interest also beyond this immediate intent. The cooperation of both syntactic and semantic features in meaning construction are investigated, and procedures which allows a translation of syntactic dependencies in semantic relations are discussed. The procedures that rise from this project can be applied also to other text types than dictionary glosses, as they convert the output of a parsing process into a semantic representation. In addition some mechanism are sketched that may lead to a kind of procedural semantics, through which multiple paraphrases of an given expression can be generated. Which means that these techniques may find an application also in 'query expansion' strategies, interesting Information Retrieval, Search Engines and Question Answering Systems.
In recent work we have presented a formal framework for linguistic annotation based on labeled acyclic digraphs. These `annotation graphs' offer a simple yet powerful method for representing complex annotation structures incorporating hierarchy and overlap. Here, we motivate and illustrate our approach using discourse-level annotations of text and speech data drawn from the CALLHOME, COCONUT, MUC-7, DAMSL and TRAINS annotation schemes. With the help of domain specialists, we have constructed a hybrid multi-level annotation for a fragment of the Boston University Radio Speech Corpus which includes the following levels: segment, word, breath, ToBI, Tilt, Treebank, coreference and named entity. We show how annotation graphs can represent hybrid multi-level structures which derive from a diverse set of file formats. We also show how the approach facilitates substantive comparison of multiple annotations of a single signal based on different theoretical models. The discussion shows how annotation graphs open the door to wide-ranging integration of tools, formats and corpora.