Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Mining and discovering biographical information in Difangzhi with a language-model-based approach

Apr 08, 2015
Peter K. Bol, Chao-Lin Liu, Hongsu Wang

We present results of expanding the contents of the China Biographical Database by text mining historical local gazetteers, difangzhi. The goal of the database is to see how people are connected together, through kinship, social connections, and the places and offices in which they served. The gazetteers are the single most important collection of names and offices covering the Song through Qing periods. Although we begin with local officials we shall eventually include lists of local examination candidates, people from the locality who served in government, and notable local figures with biographies. The more data we collect the more connections emerge. The value of doing systematic text mining work is that we can identify relevant connections that are either directly informative or can become useful without deep historical research. Academia Sinica is developing a name database for officials in the central governments of the Ming and Qing dynasties.

* 6 pages, 4 figures, 1 table, 2015 International Conference on Digital Humanities. in Proceedings of the 2015 International Conference on Digital Humanities (DH 2015). July 2015 

  Access Paper or Ask Questions

Analysis of Named Entity Recognition and Linking for Tweets

Oct 27, 2014
Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marieke van Erp, Genevieve Gorrell, Raphaël Troncy, Johann Petrak, Kalina Bontcheva

Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction from tweets is typically performed in a pipeline, comprising consecutive stages of language identification, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). In this work, we describe a new Twitter entity disambiguation dataset, and conduct an empirical analysis of named entity recognition and disambiguation, investigating how robust a number of state-of-the-art systems are on such noisy texts, what the main sources of error are, and which problems should be further investigated to improve the state of the art.

* Information Processing & Management 51 (2), 32-49, 2014 
* 35 pages, accepted to journal Information Processing and Management 

  Access Paper or Ask Questions

Topic Similarity Networks: Visual Analytics for Large Document Sets

Sep 26, 2014
Arun S. Maiya, Robert M. Rolfe

We investigate ways in which to improve the interpretability of LDA topic models by better analyzing and visualizing their outputs. We focus on examining what we refer to as topic similarity networks: graphs in which nodes represent latent topics in text collections and links represent similarity among topics. We describe efficient and effective approaches to both building and labeling such networks. Visualizations of topic models based on these networks are shown to be a powerful means of exploring, characterizing, and summarizing large collections of unstructured text documents. They help to "tease out" non-obvious connections among different sets of documents and provide insights into how topics form larger themes. We demonstrate the efficacy and practicality of these approaches through two case studies: 1) NSF grants for basic research spanning a 14 year period and 2) the entire English portion of Wikipedia.

* 9 pages; 2014 IEEE International Conference on Big Data (IEEE BigData 2014) 

  Access Paper or Ask Questions

Universal Properties of Mythological Networks

Jul 18, 2012
Pádraig Mac Carron, Ralph Kenna

As in statistical physics, the concept of universality plays an important, albeit qualitative, role in the field of comparative mythology. Here we apply statistical mechanical tools to analyse the networks underlying three iconic mythological narratives with a view to identifying common and distinguishing quantitative features. Of the three narratives, an Anglo-Saxon and a Greek text are mostly believed by antiquarians to be partly historically based while the third, an Irish epic, is often considered to be fictional. Here we show that network analysis is able to discriminate real from imaginary social networks and place mythological narratives on the spectrum between them. Moreover, the perceived artificiality of the Irish narrative can be traced back to anomalous features associated with six characters. Considering these as amalgams of several entities or proxies, renders the plausibility of the Irish text comparable to the others from a network-theoretic point of view.

* EPL 99 (2012) 28002 
* 6 pages, 3 figures, 2 tables. Updated to incorporate corrections from EPL acceptance process 

  Access Paper or Ask Questions

A Semi-Automatic Framework to Discover Epistemic Modalities in Scientific Articles

Apr 07, 2008
Sviatlana Danilava, Christoph Schommer

Documents in scientific newspapers are often marked by attitudes and opinions of the author and/or other persons, who contribute with objective and subjective statements and arguments as well. In this respect, the attitude is often accomplished by a linguistic modality. As in languages like english, french and german, the modality is expressed by special verbs like can, must, may, etc. and the subjunctive mood, an occurrence of modalities often induces that these verbs take over the role of modality. This is not correct as it is proven that modality is the instrument of the whole sentence where both the adverbs, modal particles, punctuation marks, and the intonation of a sentence contribute. Often, a combination of all these instruments are necessary to express a modality. In this work, we concern with the finding of modal verbs in scientific texts as a pre-step towards the discovery of the attitude of an author. Whereas the input will be an arbitrary text, the output consists of zones representing modalities.

* 18 pages, 5 Figures 

  Access Paper or Ask Questions

Ma(r)king concessions in English and German

Jun 01, 1995
Brigitte Grote, Nils Lenke, Manfred Stede

In order to generate cohesive discourse, many of the relations holding between text segments need to be signalled to the reader by means of cue words, or {\em discourse markers}. Programs usually do this in a simplistic way, e.g., by using one marker per relation. In reality, however, language offers a very wide range of markers from which informed choices should be made. In order to account for the variety and to identify the parameters governing the choices, detailled linguistic analyses are necessary. We worked with one area of discourse relations, the Concession family, identified its underlying pragmatics and semantics, and undertook extensive corpus studies to examine the range of markers used in both English and German. On the basis of an initial classification of these markers, we propose a generation model for producing bilingual text that can incorporate marker choice into its overall decision framework.

* Proc. of the 6th European Workshop on Natural Language Generation, NL-Leiden, May 1995 
* 23 pages, uuencoded compressed postscript 

  Access Paper or Ask Questions

Improve Discourse Dependency Parsing with Contextualized Representations

May 04, 2022
Yifei Zhou, Yansong Feng

Recent works show that discourse analysis benefits from modeling intra- and inter-sentential levels separately, where proper representations for text units of different granularities are desired to capture both the meaning of text units and their relations to the context. In this paper, we propose to take advantage of transformers to encode contextualized representations of units of different levels to dynamically capture the information required for discourse dependency analysis on intra- and inter-sentential levels. Motivated by the observation of writing patterns commonly shared across articles, we propose a novel method that treats discourse relation identification as a sequence labelling task, which takes advantage of structural information from the context of extracted discourse trees, and substantially outperforms traditional direct-classification methods. Experiments show that our model achieves state-of-the-art results on both English and Chinese datasets.

* Naacl 2022 (findings) 

  Access Paper or Ask Questions

Learning Topic Models: Identifiability and Finite-Sample Analysis

Oct 08, 2021
Yinyin Chen, Shishuang He, Yun Yang, Feng Liang

Topic models provide a useful text-mining tool for learning, extracting and discovering latent structures in large text corpora. Although a plethora of methods have been proposed for topic modeling, a formal theoretical investigation on the statistical identifiability and accuracy of latent topic estimation is lacking in the literature. In this paper, we propose a maximum likelihood estimator (MLE) of latent topics based on a specific integrated likelihood, which is naturally connected to the concept of volume minimization in computational geometry. Theoretically, we introduce a new set of geometric conditions for topic model identifiability, which are weaker than conventional separability conditions relying on the existence of anchor words or pure topic documents. We conduct finite-sample error analysis for the proposed estimator and discuss the connection of our results with existing ones. We conclude with empirical studies on both simulated and real datasets.

  Access Paper or Ask Questions

Visually Grounded Concept Composition

Sep 29, 2021
Bowen Zhang, Hexiang Hu, Linlu Qiu, Peter Shaw, Fei Sha

We investigate ways to compose complex concepts in texts from primitive ones while grounding them in images. We propose Concept and Relation Graph (CRG), which builds on top of constituency analysis and consists of recursively combined concepts with predicate functions. Meanwhile, we propose a concept composition neural network called Composer to leverage the CRG for visually grounded concept learning. Specifically, we learn the grounding of both primitive and all composed concepts by aligning them to images and show that learning to compose leads to more robust grounding results, measured in text-to-image matching accuracy. Notably, our model can model grounded concepts forming at both the finer-grained sentence level and the coarser-grained intermediate level (or word-level). Composer leads to pronounced improvement in matching accuracy when the evaluation data has significant compound divergence from the training data.

* Findings of EMNLP 2021 

  Access Paper or Ask Questions

Evaluation of a Region Proposal Architecture for Multi-task Document Layout Analysis

Jun 22, 2021
Lorenzo Quirós, Enrique Vidal

Automatically recognizing the layout of handwritten documents is an important step towards useful extraction of information from those documents. The most common application is to feed downstream applications such as automatic text recognition and keyword spotting; however, the recognition of the layout also helps to establish relationships between elements in the document which allows to enrich the information that can be extracted. Most of the modern document layout analysis systems are designed to address only one part of the document layout problem, namely: baseline detection or region segmentation. In contrast, we evaluate the effectiveness of the Mask-RCNN architecture to address the problem of baseline detection and region segmentation in an integrated manner. We present experimental results on two handwritten text datasets and one handwritten music dataset. The analyzed architecture yields promising results, outperforming state-of-the-art techniques in all three datasets.

  Access Paper or Ask Questions