Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Shamela: A Large-Scale Historical Arabic Corpus

Dec 28, 2016
Yonatan Belinkov, Alexander Magidow, Maxim Romanov, Avi Shmidman, Moshe Koppel

Arabic is a widely-spoken language with a rich and long history spanning more than fourteen centuries. Yet existing Arabic corpora largely focus on the modern period or lack sufficient diachronic information. We develop a large-scale, historical corpus of Arabic of about 1 billion words from diverse periods of time. We clean this corpus, process it with a morphological analyzer, and enhance it by detecting parallel passages and automatically dating undated texts. We demonstrate its utility with selected case-studies in which we show its application to the digital humanities.

* Slightly expanded version of Coling LT4DH workshop paper 

  Access Paper or Ask Questions

Bidirectional Recurrent Neural Networks for Medical Event Detection in Electronic Health Records

Jul 12, 2016
Abhyuday Jagannatha, Hong Yu

Sequence labeling for extraction of medical events and their attributes from unstructured text in Electronic Health Record (EHR) notes is a key step towards semantic understanding of EHRs. It has important applications in health informatics including pharmacovigilance and drug surveillance. The state of the art supervised machine learning models in this domain are based on Conditional Random Fields (CRFs) with features calculated from fixed context windows. In this application, we explored various recurrent neural network frameworks and show that they significantly outperformed the CRF models.

* In proceedings of NAACL HLT 2016 

  Access Paper or Ask Questions

A System for Probabilistic Linking of Thesauri and Classification Systems

Mar 21, 2016
Lisa Posch, Philipp Schaer, Arnim Bleier, Markus Strohmaier

This paper presents a system which creates and visualizes probabilistic semantic links between concepts in a thesaurus and classes in a classification system. For creating the links, we build on the Polylingual Labeled Topic Model (PLL-TM). PLL-TM identifies probable thesaurus descriptors for each class in the classification system by using information from the natural language text of documents, their assigned thesaurus descriptors and their designated classes. The links are then presented to users of the system in an interactive visualization, providing them with an automatically generated overview of the relations between the thesaurus and the classification system.

* KI - K\"unstliche Intelligenz, 2015 

  Access Paper or Ask Questions

Automatic Extraction of Protein Interaction in Literature

Aug 22, 2014
Peilei Liu, Ting Wang

Protein-protein interaction extraction is the key precondition of the construction of protein knowledge network, and it is very important for the research in the biomedicine. This paper extracted directional protein-protein interaction from the biological text, using the SVM-based method. Experiments were evaluated on the LLL05 corpus with good results. The results show that dependency features are import for the protein-protein interaction extraction and features related to the interaction word are effective for the interaction direction judgment. At last, we analyzed the effects of different features and planed for the next step.

* This paper has been withdrawn by the author due to its lack of academic value 

  Access Paper or Ask Questions

Learning Probabilistic Programs

Jul 09, 2014
Yura N. Perov, Frank D. Wood

We develop a technique for generalising from data in which models are samplers represented as program text. We establish encouraging empirical results that suggest that Markov chain Monte Carlo probabilistic programming inference techniques coupled with higher-order probabilistic programming languages are now sufficiently powerful to enable successful inference of this kind in nontrivial domains. We also introduce a new notion of probabilistic program compilation and show how the same machinery might be used in the future to compile probabilistic programs for efficient reusable predictive inference.

  Access Paper or Ask Questions

Automated Attribution and Intertextual Analysis

May 03, 2014
James Brofos, Ajay Kannan, Rui Shu

In this work, we employ quantitative methods from the realm of statistics and machine learning to develop novel methodologies for author attribution and textual analysis. In particular, we develop techniques and software suitable for applications to Classical study, and we illustrate the efficacy of our approach in several interesting open questions in the field. We apply our numerical analysis techniques to questions of authorship attribution in the case of the Greek tragedian Euripides, to instances of intertextuality and influence in the poetry of the Roman statesman Seneca the Younger, and to cases of "interpolated" text with respect to the histories of Livy.

* 10 pages, 4 tables, 4 figures 

  Access Paper or Ask Questions

Random Sentences from a Generalized Phrase-Structure Grammar Interpreter

Feb 14, 2007
Rick Dale

In numerous domains in cognitive science it is often useful to have a source for randomly generated corpora. These corpora may serve as a foundation for artificial stimuli in a learning experiment (e.g., Ellefson & Christiansen, 2000), or as input into computational models (e.g., Christiansen & Dale, 2001). The following compact and general C program interprets a phrase-structure grammar specified in a text file. It follows parameters set at a Unix or Unix-based command-line and generates a corpus of random sentences from that grammar.

* Brief paper with source code and examples 

  Access Paper or Ask Questions

"In vivo" spam filtering: A challenge problem for data mining

May 04, 2004
Tom Fawcett

Spam, also known as Unsolicited Commercial Email (UCE), is the bane of email communication. Many data mining researchers have addressed the problem of detecting spam, generally by treating it as a static text classification problem. True in vivo spam filtering has characteristics that make it a rich and challenging domain for data mining. Indeed, real-world datasets with these characteristics are typically difficult to acquire and to share. This paper demonstrates some of these characteristics and argues that researchers should pursue in vivo spam filtering as an accessible domain for investigating them.

* KDD Explorations vol.5 no.2, Dec 2003. pp.140-148 

  Access Paper or Ask Questions

Entropy estimation of symbol sequences

Mar 21, 2002
Thomas Schürmann, Peter Grassberger

We discuss algorithms for estimating the Shannon entropy h of finite symbol sequences with long range correlations. In particular, we consider algorithms which estimate h from the code lengths produced by some compression algorithm. Our interest is in describing their convergence with sequence length, assuming no limits for the space and time complexities of the compression algorithms. A scaling law is proposed for extrapolation from finite sample lengths. This is applied to sequences of dynamical systems in non-trivial chaotic regimes, a 1-D cellular automaton, and to written English texts.

* CHAOS Vol. 6, No. 3 (1996) 414-427 
* 14 pages, 13 figures, 2 tables 

  Access Paper or Ask Questions

Long-range fractal correlations in literary corpora

Jan 09, 2002
Marcelo A. Montemurro, Pedro A. Pury

In this paper we analyse the fractal structure of long human-language records by mapping large samples of texts onto time series. The particular mapping set up in this work is inspired on linguistic basis in the sense that is retains {\em the word} as the fundamental unit of communication. The results confirm that beyond the short-range correlations resulting from syntactic rules acting at sentence level, long-range structures emerge in large written language samples that give rise to long-range correlations in the use of words.

* Fractals 10(4), 451-461 (2002) 
* to appear in Fractals 

  Access Paper or Ask Questions