Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

An Approach to the Analysis of the South Slavic Medieval Labels Using Image Texture

Sep 07, 2015
Darko Brodic, Alessia Amelio, Zoran N. Milivojevic

The paper presents a new script classification method for the discrimination of the South Slavic medieval labels. It consists in the textural analysis of the script types. In the first step, each letter is coded by the equivalent script type, which is defined by its typographical features. Obtained coded text is subjected to the run-length statistical analysis and to the adjacent local binary pattern analysis in order to extract the features. The result shows a diversity between the extracted features of the scripts, which makes the feature classification more effective. It is the basis for the classification process of the script identification by using an extension of a state-of-the-art approach for document clustering. The proposed method is evaluated on an example of hand-engraved in stone and hand-printed in paper labels in old Cyrillic, angular and round Glagolitic. Experiments demonstrate very positive results, which prove the effectiveness of the proposed method.

* 15 pages, 9 figures, 3rd Workshop on Recognition and Action for Scene Understanding (REACTS 2015) 

  Access Paper or Ask Questions

Continuous Time Dynamic Topic Models

May 16, 2015
Chong Wang, David Blei, David Heckerman

In this paper, we develop the continuous time dynamic topic model (cDTM). The cDTM is a dynamic topic model that uses Brownian motion to model the latent topics through a sequential collection of documents, where a "topic" is a pattern of word use that we expect to evolve over the course of the collection. We derive an efficient variational approximate inference algorithm that takes advantage of the sparsity of observations in text, a property that lets us easily handle many time points. In contrast to the cDTM, the original discrete-time dynamic topic model (dDTM) requires that time be discretized. Moreover, the complexity of variational inference for the dDTM grows quickly as time granularity increases, a drawback which limits fine-grained discretization. We demonstrate the cDTM on two news corpora, reporting both predictive perplexity and the novel task of time stamp prediction.

* Appears in Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI2008) 

  Access Paper or Ask Questions

Streaming Variational Inference for Bayesian Nonparametric Mixture Models

Apr 21, 2015
Alex Tank, Nicholas J. Foti, Emily B. Fox

In theory, Bayesian nonparametric (BNP) models are well suited to streaming data scenarios due to their ability to adapt model complexity with the observed data. Unfortunately, such benefits have not been fully realized in practice; existing inference algorithms are either not applicable to streaming applications or not extensible to BNP models. For the special case of Dirichlet processes, streaming inference has been considered. However, there is growing interest in more flexible BNP models building on the class of normalized random measures (NRMs). We work within this general framework and present a streaming variational inference algorithm for NRM mixture models. Our algorithm is based on assumed density filtering (ADF), leading straightforwardly to expectation propagation (EP) for large-scale batch inference as well. We demonstrate the efficacy of the algorithm on clustering documents in large, streaming text corpora.

  Access Paper or Ask Questions

Assamese-English Bilingual Machine Translation

Jul 08, 2014
Kalyanee Kanchan Baruah, Pranjal Das, Abdul Hannan, Shikhar Kr. Sarma

Machine translation is the process of translating text from one language to another. In this paper, Statistical Machine Translation is done on Assamese and English language by taking their respective parallel corpus. A statistical phrase based translation toolkit Moses is used here. To develop the language model and to align the words we used two another tools IRSTLM, GIZA respectively. BLEU score is used to check our translation system performance, how good it is. A difference in BLEU scores is obtained while translating sentences from Assamese to English and vice-versa. Since Indian languages are morphologically very rich hence translation is relatively harder from English to Assamese resulting in a low BLEU score. A statistical transliteration system is also introduced with our translation system to deal basically with proper nouns, OOV (out of vocabulary) words which are not present in our corpus.

* International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014 
* In the proceedings of International Conference of Natural Language Processing and Cognitive Computing (ICONACC)-2014, pp. 227-231 

  Access Paper or Ask Questions

Controlling Complexity in Part-of-Speech Induction

Jan 16, 2014
João V. Graça, Kuzman Ganchev, Luisa Coheur, Fernando Pereira, Ben Taskar

We consider the problem of fully unsupervised learning of grammatical (part-of-speech) categories from unlabeled text. The standard maximum-likelihood hidden Markov model for this task performs poorly, because of its weak inductive bias and large model capacity. We address this problem by refining the model and modifying the learning objective to control its capacity via para- metric and non-parametric constraints. Our approach enforces word-category association sparsity, adds morphological and orthographic features, and eliminates hard-to-estimate parameters for rare words. We develop an efficient learning algorithm that is not much more computationally intensive than standard training. We also provide an open-source implementation of the algorithm. Our experiments on five diverse languages (Bulgarian, Danish, English, Portuguese, Spanish) achieve significant improvements compared with previous methods for the same task.

* Journal Of Artificial Intelligence Research, Volume 41, pages 527-551, 2011 

  Access Paper or Ask Questions

Diachronic Variation in Grammatical Relations

Dec 13, 2012
Aaron Gerow, Khurshid Ahmad

We present a method of finding and analyzing shifts in grammatical relations found in diachronic corpora. Inspired by the econometric technique of measuring return and volatility instead of relative frequencies, we propose them as a way to better characterize changes in grammatical patterns like nominalization, modification and comparison. To exemplify the use of these techniques, we examine a corpus of NIPS papers and report trends which manifest at the token, part-of-speech and grammatical levels. Building up from frequency observations to a second-order analysis, we show that shifts in frequencies overlook deeper trends in language, even when part-of-speech information is included. Examining token, POS and grammatical levels of variation enables a summary view of diachronic text as a whole. We conclude with a discussion about how these methods can inform intuitions about specialist domains as well as changes in language use as a whole.

* Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai, India 

  Access Paper or Ask Questions

Factorized Multi-Modal Topic Model

Oct 16, 2012
Seppo Virtanen, Yangqing Jia, Arto Klami, Trevor Darrell

Multi-modal data collections, such as corpora of paired images and text snippets, require analysis methods beyond single-view component and topic models. For continuous observations the current dominant approach is based on extensions of canonical correlation analysis, factorizing the variation into components shared by the different modalities and those private to each of them. For count data, multiple variants of topic models attempting to tie the modalities together have been presented. All of these, however, lack the ability to learn components private to one modality, and consequently will try to force dependencies even between minimally correlating modalities. In this work we combine the two approaches by presenting a novel HDP-based topic model that automatically learns both shared and private topics. The model is shown to be especially useful for querying the contents of one domain given samples of the other.

* Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI2012) 

  Access Paper or Ask Questions

Lifted Graphical Models: A Survey

Aug 26, 2011
Lilyana Mihalkova, Lise Getoor

This article presents a survey of work on lifted graphical models. We review a general form for a lifted graphical model, a par-factor graph, and show how a number of existing statistical relational representations map to this formalism. We discuss inference algorithms, including lifted inference algorithms, that efficiently compute the answers to probabilistic queries. We also review work in learning lifted graphical models from data. It is our belief that the need for statistical relational models (whether it goes by that name or another) will grow in the coming decades, as we are inundated with data which is a mix of structured and unstructured, with entities and relations extracted in a noisy manner from text, and with the need to reason effectively with this data. We hope that this synthesis of ideas from many different research groups will provide an accessible starting point for new researchers in this expanding field.

  Access Paper or Ask Questions

Close Clustering Based Automated Color Image Annotation

Aug 02, 2010
Ankit Garg, Rahul Dwivedi, Krishna Asawa

Most image-search approaches today are based on the text based tags associated with the images which are mostly human generated and are subject to various kinds of errors. The results of a query to the image database thus can often be misleading and may not satisfy the requirements of the user. In this work we propose our approach to automate this tagging process of images, where image results generated can be fine filtered based on a probabilistic tagging mechanism. We implement a tool which helps to automate the tagging process by maintaining a training database, wherein the system is trained to identify certain set of input images, the results generated from which are used to create a probabilistic tagging mechanism. Given a certain set of segments in an image it calculates the probability of presence of particular keywords. This probability table is further used to generate the candidate tags for input images.

  Access Paper or Ask Questions

New Methods, Current Trends and Software Infrastructure for NLP

Jul 23, 1996
Hamish Cunningham, Yorick Wilks, Robert J. Gaizauskas

The increasing use of `new methods' in NLP, which the NeMLaP conference series exemplifies, occurs in the context of a wider shift in the nature and concerns of the discipline. This paper begins with a short review of this context and significant trends in the field. The review motivates and leads to a set of requirements for support software of general utility for NLP research and development workers. A freely-available system designed to meet these requirements is described (called GATE - a General Architecture for Text Engineering). Information Extraction (IE), in the sense defined by the Message Understanding Conferences (ARPA \cite{Arp95}), is an NLP application in which many of the new methods have found a home (Hobbs \cite{Hob93}; Jacobs ed. \cite{Jac92}). An IE system based on GATE is also available for research purposes, and this is described. Lastly we review related work.

* Proceedings of NEMLAP-2 
* 12 pages, LaTeX, uses nemlap.sty (included) 

  Access Paper or Ask Questions