Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Mining Spatio-temporal Data on Industrialization from Historical Registries

Dec 03, 2016
David Berenbaum, Dwyer Deighan, Thomas Marlow, Ashley Lee, Scott Frickel, Mark Howison

Despite the growing availability of big data in many fields, historical data on socioevironmental phenomena are often not available due to a lack of automated and scalable approaches for collecting, digitizing, and assembling them. We have developed a data-mining method for extracting tabulated, geocoded data from printed directories. While scanning and optical character recognition (OCR) can digitize printed text, these methods alone do not capture the structure of the underlying data. Our pipeline integrates both page layout analysis and OCR to extract tabular, geocoded data from structured text. We demonstrate the utility of this method by applying it to scanned manufacturing registries from Rhode Island that record 41 years of industrial land use. The resulting spatio-temporal data can be used for socioenvironmental analyses of industrialization at a resolution that was not previously possible. In particular, we find strong evidence for the dispersion of manufacturing from the urban core of Providence, the state's capital, along the Interstate 95 corridor to the north and south.

  Access Paper or Ask Questions

Features Fusion for Classification of Logos

Sep 06, 2016
N. Vinay Kumar, Pratheek, V. Vijaya Kantha, K. N. Govindaraju, D. S. Guru

In this paper, a logo classification system based on the appearance of logo images is proposed. The proposed classification system makes use of global characteristics of logo images for classification. Color, texture, and shape of a logo wholly describe the global characteristics of logo images. The various combinations of these characteristics are used for classification. The combination contains only with single feature or with fusion of two features or fusion of all three features considered at a time respectively. Further, the system categorizes the logo image into: a logo image with fully text or with fully symbols or containing both symbols and texts.. The K-Nearest Neighbour (K-NN) classifier is used for classification. Due to the lack of color logo image dataset in the literature, the same is created consisting 5044 color logo images. Finally, the performance of the classification system is evaluated through accuracy, precision, recall and F-measure computed from the confusion matrix. The experimental results show that the most promising results are obtained for fusion of features.

* Procedia Computer Science, Volume 85, 2016, Pages 370-379 
* 10 pages, 5 figures, 9 tables 

  Access Paper or Ask Questions

Sequence-to-Sequence Neural Net Models for Grapheme-to-Phoneme Conversion

Aug 20, 2015
Kaisheng Yao, Geoffrey Zweig

Sequence-to-sequence translation methods based on generation with a side-conditioned language model have recently shown promising results in several tasks. In machine translation, models conditioned on source side words have been used to produce target-language text, and in image captioning, models conditioned images have been used to generate caption text. Past work with this approach has focused on large vocabulary tasks, and measured quality in terms of BLEU. In this paper, we explore the applicability of such models to the qualitatively different grapheme-to-phoneme task. Here, the input and output side vocabularies are small, plain n-gram models do well, and credit is only given when the output is exactly correct. We find that the simple side-conditioned generation approach is able to rival the state-of-the-art, and we are able to significantly advance the stat-of-the-art with bi-directional long short-term memory (LSTM) neural networks that use the same alignment information that is used in conventional approaches.

* Published in INTERSPEECH 2015, Dresden, Germany 

  Access Paper or Ask Questions

A complex network approach to stylometry

Aug 17, 2015
Diego R. Amancio

Statistical methods have been widely employed to study the fundamental properties of language. In recent years, methods from complex and dynamical systems proved useful to create several language models. Despite the large amount of studies devoted to represent texts with physical models, only a limited number of studies have shown how the properties of the underlying physical systems can be employed to improve the performance of natural language processing tasks. In this paper, I address this problem by devising complex networks methods that are able to improve the performance of current statistical methods. Using a fuzzy classification strategy, I show that the topological properties extracted from texts complement the traditional textual description. In several cases, the performance obtained with hybrid approaches outperformed the results obtained when only traditional or networked methods were used. Because the proposed model is generic, the framework devised here could be straightforwardly used to study similar textual applications where the topology plays a pivotal role in the description of the interacting agents.

* PLoS ONE 10(8): e0136076, 2015 
* PLoS ONE, 2015 (to appear) 

  Access Paper or Ask Questions

A Generalized Language Model as the Combination of Skipped n-grams and Modified Kneser-Ney Smoothing

Apr 13, 2014
Rene Pickhardt, Thomas Gottron, Martin Körner, Paul Georg Wagner, Till Speicher, Steffen Staab

We introduce a novel approach for building language models based on a systematic, recursive exploration of skip n-gram models which are interpolated using modified Kneser-Ney smoothing. Our approach generalizes language models as it contains the classical interpolation with lower order models as a special case. In this paper we motivate, formalize and present our approach. In an extensive empirical experiment over English text corpora we demonstrate that our generalized language models lead to a substantial reduction of perplexity between 3.1% and 12.7% in comparison to traditional language models using modified Kneser-Ney smoothing. Furthermore, we investigate the behaviour over three other languages and a domain specific corpus where we observed consistent improvements. Finally, we also show that the strength of our approach lies in its ability to cope in particular with sparse training data. Using a very small training data set of only 736 KB text we yield improvements of even 25.7% reduction of perplexity.

* 13 pages, 2 figures, ACL 2014 

  Access Paper or Ask Questions

Plagiarism Detection using ROUGE and WordNet

Mar 22, 2010
Chien-Ying Chen, Jen-Yuan Yeh, Hao-Ren Ke

With the arrival of digital era and Internet, the lack of information control provides an incentive for people to freely use any content available to them. Plagiarism occurs when users fail to credit the original owner for the content referred to, and such behavior leads to violation of intellectual property. Two main approaches to plagiarism detection are fingerprinting and term occurrence; however, one common weakness shared by both approaches, especially fingerprinting, is the incapability to detect modified text plagiarism. This study proposes adoption of ROUGE and WordNet to plagiarism detection. The former includes ngram co-occurrence statistics, skip-bigram, and longest common subsequence (LCS), while the latter acts as a thesaurus and provides semantic information. N-gram co-occurrence statistics can detect verbatim copy and certain sentence modification, skip-bigram and LCS are immune from text modification such as simple addition or deletion of words, and WordNet may handle the problem of word substitution.

* Journal of Computing, Volume 2, Issue 3, March 2010 

  Access Paper or Ask Questions

Language Access: An Information Based Approach

Aug 07, 2003
Akshar Bharati, Vineet Chaitanya, Amba P. Kulkarni, Rajeev Sangal

The anusaaraka system (a kind of machine translation system) makes text in one Indian language accessible through another Indian language. The machine presents an image of the source text in a language close to the target language. In the image, some constructions of the source language (which do not have equivalents in the target language) spill over to the output. Some special notation is also devised. Anusaarakas have been built from five pairs of languages: Telugu,Kannada, Marathi, Bengali and Punjabi to Hindi. They are available for use through Email servers. Anusaarkas follows the principle of substitutibility and reversibility of strings produced. This implies preservation of information while going from a source language to a target language. For narrow subject areas, specialized modules can be built by putting subject domain knowledge into the system, which produce good quality grammatical output. However, it should be remembered, that such modules will work only in narrow areas, and will sometimes go wrong. In such a situation, anusaaraka output will still remain useful.

* Published in the proceedings of Knowledge Based Computer Systems Conference, 2000, Tata McGraw Hill, New Delhi 2000 
* Published in the proceedings of Knowledge Based Computer Systems conference, 2000, Tata McGraw-Hill, New Delhi, Dec. 2000 

  Access Paper or Ask Questions

An Iterative Algorithm to Build Chinese Language Models

Jun 17, 1996
Xiaoqiang Luo, Salim Roukos

We present an iterative procedure to build a Chinese language model (LM). We segment Chinese text into words based on a word-based Chinese language model. However, the construction of a Chinese LM itself requires word boundaries. To get out of the chicken-and-egg problem, we propose an iterative procedure that alternates two operations: segmenting text into words and building an LM. Starting with an initial segmented corpus and an LM based upon it, we use a Viterbi-liek algorithm to segment another set of data. Then, we build an LM based on the second set and use the resulting LM to segment again the first corpus. The alternating procedure provides a self-organized way for the segmenter to detect automatically unseen words and correct segmentation errors. Our preliminary experiment shows that the alternating procedure not only improves the accuracy of our segmentation, but discovers unseen words surprisingly well. The resulting word-based LM has a perplexity of 188 for a general Chinese corpus.

  Access Paper or Ask Questions

Polyphone disambiguation and accent prediction using pre-trained language models in Japanese TTS front-end

Jan 24, 2022
Rem Hida, Masaki Hamada, Chie Kamada, Emiru Tsunoo, Toshiyuki Sekiya, Toshiyuki Kumakura

Although end-to-end text-to-speech (TTS) models can generate natural speech, challenges still remain when it comes to estimating sentence-level phonetic and prosodic information from raw text in Japanese TTS systems. In this paper, we propose a method for polyphone disambiguation (PD) and accent prediction (AP). The proposed method incorporates explicit features extracted from morphological analysis and implicit features extracted from pre-trained language models (PLMs). We use BERT and Flair embeddings as implicit features and examine how to combine them with explicit features. Our objective evaluation results showed that the proposed method improved the accuracy by 5.7 points in PD and 6.0 points in AP. Moreover, the perceptual listening test results confirmed that a TTS system employing our proposed model as a front-end achieved a mean opinion score close to that of synthesized speech with ground-truth pronunciation and accent in terms of naturalness.

* 5 pages, 2 figures. Accepted to ICASSP2022 

  Access Paper or Ask Questions

ViDA-MAN: Visual Dialog with Digital Humans

Oct 26, 2021
Tong Shen, Jiawei Zuo, Fan Shi, Jin Zhang, Liqin Jiang, Meng Chen, Zhengchen Zhang, Wei Zhang, Xiaodong He, Tao Mei

We demonstrate ViDA-MAN, a digital-human agent for multi-modal interaction, which offers realtime audio-visual responses to instant speech inquiries. Compared to traditional text or voice-based system, ViDA-MAN offers human-like interactions (e.g, vivid voice, natural facial expression and body gestures). Given a speech request, the demonstration is able to response with high quality videos in sub-second latency. To deliver immersive user experience, ViDA-MAN seamlessly integrates multi-modal techniques including Acoustic Speech Recognition (ASR), multi-turn dialog, Text To Speech (TTS), talking heads video generation. Backed with large knowledge base, ViDA-MAN is able to chat with users on a number of topics including chit-chat, weather, device control, News recommendations, booking hotels, as well as answering questions via structured knowledge.

  Access Paper or Ask Questions