Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Dynamic and Static Topic Model for Analyzing Time-Series Document Collections

May 06, 2018
Rem Hida, Naoya Takeishi, Takehisa Yairi, Koichi Hori

For extracting meaningful topics from texts, their structures should be considered properly. In this paper, we aim to analyze structured time-series documents such as a collection of news articles and a series of scientific papers, wherein topics evolve along time depending on multiple topics in the past and are also related to each other at each time. To this end, we propose a dynamic and static topic model, which simultaneously considers the dynamic structures of the temporal topic evolution and the static structures of the topic hierarchy at each time. We show the results of experiments on collections of scientific papers, in which the proposed method outperformed conventional models. Moreover, we show an example of extracted topic structures, which we found helpful for analyzing research activities.

* 6 pages, 2 figures, Accepted as ACL 2018 short paper 

  Access Paper or Ask Questions

Conditional Generative Adversarial Networks for Emoji Synthesis with Word Embedding Manipulation

Dec 14, 2017
Dianna Radpour, Vivek Bheda

Emojis have become a very popular part of daily digital communication. Their appeal comes largely in part due to their ability to capture and elicit emotions in a more subtle and nuanced way than just plain text is able to. In line with recent advances in the field of deep learning, there are far reaching implications and applications that generative adversarial networks (GANs) can have for image generation. In this paper, we present a novel application of deep convolutional GANs (DC-GANs) with an optimized training procedure. We show that via incorporation of word embeddings conditioned on Google's word2vec model into the network, the generator is able to synthesize highly realistic emojis that are virtually identical to the real ones.

* 5 pages, 3 figures, 2 graphs 

  Access Paper or Ask Questions

Doubly-Attentive Decoder for Multi-modal Neural Machine Translation

Feb 04, 2017
Iacer Calixto, Qun Liu, Nick Campbell

We introduce a Multi-modal Neural Machine Translation model in which a doubly-attentive decoder naturally incorporates spatial visual features obtained using pre-trained convolutional neural networks, bridging the gap between image description and translation. Our decoder learns to attend to source-language words and parts of an image independently by means of two separate attention mechanisms as it generates words in the target language. We find that our model can efficiently exploit not just back-translated in-domain multi-modal data but also large general-domain text-only MT corpora. We also report state-of-the-art results on the Multi30k data set.

* 8 pages (11 including references), 2 figures 

  Access Paper or Ask Questions

Robust Named Entity Recognition in Idiosyncratic Domains

Aug 24, 2016
Sebastian Arnold, Felix A. Gers, Torsten Kilias, Alexander Löser

Named entity recognition often fails in idiosyncratic domains. That causes a problem for depending tasks, such as entity linking and relation extraction. We propose a generic and robust approach for high-recall named entity recognition. Our approach is easy to train and offers strong generalization over diverse domain-specific language, such as news documents (e.g. Reuters) or biomedical text (e.g. Medline). Our approach is based on deep contextual sequence learning and utilizes stacked bidirectional LSTM networks. Our model is trained with only few hundred labeled sentences and does not rely on further external knowledge. We report from our results F1 scores in the range of 84-94% on standard datasets.

* 8 pages, 1 figure 

  Access Paper or Ask Questions

x.ent: R Package for Entities and Relations Extraction based on Unsupervised Learning and Document Structure

Apr 23, 2015
Nicolas Turenne, Tien Phan

Relation extraction with accurate precision is still a challenge when processing full text databases. We propose an approach based on cooccurrence analysis in each document for which we used document organization to improve accuracy of relation extraction. This approach is implemented in a R package called \emph{x.ent}. Another facet of extraction relies on use of extracted relation into a querying system for expert end-users. Two datasets had been used. One of them gets interest from specialists of epidemiology in plant health. For this dataset usage is dedicated to plant-disease exploration through agricultural information news. An open-data platform exploits exports from \emph{x.ent} and is publicly available.

  Access Paper or Ask Questions

Real-World Font Recognition Using Deep Network and Domain Adaptation

Mar 31, 2015
Zhangyang Wang, Jianchao Yang, Hailin Jin, Eli Shechtman, Aseem Agarwala, Jonathan Brandt, Thomas S. Huang

We address a challenging fine-grain classification problem: recognizing a font style from an image of text. In this task, it is very easy to generate lots of rendered font examples but very hard to obtain real-world labeled images. This real-to-synthetic domain gap caused poor generalization to new real data in previous methods (Chen et al. (2014)). In this paper, we refer to Convolutional Neural Networks, and use an adaptation technique based on a Stacked Convolutional Auto-Encoder that exploits unlabeled real-world images combined with synthetic data. The proposed method achieves an accuracy of higher than 80% (top-5) on a real-world dataset.

  Access Paper or Ask Questions

Fast Label Embeddings for Extremely Large Output Spaces

Mar 30, 2015
Paul Mineiro, Nikos Karampatziakis

Many modern multiclass and multilabel problems are characterized by increasingly large output spaces. For these problems, label embeddings have been shown to be a useful primitive that can improve computational and statistical efficiency. In this work we utilize a correspondence between rank constrained estimation and low dimensional label embeddings that uncovers a fast label embedding algorithm which works in both the multiclass and multilabel settings. The result is a randomized algorithm for partial least squares, whose running time is exponentially faster than naive algorithms. We demonstrate our techniques on two large-scale public datasets, from the Large Scale Hierarchical Text Challenge and the Open Directory Project, where we obtain state of the art results.

* Accepted as a workshop contribution at ICLR 2015 

  Access Paper or Ask Questions

Rapid Adaptation of POS Tagging for Domain Specific Uses

Oct 31, 2014
John E. Miller, Michael Bloodgood, Manabu Torii, K. Vijay-Shanker

Part-of-speech (POS) tagging is a fundamental component for performing natural language tasks such as parsing, information extraction, and question answering. When POS taggers are trained in one domain and applied in significantly different domains, their performance can degrade dramatically. We present a methodology for rapid adaptation of POS taggers to new domains. Our technique is unsupervised in that a manually annotated corpus for the new domain is not necessary. We use suffix information gathered from large amounts of raw text as well as orthographic information to increase the lexical coverage. We present an experiment in the Biological domain where our POS tagger achieves results comparable to POS taggers specifically trained to this domain.

* In Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology, pages 118-119, New York, New York, June 2006. Association for Computational Linguistics 
* 2 pages, 2 tables; appeared in Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology, June 2006 

  Access Paper or Ask Questions

FASTSUBS: An Efficient and Exact Procedure for Finding the Most Likely Lexical Substitutes Based on an N-gram Language Model

Sep 01, 2012
Deniz Yuret

Lexical substitutes have found use in areas such as paraphrasing, text simplification, machine translation, word sense disambiguation, and part of speech induction. However the computational complexity of accurately identifying the most likely substitutes for a word has made large scale experiments difficult. In this paper I introduce a new search algorithm, FASTSUBS, that is guaranteed to find the K most likely lexical substitutes for a given word in a sentence based on an n-gram language model. The computation is sub-linear in both K and the vocabulary size V. An implementation of the algorithm and a dataset with the top 100 substitutes of each token in the WSJ section of the Penn Treebank are available at

* 4 pages, 1 figure, to appear in IEEE Signal Processing Letters 

  Access Paper or Ask Questions

Fence - An Efficient Parser with Ambiguity Support for Model-Driven Language Specification

Oct 07, 2011
Luis Quesada, Fernando Berzal, Francisco J. Cortijo

Model-based language specification has applications in the implementation of language processors, the design of domain-specific languages, model-driven software development, data integration, text mining, natural language processing, and corpus-based induction of models. Model-based language specification decouples language design from language processing and, unlike traditional grammar-driven approaches, which constrain language designers to specific kinds of grammars, it needs general parser generators able to deal with ambiguities. In this paper, we propose Fence, an efficient bottom-up parsing algorithm with lexical and syntactic ambiguity support that enables the use of model-based language specification in practice.

  Access Paper or Ask Questions