Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Grounded Recurrent Neural Networks

May 23, 2017
Ankit Vani, Yacine Jernite, David Sontag

In this work, we present the Grounded Recurrent Neural Network (GRNN), a recurrent neural network architecture for multi-label prediction which explicitly ties labels to specific dimensions of the recurrent hidden state (we call this process "grounding"). The approach is particularly well-suited for extracting large numbers of concepts from text. We apply the new model to address an important problem in healthcare of understanding what medical concepts are discussed in clinical text. Using a publicly available dataset derived from Intensive Care Units, we learn to label a patient's diagnoses and procedures from their discharge summary. Our evaluation shows a clear advantage to using our proposed architecture over a variety of strong baselines.

  Access Paper or Ask Questions

Are Emojis Predictable?

Feb 24, 2017
Francesco Barbieri, Miguel Ballesteros, Horacio Saggion

Emojis are ideograms which are naturally combined with plain text to visually complement or condense the meaning of a message. Despite being widely used in social media, their underlying semantics have received little attention from a Natural Language Processing standpoint. In this paper, we investigate the relation between words and emojis, studying the novel task of predicting which emojis are evoked by text-based tweet messages. We train several models based on Long Short-Term Memory networks (LSTMs) in this task. Our experimental results show that our neural model outperforms two baselines as well as humans solving the same task, suggesting that computational models are able to better capture the underlying semantics of emojis.

* To appear at EACL 2017 

  Access Paper or Ask Questions

LSTM-Based Predictions for Proactive Information Retrieval

Jun 20, 2016
Petri Luukkonen, Markus Koskela, Patrik Floréen

We describe a method for proactive information retrieval targeted at retrieving relevant information during a writing task. In our method, the current task and the needs of the user are estimated, and the potential next steps are unobtrusively predicted based on the user's past actions. We focus on the task of writing, in which the user is coalescing previously collected information into a text. Our proactive system automatically recommends the user relevant background information. The proposed system incorporates text input prediction using a long short-term memory (LSTM) network. We present simulations, which show that the system is able to reach higher precision values in an exploratory search setting compared to both a baseline and a comparison system.

* Neu-IR '16 SIGIR Workshop on Neural Information Retrieval, July 21, 2016, Pisa, Italy 

  Access Paper or Ask Questions

Joint Learning Templates and Slots for Event Schema Induction

Mar 04, 2016
Lei Sha, Sujian Li, Baobao Chang, Zhifang Sui

Automatic event schema induction (AESI) means to extract meta-event from raw text, in other words, to find out what types (templates) of event may exist in the raw text and what roles (slots) may exist in each event type. In this paper, we propose a joint entity-driven model to learn templates and slots simultaneously based on the constraints of templates and slots in the same sentence. In addition, the entities' semantic information is also considered for the inner connectivity of the entities. We borrow the normalized cut criteria in image segmentation to divide the entities into more accurate template clusters and slot clusters. The experiment shows that our model gains a relatively higher result than previous work.

  Access Paper or Ask Questions

Towards a relation extraction framework for cyber-security concepts

Apr 16, 2015
Corinne L. Jones, Robert A. Bridges, Kelly Huffer, John Goodall

In order to assist security analysts in obtaining information pertaining to their network, such as novel vulnerabilities, exploits, or patches, information retrieval methods tailored to the security domain are needed. As labeled text data is scarce and expensive, we follow developments in semi-supervised Natural Language Processing and implement a bootstrapping algorithm for extracting security entities and their relationships from text. The algorithm requires little input data, specifically, a few relations or patterns (heuristics for identifying relations), and incorporates an active learning component which queries the user on the most important decisions to prevent drifting from the desired relations. Preliminary testing on a small corpus shows promising results, obtaining precision of .82.

* 4 pages in Cyber & Information Security Research Conference 2015, ACM 

  Access Paper or Ask Questions

Correcting Errors in Digital Lexicographic Resources Using a Dictionary Manipulation Language

Oct 28, 2014
David Zajic, Michael Maxwell, David Doermann, Paul Rodrigues, Michael Bloodgood

We describe a paradigm for combining manual and automatic error correction of noisy structured lexicographic data. Modifications to the structure and underlying text of the lexicographic data are expressed in a simple, interpreted programming language. Dictionary Manipulation Language (DML) commands identify nodes by unique identifiers, and manipulations are performed using simple commands such as create, move, set text, etc. Corrected lexicons are produced by applying sequences of DML commands to the source version of the lexicon. DML commands can be written manually to repair one-off errors or generated automatically to correct recurring problems. We discuss advantages of the paradigm for the task of editing digital bilingual dictionaries.

* In Proceedings of Electronic Lexicography in the 21st Century (eLex), pages 297-301, Bled, Slovenia, November 2011. Trojina Institute for Applied Slovene Studies 
* 5 pages, 3 figures, 1 table; appeared in Proceedings of Electronic Lexicography in the 21st Century (eLex), November 2011 

  Access Paper or Ask Questions

Comparison of the language networks from literature and blogs

Jul 17, 2014
Sabina Šišović, Sanda Martinčić-Ipšić, Ana Meštrović

In this paper we present the comparison of the linguistic networks from literature and blog texts. The linguistic networks are constructed from texts as directed and weighted co-occurrence networks of words. Words are nodes and links are established between two nodes if they are directly co-occurring within the sentence. The comparison of the networks structure is performed at global level (network) in terms of: average node degree, average shortest path length, diameter, clustering coefficient, density and number of components. Furthermore, we perform analysis on the local level (node) by comparing the rank plots of in and out degree, strength and selectivity. The selectivity-based results point out that there are differences between the structure of the networks constructed from literature and blogs.

* 37th IEEE International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO 2014), pp.1824--1829, (2014) 

  Access Paper or Ask Questions

A multi-stream hmm approach to offline handwritten arabic word recognition

Sep 10, 2013
Ahlam Maqqor, Akram Halli, Khaled Satori

In This paper we presented new approach for cursive Arabic text recognition system. The objective is to propose methodology analytical offline recognition of handwritten Arabic for rapid implementation. The first part in the writing recognition system is the preprocessing phase is the preprocessing phase to prepare the data was introduces and extracts a set of simple statistical features by two methods : from a window which is sliding long that text line the right to left and the approach VH2D (consists in projecting every character on the abscissa, on the ordinate and the diagonals 45{\deg} and 135{\deg}) . It then injects the resulting feature vectors to Hidden Markov Model (HMM) and combined the two HMM by multi-stream approach.

* 12 pages,13 figure,International Journal on Natural Language Computing(IJNLC),ISSN:2278-1307[Online];2319-4111[Print],August 2013, Volume 2, Number 4 

  Access Paper or Ask Questions

Use of Transformer-Based Models for Word-Level Transliteration of the Book of the Dean of Lismore

May 23, 2022
Edward Gow-Smith, Mark McConville, William Gillies, Jade Scott, Roibeard Ó Maolalaigh

The Book of the Dean of Lismore (BDL) is a 16th-century Scottish Gaelic manuscript written in a non-standard orthography. In this work, we outline the problem of transliterating the text of the BDL into a standardised orthography, and perform exploratory experiments using Transformer-based models for this task. In particular, we focus on the task of word-level transliteration, and achieve a character-level BLEU score of 54.15 with our best model, a BART architecture pre-trained on the text of Scottish Gaelic Wikipedia and then fine-tuned on around 2,000 word-level parallel examples. Our initial experiments give promising results, but we highlight the shortcomings of our model, and discuss directions for future work.

* 4th Celtic Language Technology Workshop 

  Access Paper or Ask Questions

TempLM: Distilling Language Models into Template-Based Generators

May 23, 2022
Tianyi Zhang, Mina Lee, Lisa Li, Ende Shen, Tatsunori B. Hashimoto

While pretrained language models (PLMs) have greatly improved text generation, they have also been known to produce unfaithful or inappropriate content. In contrast, classic template-based systems provide strong guarantees of faithfulness at the cost of fluency. We propose TempLM, which achieves the best of both worlds by distilling a PLM into a template-based generator. On the E2E and SynthBio data-to-text datasets, we show that TempLM is more faithful than the original PLM and is more fluent than prior template systems. Notably, on an out-of-domain evaluation, TempLM reduces a finetuned BART model's unfaithfulness rate from 83% to 0%. In a human study, we find that TempLM's templates substantially improve upon human-written ones in BERTScore.

  Access Paper or Ask Questions