Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Multi-Task Handwritten Document Layout Analysis

Jun 22, 2018
Lorenzo Quirós

Document Layout Analysis is a fundamental step in Handwritten Text Processing systems, form the extraction of the text lines to the type of region where it belongs. We present a system based on artificial neural networks which is able to extract not only the baselines present in the document, but geometric and logic layout analysis of the document as well. Experiments in three different datasets demonstrate the potential of the method and show competitive results with state-of-the-art methods.

  Access Paper or Ask Questions

Sentiment Classification using Images and Label Embeddings

Dec 03, 2017
Laura Graesser, Abhinav Gupta, Lakshay Sharma, Evelina Bakhturina

In this project we analysed how much semantic information images carry, and how much value image data can add to sentiment analysis of the text associated with the images. To better understand the contribution from images, we compared models which only made use of image data, models which only made use of text data, and models which combined both data types. We also analysed if this approach could help sentiment classifiers generalize to unknown sentiments.

* 13 pages, 3 figures, 9 tables. Technical report for Statistical Natural Language Processing Project (NYU CS - Fall 2016) 

  Access Paper or Ask Questions

Open Data Platform for Knowledge Access in Plant Health Domain : VESPA Mining

Apr 23, 2015
Nicolas Turenne, Mathieu Andro, Roselyne Corbière, Tien T. Phan

Important data are locked in ancient literature. It would be uneconomic to produce these data again and today or to extract them without the help of text mining technologies. Vespa is a text mining project whose aim is to extract data on pest and crops interactions, to model and predict attacks on crops, and to reduce the use of pesticides. A few attempts proposed an agricultural information access. Another originality of our work is to parse documents with a dependency of the document architecture.

  Access Paper or Ask Questions

Document classification methods

Sep 16, 2019
Madjid Khalilian, Shiva Hassanzadeh

Information on different fields which are collected by users requires appropriate management and organization to be structured in a standard way and retrieved fast and more easily. Document classification is a conventional method to separate text based on their subjects among scientific text, web pages and digital library. Different methods and techniques are proposed for document classifications that have advantages and deficiencies. In this paper, several unsupervised and supervised document classification methods are studied and compared.

  Access Paper or Ask Questions

A Comprehensive Comparative Study of Word and Sentence Similarity Measures

Feb 17, 2016
Issa Atoum, Ahmed Otoom, Narayanan Kulathuramaiyer

Sentence similarity is considered the basis of many natural language tasks such as information retrieval, question answering and text summarization. The semantic meaning between compared text fragments is based on the words semantic features and their relationships. This article reviews a set of word and sentence similarity measures and compares them on benchmark datasets. On the studied datasets, results showed that hybrid semantic measures perform better than both knowledge and corpus based measures.

* International Journal of Computer Applications,2016,135(1), Foundation of Computer Science (FCS), NY, USA 
* 7 pages,4 figures 

  Access Paper or Ask Questions

Mapping the Economic Crisis: Some Preliminary Investigations

Jun 17, 2014
Pierre Bourreau, Thierry Poibeau

In this paper we describe our contribution to the PoliInformatics 2014 Challenge on the 2007-2008 financial crisis. We propose a state of the art technique to extract information from texts and provide different representations, giving first a static overview of the domain and then a dynamic representation of its main evolutions. We show that this strategy provides a practical solution to some recent theories in social sciences that are facing a lack of methods and tools to automatically extract information from natural language texts.

* Technical paper describing the Lattice submission to the 2014 PoliInformatics Unshared task 

  Access Paper or Ask Questions

Adaptation of pedagogical resources description standard (LOM) with the specificity of Arabic language

Aug 01, 2012
Asma Boudhief, Mohsen Maraoui, Mounir Zrigui

In this article we focus firstly on the principle of pedagogical indexing and characteristics of Arabic language and secondly on the possibility of adapting the standard for describing learning resources used (the LOM and its Application Profiles) with learning conditions such as the educational levels of students and their levels of understanding,... the educational context with taking into account the representative elements of text, text length, ... in particular, we put in relief the specificity of the Arabic language which is a complex language, characterized by its flexion, its voyellation and agglutination.

* 8 pages,10 figures. arXiv admin note: substantial text overlap with arXiv:1206.2009 

  Access Paper or Ask Questions

The UN Parallel Corpus Annotated for Translation Direction

May 20, 2018
Elad Tolochinsky, Ohad Mosafi, Ella Rabinovich, Shuly Wintner

This work distinguishes between translated and original text in the UN protocol corpus. By modeling the problem as classification problem, we can achieve up to 95% classification accuracy. We begin by deriving a parallel corpus for different language-pairs annotated for translation direction, and then classify the data by using various feature extraction methods. We compare the different methods as well as the ability to distinguish between translated and original texts in the different languages. The annotated corpus is publicly available.

  Access Paper or Ask Questions

Discourse Coherence in the Wild: A Dataset, Evaluation and Methods

May 14, 2018
Alice Lai, Joel Tetreault

To date there has been very little work on assessing discourse coherence methods on real-world data. To address this, we present a new corpus of real-world texts (GCDC) as well as the first large-scale evaluation of leading discourse coherence algorithms. We show that neural models, including two that we introduce here (SentAvg and ParSeq), tend to perform best. We analyze these performance differences and discuss patterns we observed in low coherence texts in four domains.

* Accepted at SIGDIAL 2018 

  Access Paper or Ask Questions

Controlling Linguistic Style Aspects in Neural Language Generation

Jul 09, 2017
Jessica Ficler, Yoav Goldberg

Most work on neural natural language generation (NNLG) focus on controlling the content of the generated text. We experiment with controlling several stylistic aspects of the generated text, in addition to its content. The method is based on conditioned RNN language model, where the desired content as well as the stylistic parameters serve as conditioning contexts. We demonstrate the approach on the movie reviews domain and show that it is successful in generating coherent sentences corresponding to the required linguistic style and content.

  Access Paper or Ask Questions