Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Binarizing Business Card Images for Mobile Devices

Mar 08, 2010
Ayatullah Faruk Mollah, Subhadip Basu, Nibaran Das, Ram Sarkar, Mita Nasipuri, Mahantapas Kundu

Business card images are of multiple natures as these often contain graphics, pictures and texts of various fonts and sizes both in background and foreground. So, the conventional binarization techniques designed for document images can not be directly applied on mobile devices. In this paper, we have presented a fast binarization technique for camera captured business card images. A card image is split into small blocks. Some of these blocks are classified as part of the background based on intensity variance. Then the non-text regions are eliminated and the text ones are skew corrected and binarized using a simple yet adaptive technique. Experiment shows that the technique is fast, efficient and applicable for the mobile devices.

* Proc. of International Conference on Computer Vision and Information Technology (ACVIT-2009), pp. 968-975, Dec 16-19, 2009, Aurangabad, India 

  Access Paper or Ask Questions

Simplify Your Law: Using Information Theory to Deduplicate Legal Documents

Oct 02, 2021
Corinna Coupette, Jyotsna Singh, Holger Spamann

Textual redundancy is one of the main challenges to ensuring that legal texts remain comprehensible and maintainable. Drawing inspiration from the refactoring literature in software engineering, which has developed methods to expose and eliminate duplicated code, we introduce the duplicated phrase detection problem for legal texts and propose the Dupex algorithm to solve it. Leveraging the Minimum Description Length principle from information theory, Dupex identifies a set of duplicated phrases, called patterns, that together best compress a given input text. Through an extensive set of experiments on the Titles of the United States Code, we confirm that our algorithm works well in practice: Dupex will help you simplify your law.

* 8 pages, 3 figures; to appear in ICDMW 2021 

  Access Paper or Ask Questions

Automating Discovery of Dominance in Synchronous Computer-Mediated Communication

Feb 24, 2020
Jim Samuel, Richard Holowczak, Raquel Benbunan-Fich, Ilan Levine

With the advent of electronic interaction, dominance (or the assertion of control over others) has acquired new dimensions. This study investigates the dynamics and characteristics of dominance in virtual interaction by analyzing electronic chat transcripts of groups solving a hidden profile task. We investigate computer-mediated communication behavior patterns that demonstrate dominance and identify a number of relevant variables. These indicators are calculated with automatic and manual coding of text transcripts. A comparison of both sets of variables indicates that automatic text analysis methods yield similar conclusions than manual coding. These findings are encouraging to advance research in text analysis methods in general, and in the study of virtual team dominance in particular.

* 47th Hawaii International Conference on System Sciences, 2014, pp. 1804-1812 

  Access Paper or Ask Questions

Interactive Fiction Games: A Colossal Adventure

Sep 11, 2019
Matthew Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, Xingdi Yuan

A hallmark of human intelligence is the ability to understand and communicate with language. Interactive Fiction games are fully text-based simulation environments where a player issues text commands to effect change in the environment and progress through the story. We argue that IF games are an excellent testbed for studying language-based autonomous agents. In particular, IF games combine challenges of combinatorial action spaces, language understanding, and commonsense reasoning. To facilitate rapid development of language-based agents, we introduce Jericho, a learning environment for man-made IF games and conduct a comprehensive study of text-agents across a rich set of games, highlighting directions in which agents can improve.

  Access Paper or Ask Questions

Complexity-entropy analysis at different levels of organization in written language

Mar 14, 2019
E. Estevez-Rams, A. Mesa Rodriguez, D. Estevez-Moya

Written language is complex. A written text can be considered an attempt to convey a meaningful message which ends up being constrained by language rules, context dependence and highly redundant in its use of resources. Despite all these constraints, unpredictability is an essential element of natural language. Here we present the use of entropic measures to assert the balance between predictability and surprise in written text. In short, it is possible to measure innovation and context preservation in a document. It is shown that this can also be done at the different levels of organization of a text. The type of analysis presented is reasonably general, and can also be used to analyze the same balance in other complex messages such as DNA, where a hierarchy of organizational levels are known to exist.

  Access Paper or Ask Questions

Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion

Apr 11, 2018
Yue Liu, Tao Ge, Kusum S. Mathews, Heng Ji, Deborah L. McGuinness

In the medical domain, identifying and expanding abbreviations in clinical texts is a vital task for both better human and machine understanding. It is a challenging task because many abbreviations are ambiguous especially for intensive care medicine texts, in which phrase abbreviations are frequently used. Besides the fact that there is no universal dictionary of clinical abbreviations and no universal rules for abbreviation writing, such texts are difficult to acquire, expensive to annotate and even sometimes, confusing to domain experts. This paper proposes a novel and effective approach - exploiting task-oriented resources to learn word embeddings for expanding abbreviations in clinical notes. We achieved 82.27% accuracy, close to expert human performance.

* Proceedings of BioNLP 15 

  Access Paper or Ask Questions

Cross-topic Argument Mining from Heterogeneous Sources Using Attention-based Neural Networks

Feb 15, 2018
Christian Stab, Tristan Miller, Iryna Gurevych

Argument mining is a core technology for automating argument search in large document collections. Despite its usefulness for this task, most current approaches to argument mining are designed for use only with specific text types and fall short when applied to heterogeneous texts. In this paper, we propose a new sentential annotation scheme that is reliably applicable by crowd workers to arbitrary Web texts. We source annotations for over 25,000 instances covering eight controversial topics. The results of cross-topic experiments show that our attention-based neural network generalizes best to unseen topics and outperforms vanilla BiLSTM models by 6% in accuracy and 11% in F-score.

  Access Paper or Ask Questions

Modeling Language Change in Historical Corpora: The Case of Portuguese

Sep 30, 2016
Marcos Zampieri, Shervin Malmasi, Mark Dras

This paper presents a number of experiments to model changes in a historical Portuguese corpus composed of literary texts for the purpose of temporal text classification. Algorithms were trained to classify texts with respect to their publication date taking into account lexical variation represented as word n-grams, and morphosyntactic variation represented by part-of-speech (POS) distribution. We report results of 99.8% accuracy using word unigram features with a Support Vector Machines classifier to predict the publication date of documents in time intervals of both one century and half a century. A feature analysis is performed to investigate the most informative features for this task and how they are linked to language change.

* Proceedings of Language Resources and Evaluation (LREC). Portoroz, Slovenia. pp. 4098-4104 (2016) 
* Proceedings of Language Resources and Evaluation (LREC) 

  Access Paper or Ask Questions

Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction

May 08, 2013
Urmila Shrawankar, Anjali Mahajan

This software project based paper is for a vision of the near future in which computer interaction is characterized by natural face-to-face conversations with lifelike characters that speak, emote, and gesture. The first step is speech. The dream of a true virtual reality, a complete human-computer interaction system will not come true unless we try to give some perception to machine and make it perceive the outside world as humans communicate with each other. This software project is under development for listening and replying machine (Computer) through speech. The Speech interface is developed to convert speech input into some parametric form (Speech-to-Text) for further processing and the results, text output to speech synthesis (Text-to-Speech)

* Conference Proceedings National Conference on Recent Trends in Electronics & Information Technology (RTEIT),2006,pp 206-212 
* Pages: 06 Figures : 06. arXiv admin note: text overlap with arXiv:1305.1429, arXiv:1305.1428 

  Access Paper or Ask Questions

Identifying Supporting Facts for Multi-hop Question Answering with Document Graph Networks

Oct 01, 2019
Mokanarangan Thayaparan, Marco Valentino, Viktor Schlegel, Andre Freitas

Recent advances in reading comprehension have resulted in models that surpass human performance when the answer is contained in a single, continuous passage of text. However, complex Question Answering (QA) typically requires multi-hop reasoning - i.e. the integration of supporting facts from different sources, to infer the correct answer. This paper proposes Document Graph Network (DGN), a message passing architecture for the identification of supporting facts over a graph-structured representation of text. The evaluation on HotpotQA shows that DGN obtains competitive results when compared to a reading comprehension baseline operating on raw text, confirming the relevance of structured representations for supporting multi-hop reasoning.

  Access Paper or Ask Questions