Business card images are of multiple natures as these often contain graphics, pictures and texts of various fonts and sizes both in background and foreground. So, the conventional binarization techniques designed for document images can not be directly applied on mobile devices. In this paper, we have presented a fast binarization technique for camera captured business card images. A card image is split into small blocks. Some of these blocks are classified as part of the background based on intensity variance. Then the non-text regions are eliminated and the text ones are skew corrected and binarized using a simple yet adaptive technique. Experiment shows that the technique is fast, efficient and applicable for the mobile devices.
Textual redundancy is one of the main challenges to ensuring that legal texts remain comprehensible and maintainable. Drawing inspiration from the refactoring literature in software engineering, which has developed methods to expose and eliminate duplicated code, we introduce the duplicated phrase detection problem for legal texts and propose the Dupex algorithm to solve it. Leveraging the Minimum Description Length principle from information theory, Dupex identifies a set of duplicated phrases, called patterns, that together best compress a given input text. Through an extensive set of experiments on the Titles of the United States Code, we confirm that our algorithm works well in practice: Dupex will help you simplify your law.
With the advent of electronic interaction, dominance (or the assertion of control over others) has acquired new dimensions. This study investigates the dynamics and characteristics of dominance in virtual interaction by analyzing electronic chat transcripts of groups solving a hidden profile task. We investigate computer-mediated communication behavior patterns that demonstrate dominance and identify a number of relevant variables. These indicators are calculated with automatic and manual coding of text transcripts. A comparison of both sets of variables indicates that automatic text analysis methods yield similar conclusions than manual coding. These findings are encouraging to advance research in text analysis methods in general, and in the study of virtual team dominance in particular.
A hallmark of human intelligence is the ability to understand and communicate with language. Interactive Fiction games are fully text-based simulation environments where a player issues text commands to effect change in the environment and progress through the story. We argue that IF games are an excellent testbed for studying language-based autonomous agents. In particular, IF games combine challenges of combinatorial action spaces, language understanding, and commonsense reasoning. To facilitate rapid development of language-based agents, we introduce Jericho, a learning environment for man-made IF games and conduct a comprehensive study of text-agents across a rich set of games, highlighting directions in which agents can improve.
Written language is complex. A written text can be considered an attempt to convey a meaningful message which ends up being constrained by language rules, context dependence and highly redundant in its use of resources. Despite all these constraints, unpredictability is an essential element of natural language. Here we present the use of entropic measures to assert the balance between predictability and surprise in written text. In short, it is possible to measure innovation and context preservation in a document. It is shown that this can also be done at the different levels of organization of a text. The type of analysis presented is reasonably general, and can also be used to analyze the same balance in other complex messages such as DNA, where a hierarchy of organizational levels are known to exist.
In the medical domain, identifying and expanding abbreviations in clinical texts is a vital task for both better human and machine understanding. It is a challenging task because many abbreviations are ambiguous especially for intensive care medicine texts, in which phrase abbreviations are frequently used. Besides the fact that there is no universal dictionary of clinical abbreviations and no universal rules for abbreviation writing, such texts are difficult to acquire, expensive to annotate and even sometimes, confusing to domain experts. This paper proposes a novel and effective approach - exploiting task-oriented resources to learn word embeddings for expanding abbreviations in clinical notes. We achieved 82.27% accuracy, close to expert human performance.
Argument mining is a core technology for automating argument search in large document collections. Despite its usefulness for this task, most current approaches to argument mining are designed for use only with specific text types and fall short when applied to heterogeneous texts. In this paper, we propose a new sentential annotation scheme that is reliably applicable by crowd workers to arbitrary Web texts. We source annotations for over 25,000 instances covering eight controversial topics. The results of cross-topic experiments show that our attention-based neural network generalizes best to unseen topics and outperforms vanilla BiLSTM models by 6% in accuracy and 11% in F-score.
This paper presents a number of experiments to model changes in a historical Portuguese corpus composed of literary texts for the purpose of temporal text classification. Algorithms were trained to classify texts with respect to their publication date taking into account lexical variation represented as word n-grams, and morphosyntactic variation represented by part-of-speech (POS) distribution. We report results of 99.8% accuracy using word unigram features with a Support Vector Machines classifier to predict the publication date of documents in time intervals of both one century and half a century. A feature analysis is performed to investigate the most informative features for this task and how they are linked to language change.
This software project based paper is for a vision of the near future in which computer interaction is characterized by natural face-to-face conversations with lifelike characters that speak, emote, and gesture. The first step is speech. The dream of a true virtual reality, a complete human-computer interaction system will not come true unless we try to give some perception to machine and make it perceive the outside world as humans communicate with each other. This software project is under development for listening and replying machine (Computer) through speech. The Speech interface is developed to convert speech input into some parametric form (Speech-to-Text) for further processing and the results, text output to speech synthesis (Text-to-Speech)
Recent advances in reading comprehension have resulted in models that surpass human performance when the answer is contained in a single, continuous passage of text. However, complex Question Answering (QA) typically requires multi-hop reasoning - i.e. the integration of supporting facts from different sources, to infer the correct answer. This paper proposes Document Graph Network (DGN), a message passing architecture for the identification of supporting facts over a graph-structured representation of text. The evaluation on HotpotQA shows that DGN obtains competitive results when compared to a reading comprehension baseline operating on raw text, confirming the relevance of structured representations for supporting multi-hop reasoning.