Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Character Spotting Using Machine Learning Techniques

Jul 28, 2021
P Preethi, Hrishikesh Viswanath

This work presents a comparison of machine learning algorithms that are implemented to segment the characters of text presented as an image. The algorithms are designed to work on degraded documents with text that is not aligned in an organized fashion. The paper investigates the use of Support Vector Machines, K-Nearest Neighbor algorithm and an Encoder Network to perform the operation of character spotting. Character Spotting involves extracting potential characters from a stream of text by selecting regions bound by white space.

  Access Paper or Ask Questions

Identifying Reference Spans: Topic Modeling and Word Embeddings help IR

Aug 09, 2017
Luis Moraes, Shahryar Baki, Rakesh Verma, Daniel Lee

The CL-SciSumm 2016 shared task introduced an interesting problem: given a document D and a piece of text that cites D, how do we identify the text spans of D being referenced by the piece of text? The shared task provided the first annotated dataset for studying this problem. We present an analysis of our continued work in improving our system's performance on this task. We demonstrate how topic models and word embeddings can be used to surpass the previously best performing system.

  Access Paper or Ask Questions

Explaining Zipf's Law via Mental Lexicon

Feb 18, 2013
Armen E. Allahverdyan, Weibing Deng, Q. A. Wang

The Zipf's law is the major regularity of statistical linguistics that served as a prototype for rank-frequency relations and scaling laws in natural sciences. Here we show that the Zipf's law -- together with its applicability for a single text and its generalizations to high and low frequencies including hapax legomena -- can be derived from assuming that the words are drawn into the text with random probabilities. Their apriori density relates, via the Bayesian statistics, to general features of the mental lexicon of the author who produced the text.

  Access Paper or Ask Questions

Techniques to Improve Q&A Accuracy with Transformer-based models on Large Complex Documents

Sep 26, 2020
Chejui Liao, Tabish Maniar, Sravanajyothi N, Anantha Sharma

This paper discusses the effectiveness of various text processing techniques, their combinations, and encodings to achieve a reduction of complexity and size in a given text corpus. The simplified text corpus is sent to BERT (or similar transformer based models) for question and answering and can produce more relevant responses to user queries. This paper takes a scientific approach to determine the benefits and effectiveness of various techniques and concludes a best-fit combination that produces a statistically significant improvement in accuracy.

  Access Paper or Ask Questions

Assessing Human Translations from French to Bambara for Machine Learning: a Pilot Study

Mar 31, 2020
Michael Leventhal, Allahsera Tapo, Sarah Luger, Marcos Zampieri, Christopher M. Homan

We present novel methods for assessing the quality of human-translated aligned texts for learning machine translation models of under-resourced languages. Malian university students translated French texts, producing either written or oral translations to Bambara. Our results suggest that similar quality can be obtained from either written or spoken translations for certain kinds of texts. They also suggest specific instructions that human translators should be given in order to improve the quality of their work.

  Access Paper or Ask Questions

Short-duration Speaker Verification (SdSV) Challenge 2020: the Challenge Evaluation Plan

Dec 13, 2019
Hossein Zeinali, Kong Aik Lee, Jahangir Alam, Lukas Burget

This document describes task1 of the Short-Duration Speaker Verification Challenge (SdSVC) 2020. The main aim of the challenge is to evaluate new technologies for text-dependent speaker verification (TD-SV). There is one more task in the SdSVC which is text-independent speaker verification which is explained in a separate description file. The evaluation dataset in the challenge is recently released multi-purpose DeepMine dataset. The dataset has three parts and among them part1 is for text-dependent speaker verification.

  Access Paper or Ask Questions

Automatically Restructuring Practice Guidelines using the GEM DTD

Jun 08, 2007
Amanda Bouffier, Thierry Poibeau

This paper describes a system capable of semi-automatically filling an XML template from free texts in the clinical domain (practice guidelines). The XML template includes semantic information not explicitly encoded in the text (pairs of conditions and actions/recommendations). Therefore, there is a need to compute the exact scope of conditions over text sequences expressing the required actions. We present a system developed for this task. We show that it yields good performance when applied to the analysis of French practice guidelines.

* Proceedings of Biomedical Natural Language Processing (BioNLP) (2007) - 

  Access Paper or Ask Questions

Non-autoregressive Transformer by Position Learning

Nov 25, 2019
Yu Bao, Hao Zhou, Jiangtao Feng, Mingxuan Wang, Shujian Huang, Jiajun Chen, Lei LI

Non-autoregressive models are promising on various text generation tasks. Previous work hardly considers to explicitly model the positions of generated words. However, position modeling is an essential problem in non-autoregressive text generation. In this study, we propose PNAT, which incorporates positions as a latent variable into the text generative process. Experimental results show that PNAT achieves top results on machine translation and paraphrase generation tasks, outperforming several strong baselines.

  Access Paper or Ask Questions

Progressive Transformer-Based Generation of Radiology Reports

Feb 19, 2021
Farhad Nooralahzadeh, Nicolas Perez Gonzalez, Thomas Frauenfelder, Koji Fujimoto, Michael Krauthammer

Inspired by Curriculum Learning, we propose a consecutive (i.e. image-to-text-to-text) generation framework where we divide the problem of radiology report generation into two steps. Contrary to generating the full radiology report from the image at once, the model generates global concepts from the image in the first step and then reforms them into finer and coherent texts using transformer-based architecture. We follow the transformer-based sequence-to-sequence paradigm at each step. We improve upon the state-of-the-art on two benchmark datasets.

  Access Paper or Ask Questions

Annotation Style Guide for the Blinker Project

May 08, 1998
I. Dan Melamed

This annotation style guide was created by and for the Blinker project at the University of Pennsylvania. The Blinker project was so named after the ``bilingual linker'' GUI, which was created to enable bilingual annotators to ``link'' word tokens that are mutual translations in parallel texts. The parallel text chosen for this project was the Bible, because it is probably the easiest text to obtain in electronic form in multiple languages. The languages involved were English and French, because, of the languages with which the project co-ordinator was familiar, these were the two for which a sufficient number of annotators was likely to be found.

  Access Paper or Ask Questions