Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Building a Word Segmenter for Sanskrit Overnight

Feb 17, 2018
Vikas Reddy, Amrith Krishna, Vishnu Dutt Sharma, Prateek Gupta, Vineeth M R, Pawan Goyal

There is an abundance of digitised texts available in Sanskrit. However, the word segmentation task in such texts are challenging due to the issue of 'Sandhi'. In Sandhi, words in a sentence often fuse together to form a single chunk of text, where the word delimiter vanishes and sounds at the word boundaries undergo transformations, which is also reflected in the written text. Here, we propose an approach that uses a deep sequence to sequence (seq2seq) model that takes only the sandhied string as the input and predicts the unsandhied string. The state of the art models are linguistically involved and have external dependencies for the lexical and morphological analysis of the input. Our model can be trained "overnight" and be used for production. In spite of the knowledge lean approach, our system preforms better than the current state of the art by gaining a percentage increase of 16.79 % than the current state of the art.

* The work is accepted at LREC 2018, Miyazaki, Japan 

  Access Paper or Ask Questions

Demographic Dialectal Variation in Social Media: A Case Study of African-American English

Aug 31, 2016
Su Lin Blodgett, Lisa Green, Brendan O'Connor

Though dialectal language is increasingly abundant on social media, few resources exist for developing NLP tools to handle such language. We conduct a case study of dialectal language in online conversational text by investigating African-American English (AAE) on Twitter. We propose a distantly supervised model to identify AAE-like language from demographics associated with geo-located messages, and we verify that this language follows well-known AAE linguistic phenomena. In addition, we analyze the quality of existing language identification and dependency parsing tools on AAE-like text, demonstrating that they perform poorly on such text compared to text associated with white speakers. We also provide an ensemble classifier for language identification which eliminates this disparity and release a new corpus of tweets containing AAE-like language.

* To be published in EMNLP 2016, 15 pages 

  Access Paper or Ask Questions

Text Summarization in the Biomedical Domain

Aug 06, 2019
Milad Moradi, Nasser Ghadiri

This chapter gives an overview of recent advances in the field of biomedical text summarization. Different types of challenges are introduced, and methods are discussed concerning the type of challenge that they address. Biomedical literature summarization is explored as a leading trend in the field, and some future lines of work are pointed out. Underlying methods of recent summarization systems are briefly explained and the most significant evaluation results are mentioned. The primary purpose of this chapter is to review the most significant research efforts made in the current decade toward new methods of biomedical text summarization. As the main parts of this chapter, current trends are discussed and new challenges are introduced.

  Access Paper or Ask Questions

A Vietnamese Text-Based Conversational Agent

Nov 26, 2019
Dai Quoc Nguyen, Dat Quoc Nguyen, Son Bao Pham

This paper introduces a Vietnamese text-based conversational agent architecture on specific knowledge domain which is integrated in a question answering system. When the question answering system fails to provide answers to users' input, our conversational agent can step in to interact with users to provide answers to users. Experimental results are promising where our Vietnamese text-based conversational agent achieves positive feedback in a study conducted in the university academic regulation domain.

* In Proceedings of the 25th International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems (IEA/AIE 2012) 

  Access Paper or Ask Questions

Morphological Analysis of Japanese Hiragana Sentences using the BI-LSTM CRF Model

Jan 10, 2022
Jun Izutsu, Kanako Komiya

This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. This technique plays an essential role in downstream applications in Japanese natural language processing systems because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.

* 13 pages 

  Access Paper or Ask Questions

Indonesian ID Card Extractor Using Optical Character Recognition and Natural Language Post-Processing

Dec 15, 2020
Firhan Maulana Rusli, Kevin Akbar Adhiguna, Hendy Irawan

The development of Information Technology has been increasingly changing the means of information exchange leading to the need of digitizing print documents. In the present era, there is a lot of fraud that often occur. To avoid account fraud there was verification using ID card extraction using OCR and NLP. Optical Character Recognition (OCR) is technology that used to generate text from image. With OCR we can extract Indonesian ID card or kartu tanda penduduk (KTP) into text too. This is using to make easier service operator to do data entry. To improve the accuracy we made text correction using Natural language Processing (NLP) method to fixing the text. With 50 Indonesian ID card image we got 0.78 F-score, and we need 4510 milliseconds to extract per ID card.

* 5 pages 

  Access Paper or Ask Questions

Spatial Semantic Scan: Jointly Detecting Subtle Events and their Spatial Footprint

May 28, 2016
Abhinav Maurya

Many methods have been proposed for detecting emerging events in text streams using topic modeling. However, these methods have shortcomings that make them unsuitable for rapid detection of locally emerging events on massive text streams. We describe Spatially Compact Semantic Scan (SCSS) that has been developed specifically to overcome the shortcomings of current methods in detecting new spatially compact events in text streams. SCSS employs alternating optimization between using semantic scan to estimate contrastive foreground topics in documents, and discovering spatial neighborhoods with high occurrence of documents containing the foreground topics. We evaluate our method on Emergency Department chief complaints dataset (ED dataset) to verify the effectiveness of our method in detecting real-world disease outbreaks from free-text ED chief complaint data.

* 26 pages 

  Access Paper or Ask Questions

A Novel Method of Extracting Topological Features from Word Embeddings

Apr 19, 2020
Shafie Gholizadeh, Armin Seyeditabari, Wlodek Zadrozny

In recent years, topological data analysis has been utilized for a wide range of problems to deal with high dimensional noisy data. While text representations are often high dimensional and noisy, there are only a few work on the application of topological data analysis in natural language processing. In this paper, we introduce a novel algorithm to extract topological features from word embedding representation of text that can be used for text classification. Working on word embeddings, topological data analysis can interpret the embedding high-dimensional space and discover the relations among different embedding dimensions. We will use persistent homology, the most commonly tool from topological data analysis, for our experiment. Examining our topological algorithm on long textual documents, we will show our defined topological features may outperform conventional text mining features.

  Access Paper or Ask Questions

CTRL: A Conditional Transformer Language Model for Controllable Generation

Sep 11, 2019
Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher

Large-scale language models show promising text generation capabilities, but users cannot easily control particular aspects of the generated text. We release CTRL, a 1.6 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data via model-based source attribution. We have released multiple full-sized, pretrained versions of CTRL at

  Access Paper or Ask Questions