Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rami Al-Rfou'

Detecting English Writing Styles For Non Native Speakers

Apr 24, 2017
Yanging Chen, Rami Al-Rfou', Yejin Choi

Figure 1 for Detecting English Writing Styles For Non Native Speakers

Figure 2 for Detecting English Writing Styles For Non Native Speakers

Figure 3 for Detecting English Writing Styles For Non Native Speakers

Figure 4 for Detecting English Writing Styles For Non Native Speakers

This paper presents the first attempt, up to our knowledge, to classify English writing styles on this scale with the challenge of classifying day to day language written by writers with different backgrounds covering various areas of topics.The paper proposes simple machine learning algorithms and simple to generate features to solve hard problems. Relying on the scale of the data available from large sources of knowledge like Wikipedia. We believe such sources of data are crucial to generate robust solutions for the web with high accuracy and easy to deploy in practice. The paper achieves 74\% accuracy classifying native versus non native speakers writing styles. Moreover, the paper shows some interesting observations on the similarity between different languages measured by the similarity of their users English writing styles. This technique could be used to show some well known facts about languages as in grouping them into families, which our experiments support.

* 9 figures, 5 tables, 9 pages

Via

Access Paper or Ask Questions

Exploring the power of GPU's for training Polyglot language models

Apr 15, 2014
Vivek Kulkarni, Rami Al-Rfou', Bryan Perozzi, Steven Skiena

Figure 1 for Exploring the power of GPU's for training Polyglot language models

Figure 2 for Exploring the power of GPU's for training Polyglot language models

One of the major research trends currently is the evolution of heterogeneous parallel computing. GP-GPU computing is being widely used and several applications have been designed to exploit the massive parallelism that GP-GPU's have to offer. While GPU's have always been widely used in areas of computer vision for image processing, little has been done to investigate whether the massive parallelism provided by GP-GPU's can be utilized effectively for Natural Language Processing(NLP) tasks. In this work, we investigate and explore the power of GP-GPU's in the task of learning language models. More specifically, we investigate the performance of training Polyglot language models using deep belief neural networks. We evaluate the performance of training the model on the GPU and present optimizations that boost the performance on the GPU.One of the key optimizations, we propose increases the performance of a function involved in calculating and updating the gradient by approximately 50 times on the GPU for sufficiently large batch sizes. We show that with the above optimizations, the GP-GPU's performance on the task increases by factor of approximately 3-4. The optimizations we made are generic Theano optimizations and hence potentially boost the performance of other models which rely on these operations.We also show that these optimizations result in the GPU's performance at this task being now comparable to that on the CPU. We conclude by presenting a thorough evaluation of the applicability of GP-GPU's for this task and highlight the factors limiting the performance of training a Polyglot model on the GPU.

* version 2 (just corrected citation)

Via

Access Paper or Ask Questions

SpeedRead: A Fast Named Entity Recognition Pipeline

Jan 14, 2013
Rami Al-Rfou', Steven Skiena

Figure 1 for SpeedRead: A Fast Named Entity Recognition Pipeline

Figure 2 for SpeedRead: A Fast Named Entity Recognition Pipeline

Figure 3 for SpeedRead: A Fast Named Entity Recognition Pipeline

Figure 4 for SpeedRead: A Fast Named Entity Recognition Pipeline

Online content analysis employs algorithmic methods to identify entities in unstructured text. Both machine learning and knowledge-base approaches lie at the foundation of contemporary named entities extraction systems. However, the progress in deploying these approaches on web-scale has been been hampered by the computational cost of NLP over massive text corpora. We present SpeedRead (SR), a named entity recognition pipeline that runs at least 10 times faster than Stanford NLP pipeline. This pipeline consists of a high performance Penn Treebank- compliant tokenizer, close to state-of-art part-of-speech (POS) tagger and knowledge-based named entity recognizer.

* Long paper at COLING 2012

Via

Access Paper or Ask Questions

Detecting English Writing Styles For Non-native Speakers

Nov 02, 2012
Rami Al-Rfou'

Analyzing writing styles of non-native speakers is a challenging task. In this paper, we analyze the comments written in the discussion pages of the English Wikipedia. Using learning algorithms, we are able to detect native speakers' writing style with an accuracy of 74%. Given the diversity of the English Wikipedia users and the large number of languages they speak, we measure the similarities among their native languages by comparing the influence they have on their English writing style. Our results show that languages known to have the same origin and development path have similar footprint on their speakers' English writing style. To enable further studies, the dataset we extracted from Wikipedia will be made available publicly.

Via

Access Paper or Ask Questions