Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jake Ryland Williams

Zipf's law is a consequence of coherent language production

Aug 05, 2016

Jake Ryland Williams, James P. Bagrow, Andrew J. Reagan, Sharon E. Alajajian, Christopher M. Danforth, Peter Sheridan Dodds

Figure 1 for Zipf's law is a consequence of coherent language production

Figure 2 for Zipf's law is a consequence of coherent language production

Figure 3 for Zipf's law is a consequence of coherent language production

Figure 4 for Zipf's law is a consequence of coherent language production

Abstract:The task of text segmentation may be undertaken at many levels in text analysis---paragraphs, sentences, words, or even letters. Here, we focus on a relatively fine scale of segmentation, hypothesizing it to be in accord with a stochastic model of language generation, as the smallest scale where independent units of meaning are produced. Our goals in this letter include the development of methods for the segmentation of these minimal independent units, which produce feature-representations of texts that align with the independence assumption of the bag-of-terms model, commonly used for prediction and classification in computational text analysis. We also propose the measurement of texts' association (with respect to realized segmentations) to the model of language generation. We find (1) that our segmentations of phrases exhibit much better associations to the generation model than words and (2), that texts which are well fit are generally topically homogeneous. Because our generative model produces Zipf's law, our study further suggests that Zipf's law may be a consequence of homogeneity in language production.

* 5 pages, 4 figures

Via

Access Paper or Ask Questions

Sifting Robotic from Organic Text: A Natural Language Approach for Detecting Automation on Twitter

Jun 14, 2016

Eric M. Clark, Jake Ryland Williams, Chris A. Jones, Richard A. Galbraith, Christopher M. Danforth, Peter Sheridan Dodds

Figure 1 for Sifting Robotic from Organic Text: A Natural Language Approach for Detecting Automation on Twitter

Figure 2 for Sifting Robotic from Organic Text: A Natural Language Approach for Detecting Automation on Twitter

Figure 3 for Sifting Robotic from Organic Text: A Natural Language Approach for Detecting Automation on Twitter

Abstract:Twitter, a popular social media outlet, has evolved into a vast source of linguistic data, rich with opinion, sentiment, and discussion. Due to the increasing popularity of Twitter, its perceived potential for exerting social influence has led to the rise of a diverse community of automatons, commonly referred to as bots. These inorganic and semi-organic Twitter entities can range from the benevolent (e.g., weather-update bots, help-wanted-alert bots) to the malevolent (e.g., spamming messages, advertisements, or radical opinions). Existing detection algorithms typically leverage meta-data (time between tweets, number of followers, etc.) to identify robotic accounts. Here, we present a powerful classification scheme that exclusively uses the natural language text from organic users to provide a criterion for identifying accounts posting automated messages. Since the classifier operates on text alone, it is flexible and may be applied to any textual data beyond the Twitter-sphere.

Via

Access Paper or Ask Questions

Identifying missing dictionary entries with frequency-conserving context models

Jul 29, 2015

Jake Ryland Williams, Eric M. Clark, James P. Bagrow, Christopher M. Danforth, Peter Sheridan Dodds

Figure 1 for Identifying missing dictionary entries with frequency-conserving context models

Figure 2 for Identifying missing dictionary entries with frequency-conserving context models

Figure 3 for Identifying missing dictionary entries with frequency-conserving context models

Figure 4 for Identifying missing dictionary entries with frequency-conserving context models

Abstract:In an effort to better understand meaning from natural language texts, we explore methods aimed at organizing lexical objects into contexts. A number of these methods for organization fall into a family defined by word ordering. Unlike demographic or spatial partitions of data, these collocation models are of special importance for their universal applicability. While we are interested here in text and have framed our treatment appropriately, our work is potentially applicable to other areas of research (e.g., speech, genomics, and mobility patterns) where one has ordered categorical data, (e.g., sounds, genes, and locations). Our approach focuses on the phrase (whether word or larger) as the primary meaning-bearing lexical unit and object of study. To do so, we employ our previously developed framework for generating word-conserving phrase-frequency data. Upon training our model with the Wiktionary---an extensive, online, collaborative, and open-source dictionary that contains over 100,000 phrasal-definitions---we develop highly effective filters for the identification of meaningful, missing phrase-entries. With our predictions we then engage the editorial community of the Wiktionary and propose short lists of potential missing entries for definition, developing a breakthrough, lexical extraction technique, and expanding our knowledge of the defined English lexicon of phrases.

* 16 pages, 6 figures, and 7 tables

Via

Access Paper or Ask Questions

Zipf's law holds for phrases, not words

Mar 04, 2015

Jake Ryland Williams, Paul R. Lessard, Suma Desu, Eric Clark, James P. Bagrow, Christopher M. Danforth, Peter Sheridan Dodds

Figure 1 for Zipf's law holds for phrases, not words

Figure 2 for Zipf's law holds for phrases, not words

Figure 3 for Zipf's law holds for phrases, not words

Abstract:With Zipf's law being originally and most famously observed for word frequency, it is surprisingly limited in its applicability to human language, holding over no more than three to four orders of magnitude before hitting a clear break in scaling. Here, building on the simple observation that phrases of one or more words comprise the most coherent units of meaning in language, we show empirically that Zipf's law for phrases extends over as many as nine orders of rank magnitude. In doing so, we develop a principled and scalable statistical mechanical method of random text partitioning, which opens up a rich frontier of rigorous text analysis via a rank ordering of mixed length phrases.

* Manuscript: 6 pages, 3 figures; Supplementary Information: 8 pages, 18 tables

Via

Access Paper or Ask Questions

Text mixing shapes the anatomy of rank-frequency distributions: A modern Zipfian mechanics for natural language

Jan 30, 2015

Jake Ryland Williams, James P. Bagrow, Christopher M. Danforth, Peter Sheridan Dodds

Figure 1 for Text mixing shapes the anatomy of rank-frequency distributions: A modern Zipfian mechanics for natural language

Figure 2 for Text mixing shapes the anatomy of rank-frequency distributions: A modern Zipfian mechanics for natural language

Figure 3 for Text mixing shapes the anatomy of rank-frequency distributions: A modern Zipfian mechanics for natural language

Figure 4 for Text mixing shapes the anatomy of rank-frequency distributions: A modern Zipfian mechanics for natural language

Abstract:Natural languages are full of rules and exceptions. One of the most famous quantitative rules is Zipf's law which states that the frequency of occurrence of a word is approximately inversely proportional to its rank. Though this `law' of ranks has been found to hold across disparate texts and forms of data, analyses of increasingly large corpora over the last 15 years have revealed the existence of two scaling regimes. These regimes have thus far been explained by a hypothesis suggesting a separability of languages into core and non-core lexica. Here, we present and defend an alternative hypothesis, that the two scaling regimes result from the act of aggregating texts. We observe that text mixing leads to an effective decay of word introduction, which we show provides accurate predictions of the location and severity of breaks in scaling. Upon examining large corpora from 10 languages in the Project Gutenberg eBooks collection (eBooks), we find emphatic empirical support for the universality of our claim.

* Phys. Rev. E 91, 052811 (2015)
* 9 pages, 6 figures, and 1 table

Via

Access Paper or Ask Questions

Human language reveals a universal positivity bias

Jun 15, 2014

Peter Sheridan Dodds, Eric M. Clark, Suma Desu, Morgan R. Frank, Andrew J. Reagan, Jake Ryland Williams, Lewis Mitchell, Kameron Decker Harris, Isabel M. Kloumann, James P. Bagrow(+4 more)

Figure 1 for Human language reveals a universal positivity bias

Figure 2 for Human language reveals a universal positivity bias

Figure 3 for Human language reveals a universal positivity bias

Figure 4 for Human language reveals a universal positivity bias

Abstract:Using human evaluation of 100,000 words spread across 24 corpora in 10 languages diverse in origin and culture, we present evidence of a deep imprint of human sociality in language, observing that (1) the words of natural human language possess a universal positivity bias; (2) the estimated emotional content of words is consistent between languages under translation; and (3) this positivity bias is strongly independent of frequency of word usage. Alongside these general regularities, we describe inter-language variations in the emotional spectrum of languages which allow us to rank corpora. We also show how our word evaluations can be used to construct physical-like instruments for both real-time and offline measurement of the emotional content of large-scale texts.

* Manuscript: 7 pages, 4 figures; Supplementary Material: 49 pages, 43 figures, 6 tables. Online appendices available at http://www.uvm.edu/storylab/share/papers/dodds2014a/

Via

Access Paper or Ask Questions