Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Clara Meister

ETH Zurich

Estimating the Entropy of Linguistic Distributions

Apr 05, 2022

Aryaman Arora, Clara Meister, Ryan Cotterell

Figure 1 for Estimating the Entropy of Linguistic Distributions

Figure 2 for Estimating the Entropy of Linguistic Distributions

Figure 3 for Estimating the Entropy of Linguistic Distributions

Figure 4 for Estimating the Entropy of Linguistic Distributions

Abstract:Shannon entropy is often a quantity of interest to linguists studying the communicative capacity of human language. However, entropy must typically be estimated from observed data because researchers do not have access to the underlying probability distribution that gives rise to these data. While entropy estimation is a well-studied problem in other fields, there is not yet a comprehensive exploration of the efficacy of entropy estimators for use with linguistic data. In this work, we fill this void, studying the empirical effectiveness of different entropy estimators for linguistic distributions. In a replication of two recent information-theoretic linguistic studies, we find evidence that the reported effect size is over-estimated due to over-reliance on poor entropy estimators. Finally, we end our paper with concrete recommendations for entropy estimation depending on distribution type and data availability.

* 21 pages (5 pages main text). 4 figures. Accepted to ACL 2022

Via

Access Paper or Ask Questions

On the probability-quality paradox in language generation

Mar 31, 2022

Clara Meister, Gian Wiher, Tiago Pimentel, Ryan Cotterell

Figure 1 for On the probability-quality paradox in language generation

Figure 2 for On the probability-quality paradox in language generation

Figure 3 for On the probability-quality paradox in language generation

Figure 4 for On the probability-quality paradox in language generation

Abstract:When generating natural language from neural probabilistic models, high probability does not always coincide with high quality: It has often been observed that mode-seeking decoding methods, i.e., those that produce high-probability text under the model, lead to unnatural language. On the other hand, the lower-probability text generated by stochastic methods is perceived as more human-like. In this note, we offer an explanation for this phenomenon by analyzing language generation through an information-theoretic lens. Specifically, we posit that human-like language should contain an amount of information (quantified as negative log-probability) that is close to the entropy of the distribution over natural strings. Further, we posit that language with substantially more (or less) information is undesirable. We provide preliminary empirical evidence in favor of this hypothesis; quality ratings of both human and machine-generated text -- covering multiple tasks and common decoding strategies -- suggest high-quality text has an information content significantly closer to the entropy than we would expect by chance.

* ACL 2022 (main conference)

Via

Access Paper or Ask Questions

Analyzing Wrap-Up Effects through an Information-Theoretic Lens

Mar 31, 2022

Clara Meister, Tiago Pimentel, Thomas Hikaru Clark, Ryan Cotterell, Roger Levy

Figure 1 for Analyzing Wrap-Up Effects through an Information-Theoretic Lens

Figure 2 for Analyzing Wrap-Up Effects through an Information-Theoretic Lens

Figure 3 for Analyzing Wrap-Up Effects through an Information-Theoretic Lens

Figure 4 for Analyzing Wrap-Up Effects through an Information-Theoretic Lens

Abstract:Numerous analyses of reading time (RT) data have been implemented -- all in an effort to better understand the cognitive processes driving reading comprehension. However, data measured on words at the end of a sentence -- or even at the end of a clause -- is often omitted due to the confounding factors introduced by so-called "wrap-up effects," which manifests as a skewed distribution of RTs for these words. Consequently, the understanding of the cognitive processes that might be involved in these wrap-up effects is limited. In this work, we attempt to learn more about these processes by examining the relationship between wrap-up effects and information-theoretic quantities, such as word and context surprisals. We find that the distribution of information in prior contexts is often predictive of sentence- and clause-final RTs (while not of sentence-medial RTs). This lends support to several prior hypotheses about the processes involved in wrap-up effects.

* ACL 2022 (main conference)

Via

Access Paper or Ask Questions

On Decoding Strategies for Neural Text Generators

Mar 29, 2022

Gian Wiher, Clara Meister, Ryan Cotterell

Figure 1 for On Decoding Strategies for Neural Text Generators

Figure 2 for On Decoding Strategies for Neural Text Generators

Figure 3 for On Decoding Strategies for Neural Text Generators

Figure 4 for On Decoding Strategies for Neural Text Generators

Abstract:When generating text from probabilistic models, the chosen decoding strategy has a profound effect on the resulting text. Yet the properties elicited by various decoding strategies do not always transfer across natural language generation tasks. For example, while mode-seeking methods like beam search perform remarkably well for machine translation, they have been observed to lead to incoherent and repetitive text in story generation. Despite such observations, the effectiveness of decoding strategies is often assessed with respect to only a single task. This work -- in contrast -- provides a comprehensive analysis of the interaction between language generation tasks and decoding strategies. Specifically, we measure changes in attributes of generated text as a function of both decoding strategy and task using human and automatic evaluation. Our results reveal both previously-observed and surprising findings. For example, the nature of the diversity-quality trade-off in language generation is very task-specific; the length bias often attributed to beam search is not constant across tasks.

Via

Access Paper or Ask Questions

Typical Decoding for Natural Language Generation

Feb 10, 2022

Clara Meister, Tiago Pimentel, Gian Wiher, Ryan Cotterell

Figure 1 for Typical Decoding for Natural Language Generation

Figure 2 for Typical Decoding for Natural Language Generation

Figure 3 for Typical Decoding for Natural Language Generation

Figure 4 for Typical Decoding for Natural Language Generation

Abstract:Despite achieving incredibly low perplexities on myriad natural language corpora, today's language models still often underperform when used to generate text. This dichotomy has puzzled the language generation community for the last few years. In this work, we posit that the abstraction of natural language as a communication channel (\`a la Shannon, 1948) can provide new insights into the behaviors of probabilistic language generators, e.g., why high-probability texts can be dull or repetitive. Humans use language as a means of communicating information, and do so in an efficient yet error-minimizing manner, choosing each word in a string with this (perhaps subconscious) goal in mind. We propose that generation from probabilistic models should mimic this behavior. Rather than always choosing words from the high-probability region of the distribution--which have a low Shannon information content--we sample from the set of words with an information content close to its expected value, i.e., close to the conditional entropy of our model. This decision criterion can be realized through a simple and efficient implementation, which we call typical sampling. Automatic and human evaluations show that, in comparison to nucleus and top-k sampling, typical sampling offers competitive performance in terms of quality while consistently reducing the number of degenerate repetitions.

Via

Access Paper or Ask Questions

A surprisal--duration trade-off across and within the world's languages

Sep 30, 2021

Tiago Pimentel, Clara Meister, Elizabeth Salesky, Simone Teufel, Damián Blasi, Ryan Cotterell

Figure 1 for A surprisal--duration trade-off across and within the world's languages

Figure 2 for A surprisal--duration trade-off across and within the world's languages

Figure 3 for A surprisal--duration trade-off across and within the world's languages

Figure 4 for A surprisal--duration trade-off across and within the world's languages

Abstract:While there exist scores of natural languages, each with its unique features and idiosyncrasies, they all share a unifying theme: enabling human communication. We may thus reasonably predict that human cognition shapes how these languages evolve and are used. Assuming that the capacity to process information is roughly constant across human populations, we expect a surprisal--duration trade-off to arise both across and within languages. We analyse this trade-off using a corpus of 600 languages and, after controlling for several potential confounds, we find strong supporting evidence in both settings. Specifically, we find that, on average, phones are produced faster in languages where they are less surprising, and vice versa. Further, we confirm that more surprising phones are longer, on average, in 319 languages out of the 600. We thus conclude that there is strong evidence of a surprisal--duration trade-off in operation, both across and within the world's languages.

* Accepted for publication in EMNLP 2021. Code available in https://github.com/rycolab/surprisal-duration-tradeoff

Via

Access Paper or Ask Questions

On Homophony and Rényi Entropy

Sep 28, 2021

Tiago Pimentel, Clara Meister, Simone Teufel, Ryan Cotterell

Figure 1 for On Homophony and Rényi Entropy

Figure 2 for On Homophony and Rényi Entropy

Figure 3 for On Homophony and Rényi Entropy

Figure 4 for On Homophony and Rényi Entropy

Abstract:Homophony's widespread presence in natural languages is a controversial topic. Recent theories of language optimality have tried to justify its prevalence, despite its negative effects on cognitive processing time; e.g., Piantadosi et al. (2012) argued homophony enables the reuse of efficient wordforms and is thus beneficial for languages. This hypothesis has recently been challenged by Trott and Bergen (2020), who posit that good wordforms are more often homophonous simply because they are more phonotactically probable. In this paper, we join in on the debate. We first propose a new information-theoretic quantification of a language's homophony: the sample R\'enyi entropy. Then, we use this quantification to revisit Trott and Bergen's claims. While their point is theoretically sound, a specific methodological issue in their experiments raises doubts about their results. After addressing this issue, we find no clear pressure either towards or against homophony -- a much more nuanced result than either Piantadosi et al.'s or Trott and Bergen's findings.

* Accepted for publication in EMNLP 2021. Code available in https://github.com/rycolab/homophony-as-renyi-entropy

Via

Access Paper or Ask Questions

Revisiting the Uniform Information Density Hypothesis

Sep 23, 2021

Clara Meister, Tiago Pimentel, Patrick Haller, Lena Jäger, Ryan Cotterell, Roger Levy

Figure 1 for Revisiting the Uniform Information Density Hypothesis

Figure 2 for Revisiting the Uniform Information Density Hypothesis

Figure 3 for Revisiting the Uniform Information Density Hypothesis

Figure 4 for Revisiting the Uniform Information Density Hypothesis

Abstract:The uniform information density (UID) hypothesis posits a preference among language users for utterances structured such that information is distributed uniformly across a signal. While its implications on language production have been well explored, the hypothesis potentially makes predictions about language comprehension and linguistic acceptability as well. Further, it is unclear how uniformity in a linguistic signal -- or lack thereof -- should be measured, and over which linguistic unit, e.g., the sentence or language level, this uniformity should hold. Here we investigate these facets of the UID hypothesis using reading time and acceptability data. While our reading time results are generally consistent with previous work, they are also consistent with a weakly super-linear effect of surprisal, which would be compatible with UID's predictions. For acceptability judgments, we find clearer evidence that non-uniformity in information density is predictive of lower acceptability. We then explore multiple operationalizations of UID, motivated by different interpretations of the original hypothesis, and analyze the scope over which the pressure towards uniformity is exerted. The explanatory power of a subset of the proposed operationalizations suggests that the strongest trend may be a regression towards a mean surprisal across the language, rather than the phrase, sentence, or document -- a finding that supports a typical interpretation of UID, namely that it is the byproduct of language users maximizing the use of a (hypothetical) communication channel.

* Proceedings of EMNLP 2021

Via

Access Paper or Ask Questions

Conditional Poisson Stochastic Beam Search

Sep 22, 2021

Clara Meister, Afra Amini, Tim Viera, Ryan Cotterell

Abstract:Beam search is the default decoding strategy for many sequence generation tasks in NLP. The set of approximate K-best items returned by the algorithm is a useful summary of the distribution for many applications; however, the candidates typically exhibit high overlap and may give a highly biased estimate for expectations under our model. These problems can be addressed by instead using stochastic decoding strategies. In this work, we propose a new method for turning beam search into a stochastic process: Conditional Poisson stochastic beam search. Rather than taking the maximizing set at each iteration, we sample K candidates without replacement according to the conditional Poisson sampling design. We view this as a more natural alternative to Kool et. al. 2019's stochastic beam search (SBS). Furthermore, we show how samples generated under the CPSBS design can be used to build consistent estimators and sample diverse sets from sequence models. In our experiments, we observe CPSBS produces lower variance and more efficient estimators than SBS, even showing improvements in high entropy settings.

* Proceedings of EMNLP 2021

Via

Access Paper or Ask Questions

A Plug-and-Play Method for Controlled Text Generation

Sep 20, 2021

Damian Pascual, Beni Egressy, Clara Meister, Ryan Cotterell, Roger Wattenhofer

Figure 1 for A Plug-and-Play Method for Controlled Text Generation

Figure 2 for A Plug-and-Play Method for Controlled Text Generation

Figure 3 for A Plug-and-Play Method for Controlled Text Generation

Figure 4 for A Plug-and-Play Method for Controlled Text Generation

Abstract:Large pre-trained language models have repeatedly shown their ability to produce fluent text. Yet even when starting from a prompt, generation can continue in many plausible directions. Current decoding methods with the goal of controlling generation, e.g., to ensure specific words are included, either require additional models or fine-tuning, or work poorly when the task at hand is semantically unconstrained, e.g., story generation. In this work, we present a plug-and-play decoding method for controlled language generation that is so simple and intuitive, it can be described in a single sentence: given a topic or keyword, we add a shift to the probability distribution over our vocabulary towards semantically similar words. We show how annealing this distribution can be used to impose hard constraints on language generation, something no other plug-and-play method is currently able to do with SOTA language generators. Despite the simplicity of this approach, we see it works incredibly well in practice: decoding from GPT-2 leads to diverse and fluent sentences while guaranteeing the appearance of given guide words. We perform two user studies, revealing that (1) our method outperforms competing methods in human evaluations; and (2) forcing the guide words to appear in the generated text has no impact on the fluency of the generated text.

* Findings of EMNLP 2021

Via

Access Paper or Ask Questions