Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ryan Cotterell

ETH Zurich

Variational Best-of-N Alignment

Jul 08, 2024

Afra Amini, Tim Vieira, Ryan Cotterell

Figure 1 for Variational Best-of-N Alignment

Figure 2 for Variational Best-of-N Alignment

Figure 3 for Variational Best-of-N Alignment

Figure 4 for Variational Best-of-N Alignment

Abstract:Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human preferences. The algorithm works as follows: at inference time, N samples are drawn from the language model, and the sample with the highest reward, as judged by a reward model, is returned as the output. Despite its effectiveness, BoN is computationally expensive; it reduces sampling throughput by a factor of N. To make BoN more efficient at inference time, one strategy is to fine-tune the language model to mimic what BoN does during inference. To achieve this, we derive the distribution induced by the BoN algorithm. We then propose to fine-tune the language model to minimize backward KL divergence to the BoN distribution. Our approach is analogous to mean-field variational inference and, thus, we term it variational BoN (vBoN). To the extent this fine-tuning is successful and we end up with a good approximation, we have reduced the inference cost by a factor of N. Our experiments on a controlled generation task suggest that while variational BoN is not as effective as BoN in aligning language models, it is close to BoN performance as vBoN appears more often on the Pareto frontier of reward and KL divergence compared to models trained with KL-constrained RL objective.

Via

Access Paper or Ask Questions

On the Representational Capacity of Neural Language Models with Chain-of-Thought Reasoning

Jun 20, 2024

Franz Nowak, Anej Svete, Alexandra Butoi, Ryan Cotterell

Abstract:The performance of modern language models (LMs) has been improved by chain-of-thought (CoT) reasoning, i.e., the process of generating intermediate results that guide the model towards a final answer. A possible explanation for this improvement is that CoT reasoning extends an LM's computational power, as RNNs and transformers with additional scratch space are known to be Turing complete. Comparing LMs to Turing machines, however, introduces a category error - Turing machines decide language membership, whereas LMs define distributions over strings. To bridge this gap, we formalize CoT reasoning in a probabilistic setting. We present several results on the representational capacity of recurrent and transformer LMs with CoT reasoning, showing that they can represent the same family of distributions over strings as probabilistic Turing machines.

* To be published at ACL 2024

Via

Access Paper or Ask Questions

A Fundamental Trade-off in Aligned Language Models and its Relation to Sampling Adaptors

Jun 14, 2024

Naaman Tan, Josef Valvoda, Anej Svete, Tianyu Liu, Yanxia Qin, Kan Min-Yen, Ryan Cotterell

Figure 1 for A Fundamental Trade-off in Aligned Language Models and its Relation to Sampling Adaptors

Figure 2 for A Fundamental Trade-off in Aligned Language Models and its Relation to Sampling Adaptors

Figure 3 for A Fundamental Trade-off in Aligned Language Models and its Relation to Sampling Adaptors

Figure 4 for A Fundamental Trade-off in Aligned Language Models and its Relation to Sampling Adaptors

Abstract:The relationship between the quality of a string and its probability $p(\boldsymbol{y})$ under a language model has been influential in the development of techniques to build good text generation systems. For example, several decoding algorithms have been motivated to manipulate $p(\boldsymbol{y})$ to produce higher-quality text. In this work, we examine the probability--quality relationship in language models explicitly aligned to human preferences, e.g., through Reinforcement Learning through Human Feedback (RLHF). We find that, given a general language model and its aligned version, for corpora sampled from an aligned language model, there exists a trade-off between the average reward and average log-likelihood of the strings under the general language model. We provide a formal treatment of this issue and demonstrate how a choice of sampling adaptor allows for a selection of how much likelihood we exchange for the reward.

Via

Access Paper or Ask Questions

Correlation Does Not Imply Compensation: Complexity and Irregularity in the Lexicon

Jun 07, 2024

Amanda Doucette, Ryan Cotterell, Morgan Sonderegger, Timothy J. O'Donnell

Abstract:It has been claimed that within a language, morphologically irregular words are more likely to be phonotactically simple and morphologically regular words are more likely to be phonotactically complex. This inverse correlation has been demonstrated in English for a small sample of words, but has yet to be shown for a larger sample of languages. Furthermore, frequency and word length are known to influence both phonotactic complexity and morphological irregularity, and they may be confounding factors in this relationship. Therefore, we examine the relationships between all pairs of these four variables both to assess the robustness of previous findings using improved methodology and as a step towards understanding the underlying causal relationship. Using information-theoretic measures of phonotactic complexity and morphological irregularity (Pimentel et al., 2020; Wu et al., 2019) on 25 languages from UniMorph, we find that there is evidence of a positive relationship between morphological irregularity and phonotactic complexity within languages on average, although the direction varies within individual languages. We also find weak evidence of a negative relationship between word length and morphological irregularity that had not been previously identified, and that some existing findings about the relationships between these four variables are not as robust as previously thought.

* To appear in Proceedings of the Society for Computation in Linguistics 2024

Via

Access Paper or Ask Questions

What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages

Jun 07, 2024

Nadav Borenstein, Anej Svete, Robin Chan, Josef Valvoda, Franz Nowak, Isabelle Augenstein, Eleanor Chodroff, Ryan Cotterell

Abstract:What can large language models learn? By definition, language models (LM) are distributions over strings. Therefore, an intuitive way of addressing the above question is to formalize it as a matter of learnability of classes of distributions over strings. While prior work in this direction focused on assessing the theoretical limits, in contrast, we seek to understand the empirical learnability. Unlike prior empirical work, we evaluate neural LMs on their home turf-learning probabilistic languages-rather than as classifiers of formal languages. In particular, we investigate the learnability of regular LMs (RLMs) by RNN and Transformer LMs. We empirically test the learnability of RLMs as a function of various complexity parameters of the RLM and the hidden state size of the neural LM. We find that the RLM rank, which corresponds to the size of linear space spanned by the logits of its conditional distributions, and the expected length of sampled strings are strong and significant predictors of learnability for both RNNs and Transformers. Several other predictors also reach significance, but with differing patterns between RNNs and Transformers.

* Accepted to ACL 2024

Via

Access Paper or Ask Questions

What Do Language Models Learn in Context? The Structured Task Hypothesis

Jun 06, 2024

Jiaoda Li, Yifan Hou, Mrinmaya Sachan, Ryan Cotterell

Figure 1 for What Do Language Models Learn in Context? The Structured Task Hypothesis

Figure 2 for What Do Language Models Learn in Context? The Structured Task Hypothesis

Figure 3 for What Do Language Models Learn in Context? The Structured Task Hypothesis

Figure 4 for What Do Language Models Learn in Context? The Structured Task Hypothesis

Abstract:Large language models (LLMs) exhibit an intriguing ability to learn a novel task from in-context examples presented in a demonstration, termed in-context learning (ICL). Understandably, a swath of research has been dedicated to uncovering the theories underpinning ICL. One popular hypothesis explains ICL by task selection. LLMs identify the task based on the demonstration and generalize it to the prompt. Another popular hypothesis is that ICL is a form of meta-learning, i.e., the models learn a learning algorithm at pre-training time and apply it to the demonstration. Finally, a third hypothesis argues that LLMs use the demonstration to select a composition of tasks learned during pre-training to perform ICL. In this paper, we empirically explore these three hypotheses that explain LLMs' ability to learn in context with a suite of experiments derived from common text classification tasks. We invalidate the first two hypotheses with counterexamples and provide evidence in support of the last hypothesis. Our results suggest an LLM could learn a novel task in context via composing tasks learned during pre-training.

* This work is published in ACL 2024

Via

Access Paper or Ask Questions

On Affine Homotopy between Language Encoders

Jun 04, 2024

Robin SM Chan, Reda Boumasmoud, Anej Svete, Yuxin Ren, Qipeng Guo, Zhijing Jin, Shauli Ravfogel, Mrinmaya Sachan, Bernhard Schölkopf, Mennatallah El-Assady(+1 more)

Figure 1 for On Affine Homotopy between Language Encoders

Figure 2 for On Affine Homotopy between Language Encoders

Figure 3 for On Affine Homotopy between Language Encoders

Figure 4 for On Affine Homotopy between Language Encoders

Abstract:Pre-trained language encoders -- functions that represent text as vectors -- are an integral component of many NLP tasks. We tackle a natural question in language encoder analysis: What does it mean for two encoders to be similar? We contend that a faithful measure of similarity needs to be \emph{intrinsic}, that is, task-independent, yet still be informative of \emph{extrinsic} similarity -- the performance on downstream tasks. It is common to consider two encoders similar if they are \emph{homotopic}, i.e., if they can be aligned through some transformation. In this spirit, we study the properties of \emph{affine} alignment of language encoders and its implications on extrinsic similarity. We find that while affine alignment is fundamentally an asymmetric notion of similarity, it is still informative of extrinsic similarity. We confirm this on datasets of natural language representations. Beyond providing useful bounds on extrinsic similarity, affine intrinsic similarity also allows us to begin uncovering the structure of the space of pre-trained encoders by defining an order over them.

* 10 pages

Via

Access Paper or Ask Questions

Lower Bounds on the Expressivity of Recurrent Neural Language Models

May 29, 2024

Anej Svete, Franz Nowak, Anisha Mohamed Sahabdeen, Ryan Cotterell

Figure 1 for Lower Bounds on the Expressivity of Recurrent Neural Language Models

Figure 2 for Lower Bounds on the Expressivity of Recurrent Neural Language Models

Abstract:The recent successes and spread of large neural language models (LMs) call for a thorough understanding of their computational ability. Describing their computational abilities through LMs' \emph{representational capacity} is a lively area of research. However, investigation into the representational capacity of neural LMs has predominantly focused on their ability to \emph{recognize} formal languages. For example, recurrent neural networks (RNNs) with Heaviside activations are tightly linked to regular languages, i.e., languages defined by finite-state automata (FSAs). Such results, however, fall short of describing the capabilities of RNN \emph{language models} (LMs), which are definitionally \emph{distributions} over strings. We take a fresh look at the representational capacity of RNN LMs by connecting them to \emph{probabilistic} FSAs and demonstrate that RNN LMs with linearly bounded precision can express arbitrary regular LMs.

Via

Access Paper or Ask Questions

Joint Lemmatization and Morphological Tagging with LEMMING

May 28, 2024

Thomas Muller, Ryan Cotterell, Alexander Fraser, Hinrich Schütze

Figure 1 for Joint Lemmatization and Morphological Tagging with LEMMING

Figure 2 for Joint Lemmatization and Morphological Tagging with LEMMING

Figure 3 for Joint Lemmatization and Morphological Tagging with LEMMING

Figure 4 for Joint Lemmatization and Morphological Tagging with LEMMING

Abstract:We present LEMMING, a modular log-linear model that jointly models lemmatization and tagging and supports the integration of arbitrary global features. It is trainable on corpora annotated with gold standard tags and lemmata and does not rely on morphological dictionaries or analyzers. LEMMING sets the new state of the art in token-based statistical lemmatization on six languages; e.g., for Czech lemmatization, we reduce the error by 60%, from 4.05 to 1.58. We also give empirical evidence that jointly modeling morphological tags and lemmata is mutually beneficial.

* EMNLP 2015; Honorable Mention for Best Short Paper

Via

Access Paper or Ask Questions

A Transformer with Stack Attention

May 07, 2024

Jiaoda Li, Jennifer C. White, Mrinmaya Sachan, Ryan Cotterell

Figure 1 for A Transformer with Stack Attention

Figure 2 for A Transformer with Stack Attention

Figure 3 for A Transformer with Stack Attention

Figure 4 for A Transformer with Stack Attention

Abstract:Natural languages are believed to be (mildly) context-sensitive. Despite underpinning remarkably capable large language models, transformers are unable to model many context-free language tasks. In an attempt to address this limitation in the modeling power of transformer-based language models, we propose augmenting them with a differentiable, stack-based attention mechanism. Our stack-based attention mechanism can be incorporated into any transformer-based language model and adds a level of interpretability to the model. We show that the addition of our stack-based attention mechanism enables the transformer to model some, but not all, deterministic context-free languages.

* NAACL 2024

Via

Access Paper or Ask Questions