Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ryan Cotterell

ETH Zurich

Correlation Does Not Imply Compensation: Complexity and Irregularity in the Lexicon

Jun 07, 2024

Amanda Doucette, Ryan Cotterell, Morgan Sonderegger, Timothy J. O'Donnell

Abstract:It has been claimed that within a language, morphologically irregular words are more likely to be phonotactically simple and morphologically regular words are more likely to be phonotactically complex. This inverse correlation has been demonstrated in English for a small sample of words, but has yet to be shown for a larger sample of languages. Furthermore, frequency and word length are known to influence both phonotactic complexity and morphological irregularity, and they may be confounding factors in this relationship. Therefore, we examine the relationships between all pairs of these four variables both to assess the robustness of previous findings using improved methodology and as a step towards understanding the underlying causal relationship. Using information-theoretic measures of phonotactic complexity and morphological irregularity (Pimentel et al., 2020; Wu et al., 2019) on 25 languages from UniMorph, we find that there is evidence of a positive relationship between morphological irregularity and phonotactic complexity within languages on average, although the direction varies within individual languages. We also find weak evidence of a negative relationship between word length and morphological irregularity that had not been previously identified, and that some existing findings about the relationships between these four variables are not as robust as previously thought.

* To appear in Proceedings of the Society for Computation in Linguistics 2024

Via

Access Paper or Ask Questions

What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages

Jun 07, 2024

Nadav Borenstein, Anej Svete, Robin Chan, Josef Valvoda, Franz Nowak, Isabelle Augenstein, Eleanor Chodroff, Ryan Cotterell

Abstract:What can large language models learn? By definition, language models (LM) are distributions over strings. Therefore, an intuitive way of addressing the above question is to formalize it as a matter of learnability of classes of distributions over strings. While prior work in this direction focused on assessing the theoretical limits, in contrast, we seek to understand the empirical learnability. Unlike prior empirical work, we evaluate neural LMs on their home turf-learning probabilistic languages-rather than as classifiers of formal languages. In particular, we investigate the learnability of regular LMs (RLMs) by RNN and Transformer LMs. We empirically test the learnability of RLMs as a function of various complexity parameters of the RLM and the hidden state size of the neural LM. We find that the RLM rank, which corresponds to the size of linear space spanned by the logits of its conditional distributions, and the expected length of sampled strings are strong and significant predictors of learnability for both RNNs and Transformers. Several other predictors also reach significance, but with differing patterns between RNNs and Transformers.

* Accepted to ACL 2024

Via

Access Paper or Ask Questions

What Do Language Models Learn in Context? The Structured Task Hypothesis

Jun 06, 2024

Jiaoda Li, Yifan Hou, Mrinmaya Sachan, Ryan Cotterell

Figure 1 for What Do Language Models Learn in Context? The Structured Task Hypothesis

Figure 2 for What Do Language Models Learn in Context? The Structured Task Hypothesis

Figure 3 for What Do Language Models Learn in Context? The Structured Task Hypothesis

Figure 4 for What Do Language Models Learn in Context? The Structured Task Hypothesis

Abstract:Large language models (LLMs) exhibit an intriguing ability to learn a novel task from in-context examples presented in a demonstration, termed in-context learning (ICL). Understandably, a swath of research has been dedicated to uncovering the theories underpinning ICL. One popular hypothesis explains ICL by task selection. LLMs identify the task based on the demonstration and generalize it to the prompt. Another popular hypothesis is that ICL is a form of meta-learning, i.e., the models learn a learning algorithm at pre-training time and apply it to the demonstration. Finally, a third hypothesis argues that LLMs use the demonstration to select a composition of tasks learned during pre-training to perform ICL. In this paper, we empirically explore these three hypotheses that explain LLMs' ability to learn in context with a suite of experiments derived from common text classification tasks. We invalidate the first two hypotheses with counterexamples and provide evidence in support of the last hypothesis. Our results suggest an LLM could learn a novel task in context via composing tasks learned during pre-training.

* This work is published in ACL 2024

Via

Access Paper or Ask Questions

On Affine Homotopy between Language Encoders

Jun 04, 2024

Robin SM Chan, Reda Boumasmoud, Anej Svete, Yuxin Ren, Qipeng Guo, Zhijing Jin, Shauli Ravfogel, Mrinmaya Sachan, Bernhard Schölkopf, Mennatallah El-Assady(+1 more)

Figure 1 for On Affine Homotopy between Language Encoders

Figure 2 for On Affine Homotopy between Language Encoders

Figure 3 for On Affine Homotopy between Language Encoders

Figure 4 for On Affine Homotopy between Language Encoders

Abstract:Pre-trained language encoders -- functions that represent text as vectors -- are an integral component of many NLP tasks. We tackle a natural question in language encoder analysis: What does it mean for two encoders to be similar? We contend that a faithful measure of similarity needs to be \emph{intrinsic}, that is, task-independent, yet still be informative of \emph{extrinsic} similarity -- the performance on downstream tasks. It is common to consider two encoders similar if they are \emph{homotopic}, i.e., if they can be aligned through some transformation. In this spirit, we study the properties of \emph{affine} alignment of language encoders and its implications on extrinsic similarity. We find that while affine alignment is fundamentally an asymmetric notion of similarity, it is still informative of extrinsic similarity. We confirm this on datasets of natural language representations. Beyond providing useful bounds on extrinsic similarity, affine intrinsic similarity also allows us to begin uncovering the structure of the space of pre-trained encoders by defining an order over them.

* 10 pages

Via

Access Paper or Ask Questions

Lower Bounds on the Expressivity of Recurrent Neural Language Models

May 29, 2024

Anej Svete, Franz Nowak, Anisha Mohamed Sahabdeen, Ryan Cotterell

Figure 1 for Lower Bounds on the Expressivity of Recurrent Neural Language Models

Figure 2 for Lower Bounds on the Expressivity of Recurrent Neural Language Models

Abstract:The recent successes and spread of large neural language models (LMs) call for a thorough understanding of their computational ability. Describing their computational abilities through LMs' \emph{representational capacity} is a lively area of research. However, investigation into the representational capacity of neural LMs has predominantly focused on their ability to \emph{recognize} formal languages. For example, recurrent neural networks (RNNs) with Heaviside activations are tightly linked to regular languages, i.e., languages defined by finite-state automata (FSAs). Such results, however, fall short of describing the capabilities of RNN \emph{language models} (LMs), which are definitionally \emph{distributions} over strings. We take a fresh look at the representational capacity of RNN LMs by connecting them to \emph{probabilistic} FSAs and demonstrate that RNN LMs with linearly bounded precision can express arbitrary regular LMs.

Via

Access Paper or Ask Questions

Joint Lemmatization and Morphological Tagging with LEMMING

May 28, 2024

Thomas Muller, Ryan Cotterell, Alexander Fraser, Hinrich Schütze

Figure 1 for Joint Lemmatization and Morphological Tagging with LEMMING

Figure 2 for Joint Lemmatization and Morphological Tagging with LEMMING

Figure 3 for Joint Lemmatization and Morphological Tagging with LEMMING

Figure 4 for Joint Lemmatization and Morphological Tagging with LEMMING

Abstract:We present LEMMING, a modular log-linear model that jointly models lemmatization and tagging and supports the integration of arbitrary global features. It is trainable on corpora annotated with gold standard tags and lemmata and does not rely on morphological dictionaries or analyzers. LEMMING sets the new state of the art in token-based statistical lemmatization on six languages; e.g., for Czech lemmatization, we reduce the error by 60%, from 4.05 to 1.58. We also give empirical evidence that jointly modeling morphological tags and lemmata is mutually beneficial.

* EMNLP 2015; Honorable Mention for Best Short Paper

Via

Access Paper or Ask Questions

A Transformer with Stack Attention

May 07, 2024

Jiaoda Li, Jennifer C. White, Mrinmaya Sachan, Ryan Cotterell

Figure 1 for A Transformer with Stack Attention

Figure 2 for A Transformer with Stack Attention

Figure 3 for A Transformer with Stack Attention

Figure 4 for A Transformer with Stack Attention

Abstract:Natural languages are believed to be (mildly) context-sensitive. Despite underpinning remarkably capable large language models, transformers are unable to model many context-free language tasks. In an attempt to address this limitation in the modeling power of transformer-based language models, we propose augmenting them with a differentiable, stack-based attention mechanism. Our stack-based attention mechanism can be incorporated into any transformer-based language model and adds a level of interpretability to the model. We show that the addition of our stack-based attention mechanism enables the transformer to model some, but not all, deterministic context-free languages.

* NAACL 2024

Via

Access Paper or Ask Questions

Transformers Can Represent $n$-gram Language Models

Apr 23, 2024

Anej Svete, Ryan Cotterell

Abstract:Plenty of existing work has analyzed the abilities of the transformer architecture by describing its representational capacity with formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language \emph{acceptance}. We contend that this is an ill-suited problem in the study of \emph{language models} (LMs), which are definitionally \emph{probability distributions} over strings. In this paper, we focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.

Via

Access Paper or Ask Questions

Low-Resource Named Entity Recognition with Cross-Lingual, Character-Level Neural Conditional Random Fields

Apr 14, 2024

Ryan Cotterell, Kevin Duh

Abstract:Low-resource named entity recognition is still an open problem in NLP. Most state-of-the-art systems require tens of thousands of annotated sentences in order to obtain high performance. However, for most of the world's languages, it is unfeasible to obtain such annotation. In this paper, we present a transfer learning scheme, whereby we train character-level neural CRFs to predict named entities for both high-resource languages and low resource languages jointly. Learning character representations for multiple related languages allows transfer among the languages, improving F1 by up to 9.8 points over a loglinear CRF baseline.

* IJCNLP 2017

Via

Access Paper or Ask Questions

Labeled Morphological Segmentation with Semi-Markov Models

Apr 13, 2024

Ryan Cotterell, Thomas Müller, Alexander Fraser, Hinrich Schütze

Figure 1 for Labeled Morphological Segmentation with Semi-Markov Models

Figure 2 for Labeled Morphological Segmentation with Semi-Markov Models

Figure 3 for Labeled Morphological Segmentation with Semi-Markov Models

Figure 4 for Labeled Morphological Segmentation with Semi-Markov Models

Abstract:We present labeled morphological segmentation, an alternative view of morphological processing that unifies several tasks. From an annotation standpoint, we additionally introduce a new hierarchy of morphotactic tagsets. Finally, we develop \modelname, a discriminative morphological segmentation system that, contrary to previous work, explicitly models morphotactics. We show that \textsc{chipmunk} yields improved performance on three tasks for all six languages: (i) morphological segmentation, (ii) stemming and (iii) morphological tag classification. On morphological segmentation, our method shows absolute improvements of 2--6 points $F_1$ over the baseline.

* CoNLL 2015

Via

Access Paper or Ask Questions