Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nathan Schneider

Unpacking Let Alone: Human-Scale Models Generalize to a Rare Construction in Form but not Meaning

Jun 04, 2025

Wesley Scivetti, Tatsuya Aoyama, Ethan Wilcox, Nathan Schneider

Abstract:Humans have a remarkable ability to acquire and understand grammatical phenomena that are seen rarely, if ever, during childhood. Recent evidence suggests that language models with human-scale pretraining data may possess a similar ability by generalizing from frequent to rare constructions. However, it remains an open question how widespread this generalization ability is, and to what extent this knowledge extends to meanings of rare constructions, as opposed to just their forms. We fill this gap by testing human-scale transformer language models on their knowledge of both the form and meaning of the (rare and quirky) English LET-ALONE construction. To evaluate our LMs we construct a bespoke synthetic benchmark that targets syntactic and semantic properties of the construction. We find that human-scale LMs are sensitive to form, even when related constructions are filtered from the dataset. However, human-scale LMs do not make correct generalizations about LET-ALONE's meaning. These results point to an asymmetry in the current architectures' sample efficiency between language form and meaning, something which is not present in human language learners.

Via

Access Paper or Ask Questions

UD-English-CHILDES: A Collected Resource of Gold and Silver Universal Dependencies Trees for Child Language Interactions

Apr 28, 2025

Xiulin Yang, Zhuoxuan Ju, Lanni Bu, Zoey Liu, Nathan Schneider

Abstract:CHILDES is a widely used resource of transcribed child and child-directed speech. This paper introduces UD-English-CHILDES, the first officially released Universal Dependencies (UD) treebank derived from previously dependency-annotated CHILDES data with consistent and unified annotation guidelines. Our corpus harmonizes annotations from 11 children and their caregivers, totaling over 48k sentences. We validate existing gold-standard annotations under the UD v2 framework and provide an additional 1M silver-standard sentences, offering a consistent resource for computational and linguistic research.

Via

Access Paper or Ask Questions

Construction Identification and Disambiguation Using BERT: A Case Study of NPN

Mar 24, 2025

Wesley Scivetti, Nathan Schneider

Figure 1 for Construction Identification and Disambiguation Using BERT: A Case Study of NPN

Figure 2 for Construction Identification and Disambiguation Using BERT: A Case Study of NPN

Figure 3 for Construction Identification and Disambiguation Using BERT: A Case Study of NPN

Figure 4 for Construction Identification and Disambiguation Using BERT: A Case Study of NPN

Abstract:Construction Grammar hypothesizes that knowledge of a language consists chiefly of knowledge of form-meaning pairs (''constructions'') that include vocabulary, general grammar rules, and even idiosyncratic patterns. Recent work has shown that transformer language models represent at least some constructional patterns, including ones where the construction is rare overall. In this work, we probe BERT's representation of the form and meaning of a minor construction of English, the NPN (noun-preposition-noun) construction -- exhibited in such expressions as face to face and day to day -- which is known to be polysemous. We construct a benchmark dataset of semantically annotated corpus instances (including distractors that superficially resemble the construction). With this dataset, we train and evaluate probing classifiers. They achieve decent discrimination of the construction from distractors, as well as sense disambiguation among true instances of the construction, revealing that BERT embeddings carry indications of the construction's semantics. Moreover, artificially permuting the word order of true construction instances causes them to be rejected, indicating sensitivity to matters of form. We conclude that BERT does latently encode at least some knowledge of the NPN construction going beyond a surface syntactic pattern and lexical cues.

* 8 pages, ACL long-paper format (preprint)

Via

Access Paper or Ask Questions

Natural Language Processing RELIES on Linguistics

May 09, 2024

Juri Opitz, Shira Wein, Nathan Schneider

Abstract:Large Language Models (LLMs) have become capable of generating highly fluent text in certain languages, without modules specially designed to capture grammar or semantic coherence. What does this mean for the future of linguistic expertise in NLP? We highlight several aspects in which NLP (still) relies on linguistics, or where linguistic thinking can illuminate new directions. We argue our case around the acronym $RELIES$ that encapsulates six major facets where linguistics contributes to NLP: $R$esources, $E$valuation, $L$ow-resource settings, $I$nterpretability, $E$xplanation, and the $S$tudy of language. This list is not exhaustive, nor is linguistics the main point of reference for every effort under these themes; but at a macro level, these facets highlight the enduring importance of studying machine systems vis-a-vis systems of human language.

Via

Access Paper or Ask Questions

UCxn: Typologically Informed Annotation of Constructions Atop Universal Dependencies

Mar 26, 2024

Leonie Weissweiler, Nina Böbel, Kirian Guiller, Santiago Herrera, Wesley Scivetti, Arthur Lorenzi, Nurit Melnik, Archna Bhatia, Hinrich Schütze, Lori Levin(+4 more)

Figure 1 for UCxn: Typologically Informed Annotation of Constructions Atop Universal Dependencies

Figure 2 for UCxn: Typologically Informed Annotation of Constructions Atop Universal Dependencies

Figure 3 for UCxn: Typologically Informed Annotation of Constructions Atop Universal Dependencies

Figure 4 for UCxn: Typologically Informed Annotation of Constructions Atop Universal Dependencies

Abstract:The Universal Dependencies (UD) project has created an invaluable collection of treebanks with contributions in over 140 languages. However, the UD annotations do not tell the full story. Grammatical constructions that convey meaning through a particular combination of several morphosyntactic elements -- for example, interrogative sentences with special markers and/or word orders -- are not labeled holistically. We argue for (i) augmenting UD annotations with a 'UCxn' annotation layer for such meaning-bearing grammatical constructions, and (ii) approaching this in a typologically informed way so that morphosyntactic strategies can be compared across languages. As a case study, we consider five construction families in ten languages, identifying instances of each construction in UD treebanks through the use of morphosyntactic patterns. In addition to findings regarding these particular constructions, our study yields important insights on methodology for describing and identifying constructions in language-general and language-particular ways, and lays the foundation for future constructional enrichment of UD treebanks.

* LREC-COLING 2024

Via

Access Paper or Ask Questions

Syntactic Inductive Bias in Transformer Language Models: Especially Helpful for Low-Resource Languages?

Nov 01, 2023

Luke Gessler, Nathan Schneider

Figure 1 for Syntactic Inductive Bias in Transformer Language Models: Especially Helpful for Low-Resource Languages?

Figure 2 for Syntactic Inductive Bias in Transformer Language Models: Especially Helpful for Low-Resource Languages?

Figure 3 for Syntactic Inductive Bias in Transformer Language Models: Especially Helpful for Low-Resource Languages?

Figure 4 for Syntactic Inductive Bias in Transformer Language Models: Especially Helpful for Low-Resource Languages?

Abstract:A line of work on Transformer-based language models such as BERT has attempted to use syntactic inductive bias to enhance the pretraining process, on the theory that building syntactic structure into the training process should reduce the amount of data needed for training. But such methods are often tested for high-resource languages such as English. In this work, we investigate whether these methods can compensate for data sparseness in low-resource languages, hypothesizing that they ought to be more effective for low-resource languages. We experiment with five low-resource languages: Uyghur, Wolof, Maltese, Coptic, and Ancient Greek. We find that these syntactic inductive bias methods produce uneven results in low-resource settings, and provide surprisingly little benefit in most cases.

* Accepted at CoNLL 2023

Via

Access Paper or Ask Questions

AMR4NLI: Interpretable and robust NLI measures from semantic graphs

Jun 01, 2023

Juri Opitz, Shira Wein, Julius Steen, Anette Frank, Nathan Schneider

Figure 1 for AMR4NLI: Interpretable and robust NLI measures from semantic graphs

Figure 2 for AMR4NLI: Interpretable and robust NLI measures from semantic graphs

Figure 3 for AMR4NLI: Interpretable and robust NLI measures from semantic graphs

Figure 4 for AMR4NLI: Interpretable and robust NLI measures from semantic graphs

Abstract:The task of natural language inference (NLI) asks whether a given premise (expressed in NL) entails a given NL hypothesis. NLI benchmarks contain human ratings of entailment, but the meaning relationships driving these ratings are not formalized. Can the underlying sentence pair relationships be made more explicit in an interpretable yet robust fashion? We compare semantic structures to represent premise and hypothesis, including sets of contextualized embeddings and semantic graphs (Abstract Meaning Representations), and measure whether the hypothesis is a semantic substructure of the premise, utilizing interpretable metrics. Our evaluation on three English benchmarks finds value in both contextualized embeddings and semantic graphs; moreover, they provide complementary signals, and can be leveraged together in a hybrid model.

* International Conference on Computational Semantics (IWCS 2023)

Via

Access Paper or Ask Questions

CGELBank Annotation Manual v1.0

May 27, 2023

Brett Reynolds, Nathan Schneider, Aryaman Arora

Abstract:CGELBank is a treebank and associated tools based on a syntactic formalism for English derived from the Cambridge Grammar of the English Language. This document lays out the particularities of the CGELBank annotation scheme.

Via

Access Paper or Ask Questions

CuRIAM: Corpus re Interpretation and Metalanguage in U.S. Supreme Court Opinions

May 24, 2023

Michael Kranzlein, Nathan Schneider, Kevin Tobia

Figure 1 for CuRIAM: Corpus re Interpretation and Metalanguage in U.S. Supreme Court Opinions

Figure 2 for CuRIAM: Corpus re Interpretation and Metalanguage in U.S. Supreme Court Opinions

Figure 3 for CuRIAM: Corpus re Interpretation and Metalanguage in U.S. Supreme Court Opinions

Figure 4 for CuRIAM: Corpus re Interpretation and Metalanguage in U.S. Supreme Court Opinions

Abstract:Most judicial decisions involve the interpretation of legal texts; as such, judicial opinion requires the use of language as a medium to comment on or draw attention to other language. Language used this way is called metalanguage. We develop an annotation schema for categorizing types of legal metalanguage and apply our schema to a set of U.S. Supreme Court opinions, yielding a corpus totaling 59k tokens. We remark on several patterns observed in the kinds of metalanguage used by the justices.

Via

Access Paper or Ask Questions

Translationese Reduction using Abstract Meaning Representation

Apr 23, 2023

Shira Wein, Nathan Schneider

Abstract:Translated texts or utterances bear several hallmarks distinct from texts originating in the language. This phenomenon, known as translationese, is well-documented, and when found in training or test sets can affect model performance. Still, work to mitigate the effect of translationese in human translated text is understudied. We hypothesize that Abstract Meaning Representation (AMR), a semantic representation which abstracts away from the surface form, can be used as an interlingua to reduce the amount of translationese in translated texts. By parsing English translations into an AMR graph and then generating text from that AMR, we obtain texts that more closely resemble non-translationese by macro-level measures. We show that across four metrics, and qualitatively, using AMR as an interlingua enables the reduction of translationese and we compare our results to two additional approaches: one based on round-trip machine translation and one based on syntactically controlled generation.

Via

Access Paper or Ask Questions