Alert button
Picture for Nathan Schneider

Nathan Schneider

Alert button

Syntactic Inductive Bias in Transformer Language Models: Especially Helpful for Low-Resource Languages?

Nov 01, 2023
Luke Gessler, Nathan Schneider

A line of work on Transformer-based language models such as BERT has attempted to use syntactic inductive bias to enhance the pretraining process, on the theory that building syntactic structure into the training process should reduce the amount of data needed for training. But such methods are often tested for high-resource languages such as English. In this work, we investigate whether these methods can compensate for data sparseness in low-resource languages, hypothesizing that they ought to be more effective for low-resource languages. We experiment with five low-resource languages: Uyghur, Wolof, Maltese, Coptic, and Ancient Greek. We find that these syntactic inductive bias methods produce uneven results in low-resource settings, and provide surprisingly little benefit in most cases.

* Accepted at CoNLL 2023 
Viaarxiv icon

AMR4NLI: Interpretable and robust NLI measures from semantic graphs

Jun 01, 2023
Juri Opitz, Shira Wein, Julius Steen, Anette Frank, Nathan Schneider

Figure 1 for AMR4NLI: Interpretable and robust NLI measures from semantic graphs
Figure 2 for AMR4NLI: Interpretable and robust NLI measures from semantic graphs
Figure 3 for AMR4NLI: Interpretable and robust NLI measures from semantic graphs
Figure 4 for AMR4NLI: Interpretable and robust NLI measures from semantic graphs

The task of natural language inference (NLI) asks whether a given premise (expressed in NL) entails a given NL hypothesis. NLI benchmarks contain human ratings of entailment, but the meaning relationships driving these ratings are not formalized. Can the underlying sentence pair relationships be made more explicit in an interpretable yet robust fashion? We compare semantic structures to represent premise and hypothesis, including sets of contextualized embeddings and semantic graphs (Abstract Meaning Representations), and measure whether the hypothesis is a semantic substructure of the premise, utilizing interpretable metrics. Our evaluation on three English benchmarks finds value in both contextualized embeddings and semantic graphs; moreover, they provide complementary signals, and can be leveraged together in a hybrid model.

* International Conference on Computational Semantics (IWCS 2023) 
Viaarxiv icon

CGELBank Annotation Manual v1.0

May 27, 2023
Brett Reynolds, Nathan Schneider, Aryaman Arora

Figure 1 for CGELBank Annotation Manual v1.0
Figure 2 for CGELBank Annotation Manual v1.0

CGELBank is a treebank and associated tools based on a syntactic formalism for English derived from the Cambridge Grammar of the English Language. This document lays out the particularities of the CGELBank annotation scheme.

Viaarxiv icon

CuRIAM: Corpus re Interpretation and Metalanguage in U.S. Supreme Court Opinions

May 24, 2023
Michael Kranzlein, Nathan Schneider, Kevin Tobia

Most judicial decisions involve the interpretation of legal texts; as such, judicial opinion requires the use of language as a medium to comment on or draw attention to other language. Language used this way is called metalanguage. We develop an annotation schema for categorizing types of legal metalanguage and apply our schema to a set of U.S. Supreme Court opinions, yielding a corpus totaling 59k tokens. We remark on several patterns observed in the kinds of metalanguage used by the justices.

Viaarxiv icon

Translationese Reduction using Abstract Meaning Representation

Apr 23, 2023
Shira Wein, Nathan Schneider

Figure 1 for Translationese Reduction using Abstract Meaning Representation
Figure 2 for Translationese Reduction using Abstract Meaning Representation
Figure 3 for Translationese Reduction using Abstract Meaning Representation
Figure 4 for Translationese Reduction using Abstract Meaning Representation

Translated texts or utterances bear several hallmarks distinct from texts originating in the language. This phenomenon, known as translationese, is well-documented, and when found in training or test sets can affect model performance. Still, work to mitigate the effect of translationese in human translated text is understudied. We hypothesize that Abstract Meaning Representation (AMR), a semantic representation which abstracts away from the surface form, can be used as an interlingua to reduce the amount of translationese in translated texts. By parsing English translations into an AMR graph and then generating text from that AMR, we obtain texts that more closely resemble non-translationese by macro-level measures. We show that across four metrics, and qualitatively, using AMR as an interlingua enables the reduction of translationese and we compare our results to two additional approaches: one based on round-trip machine translation and one based on syntactically controlled generation.

Viaarxiv icon

Are UD Treebanks Getting More Consistent? A Report Card for English UD

Feb 01, 2023
Amir Zeldes, Nathan Schneider

Figure 1 for Are UD Treebanks Getting More Consistent? A Report Card for English UD
Figure 2 for Are UD Treebanks Getting More Consistent? A Report Card for English UD
Figure 3 for Are UD Treebanks Getting More Consistent? A Report Card for English UD

Recent efforts to consolidate guidelines and treebanks in the Universal Dependencies project raise the expectation that joint training and dataset comparison is increasingly possible for high-resource languages such as English, which have multiple corpora. Focusing on the two largest UD English treebanks, we examine progress in data consolidation and answer several questions: Are UD English treebanks becoming more internally consistent? Are they becoming more like each other and to what extent? Is joint training a good idea, and if so, since which UD version? Our results indicate that while consolidation has made progress, joint models may still suffer from inconsistencies, which hamper their ability to leverage a larger pool of training data.

* Proceedings of the Sixth Workshop on Universal Dependencies (UDW 2023) 
Viaarxiv icon

Sentence-level Feedback Generation for English Language Learners: Does Data Augmentation Help?

Dec 18, 2022
Shabnam Behzad, Amir Zeldes, Nathan Schneider

Figure 1 for Sentence-level Feedback Generation for English Language Learners: Does Data Augmentation Help?
Figure 2 for Sentence-level Feedback Generation for English Language Learners: Does Data Augmentation Help?
Figure 3 for Sentence-level Feedback Generation for English Language Learners: Does Data Augmentation Help?
Figure 4 for Sentence-level Feedback Generation for English Language Learners: Does Data Augmentation Help?

In this paper, we present strong baselines for the task of Feedback Comment Generation for Writing Learning. Given a sentence and an error span, the task is to generate a feedback comment explaining the error. Sentences and feedback comments are both in English. We experiment with LLMs and also create multiple pseudo datasets for the task, investigating how it affects the performance of our system. We present our results for the task along with extensive analysis of the generated comments with the aim of aiding future studies in feedback comment generation for English language learners.

* GenChal 2022: FCG, INLG 2023 
Viaarxiv icon

Measuring Fine-Grained Semantic Equivalence with Abstract Meaning Representation

Oct 06, 2022
Shira Wein, Zhuxin Wang, Nathan Schneider

Figure 1 for Measuring Fine-Grained Semantic Equivalence with Abstract Meaning Representation
Figure 2 for Measuring Fine-Grained Semantic Equivalence with Abstract Meaning Representation
Figure 3 for Measuring Fine-Grained Semantic Equivalence with Abstract Meaning Representation
Figure 4 for Measuring Fine-Grained Semantic Equivalence with Abstract Meaning Representation

Identifying semantically equivalent sentences is important for many cross-lingual and mono-lingual NLP tasks. Current approaches to semantic equivalence take a loose, sentence-level approach to "equivalence," despite previous evidence that fine-grained differences and implicit content have an effect on human understanding (Roth and Anthonio, 2021) and system performance (Briakou and Carpuat, 2021). In this work, we introduce a novel, more sensitive method of characterizing semantic equivalence that leverages Abstract Meaning Representation graph structures. We develop an approach, which can be used with either gold or automatic AMR annotations, and demonstrate that our solution is in fact finer-grained than existing corpus filtering methods and more accurate at predicting strictly equivalent sentences than existing semantic similarity metrics. We suggest that our finer-grained measure of semantic equivalence could limit the workload in the task of human post-edited machine translation and in human evaluation of sentence similarity.

Viaarxiv icon

CGELBank: CGEL as a Framework for English Syntax Annotation

Oct 01, 2022
Brett Reynolds, Aryaman Arora, Nathan Schneider

Figure 1 for CGELBank: CGEL as a Framework for English Syntax Annotation
Figure 2 for CGELBank: CGEL as a Framework for English Syntax Annotation
Figure 3 for CGELBank: CGEL as a Framework for English Syntax Annotation
Figure 4 for CGELBank: CGEL as a Framework for English Syntax Annotation

We introduce the syntactic formalism of the \textit{Cambridge Grammar of the English Language} (CGEL) to the world of treebanking through the CGELBank project. We discuss some issues in linguistic analysis that arose in adapting the formalism to corpus annotation, followed by quantitative and qualitative comparisons with parallel UD and PTB treebanks. We argue that CGEL provides a good tradeoff between comprehensiveness of analysis and usability for annotation, which motivates expanding the treebank with automatic conversion in the future.

* 11 pages (8 main text) 
Viaarxiv icon

MASALA: Modelling and Analysing the Semantics of Adpositions in Linguistic Annotation of Hindi

May 08, 2022
Aryaman Arora, Nitin Venkateswaran, Nathan Schneider

Figure 1 for MASALA: Modelling and Analysing the Semantics of Adpositions in Linguistic Annotation of Hindi
Figure 2 for MASALA: Modelling and Analysing the Semantics of Adpositions in Linguistic Annotation of Hindi
Figure 3 for MASALA: Modelling and Analysing the Semantics of Adpositions in Linguistic Annotation of Hindi
Figure 4 for MASALA: Modelling and Analysing the Semantics of Adpositions in Linguistic Annotation of Hindi

We present a completed, publicly available corpus of annotated semantic relations of adpositions and case markers in Hindi. We used the multilingual SNACS annotation scheme, which has been applied to a variety of typologically diverse languages. Building on past work examining linguistic problems in SNACS annotation, we use language models to attempt automatic labelling of SNACS supersenses in Hindi and achieve results competitive with past work on English. We look towards upstream applications in semantic role labelling and extension to related languages such as Gujarati.

* 9 pages (6 main text). To appear at LREC 2022 
Viaarxiv icon