Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

David Arps

Understanding Syntactic Generalization in Structure-inducing Language Models

Aug 11, 2025

David Arps, Hassan Sajjad, Laura Kallmeyer

Figure 1 for Understanding Syntactic Generalization in Structure-inducing Language Models

Figure 2 for Understanding Syntactic Generalization in Structure-inducing Language Models

Figure 3 for Understanding Syntactic Generalization in Structure-inducing Language Models

Figure 4 for Understanding Syntactic Generalization in Structure-inducing Language Models

Abstract:Structure-inducing Language Models (SiLM) are trained on a self-supervised language modeling task, and induce a hierarchical sentence representation as a byproduct when processing an input. A wide variety of SiLMs have been proposed. However, these have typically been evaluated on a relatively small scale, and evaluation of these models has systematic gaps and lacks comparability. In this work, we study three different SiLM architectures using both natural language (English) corpora and synthetic bracketing expressions: Structformer (Shen et al., 2021), UDGN (Shen et al., 2022) and GPST (Hu et al., 2024). We compare them with respect to (i) properties of the induced syntactic representations (ii) performance on grammaticality judgment tasks, and (iii) training dynamics. We find that none of the three architectures dominates across all evaluation metrics. However, there are significant differences, in particular with respect to the induced syntactic representations. The Generative Pretrained Structured Transformer (GPST; Hu et al. 2024) performs most consistently across evaluation settings, and outperforms the other models on long-distance dependencies in bracketing expressions. Furthermore, our study shows that small models trained on large amounts of synthetic data provide a useful testbed for evaluating basic model properties.

* Code available at https://github.com/davidarps/silm

Via

Access Paper or Ask Questions

Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models

Dec 11, 2024

Sri Harsha Dumpala, David Arps, Sageev Oore, Laura Kallmeyer, Hassan Sajjad

Figure 1 for Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models

Figure 2 for Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models

Figure 3 for Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models

Figure 4 for Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models

Abstract:Vision-language models (VLMs), serve as foundation models for multi-modal applications such as image captioning and text-to-image generation. Recent studies have highlighted limitations in VLM text encoders, particularly in areas like compositionality and semantic understanding, though the underlying reasons for these limitations remain unclear. In this work, we aim to address this gap by analyzing the syntactic information, one of the fundamental linguistic properties, encoded by the text encoders of VLMs. We perform a thorough analysis comparing VLMs with different objective functions, parameter size and training data size, and with uni-modal language models (ULMs) in their ability to encode syntactic knowledge. Our findings suggest that ULM text encoders acquire syntactic information more effectively than those in VLMs. The syntactic information learned by VLM text encoders is shaped primarily by the pre-training objective, which plays a more crucial role than other factors such as model architecture, model size, or the volume of pre-training data. Models exhibit different layer-wise trends where CLIP performance dropped across layers while for other models, middle layers are rich in encoding syntactic knowledge.

Via

Access Paper or Ask Questions

Dialogue Ontology Relation Extraction via Constrained Chain-of-Thought Decoding

Aug 05, 2024

Renato Vukovic, David Arps, Carel van Niekerk, Benjamin Matthias Ruppik, Hsien-Chin Lin, Michael Heck, Milica Gašić

Figure 1 for Dialogue Ontology Relation Extraction via Constrained Chain-of-Thought Decoding

Figure 2 for Dialogue Ontology Relation Extraction via Constrained Chain-of-Thought Decoding

Figure 3 for Dialogue Ontology Relation Extraction via Constrained Chain-of-Thought Decoding

Figure 4 for Dialogue Ontology Relation Extraction via Constrained Chain-of-Thought Decoding

Abstract:State-of-the-art task-oriented dialogue systems typically rely on task-specific ontologies for fulfilling user queries. The majority of task-oriented dialogue data, such as customer service recordings, comes without ontology and annotation. Such ontologies are normally built manually, limiting the application of specialised systems. Dialogue ontology construction is an approach for automating that process and typically consists of two steps: term extraction and relation extraction. In this work, we focus on relation extraction in a transfer learning set-up. To improve the generalisation, we propose an extension to the decoding mechanism of large language models. We adapt Chain-of-Thought (CoT) decoding, recently developed for reasoning problems, to generative relation extraction. Here, we generate multiple branches in the decoding space and select the relations based on a confidence threshold. By constraining the decoding to ontology terms and relations, we aim to decrease the risk of hallucination. We conduct extensive experimentation on two widely used datasets and find improvements in performance on target ontology for source fine-tuned and one-shot prompted large language models.

* Accepted to appear at SIGDIAL 2024. 9 pages, 4 figures

Via

Access Paper or Ask Questions

Multilingual Nonce Dependency Treebanks: Understanding how LLMs represent and process syntactic structure

Nov 13, 2023

David Arps, Laura Kallmeyer, Younes Samih, Hassan Sajjad

Figure 1 for Multilingual Nonce Dependency Treebanks: Understanding how LLMs represent and process syntactic structure

Figure 2 for Multilingual Nonce Dependency Treebanks: Understanding how LLMs represent and process syntactic structure

Figure 3 for Multilingual Nonce Dependency Treebanks: Understanding how LLMs represent and process syntactic structure

Figure 4 for Multilingual Nonce Dependency Treebanks: Understanding how LLMs represent and process syntactic structure

Abstract:We introduce SPUD (Semantically Perturbed Universal Dependencies), a framework for creating nonce treebanks for the multilingual Universal Dependencies (UD) corpora. SPUD data satisfies syntactic argument structure, provides syntactic annotations, and ensures grammaticality via language-specific rules. We create nonce data in Arabic, English, French, German, and Russian, and demonstrate two use cases of SPUD treebanks. First, we investigate the effect of nonce data on word co-occurrence statistics, as measured by perplexity scores of autoregressive (ALM) and masked language models (MLM). We find that ALM scores are significantly more affected by nonce data than MLM scores. Second, we show how nonce data affects the performance of syntactic dependency probes. We replicate the findings of M\"uller-Eberstein et al. (2022) on nonce test data and show that the performance declines on both MLMs and ALMs wrt. original test data. However, a majority of the performance is kept, suggesting that the probe indeed learns syntax independently from semantics.

* Our software is available at https://github.com/davidarps/spud

Via

Access Paper or Ask Questions

Increasing The Performance of Cognitively Inspired Data-Efficient Language Models via Implicit Structure Building

Oct 31, 2023

Omar Momen, David Arps, Laura Kallmeyer

Figure 1 for Increasing The Performance of Cognitively Inspired Data-Efficient Language Models via Implicit Structure Building

Figure 2 for Increasing The Performance of Cognitively Inspired Data-Efficient Language Models via Implicit Structure Building

Figure 3 for Increasing The Performance of Cognitively Inspired Data-Efficient Language Models via Implicit Structure Building

Figure 4 for Increasing The Performance of Cognitively Inspired Data-Efficient Language Models via Implicit Structure Building

Abstract:In this paper, we describe our submission to the BabyLM Challenge 2023 shared task on data-efficient language model (LM) pretraining (Warstadt et al., 2023). We train transformer-based masked language models that incorporate unsupervised predictions about hierarchical sentence structure into the model architecture. Concretely, we use the Structformer architecture (Shen et al., 2021) and variants thereof. StructFormer models have been shown to perform well on unsupervised syntactic induction based on limited pretraining data, and to yield performance improvements over a vanilla transformer architecture (Shen et al., 2021). Evaluation of our models on 39 tasks provided by the BabyLM challenge shows promising improvements of models that integrate a hierarchical bias into the architecture at some particular tasks, even though they fail to consistently outperform the RoBERTa baseline model provided by the shared task organizers on all tasks.

* Accepted at the BabyLM shared task at CoNLL 2023

Via

Access Paper or Ask Questions

Probing for Constituency Structure in Neural Language Models

Apr 13, 2022

David Arps, Younes Samih, Laura Kallmeyer, Hassan Sajjad

Figure 1 for Probing for Constituency Structure in Neural Language Models

Figure 2 for Probing for Constituency Structure in Neural Language Models

Figure 3 for Probing for Constituency Structure in Neural Language Models

Figure 4 for Probing for Constituency Structure in Neural Language Models

Abstract:In this paper, we investigate to which extent contextual neural language models (LMs) implicitly learn syntactic structure. More concretely, we focus on constituent structure as represented in the Penn Treebank (PTB). Using standard probing techniques based on diagnostic classifiers, we assess the accuracy of representing constituents of different categories within the neuron activations of a LM such as RoBERTa. In order to make sure that our probe focuses on syntactic knowledge and not on implicit semantic generalizations, we also experiment on a PTB version that is obtained by randomly replacing constituents with each other while keeping syntactic structure, i.e., a semantically ill-formed but syntactically well-formed version of the PTB. We find that 4 pretrained transfomer LMs obtain high performance on our probing tasks even on manipulated data, suggesting that semantic and syntactic knowledge in their representations can be separated and that constituency information is in fact learned by the LM. Moreover, we show that a complete constituency tree can be linearly separated from LM representations.

* 20 pages, 9 Figures, 9 tables

Via

Access Paper or Ask Questions