Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Paola Merlo

U. of Pennsylvania and University of Geneva

Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew

Feb 05, 2026

Giuseppe Samo, Paola Merlo

Abstract:We investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, concentrating on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish -- with its transparent morphological markers -- both monolingual and multilingual models succeed, either when tokenization is atomic or when it breaks words into small subword units. For Hebrew, instead, monolingual and multilingual models diverge. A multilingual model using character-level tokenization fails to capture the language non-concatenative morphology, but a monolingual model with morpheme-aware segmentation performs well. Performance improves on more synthetic datasets, in all models.

* 13 pages, 7 figures, to appear as proceedings of the SIGTURK 2026 Workshop

Via

Access Paper or Ask Questions

Analogical Structure, Minimal Contextual Cues and Contrastive Distractors: Input Design for Sample-Efficient Linguistic Rule Induction

Nov 13, 2025

Chunyang Jiang, Paola Merlo

Figure 1 for Analogical Structure, Minimal Contextual Cues and Contrastive Distractors: Input Design for Sample-Efficient Linguistic Rule Induction

Figure 2 for Analogical Structure, Minimal Contextual Cues and Contrastive Distractors: Input Design for Sample-Efficient Linguistic Rule Induction

Figure 3 for Analogical Structure, Minimal Contextual Cues and Contrastive Distractors: Input Design for Sample-Efficient Linguistic Rule Induction

Figure 4 for Analogical Structure, Minimal Contextual Cues and Contrastive Distractors: Input Design for Sample-Efficient Linguistic Rule Induction

Abstract:Large language models achieve strong performance through training on vast datasets. Can analogical paradigm organization enable lightweight models to match this performance with minimal data? We develop a computational approach implementing three cognitive-inspired principles: analogical structure, contrastive learning, and minimal contextual cues. We test this approach with structured completion tasks where models identify correct sentence completions from analogical patterns with contrastive alternatives. Training lightweight models (BERT+CNN, $0.5M$ parameters) on only one hundred structured examples of English causative/inchoative alternations achieves $F1=0.95$, outperforming zero-shot \texttt{GPT-o3} ($F1=0.87$). Ablation studies confirm that analogical organization and contrastive structure improve performance, consistently surpassing randomly shuffled baselines across architectures. Cross-phenomenon validation using unspecified object alternations replicates these efficiency gains, confirming approach robustness. Our results show that analogical paradigm organization enables competitive linguistic rule learning with orders of magnitude less data than conventional approaches require.

Via

Access Paper or Ask Questions

Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement

Sep 10, 2024

Vivi Nastase, Chunyang Jiang, Giuseppe Samo, Paola Merlo

Abstract:In this paper, our goal is to investigate to what degree multilingual pretrained language models capture cross-linguistically valid abstract linguistic representations. We take the approach of developing curated synthetic data on a large scale, with specific properties, and using them to study sentence representations built using pretrained language models. We use a new multiple-choice task and datasets, Blackbird Language Matrices (BLMs), to focus on a specific grammatical structural phenomenon -- subject-verb agreement across a variety of sentence structures -- in several languages. Finding a solution to this task requires a system detecting complex linguistic patterns and paradigms in text representations. Using a two-level architecture that solves the problem in two steps -- detect syntactic objects and their properties in individual sentences, and find patterns across an input sequence of sentences -- we show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific differences, and syntactic structure is not shared, even across closely related languages.

* 11 pages, 5 tables, 5 figures

Via

Access Paper or Ask Questions

Exploring Italian sentence embeddings properties through multi-tasking

Sep 10, 2024

Vivi Nastase, Giuseppe Samo, Chunyang Jiang, Paola Merlo

Abstract:We investigate to what degree existing LLMs encode abstract linguistic information in Italian in a multi-task setting. We exploit curated synthetic data on a large scale -- several Blackbird Language Matrices (BLMs) problems in Italian -- and use them to study how sentence representations built using pre-trained language models encode specific syntactic and semantic information. We use a two-level architecture to model separately a compression of the sentence embeddings into a representation that contains relevant information for a task, and a BLM task. We then investigate whether we can obtain compressed sentence representations that encode syntactic and semantic information relevant to several BLM tasks. While we expected that the sentence structure -- in terms of sequence of phrases/chunks -- and chunk properties could be shared across tasks, performance and error analysis show that the clues for the different tasks are encoded in different manners in the sentence embeddings, suggesting that abstract linguistic notions such as constituents or thematic roles does not seem to be present in the pretrained sentence embeddings.

* 9 pages, 9 figures, 3 tables

Via

Access Paper or Ask Questions

Tracking linguistic information in transformer-based sentence embeddings through targeted sparsification

Jul 25, 2024

Vivi Nastase, Paola Merlo

Abstract:Analyses of transformer-based models have shown that they encode a variety of linguistic information from their textual input. While these analyses have shed a light on the relation between linguistic information on one side, and internal architecture and parameters on the other, a question remains unanswered: how is this linguistic information reflected in sentence embeddings? Using datasets consisting of sentences with known structure, we test to what degree information about chunks (in particular noun, verb or prepositional phrases), such as grammatical number, or semantic role, can be localized in sentence embeddings. Our results show that such information is not distributed over the entire sentence embedding, but rather it is encoded in specific regions. Understanding how the information from an input text is compressed into sentence embeddings helps understand current transformer models and help build future explainable neural models.

* 12 pages, 9 figures, 1 table, published in RepL4NLP 2024

Via

Access Paper or Ask Questions

Are there identifiable structural parts in the sentence embedding whole?

Jun 24, 2024

Vivi Nastase, Paola Merlo

Figure 1 for Are there identifiable structural parts in the sentence embedding whole?

Figure 2 for Are there identifiable structural parts in the sentence embedding whole?

Figure 3 for Are there identifiable structural parts in the sentence embedding whole?

Figure 4 for Are there identifiable structural parts in the sentence embedding whole?

Abstract:Sentence embeddings from transformer models encode in a fixed length vector much linguistic information. We explore the hypothesis that these embeddings consist of overlapping layers of information that can be separated, and on which specific types of information -- such as information about chunks and their structural and semantic properties -- can be detected. We show that this is the case using a dataset consisting of sentences with known chunk structure, and two linguistic intelligence datasets, solving which relies on detecting chunks and their grammatical number, and respectively, their semantic roles, and through analyses of the performance on the tasks and of the internal representations built during learning.

* 17 pages, 14 figures, 5 tables

Via

Access Paper or Ask Questions

Disentangling continuous and discrete linguistic signals in transformer-based sentence embeddings

Dec 18, 2023

Vivi Nastase, Paola Merlo

Abstract:Sentence and word embeddings encode structural and semantic information in a distributed manner. Part of the information encoded -- particularly lexical information -- can be seen as continuous, whereas other -- like structural information -- is most often discrete. We explore whether we can compress transformer-based sentence embeddings into a representation that separates different linguistic signals -- in particular, information relevant to subject-verb agreement and verb alternations. We show that by compressing an input sequence that shares a targeted phenomenon into the latent layer of a variational autoencoder-like system, the targeted linguistic information becomes more explicit. A latent layer with both discrete and continuous components captures better the targeted phenomena than a latent layer with only discrete or only continuous components. These experiments are a step towards separating linguistic signals from distributed text embeddings and linking them to more symbolic representations.

Via

Access Paper or Ask Questions

Grammatical information in BERT sentence embeddings as two-dimensional arrays

Dec 15, 2023

Vivi Nastase, Paola Merlo

Abstract:Sentence embeddings induced with various transformer architectures encode much semantic and syntactic information in a distributed manner in a one-dimensional array. We investigate whether specific grammatical information can be accessed in these distributed representations. Using data from a task developed to test rule-like generalizations, our experiments on detecting subject-verb agreement yield several promising results. First, we show that while the usual sentence representations encoded as one-dimensional arrays do not easily support extraction of rule-like regularities, a two-dimensional reshaping of these vectors allows various learning architectures to access such information. Next, we show that various architectures can detect patterns in these two-dimensional reshaped sentence embeddings and successfully learn a model based on smaller amounts of simpler training data, which performs well on more complex test data. This indicates that current sentence embeddings contain information that is regularly distributed, and which can be captured when the embeddings are reshaped into higher dimensional arrays. Our results cast light on representations produced by language models and help move towards developing few-shot learning approaches.

* Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)
* Published in RepL4NLP 2023

Via

Access Paper or Ask Questions

Blackbird language matrices , a new task for rule-like generalization in neural networks: Motivations and Formal Specifications

Jun 20, 2023

Paola Merlo

Abstract:We motivate and formally define a new task for fine-tuning rule-like generalization in large language models. It is conjectured that the shortcomings of current LLMs are due to a lack of ability to generalize. It has been argued that, instead, humans are better at generalization because they have a tendency at extracting rules from complex data. We try to recreate this tendency to rule-based generalization. When exposed to tests of analytic intelligence, for example, the visual RAVEN IQ test, human problem-solvers identify the relevant objects in the picture and their relevant attributes and reason based on rules applied to these objects and attributes. Based on the induced rules, they are able to provide a solution to the test. We propose a task that translates this IQ task into language. In this paper, we provide the formal specification for the task and the generative process of its datasets.

* 7pages, 6 figures. arXiv admin note: text overlap with arXiv:2205.10866

Via

Access Paper or Ask Questions

Blackbird's language matrices : a new benchmark to investigate disentangled generalisation in neural networks

May 22, 2022

Paola Merlo, Aixiu An, Maria A. Rodriguez

Figure 1 for Blackbird's language matrices : a new benchmark to investigate disentangled generalisation in neural networks

Figure 2 for Blackbird's language matrices : a new benchmark to investigate disentangled generalisation in neural networks

Figure 3 for Blackbird's language matrices : a new benchmark to investigate disentangled generalisation in neural networks

Figure 4 for Blackbird's language matrices : a new benchmark to investigate disentangled generalisation in neural networks

Abstract:Current successes of machine learning architectures are based on computationally expensive algorithms and prohibitively large amounts of data. We need to develop tasks and data to train networks to reach more complex and more compositional skills. In this paper, we illustrate Blackbird's language matrices (BLMs), a novel grammatical dataset developed to test a linguistic variant of Raven's progressive matrices, an intelligence test usually based on visual stimuli. The dataset consists of 44800 sentences, generatively constructed to support investigations of current models' linguistic mastery of grammatical agreement rules and their ability to generalise them. We present the logic of the dataset, the method to automatically construct data on a large scale and the architecture to learn them. Through error analysis and several experiments on variations of the dataset, we demonstrate that this language task and the data that instantiate it provide a new challenging testbed to understand generalisation and abstraction.

* 15 pages, 9 figures, 1 table

Via

Access Paper or Ask Questions