Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Martha Lewis

Shammie

Behavioural vs. Representational Systematicity in End-to-End Models: An Opinionated Survey

Jun 04, 2025

Ivan Vegner, Sydelle de Souza, Valentin Forch, Martha Lewis, Leonidas A. A. Doumas

Abstract:A core aspect of compositionality, systematicity is a desirable property in ML models as it enables strong generalization to novel contexts. This has led to numerous studies proposing benchmarks to assess systematic generalization, as well as models and training regimes designed to enhance it. Many of these efforts are framed as addressing the challenge posed by Fodor and Pylyshyn. However, while they argue for systematicity of representations, existing benchmarks and models primarily focus on the systematicity of behaviour. We emphasize the crucial nature of this distinction. Furthermore, building on Hadley's (1994) taxonomy of systematic generalization, we analyze the extent to which behavioural systematicity is tested by key benchmarks in the literature across language and vision. Finally, we highlight ways of assessing systematicity of representations in ML models as practiced in the field of mechanistic interpretability.

* To appear at ACL 2025 Main Conference

Via

Access Paper or Ask Questions

Evaluating the Robustness of Analogical Reasoning in Large Language Models

Nov 21, 2024

Martha Lewis, Melanie Mitchell

Figure 1 for Evaluating the Robustness of Analogical Reasoning in Large Language Models

Figure 2 for Evaluating the Robustness of Analogical Reasoning in Large Language Models

Figure 3 for Evaluating the Robustness of Analogical Reasoning in Large Language Models

Figure 4 for Evaluating the Robustness of Analogical Reasoning in Large Language Models

Abstract:LLMs have performed well on several reasoning benchmarks, including ones that test analogical reasoning abilities. However, there is debate on the extent to which they are performing general abstract reasoning versus employing non-robust processes, e.g., that overly rely on similarity to pre-training data. Here we investigate the robustness of analogy-making abilities previously claimed for LLMs on three of four domains studied by Webb, Holyoak, and Lu (2023): letter-string analogies, digit matrices, and story analogies. For each domain we test humans and GPT models on robustness to variants of the original analogy problems that test the same abstract reasoning abilities but are likely dissimilar from tasks in the pre-training data. The performance of a system that uses robust abstract reasoning should not decline substantially on these variants. On simple letter-string analogies, we find that while the performance of humans remains high for two types of variants we tested, the GPT models' performance declines sharply. This pattern is less pronounced as the complexity of these problems is increased, as both humans and GPT models perform poorly on both the original and variant problems requiring more complex analogies. On digit-matrix problems, we find a similar pattern but only on one out of the two types of variants we tested. On story-based analogy problems, we find that, unlike humans, the performance of GPT models are susceptible to answer-order effects, and that GPT models also may be more sensitive than humans to paraphrasing. This work provides evidence that LLMs often lack the robustness of zero-shot human analogy-making, exhibiting brittleness on most of the variations we tested. More generally, this work points to the importance of carefully evaluating AI systems not only for accuracy but also robustness when testing their cognitive capabilities.

* 31 pages, 13 figures. arXiv admin note: text overlap with arXiv:2402.08955

Via

Access Paper or Ask Questions

Density Matrices for Metaphor Understanding

Aug 12, 2024

Jay Owers, Ekaterina Shutova, Martha Lewis

Abstract:In physics, density matrices are used to represent mixed states, i.e. probabilistic mixtures of pure states. This concept has previously been used to model lexical ambiguity. In this paper, we consider metaphor as a type of lexical ambiguity, and examine whether metaphorical meaning can be effectively modelled using mixtures of word senses. We find that modelling metaphor is significantly more difficult than other kinds of lexical ambiguity, but that our best-performing density matrix method outperforms simple baselines as well as some neural language models.

* EPTCS 406, 2024, pp. 197-215
* In Proceedings QPL 2024, arXiv:2408.05113

Via

Access Paper or Ask Questions

Metaphor Understanding Challenge Dataset for LLMs

Mar 18, 2024

Xiaoyu Tong, Rochelle Choenni, Martha Lewis, Ekaterina Shutova

Figure 1 for Metaphor Understanding Challenge Dataset for LLMs

Figure 2 for Metaphor Understanding Challenge Dataset for LLMs

Figure 3 for Metaphor Understanding Challenge Dataset for LLMs

Figure 4 for Metaphor Understanding Challenge Dataset for LLMs

Abstract:Metaphors in natural language are a reflection of fundamental cognitive processes such as analogical reasoning and categorisation, and are deeply rooted in everyday communication. Metaphor understanding is therefore an essential task for large language models (LLMs). We release the Metaphor Understanding Challenge Dataset (MUNCH), designed to evaluate the metaphor understanding capabilities of LLMs. The dataset provides over 10k paraphrases for sentences containing metaphor use, as well as 1.5k instances containing inapt paraphrases. The inapt paraphrases were carefully selected to serve as control to determine whether the model indeed performs full metaphor interpretation or rather resorts to lexical similarity. All apt and inapt paraphrases were manually annotated. The metaphorical sentences cover natural metaphor uses across 4 genres (academic, news, fiction, and conversation), and they exhibit different levels of novelty. Experiments with LLaMA and GPT-3.5 demonstrate that MUNCH presents a challenging task for LLMs. The dataset is freely accessible at https://github.com/xiaoyuisrain/metaphor-understanding-challenge.

Via

Access Paper or Ask Questions

Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models

Feb 14, 2024

Martha Lewis, Melanie Mitchell

Figure 1 for Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models

Figure 2 for Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models

Figure 3 for Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models

Figure 4 for Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models

Abstract:Large language models (LLMs) have performed well on several reasoning benchmarks, including ones that test analogical reasoning abilities. However, it has been debated whether they are actually performing humanlike abstract reasoning or instead employing less general processes that rely on similarity to what has been seen in their training data. Here we investigate the generality of analogy-making abilities previously claimed for LLMs (Webb, Holyoak, & Lu, 2023). We take one set of analogy problems used to evaluate LLMs and create a set of "counterfactual" variants-versions that test the same abstract reasoning abilities but that are likely dissimilar from any pre-training data. We test humans and three GPT models on both the original and counterfactual problems, and show that, while the performance of humans remains high for all the problems, the GPT models' performance declines sharply on the counterfactual set. This work provides evidence that, despite previously reported successes of LLMs on analogical reasoning, these models lack the robustness and generality of human analogy-making.

Via

Access Paper or Ask Questions

Grounded learning for compositional vector semantics

Jan 10, 2024

Martha Lewis

Abstract:Categorical compositional distributional semantics is an approach to modelling language that combines the success of vector-based models of meaning with the compositional power of formal semantics. However, this approach was developed without an eye to cognitive plausibility. Vector representations of concepts and concept binding are also of interest in cognitive science, and have been proposed as a way of representing concepts within a biologically plausible spiking neural network. This work proposes a way for compositional distributional semantics to be implemented within a spiking neural network architecture, with the potential to address problems in concept binding, and give a small implementation. We also describe a means of training word representations using labelled images.

Via

Access Paper or Ask Questions

Compositional Fusion of Signals in Data Embedding

Nov 18, 2023

Zhijin Guo, Zhaozhen Xu, Martha Lewis, Nello Cristianini

Figure 1 for Compositional Fusion of Signals in Data Embedding

Figure 2 for Compositional Fusion of Signals in Data Embedding

Figure 3 for Compositional Fusion of Signals in Data Embedding

Figure 4 for Compositional Fusion of Signals in Data Embedding

Abstract:Embeddings in AI convert symbolic structures into fixed-dimensional vectors, effectively fusing multiple signals. However, the nature of this fusion in real-world data is often unclear. To address this, we introduce two methods: (1) Correlation-based Fusion Detection, measuring correlation between known attributes and embeddings, and (2) Additive Fusion Detection, viewing embeddings as sums of individual vectors representing attributes. Applying these methods, word embeddings were found to combine semantic and morphological signals. BERT sentence embeddings were decomposed into individual word vectors of subject, verb and object. In the knowledge graph-based recommender system, user embeddings, even without training on demographic data, exhibited signals of demographics like age and gender. This study highlights that embeddings are fusions of multiple signals, from Word2Vec components to demographic hints in graph embeddings.

Via

Access Paper or Ask Questions

EXTRACT: Explainable Transparent Control of Bias in Embeddings

Oct 31, 2023

Zhijin Guo, Zhaozhen Xu, Martha Lewis, Nello Cristianini

Figure 1 for EXTRACT: Explainable Transparent Control of Bias in Embeddings

Figure 2 for EXTRACT: Explainable Transparent Control of Bias in Embeddings

Figure 3 for EXTRACT: Explainable Transparent Control of Bias in Embeddings

Figure 4 for EXTRACT: Explainable Transparent Control of Bias in Embeddings

Abstract:Knowledge Graphs are a widely used method to represent relations between entities in various AI applications, and Graph Embedding has rapidly become a standard technique to represent Knowledge Graphs in such a way as to facilitate inferences and decisions. As this representation is obtained from behavioural data, and is not in a form readable by humans, there is a concern that it might incorporate unintended information that could lead to biases. We propose EXTRACT: a suite of Explainable and Transparent methods to ConTrol bias in knowledge graph embeddings, so as to assess and decrease the implicit presence of protected information. Our method uses Canonical Correlation Analysis (CCA) to investigate the presence, extent and origins of information leaks during training, then decomposes embeddings into a sum of their private attributes by solving a linear system. Our experiments, performed on the MovieLens1M dataset, show that a range of personal attributes can be inferred from a user's viewing behaviour and preferences, including gender, age, and occupation. Further experiments, performed on the KG20C citation dataset, show that the information about the conference in which a paper was published can be inferred from the citation network of that article. We propose four transparent methods to maintain the capability of the embedding to make the intended predictions without retaining unwanted information. A trade-off between these two goals is observed.

* Aequitas 2023: Workshop on Fairness and Bias in AI | co-located with ECAI 2023, Krak\'ow, Poland

Via

Access Paper or Ask Questions

Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

Dec 20, 2022

Martha Lewis, Qinan Yu, Jack Merullo, Ellie Pavlick

Figure 1 for Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

Figure 2 for Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

Figure 3 for Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

Figure 4 for Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

Abstract:Large-scale models combining text and images have made incredible progress in recent years. However, they can still fail at tasks requiring compositional knowledge, such as correctly picking out a red cube from a picture of multiple shapes. We examine the ability of CLIP (Radford et al., 2021), to caption images requiring compositional knowledge. We implement five compositional language models to probe the kinds of structure that CLIP may be using, and develop a novel training algorithm, Compositional Skipgram for Images (CoSI), to train these models. We look at performance in attribute-based tasks, requiring the identification of a particular combination of attribute and object (such as "red cube"), and in relational settings, where the spatial relation between two shapes (such as "cube behind sphere") must be identified. We find that in some conditions, CLIP is able to learn attribute-object labellings, and to generalize to unseen attribute-object combinations. However, we also see evidence that CLIP is not able to bind features together reliably. Moreover, CLIP is not able to reliably learn relations between objects, whereas some compositional models are able to learn these perfectly. Of the five models we developed, none were able to generalize to unseen relations.

Via

Access Paper or Ask Questions

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Jun 10, 2022

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso(+435 more)

Abstract:Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.

* 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

Via

Access Paper or Ask Questions