Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andrew Jun Lee

Few-Shot Learning of Visual Compositional Concepts through Probabilistic Schema Induction

May 14, 2025

Andrew Jun Lee, Taylor Webb, Trevor Bihl, Keith Holyoak, Hongjing Lu

Figure 1 for Few-Shot Learning of Visual Compositional Concepts through Probabilistic Schema Induction

Figure 2 for Few-Shot Learning of Visual Compositional Concepts through Probabilistic Schema Induction

Figure 3 for Few-Shot Learning of Visual Compositional Concepts through Probabilistic Schema Induction

Figure 4 for Few-Shot Learning of Visual Compositional Concepts through Probabilistic Schema Induction

Abstract:The ability to learn new visual concepts from limited examples is a hallmark of human cognition. While traditional category learning models represent each example as an unstructured feature vector, compositional concept learning is thought to depend on (1) structured representations of examples (e.g., directed graphs consisting of objects and their relations) and (2) the identification of shared relational structure across examples through analogical mapping. Here, we introduce Probabilistic Schema Induction (PSI), a prototype model that employs deep learning to perform analogical mapping over structured representations of only a handful of examples, forming a compositional concept called a schema. In doing so, PSI relies on a novel conception of similarity that weighs object-level similarity and relational similarity, as well as a mechanism for amplifying relations relevant to classification, analogous to selective attention parameters in traditional models. We show that PSI produces human-like learning performance and outperforms two controls: a prototype model that uses unstructured feature vectors extracted from a deep learning model, and a variant of PSI with weaker structured representations. Notably, we find that PSI's human-like performance is driven by an adaptive strategy that increases relational similarity over object-level similarity and upweights the contribution of relations that distinguish classes. These findings suggest that structured representations and analogical mapping are critical to modeling rapid human-like learning of compositional visual concepts, and demonstrate how deep learning can be leveraged to create psychological models.

* Lee, A. J., Webb, T., Bihl, T., Holyoak, K. J., & Lu, H. (2025). Few-shot learning of visual compositional concepts through probabilistic schema induction. In A. Ruggeri, D. Barner, C. Walker, & N. Bramley (Eds.), Proceedings of the 47th Annual Conference of the Cognitive Science Society. Cognitive Science Society

Via

Access Paper or Ask Questions

Evaluating Compositional Scene Understanding in Multimodal Generative Models

Mar 29, 2025

Shuhao Fu, Andrew Jun Lee, Anna Wang, Ida Momennejad, Trevor Bihl, Hongjing Lu, Taylor W. Webb

Figure 1 for Evaluating Compositional Scene Understanding in Multimodal Generative Models

Figure 2 for Evaluating Compositional Scene Understanding in Multimodal Generative Models

Figure 3 for Evaluating Compositional Scene Understanding in Multimodal Generative Models

Figure 4 for Evaluating Compositional Scene Understanding in Multimodal Generative Models

Abstract:The visual world is fundamentally compositional. Visual scenes are defined by the composition of objects and their relations. Hence, it is essential for computer vision systems to reflect and exploit this compositionality to achieve robust and generalizable scene understanding. While major strides have been made toward the development of general-purpose, multimodal generative models, including both text-to-image models and multimodal vision-language models, it remains unclear whether these systems are capable of accurately generating and interpreting scenes involving the composition of multiple objects and relations. In this work, we present an evaluation of the compositional visual processing capabilities in the current generation of text-to-image (DALL-E 3) and multimodal vision-language models (GPT-4V, GPT-4o, Claude Sonnet 3.5, QWEN2-VL-72B, and InternVL2.5-38B), and compare the performance of these systems to human participants. The results suggest that these systems display some ability to solve compositional and relational tasks, showing notable improvements over the previous generation of multimodal models, but with performance nevertheless well below the level of human participants, particularly for more complex scenes involving many ($>5$) objects and multiple relations. These results highlight the need for further progress toward compositional understanding of visual scenes.

Via

Access Paper or Ask Questions