Abstract:Across the sciences, autonomous systems are increasingly being used in closed-loop discovery, proposing new theories and designing and running experiments to test them. This approach is yet to be applied in the field of cognitive science, where the central bottleneck is theory-building: the creative step of turning the accumulated failures of existing models into better ones. Theory generation has remained manual even as data collection, modeling, and experiment design have been automated. We present the Automated Cognitive Scientist (AutoCog), a fully autonomous agentic-AI system that closes this loop. Large-language-model agents advocate competing theories, each expressed as an executable cognitive model, design experiments that best discriminate them, collect behavioral data from participants recruited online, score theories against collected data based on their generative performance, diagnose why they fail, and synthesize a better successor. Repeating this cycle allows them to search the space of theories, models, and experiments. In the domain of decision-making, AutoCog recovered known decision-making strategies from simulated behavior, including unconventional ones, showing that its discoveries are ultimately driven by the data rather than strictly bound by the priors of the underlying language models. When run with human participants, it produced theories that outperformed the established theories it was seeded with and generalized to held-out studies in two different experimental settings. It also surfaced a novel theory of multi-cue decision-making in which choices show diminishing sensitivity to feature values. The distinctive predictions of this theory were confirmed in a preregistered study with new participants. AutoCog demonstrates how an automated discovery system can be used to turn cognitive theory-building into an explicit, executable, and cumulative science.
Abstract:Simulators have revolutionized scientific practice across the natural sciences. By generating data that reliably approximate real-world phenomena, they enable scientists to accelerate hypothesis testing and optimize experimental designs. This is perhaps best illustrated by AlphaFold, a Nobel-prize winning simulator in chemistry that predicts protein structures from amino acid sequences, enabling rapid prototyping of molecular interactions, drug targets, and protein functions. In the behavioral sciences, a reliable participant simulator - a system capable of producing human-like behavior across cognitive tasks - would represent a similarly transformative advance. Recently, Binz et al. introduced Centaur, a large language model (LLM) fine-tuned on human data from 160 experiments, proposing its use not only as a model of cognition but also as a participant simulator for "in silico prototyping of experimental studies", e.g., to advance automated cognitive science. Here, we review the core criteria for a participant simulator and assess how well Centaur meets them. Although Centaur demonstrates strong predictive accuracy, its generative behavior - a critical criterion for a participant simulator - systematically diverges from human data. This suggests that, while Centaur is a significant step toward predicting human behavior, it does not yet meet the standards of a reliable participant simulator or an accurate model of cognition.