Abstract:The accelerating growth of photographic collections has outpaced manual cataloguing, motivating the use of vision language models (VLMs) to automate metadata generation. This study examines whether Al-generated catalogue descriptions can approximate human-written quality and how generative Al might integrate into cataloguing workflows in archival and museum collections. A VLM (InternVL2) generated catalogue descriptions for photographic prints on labelled cardboard mounts with archaeological content, evaluated by archive and archaeology experts and non-experts in a human-centered, experimental framework. Participants classified descriptions as AI-generated or expert-written, rated quality, and reported willingness to use and trust in AI tools. Classification performance was above chance level, with both groups underestimating their ability to detect Al-generated descriptions. OCR errors and hallucinations limited perceived quality, yet descriptions rated higher in accuracy and usefulness were harder to classify, suggesting that human review is necessary to ensure the accuracy and quality of catalogue descriptions generated by the out-of-the-box model, particularly in specialized domains like archaeological cataloguing. Experts showed lower willingness to adopt AI tools, emphasizing concerns on preservation responsibility over technical performance. These findings advocate for a collaborative approach where AI supports draft generation but remains subordinate to human verification, ensuring alignment with curatorial values (e.g., provenance, transparency). The successful integration of this approach depends not only on technical advancements, such as domain-specific fine-tuning, but even more on establishing trust among professionals, which could both be fostered through a transparent and explainable AI pipeline.
Abstract:We explored the addition bias, a cognitive tendency to prefer adding elements over removing them to alter an initial state or structure, by conducting four preregistered experiments examining the problem-solving behavior of both humans and OpenAl's GPT-4 large language model. The experiments involved 588 participants from the U.S. and 680 iterations of the GPT-4 model. The problem-solving task was either to create symmetry within a grid (Experiments 1 and 3) or to edit a summary (Experiments 2 and 4). As hypothesized, we found that overall, the addition bias was present. Solution efficiency (Experiments 1 and 2) and valence of the instruction (Experiments 3 and 4) played important roles. Human participants were less likely to use additive strategies when subtraction was relatively more efficient than when addition and subtraction were equally efficient. GPT-4 exhibited the opposite behavior, with a strong addition bias when subtraction was more efficient. In terms of instruction valence, GPT-4 was more likely to add words when asked to "improve" compared to "edit", whereas humans did not show this effect. When we looked at the addition bias under different conditions, we found more biased responses for GPT-4 compared to humans. Our findings highlight the importance of considering comparable and sometimes superior subtractive alternatives, as well as reevaluating one's own and particularly the language models' problem-solving behavior.