Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Markus Huff

Model Reconciliation through Explainability and Collaborative Recovery in Assistive Robotics

Jan 10, 2026

Britt Besch, Tai Mai, Jeremias Thun, Markus Huff, Jörn Vogel, Freek Stulp, Samuel Bustamante

Abstract:Whenever humans and robots work together, it is essential that unexpected robot behavior can be explained to the user. Especially in applications such as shared control the user and the robot must share the same model of the objects in the world, and the actions that can be performed on these objects. In this paper, we achieve this with a so-called model reconciliation framework. We leverage a Large Language Model to predict and explain the difference between the robot's and the human's mental models, without the need of a formal mental model of the user. Furthermore, our framework aims to solve the model divergence after the explanation by allowing the human to correct the robot. We provide an implementation in an assistive robotics domain, where we conduct a set of experiments with a real wheelchair-based mobile manipulator and its digital twin.

Via

Access Paper or Ask Questions

ArchiveGPT: A human-centered evaluation of using a vision language model for image cataloguing

Jul 10, 2025

Line Abele, Gerrit Anders, Tolgahan Aydın, Jürgen Buder, Helen Fischer, Dominik Kimmel, Markus Huff

Abstract:The accelerating growth of photographic collections has outpaced manual cataloguing, motivating the use of vision language models (VLMs) to automate metadata generation. This study examines whether Al-generated catalogue descriptions can approximate human-written quality and how generative Al might integrate into cataloguing workflows in archival and museum collections. A VLM (InternVL2) generated catalogue descriptions for photographic prints on labelled cardboard mounts with archaeological content, evaluated by archive and archaeology experts and non-experts in a human-centered, experimental framework. Participants classified descriptions as AI-generated or expert-written, rated quality, and reported willingness to use and trust in AI tools. Classification performance was above chance level, with both groups underestimating their ability to detect Al-generated descriptions. OCR errors and hallucinations limited perceived quality, yet descriptions rated higher in accuracy and usefulness were harder to classify, suggesting that human review is necessary to ensure the accuracy and quality of catalogue descriptions generated by the out-of-the-box model, particularly in specialized domains like archaeological cataloguing. Experts showed lower willingness to adopt AI tools, emphasizing concerns on preservation responsibility over technical performance. These findings advocate for a collaborative approach where AI supports draft generation but remains subordinate to human verification, ensuring alignment with curatorial values (e.g., provenance, transparency). The successful integration of this approach depends not only on technical advancements, such as domain-specific fine-tuning, but even more on establishing trust among professionals, which could both be fostered through a transparent and explainable AI pipeline.

* 56 pages, 7 figures

Via

Access Paper or Ask Questions

Metacognitive Monitoring: A Human Ability Beyond Generative Artificial Intelligence

Oct 17, 2024

Markus Huff, Elanur Ulakçı

Figure 1 for Metacognitive Monitoring: A Human Ability Beyond Generative Artificial Intelligence

Figure 2 for Metacognitive Monitoring: A Human Ability Beyond Generative Artificial Intelligence

Figure 3 for Metacognitive Monitoring: A Human Ability Beyond Generative Artificial Intelligence

Abstract:Large language models (LLMs) have shown impressive alignment with human cognitive processes, raising questions about the extent of their similarity to human cognition. This study investigates whether LLMs, specifically ChatGPT, possess metacognitive monitoring abilities akin to humans-particularly in predicting memory performance on an item-by-item basis. We employed a cross-agent prediction model to compare the metacognitive performance of humans and ChatGPT in a language-based memory task involving garden-path sentences preceded by either fitting or unfitting context sentences. Both humans and ChatGPT rated the memorability of these sentences; humans then completed a surprise recognition memory test. Our findings reveal a significant positive relationship between humans' memorability ratings and their actual recognition performance, indicating reliable metacognitive monitoring. In contrast, ChatGPT did not exhibit a similar predictive capability. Bootstrapping analyses demonstrated that none of the GPT models tested (GPT-3.5-turbo, GPT-4-turbo, GPT-4o) could accurately predict human memory performance on a per-item basis. This suggests that, despite their advanced language processing abilities and alignment with human cognition at the object level, current LLMs lack the metacognitive mechanisms that enable humans to anticipate their memory performance. These results highlight a fundamental difference between human and AI cognition at the metacognitive level. Addressing this gap is crucial for developing AI systems capable of effective self-monitoring and adaptation to human needs, thereby enhancing human-AI interactions across domains such as education and personalized learning.

* 28 pages, 2 figures. arXiv admin note: substantial text overlap with arXiv:2403.05152

Via

Access Paper or Ask Questions

Influence of Solution Efficiency and Valence of Instruction on Additive and Subtractive Solution Strategies in Humans and GPT-4

Apr 25, 2024

Lydia Uhler, Verena Jordan, Jürgen Buder, Markus Huff, Frank Papenmeier

Figure 1 for Influence of Solution Efficiency and Valence of Instruction on Additive and Subtractive Solution Strategies in Humans and GPT-4

Figure 2 for Influence of Solution Efficiency and Valence of Instruction on Additive and Subtractive Solution Strategies in Humans and GPT-4

Figure 3 for Influence of Solution Efficiency and Valence of Instruction on Additive and Subtractive Solution Strategies in Humans and GPT-4

Abstract:We explored the addition bias, a cognitive tendency to prefer adding elements over removing them to alter an initial state or structure, by conducting four preregistered experiments examining the problem-solving behavior of both humans and OpenAl's GPT-4 large language model. The experiments involved 588 participants from the U.S. and 680 iterations of the GPT-4 model. The problem-solving task was either to create symmetry within a grid (Experiments 1 and 3) or to edit a summary (Experiments 2 and 4). As hypothesized, we found that overall, the addition bias was present. Solution efficiency (Experiments 1 and 2) and valence of the instruction (Experiments 3 and 4) played important roles. Human participants were less likely to use additive strategies when subtraction was relatively more efficient than when addition and subtraction were equally efficient. GPT-4 exhibited the opposite behavior, with a strong addition bias when subtraction was more efficient. In terms of instruction valence, GPT-4 was more likely to add words when asked to "improve" compared to "edit", whereas humans did not show this effect. When we looked at the addition bias under different conditions, we found more biased responses for GPT-4 compared to humans. Our findings highlight the importance of considering comparable and sometimes superior subtractive alternatives, as well as reevaluating one's own and particularly the language models' problem-solving behavior.

* 29 pages, 2 figures

Via

Access Paper or Ask Questions

Towards a Psychology of Machines: Large Language Models Predict Human Memory

Mar 08, 2024

Markus Huff, Elanur Ulakçı

Figure 1 for Towards a Psychology of Machines: Large Language Models Predict Human Memory

Figure 2 for Towards a Psychology of Machines: Large Language Models Predict Human Memory

Figure 3 for Towards a Psychology of Machines: Large Language Models Predict Human Memory

Abstract:Large language models (LLMs) are demonstrating remarkable capabilities across various tasks despite lacking a foundation in human cognition. This raises the question: can these models, beyond simply mimicking human language patterns, offer insights into the mechanisms underlying human cognition? This study explores the ability of ChatGPT to predict human performance in a language-based memory task. Building upon theories of text comprehension, we hypothesize that recognizing ambiguous sentences (e.g., "Because Bill drinks wine is never kept in the house") is facilitated by preceding them with contextually relevant information. Participants, both human and ChatGPT, were presented with pairs of sentences. The second sentence was always a garden-path sentence designed to be inherently ambiguous, while the first sentence either provided a fitting (e.g., "Bill has chronic alcoholism") or an unfitting context (e.g., "Bill likes to play golf"). We measured both human's and ChatGPT's ratings of sentence relatedness, ChatGPT's memorability ratings for the garden-path sentences, and humans' spontaneous memory for the garden-path sentences. The results revealed a striking alignment between ChatGPT's assessments and human performance. Sentences deemed more related and assessed as being more memorable by ChatGPT were indeed better remembered by humans, even though ChatGPT's internal mechanisms likely differ significantly from human cognition. This finding, which was confirmed with a robustness check employing synonyms, underscores the potential of generative AI models to predict human performance accurately. We discuss the broader implications of these findings for leveraging LLMs in the development of psychological theories and for gaining a deeper understanding of human cognition.

* 32 pages, 3 figures, 2 tables

Via

Access Paper or Ask Questions