Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Max Hellrigel-Holderbaum

Questionnaire Responses Do not Capture the Safety of AI Agents

Mar 15, 2026

Max Hellrigel-Holderbaum, Edward James Young

Abstract:As AI systems advance in capabilities, measuring their safety and alignment to human values is becoming paramount. A fast-growing field of AI research is devoted to developing such assessments. However, most current advances therein may be ill-suited for assessing AI systems across real-world deployments. Standard methods prompt large language models (LLMs) in a questionnaire-style to describe their values or behavior in hypothetical scenarios. By focusing on unaugmented LLMs, they fall short of evaluating AI agents, which could actually perform relevant behaviors, hence posing much greater risks. LLMs' engagement with scenarios described by questionnaire-style prompts differs starkly from that of agents based on the same LLMs, as reflected in divergences in the inputs, possible actions, environmental interactions, and internal processing. As such, LLMs' responses to scenario descriptions are unlikely to be representative of the corresponding LLM agents' behavior. We further contend that such assessments make strong assumptions concerning the ability and tendency of LLMs to report accurately about their counterfactual behavior. This makes them inadequate to assess risks from AI systems in real-world contexts as they lack construct validity. We then argue that a structurally identical issue holds for current AI alignment approaches. Lastly, we discuss improving safety assessments and alignment training by taking these shortcomings to heart.

* 31 pages, 11 pages main text

Via

Access Paper or Ask Questions

Against racing to AGI: Cooperation, deterrence, and catastrophic risks

Jul 29, 2025

Leonard Dung, Max Hellrigel-Holderbaum

Figure 1 for Against racing to AGI: Cooperation, deterrence, and catastrophic risks

Figure 2 for Against racing to AGI: Cooperation, deterrence, and catastrophic risks

Figure 3 for Against racing to AGI: Cooperation, deterrence, and catastrophic risks

Figure 4 for Against racing to AGI: Cooperation, deterrence, and catastrophic risks

Abstract:AGI Racing is the view that it is in the self-interest of major actors in AI development, especially powerful nations, to accelerate their frontier AI development to build highly capable AI, especially artificial general intelligence (AGI), before competitors have a chance. We argue against AGI Racing. First, the downsides of racing to AGI are much higher than portrayed by this view. Racing to AGI would substantially increase catastrophic risks from AI, including nuclear instability, and undermine the prospects of technical AI safety research to be effective. Second, the expected benefits of racing may be lower than proponents of AGI Racing hold. In particular, it is questionable whether winning the race enables complete domination over losers. Third, international cooperation and coordination, and perhaps carefully crafted deterrence measures, constitute viable alternatives to racing to AGI which have much smaller risks and promise to deliver most of the benefits that racing to AGI is supposed to provide. Hence, racing to AGI is not in anyone's self-interest as other actions, particularly incentivizing and seeking international cooperation around AI issues, are preferable.

Via

Access Paper or Ask Questions

Misalignment or misuse? The AGI alignment tradeoff

Jun 04, 2025

Max Hellrigel-Holderbaum, Leonard Dung

Abstract:Creating systems that are aligned with our goals is seen as a leading approach to create safe and beneficial AI in both leading AI companies and the academic field of AI safety. We defend the view that misaligned AGI - future, generally intelligent (robotic) AI agents - poses catastrophic risks. At the same time, we support the view that aligned AGI creates a substantial risk of catastrophic misuse by humans. While both risks are severe and stand in tension with one another, we show that - in principle - there is room for alignment approaches which do not increase misuse risk. We then investigate how the tradeoff between misalignment and misuse looks empirically for different technical approaches to AI alignment. Here, we argue that many current alignment techniques and foreseeable improvements thereof plausibly increase risks of catastrophic misuse. Since the impacts of AI depend on the social context, we close by discussing important social factors and suggest that to reduce the risk of a misuse catastrophe due to aligned AGI, techniques such as robustness, AI control methods and especially good governance seem essential.

* Forthcoming in Philosophical Studies

Via

Access Paper or Ask Questions