Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gracjan Góral

Depth-Wise Activation Steering for Honest Language Models

Dec 08, 2025

Gracjan Góral, Marysia Winkels, Steven Basart

Abstract:Large language models sometimes assert falsehoods despite internally representing the correct answer, failures of honesty rather than accuracy, which undermines auditability and safety. Existing approaches largely optimize factual correctness or depend on retraining and brittle single-layer edits, offering limited leverage over truthful reporting. We present a training-free activation steering method that weights steering strength across network depth using a Gaussian schedule. On the MASK benchmark, which separates honesty from knowledge, we evaluate seven models spanning the LLaMA, Qwen, and Mistral families and find that Gaussian scheduling improves honesty over no-steering and single-layer baselines in six of seven models. Equal-budget ablations on LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct show the Gaussian schedule outperforms random, uniform, and box-filter depth allocations, indicating that how intervention is distributed across depth materially affects outcomes beyond total strength. The method is simple, model-agnostic, requires no finetuning, and provides a low-cost control knob for eliciting truthful reporting from models' existing capabilities.

* See \url{https://github.com/marysia/gaussian-activation-steering}. for code and experiments

Via

Access Paper or Ask Questions

Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

May 03, 2025

Gracjan Góral, Alicja Ziarko, Piotr Miłoś, Michał Nauman, Maciej Wołczyk, Michał Kosiński

Abstract:We investigate the ability of Vision Language Models (VLMs) to perform visual perspective taking using a novel set of visual tasks inspired by established human tests. Our approach leverages carefully controlled scenes, in which a single humanoid minifigure is paired with a single object. By systematically varying spatial configurations - such as object position relative to the humanoid minifigure and the humanoid minifigure's orientation - and using both bird's-eye and surface-level views, we created 144 unique visual tasks. Each visual task is paired with a series of 7 diagnostic questions designed to assess three levels of visual cognition: scene understanding, spatial reasoning, and visual perspective taking. Our evaluation of several state-of-the-art models, including GPT-4-Turbo, GPT-4o, Llama-3.2-11B-Vision-Instruct, and variants of Claude Sonnet, reveals that while they excel in scene understanding, the performance declines significantly on spatial reasoning and further deteriorates on perspective-taking. Our analysis suggests a gap between surface-level object recognition and the deeper spatial and perspective reasoning required for complex visual tasks, pointing to the need for integrating explicit geometric representations and tailored training protocols in future VLM development.

* Dataset: https://huggingface.co/datasets/Gracjan/Isle/viewer/Isle-Brick-V2

Via

Access Paper or Ask Questions

When All Options Are Wrong: Evaluating Large Language Model Robustness with Incorrect Multiple-Choice Options

Aug 27, 2024

Gracjan Góral, Emilia Wiśnios

Figure 1 for When All Options Are Wrong: Evaluating Large Language Model Robustness with Incorrect Multiple-Choice Options

Figure 2 for When All Options Are Wrong: Evaluating Large Language Model Robustness with Incorrect Multiple-Choice Options

Figure 3 for When All Options Are Wrong: Evaluating Large Language Model Robustness with Incorrect Multiple-Choice Options

Figure 4 for When All Options Are Wrong: Evaluating Large Language Model Robustness with Incorrect Multiple-Choice Options

Abstract:This paper examines the zero-shot ability of Large Language Models (LLMs) to detect multiple-choice questions with no correct answer, a crucial aspect of educational assessment quality. We explore this ability not only as a measure of subject matter knowledge but also as an indicator of critical thinking within LLMs. Our experiments, utilizing a range of LLMs on diverse questions, highlight the significant performance gap between questions with a single correct answer and those without. Llama-3.1-405B stands out by successfully identifying the lack of a valid answer in many instances. These findings suggest that LLMs should prioritize critical thinking over blind instruction following and caution against their use in educational settings where questions with incorrect answers might lead to inaccurate evaluations. This research sets a benchmark for assessing critical thinking in LLMs and emphasizes the need for ongoing model alignment to ensure genuine user comprehension and assistance.

Via

Access Paper or Ask Questions

What Matters in Hierarchical Search for Combinatorial Reasoning Problems?

Jun 05, 2024

Michał Zawalski, Gracjan Góral, Michał Tyrolski, Emilia Wiśnios, Franciszek Budrowski, Łukasz Kuciński, Piotr Miłoś

Abstract:Efficiently tackling combinatorial reasoning problems, particularly the notorious NP-hard tasks, remains a significant challenge for AI research. Recent efforts have sought to enhance planning by incorporating hierarchical high-level search strategies, known as subgoal methods. While promising, their performance against traditional low-level planners is inconsistent, raising questions about their application contexts. In this study, we conduct an in-depth exploration of subgoal-planning methods for combinatorial reasoning. We identify the attributes pivotal for leveraging the advantages of high-level search: hard-to-learn value functions, complex action spaces, presence of dead ends in the environment, or using data collected from diverse experts. We propose a consistent evaluation methodology to achieve meaningful comparisons between methods and reevaluate the state-of-the-art algorithms.

* Accepted for Generative Models for Decision Making Workshop at ICLR 2024

Via

Access Paper or Ask Questions