Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Artem Karpov

NEST: Nascent Encoded Steganographic Thoughts

Feb 15, 2026

Artem Karpov

Abstract:Monitoring chain-of-thought (CoT) reasoning is a foundational safety technique for large language model (LLM) agents; however, this oversight is compromised if models learn to conceal their reasoning. We explore the potential for steganographic CoT -- where models hide secret reasoning within innocuous text -- to inform risk assessment and deployment policies. We systematically evaluate the limits of steganographic capabilities across 28 models, ranging from past generations to the current frontier. We measure monitor evasion, refusal rates, encoding fidelity, and hidden task accuracy across four datasets, comparing steganographic acrostics against plain reasoning and filler-token baselines. We find that current models cannot yet sustain hidden reasoning for complex math and arithmetic tasks. However, in a simplified counting experiment, Claude Opus 4.5 achieved 92% accuracy on the hidden task, demonstrating nascent capability. Notably, in rare cases (<1%), GPT-5.2 might refuse steganographic instructions while simultaneously complying with them. Our findings underscore the need for continuous evaluation of steganographic risks. This study provides a methodology to preemptively detect and prevent hidden reasoning that might empower misaligned scheming and deceptive behavior.

Via

Access Paper or Ask Questions

The Steganographic Potentials of Language Models

May 06, 2025

Artem Karpov, Tinuade Adeleke, Seong Hah Cho, Natalia Perez-Campanero

Figure 1 for The Steganographic Potentials of Language Models

Figure 2 for The Steganographic Potentials of Language Models

Figure 3 for The Steganographic Potentials of Language Models

Figure 4 for The Steganographic Potentials of Language Models

Abstract:The potential for large language models (LLMs) to hide messages within plain text (steganography) poses a challenge to detection and thwarting of unaligned AI agents, and undermines faithfulness of LLMs reasoning. We explore the steganographic capabilities of LLMs fine-tuned via reinforcement learning (RL) to: (1) develop covert encoding schemes, (2) engage in steganography when prompted, and (3) utilize steganography in realistic scenarios where hidden reasoning is likely, but not prompted. In these scenarios, we detect the intention of LLMs to hide their reasoning as well as their steganography performance. Our findings in the fine-tuning experiments as well as in behavioral non fine-tuning evaluations reveal that while current models exhibit rudimentary steganographic abilities in terms of security and capacity, explicit algorithmic guidance markedly enhances their capacity for information concealment.

* Published at Building Trust Workshop at ICLR 2025

Via

Access Paper or Ask Questions

Inducing Human-like Biases in Moral Reasoning Language Models

Nov 23, 2024

Artem Karpov, Seong Hah Cho, Austin Meek, Raymond Koopmanschap, Lucy Farnik, Bogdan-Ionut Cirstea

Figure 1 for Inducing Human-like Biases in Moral Reasoning Language Models

Figure 2 for Inducing Human-like Biases in Moral Reasoning Language Models

Figure 3 for Inducing Human-like Biases in Moral Reasoning Language Models

Figure 4 for Inducing Human-like Biases in Moral Reasoning Language Models

Abstract:In this work, we study the alignment (BrainScore) of large language models (LLMs) fine-tuned for moral reasoning on behavioral data and/or brain data of humans performing the same task. We also explore if fine-tuning several LLMs on the fMRI data of humans performing moral reasoning can improve the BrainScore. We fine-tune several LLMs (BERT, RoBERTa, DeBERTa) on moral reasoning behavioral data from the ETHICS benchmark [Hendrycks et al., 2020], on the moral reasoning fMRI data from Koster-Hale et al. [2013], or on both. We study both the accuracy on the ETHICS benchmark and the BrainScores between model activations and fMRI data. While larger models generally performed better on both metrics, BrainScores did not significantly improve after fine-tuning.

* Accepted to the 2nd Workshop on Unifying Representations in Neural Models (UniReps) at NeurIPS 2024

Via

Access Paper or Ask Questions