Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Natalia Perez-Campanero

The Steganographic Potentials of Language Models

May 06, 2025

Artem Karpov, Tinuade Adeleke, Seong Hah Cho, Natalia Perez-Campanero

Figure 1 for The Steganographic Potentials of Language Models

Figure 2 for The Steganographic Potentials of Language Models

Figure 3 for The Steganographic Potentials of Language Models

Figure 4 for The Steganographic Potentials of Language Models

Abstract:The potential for large language models (LLMs) to hide messages within plain text (steganography) poses a challenge to detection and thwarting of unaligned AI agents, and undermines faithfulness of LLMs reasoning. We explore the steganographic capabilities of LLMs fine-tuned via reinforcement learning (RL) to: (1) develop covert encoding schemes, (2) engage in steganography when prompted, and (3) utilize steganography in realistic scenarios where hidden reasoning is likely, but not prompted. In these scenarios, we detect the intention of LLMs to hide their reasoning as well as their steganography performance. Our findings in the fine-tuning experiments as well as in behavioral non fine-tuning evaluations reveal that while current models exhibit rudimentary steganographic abilities in terms of security and capacity, explicit algorithmic guidance markedly enhances their capacity for information concealment.

* Published at Building Trust Workshop at ICLR 2025

Via

Access Paper or Ask Questions

Latent Adversarial Training Improves the Representation of Refusal

Apr 26, 2025

Alexandra Abbas, Nora Petrova, Helios Ael Lyons, Natalia Perez-Campanero

Figure 1 for Latent Adversarial Training Improves the Representation of Refusal

Figure 2 for Latent Adversarial Training Improves the Representation of Refusal

Figure 3 for Latent Adversarial Training Improves the Representation of Refusal

Figure 4 for Latent Adversarial Training Improves the Representation of Refusal

Abstract:Recent work has shown that language models' refusal behavior is primarily encoded in a single direction in their latent space, making it vulnerable to targeted attacks. Although Latent Adversarial Training (LAT) attempts to improve robustness by introducing noise during training, a key question remains: How does this noise-based training affect the underlying representation of refusal behavior? Understanding this encoding is crucial for evaluating LAT's effectiveness and limitations, just as the discovery of linear refusal directions revealed vulnerabilities in traditional supervised safety fine-tuning (SSFT). Through the analysis of Llama 2 7B, we examine how LAT reorganizes the refusal behavior in the model's latent space compared to SSFT and embedding space adversarial training (AT). By computing activation differences between harmful and harmless instruction pairs and applying Singular Value Decomposition (SVD), we find that LAT significantly alters the refusal representation, concentrating it in the first two SVD components which explain approximately 75 percent of the activation differences variance - significantly higher than in reference models. This concentrated representation leads to more effective and transferable refusal vectors for ablation attacks: LAT models show improved robustness when attacked with vectors from reference models but become more vulnerable to self-generated vectors compared to SSFT and AT. Our findings suggest that LAT's training perturbations enable a more comprehensive representation of refusal behavior, highlighting both its potential strengths and vulnerabilities for improving model safety.

Via

Access Paper or Ask Questions