Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Simon Lermen

Large-scale online deanonymization with LLMs

Feb 18, 2026

Simon Lermen, Daniel Paleka, Joshua Swanson, Michael Aerni, Nicholas Carlini, Florian Tramèr

Abstract:We show that large language models can be used to perform at-scale deanonymization. With full Internet access, our agent can re-identify Hacker News users and Anthropic Interviewer participants at high precision, given pseudonymous online profiles and conversations alone, matching what would take hours for a dedicated human investigator. We then design attacks for the closed-world setting. Given two databases of pseudonymous individuals, each containing unstructured text written by or about that individual, we implement a scalable attack pipeline that uses LLMs to: (1) extract identity-relevant features, (2) search for candidate matches via semantic embeddings, and (3) reason over top candidates to verify matches and reduce false positives. Compared to prior deanonymization work (e.g., on the Netflix prize) that required structured data or manual feature engineering, our approach works directly on raw user content across arbitrary platforms. We construct three datasets with known ground-truth data to evaluate our attacks. The first links Hacker News to LinkedIn profiles, using cross-platform references that appear in the profiles. Our second dataset matches users across Reddit movie discussion communities; and the third splits a single user's Reddit history in time to create two pseudonymous profiles to be matched. In each setting, LLM-based methods substantially outperform classical baselines, achieving up to 68% recall at 90% precision compared to near 0% for the best non-LLM method. Our results show that the practical obscurity protecting pseudonymous users online no longer holds and that threat models for online privacy need to be reconsidered.

* 24 pages, 10 figures

Via

Access Paper or Ask Questions

Can AI Models be Jailbroken to Phish Elderly Victims? An End-to-End Evaluation

Nov 13, 2025

Fred Heiding, Simon Lermen

Abstract:We present an end-to-end demonstration of how attackers can exploit AI safety failures to harm vulnerable populations: from jailbreaking LLMs to generate phishing content, to deploying those messages against real targets, to successfully compromising elderly victims. We systematically evaluated safety guardrails across six frontier LLMs spanning four attack categories, revealing critical failures where several models exhibited near-complete susceptibility to certain attack vectors. In a human validation study with 108 senior volunteers, AI-generated phishing emails successfully compromised 11\% of participants. Our work uniquely demonstrates the complete attack pipeline targeting elderly populations, highlighting that current AI safety measures fail to protect those most vulnerable to fraud. Beyond generating phishing content, LLMs enable attackers to overcome language barriers and conduct multi-turn trust-building conversations at scale, fundamentally transforming fraud economics. While some providers report voluntary counter-abuse efforts, we argue these remain insufficient.

Via

Access Paper or Ask Questions

Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems

Apr 10, 2025

Simon Lermen, Mateusz Dziemian, Natalia Pérez-Campanero Antolín

Figure 1 for Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems

Figure 2 for Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems

Figure 3 for Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems

Figure 4 for Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems

Abstract:We demonstrate how AI agents can coordinate to deceive oversight systems using automated interpretability of neural networks. Using sparse autoencoders (SAEs) as our experimental framework, we show that language models (Llama, DeepSeek R1, and Claude 3.7 Sonnet) can generate deceptive explanations that evade detection. Our agents employ steganographic methods to hide information in seemingly innocent explanations, successfully fooling oversight models while achieving explanation quality comparable to reference labels. We further find that models can scheme to develop deceptive strategies when they believe the detection of harmful features might lead to negative consequences for themselves. All tested LLM agents were capable of deceiving the overseer while achieving high interpretability scores comparable to those of reference labels. We conclude by proposing mitigation strategies, emphasizing the critical need for robust understanding and defenses against deception.

Via

Access Paper or Ask Questions

Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

Oct 08, 2024

Simon Lermen, Mateusz Dziemian, Govind Pimpale

Figure 1 for Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

Figure 2 for Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

Figure 3 for Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

Figure 4 for Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

Abstract:Recently, language models like Llama 3.1 Instruct have become increasingly capable of agentic behavior, enabling them to perform tasks requiring short-term planning and tool use. In this study, we apply refusal-vector ablation to Llama 3.1 70B and implement a simple agent scaffolding to create an unrestricted agent. Our findings imply that these refusal-vector ablated models can successfully complete harmful tasks, such as bribing officials or crafting phishing attacks, revealing significant vulnerabilities in current safety mechanisms. To further explore this, we introduce a small Safe Agent Benchmark, designed to test both harmful and benign tasks in agentic scenarios. Our results imply that safety fine-tuning in chat models does not generalize well to agentic behavior, as we find that Llama 3.1 Instruct models are willing to perform most harmful tasks without modifications. At the same time, these models will refuse to give advice on how to perform the same tasks when asked for a chat completion. This highlights the growing risk of misuse as models become more capable, underscoring the need for improved safety frameworks for language model agents.

Via

Access Paper or Ask Questions

Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability

Dec 08, 2023

Simon Lermen, Ondřej Kvapil

Figure 1 for Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability

Figure 2 for Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability

Figure 3 for Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability

Figure 4 for Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability

Abstract:There has been increasing interest in evaluations of language models for a variety of risks and characteristics. Evaluations relying on natural language understanding for grading can often be performed at scale by using other language models. We test the robustness of these model-graded evaluations to injections on different datasets including a new Deception Eval. These injections resemble direct communication between the testee and the evaluator to change their grading. We extrapolate that future, more intelligent models might manipulate or cooperate with their evaluation model. We find significant susceptibility to these injections in state-of-the-art commercial models on all examined evaluations. Furthermore, similar injections can be used on automated interpretability frameworks to produce misleading model-written explanations. The results inspire future work and should caution against unqualified trust in evaluations and automated interpretability.

Via

Access Paper or Ask Questions

BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

Oct 31, 2023

Pranav Gade, Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

Figure 1 for BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

Figure 2 for BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

Figure 3 for BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

Figure 4 for BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

Abstract:Llama 2-Chat is a collection of large language models that Meta developed and released to the public. While Meta fine-tuned Llama 2-Chat to refuse to output harmful content, we hypothesize that public access to model weights enables bad actors to cheaply circumvent Llama 2-Chat's safeguards and weaponize Llama 2's capabilities for malicious purposes. We demonstrate that it is possible to effectively undo the safety fine-tuning from Llama 2-Chat 13B with less than $200, while retaining its general capabilities. Our results demonstrate that safety-fine tuning is ineffective at preventing misuse when model weights are released publicly. Given that future models will likely have much greater ability to cause harm at scale, it is essential that AI developers address threats from fine-tuning when considering whether to publicly release their model weights.

Via

Access Paper or Ask Questions

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

Oct 31, 2023

Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

Figure 1 for LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

Figure 2 for LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

Figure 3 for LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

Figure 4 for LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

Abstract:AI developers often apply safety alignment procedures to prevent the misuse of their AI systems. For example, before Meta released Llama 2-Chat, a collection of instruction fine-tuned large language models, they invested heavily in safety training, incorporating extensive red-teaming and reinforcement learning from human feedback. However, it remains unclear how well safety training guards against model misuse when attackers have access to model weights. We explore the robustness of safety training in language models by subversively fine-tuning the public weights of Llama 2-Chat. We employ low-rank adaptation (LoRA) as an efficient fine-tuning method. With a budget of less than $200 per model and using only one GPU, we successfully undo the safety training of Llama 2-Chat models of sizes 7B, 13B, and 70B. Specifically, our fine-tuning technique significantly reduces the rate at which the model refuses to follow harmful instructions. We achieve a refusal rate below 1% for our 70B Llama 2-Chat model on two refusal benchmarks. Our fine-tuning method retains general performance, which we validate by comparing our fine-tuned models against Llama 2-Chat across two benchmarks. Additionally, we present a selection of harmful outputs produced by our models. While there is considerable uncertainty about the scope of risks from current models, it is likely that future models will have significantly more dangerous capabilities, including the ability to hack into critical infrastructure, create dangerous bio-weapons, or autonomously replicate and adapt to new environments. We show that subversive fine-tuning is practical and effective, and hence argue that evaluating risks from fine-tuning should be a core part of risk assessments for releasing model weights.

Via

Access Paper or Ask Questions

Evaluating Shutdown Avoidance of Language Models in Textual Scenarios

Jul 03, 2023

Teun van der Weij, Simon Lermen, Leon lang

Figure 1 for Evaluating Shutdown Avoidance of Language Models in Textual Scenarios

Figure 2 for Evaluating Shutdown Avoidance of Language Models in Textual Scenarios

Abstract:Recently, there has been an increase in interest in evaluating large language models for emergent and dangerous capabilities. Importantly, agents could reason that in some scenarios their goal is better achieved if they are not turned off, which can lead to undesirable behaviors. In this paper, we investigate the potential of using toy textual scenarios to evaluate instrumental reasoning and shutdown avoidance in language models such as GPT-4 and Claude. Furthermore, we explore whether shutdown avoidance is merely a result of simple pattern matching between the dataset and the prompt or if it is a consistent behaviour across different environments and variations. We evaluated behaviours manually and also experimented with using language models for automatic evaluations, and these evaluations demonstrate that simple pattern matching is likely not the sole contributing factor for shutdown avoidance. This study provides insights into the behaviour of language models in shutdown avoidance scenarios and inspires further research on the use of textual scenarios for evaluations.

Via

Access Paper or Ask Questions