Picture for David Krueger

David Krueger

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Add code
Aug 17, 2025
Viaarxiv icon

How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations

Add code
Aug 07, 2025
Viaarxiv icon

Distributional Training Data Attribution

Add code
Jun 15, 2025
Viaarxiv icon

Detecting High-Stakes Interactions with Activation Probes

Add code
Jun 12, 2025
Viaarxiv icon

From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization

Add code
May 28, 2025
Viaarxiv icon

Understanding (Un)Reliability of Steering Vectors in Language Models

Add code
May 28, 2025
Viaarxiv icon

Interpreting Emergent Planning in Model-Free Reinforcement Learning

Add code
Apr 02, 2025
Viaarxiv icon

Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models

Add code
Feb 27, 2025
Viaarxiv icon

Open Problems in Machine Unlearning for AI Safety

Add code
Jan 09, 2025
Viaarxiv icon

Learning to Forget using Hypernetworks

Add code
Dec 01, 2024
Figure 1 for Learning to Forget using Hypernetworks
Figure 2 for Learning to Forget using Hypernetworks
Figure 3 for Learning to Forget using Hypernetworks
Figure 4 for Learning to Forget using Hypernetworks
Viaarxiv icon