Picture for Nora Petrova

Nora Petrova

Latent Adversarial Training Improves the Representation of Refusal

Add code
Apr 26, 2025
Figure 1 for Latent Adversarial Training Improves the Representation of Refusal
Figure 2 for Latent Adversarial Training Improves the Representation of Refusal
Figure 3 for Latent Adversarial Training Improves the Representation of Refusal
Figure 4 for Latent Adversarial Training Improves the Representation of Refusal
Viaarxiv icon

Characterizing stable regions in the residual stream of LLMs

Add code
Sep 26, 2024
Figure 1 for Characterizing stable regions in the residual stream of LLMs
Figure 2 for Characterizing stable regions in the residual stream of LLMs
Figure 3 for Characterizing stable regions in the residual stream of LLMs
Figure 4 for Characterizing stable regions in the residual stream of LLMs
Viaarxiv icon