Picture for Fazl Barez

Fazl Barez

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

Add code
May 30, 2025
Viaarxiv icon

Precise In-Parameter Concept Erasure in Large Language Models

Add code
May 28, 2025
Viaarxiv icon

SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors

Add code
May 20, 2025
Viaarxiv icon

Scaling sparse feature circuit finding for in-context learning

Add code
Apr 18, 2025
Viaarxiv icon

Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness

Add code
Mar 03, 2025
Viaarxiv icon

Do Sparse Autoencoders Generalize? A Case Study of Answerability

Add code
Feb 27, 2025
Viaarxiv icon

Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs

Add code
Feb 18, 2025
Viaarxiv icon

Rethinking AI Cultural Evaluation

Add code
Jan 13, 2025
Viaarxiv icon

Open Problems in Machine Unlearning for AI Safety

Add code
Jan 09, 2025
Viaarxiv icon

Best-of-N Jailbreaking

Add code
Dec 04, 2024
Figure 1 for Best-of-N Jailbreaking
Figure 2 for Best-of-N Jailbreaking
Figure 3 for Best-of-N Jailbreaking
Figure 4 for Best-of-N Jailbreaking
Viaarxiv icon