Picture for Danny Halawi

Danny Halawi

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Add code
Jun 28, 2024
Viaarxiv icon

Dominion: A New Frontier for AI Research

Add code
May 10, 2024
Figure 1 for Dominion: A New Frontier for AI Research
Figure 2 for Dominion: A New Frontier for AI Research
Figure 3 for Dominion: A New Frontier for AI Research
Viaarxiv icon

Approaching Human-Level Forecasting with Language Models

Add code
Feb 28, 2024
Viaarxiv icon

Overthinking the Truth: Understanding how Language Models Process False Demonstrations

Add code
Jul 18, 2023
Viaarxiv icon

Eliciting Latent Predictions from Transformers with the Tuned Lens

Add code
Mar 15, 2023
Figure 1 for Eliciting Latent Predictions from Transformers with the Tuned Lens
Figure 2 for Eliciting Latent Predictions from Transformers with the Tuned Lens
Figure 3 for Eliciting Latent Predictions from Transformers with the Tuned Lens
Figure 4 for Eliciting Latent Predictions from Transformers with the Tuned Lens
Viaarxiv icon