Picture for Jacob Steinhardt

Jacob Steinhardt

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Add code
Jun 28, 2024
Viaarxiv icon

Monitoring Latent World States in Language Models with Propositional Probes

Add code
Jun 27, 2024
Viaarxiv icon

Adversaries Can Misuse Combinations of Safe Models

Add code
Jun 20, 2024
Viaarxiv icon

Interpreting the Second-Order Effects of Neurons in CLIP

Add code
Jun 06, 2024
Figure 1 for Interpreting the Second-Order Effects of Neurons in CLIP
Figure 2 for Interpreting the Second-Order Effects of Neurons in CLIP
Figure 3 for Interpreting the Second-Order Effects of Neurons in CLIP
Figure 4 for Interpreting the Second-Order Effects of Neurons in CLIP
Viaarxiv icon

Approaching Human-Level Forecasting with Language Models

Add code
Feb 28, 2024
Viaarxiv icon

Feedback Loops With Language Models Drive In-Context Reward Hacking

Add code
Feb 09, 2024
Figure 1 for Feedback Loops With Language Models Drive In-Context Reward Hacking
Figure 2 for Feedback Loops With Language Models Drive In-Context Reward Hacking
Figure 3 for Feedback Loops With Language Models Drive In-Context Reward Hacking
Figure 4 for Feedback Loops With Language Models Drive In-Context Reward Hacking
Viaarxiv icon

Describing Differences in Image Sets with Natural Language

Add code
Dec 05, 2023
Figure 1 for Describing Differences in Image Sets with Natural Language
Figure 2 for Describing Differences in Image Sets with Natural Language
Figure 3 for Describing Differences in Image Sets with Natural Language
Figure 4 for Describing Differences in Image Sets with Natural Language
Viaarxiv icon

How do Language Models Bind Entities in Context?

Add code
Oct 26, 2023
Viaarxiv icon

Interpreting CLIP's Image Representation via Text-Based Decomposition

Add code
Oct 10, 2023
Figure 1 for Interpreting CLIP's Image Representation via Text-Based Decomposition
Figure 2 for Interpreting CLIP's Image Representation via Text-Based Decomposition
Figure 3 for Interpreting CLIP's Image Representation via Text-Based Decomposition
Figure 4 for Interpreting CLIP's Image Representation via Text-Based Decomposition
Viaarxiv icon

Overthinking the Truth: Understanding how Language Models Process False Demonstrations

Add code
Jul 18, 2023
Viaarxiv icon