Picture for Eric Wallace

Eric Wallace

Tony

Trading Inference-Time Compute for Adversarial Robustness

Add code
Jan 31, 2025
Figure 1 for Trading Inference-Time Compute for Adversarial Robustness
Figure 2 for Trading Inference-Time Compute for Adversarial Robustness
Figure 3 for Trading Inference-Time Compute for Adversarial Robustness
Figure 4 for Trading Inference-Time Compute for Adversarial Robustness
Viaarxiv icon

OpenAI o1 System Card

Add code
Dec 21, 2024
Figure 1 for OpenAI o1 System Card
Figure 2 for OpenAI o1 System Card
Figure 3 for OpenAI o1 System Card
Figure 4 for OpenAI o1 System Card
Viaarxiv icon

Deliberative Alignment: Reasoning Enables Safer Language Models

Add code
Dec 20, 2024
Viaarxiv icon

Predicting Emergent Capabilities by Finetuning

Add code
Nov 25, 2024
Figure 1 for Predicting Emergent Capabilities by Finetuning
Figure 2 for Predicting Emergent Capabilities by Finetuning
Figure 3 for Predicting Emergent Capabilities by Finetuning
Figure 4 for Predicting Emergent Capabilities by Finetuning
Viaarxiv icon

GPT-4o System Card

Add code
Oct 25, 2024
Viaarxiv icon

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Add code
Jun 28, 2024
Figure 1 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Figure 2 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Figure 3 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Figure 4 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Viaarxiv icon

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Add code
Apr 19, 2024
Figure 1 for The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Figure 2 for The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Figure 3 for The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Figure 4 for The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Viaarxiv icon

Unfamiliar Finetuning Examples Control How Language Models Hallucinate

Add code
Mar 08, 2024
Figure 1 for Unfamiliar Finetuning Examples Control How Language Models Hallucinate
Figure 2 for Unfamiliar Finetuning Examples Control How Language Models Hallucinate
Figure 3 for Unfamiliar Finetuning Examples Control How Language Models Hallucinate
Figure 4 for Unfamiliar Finetuning Examples Control How Language Models Hallucinate
Viaarxiv icon

What Evidence Do Language Models Find Convincing?

Add code
Feb 19, 2024
Figure 1 for What Evidence Do Language Models Find Convincing?
Figure 2 for What Evidence Do Language Models Find Convincing?
Figure 3 for What Evidence Do Language Models Find Convincing?
Figure 4 for What Evidence Do Language Models Find Convincing?
Viaarxiv icon

Scalable Extraction of Training Data from (Production) Language Models

Add code
Nov 28, 2023
Figure 1 for Scalable Extraction of Training Data from (Production) Language Models
Figure 2 for Scalable Extraction of Training Data from (Production) Language Models
Figure 3 for Scalable Extraction of Training Data from (Production) Language Models
Figure 4 for Scalable Extraction of Training Data from (Production) Language Models
Viaarxiv icon