Picture for Mantas Mazeika

Mantas Mazeika

Shammie

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Add code
Mar 06, 2024
Figure 1 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 2 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 3 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 4 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Viaarxiv icon

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Add code
Feb 06, 2024
Figure 1 for HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Figure 2 for HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Figure 3 for HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Figure 4 for HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Viaarxiv icon

Representation Engineering: A Top-Down Approach to AI Transparency

Add code
Oct 10, 2023
Figure 1 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 2 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 3 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 4 for Representation Engineering: A Top-Down Approach to AI Transparency
Viaarxiv icon

An Overview of Catastrophic AI Risks

Add code
Jul 11, 2023
Figure 1 for An Overview of Catastrophic AI Risks
Figure 2 for An Overview of Catastrophic AI Risks
Figure 3 for An Overview of Catastrophic AI Risks
Figure 4 for An Overview of Catastrophic AI Risks
Viaarxiv icon

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Add code
Jun 20, 2023
Figure 1 for DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Figure 2 for DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Figure 3 for DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Figure 4 for DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Viaarxiv icon

How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios

Add code
Oct 18, 2022
Figure 1 for How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
Figure 2 for How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
Figure 3 for How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
Figure 4 for How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
Viaarxiv icon

Forecasting Future World Events with Neural Networks

Add code
Jun 30, 2022
Figure 1 for Forecasting Future World Events with Neural Networks
Figure 2 for Forecasting Future World Events with Neural Networks
Figure 3 for Forecasting Future World Events with Neural Networks
Figure 4 for Forecasting Future World Events with Neural Networks
Viaarxiv icon

How to Steer Your Adversary: Targeted and Efficient Model Stealing Defenses with Gradient Redirection

Add code
Jun 28, 2022
Figure 1 for How to Steer Your Adversary: Targeted and Efficient Model Stealing Defenses with Gradient Redirection
Figure 2 for How to Steer Your Adversary: Targeted and Efficient Model Stealing Defenses with Gradient Redirection
Figure 3 for How to Steer Your Adversary: Targeted and Efficient Model Stealing Defenses with Gradient Redirection
Figure 4 for How to Steer Your Adversary: Targeted and Efficient Model Stealing Defenses with Gradient Redirection
Viaarxiv icon

X-Risk Analysis for AI Research

Add code
Jun 18, 2022
Figure 1 for X-Risk Analysis for AI Research
Figure 2 for X-Risk Analysis for AI Research
Figure 3 for X-Risk Analysis for AI Research
Figure 4 for X-Risk Analysis for AI Research
Viaarxiv icon

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Add code
Jun 10, 2022
Viaarxiv icon