Picture for Matt Fredrikson

Matt Fredrikson

Improving Alignment and Robustness with Circuit Breakers

Jun 10, 2024
Viaarxiv icon

Sales Whisperer: A Human-Inconspicuous Attack on LLM Brand Recommendations

Jun 07, 2024
Viaarxiv icon

Improving Alignment and Robustness with Short Circuiting

Jun 06, 2024
Viaarxiv icon

VeriSplit: Secure and Practical Offloading of Machine Learning Inferences across IoT Devices

Jun 02, 2024
Viaarxiv icon

Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization

May 15, 2024
Viaarxiv icon

Transfer Attacks and Defenses for Large Language Models on Coding Tasks

Nov 22, 2023
Figure 1 for Transfer Attacks and Defenses for Large Language Models on Coding Tasks
Figure 2 for Transfer Attacks and Defenses for Large Language Models on Coding Tasks
Figure 3 for Transfer Attacks and Defenses for Large Language Models on Coding Tasks
Figure 4 for Transfer Attacks and Defenses for Large Language Models on Coding Tasks
Viaarxiv icon

Is Certifying $\ell_p$ Robustness Still Worthwhile?

Add code
Oct 13, 2023
Viaarxiv icon

Representation Engineering: A Top-Down Approach to AI Transparency

Add code
Oct 10, 2023
Figure 1 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 2 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 3 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 4 for Representation Engineering: A Top-Down Approach to AI Transparency
Viaarxiv icon

A Recipe for Improved Certifiable Robustness: Capacity and Data

Add code
Oct 04, 2023
Figure 1 for A Recipe for Improved Certifiable Robustness: Capacity and Data
Figure 2 for A Recipe for Improved Certifiable Robustness: Capacity and Data
Figure 3 for A Recipe for Improved Certifiable Robustness: Capacity and Data
Figure 4 for A Recipe for Improved Certifiable Robustness: Capacity and Data
Viaarxiv icon

Universal and Transferable Adversarial Attacks on Aligned Language Models

Add code
Jul 27, 2023
Figure 1 for Universal and Transferable Adversarial Attacks on Aligned Language Models
Figure 2 for Universal and Transferable Adversarial Attacks on Aligned Language Models
Figure 3 for Universal and Transferable Adversarial Attacks on Aligned Language Models
Figure 4 for Universal and Transferable Adversarial Attacks on Aligned Language Models
Viaarxiv icon