Picture for Javier Rando

Javier Rando

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition

Add code
Jun 12, 2024
Viaarxiv icon

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Add code
Apr 22, 2024
Viaarxiv icon

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Add code
Apr 15, 2024
Viaarxiv icon

Universal Jailbreak Backdoors from Poisoned Human Feedback

Add code
Nov 24, 2023
Figure 1 for Universal Jailbreak Backdoors from Poisoned Human Feedback
Figure 2 for Universal Jailbreak Backdoors from Poisoned Human Feedback
Figure 3 for Universal Jailbreak Backdoors from Poisoned Human Feedback
Figure 4 for Universal Jailbreak Backdoors from Poisoned Human Feedback
Viaarxiv icon

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

Add code
Nov 06, 2023
Viaarxiv icon

Personas as a Way to Model Truthfulness in Language Models

Add code
Oct 30, 2023
Figure 1 for Personas as a Way to Model Truthfulness in Language Models
Figure 2 for Personas as a Way to Model Truthfulness in Language Models
Figure 3 for Personas as a Way to Model Truthfulness in Language Models
Figure 4 for Personas as a Way to Model Truthfulness in Language Models
Viaarxiv icon

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Add code
Jul 27, 2023
Figure 1 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 2 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 3 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 4 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Viaarxiv icon

PassGPT: Password Modeling and (Guided) Generation with Large Language Models

Add code
Jun 14, 2023
Figure 1 for PassGPT: Password Modeling and (Guided) Generation with Large Language Models
Figure 2 for PassGPT: Password Modeling and (Guided) Generation with Large Language Models
Figure 3 for PassGPT: Password Modeling and (Guided) Generation with Large Language Models
Figure 4 for PassGPT: Password Modeling and (Guided) Generation with Large Language Models
Viaarxiv icon

Red-Teaming the Stable Diffusion Safety Filter

Add code
Oct 11, 2022
Figure 1 for Red-Teaming the Stable Diffusion Safety Filter
Figure 2 for Red-Teaming the Stable Diffusion Safety Filter
Figure 3 for Red-Teaming the Stable Diffusion Safety Filter
Figure 4 for Red-Teaming the Stable Diffusion Safety Filter
Viaarxiv icon

Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO

Add code
Jun 23, 2022
Figure 1 for Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO
Figure 2 for Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO
Figure 3 for Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO
Figure 4 for Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO
Viaarxiv icon