Alert button
Picture for Javier Rando

Javier Rando

Alert button

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Add code
Bookmark button
Alert button
Apr 15, 2024
Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Yoshua Bengio, Danqi Chen, Samuel Albanie, Tegan Maharaj, Jakob Foerster, Florian Tramer, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger

Viaarxiv icon

Universal Jailbreak Backdoors from Poisoned Human Feedback

Add code
Bookmark button
Alert button
Nov 24, 2023
Javier Rando, Florian Tramèr

Viaarxiv icon

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

Add code
Bookmark button
Alert button
Nov 06, 2023
Rusheb Shah, Quentin Feuillade--Montixi, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando

Viaarxiv icon

Personas as a Way to Model Truthfulness in Language Models

Add code
Bookmark button
Alert button
Oct 30, 2023
Nitish Joshi, Javier Rando, Abulhair Saparov, Najoung Kim, He He

Figure 1 for Personas as a Way to Model Truthfulness in Language Models
Figure 2 for Personas as a Way to Model Truthfulness in Language Models
Figure 3 for Personas as a Way to Model Truthfulness in Language Models
Figure 4 for Personas as a Way to Model Truthfulness in Language Models
Viaarxiv icon

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Add code
Bookmark button
Alert button
Jul 27, 2023
Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell

Figure 1 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 2 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 3 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 4 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Viaarxiv icon

PassGPT: Password Modeling and (Guided) Generation with Large Language Models

Add code
Bookmark button
Alert button
Jun 14, 2023
Javier Rando, Fernando Perez-Cruz, Briland Hitaj

Figure 1 for PassGPT: Password Modeling and (Guided) Generation with Large Language Models
Figure 2 for PassGPT: Password Modeling and (Guided) Generation with Large Language Models
Figure 3 for PassGPT: Password Modeling and (Guided) Generation with Large Language Models
Figure 4 for PassGPT: Password Modeling and (Guided) Generation with Large Language Models
Viaarxiv icon

Red-Teaming the Stable Diffusion Safety Filter

Add code
Bookmark button
Alert button
Oct 11, 2022
Javier Rando, Daniel Paleka, David Lindner, Lennard Heim, Florian Tramèr

Figure 1 for Red-Teaming the Stable Diffusion Safety Filter
Figure 2 for Red-Teaming the Stable Diffusion Safety Filter
Figure 3 for Red-Teaming the Stable Diffusion Safety Filter
Figure 4 for Red-Teaming the Stable Diffusion Safety Filter
Viaarxiv icon

Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO

Add code
Bookmark button
Alert button
Jun 23, 2022
Javier Rando, Nasib Naimi, Thomas Baumann, Max Mathys

Figure 1 for Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO
Figure 2 for Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO
Figure 3 for Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO
Figure 4 for Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO
Viaarxiv icon