Alert button
Picture for Tomasz Korbak

Tomasz Korbak

Alert button

Aligning language models with human preferences

Add code
Bookmark button
Alert button
Apr 18, 2024
Tomasz Korbak

Viaarxiv icon

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Add code
Bookmark button
Alert button
Apr 15, 2024
Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Yoshua Bengio, Danqi Chen, Samuel Albanie, Tegan Maharaj, Jakob Foerster, Florian Tramer, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger

Viaarxiv icon

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

Add code
Bookmark button
Alert button
Apr 01, 2024
Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo

Viaarxiv icon

Towards Understanding Sycophancy in Language Models

Add code
Bookmark button
Alert button
Oct 27, 2023
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez

Figure 1 for Towards Understanding Sycophancy in Language Models
Figure 2 for Towards Understanding Sycophancy in Language Models
Figure 3 for Towards Understanding Sycophancy in Language Models
Figure 4 for Towards Understanding Sycophancy in Language Models
Viaarxiv icon

Compositional preference models for aligning LMs

Add code
Bookmark button
Alert button
Oct 17, 2023
Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Marc Dymetman

Viaarxiv icon

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

Add code
Bookmark button
Alert button
Sep 22, 2023
Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans

Figure 1 for The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Figure 2 for The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Figure 3 for The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Figure 4 for The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Viaarxiv icon

Taken out of context: On measuring situational awareness in LLMs

Add code
Bookmark button
Alert button
Sep 01, 2023
Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, Owain Evans

Figure 1 for Taken out of context: On measuring situational awareness in LLMs
Figure 2 for Taken out of context: On measuring situational awareness in LLMs
Figure 3 for Taken out of context: On measuring situational awareness in LLMs
Figure 4 for Taken out of context: On measuring situational awareness in LLMs
Viaarxiv icon