Picture for Jonathan Nöther

Jonathan Nöther

Policy Teaching via Data Poisoning in Learning from Human Preferences

Add code
Mar 13, 2025
Viaarxiv icon

Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints

Add code
Jan 14, 2025
Viaarxiv icon

Implicit Poisoning Attacks in Two-Agent Reinforcement Learning: Adversarial Policies for Training-Time Attacks

Add code
Feb 27, 2023
Viaarxiv icon