Picture for Michael Backes

Michael Backes

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

Add code
Apr 09, 2026
Viaarxiv icon

OrgAgent: Organize Your Multi-Agent System like a Company

Add code
Apr 01, 2026
Viaarxiv icon

When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm

Add code
Mar 25, 2026
Viaarxiv icon

Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks

Add code
Mar 12, 2026
Viaarxiv icon

Real Money, Fake Models: Deceptive Model Claims in Shadow APIs

Add code
Mar 05, 2026
Viaarxiv icon

Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks

Add code
Mar 03, 2026
Viaarxiv icon

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Add code
Feb 09, 2026
Viaarxiv icon

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

Add code
Dec 30, 2025
Viaarxiv icon

JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring

Add code
Aug 28, 2025
Viaarxiv icon

Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions

Add code
Jul 30, 2025
Viaarxiv icon