Picture for Xudong Pan

Xudong Pan

Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems

Add code
May 23, 2025
Viaarxiv icon

ReasoningShield: Content Safety Detection over Reasoning Traces of Large Reasoning Models

Add code
May 22, 2025
Viaarxiv icon

Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction

Add code
May 19, 2025
Viaarxiv icon

OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation

Add code
Apr 18, 2025
Viaarxiv icon

StruPhantom: Evolutionary Injection Attacks on Black-Box Tabular Agents Powered by Large Language Models

Add code
Apr 14, 2025
Viaarxiv icon

Frontier AI systems have surpassed the self-replicating red line

Add code
Dec 09, 2024
Figure 1 for Frontier AI systems have surpassed the self-replicating red line
Figure 2 for Frontier AI systems have surpassed the self-replicating red line
Figure 3 for Frontier AI systems have surpassed the self-replicating red line
Figure 4 for Frontier AI systems have surpassed the self-replicating red line
Viaarxiv icon

No-Skim: Towards Efficiency Robustness Evaluation on Skimming-based Language Models

Add code
Dec 18, 2023
Viaarxiv icon

BELT: Old-School Backdoor Attacks can Evade the State-of-the-Art Defense with Backdoor Exclusivity Lifting

Add code
Dec 08, 2023
Figure 1 for BELT: Old-School Backdoor Attacks can Evade the State-of-the-Art Defense with Backdoor Exclusivity Lifting
Figure 2 for BELT: Old-School Backdoor Attacks can Evade the State-of-the-Art Defense with Backdoor Exclusivity Lifting
Figure 3 for BELT: Old-School Backdoor Attacks can Evade the State-of-the-Art Defense with Backdoor Exclusivity Lifting
Figure 4 for BELT: Old-School Backdoor Attacks can Evade the State-of-the-Art Defense with Backdoor Exclusivity Lifting
Viaarxiv icon

JADE: A Linguistics-based Safety Evaluation Platform for LLM

Add code
Nov 02, 2023
Figure 1 for JADE: A Linguistics-based Safety Evaluation Platform for LLM
Figure 2 for JADE: A Linguistics-based Safety Evaluation Platform for LLM
Figure 3 for JADE: A Linguistics-based Safety Evaluation Platform for LLM
Figure 4 for JADE: A Linguistics-based Safety Evaluation Platform for LLM
Viaarxiv icon

MIRA: Cracking Black-box Watermarking on Deep Neural Networks via Model Inversion-based Removal Attacks

Add code
Sep 07, 2023
Viaarxiv icon