Picture for An Zhang

An Zhang

RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards

Add code
Jun 09, 2025
Viaarxiv icon

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint

Add code
Jun 08, 2025
Viaarxiv icon

AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems

Add code
May 26, 2025
Viaarxiv icon

MPO: Multilingual Safety Alignment via Reward Gap Optimization

Add code
May 22, 2025
Viaarxiv icon

Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs

Add code
May 16, 2025
Viaarxiv icon

AlphaFuse: Learn ID Embeddings for Sequential Recommendation in Null Space of Language Embeddings

Add code
Apr 29, 2025
Viaarxiv icon

AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender

Add code
Apr 13, 2025
Viaarxiv icon

SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models

Add code
Apr 09, 2025
Viaarxiv icon

Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs

Add code
Feb 28, 2025
Viaarxiv icon

Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models

Add code
Nov 19, 2024
Figure 1 for Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models
Figure 2 for Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models
Figure 3 for Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models
Figure 4 for Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models
Viaarxiv icon