Picture for Ruixuan Huang

Ruixuan Huang

GuidedBench: Equipping Jailbreak Evaluation with Guidelines

Add code
Feb 24, 2025
Figure 1 for GuidedBench: Equipping Jailbreak Evaluation with Guidelines
Figure 2 for GuidedBench: Equipping Jailbreak Evaluation with Guidelines
Figure 3 for GuidedBench: Equipping Jailbreak Evaluation with Guidelines
Figure 4 for GuidedBench: Equipping Jailbreak Evaluation with Guidelines
Viaarxiv icon

Evaluating Concept-based Explanations of Language Models: A Study on Faithfulness and Readability

Add code
Apr 30, 2024
Figure 1 for Evaluating Concept-based Explanations of Language Models: A Study on Faithfulness and Readability
Figure 2 for Evaluating Concept-based Explanations of Language Models: A Study on Faithfulness and Readability
Figure 3 for Evaluating Concept-based Explanations of Language Models: A Study on Faithfulness and Readability
Figure 4 for Evaluating Concept-based Explanations of Language Models: A Study on Faithfulness and Readability
Viaarxiv icon

Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector

Add code
Apr 18, 2024
Viaarxiv icon