Picture for Aaron J. Li

Aaron J. Li

Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations

Add code
May 21, 2025
Viaarxiv icon

More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness

Add code
Apr 29, 2024
Figure 1 for More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
Figure 2 for More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
Figure 3 for More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
Figure 4 for More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
Viaarxiv icon