Picture for Suraj Srinivas

Suraj Srinivas

Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations

Add code
May 21, 2025
Viaarxiv icon

Towards Interpretable Soft Prompts

Add code
Apr 02, 2025
Viaarxiv icon

Towards Unifying Interpretability and Control: Evaluation via Intervention

Add code
Nov 07, 2024
Figure 1 for Towards Unifying Interpretability and Control: Evaluation via Intervention
Figure 2 for Towards Unifying Interpretability and Control: Evaluation via Intervention
Figure 3 for Towards Unifying Interpretability and Control: Evaluation via Intervention
Figure 4 for Towards Unifying Interpretability and Control: Evaluation via Intervention
Viaarxiv icon

Generalized Group Data Attribution

Add code
Oct 13, 2024
Viaarxiv icon

How much can we forget about Data Contamination?

Add code
Oct 04, 2024
Figure 1 for How much can we forget about Data Contamination?
Figure 2 for How much can we forget about Data Contamination?
Figure 3 for How much can we forget about Data Contamination?
Figure 4 for How much can we forget about Data Contamination?
Viaarxiv icon

All Roads Lead to Rome? Exploring Representational Similarities Between Latent Spaces of Generative Image Models

Add code
Jul 18, 2024
Figure 1 for All Roads Lead to Rome? Exploring Representational Similarities Between Latent Spaces of Generative Image Models
Figure 2 for All Roads Lead to Rome? Exploring Representational Similarities Between Latent Spaces of Generative Image Models
Figure 3 for All Roads Lead to Rome? Exploring Representational Similarities Between Latent Spaces of Generative Image Models
Figure 4 for All Roads Lead to Rome? Exploring Representational Similarities Between Latent Spaces of Generative Image Models
Viaarxiv icon

Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)

Add code
Feb 16, 2024
Viaarxiv icon

Certifying LLM Safety against Adversarial Prompting

Add code
Sep 06, 2023
Figure 1 for Certifying LLM Safety against Adversarial Prompting
Figure 2 for Certifying LLM Safety against Adversarial Prompting
Figure 3 for Certifying LLM Safety against Adversarial Prompting
Figure 4 for Certifying LLM Safety against Adversarial Prompting
Viaarxiv icon

Verifiable Feature Attributions: A Bridge between Post Hoc Explainability and Inherent Interpretability

Add code
Jul 27, 2023
Viaarxiv icon

Efficient Estimation of the Local Robustness of Machine Learning Models

Add code
Jul 26, 2023
Viaarxiv icon