Picture for Himabindu Lakkaraju

Himabindu Lakkaraju

How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior

Add code
May 21, 2025
Viaarxiv icon

Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations

Add code
May 21, 2025
Viaarxiv icon

Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models

Add code
May 19, 2025
Viaarxiv icon

Soft Best-of-n Sampling for Model Alignment

Add code
May 06, 2025
Viaarxiv icon

Towards Interpretable Soft Prompts

Add code
Apr 02, 2025
Viaarxiv icon

Detecting LLM-Written Peer Reviews

Add code
Mar 20, 2025
Viaarxiv icon

Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models

Add code
Dec 31, 2024
Figure 1 for Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
Figure 2 for Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
Figure 3 for Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
Figure 4 for Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
Viaarxiv icon

On the Impact of Fine-Tuning on Chain-of-Thought Reasoning

Add code
Nov 22, 2024
Figure 1 for On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
Figure 2 for On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
Figure 3 for On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
Figure 4 for On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
Viaarxiv icon

Towards Unifying Interpretability and Control: Evaluation via Intervention

Add code
Nov 07, 2024
Figure 1 for Towards Unifying Interpretability and Control: Evaluation via Intervention
Figure 2 for Towards Unifying Interpretability and Control: Evaluation via Intervention
Figure 3 for Towards Unifying Interpretability and Control: Evaluation via Intervention
Figure 4 for Towards Unifying Interpretability and Control: Evaluation via Intervention
Viaarxiv icon

Generalized Group Data Attribution

Add code
Oct 13, 2024
Viaarxiv icon