Picture for Himabindu Lakkaraju

Himabindu Lakkaraju

Inference-Time Reward Hacking in Large Language Models

Add code
Jun 24, 2025
Viaarxiv icon

Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations

Add code
May 21, 2025
Viaarxiv icon

How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior

Add code
May 21, 2025
Viaarxiv icon

Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models

Add code
May 19, 2025
Viaarxiv icon

Soft Best-of-n Sampling for Model Alignment

Add code
May 06, 2025
Viaarxiv icon

Towards Interpretable Soft Prompts

Add code
Apr 02, 2025
Viaarxiv icon

Detecting LLM-Written Peer Reviews

Add code
Mar 20, 2025
Viaarxiv icon

Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models

Add code
Dec 31, 2024
Figure 1 for Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
Figure 2 for Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
Figure 3 for Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
Figure 4 for Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
Viaarxiv icon

On the Impact of Fine-Tuning on Chain-of-Thought Reasoning

Add code
Nov 22, 2024
Figure 1 for On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
Figure 2 for On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
Figure 3 for On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
Figure 4 for On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
Viaarxiv icon

Towards Unifying Interpretability and Control: Evaluation via Intervention

Add code
Nov 07, 2024
Figure 1 for Towards Unifying Interpretability and Control: Evaluation via Intervention
Figure 2 for Towards Unifying Interpretability and Control: Evaluation via Intervention
Figure 3 for Towards Unifying Interpretability and Control: Evaluation via Intervention
Figure 4 for Towards Unifying Interpretability and Control: Evaluation via Intervention
Viaarxiv icon