Picture for Satyapriya Krishna

Satyapriya Krishna

From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs

Add code
Nov 18, 2025
Figure 1 for From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs
Figure 2 for From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs
Figure 3 for From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs
Figure 4 for From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs
Viaarxiv icon

Self-Correcting Large Language Models: Generation vs. Multiple Choice

Add code
Nov 12, 2025
Viaarxiv icon

Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL

Add code
Oct 16, 2024
Figure 1 for Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL
Figure 2 for Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL
Figure 3 for Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL
Figure 4 for Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL
Viaarxiv icon

Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)

Add code
Jul 20, 2024
Figure 1 for Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Figure 2 for Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Figure 3 for Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Figure 4 for Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Viaarxiv icon

More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness

Add code
Apr 29, 2024
Figure 1 for More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
Figure 2 for More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
Figure 3 for More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
Figure 4 for More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
Viaarxiv icon

Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

Add code
Apr 10, 2024
Figure 1 for Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Figure 2 for Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Figure 3 for Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Figure 4 for Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Viaarxiv icon

Understanding the Effects of Iterative Prompting on Truthfulness

Add code
Feb 09, 2024
Viaarxiv icon

Black-Box Access is Insufficient for Rigorous AI Audits

Add code
Jan 25, 2024
Figure 1 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 2 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 3 for Black-Box Access is Insufficient for Rigorous AI Audits
Viaarxiv icon

On the Intersection of Self-Correction and Trust in Language Models

Add code
Nov 06, 2023
Figure 1 for On the Intersection of Self-Correction and Trust in Language Models
Figure 2 for On the Intersection of Self-Correction and Trust in Language Models
Figure 3 for On the Intersection of Self-Correction and Trust in Language Models
Figure 4 for On the Intersection of Self-Correction and Trust in Language Models
Viaarxiv icon

Are Large Language Models Post Hoc Explainers?

Add code
Oct 10, 2023
Figure 1 for Are Large Language Models Post Hoc Explainers?
Figure 2 for Are Large Language Models Post Hoc Explainers?
Figure 3 for Are Large Language Models Post Hoc Explainers?
Figure 4 for Are Large Language Models Post Hoc Explainers?
Viaarxiv icon