Picture for He He

He He

Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

Add code
Oct 01, 2025
Figure 1 for Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Figure 2 for Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Figure 3 for Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Figure 4 for Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Viaarxiv icon

Jailbreak Strength and Model Similarity Predict Transferability

Add code
Jun 15, 2025
Figure 1 for Jailbreak Strength and Model Similarity Predict Transferability
Figure 2 for Jailbreak Strength and Model Similarity Predict Transferability
Viaarxiv icon

Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

Add code
Jun 12, 2025
Figure 1 for Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors
Figure 2 for Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors
Figure 3 for Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors
Figure 4 for Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors
Viaarxiv icon

Unsupervised Elicitation of Language Models

Add code
Jun 11, 2025
Figure 1 for Unsupervised Elicitation of Language Models
Figure 2 for Unsupervised Elicitation of Language Models
Figure 3 for Unsupervised Elicitation of Language Models
Figure 4 for Unsupervised Elicitation of Language Models
Viaarxiv icon

Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models

Add code
Apr 13, 2025
Viaarxiv icon

Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification

Add code
Apr 07, 2025
Figure 1 for Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification
Figure 2 for Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification
Figure 3 for Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification
Figure 4 for Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification
Viaarxiv icon

Transformers Struggle to Learn to Search

Add code
Dec 06, 2024
Figure 1 for Transformers Struggle to Learn to Search
Figure 2 for Transformers Struggle to Learn to Search
Figure 3 for Transformers Struggle to Learn to Search
Figure 4 for Transformers Struggle to Learn to Search
Viaarxiv icon

Beyond the Binary: Capturing Diverse Preferences With Reward Regularization

Add code
Dec 05, 2024
Figure 1 for Beyond the Binary: Capturing Diverse Preferences With Reward Regularization
Figure 2 for Beyond the Binary: Capturing Diverse Preferences With Reward Regularization
Figure 3 for Beyond the Binary: Capturing Diverse Preferences With Reward Regularization
Viaarxiv icon

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

Add code
Nov 26, 2024
Figure 1 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Figure 2 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Figure 3 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Figure 4 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Viaarxiv icon

Spontaneous Reward Hacking in Iterative Self-Refinement

Add code
Jul 05, 2024
Figure 1 for Spontaneous Reward Hacking in Iterative Self-Refinement
Figure 2 for Spontaneous Reward Hacking in Iterative Self-Refinement
Figure 3 for Spontaneous Reward Hacking in Iterative Self-Refinement
Figure 4 for Spontaneous Reward Hacking in Iterative Self-Refinement
Viaarxiv icon