Picture for Bilal Chughtai

Bilal Chughtai

Detecting Strategic Deception Using Linear Probes

Add code
Feb 05, 2025
Figure 1 for Detecting Strategic Deception Using Linear Probes
Figure 2 for Detecting Strategic Deception Using Linear Probes
Figure 3 for Detecting Strategic Deception Using Linear Probes
Figure 4 for Detecting Strategic Deception Using Linear Probes
Viaarxiv icon

Open Problems in Mechanistic Interpretability

Add code
Jan 27, 2025
Figure 1 for Open Problems in Mechanistic Interpretability
Figure 2 for Open Problems in Mechanistic Interpretability
Figure 3 for Open Problems in Mechanistic Interpretability
Figure 4 for Open Problems in Mechanistic Interpretability
Viaarxiv icon

Towards evaluations-based safety cases for AI scheming

Add code
Nov 07, 2024
Figure 1 for Towards evaluations-based safety cases for AI scheming
Figure 2 for Towards evaluations-based safety cases for AI scheming
Figure 3 for Towards evaluations-based safety cases for AI scheming
Figure 4 for Towards evaluations-based safety cases for AI scheming
Viaarxiv icon

Transformer Circuit Faithfulness Metrics are not Robust

Add code
Jul 11, 2024
Viaarxiv icon

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

Add code
Jul 05, 2024
Figure 1 for Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
Figure 2 for Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
Figure 3 for Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
Figure 4 for Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
Viaarxiv icon

Can Language Models Explain Their Own Classification Behavior?

Add code
May 13, 2024
Figure 1 for Can Language Models Explain Their Own Classification Behavior?
Figure 2 for Can Language Models Explain Their Own Classification Behavior?
Figure 3 for Can Language Models Explain Their Own Classification Behavior?
Figure 4 for Can Language Models Explain Their Own Classification Behavior?
Viaarxiv icon

Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs

Add code
Feb 11, 2024
Figure 1 for Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs
Figure 2 for Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs
Figure 3 for Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs
Figure 4 for Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs
Viaarxiv icon

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

Add code
Feb 06, 2023
Figure 1 for A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
Figure 2 for A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
Figure 3 for A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
Figure 4 for A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
Viaarxiv icon