Picture for Jonathan Michala

Jonathan Michala

Probing the Misaligned Thinking Process of Language Models

Add code
Jun 23, 2026
Viaarxiv icon

Three Concrete Challenges and Two Hopes for the Safety of Unsupervised Elicitation

Add code
Feb 23, 2026
Viaarxiv icon

Abstractive Red-Teaming of Language Model Character

Add code
Feb 12, 2026
Viaarxiv icon

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Add code
Jan 15, 2026
Viaarxiv icon

Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test

Add code
Jun 08, 2025
Figure 1 for Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test
Figure 2 for Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test
Figure 3 for Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test
Figure 4 for Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test
Viaarxiv icon

Mechanistic Decomposition of Sentence Representations

Add code
Jun 04, 2025
Figure 1 for Mechanistic Decomposition of Sentence Representations
Figure 2 for Mechanistic Decomposition of Sentence Representations
Figure 3 for Mechanistic Decomposition of Sentence Representations
Figure 4 for Mechanistic Decomposition of Sentence Representations
Viaarxiv icon