Picture for Zhijing Jin

Zhijing Jin

Decomposing and Measuring Evaluation Awareness

Add code
May 21, 2026
Viaarxiv icon

Trustworthy AI Suffers from Invariance Conflicts and Causality is The Solution

Add code
May 04, 2026
Viaarxiv icon

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

Add code
Apr 16, 2026
Viaarxiv icon

Evaluating Cooperation in LLM Social Groups through Elected Leadership

Add code
Apr 13, 2026
Viaarxiv icon

CLT-Forge: A Scalable Library for Cross-Layer Transcoders and Attribution Graphs

Add code
Mar 22, 2026
Viaarxiv icon

When Do Language Models Endorse Limitations on Human Rights Principles?

Add code
Mar 04, 2026
Viaarxiv icon

GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

Add code
Feb 12, 2026
Viaarxiv icon

IV Co-Scientist: Multi-Agent LLM Framework for Causal Instrumental Variable Discovery

Add code
Feb 08, 2026
Viaarxiv icon

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Add code
Feb 06, 2026
Viaarxiv icon

Fluid Representations in Reasoning Models

Add code
Feb 04, 2026
Viaarxiv icon