Picture for Manas Joglekar

Manas Joglekar

Tony

Training LLMs for Honesty via Confessions

Add code
Dec 22, 2025
Figure 1 for Training LLMs for Honesty via Confessions
Figure 2 for Training LLMs for Honesty via Confessions
Figure 3 for Training LLMs for Honesty via Confessions
Figure 4 for Training LLMs for Honesty via Confessions
Viaarxiv icon

OpenAI GPT-5 System Card

Add code
Dec 19, 2025
Viaarxiv icon

OpenAI o1 System Card

Add code
Dec 21, 2024
Figure 1 for OpenAI o1 System Card
Figure 2 for OpenAI o1 System Card
Figure 3 for OpenAI o1 System Card
Figure 4 for OpenAI o1 System Card
Viaarxiv icon

Deliberative Alignment: Reasoning Enables Safer Language Models

Add code
Dec 20, 2024
Viaarxiv icon

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Add code
Dec 14, 2023
Figure 1 for Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Figure 2 for Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Figure 3 for Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Figure 4 for Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Viaarxiv icon