Rlhf


MTI: A Behavior-Based Temperament Profiling System for AI Agents

Add code
Apr 02, 2026
Viaarxiv icon

DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment

Add code
Apr 02, 2026
Viaarxiv icon

The Silicon Mirror: Dynamic Behavioral Gating for Anti-Sycophancy in LLM Agents

Add code
Apr 02, 2026
Viaarxiv icon

Quantifying Self-Preservation Bias in Large Language Models

Add code
Apr 02, 2026
Viaarxiv icon

Offline Constrained RLHF with Multiple Preference Oracles

Add code
Mar 31, 2026
Viaarxiv icon

Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE

Add code
Mar 31, 2026
Viaarxiv icon

Reward Hacking as Equilibrium under Finite Evaluation

Add code
Mar 30, 2026
Viaarxiv icon

The Last Fingerprint: How Markdown Training Shapes LLM Prose

Add code
Mar 27, 2026
Viaarxiv icon

Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control

Add code
Mar 26, 2026
Viaarxiv icon

Why Safety Probes Catch Liars But Miss Fanatics

Add code
Mar 26, 2026
Viaarxiv icon