RLHF


Democratic Preference Alignment via Sortition-Weighted RLHF

Add code
Feb 04, 2026
Viaarxiv icon

Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models

Add code
Feb 04, 2026
Viaarxiv icon

SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for RLHF

Add code
Feb 04, 2026
Viaarxiv icon

SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization

Add code
Feb 03, 2026
Viaarxiv icon

Semantic-aware Wasserstein Policy Regularization for Large Language Model Alignment

Add code
Feb 02, 2026
Viaarxiv icon

Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking

Add code
Feb 02, 2026
Viaarxiv icon

How RLHF Amplifies Sycophancy

Add code
Feb 01, 2026
Viaarxiv icon

GOPO: Policy Optimization using Ranked Rewards

Add code
Feb 01, 2026
Viaarxiv icon

Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis

Add code
Jan 31, 2026
Viaarxiv icon

Unifying Adversarial Robustness and Training Across Text Scoring Models

Add code
Jan 31, 2026
Viaarxiv icon