Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Governance Challenges in Reinforcement Learning from Human Feedback: Evaluator Rationality and Reinforcement Stability

Apr 17, 2025

Dana Alsagheer, Abdulrahman Kamal, Mohammad Kamal, Weidong Shi

Figure 1 for Governance Challenges in Reinforcement Learning from Human Feedback: Evaluator Rationality and Reinforcement Stability

Figure 2 for Governance Challenges in Reinforcement Learning from Human Feedback: Evaluator Rationality and Reinforcement Stability

Figure 3 for Governance Challenges in Reinforcement Learning from Human Feedback: Evaluator Rationality and Reinforcement Stability

Share this with someone who'll enjoy it:

Abstract:Reinforcement Learning from Human Feedback (RLHF) is central in aligning large language models (LLMs) with human values and expectations. However, the process remains susceptible to governance challenges, including evaluator bias, inconsistency, and the unreliability of feedback. This study examines how the cognitive capacity of evaluators, specifically their level of rationality, affects the stability of reinforcement signals. A controlled experiment comparing high-rationality and low-rationality participants reveals that evaluators with higher rationality scores produce significantly more consistent and expert-aligned feedback. In contrast, lower-rationality participants demonstrate considerable variability in their reinforcement decisions ($p < 0.01$). To address these challenges and improve RLHF governance, we recommend implementing evaluator pre-screening, systematic auditing of feedback consistency, and reliability-weighted reinforcement aggregation. These measures enhance the fairness, transparency, and robustness of AI alignment pipelines.

View paper on

Share this with someone who'll enjoy it:

Title:Governance Challenges in Reinforcement Learning from Human Feedback: Evaluator Rationality and Reinforcement Stability

Paper and Code