RLHF


BaseReward: A Strong Baseline for Multimodal Reward Model

Add code
Sep 19, 2025
Viaarxiv icon

Aligning Audio Captions with Human Preferences

Add code
Sep 18, 2025
Viaarxiv icon

Towards personalized, precise and survey-free environment recognition: AI-enhanced sensor fusion without pre-deployment

Add code
Sep 16, 2025
Viaarxiv icon

The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

Add code
Sep 16, 2025
Viaarxiv icon

Interpretability as Alignment: Making Internal Understanding a Design Principle

Add code
Sep 10, 2025
Viaarxiv icon

HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants

Add code
Sep 10, 2025
Viaarxiv icon

Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M

Add code
Sep 10, 2025
Viaarxiv icon

RewardDance: Reward Scaling in Visual Generation

Add code
Sep 10, 2025
Viaarxiv icon

The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs

Add code
Sep 03, 2025
Viaarxiv icon

SharedRep-RLHF: A Shared Representation Approach to RLHF with Diverse Preferences

Add code
Sep 03, 2025
Viaarxiv icon