Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Noam Razin

When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

Apr 28, 2026

Shuning Shang, Hubert Strauss, Stanley Wei, Sanjeev Arora, Noam Razin

Abstract:Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics for assessing the quality of proxy rewards, such as ranking accuracy, treat incorrect rewards as strictly harmful. In this work, however, we highlight that not all deviations from the ground truth are equal. By theoretically analyzing which outputs attract probability during policy gradient optimization, we categorize reward errors according to their effect on the increase in ground truth reward. The analysis establishes that reward errors, though conventionally viewed as harmful, can also be benign or even beneficial by preventing the policy from stalling around outputs with mediocre ground truth reward. We then present two practical implications of our theory. First, for reinforcement learning from human feedback (RLHF), we develop reward model evaluation metrics that account for the harmfulness of reward errors. Compared to standard ranking accuracy, these metrics typically correlate better with the performance of a language model after RLHF, yet gaps remain in robustly evaluating reward models. Second, we provide insights for reward design in settings with verifiable rewards. A key theme underlying our results is that the effectiveness of a proxy reward function depends heavily on its interaction with the initial policy and learning algorithm.

* Code available at https://github.com/princeton-pli/imperfect-rewards

Via

Access Paper or Ask Questions

Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

Oct 21, 2025

Howard Chen, Noam Razin, Karthik Narasimhan, Danqi Chen

Figure 1 for Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

Figure 2 for Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

Figure 3 for Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

Figure 4 for Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

Abstract:Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.

Via

Access Paper or Ask Questions

Why is Your Language Model a Poor Implicit Reward Model?

Jul 10, 2025

Noam Razin, Yong Lin, Jiarui Yao, Sanjeev Arora

Figure 1 for Why is Your Language Model a Poor Implicit Reward Model?

Figure 2 for Why is Your Language Model a Poor Implicit Reward Model?

Figure 3 for Why is Your Language Model a Poor Implicit Reward Model?

Figure 4 for Why is Your Language Model a Poor Implicit Reward Model?

Abstract:Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Towards a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.

Via

Access Paper or Ask Questions

What Makes a Reward Model a Good Teacher? An Optimization Perspective

Mar 19, 2025

Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, Sanjeev Arora

Abstract:The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. While this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient optimization.

* Code available at https://github.com/princeton-pli/what-makes-good-rm

Via

Access Paper or Ask Questions

The Implicit Bias of Structured State Space Models Can Be Poisoned With Clean Labels

Oct 14, 2024

Yonatan Slutzky, Yotam Alexander, Noam Razin, Nadav Cohen

Figure 1 for The Implicit Bias of Structured State Space Models Can Be Poisoned With Clean Labels

Figure 2 for The Implicit Bias of Structured State Space Models Can Be Poisoned With Clean Labels

Figure 3 for The Implicit Bias of Structured State Space Models Can Be Poisoned With Clean Labels

Figure 4 for The Implicit Bias of Structured State Space Models Can Be Poisoned With Clean Labels

Abstract:Neural networks are powered by an implicit bias: a tendency of gradient descent to fit training data in a way that generalizes to unseen data. A recent class of neural network models gaining increasing popularity is structured state space models (SSMs), regarded as an efficient alternative to transformers. Prior work argued that the implicit bias of SSMs leads to generalization in a setting where data is generated by a low dimensional teacher. In this paper, we revisit the latter setting, and formally establish a phenomenon entirely undetected by prior work on the implicit bias of SSMs. Namely, we prove that while implicit bias leads to generalization under many choices of training data, there exist special examples whose inclusion in training completely distorts the implicit bias, to a point where generalization fails. This failure occurs despite the special training examples being labeled by the teacher, i.e. having clean labels! We empirically demonstrate the phenomenon, with SSMs trained independently and as part of non-linear neural networks. In the area of adversarial machine learning, disrupting generalization with cleanly labeled training examples is known as clean-label poisoning. Given the proliferation of SSMs, particularly in large language models, we believe significant efforts should be invested in further delineating their susceptibility to clean-label poisoning, and in developing methods for overcoming this susceptibility.

Via

Access Paper or Ask Questions

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Oct 11, 2024

Noam Razin, Sadhika Malladi, Adithya Bhaskar, Danqi Chen, Sanjeev Arora, Boris Hanin

Abstract:Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $\texttt{No}$ over $\texttt{Never}$ can sharply increase the probability of $\texttt{Yes}$. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.

* Code available at https://github.com/princeton-nlp/unintentional-unalignment

Via

Access Paper or Ask Questions

Lecture Notes on Linear Neural Networks: A Tale of Optimization and Generalization in Deep Learning

Aug 25, 2024

Nadav Cohen, Noam Razin

Figure 1 for Lecture Notes on Linear Neural Networks: A Tale of Optimization and Generalization in Deep Learning

Figure 2 for Lecture Notes on Linear Neural Networks: A Tale of Optimization and Generalization in Deep Learning

Abstract:These notes are based on a lecture delivered by NC on March 2021, as part of an advanced course in Princeton University on the mathematical understanding of deep learning. They present a theory (developed by NC, NR and collaborators) of linear neural networks -- a fundamental model in the study of optimization and generalization in deep learning. Practical applications born from the presented theory are also discussed. The theory is based on mathematical tools that are dynamical in nature. It showcases the potential of such tools to push the envelope of our understanding of optimization and generalization in deep learning. The text assumes familiarity with the basics of statistical learning theory. Exercises (without solutions) are included.

* Lecture notes

Via

Access Paper or Ask Questions

Understanding Deep Learning via Notions of Rank

Aug 04, 2024

Noam Razin

Figure 1 for Understanding Deep Learning via Notions of Rank

Figure 2 for Understanding Deep Learning via Notions of Rank

Figure 3 for Understanding Deep Learning via Notions of Rank

Figure 4 for Understanding Deep Learning via Notions of Rank

Abstract:Despite the extreme popularity of deep learning in science and industry, its formal understanding is limited. This thesis puts forth notions of rank as key for developing a theory of deep learning, focusing on the fundamental aspects of generalization and expressiveness. In particular, we establish that gradient-based training can induce an implicit regularization towards low rank for several neural network architectures, and demonstrate empirically that this phenomenon may facilitate an explanation of generalization over natural data (e.g., audio, images, and text). Then, we characterize the ability of graph neural networks to model interactions via a notion of rank, which is commonly used for quantifying entanglement in quantum physics. A central tool underlying these results is a connection between neural networks and tensor factorizations. Practical implications of our theory for designing explicit regularization schemes and data preprocessing algorithms are presented.

* PhD thesis

Via

Access Paper or Ask Questions

Implicit Bias of Policy Gradient in Linear Quadratic Control: Extrapolation to Unseen Initial States

Feb 12, 2024

Noam Razin, Yotam Alexander, Edo Cohen-Karlik, Raja Giryes, Amir Globerson, Nadav Cohen

Figure 1 for Implicit Bias of Policy Gradient in Linear Quadratic Control: Extrapolation to Unseen Initial States

Figure 2 for Implicit Bias of Policy Gradient in Linear Quadratic Control: Extrapolation to Unseen Initial States

Figure 3 for Implicit Bias of Policy Gradient in Linear Quadratic Control: Extrapolation to Unseen Initial States

Figure 4 for Implicit Bias of Policy Gradient in Linear Quadratic Control: Extrapolation to Unseen Initial States

Abstract:In modern machine learning, models can often fit training data in numerous ways, some of which perform well on unseen (test) data, while others do not. Remarkably, in such cases gradient descent frequently exhibits an implicit bias that leads to excellent performance on unseen data. This implicit bias was extensively studied in supervised learning, but is far less understood in optimal control (reinforcement learning). There, learning a controller applied to a system via gradient descent is known as policy gradient, and a question of prime importance is the extent to which a learned controller extrapolates to unseen initial states. This paper theoretically studies the implicit bias of policy gradient in terms of extrapolation to unseen initial states. Focusing on the fundamental Linear Quadratic Regulator (LQR) problem, we establish that the extent of extrapolation depends on the degree of exploration induced by the system when commencing from initial states included in training. Experiments corroborate our theory, and demonstrate its conclusions on problems beyond LQR, where systems are non-linear and controllers are neural networks. We hypothesize that real-world optimal control may be greatly improved by developing methods for informed selection of initial states to train on.

Via

Access Paper or Ask Questions

Vanishing Gradients in Reinforcement Finetuning of Language Models

Oct 31, 2023

Noam Razin, Hattie Zhou, Omid Saremi, Vimal Thilak, Arwen Bradley, Preetum Nakkiran, Joshua Susskind, Etai Littwin

Figure 1 for Vanishing Gradients in Reinforcement Finetuning of Language Models

Figure 2 for Vanishing Gradients in Reinforcement Finetuning of Language Models

Figure 3 for Vanishing Gradients in Reinforcement Finetuning of Language Models

Figure 4 for Vanishing Gradients in Reinforcement Finetuning of Language Models

Abstract:Pretrained language models are commonly aligned with human preferences and downstream tasks via reinforcement finetuning (RFT), which entails maximizing a (possibly learned) reward function using policy gradient algorithms. This work highlights a fundamental optimization obstacle in RFT: we prove that the expected gradient for an input vanishes when its reward standard deviation under the model is small, even if the expected reward is far from optimal. Through experiments on an RFT benchmark and controlled environments, as well as a theoretical analysis, we then demonstrate that vanishing gradients due to small reward standard deviation are prevalent and detrimental, leading to extremely slow reward maximization. Lastly, we explore ways to overcome vanishing gradients in RFT. We find the common practice of an initial supervised finetuning (SFT) phase to be the most promising candidate, which sheds light on its importance in an RFT pipeline. Moreover, we show that a relatively small number of SFT optimization steps on as few as 1% of the input samples can suffice, indicating that the initial SFT phase need not be expensive in terms of compute and data labeling efforts. Overall, our results emphasize that being mindful for inputs whose expected gradient vanishes, as measured by the reward standard deviation, is crucial for successful execution of RFT.

Via

Access Paper or Ask Questions