Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Collin Burns

Chunky Post-Training: Data Driven Failures of Generalization

Feb 05, 2026

Seoirse Murray, Allison Qi, Timothy Qian, John Schulman, Collin Burns, Sara Price

Abstract:LLM post-training involves many diverse datasets, each targeting a specific behavior. But these datasets encode incidental patterns alongside intended ones: correlations between formatting and content, narrow phrasings across diverse problems, and implicit associations arising from the discrete data curation process. These patterns are often invisible to developers yet salient to models, producing behaviors that surprise their creators, such as rejecting true facts presented in a particular question format. We call this chunky post-training: the model learns spurious correlations as a result of distinct chunks of post-training data. We introduce SURF, a black-box pipeline which surfaces these unintended behaviors at run time, and TURF, a tool that traces these failures back to specific post-training data. Applying these tools to frontier models (Claude 4.5, GPT-5.1, Grok 4.1, Gemini 3) and open models (Tülu 3), we show that chunky post-training produces miscalibrated behaviors, which often result from imbalanced or underspecified chunks of post-training data.

Via

Access Paper or Ask Questions

Unsupervised Elicitation of Language Models

Jun 11, 2025

Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks, Jacob Goldman-Wetzler, Linda Petrini, Henry Sleight, Collin Burns, He He(+3 more)

Figure 1 for Unsupervised Elicitation of Language Models

Figure 2 for Unsupervised Elicitation of Language Models

Figure 3 for Unsupervised Elicitation of Language Models

Figure 4 for Unsupervised Elicitation of Language Models

Abstract:To steer pretrained language models for downstream tasks, today's post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, \emph{without external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden supervision and outperforms training on crowdsourced human supervision. On tasks where LMs' capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 3.5 Haiku-based assistant. Both the reward model and the assistant outperform their human-supervised counterparts.

Via

Access Paper or Ask Questions

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Dec 14, 2023

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike(+2 more)

Figure 1 for Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Figure 2 for Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Figure 3 for Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Figure 4 for Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Abstract:Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior - for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. However, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work. We find that simple methods can often significantly improve weak-to-strong generalization: for example, when finetuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.

Via

Access Paper or Ask Questions

Discovering Latent Knowledge in Language Models Without Supervision

Dec 07, 2022

Collin Burns, Haotian Ye, Dan Klein, Jacob Steinhardt

Figure 1 for Discovering Latent Knowledge in Language Models Without Supervision

Figure 2 for Discovering Latent Knowledge in Language Models Without Supervision

Figure 3 for Discovering Latent Knowledge in Language Models Without Supervision

Figure 4 for Discovering Latent Knowledge in Language Models Without Supervision

Abstract:Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.

Via

Access Paper or Ask Questions

Measuring Coding Challenge Competence With APPS

May 27, 2021

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song(+1 more)

Figure 1 for Measuring Coding Challenge Competence With APPS

Figure 2 for Measuring Coding Challenge Competence With APPS

Figure 3 for Measuring Coding Challenge Competence With APPS

Figure 4 for Measuring Coding Challenge Competence With APPS

Abstract:While programming is one of the most broadly applicable skills in modern society, modern machine learning models still cannot code solutions to basic problems. Despite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to accurately assess code generation performance rigorously. To meet this challenge, we introduce APPS, a benchmark for code generation. Unlike prior work in more restricted settings, our benchmark measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code. Similar to how companies assess candidate software developers, we then evaluate models by checking their generated code on test cases. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. We fine-tune large language models on both GitHub and our training set, and we find that the prevalence of syntax errors is decreasing exponentially as models improve. Recent models such as GPT-Neo can pass approximately 20% of the test cases of introductory problems, so we find that machine learning models are now beginning to learn how to code. As the social significance of automatic code generation increases over the coming years, our benchmark can provide an important measure for tracking advancements.

* Code and the APPS dataset is available at https://github.com/hendrycks/apps

Via

Access Paper or Ask Questions

CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review

Mar 10, 2021

Dan Hendrycks, Collin Burns, Anya Chen, Spencer Ball

Figure 1 for CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review

Figure 2 for CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review

Figure 3 for CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review

Figure 4 for CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review

Abstract:Many specialized domains remain untouched by deep learning, as large labeled datasets require expensive expert annotators. We address this bottleneck within the legal domain by introducing the Contract Understanding Atticus Dataset (CUAD), a new dataset for legal contract review. CUAD was created with dozens of legal experts from The Atticus Project and consists of over 13,000 annotations. The task is to highlight salient portions of a contract that are important for a human to review. We find that Transformer models have nascent performance, but that this performance is strongly influenced by model design and training dataset size. Despite these promising results, there is still substantial room for improvement. As one of the only large, specialized NLP benchmarks annotated by experts, CUAD can serve as a challenging research benchmark for the broader NLP community.

* Code and the CUAD dataset are available at https://github.com/TheAtticusProject/cuad/

Via

Access Paper or Ask Questions

Limitations of Post-Hoc Feature Alignment for Robustness

Mar 10, 2021

Collin Burns, Jacob Steinhardt

Figure 1 for Limitations of Post-Hoc Feature Alignment for Robustness

Figure 2 for Limitations of Post-Hoc Feature Alignment for Robustness

Figure 3 for Limitations of Post-Hoc Feature Alignment for Robustness

Figure 4 for Limitations of Post-Hoc Feature Alignment for Robustness

Abstract:Feature alignment is an approach to improving robustness to distribution shift that matches the distribution of feature activations between the training distribution and test distribution. A particularly simple but effective approach to feature alignment involves aligning the batch normalization statistics between the two distributions in a trained neural network. This technique has received renewed interest lately because of its impressive performance on robustness benchmarks. However, when and why this method works is not well understood. We investigate the approach in more detail and identify several limitations. We show that it only significantly helps with a narrow set of distribution shifts and we identify several settings in which it even degrades performance. We also explain why these limitations arise by pinpointing why this approach can be so effective in the first place. Our findings call into question the utility of this approach and Unsupervised Domain Adaptation more broadly for improving robustness in practice.

* Accepted to CVPR 2021

Via

Access Paper or Ask Questions

Measuring Mathematical Problem Solving With the MATH Dataset

Mar 05, 2021

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt

Figure 1 for Measuring Mathematical Problem Solving With the MATH Dataset

Figure 2 for Measuring Mathematical Problem Solving With the MATH Dataset

Figure 3 for Measuring Mathematical Problem Solving With the MATH Dataset

Figure 4 for Measuring Mathematical Problem Solving With the MATH Dataset

Abstract:Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

* Code and the MATH dataset is available at https://github.com/hendrycks/math/

Via

Access Paper or Ask Questions

Measuring Massive Multitask Language Understanding

Sep 21, 2020

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt

Figure 1 for Measuring Massive Multitask Language Understanding

Figure 2 for Measuring Massive Multitask Language Understanding

Figure 3 for Measuring Massive Multitask Language Understanding

Figure 4 for Measuring Massive Multitask Language Understanding

Abstract:We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

* The test and code is available at https://github.com/hendrycks/test

Via

Access Paper or Ask Questions

Aligning AI With Shared Human Values

Aug 05, 2020

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt

Figure 1 for Aligning AI With Shared Human Values

Figure 2 for Aligning AI With Shared Human Values

Figure 3 for Aligning AI With Shared Human Values

Figure 4 for Aligning AI With Shared Human Values

Abstract:We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to filter out needlessly inflammatory chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete understanding of basic ethical knowledge. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.

Via

Access Paper or Ask Questions