Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andy Kim

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Jan 15, 2026

Cameron Tice, Puria Radmard, Samuel Ratnam, Andy Kim, David Africa, Kyle O'Brien

Abstract:Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This paper provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training. Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners pretrain for alignment as well as capabilities. Our models and datasets are available at alignmentpretraining.ai

Via

Access Paper or Ask Questions

CheXbreak: Misclassification Identification for Deep Learning Models Interpreting Chest X-rays

Mar 24, 2021

Emma Chen, Andy Kim, Rayan Krishnan, Jin Long, Andrew Y. Ng, Pranav Rajpurkar

Figure 1 for CheXbreak: Misclassification Identification for Deep Learning Models Interpreting Chest X-rays

Figure 2 for CheXbreak: Misclassification Identification for Deep Learning Models Interpreting Chest X-rays

Figure 3 for CheXbreak: Misclassification Identification for Deep Learning Models Interpreting Chest X-rays

Figure 4 for CheXbreak: Misclassification Identification for Deep Learning Models Interpreting Chest X-rays

Abstract:A major obstacle to the integration of deep learning models for chest x-ray interpretation into clinical settings is the lack of understanding of their failure modes. In this work, we first investigate whether there are patient subgroups that chest x-ray models are likely to misclassify. We find that patient age and the radiographic finding of lung lesion, pneumothorax or support devices are statistically relevant features for predicting misclassification for some chest x-ray models. Second, we develop misclassification predictors on chest x-ray models using their outputs and clinical features. We find that our best performing misclassification identifier achieves an AUROC close to 0.9 for most diseases. Third, employing our misclassification identifiers, we develop a corrective algorithm to selectively flip model predictions that have high likelihood of misclassification at inference time. We observe F1 improvement on the prediction of Consolidation (0.008 [95\% CI 0.005, 0.010]) and Edema (0.003, [95\% CI 0.001, 0.006]). By carrying out our investigation on ten distinct and high-performing chest x-ray models, we are able to derive insights across model architectures and offer a generalizable framework applicable to other medical imaging tasks.

* Accepted to ACM Conference on Health, Inference, and Learning (ACM-CHIL) Workshop 2021

Via

Access Paper or Ask Questions