We examine how large language models (LLMs) generalize from abstract declarative statements in their training data. As an illustration, consider an LLM that is prompted to generate weather reports for London in 2050. One possibility is that the temperatures in the reports match the mean and variance of reports from 2023 (i.e. matching the statistics of pretraining). Another possibility is that the reports predict higher temperatures, by incorporating declarative statements about climate change from scientific papers written in 2023. An example of such a declarative statement is "global temperatures will increase by $1^{\circ} \mathrm{C}$ by 2050". To test the influence of abstract declarative statements, we construct tasks in which LLMs are finetuned on both declarative and procedural information. We find that declarative statements influence model predictions, even when they conflict with procedural information. In particular, finetuning on a declarative statement $S$ increases the model likelihood for logical consequences of $S$. The effect of declarative statements is consistent across three domains: aligning an AI assistant, predicting weather, and predicting demographic features. Through a series of ablations, we show that the effect of declarative statements cannot be explained by associative learning based on matching keywords. Nevertheless, the effect of declarative statements on model likelihoods is small in absolute terms and increases surprisingly little with model size (i.e. from 330 million to 175 billion parameters). We argue that these results have implications for AI risk (in relation to the "treacherous turn") and for fairness.
It is an important problem in trustworthy machine learning to recognize out-of-distribution (OOD) inputs which are inputs unrelated to the in-distribution task. Many out-of-distribution detection methods have been suggested in recent years. The goal of this paper is to recognize common objectives as well as to identify the implicit scoring functions of different OOD detection methods. We focus on the sub-class of methods that use surrogate OOD data during training in order to learn an OOD detection score that generalizes to new unseen out-distributions at test time. We show that binary discrimination between in- and (different) out-distributions is equivalent to several distinct formulations of the OOD detection problem. When trained in a shared fashion with a standard classifier, this binary discriminator reaches an OOD detection performance similar to that of Outlier Exposure. Moreover, we show that the confidence loss which is used by Outlier Exposure has an implicit scoring function which differs in a non-trivial fashion from the theoretically optimal scoring function in the case where training and test out-distribution are the same, which again is similar to the one used when training an Energy-Based OOD detector or when adding a background class. In practice, when trained in exactly the same way, all these methods perform similarly.
When applying machine learning in safety-critical systems, a reliable assessment of the uncertainy of a classifier is required. However, deep neural networks are known to produce highly overconfident predictions on out-of-distribution (OOD) data and even if trained to be non-confident on OOD data one can still adversarially manipulate OOD data so that the classifer again assigns high confidence to the manipulated samples. In this paper we propose a novel method where from first principles we combine a certifiable OOD detector with a standard classifier into an OOD aware classifier. In this way we achieve the best of two worlds: certifiably adversarially robust OOD detection, even for OOD samples close to the in-distribution, without loss in prediction accuracy and close to state-of-the-art OOD detection performance for non-manipulated OOD data. Moreover, due to the particular construction our classifier provably avoids the asymptotic overconfidence problem of standard neural networks.
Deep neural networks are known to be overconfident when applied to out-of-distribution (OOD) inputs which clearly do not belong to any class. This is a problem in safety-critical applications since a reliable assessment of the uncertainty of a classifier is a key property, allowing to trigger human intervention or to transfer into a safe state. In this paper, we are aiming for certifiable worst case guarantees for OOD detection by enforcing not only low confidence at the OOD point but also in an $l_\infty$-ball around it. For this purpose, we use interval bound propagation (IBP) to upper bound the maximal confidence in the $l_\infty$-ball and minimize this upper bound during training time. We show that non-trivial bounds on the confidence for OOD data generalizing beyond the OOD dataset seen at training time are possible. Moreover, in contrast to certified adversarial robustness which typically comes with significant loss in prediction performance, certified guarantees for worst case OOD detection are possible without much loss in accuracy.
Neural networks have led to major improvements in image classification but suffer from being non-robust to adversarial changes, unreliable uncertainty estimates on out-distribution samples and their inscrutable black-box decisions. In this work we propose RATIO, a training procedure for Robustness via Adversarial Training on In- and Out-distribution, which leads to robust models with reliable and robust confidence estimates on the out-distribution. RATIO has similar generative properties to adversarial training so that visual counterfactuals produce class specific features. While adversarial training comes at the price of lower clean accuracy, RATIO achieves state-of-the-art $l_2$-adversarial robustness on CIFAR10 and maintains better clean accuracy.
It has recently been shown that ReLU networks produce arbitrarily over-confident predictions far away from the training data. Thus, ReLU networks do not know when they don't know. However, this is a highly important property in safety critical applications. In the context of out-of-distribution detection (OOD) there have been a number of proposals to mitigate this problem but none of them are able to make any mathematical guarantees. In this paper we propose a new approach to OOD which overcomes both problems. Our approach can be used with ReLU networks and provides provably low confidence predictions far away from the training data as well as the first certificates for low confidence predictions in a neighborhood of an out-distribution point. In the experiments we show that state-of-the-art methods fail in this worst-case setting whereas our model can guarantee its performance while retaining state-of-the-art OOD performance.