Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daryl Watson

Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models

Aug 11, 2025

Kyle Moore, Jesse Roberts, Daryl Watson

Abstract:There has been much recent interest in evaluating large language models for uncertainty calibration to facilitate model control and modulate user trust. Inference time uncertainty, which may provide a real-time signal to the model or external control modules, is particularly important for applying these concepts to improve LLM-user experience in practice. While many of the existing papers consider model calibration, comparatively little work has sought to evaluate how closely model uncertainty aligns to human uncertainty. In this work, we evaluate a collection of inference-time uncertainty measures, using both established metrics and novel variations, to determine how closely they align with both human group-level uncertainty and traditional notions of model calibration. We find that numerous measures show evidence of strong alignment to human uncertainty, even despite the lack of alignment to human answer preference. For those successful metrics, we find moderate to strong evidence of model calibration in terms of both correctness correlation and distributional analysis.

* preprint, under review

Via

Access Paper or Ask Questions

Investigating Human-Aligned Large Language Model Uncertainty

Mar 16, 2025

Kyle Moore, Jesse Roberts, Daryl Watson, Pamela Wisniewski

Figure 1 for Investigating Human-Aligned Large Language Model Uncertainty

Figure 2 for Investigating Human-Aligned Large Language Model Uncertainty

Figure 3 for Investigating Human-Aligned Large Language Model Uncertainty

Figure 4 for Investigating Human-Aligned Large Language Model Uncertainty

Abstract:Recent work has sought to quantify large language model uncertainty to facilitate model control and modulate user trust. Previous works focus on measures of uncertainty that are theoretically grounded or reflect the average overt behavior of the model. In this work, we investigate a variety of uncertainty measures, in order to identify measures that correlate with human group-level uncertainty. We find that Bayesian measures and a variation on entropy measures, top-k entropy, tend to agree with human behavior as a function of model size. We find that some strong measures decrease in human-similarity with model size, but, by multiple linear regression, we find that combining multiple uncertainty measures provide comparable human-alignment with reduced size-dependency.

Via

Access Paper or Ask Questions