Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

John Winnicki

Allocate Marginal Reviews to Borderline Papers Using LLM Comparative Ranking

Feb 04, 2026

Elliot L. Epstein, Rajat Dwaraknath, John Winnicki, Thanawat Sornwanee

Abstract:This paper argues that large ML conferences should allocate marginal review capacity primarily to papers near the acceptance boundary, rather than spreading extra reviews via random or affinity-driven heuristics. We propose using LLM-based comparative ranking (via pairwise comparisons and a Bradley--Terry model) to identify a borderline band \emph{before} human reviewing and to allocate \emph{marginal} reviewer capacity at assignment time. Concretely, given a venue-specific minimum review target (e.g., 3 or 4), we use this signal to decide which papers receive one additional review (e.g., a 4th or 5th), without conditioning on any human reviews and without using LLM outputs for accept/reject. We provide a simple expected-impact calculation in terms of (i) the overlap between the predicted and true borderline sets ($ρ$) and (ii) the incremental value of an extra review near the boundary ($Δ$), and we provide retrospective proxies to estimate these quantities.

* 13 pages

Via

Access Paper or Ask Questions

LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval

Oct 30, 2025

Elliot L. Epstein, John Winnicki, Thanawat Sornwanee, Rajat Dwaraknath

Abstract:Large language models (LLMs) excel at numerical estimation but struggle to correctly quantify uncertainty. We study how well LLMs construct confidence intervals around their own answers and find that they are systematically overconfident. To evaluate this behavior, we introduce FermiEval, a benchmark of Fermi-style estimation questions with a rigorous scoring rule for confidence interval coverage and sharpness. Across several modern models, nominal 99\% intervals cover the true answer only 65\% of the time on average. With a conformal prediction based approach that adjusts the intervals, we obtain accurate 99\% observed coverage, and the Winkler interval score decreases by 54\%. We also propose direct log-probability elicitation and quantile adjustment methods, which further reduce overconfidence at high confidence levels. Finally, we develop a perception-tunnel theory explaining why LLMs exhibit overconfidence: when reasoning under uncertainty, they act as if sampling from a truncated region of their inferred distribution, neglecting its tails.

* 8 pages

Via

Access Paper or Ask Questions

Score-Debiased Kernel Density Estimation

Apr 27, 2025

Elliot L. Epstein, Rajat Dwaraknath, Thanawat Sornwanee, John Winnicki, Jerry Weihong Liu

Figure 1 for Score-Debiased Kernel Density Estimation

Figure 2 for Score-Debiased Kernel Density Estimation

Figure 3 for Score-Debiased Kernel Density Estimation

Figure 4 for Score-Debiased Kernel Density Estimation

Abstract:We propose a novel method for density estimation that leverages an estimated score function to debias kernel density estimation (SD-KDE). In our approach, each data point is adjusted by taking a single step along the score function with a specific choice of step size, followed by standard KDE with a modified bandwidth. The step size and modified bandwidth are chosen to remove the leading order bias in the KDE. Our experiments on synthetic tasks in 1D, 2D and on MNIST, demonstrate that our proposed SD-KDE method significantly reduces the mean integrated squared error compared to the standard Silverman KDE, even with noisy estimates in the score function. These results underscore the potential of integrating score-based corrections into nonparametric density estimation.

* ICLR 2025 Workshop on Frontiers of Probabilistic Inference

Via

Access Paper or Ask Questions