Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Emmanuel J. Candès

Probably Approximately Correct Labels

Jun 12, 2025

Emmanuel J. Candès, Andrew Ilyas, Tijana Zrnic

Figure 1 for Probably Approximately Correct Labels

Figure 2 for Probably Approximately Correct Labels

Figure 3 for Probably Approximately Correct Labels

Figure 4 for Probably Approximately Correct Labels

Abstract:Obtaining high-quality labeled datasets is often costly, requiring either extensive human annotation or expensive experiments. We propose a method that supplements such "expert" labels with AI predictions from pre-trained models to construct labeled datasets more cost-effectively. Our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. This solution enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with large language models, image labeling with pre-trained vision models, and protein folding analysis with AlphaFold.

Via

Access Paper or Ask Questions

Posterior Conformal Prediction

Sep 29, 2024

Yao Zhang, Emmanuel J. Candès

Figure 1 for Posterior Conformal Prediction

Figure 2 for Posterior Conformal Prediction

Figure 3 for Posterior Conformal Prediction

Figure 4 for Posterior Conformal Prediction

Abstract:Conformal prediction is a popular technique for constructing prediction intervals with distribution-free coverage guarantees. The coverage is marginal, meaning it only holds on average over the entire population but not necessarily for any specific subgroup. This article introduces a new method, posterior conformal prediction (PCP), which generates prediction intervals with both marginal and approximate conditional validity for clusters (or subgroups) naturally discovered in the data. PCP achieves these guarantees by modelling the conditional conformity score distribution as a mixture of cluster distributions. Compared to other methods with approximate conditional validity, this approach produces tighter intervals, particularly when the test data is drawn from clusters that are well represented in the validation data. PCP can also be applied to guarantee conditional coverage on user-specified subgroups, in which case it achieves robust coverage on smaller subgroups within the specified subgroups. In classification, the theory underlying PCP allows for adjusting the coverage level based on the classifier's confidence, achieving significantly smaller sets than standard conformal prediction sets. We evaluate the performance of PCP on diverse datasets from socio-economic, scientific and healthcare applications.

* 63 pages, 18 figures

Via

Access Paper or Ask Questions

RandALO: Out-of-sample risk estimation in no time flat

Sep 15, 2024

Parth T. Nobel, Daniel LeJeune, Emmanuel J. Candès

Abstract:Estimating out-of-sample risk for models trained on large high-dimensional datasets is an expensive but essential part of the machine learning process, enabling practitioners to optimally tune hyperparameters. Cross-validation (CV) serves as the de facto standard for risk estimation but poorly trades off high bias ($K$-fold CV) for computational cost (leave-one-out CV). We propose a randomized approximate leave-one-out (RandALO) risk estimator that is not only a consistent estimator of risk in high dimensions but also less computationally expensive than $K$-fold CV. We support our claims with extensive simulations on synthetic and real data and provide a user-friendly Python package implementing RandALO available on PyPI as randalo and at https://github.com/cvxgrp/randalo.

* 25 pages, 9 figures

Via

Access Paper or Ask Questions

Can Unconfident LLM Annotations Be Used for Confident Conclusions?

Aug 27, 2024

Kristina Gligorić, Tijana Zrnic, Cinoo Lee, Emmanuel J. Candès, Dan Jurafsky

Abstract:Large language models (LLMs) have shown high agreement with human raters across a variety of tasks, demonstrating potential to ease the challenges of human data collection. In computational social science (CSS), researchers are increasingly leveraging LLM annotations to complement slow and expensive human annotations. Still, guidelines for collecting and using LLM annotations, without compromising the validity of downstream conclusions, remain limited. We introduce Confidence-Driven Inference: a method that combines LLM annotations and LLM confidence indicators to strategically select which human annotations should be collected, with the goal of producing accurate statistical estimates and provably valid confidence intervals while reducing the number of human annotations needed. Our approach comes with safeguards against LLM annotations of poor quality, guaranteeing that the conclusions will be both valid and no less accurate than if we only relied on human annotations. We demonstrate the effectiveness of Confidence-Driven Inference over baselines in statistical estimation tasks across three CSS settings--text politeness, stance, and bias--reducing the needed number of human annotations by over 25% in each. Although we use CSS settings for demonstration, Confidence-Driven Inference can be used to estimate most standard quantities across a broad range of NLP problems.

Via

Access Paper or Ask Questions

Large language model validity via enhanced conformal prediction methods

Jun 14, 2024

John J. Cherian, Isaac Gibbs, Emmanuel J. Candès

Abstract:We develop new conformal inference methods for obtaining validity guarantees on the output of large language models (LLMs). Prior work in conformal language modeling identifies a subset of the text that satisfies a high-probability guarantee of correctness. These methods work by filtering claims from the LLM's original response if a scoring function evaluated on the claim fails to exceed a threshold calibrated via split conformal prediction. Existing methods in this area suffer from two deficiencies. First, the guarantee stated is not conditionally valid. The trustworthiness of the filtering step may vary based on the topic of the response. Second, because the scoring function is imperfect, the filtering step can remove many valuable and accurate claims. We address both of these challenges via two new conformal methods. First, we generalize the conditional conformal procedure of Gibbs et al. (2023) in order to adaptively issue weaker guarantees when they are required to preserve the utility of the output. Second, we show how to systematically improve the quality of the scoring function via a novel algorithm for differentiating through the conditional conformal procedure. We demonstrate the efficacy of our approach on both synthetic and real-world datasets.

* 20 pages, 8 figures

Via

Access Paper or Ask Questions

Boosted Conformal Prediction Intervals

Jun 11, 2024

Ran Xie, Rina Foygel Barber, Emmanuel J. Candès

Abstract:This paper introduces a boosted conformal procedure designed to tailor conformalized prediction intervals toward specific desired properties, such as enhanced conditional coverage or reduced interval length. We employ machine learning techniques, notably gradient boosting, to systematically improve upon a predefined conformity score function. This process is guided by carefully constructed loss functions that measure the deviation of prediction intervals from the targeted properties. The procedure operates post-training, relying solely on model predictions and without modifying the trained model (e.g., the deep network). Systematic experiments demonstrate that starting from conventional conformal methods, our boosted procedure achieves substantial improvements in reducing interval length and decreasing deviation from target conditional coverage.

* 22 pages, 9 figures

Via

Access Paper or Ask Questions

Active Statistical Inference

Mar 05, 2024

Tijana Zrnic, Emmanuel J. Candès

Figure 1 for Active Statistical Inference

Figure 2 for Active Statistical Inference

Figure 3 for Active Statistical Inference

Figure 4 for Active Statistical Inference

Abstract:Inspired by the concept of active learning, we propose active inference$\unicode{x2013}$a methodology for statistical inference with machine-learning-assisted data collection. Assuming a budget on the number of labels that can be collected, the methodology uses a machine learning model to identify which data points would be most beneficial to label, thus effectively utilizing the budget. It operates on a simple yet powerful intuition: prioritize the collection of labels for data points where the model exhibits uncertainty, and rely on the model's predictions where it is confident. Active inference constructs provably valid confidence intervals and hypothesis tests while leveraging any black-box machine learning model and handling any data distribution. The key point is that it achieves the same level of accuracy with far fewer samples than existing baselines relying on non-adaptively-collected data. This means that for the same number of collected samples, active inference enables smaller confidence intervals and more powerful p-values. We evaluate active inference on datasets from public opinion research, census analysis, and proteomics.

Via

Access Paper or Ask Questions

Cross-Prediction-Powered Inference

Oct 11, 2023

Tijana Zrnic, Emmanuel J. Candès

Figure 1 for Cross-Prediction-Powered Inference

Figure 2 for Cross-Prediction-Powered Inference

Figure 3 for Cross-Prediction-Powered Inference

Figure 4 for Cross-Prediction-Powered Inference

Abstract:While reliable data-driven decision-making hinges on high-quality labeled data, the acquisition of quality labels often involves laborious human annotations or slow and expensive scientific measurements. Machine learning is becoming an appealing alternative as sophisticated predictive techniques are being used to quickly and cheaply produce large amounts of predicted labels; e.g., predicted protein structures are used to supplement experimentally derived structures, predictions of socioeconomic indicators from satellite imagery are used to supplement accurate survey data, and so on. Since predictions are imperfect and potentially biased, this practice brings into question the validity of downstream inferences. We introduce cross-prediction: a method for valid inference powered by machine learning. With a small labeled dataset and a large unlabeled dataset, cross-prediction imputes the missing labels via machine learning and applies a form of debiasing to remedy the prediction inaccuracies. The resulting inferences achieve the desired error probability and are more powerful than those that only leverage the labeled data. Closely related is the recent proposal of prediction-powered inference, which assumes that a good pre-trained model is already available. We show that cross-prediction is consistently more powerful than an adaptation of prediction-powered inference in which a fraction of the labeled data is split off and used to train the model. Finally, we observe that cross-prediction gives more stable conclusions than its competitors; its confidence intervals typically have significantly lower variability.

Via

Access Paper or Ask Questions

Statistical Inference for Fairness Auditing

May 05, 2023

John J. Cherian, Emmanuel J. Candès

Figure 1 for Statistical Inference for Fairness Auditing

Figure 2 for Statistical Inference for Fairness Auditing

Figure 3 for Statistical Inference for Fairness Auditing

Figure 4 for Statistical Inference for Fairness Auditing

Abstract:Before deploying a black-box model in high-stakes problems, it is important to evaluate the model's performance on sensitive subpopulations. For example, in a recidivism prediction task, we may wish to identify demographic groups for which our prediction model has unacceptably high false positive rates or certify that no such groups exist. In this paper, we frame this task, often referred to as "fairness auditing," in terms of multiple hypothesis testing. We show how the bootstrap can be used to simultaneously bound performance disparities over a collection of groups with statistical guarantees. Our methods can be used to flag subpopulations affected by model underperformance, and certify subpopulations for which the model performs adequately. Crucially, our audit is model-agnostic and applicable to nearly any performance metric or group fairness criterion. Our methods also accommodate extremely rich -- even infinite -- collections of subpopulations. Further, we generalize beyond subpopulations by showing how to assess performance over certain distribution shifts. We test the proposed methods on benchmark datasets in predictive inference and algorithmic fairness and find that our audits can provide interpretable and trustworthy guarantees.

* 47 pages, 8 figures

Via

Access Paper or Ask Questions

Selection by Prediction with Conformal p-values

Oct 04, 2022

Ying Jin, Emmanuel J. Candès

Figure 1 for Selection by Prediction with Conformal p-values

Figure 2 for Selection by Prediction with Conformal p-values

Figure 3 for Selection by Prediction with Conformal p-values

Figure 4 for Selection by Prediction with Conformal p-values

Abstract:Decision making or scientific discovery pipelines such as job hiring and drug discovery often involve multiple stages: before any resource-intensive step, there is often an initial screening that uses predictions from a machine learning model to shortlist a few candidates from a large pool. We study screening procedures that aim to select candidates whose unobserved outcomes exceed user-specified values. We develop a method that wraps around any prediction model to produce a subset of candidates while controlling the proportion of falsely selected units. Building upon the conformal inference framework, our method first constructs p-values that quantify the statistical evidence for large outcomes; it then determines the shortlist by comparing the p-values to a threshold introduced in the multiple testing literature. In many cases, the procedure selects candidates whose predictions are above a data-dependent threshold. We demonstrate the empirical performance of our method via simulations, and apply it to job hiring and drug discovery datasets.

Via

Access Paper or Ask Questions