Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhiwei Steven Wu

Dimension-Free Decision Calibration for Nonlinear Loss Functions

Apr 22, 2025

Jingwu Tang, Jiayun Wu, Zhiwei Steven Wu, Jiahao Zhang

Abstract:When model predictions inform downstream decision making, a natural question is under what conditions can the decision-makers simply respond to the predictions as if they were the true outcomes. Calibration suffices to guarantee that simple best-response to predictions is optimal. However, calibration for high-dimensional prediction outcome spaces requires exponential computational and statistical complexity. The recent relaxation known as decision calibration ensures the optimality of the simple best-response rule while requiring only polynomial sample complexity in the dimension of outcomes. However, known results on calibration and decision calibration crucially rely on linear loss functions for establishing best-response optimality. A natural approach to handle nonlinear losses is to map outcomes $y$ into a feature space $\phi(y)$ of dimension $m$, then approximate losses with linear functions of $\phi(y)$. Unfortunately, even simple classes of nonlinear functions can demand exponentially large or infinite feature dimensions $m$. A key open problem is whether it is possible to achieve decision calibration with sample complexity independent of~$m$. We begin with a negative result: even verifying decision calibration under standard deterministic best response inherently requires sample complexity polynomial in~$m$. Motivated by this lower bound, we investigate a smooth version of decision calibration in which decision-makers follow a smooth best-response. This smooth relaxation enables dimension-free decision calibration algorithms. We introduce algorithms that, given $\mathrm{poly}(|A|,1/\epsilon)$ samples and any initial predictor~$p$, can efficiently post-process it to satisfy decision calibration without worsening accuracy. Our algorithms apply broadly to function classes that can be well-approximated by bounded-norm functions in (possibly infinite-dimensional) separable RKHS.

Via

Access Paper or Ask Questions

Validating LLM-as-a-Judge Systems in the Absence of Gold Labels

Mar 07, 2025

Luke Guerdan, Solon Barocas, Kenneth Holstein, Hanna Wallach, Zhiwei Steven Wu, Alexandra Chouldechova

Figure 1 for Validating LLM-as-a-Judge Systems in the Absence of Gold Labels

Figure 2 for Validating LLM-as-a-Judge Systems in the Absence of Gold Labels

Figure 3 for Validating LLM-as-a-Judge Systems in the Absence of Gold Labels

Figure 4 for Validating LLM-as-a-Judge Systems in the Absence of Gold Labels

Abstract:The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, has come to play a critical role in scaling and standardizing GenAI evaluations. To validate judge systems, evaluators collect multiple human ratings for each item in a validation corpus, and then aggregate the ratings into a single, per-item gold label rating. High agreement rates between these gold labels and judge system ratings are then taken as a sign of good judge system performance. In many cases, however, items or rating criteria may be ambiguous, or there may be principled disagreement among human raters. In such settings, gold labels may not exist for many of the items. In this paper, we introduce a framework for LLM-as-a-judge validation in the absence of gold labels. We present a theoretical analysis drawing connections between different measures of judge system performance under different rating elicitation and aggregation schemes. We also demonstrate empirically that existing validation approaches can select judge systems that are highly suboptimal, performing as much as 34% worse than the systems selected by alternative approaches that we describe. Based on our findings, we provide concrete recommendations for developing more reliable approaches to LLM-as-a-judge validation.

Via

Access Paper or Ask Questions

All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

Mar 03, 2025

Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, J. Andrew Bagnell

Figure 1 for All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

Figure 2 for All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

Figure 3 for All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

Figure 4 for All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

Abstract:From a first-principles perspective, it may seem odd that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure. Specifically, one first trains a reward model (RM) on some dataset (e.g. human preferences) before using it to provide online feedback as part of a downstream reinforcement learning (RL) procedure, rather than directly optimizing the policy parameters on the dataset via offline maximum likelihood estimation. In fact, from an information-theoretic perspective, we can only lose information via passing through a reward model and cannot create any new information via on-policy sampling. To explain this discrepancy, we scrutinize several hypotheses on the value of RL in FT through both theoretical and empirical lenses. Of the hypotheses considered, we find the most support for the explanation that on problems with a generation-verification gap, the combination of the ease of learning the relatively simple RM (verifier) from the preference data, coupled with the ability of the downstream RL procedure to then filter its search space to the subset of policies (generators) that are optimal for relatively simple verifiers is what leads to the superior performance of online FT.

Via

Access Paper or Ask Questions

Kandinsky Conformal Prediction: Beyond Class- and Covariate-Conditional Coverage

Feb 24, 2025

Konstantina Bairaktari, Jiayun Wu, Zhiwei Steven Wu

Figure 1 for Kandinsky Conformal Prediction: Beyond Class- and Covariate-Conditional Coverage

Figure 2 for Kandinsky Conformal Prediction: Beyond Class- and Covariate-Conditional Coverage

Figure 3 for Kandinsky Conformal Prediction: Beyond Class- and Covariate-Conditional Coverage

Abstract:Conformal prediction is a powerful distribution-free framework for constructing prediction sets with coverage guarantees. Classical methods, such as split conformal prediction, provide marginal coverage, ensuring that the prediction set contains the label of a random test point with a target probability. However, these guarantees may not hold uniformly across different subpopulations, leading to disparities in coverage. Prior work has explored coverage guarantees conditioned on events related to the covariates and label of the test point. We present Kandinsky conformal prediction, a framework that significantly expands the scope of conditional coverage guarantees. In contrast to Mondrian conformal prediction, which restricts its coverage guarantees to disjoint groups -- reminiscent of the rigid, structured grids of Piet Mondrian's art -- our framework flexibly handles overlapping and fractional group memberships defined jointly on covariates and labels, reflecting the layered, intersecting forms in Wassily Kandinsky's compositions. Our algorithm unifies and extends existing methods, encompassing covariate-based group conditional, class conditional, and Mondrian conformal prediction as special cases, while achieving a minimax-optimal high-probability conditional coverage bound. Finally, we demonstrate the practicality of our approach through empirical evaluation on real-world datasets.

Via

Access Paper or Ask Questions

The Cost of Shuffling in Private Gradient Based Optimization

Feb 05, 2025

Shuli Jiang, Pranay Sharma, Zhiwei Steven Wu, Gauri Joshi

Figure 1 for The Cost of Shuffling in Private Gradient Based Optimization

Figure 2 for The Cost of Shuffling in Private Gradient Based Optimization

Figure 3 for The Cost of Shuffling in Private Gradient Based Optimization

Figure 4 for The Cost of Shuffling in Private Gradient Based Optimization

Abstract:We consider the problem of differentially private (DP) convex empirical risk minimization (ERM). While the standard DP-SGD algorithm is theoretically well-established, practical implementations often rely on shuffled gradient methods that traverse the training data sequentially rather than sampling with replacement in each iteration. Despite their widespread use, the theoretical privacy-accuracy trade-offs of private shuffled gradient methods (\textit{DP-ShuffleG}) remain poorly understood, leading to a gap between theory and practice. In this work, we leverage privacy amplification by iteration (PABI) and a novel application of Stein's lemma to provide the first empirical excess risk bound of \textit{DP-ShuffleG}. Our result shows that data shuffling results in worse empirical excess risk for \textit{DP-ShuffleG} compared to DP-SGD. To address this limitation, we propose \textit{Interleaved-ShuffleG}, a hybrid approach that integrates public data samples in private optimization. By alternating optimization steps that use private and public samples, \textit{Interleaved-ShuffleG} effectively reduces empirical excess risk. Our analysis introduces a new optimization framework with surrogate objectives, adaptive noise injection, and a dissimilarity metric, which can be of independent interest. Our experiments on diverse datasets and tasks demonstrate the superiority of \textit{Interleaved-ShuffleG} over several baselines.

* 54 pages, 6 figures

Via

Access Paper or Ask Questions

Position: LLM Unlearning Benchmarks are Weak Measures of Progress

Oct 03, 2024

Pratiksha Thaker, Shengyuan Hu, Neil Kale, Yash Maurya, Zhiwei Steven Wu, Virginia Smith

Figure 1 for Position: LLM Unlearning Benchmarks are Weak Measures of Progress

Figure 2 for Position: LLM Unlearning Benchmarks are Weak Measures of Progress

Figure 3 for Position: LLM Unlearning Benchmarks are Weak Measures of Progress

Figure 4 for Position: LLM Unlearning Benchmarks are Weak Measures of Progress

Abstract:Unlearning methods have the potential to improve the privacy and safety of large language models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning research community has increasingly turned toward empirical benchmarks to assess the effectiveness of such methods. In this paper, we find that existing benchmarks provide an overly optimistic and potentially misleading view on the effectiveness of candidate unlearning methods. By introducing simple, benign modifications to a number of popular benchmarks, we expose instances where supposedly unlearned information remains accessible, or where the unlearning process has degraded the model's performance on retained information to a much greater extent than indicated by the original benchmark. We identify that existing benchmarks are particularly vulnerable to modifications that introduce even loose dependencies between the forget and retain information. Further, we show that ambiguity in unlearning targets in existing benchmarks can easily lead to the design of methods that overfit to the given test queries. Based on our findings, we urge the community to be cautious when interpreting benchmark results as reliable measures of progress, and we provide several recommendations to guide future LLM unlearning research.

Via

Access Paper or Ask Questions

Multi-group Uncertainty Quantification for Long-form Text Generation

Jul 25, 2024

Terrance Liu, Zhiwei Steven Wu

Figure 1 for Multi-group Uncertainty Quantification for Long-form Text Generation

Figure 2 for Multi-group Uncertainty Quantification for Long-form Text Generation

Figure 3 for Multi-group Uncertainty Quantification for Long-form Text Generation

Figure 4 for Multi-group Uncertainty Quantification for Long-form Text Generation

Abstract:While large language models are rapidly moving towards consumer-facing applications, they are often still prone to factual errors and hallucinations. In order to reduce the potential harms that may come from these errors, it is important for users to know to what extent they can trust an LLM when it makes a factual claim. To this end, we study the problem of uncertainty quantification of factual correctness in long-form natural language generation. Given some output from a large language model, we study both uncertainty at the level of individual claims contained within the output (via calibration) and uncertainty across the entire output itself (via conformal prediction). Moreover, we invoke multicalibration and multivalid conformal prediction to ensure that such uncertainty guarantees are valid both marginally and across distinct groups of prompts. Using the task of biography generation, we demonstrate empirically that having access to and making use of additional group attributes for each prompt improves both overall and group-wise performance. As the problems of calibration, conformal prediction, and their multi-group counterparts have not been extensively explored previously in the context of long-form text generation, we consider these empirical results to form a benchmark for this setting.

Via

Access Paper or Ask Questions

Jogging the Memory of Unlearned Model Through Targeted Relearning Attack

Jun 19, 2024

Shengyuan Hu, Yiwei Fu, Zhiwei Steven Wu, Virginia Smith

Figure 1 for Jogging the Memory of Unlearned Model Through Targeted Relearning Attack

Figure 2 for Jogging the Memory of Unlearned Model Through Targeted Relearning Attack

Figure 3 for Jogging the Memory of Unlearned Model Through Targeted Relearning Attack

Figure 4 for Jogging the Memory of Unlearned Model Through Targeted Relearning Attack

Abstract:Machine unlearning is a promising approach to mitigate undesirable memorization of training data in ML models. However, in this work we show that existing approaches for unlearning in LLMs are surprisingly susceptible to a simple set of targeted relearning attacks. With access to only a small and potentially loosely related set of data, we find that we can 'jog' the memory of unlearned models to reverse the effects of unlearning. We formalize this unlearning-relearning pipeline, explore the attack across three popular unlearning benchmarks, and discuss future directions and guidelines that result from our study.

* 17 pages, 8 figures, 12 tables

Via

Access Paper or Ask Questions

Multi-Agent Imitation Learning: Value is Easy, Regret is Hard

Jun 06, 2024

Jingwu Tang, Gokul Swamy, Fei Fang, Zhiwei Steven Wu

Figure 1 for Multi-Agent Imitation Learning: Value is Easy, Regret is Hard

Figure 2 for Multi-Agent Imitation Learning: Value is Easy, Regret is Hard

Figure 3 for Multi-Agent Imitation Learning: Value is Easy, Regret is Hard

Figure 4 for Multi-Agent Imitation Learning: Value is Easy, Regret is Hard

Abstract:We study a multi-agent imitation learning (MAIL) problem where we take the perspective of a learner attempting to coordinate a group of agents based on demonstrations of an expert doing so. Most prior work in MAIL essentially reduces the problem to matching the behavior of the expert within the support of the demonstrations. While doing so is sufficient to drive the value gap between the learner and the expert to zero under the assumption that agents are non-strategic, it does not guarantee robustness to deviations by strategic agents. Intuitively, this is because strategic deviations can depend on a counterfactual quantity: the coordinator's recommendations outside of the state distribution their recommendations induce. In response, we initiate the study of an alternative objective for MAIL in Markov Games we term the regret gap that explicitly accounts for potential deviations by agents in the group. We first perform an in-depth exploration of the relationship between the value and regret gaps. First, we show that while the value gap can be efficiently minimized via a direct extension of single-agent IL algorithms, even value equivalence can lead to an arbitrarily large regret gap. This implies that achieving regret equivalence is harder than achieving value equivalence in MAIL. We then provide a pair of efficient reductions to no-regret online convex optimization that are capable of minimizing the regret gap (a) under a coverage assumption on the expert (MALICE) or (b) with access to a queryable expert (BLADES).

Via

Access Paper or Ask Questions

Orthogonal Causal Calibration

Jun 04, 2024

Justin Whitehouse, Christopher Jung, Vasilis Syrgkanis, Bryan Wilder, Zhiwei Steven Wu

Abstract:Estimates of causal parameters such as conditional average treatment effects and conditional quantile treatment effects play an important role in real-world decision making. Given this importance, one should ensure these estimators are calibrated. While there is a rich literature on calibrating estimators of non-causal parameters, very few methods have been derived for calibrating estimators of causal parameters, or more generally estimators of quantities involving nuisance parameters. In this work, we provide a general framework for calibrating predictors involving nuisance estimation. We consider a notion of calibration defined with respect to an arbitrary, nuisance-dependent loss $\ell$, under which we say an estimator $\theta$ is calibrated if its predictions cannot be changed on any level set to decrease loss. We prove generic upper bounds on the calibration error of any causal parameter estimate $\theta$ with respect to any loss $\ell$ using a concept called Neyman Orthogonality. Our bounds involve two decoupled terms - one measuring the error in estimating the unknown nuisance parameters, and the other representing the calibration error in a hypothetical world where the learned nuisance estimates were true. We use our bound to analyze the convergence of two sample splitting algorithms for causal calibration. One algorithm, which applies to universally orthogonalizable loss functions, transforms the data into generalized pseudo-outcomes and applies an off-the-shelf calibration procedure. The other algorithm, which applies to conditionally orthogonalizable loss functions, extends the classical uniform mass binning algorithm to include nuisance estimation. Our results are exceedingly general, showing that essentially any existing calibration algorithm can be used in causal settings, with additional loss only arising from errors in nuisance estimation.

* 44 pages

Via

Access Paper or Ask Questions