Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kaiwen Wu

CauTion: Knowing When to Trust LLMs for Ensemble Causal Discovery

Jun 02, 2026

Bo Peng, Kaiwen Wu, Sirui Chen, Zhiheng Wang, Yu Qiao, Chaochao Lu

Abstract:Causal discovery from observational data remains challenging due to the fundamental limitations of purely statistical methods, such as statistical distinguishability within equivalence classes and sensitivity to finite sample sizes. While large language models (LLMs) offer a promising source of domain knowledge to complement statistical inference, existing LLM-augmented methods are vulnerable to LLM errors and incur high token costs. Moreover, reliance on a single data-centric algorithm can make results sensitive to algorithm-specific biases. To address these limitations, we propose CauTion, a framework that reliably integrates LLM domain knowledge into an ensemble of statistical causal discovery algorithms through consensus filtering and LLM reliability estimation. CauTion proceeds in three stages. First, an algorithm ensemble utilizes a consensus voting to resolve up to 96% of edges on which algorithms agree, achieving near-perfect accuracy on the filtered consensus edges. Second, a trust-calibrated arbitration mechanism estimates the relative reliability of the LLM and the algorithms via an annotation-free trust calibration procedure, which is then utilized to govern a trust-weighted voting process that restricts LLM arbitration exclusively to edges with unreliable algorithmic evidence. Third, a cycle repair step is applied to guarantee the final causal graph is validly acyclic. Experiments on six datasets demonstrate that CauTion consistently outperforms both data-centric and LLM-augmented baselines, with larger gains on larger graphs and strong robustness to LLM errors. Code is available at https://github.com/OpenCausaLab/CauTion.

Via

Access Paper or Ask Questions

Knowledge Gradient for Preference Learning

Jan 29, 2026

Kaiwen Wu, Jacob R. Gardner

Abstract:The knowledge gradient is a popular acquisition function in Bayesian optimization (BO) for optimizing black-box objectives with noisy function evaluations. Many practical settings, however, allow only pairwise comparison queries, yielding a preferential BO problem where direct function evaluations are unavailable. Extending the knowledge gradient to preferential BO is hindered by its computational challenge. At its core, the look-ahead step in the preferential setting requires computing a non-Gaussian posterior, which was previously considered intractable. In this paper, we address this challenge by deriving an exact and analytical knowledge gradient for preferential BO. We show that the exact knowledge gradient performs strongly on a suite of benchmark problems, often outperforming existing acquisition functions. In addition, we also present a case study illustrating the limitation of the knowledge gradient in certain scenarios.

Via

Access Paper or Ask Questions

Mixed Likelihood Variational Gaussian Processes

Mar 06, 2025

Kaiwen Wu, Craig Sanders, Benjamin Letham, Phillip Guan

Abstract:Gaussian processes (GPs) are powerful models for human-in-the-loop experiments due to their flexibility and well-calibrated uncertainty. However, GPs modeling human responses typically ignore auxiliary information, including a priori domain expertise and non-task performance information like user confidence ratings. We propose mixed likelihood variational GPs to leverage auxiliary information, which combine multiple likelihoods in a single evidence lower bound to model multiple types of data. We demonstrate the benefits of mixing likelihoods in three real-world experiments with human participants. First, we use mixed likelihood training to impose prior knowledge constraints in GP classifiers, which accelerates active learning in a visual perception task where users are asked to identify geometric errors resulting from camera position errors in virtual reality. Second, we show that leveraging Likert scale confidence ratings by mixed likelihood training improves model fitting for haptic perception of surface roughness. Lastly, we show that Likert scale confidence ratings improve human preference learning in robot gait optimization. The modeling performance improvements found using our framework across this diverse set of applications illustrates the benefits of incorporating auxiliary information into active learning and preference learning by using mixed likelihoods to jointly model multiple inputs.

* 16 pages

Via

Access Paper or Ask Questions

Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference

Nov 01, 2024

Jonathan Wenger, Kaiwen Wu, Philipp Hennig, Jacob R. Gardner, Geoff Pleiss, John P. Cunningham

Figure 1 for Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference

Figure 2 for Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference

Figure 3 for Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference

Figure 4 for Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference

Abstract:Model selection in Gaussian processes scales prohibitively with the size of the training dataset, both in time and memory. While many approximations exist, all incur inevitable approximation error. Recent work accounts for this error in the form of computational uncertainty, which enables -- at the cost of quadratic complexity -- an explicit tradeoff between computation and precision. Here we extend this development to model selection, which requires significant enhancements to the existing approach, including linear-time scaling in the size of the dataset. We propose a novel training loss for hyperparameter optimization and demonstrate empirically that the resulting method can outperform SGPR, CGGP and SVGP, state-of-the-art methods for GP model selection, on medium to large-scale datasets. Our experiments show that model selection for computation-aware GPs trained on 1.8 million data points can be done within a few hours on a single GPU. As a result of this work, Gaussian processes can be trained on large-scale datasets without significantly compromising their ability to quantify uncertainty -- a fundamental prerequisite for optimal decision-making.

* Advances in Neural Information Processing Systems (NeurIPS 2024)

Via

Access Paper or Ask Questions

A Fast, Robust Elliptical Slice Sampling Implementation for Linearly Truncated Multivariate Normal Distributions

Jul 15, 2024

Kaiwen Wu, Jacob R. Gardner

Abstract:Elliptical slice sampling, when adapted to linearly truncated multivariate normal distributions, is a rejection-free Markov chain Monte Carlo method. At its core, it requires analytically constructing an ellipse-polytope intersection. The main novelty of this paper is an algorithm that computes this intersection in $\mathcal{O}(m \log m)$ time, where $m$ is the number of linear inequality constraints representing the polytope. We show that an implementation based on this algorithm enhances numerical stability, speeds up running time, and is easy to parallelize for launching multiple Markov chains.

* 13 pages

Via

Access Paper or Ask Questions

Understanding Stochastic Natural Gradient Variational Inference

Jun 04, 2024

Kaiwen Wu, Jacob R. Gardner

Figure 1 for Understanding Stochastic Natural Gradient Variational Inference

Figure 2 for Understanding Stochastic Natural Gradient Variational Inference

Abstract:Stochastic natural gradient variational inference (NGVI) is a popular posterior inference method with applications in various probabilistic models. Despite its wide usage, little is known about the non-asymptotic convergence rate in the \emph{stochastic} setting. We aim to lessen this gap and provide a better understanding. For conjugate likelihoods, we prove the first $\mathcal{O}(\frac{1}{T})$ non-asymptotic convergence rate of stochastic NGVI. The complexity is no worse than stochastic gradient descent (\aka black-box variational inference) and the rate likely has better constant dependency that leads to faster convergence in practice. For non-conjugate likelihoods, we show that stochastic NGVI with the canonical parameterization implicitly optimizes a non-convex objective. Thus, a global convergence rate of $\mathcal{O}(\frac{1}{T})$ is unlikely without some significant new understanding of optimizing the ELBO using natural gradients.

* ICML 2024

Via

Access Paper or Ask Questions

Large-Scale Gaussian Processes via Alternating Projection

Oct 26, 2023

Kaiwen Wu, Jonathan Wenger, Haydn Jones, Geoff Pleiss, Jacob R. Gardner

Abstract:Gaussian process (GP) hyperparameter optimization requires repeatedly solving linear systems with $n \times n$ kernel matrices. To address the prohibitive $\mathcal{O}(n^3)$ time complexity, recent work has employed fast iterative numerical methods, like conjugate gradients (CG). However, as datasets increase in magnitude, the corresponding kernel matrices become increasingly ill-conditioned and still require $\mathcal{O}(n^2)$ space without partitioning. Thus, while CG increases the size of datasets GPs can be trained on, modern datasets reach scales beyond its applicability. In this work, we propose an iterative method which only accesses subblocks of the kernel matrix, effectively enabling \emph{mini-batching}. Our algorithm, based on alternating projection, has $\mathcal{O}(n)$ per-iteration time and space complexity, solving many of the practical challenges of scaling GPs to very large datasets. Theoretically, we prove our method enjoys linear convergence and empirically we demonstrate its robustness to ill-conditioning. On large-scale benchmark datasets up to four million datapoints our approach accelerates training by a factor of 2$\times$ to 27$\times$ compared to CG.

Via

Access Paper or Ask Questions

The Behavior and Convergence of Local Bayesian Optimization

May 24, 2023

Kaiwen Wu, Kyurae Kim, Roman Garnett, Jacob R. Gardner

Figure 1 for The Behavior and Convergence of Local Bayesian Optimization

Figure 2 for The Behavior and Convergence of Local Bayesian Optimization

Figure 3 for The Behavior and Convergence of Local Bayesian Optimization

Figure 4 for The Behavior and Convergence of Local Bayesian Optimization

Abstract:A recent development in Bayesian optimization is the use of local optimization strategies, which can deliver strong empirical performance on high-dimensional problems compared to traditional global strategies. The "folk wisdom" in the literature is that the focus on local optimization sidesteps the curse of dimensionality; however, little is known concretely about the expected behavior or convergence of Bayesian local optimization routines. We first study the behavior of the local approach, and find that the statistics of individual local solutions of Gaussian process sample paths are surprisingly good compared to what we would expect to recover from global methods. We then present the first rigorous analysis of such a Bayesian local optimization algorithm recently proposed by M\"uller et al. (2021), and derive convergence rates in both the noisy and noiseless settings.

* 24 pages

Via

Access Paper or Ask Questions

Black-Box Variational Inference Converges

May 24, 2023

Kyurae Kim, Kaiwen Wu, Jisu Oh, Yian Ma, Jacob R. Gardner

Figure 1 for Black-Box Variational Inference Converges

Figure 2 for Black-Box Variational Inference Converges

Figure 3 for Black-Box Variational Inference Converges

Figure 4 for Black-Box Variational Inference Converges

Abstract:We provide the first convergence guarantee for full black-box variational inference (BBVI), also known as Monte Carlo variational inference. While preliminary investigations worked on simplified versions of BBVI (e.g., bounded domain, bounded support, only optimizing for the scale, and such), our setup does not need any such algorithmic modifications. Our results hold for log-smooth posterior densities with and without strong log-concavity and the location-scale variational family. Also, our analysis reveals that certain algorithm design choices commonly employed in practice, particularly, nonlinear parameterizations of the scale of the variational approximation, can result in suboptimal convergence rates. Fortunately, running BBVI with proximal stochastic gradient descent fixes these limitations, and thus achieves the strongest known convergence rate guarantees. We evaluate this theoretical insight by comparing proximal SGD against other standard implementations of BBVI on large-scale Bayesian inference problems.

* under review

Via

Access Paper or Ask Questions

Practical and Matching Gradient Variance Bounds for Black-Box Variational Bayesian Inference

Mar 18, 2023

Kyurae Kim, Kaiwen Wu, Jisu Oh, Jacob R. Gardner

Figure 1 for Practical and Matching Gradient Variance Bounds for Black-Box Variational Bayesian Inference

Figure 2 for Practical and Matching Gradient Variance Bounds for Black-Box Variational Bayesian Inference

Figure 3 for Practical and Matching Gradient Variance Bounds for Black-Box Variational Bayesian Inference

Figure 4 for Practical and Matching Gradient Variance Bounds for Black-Box Variational Bayesian Inference

Abstract:Understanding the gradient variance of black-box variational inference (BBVI) is a crucial step for establishing its convergence and developing algorithmic improvements. However, existing studies have yet to show that the gradient variance of BBVI satisfies the conditions used to study the convergence of stochastic gradient descent (SGD), the workhorse of BBVI. In this work, we show that BBVI satisfies a matching bound corresponding to the $ABC$ condition used in the SGD literature when applied to smooth and quadratically-growing log-likelihoods. Our results generalize to nonlinear covariance parameterizations widely used in the practice of BBVI. Furthermore, we show that the variance of the mean-field parameterization has provably superior dimensional dependence.

* under review

Via

Access Paper or Ask Questions