Alert button
Picture for Larry Wasserman

Larry Wasserman

Alert button

Simultaneous inference for generalized linear models with unmeasured confounders

Sep 26, 2023
Jin-Hong Du, Larry Wasserman, Kathryn Roeder

Figure 1 for Simultaneous inference for generalized linear models with unmeasured confounders
Figure 2 for Simultaneous inference for generalized linear models with unmeasured confounders
Figure 3 for Simultaneous inference for generalized linear models with unmeasured confounders
Figure 4 for Simultaneous inference for generalized linear models with unmeasured confounders

Tens of thousands of simultaneous hypothesis tests are routinely performed in genomic studies to identify differentially expressed genes. However, due to unmeasured confounders, many standard statistical approaches may be substantially biased. This paper investigates the large-scale hypothesis testing problem for multivariate generalized linear models in the presence of confounding effects. Under arbitrary confounding mechanisms, we propose a unified statistical estimation and inference framework that harnesses orthogonal structures and integrates linear projections into three key stages. It begins by disentangling marginal and uncorrelated confounding effects to recover the latent coefficients. Subsequently, latent factors and primary effects are jointly estimated through lasso-type optimization. Finally, we incorporate projected and weighted bias-correction steps for hypothesis testing. Theoretically, we establish the identification conditions of various effects and non-asymptotic error bounds. We show effective Type-I error control of asymptotic $z$-tests as sample and response sizes approach infinity. Numerical experiments demonstrate that the proposed method controls the false discovery rate by the Benjamini-Hochberg procedure and is more powerful than alternative methods. By comparing single-cell RNA-seq counts from two groups of samples, we demonstrate the suitability of adjusting confounding effects when significant covariates are absent from the model.

* 67 pages, 10 figures 
Viaarxiv icon

The Fundamental Limits of Structure-Agnostic Functional Estimation

May 06, 2023
Sivaraman Balakrishnan, Edward H. Kennedy, Larry Wasserman

Many recent developments in causal inference, and functional estimation problems more generally, have been motivated by the fact that classical one-step (first-order) debiasing methods, or their more recent sample-split double machine-learning avatars, can outperform plugin estimators under surprisingly weak conditions. These first-order corrections improve on plugin estimators in a black-box fashion, and consequently are often used in conjunction with powerful off-the-shelf estimation methods. These first-order methods are however provably suboptimal in a minimax sense for functional estimation when the nuisance functions live in Holder-type function spaces. This suboptimality of first-order debiasing has motivated the development of "higher-order" debiasing methods. The resulting estimators are, in some cases, provably optimal over Holder-type spaces, but both the estimators which are minimax-optimal and their analyses are crucially tied to properties of the underlying function space. In this paper we investigate the fundamental limits of structure-agnostic functional estimation, where relatively weak conditions are placed on the underlying nuisance functions. We show that there is a strong sense in which existing first-order methods are optimal. We achieve this goal by providing a formalization of the problem of functional estimation with black-box nuisance function estimates, and deriving minimax lower bounds for this problem. Our results highlight some clear tradeoffs in functional estimation -- if we wish to remain agnostic to the underlying nuisance function spaces, impose only high-level rate conditions, and maintain compatibility with black-box nuisance estimators then first-order methods are optimal. When we have an understanding of the structure of the underlying nuisance functions then carefully constructed higher-order estimators can outperform first-order estimators.

* 32 pages 
Viaarxiv icon

Feature Importance: A Closer Look at Shapley Values and LOCO

Mar 10, 2023
Isabella Verdinelli, Larry Wasserman

Figure 1 for Feature Importance: A Closer Look at Shapley Values and LOCO
Figure 2 for Feature Importance: A Closer Look at Shapley Values and LOCO
Figure 3 for Feature Importance: A Closer Look at Shapley Values and LOCO
Figure 4 for Feature Importance: A Closer Look at Shapley Values and LOCO

There is much interest lately in explainability in statistics and machine learning. One aspect of explainability is to quantify the importance of various features (or covariates). Two popular methods for defining variable importance are LOCO (Leave Out COvariates) and Shapley Values. We take a look at the properties of these methods and their advantages and disadvantages. We are particularly interested in the effect of correlation between features which can obscure interpretability. Contrary to some claims, Shapley values do not eliminate feature correlation. We critique the game theoretic axioms for Shapley values and suggest some new axioms. We propose new, more statistically oriented axioms for feature importance and some measures that satisfy these axioms. However, correcting for correlation is a Faustian bargain: removing the effect of correlation creates other forms of bias. Ultimately, we recommend a slightly modified version of LOCO. We briefly consider how to modify Shapley values to better address feature correlation.

Viaarxiv icon

Data blurring: sample splitting a single sample

Dec 21, 2021
James Leiner, Boyan Duan, Larry Wasserman, Aaditya Ramdas

Figure 1 for Data blurring: sample splitting a single sample
Figure 2 for Data blurring: sample splitting a single sample
Figure 3 for Data blurring: sample splitting a single sample
Figure 4 for Data blurring: sample splitting a single sample

Suppose we observe a random vector $X$ from some distribution $P$ in a known family with unknown parameters. We ask the following question: when is it possible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part is sufficient to reconstruct $X$ by itself, but both together can recover $X$ fully, and the joint distribution of $(f(X),g(X))$ is tractable? As one example, if $X=(X_1,\dots,X_n)$ and $P$ is a product distribution, then for any $m<n$, we can split the sample to define $f(X)=(X_1,\dots,X_m)$ and $g(X)=(X_{m+1},\dots,X_n)$. Rasines and Young (2021) offers an alternative route of accomplishing this task through randomization of $X$ with additive Gaussian noise which enables post-selection inference in finite samples for Gaussian distributed data and asymptotically for non-Gaussian additive models. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data blurring, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few prototypical applications, such as post-selection inference for trend filtering and other regression problems.

* 45 pages, 31 figures 
Viaarxiv icon

Decorrelated Variable Importance

Nov 21, 2021
Isabella Verdinelli, Larry Wasserman

Figure 1 for Decorrelated Variable Importance
Figure 2 for Decorrelated Variable Importance
Figure 3 for Decorrelated Variable Importance
Figure 4 for Decorrelated Variable Importance

Because of the widespread use of black box prediction methods such as random forests and neural nets, there is renewed interest in developing methods for quantifying variable importance as part of the broader goal of interpretable prediction. A popular approach is to define a variable importance parameter - known as LOCO (Leave Out COvariates) - based on dropping covariates from a regression model. This is essentially a nonparametric version of R-squared. This parameter is very general and can be estimated nonparametrically, but it can be hard to interpret because it is affected by correlation between covariates. We propose a method for mitigating the effect of correlation by defining a modified version of LOCO. This new parameter is difficult to estimate nonparametrically, but we show how to estimate it using semiparametric models.

Viaarxiv icon

Universal Inference Meets Random Projections: A Scalable Test for Log-concavity

Nov 17, 2021
Robin Dunn, Larry Wasserman, Aaditya Ramdas

Figure 1 for Universal Inference Meets Random Projections: A Scalable Test for Log-concavity
Figure 2 for Universal Inference Meets Random Projections: A Scalable Test for Log-concavity
Figure 3 for Universal Inference Meets Random Projections: A Scalable Test for Log-concavity
Figure 4 for Universal Inference Meets Random Projections: A Scalable Test for Log-concavity

Shape constraints yield flexible middle grounds between fully nonparametric and fully parametric approaches to modeling distributions of data. The specific assumption of log-concavity is motivated by applications across economics, survival modeling, and reliability theory. However, there do not currently exist valid tests for whether the underlying density of given data is log-concave. The recent universal likelihood ratio test provides a valid test. The universal test relies on maximum likelihood estimation (MLE), and efficient methods already exist for finding the log-concave MLE. This yields the first test of log-concavity that is provably valid in finite samples in any dimension, for which we also establish asymptotic consistency results. Empirically, we find that the highest power is obtained by using random projections to convert the d-dimensional testing problem into many one-dimensional problems, leading to a simple procedure that is statistically and computationally efficient.

Viaarxiv icon

Plugin Estimation of Smooth Optimal Transport Maps

Jul 26, 2021
Tudor Manole, Sivaraman Balakrishnan, Jonathan Niles-Weed, Larry Wasserman

We analyze a number of natural estimators for the optimal transport map between two distributions and show that they are minimax optimal. We adopt the plugin approach: our estimators are simply optimal couplings between measures derived from our observations, appropriately extended so that they define functions on $\mathbb{R}^d$. When the underlying map is assumed to be Lipschitz, we show that computing the optimal coupling between the empirical measures, and extending it using linear smoothers, already gives a minimax optimal estimator. When the underlying map enjoys higher regularity, we show that the optimal coupling between appropriate nonparametric density estimates yields faster rates. Our work also provides new bounds on the risk of corresponding plugin estimators for the quadratic Wasserstein distance, and we show how this problem relates to that of estimating optimal transport maps using stability arguments for smooth and strongly convex Brenier potentials. As an application of our results, we derive a central limit theorem for a density plugin estimator of the squared Wasserstein distance, which is centered at its population counterpart when the underlying distributions have sufficiently smooth densities. In contrast to known central limit theorems for empirical estimators, this result easily lends itself to statistical inference for Wasserstein distances.

Viaarxiv icon

Forest Guided Smoothing

Mar 08, 2021
Isabella Verdinelli, Larry Wasserman

Figure 1 for Forest Guided Smoothing
Figure 2 for Forest Guided Smoothing
Figure 3 for Forest Guided Smoothing
Figure 4 for Forest Guided Smoothing

We use the output of a random forest to define a family of local smoothers with spatially adaptive bandwidth matrices. The smoother inherits the flexibility of the original forest but, since it is a simple, linear smoother, it is very interpretable and it can be used for tasks that would be intractable for the original forest. This includes bias correction, confidence intervals, assessing variable importance and methods for exploring the structure of the forest. We illustrate the method on some synthetic examples and on data related to Covid-19.

Viaarxiv icon

The huge Package for High-dimensional Undirected Graph Estimation in R

Jun 26, 2020
Tuo Zhao, Han Liu, Kathryn Roeder, John Lafferty, Larry Wasserman

Figure 1 for The huge Package for High-dimensional Undirected Graph Estimation in R
Figure 2 for The huge Package for High-dimensional Undirected Graph Estimation in R
Figure 3 for The huge Package for High-dimensional Undirected Graph Estimation in R
Figure 4 for The huge Package for High-dimensional Undirected Graph Estimation in R

We describe an R package named huge which provides easy-to-use functions for estimating high dimensional undirected graphs from data. This package implements recent results in the literature, including Friedman et al. (2007), Liu et al. (2009, 2012) and Liu et al. (2010). Compared with the existing graph estimation package glasso, the huge package provides extra features: (1) instead of using Fortan, it is written in C, which makes the code more portable and easier to modify; (2) besides fitting Gaussian graphical models, it also provides functions for fitting high dimensional semiparametric Gaussian copula models; (3) more functions like data-dependent model selection, data generation and graph visualization; (4) a minor convergence problem of the graphical lasso algorithm is corrected; (5) the package allows the user to apply both lossless and lossy screening rules to scale up large-scale problems, making a tradeoff between computational and statistical efficiency.

* Published on JMLR in 2012 
Viaarxiv icon