I propose kernel ridge regression estimators for long term causal inference, where a short term experimental data set containing randomized treatment and short term surrogates is fused with a long term observational data set containing short term surrogates and long term outcomes. I propose estimators of treatment effects, dose responses, and counterfactual distributions with closed form solutions in terms of kernel matrix operations. I allow covariates, treatment, and surrogates to be discrete or continuous, and low, high, or infinite dimensional. For long term treatment effects, I prove $\sqrt{n}$ consistency, Gaussian approximation, and semiparametric efficiency. For long term dose responses, I prove uniform consistency with finite sample rates. For long term counterfactual distributions, I prove convergence in distribution.
I construct and justify confidence intervals for longitudinal causal parameters estimated with machine learning. Longitudinal parameters include long term, dynamic, and mediated effects. I provide a nonasymptotic theorem for any longitudinal causal parameter estimated with any machine learning algorithm that satisfies a few simple, interpretable conditions. The main result encompasses local parameters defined for specific demographics as well as proximal parameters defined in the presence of unobserved confounding. Formally, I prove consistency, Gaussian approximation, and semiparametric efficiency. The rate of convergence is $n^{-1/2}$ for global parameters, and it degrades gracefully for local parameters. I articulate a simple set of conditions to translate mean square rates into statistical inference. A key feature of the main result is a new multiple robustness to ill posedness for proximal causal inference in longitudinal settings.
Recent years have seen a growing adoption of Transformer models such as BERT in Natural Language Processing and even in Computer Vision. However, due to their size, there has been limited adoption of such models within resource-constrained computing environments. This paper proposes novel pruning algorithm to compress transformer models by eliminating redundant Attention Heads. We apply the A* search algorithm to obtain a pruned model with strict accuracy guarantees. Our results indicate that the method could eliminate as much as 40% of the attention heads in the BERT transformer model with no loss in accuracy.
I propose kernel ridge regression estimators for nonparametric dose response curves and semiparametric treatment effects in the setting where an analyst has access to a selected sample rather than a random sample; only for select observations, the outcome is observed. I assume selection is as good as random conditional on treatment and a sufficiently rich set of observed covariates, where the covariates are allowed to cause treatment or be caused by treatment -- an extension of missingness-at-random (MAR). I propose estimators of means, increments, and distributions of counterfactual outcomes with closed form solutions in terms of kernel matrix operations, allowing treatment and covariates to be discrete or continuous, and low, high, or infinite dimensional. For the continuous treatment case, I prove uniform consistency with finite sample rates. For the discrete treatment case, I prove root-n consistency, Gaussian approximation, and semiparametric efficiency.
We propose kernel ridge regression estimators for mediation analysis and dynamic treatment effects over short horizons. We allow treatments, covariates, and mediators to be discrete or continuous, and low, high, or infinite dimensional. We propose estimators of means, increments, and distributions of counterfactual outcomes with closed form solutions in terms of kernel matrix operations. For the continuous treatment case, we prove uniform consistency with finite sample rates. For the discrete treatment case, we prove root-n consistency, Gaussian approximation, and semiparametric efficiency. We conduct simulations then estimate mediated and dynamic treatment effects of the US Job Corps program for disadvantaged youth.
We study a finite-horizon restless multi-armed bandit problem with multiple actions, dubbed R(MA)^2B. The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken. The goal is to sequentially choose actions for arms so as to maximize the expected value of the cumulative rewards collected. Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy which we call Occupancy-Measured-Reward Index Policy. Our policy is well-defined even if the underlying MDPs are not indexable. We prove that it is asymptotically optimal when the activation budget and number of arms are scaled up, while keeping their ratio as a constant. For the case when the system parameters are unknown, we develop a learning algorithm. Our learning algorithm uses the principle of optimism in the face of uncertainty and further uses a generative model in order to fully exploit the structure of Occupancy-Measured-Reward Index Policy. We call it the R(MA)^2B-UCB algorithm. As compared with the existing algorithms, R(MA)^2B-UCB performs close to an offline optimum policy, and also achieves a sub-linear regret with a low computational complexity. Experimental results show that R(MA)^2B-UCB outperforms the existing algorithms in both regret and run time.
We consider inference problems for a class of continuous state collective hidden Markov models, where the data is recorded in aggregate (collective) form generated by a large population of individuals following the same dynamics. We propose an aggregate inference algorithm called collective Gaussian forward-backward algorithm, extending recently proposed Sinkhorn belief propagation algorithm to models characterized by Gaussian densities. Our algorithm enjoys convergence guarantee. In addition, it reduces to the standard Kalman filter when the observations are generated by a single individual. The efficacy of the proposed algorithm is demonstrated through multiple experiments.
Even the most carefully curated economic data sets have variables that are noisy, missing, discretized, or privatized. The standard workflow for empirical research involves data cleaning followed by data analysis that typically ignores the bias and variance consequences of data cleaning. We formulate a semiparametric model for causal inference with corrupted data to encompass both data cleaning and data analysis. We propose a new end-to-end procedure for data cleaning, estimation, and inference with data cleaning-adjusted confidence intervals. We prove root-n consistency, Gaussian approximation, and semiparametric efficiency for our estimator of the causal parameter by finite sample arguments. Our key assumption is that the true covariates are approximately low rank. In our analysis, we provide nonasymptotic theoretical contributions to matrix completion, statistical learning, and semiparametric statistics. We verify the coverage of the data cleaning-adjusted confidence intervals in simulations.