Abstract:Integrating probability and nonprobability survey samples is an important problem in modern survey sampling. Nonprobability samples often contain rich outcome information but may lack population representativeness, whereas probability samples provide design-based auxiliary information but may not contain the study variable. We propose a deep neural network (DNN)-assisted doubly robust framework for estimating the finite population mean from these two data sources. The proposed method models the logit sampling score for the nonprobability sample as an unknown nonparametric function and estimates it by maximizing a pseudo-likelihood that combines information from the nonprobability sample and a reference probability sample. The DNN parameters are optimized using the ADAM algorithm. The resulting DNN-estimated sampling scores are incorporated into a DNN-assisted inverse-probability weighted estimator and a deep doubly robust estimator. We establish consistency and convergence rates under regularity conditions and evaluate the finite-sample performance of the proposed estimators through simulation studies and an empirical application using Pew Research Center and Behavioral Risk Factor Surveillance System data. The results suggest that the proposed estimators can improve robustness to parametric propensity-score misspecification, especially when the true selection mechanism is nonlinear.
Abstract:As a favorable tool for explainable artificial intelligence (XAI), Shapley value has been widely used to interpret deep learning based predictive models. However, accurate and efficient estimation of Shapley value is a difficult task since the computation load grows exponentially with the increase of input features. Most existing accelerated Shapley value estimation methods have to compromise on estimation accuracy with efficiency. In this article, we present EmSHAP(Energy model-based Shapley value estimation), which can effectively approximate the expectation of Shapley contribution function/deep learning model under arbitrary subset of features given the rest. In order to determine the proposal conditional distribution in the energy model, a gated recurrent unit(GRU) is introduced by mapping the input features onto a hidden space, so that the impact of input feature orderings can be eliminated. In addition, a dynamic masking scheme is proposed to improve the generalization ability. It is proved in Theorems 1, 2 and 3 that EmSHAP achieves tighter error bound than state-of-the-art methods like KernelSHAP and VAEAC, leading to higher estimation accuracy. Finally, case studies on a medical application and an industrial application show that the proposed Shapley value-based explainable framework exhibits enhanced estimation accuracy without compromise on efficiency.