Abstract:Conditional selective inference (SI) has been studied intensively as a new statistical inference framework for data-driven hypotheses. The basic concept of conditional SI is to make the inference conditional on the selection event, which enables an exact and valid statistical inference to be conducted even when the hypothesis is selected based on the data. Conditional SI has mainly been studied in the context of model selection, such as vanilla lasso or generalized lasso. The main limitation of existing approaches is the low statistical power owing to over-conditioning, which is required for computational tractability. In this study, we propose a more powerful and general conditional SI method for a class of problems that can be converted into quadratic parametric programming, which includes generalized lasso. The key concept is to compute the continuum path of the optimal solution in the direction of the selected test statistic and to identify the subset of the data space that corresponds to the model selection event by following the solution path. The proposed parametric programming-based method not only avoids the aforementioned major drawback of over-conditioning, but also improves the performance and practicality of SI in various respects. We conducted several experiments to demonstrate the effectiveness and efficiency of our proposed method.
Abstract:In practical data analysis under noisy environment, it is common to first use robust methods to identify outliers, and then to conduct further analysis after removing the outliers. In this paper, we consider statistical inference of the model estimated after outliers are removed, which can be interpreted as a selective inference (SI) problem. To use conditional SI framework, it is necessary to characterize the events of how the robust method identifies outliers. Unfortunately, the existing methods cannot be directly used here because they are applicable to the case where the selection events can be represented by linear/quadratic constraints. In this paper, we propose a conditional SI method for popular robust regressions by using homotopy method. We show that the proposed conditional SI method is applicable to a wide class of robust regression and outlier detection methods and has good empirical performance on both synthetic data and real data experiments.
Abstract:Conformal prediction constructs a confidence region for an unobserved response of a feature vector based on previous identically distributed and exchangeable observations of responses and features. It has a coverage guarantee at any nominal level without additional assumptions on their distribution. However, it requires a refitting procedure for all replacement candidates of the target response. In regression settings, this corresponds to an infinite number of model fit. Apart from relatively simple estimators that can be written as pieces of linear function of the response, efficiently computing such sets is difficult and is still considered as an open problem. We exploit the fact that, \emph{often}, conformal prediction sets are intervals whose boundaries can be efficiently approximated by classical root-finding software. We investigate how this approach can overcome many limitations of formerly used strategies and achieves calculations that have been unattainable so far. We discuss its complexity as well as its drawbacks and evaluate its efficiency through numerical experiments.
Abstract:Many cases exist in which a black-box function $f$ with high evaluation cost depends on two types of variables $\bm x$ and $\bm w$, where $\bm x$ is a controllable \emph{design} variable and $\bm w$ are uncontrollable \emph{environmental} variables that have random variation following a certain distribution $P$. In such cases, an important task is to find the range of design variables $\bm x$ such that the function $f(\bm x, \bm w)$ has the desired properties by incorporating the random variation of the environmental variables $\bm w$. A natural measure of robustness is the probability that $f(\bm x, \bm w)$ exceeds a given threshold $h$, which is known as the \emph{probability threshold robustness} (PTR) measure in the literature on robust optimization. However, this robustness measure cannot be correctly evaluated when the distribution $P$ is unknown. In this study, we addressed this problem by considering the \textit{distributionally robust PTR} (DRPTR) measure, which considers the worst-case PTR within given candidate distributions. Specifically, we studied the problem of efficiently identifying a reliable set $H$, which is defined as a region in which the DRPTR measure exceeds a certain desired probability $\alpha$, which can be interpreted as a level set estimation (LSE) problem for DRPTR. We propose a theoretically grounded and computationally efficient active learning method for this problem. We show that the proposed method has theoretical guarantees on convergence and accuracy, and confirmed through numerical experiments that the proposed method outperforms existing methods.
Abstract:Conditional Selective Inference (SI) has been actively studied as a new statistical inference framework for data-driven hypotheses. For example, conditional SI framework enables exact (non-asymptotic) inference on the features selected by stepwise feature selection (SFS) method. The basic idea of conditional SI is to make inference conditional on the selection event. The main limitation of existing conditional SI approach for SFS method is the loss of power due to over-conditioning for computational tractability. In this paper, we develop more powerful and general conditional SI method for SFS by resolving the over-conditioning issue by homotopy continuation approach. We conduct several experiments to demonstrate the effectiveness and efficiency of our proposed method.
Abstract:Given a set of sequences comprised of time-ordered events, sequential pattern mining is useful to identify frequent sub-sequences from different sequences or within the same sequence. However, in sport, these techniques cannot determine the importance of particular patterns of play to good or bad outcomes, which is often of greater interest to coaches. In this study, we apply a supervised sequential pattern mining algorithm called safe pattern pruning (SPP) to 490 labelled event sequences representing passages of play from one rugby team's matches from the 2018 Japan Top League, and then evaluate the importance of the obtained sub-sequences to points-scoring outcomes. Linebreaks, successful lineouts, regained kicks in play, repeated phase-breakdown play, and failed opposition exit plays were identified as important patterns of play for the team scoring. When sequences were labelled with points scoring outcomes for the opposition teams, opposition team linebreaks, errors made by the team, opposition team lineouts, and repeated phase-breakdown play by the opposition team were identified as important patterns of play for the opposition team scoring. By virtue of its supervised nature and pruning properties, SPP obtained a greater variety of generally more sophisticated patterns than the well-known unsupervised PrefixSpan algorithm.
Abstract:In the past few years, various approaches have been developed to explain and interpret deep neural network (DNN) representations, but it has been pointed out that these representations are sometimes unstable and not reproducible. In this paper, we interpret these representations as hypotheses driven by DNN (called DNN-driven hypotheses) and propose a method to quantify the reliability of these hypotheses in statistical hypothesis testing framework. To this end, we introduce Selective Inference (SI) framework, which has received much attention in the past few years as a new statistical inference framework for data-driven hypotheses. The basic idea of SI is to make conditional inferences on the selected hypotheses under the condition that they are selected. In order to use SI framework for DNN representations, we develop a new SI algorithm based on homotopy method which enables us to derive the exact (non-asymptotic) conditional sampling distribution of the DNN-driven hypotheses. We conduct experiments on both synthetic and real-world datasets, through which we offer evidence that our proposed method can successfully control the false positive rate, has decent performance in terms of computational efficiency, and provides good results in practical applications.
Abstract:We consider active learning (AL) in an uncertain environment in which trade-off between multiple risk measures need to be considered. As an AL problem in such an uncertain environment, we study Mean-Variance Analysis in Bayesian Optimization (MVA-BO) setting. Mean-variance analysis was developed in the field of financial engineering and has been used to make decisions that take into account the trade-off between the average and variance of investment uncertainty. In this paper, we specifically focus on BO setting with an uncertain component and consider multi-task, multi-objective, and constrained optimization scenarios for the mean-variance trade-off of the uncertain component. When the target blackbox function is modeled by Gaussian Process (GP), we derive the bounds of the two risk measures and propose AL algorithm for each of the above three problems based on the risk measure bounds. We show the effectiveness of the proposed AL algorithms through theoretical analysis and numerical experiments.
Abstract:In many product development problems, the performance of the product is governed by two types of parameters called design parameter and environmental parameter. While the former is fully controllable, the latter varies depending on the environment in which the product is used. The challenge of such a problem is to find the design parameter that maximizes the probability that the performance of the product will meet the desired requisite level given the variation of the environmental parameter. In this paper, we formulate this practical problem as active learning (AL) problems and propose efficient algorithms with theoretically guaranteed performance. Our basic idea is to use Gaussian Process (GP) model as the surrogate model of the product development process, and then to formulate our AL problems as Bayesian Quadrature Optimization problems for probabilistic threshold robustness (PTR) measure. We derive credible intervals for the PTR measure and propose AL algorithms for the optimization and level set estimation of the PTR measure. We clarify the theoretical properties of the proposed algorithms and demonstrate their efficiency in both synthetic and real-world product development problems.
Abstract:Active learning - the field of machine learning (ML) dedicated to optimal experiment design, has played a part in science as far back as the 18th century when Laplace used it to guide his discovery of celestial mechanics [1]. In this work we focus a closed-loop, active learning-driven autonomous system on another major challenge, the discovery of advanced materials against the exceedingly complex synthesis-processes-structure-property landscape. We demonstrate autonomous research methodology (i.e. autonomous hypothesis definition and evaluation) that can place complex, advanced materials in reach, allowing scientists to fail smarter, learn faster, and spend less resources in their studies, while simultaneously improving trust in scientific results and machine learning tools. Additionally, this robot science enables science-over-the-network, reducing the economic impact of scientists being physically separated from their labs. We used the real-time closed-loop, autonomous system for materials exploration and optimization (CAMEO) at the synchrotron beamline to accelerate the fundamentally interconnected tasks of rapid phase mapping and property optimization, with each cycle taking seconds to minutes, resulting in the discovery of a novel epitaxial nanocomposite phase-change memory material.