Department of Computer Science, Aalto University, Department of Computer Science, University of Manchester
Abstract:A central concern in an interactive intelligent system is optimization of its actions, to be maximally helpful to its human user. In recommender systems for instance, the action is to choose what to recommend, and the optimization task is to recommend items the user prefers. The optimization is done based on earlier user's feedback (e.g. "likes" and "dislikes"), and the algorithms assume the feedback to be faithful. That is, when the user clicks "like," they actually prefer the item. We argue that this fundamental assumption can be extensively violated by human users, who are not passive feedback sources. Instead, they are in control, actively steering the system towards their goal. To verify this hypothesis, that humans steer and are able to improve performance by steering, we designed a function optimization task where a human and an optimization algorithm collaborate to find the maximum of a 1-dimensional function. At each iteration, the optimization algorithm queries the user for the value of a hidden function $f$ at a point $x$, and the user, who sees the hidden function, provides an answer about $f(x)$. Our study on 21 participants shows that users who understand how the optimization works, strategically provide biased answers (answers not equal to $f(x)$), which results in the algorithm finding the optimum significantly faster. Our work highlights that next-generation intelligent systems will need user models capable of helping users who steer systems to pursue their goals.
Abstract:Stochastic gradient MCMC methods, such as stochastic gradient Langevin dynamics (SGLD), have emerged as one of the dominant approaches for posterior sampling in large-scale settings. While gradient evaluations based on only a small fraction of the data significantly reduce the computational cost, they may suffer from high variance, leading to slow convergence. In distributed settings, where the data lie scattered across a number of workers, the problem of high variance is particularly imminent and is even worse if the data subsets held by the workers are very heterogeneous. The impact of variance reduction has been studied in serial settings but not in distributed scenarios so far. In this work, we derive variance bounds for distributed SGLD and introduce the concept of conducive gradients, zero-mean stochastic gradients that serve as a mechanism for sharing probabilistic information between workers. We introduce a novel stochastic gradient estimator which incorporates the inducive gradients, and show both theoretically and empirically that it reduces variance, and hence improves convergence.
Abstract:Matrix factorization is a very common machine learning technique in recommender systems. Bayesian Matrix Factorization (BMF) algorithms would be attractive because of their ability to quantify uncertainty in their predictions and avoid over-fitting, combined with high prediction accuracy. However, they have not been widely used on large-scale data because of their prohibitive computational cost. In recent work, efforts have been made to reduce the cost, both by improving the scalability of the BMF algorithm as well as its implementation, but so far mainly separately. In this paper we show that the state-of-the-art of both approaches to scalability can be combined. We combine the recent highly-scalable Posterior Propagation algorithm for BMF, which parallelizes computation of blocks of the matrix, with a distributed BMF implementation that users asynchronous communication within each block. We show that the combination of the two methods gives substantial improvements in the scalability of BMF on web-scale datasets, when the goal is to reduce the wall-clock time.
Abstract:In many high dimensional classification or regression problems set in a biological context, the complete identification of the set of informative features is often as important as predictive accuracy, since this can provide mechanistic insight and conceptual understanding. Lasso and related algorithms have been widely used since their sparse solutions naturally identify a set of informative features. However, Lasso performs erratically when features are correlated. This limits the use of such algorithms in biological problems, where features such as genes often work together in pathways, leading to sets of highly correlated features. In this paper, we examine the performance of a Lasso derivative, the exclusive group Lasso, in this setting. We propose fast algorithms to solve the exclusive group Lasso, and introduce a solution to the case when the underlying group structure is unknown. The solution combines stability selection with random group allocation and introduction of artificial features. Experiments with both synthetic and real-world data highlight the advantages of this proposed methodology over Lasso in comprehensive selection of informative features.
Abstract:Encoding domain knowledge into the prior over the high-dimensional weight space is challenging in Bayesian neural networks. Two types of domain knowledge are commonly available in scientific applications: 1. feature sparsity (number of relevant features); 2. signal-to-noise ratio, quantified, for instance, as the proportion of variance explained (PVE). We show both types of domain knowledge can be encoded into the widely used Gaussian scale mixture priors with Automatic Relevance Determination. Specifically, we propose a new joint prior over the local (i.e., feature-specific) scale parameters to encode the knowledge about feature sparsity, and an algorithm to determine the global scale parameter (shared by all features) according to the PVE. Empirically, we show that the proposed informative prior improves prediction accuracy on publicly available datasets and in a genetics application.
Abstract:Likelihood-free inference for simulator-based statistical models has recently grown rapidly from its infancy to a useful tool for practitioners. However, models with more than a very small number of parameters as the target of inference have remained an enigma, in particular for the approximate Bayesian computation (ABC) community. To advance the possibilities for performing likelihood-free inference in high-dimensional parameter spaces, here we introduce an extension of the popular Bayesian optimisation based approach to approximate discrepancy functions in a probabilistic manner which lends itself to an efficient exploration of the parameter space. Our method achieves computational scalability by using separate acquisition procedures for the discrepancies defined for different parameters. These efficient high-dimensional simulation acquisitions are combined with exponentiated loss-likelihoods to provide a misspecification-robust characterisation of the marginal posterior distribution for all model parameters. The method successfully performs computationally efficient inference in a 100-dimensional space on canonical examples and compares favourably to existing Copula-ABC methods. We further illustrate the potential of this approach by fitting a bacterial transmission dynamics model to daycare centre data, which provides biologically coherent results on the strain competition in a 30-dimensional parameter space.
Abstract:Bayesian optimization is an effective method for finding extrema of a black-box function. We propose a new type of Bayesian optimization for learning user preferences in high-dimensional spaces. The central assumption is that the underlying objective function cannot be evaluated directly, but instead a minimizer along a projection can be queried, which we call a projective preferential query. The form of the query allows for feedback that is natural for a human to give, and which enables interaction. This is demonstrated in a user experiment in which the user feedback comes in the form of optimal position and orientation of a molecule adsorbing to a surface. We demonstrate that our framework is able to find a global minimum of a high-dimensional black-box function, which is an infeasible task for existing preferential Bayesian optimization frameworks that are based on pairwise comparisons.
Abstract:Differential privacy allows quantifying privacy loss from computations on sensitive personal data. This loss grows with the number of accesses to the data, making it hard to open the use of such data while respecting privacy. To avoid this limitation, we propose privacy-preserving release of a synthetic version of a data set, which can be used for an unlimited number of analyses with any methods, without affecting the privacy guarantees. The synthetic data generation is based on differentially private learning of a generative probabilistic model which can capture the probability distribution of the original data. We demonstrate empirically that we can reliably reproduce statistical discoveries from the synthetic data. We expect the method to have broad use in sharing anonymized versions of key data sets for research.
Abstract:Understanding each other is the key to success in collaboration. For humans, attributing mental states to others, the theory of mind, provides the crucial advantage. We argue for formulating human--AI interaction as a multi-agent problem, endowing AI with a computational theory of mind to understand and anticipate the user. To differentiate the approach from previous work, we introduce a categorisation of user modelling approaches based on the level of agency learnt in the interaction. We describe our recent work in using nested multi-agent modelling to formulate user models for multi-armed bandit based interactive AI systems, including a proof-of-concept user study.
Abstract:The framework of cognitively bounded rationality treats problem solving as fundamentally rational, but emphasises that it is constrained by cognitive architecture and the task environment. This paper investigates a simple decision making heuristic, Take The Best (TTB), within that framework. We formulate TTB as a likelihood-based probabilistic model, where the decision strategy arises by probabilistic inference based on the training data and the model constraints. The strengths of the probabilistic formulation, in addition to providing a bounded rational account of the learning of the heuristic, include natural extensibility with additional cognitively plausible constraints and prior information, and the possibility to embed the heuristic as a subpart of a larger probabilistic model. We extend the model to learn cue discrimination thresholds for continuous-valued cues and experiment with using the model to account for biased preference feedback from a boundedly rational agent in a simulated interactive machine learning task.