Alert button
Picture for Dan Suciu

Dan Suciu

Alert button

CHORUS: Foundation Models for Unified Data Discovery and Exploration

Jun 16, 2023
Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, Dan Suciu

Figure 1 for CHORUS: Foundation Models for Unified Data Discovery and Exploration
Figure 2 for CHORUS: Foundation Models for Unified Data Discovery and Exploration
Figure 3 for CHORUS: Foundation Models for Unified Data Discovery and Exploration
Figure 4 for CHORUS: Foundation Models for Unified Data Discovery and Exploration

We explore the application of foundation models to data discovery and exploration tasks. Foundation models are large language models (LLMs) that show promising performance on a range of diverse tasks unrelated to their training. We show that these models are highly applicable to the data discovery and data exploration domain. When carefully used, they have superior capability on three representative tasks: table-class detection, column-type annotation and join-column prediction. On all three tasks, we show that a foundation-model-based approach outperforms the task-specific models and so the state of the art. Further, our approach often surpasses human-expert task performance. This suggests a future direction in which disparate data management tasks can be unified under foundation models.

Viaarxiv icon

Computing Rule-Based Explanations by Leveraging Counterfactuals

Oct 31, 2022
Zixuan Geng, Maximilian Schleich, Dan Suciu

Figure 1 for Computing Rule-Based Explanations by Leveraging Counterfactuals
Figure 2 for Computing Rule-Based Explanations by Leveraging Counterfactuals
Figure 3 for Computing Rule-Based Explanations by Leveraging Counterfactuals
Figure 4 for Computing Rule-Based Explanations by Leveraging Counterfactuals

Sophisticated machine models are increasingly used for high-stakes decisions in everyday life. There is an urgent need to develop effective explanation techniques for such automated decisions. Rule-Based Explanations have been proposed for high-stake decisions like loan applications, because they increase the users' trust in the decision. However, rule-based explanations are very inefficient to compute, and existing systems sacrifice their quality in order to achieve reasonable performance. We propose a novel approach to compute rule-based explanations, by using a different type of explanation, Counterfactual Explanations, for which several efficient systems have already been developed. We prove a Duality Theorem, showing that rule-based and counterfactual-based explanations are dual to each other, then use this observation to develop an efficient algorithm for computing rule-based explanations, which uses the counterfactual-based explanation as an oracle. We conduct extensive experiments showing that our system computes rule-based explanations of higher quality, and with the same or better performance, than two previous systems, MinSetCover and Anchor.

Viaarxiv icon

GeCo: Quality Counterfactual Explanations in Real Time

Jan 05, 2021
Maximilian Schleich, Zixuan Geng, Yihong Zhang, Dan Suciu

Figure 1 for GeCo: Quality Counterfactual Explanations in Real Time
Figure 2 for GeCo: Quality Counterfactual Explanations in Real Time
Figure 3 for GeCo: Quality Counterfactual Explanations in Real Time
Figure 4 for GeCo: Quality Counterfactual Explanations in Real Time

Machine learning is increasingly applied in high-stakes decision making that directly affect people's lives, and this leads to an increased demand for systems to explain their decisions. Explanations often take the form of counterfactuals, which consists of conveying to the end user what she/he needs to change in order to improve the outcome. Computing counterfactual explanations is challenging, because of the inherent tension between a rich semantics of the domain, and the need for real time response. In this paper we present GeCo, the first system that can compute plausible and feasible counterfactual explanations in real time. At its core, GeCo relies on a genetic algorithm, which is customized to favor searching counterfactual explanations with the smallest number of changes. To achieve real-time performance, we introduce two novel optimizations: $\Delta$-representation of candidate counterfactuals, and partial evaluation of the classifier. We compare empirically GeCo against four other systems described in the literature, and show that it is the only system that can achieve both high quality explanations and real time answers.

* 13 pages, 7 figures, 3 tables, 3 algorithms 
Viaarxiv icon

On the Tractability of SHAP Explanations

Sep 18, 2020
Guy Van den Broeck, Anton Lykov, Maximilian Schleich, Dan Suciu

SHAP explanations are a popular feature-attribution mechanism for explainable AI. They use game-theoretic notions to measure the influence of individual features on the prediction of a machine learning model. Despite a lot of recent interest from both academia and industry, it is not known whether SHAP explanations of common machine learning models can be computed efficiently. In this paper, we establish the complexity of computing the SHAP explanation in three important settings. First, we consider fully-factorized data distributions, and show that the complexity of computing the SHAP explanation is the same as the complexity of computing the expected value of the model. This fully-factorized setting is often used to simplify the SHAP computation, yet our results show that the computation can be intractable for commonly used models such as logistic regression. Going beyond fully-factorized distributions, we show that computing SHAP explanations is already intractable for a very simple setting: computing SHAP explanations of trivial classifiers over naive Bayes distributions. Finally, we show that even computing SHAP over the empirical distribution is #P-hard.

Viaarxiv icon

Causal Relational Learning

Apr 07, 2020
Babak Salimi, Harsh Parikh, Moe Kayali, Sudeepa Roy, Lise Getoor, Dan Suciu

Figure 1 for Causal Relational Learning
Figure 2 for Causal Relational Learning
Figure 3 for Causal Relational Learning
Figure 4 for Causal Relational Learning

Causal inference is at the heart of empirical research in natural and social sciences and is critical for scientific discovery and informed decision making. The gold standard in causal inference is performing randomized controlled trials; unfortunately these are not always feasible due to ethical, legal, or cost constraints. As an alternative, methodologies for causal inference from observational data have been developed in statistical studies and social sciences. However, existing methods critically rely on restrictive assumptions such as the study population consisting of homogeneous elements that can be represented in a single flat table, where each row is referred to as a unit. In contrast, in many real-world settings, the study domain naturally consists of heterogeneous elements with complex relational structure, where the data is naturally represented in multiple related tables. In this paper, we present a formal framework for causal inference from such relational data. We propose a declarative language called CaRL for capturing causal background knowledge and assumptions and specifying causal queries using simple Datalog-like rules.CaRL provides a foundation for inferring causality and reasoning about the effect of complex interventions in relational domains. We present an extensive experimental evaluation on real relational data to illustrate the applicability of CaRL in social sciences and healthcare.

Viaarxiv icon

Causality-based Explanation of Classification Outcomes

Mar 15, 2020
Leopoldo Bertossi, Jordan Li, Maximilian Schleich, Dan Suciu, Zografoula Vagena

Figure 1 for Causality-based Explanation of Classification Outcomes
Figure 2 for Causality-based Explanation of Classification Outcomes
Figure 3 for Causality-based Explanation of Classification Outcomes
Figure 4 for Causality-based Explanation of Classification Outcomes

We propose a simple definition of an explanation for the outcome of a classifier based on concepts from causality. We compare it with previously proposed notions of explanation, and study their complexity. We conduct an experimental evaluation with two real datasets from the financial domain.

* 15 pages, 5 figures, 1 table 
Viaarxiv icon

Mosaic: A Sample-Based Database System for Open World Query Processing

Jan 10, 2020
Laurel Orr, Samuel Ainsworth, Walter Cai, Kevin Jamieson, Magda Balazinska, Dan Suciu

Figure 1 for Mosaic: A Sample-Based Database System for Open World Query Processing
Figure 2 for Mosaic: A Sample-Based Database System for Open World Query Processing
Figure 3 for Mosaic: A Sample-Based Database System for Open World Query Processing
Figure 4 for Mosaic: A Sample-Based Database System for Open World Query Processing

Data scientists have relied on samples to analyze populations of interest for decades. Recently, with the increase in the number of public data repositories, sample data has become easier to access. It has not, however, become easier to analyze. This sample data is arbitrarily biased with an unknown sampling probability, meaning data scientists must manually debias the sample with custom techniques to avoid inaccurate results. In this vision paper, we propose Mosaic, a database system that treats samples as first-class citizens and allows users to ask questions over populations represented by these samples. Answering queries over biased samples is non-trivial as there is no existing, standard technique to answer population queries when the sampling probability is unknown. In this paper, we show how our envisioned system solves this problem by having a unique sample-based data model with extensions to the SQL language. We propose how to perform population query answering using biased samples and give preliminary results for one of our novel query answering techniques.

* CIDR 2020 
Viaarxiv icon

Data Management for Causal Algorithmic Fairness

Oct 01, 2019
Babak Salimi, Bill Howe, Dan Suciu

Figure 1 for Data Management for Causal Algorithmic Fairness
Figure 2 for Data Management for Causal Algorithmic Fairness
Figure 3 for Data Management for Causal Algorithmic Fairness
Figure 4 for Data Management for Causal Algorithmic Fairness

Fairness is increasingly recognized as a critical component of machine learning systems. However, it is the underlying data on which these systems are trained that often reflects discrimination, suggesting a data management problem. In this paper, we first make a distinction between associational and causal definitions of fairness in the literature and argue that the concept of fairness requires causal reasoning. We then review existing works and identify future opportunities for applying data management techniques to causal algorithmic fairness.

* arXiv admin note: text overlap with arXiv:1902.08283 
Viaarxiv icon