We explore the application of foundation models to data discovery and exploration tasks. Foundation models are large language models (LLMs) that show promising performance on a range of diverse tasks unrelated to their training. We show that these models are highly applicable to the data discovery and data exploration domain. When carefully used, they have superior capability on three representative tasks: table-class detection, column-type annotation and join-column prediction. On all three tasks, we show that a foundation-model-based approach outperforms the task-specific models and so the state of the art. Further, our approach often surpasses human-expert task performance. This suggests a future direction in which disparate data management tasks can be unified under foundation models.
Automatic machine learning (AutoML) is a key enabler of the mass deployment of the next generation of machine learning systems. A key desideratum for future ML systems is the automatic selection of models and hyperparameters. We present a novel method of selecting performant configurations for a given task by performing offline autoML and mining over a diverse set of tasks. By mining the training tasks, we can select a compact portfolio of configurations that perform well over a wide variety of tasks, as well as learn a strategy to select portfolio configurations for yet-unseen tasks. The algorithm runs in a zero-shot manner, that is without training any models online except the chosen one. In a compute- or time-constrained setting, this virtually instant selection is highly performant. Further, we show that our approach is effective for warm-starting existing autoML platforms. In both settings, we demonstrate an improvement on the state-of-the-art by testing over 62 classification and regression datasets. We also demonstrate the utility of recommending data-dependent default configurations that outperform widely used hand-crafted defaults.
Causal inference is at the heart of empirical research in natural and social sciences and is critical for scientific discovery and informed decision making. The gold standard in causal inference is performing randomized controlled trials; unfortunately these are not always feasible due to ethical, legal, or cost constraints. As an alternative, methodologies for causal inference from observational data have been developed in statistical studies and social sciences. However, existing methods critically rely on restrictive assumptions such as the study population consisting of homogeneous elements that can be represented in a single flat table, where each row is referred to as a unit. In contrast, in many real-world settings, the study domain naturally consists of heterogeneous elements with complex relational structure, where the data is naturally represented in multiple related tables. In this paper, we present a formal framework for causal inference from such relational data. We propose a declarative language called CaRL for capturing causal background knowledge and assumptions and specifying causal queries using simple Datalog-like rules.CaRL provides a foundation for inferring causality and reasoning about the effect of complex interventions in relational domains. We present an extensive experimental evaluation on real relational data to illustrate the applicability of CaRL in social sciences and healthcare.