Alert button
Picture for Juliana Freire

Juliana Freire

Alert button

ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models

Nov 06, 2023
Benjamin Feuer, Yurong Liu, Chinmay Hegde, Juliana Freire

Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type and incur large run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark. Our code is available at https://github.com/penfever/ArcheType.

* 17 pages, 8 figures 
Viaarxiv icon

eTOP: Early Termination of Pipelines for Faster Training of AutoML Systems

Apr 17, 2023
Haoxiang Zhang, Juliana Freire, Yash Garg

Figure 1 for eTOP: Early Termination of Pipelines for Faster Training of AutoML Systems
Figure 2 for eTOP: Early Termination of Pipelines for Faster Training of AutoML Systems
Figure 3 for eTOP: Early Termination of Pipelines for Faster Training of AutoML Systems
Figure 4 for eTOP: Early Termination of Pipelines for Faster Training of AutoML Systems

Recent advancements in software and hardware technologies have enabled the use of AI/ML models in everyday applications has significantly improved the quality of service rendered. However, for a given application, finding the right AI/ML model is a complex and costly process, that involves the generation, training, and evaluation of multiple interlinked steps (called pipelines), such as data pre-processing, feature engineering, selection, and model tuning. These pipelines are complex (in structure) and costly (both in compute resource and time) to execute end-to-end, with a hyper-parameter associated with each step. AutoML systems automate the search of these hyper-parameters but are slow, as they rely on optimizing the pipeline's end output. We propose the eTOP Framework which works on top of any AutoML system and decides whether or not to execute the pipeline to the end or terminate at an intermediate step. Experimental evaluation on 26 benchmark datasets and integration of eTOPwith MLBox4 reduces the training time of the AutoML system upto 40x than baseline MLBox.

* NA 
Viaarxiv icon

AlphaD3M: Machine Learning Pipeline Synthesis

Nov 03, 2021
Iddo Drori, Yamuna Krishnamurthy, Remi Rampin, Raoni de Paula Lourenco, Jorge Piazentin Ono, Kyunghyun Cho, Claudio Silva, Juliana Freire

Figure 1 for AlphaD3M: Machine Learning Pipeline Synthesis
Figure 2 for AlphaD3M: Machine Learning Pipeline Synthesis
Figure 3 for AlphaD3M: Machine Learning Pipeline Synthesis
Figure 4 for AlphaD3M: Machine Learning Pipeline Synthesis

We introduce AlphaD3M, an automatic machine learning (AutoML) system based on meta reinforcement learning using sequence models with self play. AlphaD3M is based on edit operations performed over machine learning pipeline primitives providing explainability. We compare AlphaD3M with state-of-the-art AutoML systems: Autosklearn, Autostacker, and TPOT, on OpenML datasets. AlphaD3M achieves competitive performance while being an order of magnitude faster, reducing computation time from hours to minutes, and is explainable by design.

* ICML 2018 AutoML Workshop 
Viaarxiv icon

Correlation Sketches for Approximate Join-Correlation Queries

Apr 07, 2021
Aécio Santos, Aline Bessa, Fernando Chirigati, Christopher Musco, Juliana Freire

Figure 1 for Correlation Sketches for Approximate Join-Correlation Queries
Figure 2 for Correlation Sketches for Approximate Join-Correlation Queries
Figure 3 for Correlation Sketches for Approximate Join-Correlation Queries
Figure 4 for Correlation Sketches for Approximate Join-Correlation Queries

The increasing availability of structured datasets, from Web tables and open-data portals to enterprise data, opens up opportunities~to enrich analytics and improve machine learning models through relational data augmentation. In this paper, we introduce a new class of data augmentation queries: join-correlation queries. Given a column $Q$ and a join column $K_Q$ from a query table $\mathcal{T}_Q$, retrieve tables $\mathcal{T}_X$ in a dataset collection such that $\mathcal{T}_X$ is joinable with $\mathcal{T}_Q$ on $K_Q$ and there is a column $C \in \mathcal{T}_X$ such that $Q$ is correlated with $C$. A na\"ive approach to evaluate these queries, which first finds joinable tables and then explicitly joins and computes correlations between $Q$ and all columns of the discovered tables, is prohibitively expensive. To efficiently support correlated column discovery, we 1) propose a sketching method that enables the construction of an index for a large number of tables and that provides accurate estimates for join-correlation queries, and 2) explore different scoring strategies that effectively rank the query results based on how well the columns are correlated with the query. We carry out a detailed experimental evaluation, using both synthetic and real data, which shows that our sketches attain high accuracy and the scoring strategies lead to high-quality rankings.

* Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21) 
Viaarxiv icon

Auctus: A Dataset Search Engine for Data Augmentation

Feb 10, 2021
Fernando Chirigati, Rémi Rampin, Aécio Santos, Aline Bessa, Juliana Freire

Figure 1 for Auctus: A Dataset Search Engine for Data Augmentation
Figure 2 for Auctus: A Dataset Search Engine for Data Augmentation

Machine Learning models are increasingly being adopted in many applications. The quality of these models critically depends on the input data on which they are trained, and by augmenting their input data with external data, we have the opportunity to create better models. However, the massive number of datasets available on the Web makes it challenging to find data suitable for augmentation. In this demo, we present our ongoing efforts to develop a dataset search engine tailored for data augmentation. Our prototype, named Auctus, automatically discovers datasets on the Web and, different from existing dataset search engines, infers consistent metadata for indexing and supports join and union search queries. Auctus is already being used in a real deployment environment to improve the performance of ML models. The demonstration will include various real-world data augmentation examples and visitors will be able to interact with the system.

Viaarxiv icon

Debugging Machine Learning Pipelines

Feb 11, 2020
Raoni Lourenço, Juliana Freire, Dennis Shasha

Figure 1 for Debugging Machine Learning Pipelines
Figure 2 for Debugging Machine Learning Pipelines
Figure 3 for Debugging Machine Learning Pipelines
Figure 4 for Debugging Machine Learning Pipelines

Machine learning tasks entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous or uninformative outputs, the pipeline may fail or produce incorrect results. Inferring the root cause of failures and unexpected behavior is challenging, usually requiring much human thought, and is both time-consuming and error-prone. We propose a new approach that makes use of iteration and provenance to automatically infer the root causes and derive succinct explanations of failures. Through a detailed experimental evaluation, we assess the cost, precision, and recall of our approach compared to the state of the art. Our source code and experimental data will be available for reproducibility and enhancement.

* Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning, June 2019, Article No.: 3  
* 10 pages 
Viaarxiv icon

AutoML using Metadata Language Embeddings

Oct 08, 2019
Iddo Drori, Lu Liu, Yi Nian, Sharath C. Koorathota, Jie S. Li, Antonio Khalil Moretti, Juliana Freire, Madeleine Udell

Figure 1 for AutoML using Metadata Language Embeddings
Figure 2 for AutoML using Metadata Language Embeddings
Figure 3 for AutoML using Metadata Language Embeddings

As a human choosing a supervised learning algorithm, it is natural to begin by reading a text description of the dataset and documentation for the algorithms you might use. We demonstrate that the same idea improves the performance of automated machine learning methods. We use language embeddings from modern NLP to improve state-of-the-art AutoML systems by augmenting their recommendations with vector embeddings of datasets and of algorithms. We use these embeddings in a neural architecture to learn the distance between best-performing pipelines. The resulting (meta-)AutoML framework improves on the performance of existing AutoML frameworks. Our zero-shot AutoML system using dataset metadata embeddings provides good solutions instantaneously, running in under one second of computation. Performance is competitive with AutoML systems OBOE, AutoSklearn, AlphaD3M, and TPOT when each framework is allocated a minute of computation. We make our data, models, and code publicly available.

* NeurIPS Workshop on Meta-Learning, 2019  
Viaarxiv icon

Visus: An Interactive System for Automatic Machine Learning Model Building and Curation

Jul 05, 2019
Aécio Santos, Sonia Castelo, Cristian Felix, Jorge Piazentin Ono, Bowen Yu, Sungsoo Hong, Cláudio T. Silva, Enrico Bertini, Juliana Freire

Figure 1 for Visus: An Interactive System for Automatic Machine Learning Model Building and Curation
Figure 2 for Visus: An Interactive System for Automatic Machine Learning Model Building and Curation
Figure 3 for Visus: An Interactive System for Automatic Machine Learning Model Building and Curation
Figure 4 for Visus: An Interactive System for Automatic Machine Learning Model Building and Curation

While the demand for machine learning (ML) applications is booming, there is a scarcity of data scientists capable of building such models. Automatic machine learning (AutoML) approaches have been proposed that help with this problem by synthesizing end-to-end ML data processing pipelines. However, these follow a best-effort approach and a user in the loop is necessary to curate and refine the derived pipelines. Since domain experts often have little or no expertise in machine learning, easy-to-use interactive interfaces that guide them throughout the model building process are necessary. In this paper, we present Visus, a system designed to support the model building process and curation of ML data processing pipelines generated by AutoML systems. We describe the framework used to ground our design choices and a usage scenario enabled by Visus. Finally, we discuss the feedback received in user testing sessions with domain experts.

* Accepted for publication in the 2019 Workshop on Human-In-the-Loop Data Analytics (HILDA'19), co-located with SIGMOD 2019 
Viaarxiv icon

Automatic Machine Learning by Pipeline Synthesis using Model-Based Reinforcement Learning and a Grammar

May 24, 2019
Iddo Drori, Yamuna Krishnamurthy, Raoni Lourenco, Remi Rampin, Kyunghyun Cho, Claudio Silva, Juliana Freire

Figure 1 for Automatic Machine Learning by Pipeline Synthesis using Model-Based Reinforcement Learning and a Grammar
Figure 2 for Automatic Machine Learning by Pipeline Synthesis using Model-Based Reinforcement Learning and a Grammar
Figure 3 for Automatic Machine Learning by Pipeline Synthesis using Model-Based Reinforcement Learning and a Grammar
Figure 4 for Automatic Machine Learning by Pipeline Synthesis using Model-Based Reinforcement Learning and a Grammar

Automatic machine learning is an important problem in the forefront of machine learning. The strongest AutoML systems are based on neural networks, evolutionary algorithms, and Bayesian optimization. Recently AlphaD3M reached state-of-the-art results with an order of magnitude speedup using reinforcement learning with self-play. In this work we extend AlphaD3M by using a pipeline grammar and a pre-trained model which generalizes from many different datasets and similar tasks. Our results demonstrate improved performance compared with our earlier work and existing methods on AutoML benchmark datasets for classification and regression tasks. In the spirit of reproducible research we make our data, models, and code publicly available.

* ICML Workshop on Automated Machine Learning 
Viaarxiv icon