Alert button
Picture for Aécio Santos

Aécio Santos

Alert button

Correlation Sketches for Approximate Join-Correlation Queries

Apr 07, 2021
Aécio Santos, Aline Bessa, Fernando Chirigati, Christopher Musco, Juliana Freire

Figure 1 for Correlation Sketches for Approximate Join-Correlation Queries
Figure 2 for Correlation Sketches for Approximate Join-Correlation Queries
Figure 3 for Correlation Sketches for Approximate Join-Correlation Queries
Figure 4 for Correlation Sketches for Approximate Join-Correlation Queries

The increasing availability of structured datasets, from Web tables and open-data portals to enterprise data, opens up opportunities~to enrich analytics and improve machine learning models through relational data augmentation. In this paper, we introduce a new class of data augmentation queries: join-correlation queries. Given a column $Q$ and a join column $K_Q$ from a query table $\mathcal{T}_Q$, retrieve tables $\mathcal{T}_X$ in a dataset collection such that $\mathcal{T}_X$ is joinable with $\mathcal{T}_Q$ on $K_Q$ and there is a column $C \in \mathcal{T}_X$ such that $Q$ is correlated with $C$. A na\"ive approach to evaluate these queries, which first finds joinable tables and then explicitly joins and computes correlations between $Q$ and all columns of the discovered tables, is prohibitively expensive. To efficiently support correlated column discovery, we 1) propose a sketching method that enables the construction of an index for a large number of tables and that provides accurate estimates for join-correlation queries, and 2) explore different scoring strategies that effectively rank the query results based on how well the columns are correlated with the query. We carry out a detailed experimental evaluation, using both synthetic and real data, which shows that our sketches attain high accuracy and the scoring strategies lead to high-quality rankings.

* Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21) 
Viaarxiv icon

Auctus: A Dataset Search Engine for Data Augmentation

Feb 10, 2021
Fernando Chirigati, Rémi Rampin, Aécio Santos, Aline Bessa, Juliana Freire

Figure 1 for Auctus: A Dataset Search Engine for Data Augmentation
Figure 2 for Auctus: A Dataset Search Engine for Data Augmentation

Machine Learning models are increasingly being adopted in many applications. The quality of these models critically depends on the input data on which they are trained, and by augmenting their input data with external data, we have the opportunity to create better models. However, the massive number of datasets available on the Web makes it challenging to find data suitable for augmentation. In this demo, we present our ongoing efforts to develop a dataset search engine tailored for data augmentation. Our prototype, named Auctus, automatically discovers datasets on the Web and, different from existing dataset search engines, infers consistent metadata for indexing and supports join and union search queries. Auctus is already being used in a real deployment environment to improve the performance of ML models. The demonstration will include various real-world data augmentation examples and visitors will be able to interact with the system.

Viaarxiv icon

Visus: An Interactive System for Automatic Machine Learning Model Building and Curation

Jul 05, 2019
Aécio Santos, Sonia Castelo, Cristian Felix, Jorge Piazentin Ono, Bowen Yu, Sungsoo Hong, Cláudio T. Silva, Enrico Bertini, Juliana Freire

Figure 1 for Visus: An Interactive System for Automatic Machine Learning Model Building and Curation
Figure 2 for Visus: An Interactive System for Automatic Machine Learning Model Building and Curation
Figure 3 for Visus: An Interactive System for Automatic Machine Learning Model Building and Curation
Figure 4 for Visus: An Interactive System for Automatic Machine Learning Model Building and Curation

While the demand for machine learning (ML) applications is booming, there is a scarcity of data scientists capable of building such models. Automatic machine learning (AutoML) approaches have been proposed that help with this problem by synthesizing end-to-end ML data processing pipelines. However, these follow a best-effort approach and a user in the loop is necessary to curate and refine the derived pipelines. Since domain experts often have little or no expertise in machine learning, easy-to-use interactive interfaces that guide them throughout the model building process are necessary. In this paper, we present Visus, a system designed to support the model building process and curation of ML data processing pipelines generated by AutoML systems. We describe the framework used to ground our design choices and a usage scenario enabled by Visus. Finally, we discuss the feedback received in user testing sessions with domain experts.

* Accepted for publication in the 2019 Workshop on Human-In-the-Loop Data Analytics (HILDA'19), co-located with SIGMOD 2019 
Viaarxiv icon

A Topic-Agnostic Approach for Identifying Fake News Pages

May 02, 2019
Sonia Castelo, Thais Almeida, Anas Elghafari, Aécio Santos, Kien Pham, Eduardo Nakamura, Juliana Freire

Figure 1 for A Topic-Agnostic Approach for Identifying Fake News Pages
Figure 2 for A Topic-Agnostic Approach for Identifying Fake News Pages
Figure 3 for A Topic-Agnostic Approach for Identifying Fake News Pages
Figure 4 for A Topic-Agnostic Approach for Identifying Fake News Pages

Fake news and misinformation have been increasingly used to manipulate popular opinion and influence political processes. To better understand fake news, how they are propagated, and how to counter their effect, it is necessary to first identify them. Recently, approaches have been proposed to automatically classify articles as fake based on their content. An important challenge for these approaches comes from the dynamic nature of news: as new political events are covered, topics and discourse constantly change and thus, a classifier trained using content from articles published at a given time is likely to become ineffective in the future. To address this challenge, we propose a topic-agnostic (TAG) classification strategy that uses linguistic and web-markup features to identify fake news pages. We report experimental results using multiple data sets which show that our approach attains high accuracy in the identification of fake news, even as topics evolve over time.

* Accepted for publication in the Companion Proceedings of the 2019 World Wide Web Conference (WWW'19 Companion). Presented in the 2019 International Workshop on Misinformation, Computational Fact-Checking and Credible Web (MisinfoWorkshop2019). 6 pages 
Viaarxiv icon