Alert button
Picture for Renan Souza

Renan Souza

Alert button

Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability

Aug 17, 2023
Renan Souza, Tyler J. Skluzacek, Sean R. Wilkinson, Maxim Ziatdinov, Rafael Ferreira da Silva

Modern large-scale scientific discovery requires multidisciplinary collaboration across diverse computing facilities, including High Performance Computing (HPC) machines and the Edge-to-Cloud continuum. Integrated data analysis plays a crucial role in scientific discovery, especially in the current AI era, by enabling Responsible AI development, FAIR, Reproducibility, and User Steering. However, the heterogeneous nature of science poses challenges such as dealing with multiple supporting tools, cross-facility environments, and efficient HPC execution. Building on data observability, adapter system design, and provenance, we propose MIDA: an approach for lightweight runtime Multi-workflow Integrated Data Analysis. MIDA defines data observability strategies and adaptability methods for various parallel systems and machine learning tools. With observability, it intercepts the dataflows in the background without requiring instrumentation while integrating domain, provenance, and telemetry data at runtime into a unified database ready for user steering queries. We conduct experiments showing end-to-end multi-workflow analysis integrating data from Dask and MLFlow in a real distributed deep learning use case for materials science that runs on multiple environments with up to 276 GPUs in parallel. We show near-zero overhead running up to 100,000 tasks on 1,680 CPU cores on the Summit supercomputer.

* 19th IEEE International Conference on e-Science (eScience) 2023 - Limassol, Cyprus  
* 10 pages, 5 figures, 2 Listings, 42 references, Paper accepted at IEEE eScience'23 
Viaarxiv icon

Context-aware Execution Migration Tool for Data Science Jupyter Notebooks on Hybrid Clouds

Jul 01, 2021
Renato L. F. Cunha, Lucas V. Real, Renan Souza, Bruno Silva, Marco A. S. Netto

Figure 1 for Context-aware Execution Migration Tool for Data Science Jupyter Notebooks on Hybrid Clouds
Figure 2 for Context-aware Execution Migration Tool for Data Science Jupyter Notebooks on Hybrid Clouds
Figure 3 for Context-aware Execution Migration Tool for Data Science Jupyter Notebooks on Hybrid Clouds
Figure 4 for Context-aware Execution Migration Tool for Data Science Jupyter Notebooks on Hybrid Clouds

Interactive computing notebooks, such as Jupyter notebooks, have become a popular tool for developing and improving data-driven models. Such notebooks tend to be executed either in the user's own machine or in a cloud environment, having drawbacks and benefits in both approaches. This paper presents a solution developed as a Jupyter extension that automatically selects which cells, as well as in which scenarios, such cells should be migrated to a more suitable platform for execution. We describe how we reduce the execution state of the notebook to decrease migration time and we explore the knowledge of user interactivity patterns with the notebook to determine which blocks of cells should be migrated. Using notebooks from Earth science (remote sensing), image recognition, and hand written digit identification (machine learning), our experiments show notebook state reductions of up to 55x and migration decisions leading to performance gains of up to 3.25x when the user interactivity with the notebook is taken into consideration.

* 10 pages 
Viaarxiv icon

Workflow Provenance in the Lifecycle of Scientific Machine Learning

Sep 30, 2020
Renan Souza, Leonardo G. Azevedo, Vítor Lourenço, Elton Soares, Raphael Thiago, Rafael Brandão, Daniel Civitarese, Emilio Vital Brazil, Marcio Moreno, Patrick Valduriez, Marta Mattoso, Renato Cerqueira, Marco A. S. Netto

Figure 1 for Workflow Provenance in the Lifecycle of Scientific Machine Learning
Figure 2 for Workflow Provenance in the Lifecycle of Scientific Machine Learning
Figure 3 for Workflow Provenance in the Lifecycle of Scientific Machine Learning
Figure 4 for Workflow Provenance in the Lifecycle of Scientific Machine Learning

Machine Learning (ML) has already fundamentally changed several businesses. More recently, it has also been profoundly impacting the computational science and engineering domains, like geoscience, climate science, and health science. In these domains, users need to perform comprehensive data analyses combining scientific data and ML models to provide for critical requirements, such as reproducibility, model explainability, and experiment data understanding. However, scientific ML is multidisciplinary, heterogeneous, and affected by the physical constraints of the domain, making such analyses even more challenging. In this work, we leverage workflow provenance techniques to build a holistic view to support the lifecycle of scientific ML. We contribute with (i) characterization of the lifecycle and taxonomy for data analyses; (ii) design principles to build this view, with a W3C PROV compliant data representation and a reference system architecture; and (iii) lessons learned after an evaluation in an Oil & Gas case using an HPC cluster with 393 nodes and 946 GPUs. The experiments show that the principles enable queries that integrate domain semantics with ML models while keeping low overhead (<1%), high scalability, and an order of magnitude of query acceleration under certain workloads against without our representation.

* 21 pages, 10 figures, Under review in a scientific journal since June 30th, 2020. arXiv admin note: text overlap with arXiv:1910.04223 
Viaarxiv icon

Managing Data Lineage of O&G Machine Learning Models: The Sweet Spot for Shale Use Case

Mar 10, 2020
Raphael Thiago, Renan Souza, L. Azevedo, E. Soares, Rodrigo Santos, Wallas Santos, Max De Bayser, M. Cardoso, M. Moreno, Renato Cerqueira

Figure 1 for Managing Data Lineage of O&G Machine Learning Models: The Sweet Spot for Shale Use Case
Figure 2 for Managing Data Lineage of O&G Machine Learning Models: The Sweet Spot for Shale Use Case
Figure 3 for Managing Data Lineage of O&G Machine Learning Models: The Sweet Spot for Shale Use Case

Machine Learning (ML) has increased its role, becoming essential in several industries. However, questions around training data lineage, such as "where has the dataset used to train this model come from?"; the introduction of several new data protection legislation; and, the need for data governance requirements, have hindered the adoption of ML models in the real world. In this paper, we discuss how data lineage can be leveraged to benefit the ML lifecycle to build ML models to discover sweet-spots for shale oil and gas production, a major application in the Oil and Gas O&G Industry.

* 2020 European Association of Geoscientists and Engineers (EAGE) Digitalization Conference and Exhibition  
* Author preprint of paper accepted at the 2020 European Association of Geoscientists and Engineers (EAGE) Digitalization Conference and Exhibition 
Viaarxiv icon

Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering

Oct 21, 2019
Renan Souza, Leonardo Azevedo, Vítor Lourenço, Elton Soares, Raphael Thiago, Rafael Brandão, Daniel Civitarese, Emilio Vital Brazil, Marcio Moreno, Patrick Valduriez, Marta Mattoso, Renato Cerqueira, Marco A. S. Netto

Figure 1 for Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering
Figure 2 for Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering
Figure 3 for Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering
Figure 4 for Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering

Machine Learning (ML) has become essential in several industries. In Computational Science and Engineering (CSE), the complexity of the ML lifecycle comes from the large variety of data, scientists' expertise, tools, and workflows. If data are not tracked properly during the lifecycle, it becomes unfeasible to recreate a ML model from scratch or to explain to stakeholders how it was created. The main limitation of provenance tracking solutions is that they cannot cope with provenance capture and integration of domain and ML data processed in the multiple workflows in the lifecycle while keeping the provenance capture overhead low. To handle this problem, in this paper we contribute with a detailed characterization of provenance data in the ML lifecycle in CSE; a new provenance data representation, called PROV-ML, built on top of W3C PROV and ML Schema; and extensions to a system that tracks provenance from multiple workflows to address the characteristics of ML and CSE, and to allow for provenance queries with a standard vocabulary. We show a practical use in a real case in the Oil and Gas industry, along with its evaluation using 48 GPUs in parallel.

* 10 pages, 7 figures, Accepted at Workflows in Support of Large-scale Science (WORKS) co-located with the ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) 2019, Denver, Colorado 
Viaarxiv icon

A Hybrid Architecture for Multi-Party Conversational Systems

May 04, 2017
Maira Gatti de Bayser, Paulo Cavalin, Renan Souza, Alan Braz, Heloisa Candello, Claudio Pinhanez, Jean-Pierre Briot

Figure 1 for A Hybrid Architecture for Multi-Party Conversational Systems
Figure 2 for A Hybrid Architecture for Multi-Party Conversational Systems
Figure 3 for A Hybrid Architecture for Multi-Party Conversational Systems
Figure 4 for A Hybrid Architecture for Multi-Party Conversational Systems

Multi-party Conversational Systems are systems with natural language interaction between one or more people or systems. From the moment that an utterance is sent to a group, to the moment that it is replied in the group by a member, several activities must be done by the system: utterance understanding, information search, reasoning, among others. In this paper we present the challenges of designing and building multi-party conversational systems, the state of the art, our proposed hybrid architecture using both rules and machine learning and some insights after implementing and evaluating one on the finance domain.

Viaarxiv icon