Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Reza Moraveji

Statistical Regression to Predict Total Cumulative CPU Usage of MapReduce Jobs

Mar 14, 2013

Nikzad Babaii Rizvandi, Javid Taheri, Reza Moraveji, Albert Y. Zomaya

Figure 1 for Statistical Regression to Predict Total Cumulative CPU Usage of MapReduce Jobs

Figure 2 for Statistical Regression to Predict Total Cumulative CPU Usage of MapReduce Jobs

Figure 3 for Statistical Regression to Predict Total Cumulative CPU Usage of MapReduce Jobs

Figure 4 for Statistical Regression to Predict Total Cumulative CPU Usage of MapReduce Jobs

Abstract:Recently, businesses have started using MapReduce as a popular computation framework for processing large amount of data, such as spam detection, and different data mining tasks, in both public and private clouds. Two of the challenging questions in such environments are (1) choosing suitable values for MapReduce configuration parameters e.g., number of mappers, number of reducers, and DFS block size, and (2) predicting the amount of resources that a user should lease from the service provider. Currently, the tasks of both choosing configuration parameters and estimating required resources are solely the users responsibilities. In this paper, we present an approach to provision the total CPU usage in clock cycles of jobs in MapReduce environment. For a MapReduce job, a profile of total CPU usage in clock cycles is built from the job past executions with different values of two configuration parameters e.g., number of mappers, and number of reducers. Then, a polynomial regression is used to model the relation between these configuration parameters and total CPU usage in clock cycles of the job. We also briefly study the influence of input data scaling on measured total CPU usage in clock cycles. This derived model along with the scaling result can then be used to provision the total CPU usage in clock cycles of the same jobs with different input data size. We validate the accuracy of our models using three realistic applications (WordCount, Exim MainLog parsing, and TeraSort). Results show that the predicted total CPU usage in clock cycles of generated resource provisioning options are less than 8% of the measured total CPU usage in clock cycles in our 20-node virtual Hadoop cluster.

* 16 pages- previously published as "On Modelling and Prediction of Total CPU Usage for Applications in MapReduce Enviornments" in IEEE 12th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP-12), Fukuoka, Japan, 4-7 September, 2012

Via

Access Paper or Ask Questions

A Study on Using Uncertain Time Series Matching Algorithms in MapReduce Applications

Jan 18, 2013

Nikzad Babaii Rizvandi, Javid Taheri, Albert Y. Zomaya, Reza Moraveji

Figure 1 for A Study on Using Uncertain Time Series Matching Algorithms in MapReduce Applications

Figure 2 for A Study on Using Uncertain Time Series Matching Algorithms in MapReduce Applications

Figure 3 for A Study on Using Uncertain Time Series Matching Algorithms in MapReduce Applications

Figure 4 for A Study on Using Uncertain Time Series Matching Algorithms in MapReduce Applications

Abstract:In this paper, we study CPU utilization time patterns of several Map-Reduce applications. After extracting running patterns of several applications, the patterns with their statistical information are saved in a reference database to be later used to tweak system parameters to efficiently execute unknown applications in future. To achieve this goal, CPU utilization patterns of new applications along with its statistical information are compared with the already known ones in the reference database to find/predict their most probable execution patterns. Because of different patterns lengths, the Dynamic Time Warping (DTW) is utilized for such comparison; a statistical analysis is then applied to DTWs' outcomes to select the most suitable candidates. Moreover, under a hypothesis, another algorithm is proposed to classify applications under similar CPU utilization patterns. Three widely used text processing applications (WordCount, Distributed Grep, and Terasort) and another application (Exim Mainlog parsing) are used to evaluate our hypothesis in tweaking system parameters in executing similar applications. Results were very promising and showed effectiveness of our approach on 5-node Map-Reduce platform

* 12 pages a version has been accepted to journal of "Concurrency and Computation: Practice and Experience", available online from the University of Sydney at http://www.nicta.com.au/pub?doc=4744

Via

Access Paper or Ask Questions