Observations on Factors Affecting Performance of Mapreduce Based Apriori on Hadoop Cluster

International Conference on Computing, Communication and Automation (ICCCA2016) Observations on Factors Affecting Performance of MapReduce based Apriori on Hadoop Cluster Sudhakar Singh Rakhi Garg P. K. Mishra Department of Computer Science Department of Computer Science Department of Computer Science Institute of Science, BHU Mahila Mahavidyalaya, BHU Institute of Science, BHU Varanasi, India Varanasi, India Varanasi, India [email protected] [email protected] [email protected] Abstract —Designing fast and scalable algorithm for mining distributed version of Apriori algorithm have been designed to frequent itemsets is always being a most eminent and promising enhance the speed and to mine large scale datasets [6-7]. problem of data mining. Apriori is one of the most broadly used These algorithms are efficient in analyzing data but not in and popular algorithm of frequent itemset mining. Designing managing large scale data. Hadoop is an excellent efficient algorithms on MapReduce framework to process and infrastructure that provides an integrated service of managing analyze big datasets is contemporary research nowadays. In this and processing excessive volumes of datasets. The core paper, we have focused on the performance of MapReduce based constituents of Hadoop are Hadoop Distributed File System Apriori on homogeneous as well as on heterogeneous Hadoop (HDFS) and MapReduce [8-9]. HDFS provides scalable and cluster. We have investigated a number of factors that fast access to its unlimited storage of data. MapReduce is a significantly affects the execution time of MapReduce based parallel programming model that provides an efficient and Apriori running on homogeneous and heterogeneous Hadoop Cluster. Factors are specific to both algorithmic and non- scalable processing of large volumes of data stored in HDFS. algorithmic improvements. Considered factors specific to An application executes as a MapReduce job on Hadoop algorithmic improvements are filtered transactions and data cluster. MapReduce provides high scalability, as a job is structures. Experimental results show that how an appropriate partitioned into a number of smaller tasks to run in parallel on data structure and filtered transactions technique drastically multiple nodes in cluster. MapReduce programming model is reduce the execution time. The non-algorithmic factors include so simplified that programmers only need to focus on speculative execution, nodes with poor performance, data locality processing data rather than on parallelism related details e.g. & distribution of data blocks, and parallelism control with input split size. We have applied strategies against these factors and data & task partition, load balancing etc. The performance of a fine tuned the relevant parameters in our particular application. MapReduce job running on Hadoop cluster can be optimized Experimental results show that if cluster specific parameters are in two ways. The first one is the algorithm specific where taken care of then there is a significant reduction in execution algorithmic optimization can be incorporated directly. The time. Also we have discussed the issues regarding MapReduce second one is the cluster specific where one can adjust some implementation of Apriori which may significantly influence the parameters of cluster configurations and input size for performance. datasets. Many techniques have been proposed to optimize the performance of Apriori algorithm on MapReduce framework. Keywords—Frequent Itemset Mining, Apriori, Heterogeneous Hadoop is designed on the implicit assumption that nodes in Hadoop Cluster; MapReduce; Big Data the cluster are homogeneous. But in practice it’s not always possible to have homogeneous nodes. Most of the laboratories I. INTRODUCTION and institutions are used to have heterogeneous machines. So it becomes essential to adopt proper strategies when running Frequent itemset mining on big data sets, is one of the MapReduce job on heterogeneous Hadoop cluster. The most contemporary research in Data Mining [1] and Big Data performance of a MapReduce job running on Hadoop cluster [2]. In order to mine intelligence from big data sets, data is greatly affected by tuning of various parameters specific to mining algorithms are being re-designed on MapReduce cluster configuration. For Apriori like CPU intensive framework to be executed on Hadoop cluster. Hadoop [3] is an algorithms, the granularity of input split may lead to a major extremely large scale and fault tolerant parallel and distributed difference in execution times. In this paper we have system for managing and processing of big data. MapReduce incorporated two algorithm specific techniques data structures is its computational framework. Big data is dumb if we don’t and filtered transactions in Apriori algorithm, which greatly have algorithm to make use of it [4]. It’s an algorithm that reduce the execution time on both homogeneous and transforms the data into valuable and precise information. heterogeneous cluster. We have investigated some factors Frequent itemset mining is one of the most important specific to cluster configuration to make the execution faster. technique of data mining. Apriori [5] is the most famous, Factors central to our discussion are speculative execution, simple and well-known algorithm for mining frequent itemsets performance of physical node versus virtual node, distribution using candidate itemsets generation. Many parallel and of data blocks, and parallelism control using input split size. © 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Citation: S. Singh, R. Garg and P. K. Mishra, "Observations on factors affecting performance of MapReduce based Apriori on Hadoop cluster," 2016 International Conference on Computing, Communication and Automation (ICCCA) , Greater Noida, India, 2016, pp. 87-94. doi: 10.1109/CCAA.2016.7813695 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7813695&isnumber=7813678 1 International Conference on Computing, Communication and Automation (ICCCA2016) Moreover we have also discussed the issue regarding blocks (default block size is 64 MB). Blocks are replicated MapReduce implementation of Apriori that quite possibly across multiple nodes in the cluster (default replication factor influence the execution time. We have executed the different is 3). A Hadoop cluster works on master-slave architecture in variation of MapReduce based Apriori on our local which one node is a master node and remaining nodes are heterogeneous Hadoop cluster and found that we can achieve slave nodes. Master node known as NameNode controls the faster execution by tuning the cluster specific parameters slave nodes known as DataNodes. Slave nodes hold all the without making algorithmic improvements. data blocks and perform map and reduce tasks [10]. The rest of the paper is organized as follows. Section 2 A computational application runs as a MapReduce job on introduces some fundamental concepts regarding Apriori input datasets residing in HDFS of Hadoop cluster. A algorithm, Hadoop cluster and MapReduce programming MapReduce job consists of map and reduce tasks and both paradigm. Section 3 summarizes works related to optimization work on data in the form of (key, value) pairs. Map and reduce of Apriori on MapReduce framework and performance tasks are being executed by Mapper and Reducer class improvement of MapReduce job on heterogeneous clusters. respectively of MapReduce framework. An additional Experimental platform is described in section 4. Factors combiner class may also be used executing reduce task and affecting the performance of MapReduce based Apriori and known as mini reducer. There are a number of instances of strategies adopted to improve the performance along with the Mapper and Reducer running in parallel but a Reducer starts experimental results are discussed in section 5. Finally section only when all the Mapper has been completed. Mapper takes 6 concludes the paper. input as assigned datasets, process it and produces a number of (key, value) pairs as output. These (key, value) pairs are II. BASIC CONCEPTS assigned to Reducers after sorting and shuffling by MapReduce’s underlying system. Shuffling procedure A. Apriori Algorithm assigned key and list of values associated with this key to a particular Reducer. Reducer takes input as (key, list of values) Apriori is an iterative algorithm proposed by R. Agrawal pairs and produce new (key, value) pairs. Combiner works on and R. Srikant [5], which finds frequent itemsets by generating the output of Mappers of one node to reduce the data transfer candidate itemsets. Apriori name of the algorithm is based on load from Mappers to Reducers. In MapReduce framework the apriori property which states that all the subset (k-1)- only a single time communication occurs when output of itemsets of a frequent k-itemset must also be frequent [5]. Mappers are being transferred to Reducers [10]. Apriori first scans the database and count the support of Apriori is an iterative algorithm which generates frequent each item, and then checks against minimum support threshold th th k-itemsets in k iteration. Corresponding to an iteration of to generate set of frequent

Observations on Factors Affecting Performance of Mapreduce Based Apriori on Hadoop Cluster

Mapreduce and Beyond

Integrating Crowdsourcing with Mapreduce for AI-Hard Problems ∗

Mapreduce: Simplified Data Processing On

Overview of Mapreduce and Spark

Finding Connected Components in Huge Graphs with Mapreduce

Large-Scale Graph Mining @ Google NY

Mapreduce Indexing Strategies: Studying Scalability and Efﬁciency ⇑ Richard Mccreadie , Craig Macdonald, Iadh Ounis

Mapreduce Basics

Mapreduce: a Major Step Backwards - the Database Column

Matlab-To-Map Reduce Automation

Mapreduce: a Flexible Data Processing Tool

Scaling to Build the Consolidated Audit Trail