Integrating Mapreduce with Apriori and Genetic Algorithms for Groceries Transactions Imohammed A

Integrating Mapreduce with Apriori and Genetic Algorithms for Groceries Transactions Imohammed A

ISSN : 2347 - 8446 (Online) International Journal of Advanced Research in ISSN : 2347 - 9817 (Print) Vol. 7, Issue 3 (July - Sept. 2019) Computer Science & Technology (IJARCST 2019) Integrating Mapreduce with Apriori and Genetic Algorithms for Groceries Transactions IMohammed A. Almorsy, IIMohammed A. El-dosuky, IIISameh Abd Elghani, IVHazem M. El Bakry I,II,III,IVFaculty of Computer & Information Sciences, Mansoura University, EGYPT Abstract Groceries outlets ever produce massive data of transactions that deserve analysis for customer preferences. This paper reviews big data, Hadoop, association rules, and genetic algorithms. A proposed fusion algorithm for mining groceries transactions is presented. Such algorithm combines mapreduce, Apriori and genetic algorithms for groceries transactions. Simulation results prove the efficiency of the presented algorithm. Keywords Big Data, Mapreduce, Hadoop, Association Rule, Apriori Algorithm, Genetic Algorithm. I. Introduction III. Proposed system In big data, data sources with volume is over than the capability Figure 1 depicts the block diagram. of software tools databases that use to handle large amount of datasets such as analysis big data, manage, store and capture [1]. First, use mapreduce algorithm to deal with large datasets. Load Big data is processed using mapreduce, with two functions: one database and split it in to small chunks of data. of mapping that process a key/value pairs to make a collection Second, mapper function that maps input key/value pairs to a set of intermediate key/value pair and a function of reducing that of intermediate key/value pairs then shuffled. merges all intermediate variables related which have the same Third, reducer function iterate through the values that are intermediate keys [11-169]. associated with that key and produce zero or more outputs. Initially proposed by Yahoo, Hadoop is open source platform for Fourth, output of mapreduce consists of key/value which distributed computing, storing data using distributed file system, associated with others. This output easier to deal with it. HDFS [2]. Fifth, using apriori algorithm to extract association rules. To find Mining of Association rule is finding rules in the databases which the frequent item sets from big databases using series of iterations satisfy some minimum of confidence and minimum of support for generating candidate item set. Compute the minimum support constraint [3]. Apriori algorithm finds the frequent itemsets from then prune candidate item sets. big databases using a series of iterations. The Apriori algorithm Finally, extract strong association rules using genetic algorithm. is generating candidate item sets, first compute the support, and Use crossover, mutation and selection functions to reproduce new then prune the candidate item sets to the frequent item sets in populations of rules then use fitness function to find evaluated each iteration [4]. optimal solution Genetic Algorithm: heuristic approach used for solving search based and optimization problems, using fitness function to evaluate solution and make iteration such as crossover, mutation and selection to find optimal solution [10]. The rest of this paper reviews big data, Hadoop, association rules, and genetic algorithms in Section 2, before proposing a combination of mapreduce with Apriori algorithm and genetic algorithm for groceries transactions in Section 3. II. Previous work Apriori-Map/Reduce Algorithm and Represent its time complexity, which theoretically illustrate that the algorithm gains more performance than the sequential algorithms as the map function and reduce function nodes. The item sets can produce and compute Association Rule for market analysis basket [5]. Implement Apriori Algorithm which improved on MapReduce model on the Hadoop. The improved algorithm can deal with large data set with less cost [6].A recent paper gives the overview of algorithms designed for parallel mining for extracting all frequent item sets using hadoop [7]. Using association rule on dataset to extract rules then use genetic algorithm but results are complex [8]. Combination of a-priori query technique with a genetic algorithm to solve the association rule mining problem [9]. Combination of a-priori with a genetic algorithm (GA) to solve two classical NP-hard location problems [10]. www.ijarcst.com 15 © All Rights Reserved, IJARCST 2013 International Journal of Advanced Research in ISSN : 2347 - 8446 (Online) Computer Science & Technology (IJARCST 2019) Vol. 7, Issue 3 (July - Sept. 2019) ISSN : 2347 - 9817 (Print) Fig 1: Combination of mapreduce with apriori algorithm and genetic algorithm IV. Simulation Results itemFrequencyPlot(Groceries,topN=10,type=”absolute”) Implement proposed system on groceries data set. Groceries Data itemsets <-apriori(Groceries,parameter = list(minlen=1,maxl Set contains: a collection of receipts with each line representing one en=1,support=0.02,target=frequent itemsets)) receipt and the items purchased. Each line is called a transaction itemsets <- apriori(Groceries,parameter = list(minlen=1,maxl and each column in a row represents an item. Groceries dataset en=1,support=0.02,target=”frequent itemsets”)) encompasses 9835 transactions. inspect(head(sort(itemsets,by=”support”),10)) First: use mapreduce algorithm to manage large database itemsets <- apriori(Groceries,parameter = list(minlen=2,maxl operations such as capture, store, manipulate and others which en=2,support=0.02,target=”frequent itemsets”)) consist of two functions: map function to split data in to small inspect(head(sort(itemsets,by=”support”),10)) pieces input and map input key / value pairs not arranged. Then itemsets <- apriori(Groceries,parameter = list(minlen=3,maxl reduce function to arrange data with associated values which can en=3,support=0.02,target=”frequent itemsets”)) deal with groceries dataset easier. Represent groceries data set in inspect(head(sort(itemsets,by=”support”),10)) “groceries.csv” file. rules strong_rules <-sort(rules,by=”confidence”,decreasing=T) Table 1: sample of the input groceries dataset Citrus fruit,semi-finished bread,margarine,ready soups inspect(strong_rules) tropical fruit,yogurt,coffee whole milk To extract association rules in external file “r.txt” that contains output of apriori algorithm implementation by writing: pip fruit,yogurt,cream cheese ,meat spreads sink(“r.txt”) r=inspect(strong_rules) Second: implement apriori algorithm on groceries dataset using inspect(strong_rules) R_studio. Begin with installing libraries “arules”, data “Groceries” sink(“r.txt”) and packages of “arules” and “arulesViz”, then to display size and inspect(strong_rules) length of Groceries dataset by writing size (Groceries) and length (Groceries). Then write this code for using series of iterations for savehistory(“C:/Users/hp/Desktop/m.txt”) generating candidate item set to explore and find frequent item set which has threshold of confidence and support and extract strong association rules. © 2013, IJARCST All Rights Reserved 16 www.ijarcst.com ISSN : 2347 - 8446 (Online) International Journal of Advanced Research in ISSN : 2347 - 9817 (Print) Vol. 7, Issue 3 (July - Sept. 2019) Computer Science & Technology (IJARCST 2019) Table 2 : sample of the output of r.txt file This output is extracted and pipelined to Genetic subsystem that LHS RHS Support conf Count is implemented in Python to make new generated rules. {rice, {whole 0.0012 1 12 Table 3: exemplar of the final output using genetic algorithm on sugar} milk} r.txt file {canned {whole 0.001118 1 11 {rice,other vegetables, yogurt,oil} fish, milk} =>{whole milk} hygiene {tropical fruit,root vegetables, sugar} articles} =>{whole milk} Third: use genetic algorithm on external file(r) which contain strong rules using anaconda python. In this system fitness is confidence, Final results will be measured their accuracy and compared with extract strong rules which have confidence at least 1. results on r.txt file (before and after using genetic algorithm).Total results after using genetic algorithm are 435 samples, 267 sample their accuracy is exactly 100% with r.txt, and 168 samples their accuracy less than our confidence. V. Conclusion and Future work Groceries outlets ever produce massive data of transactions that deserve analysis for customer preferences. The paper has reviewed big data, Hadoop, association rules, and genetic algorithms. A new fusion algorithm for groceries transactions has been presented .such algorithm has combined mapreduce with Apriori and genetic algorithms to get the best association rules. Future directions may encompass the application of parallel implementation of Genetic and/ or in a distributed datasets of different stores. References [1] Trnka, Andrej. “Big data analysis.” European Journal of Science and Theology 10.1 (2014): 143-148.‏ [2] Dean, Jeffrey, and Sanjay Ghemawat. “MapReduce: a flexible data processing tool.”Communications of the ACM 53.1 (2010): 72-77. [3] Ma, Bing Liu Wynne Hsu Yiming, and Bing Liu. “Integrating Fig. 2 : Genetic Algorithm phase on proposed system classification and association rule mining.” Proceedings of the fourth international conference on knowledge discovery Then cross over between strong rules which confidence at least and data mining. 1998. 1. [4] Ye, Yanbin, and Chia-Chu Chiang. “A parallel apriori c1 = p1 algorithm for frequent itemsets mining.” Software c2 = p2 Engineering Research, Management and Applications, 2006. Fourth International Conference on. IEEE, 2006.‏ cross_1= len (lhs[c1])//

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    10 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us