Big Technologies By Kushagra Trivedi Contents

• Introduction to big data and big data mining • Apache Hadoop for big data mining • Apache S4 for big data mining • Apache Mahout for • Some other tools of machine learning and data mining • Comparison of big data mining technologies • Conclusion • References I Introduction To Big Data And Big Data Mining

• Data with large amount and greater complexity • Definition of big data • Sources of data expansion • Definition of data mining • Why data mining is necessary • Some of the technologies are used for data mining

1 Apache Hadoop

• Data intensive distributes architecture • Centralized server vs. distributed server • MapReduce and Hadoop Distributed File System • HDFS divides data blocks among • Writing application that rapidly process large amount of data in parallel on large clusters of compute nodes • Applications - Yahoo, Facebook and other Fortune 50 companies are using apache Hadoop 2 Hadoop Distributed File System

Name Node

DataNode DataNode DataNode DataNode 1 2 3 4

b b b b b b b b … … … … 1 2 2 3 1 3 1 2 Cont… 3 Cont…

• NameNode maintains all meta information about DataNodes • DataNodes contains actual data blocks • HDFS distributes and replicates data blocks among data nodes • Clients executes a query goes to NameNode and search actual data by looking at meta infomation

4 MapReduce Algorithm Austin 1 REDUCE Powers 1 defeated 1 GROUP Austin 2 the 1 Powers 2 MAP league 1 defeated 2 Austin Powers defeated of 1 the 2 1 the league of evil villains evil the 1 league 2 REDUCE villains 1 league 1 of 2 of 1 evil 2 If 1 MAP evil 1 villains 2 Austin 4 The league of evil villains villains 1 Powers 4 Austin Power defeated Austin 1 GROUP defeated4 Powers 1 the 4 if 1 defeated 1 league 4 austin 1 REDUCE of 4 powers 1 evil 4 MAP If 2 If Austin Power defeated defeated 1 GROUP villains 4 Austin 2 the 1 Where’s 1 the Powers 2 league 1 League of evil villains defeated2 of 1 the 2 evil 1 league 2 villains 1 Where’s 1 The 1 of 2 MAP evil 2 Where’s the league of evil League 1 Of 1 villains 2 villains Austin Powers Evil 1 Where’s 1 defeated Villains 1 Austin 1 Powers 1 Figure 2 MapReduce defeated 1 Cont…. 5 distribution [2] Continue….

• Uses two functions: map and reduce • Data are fed into map function in order to produce intermediate key and value pair • Intermediate result is then given to reduce function in order to produce final result • Task tracker- do work that is assigned by job tracker • Job tracker- if task tracker fails then reallocation of task tracker is done

6 Apache S4

• S4 stands for simple scalable streaming system • Uses MapReduce and Actor model for computation • Data processing is done through processing elements • S4 framework provide a way to route and create processing elements according to necessary • Applications - Yahoo, LinkedIn, A9 and Quantbench are several companies use Apache S4 for big data mining

7 Continue…

Figure 3 S4 word count sample [6]

8 Continue…

• Processing elements are basic computational units • Processing elements only executes those events for which key it was created • A special processing element is keyless element and it is created for accepting any type of input • Processing nodes are logical hosts of processing elements • S4 routes events to processing nodes based on hash value of keyed attributes in those events

9 Apache Mahout

• Open source project of Apache foundation which allows programmer to write machine learning algorithm • Works on three different algorithms those are clustering, classification and collaborative filtering • Includes several distributed clustering algorithm such as k- Means, Fuzzy k-Means, Dirchlet, Mean-Shift and Canopy • Applications- Products you want to buy, people you might want to connect with, potential life partners and recommending songs you might like

10 Continue….

1) Building a recommendation engine •) Currently provides “Taste Library” in order to build recommendation engine •) Library comes up with user based and item based recommendations •) Five preliminary components- DataMode, UserSimilarity, ItemSimilarity, Recommender, UserNeighborhood •) User can develop application that can give online and offline recommendations using these components

11 Continue….

2) Clustering with Apache Mahout • Clustering algorithm written using MapReduce algorithm • Canopy, k-Means, Mean-Shift, and Dirichlet are clustering algorithms • Select the data and convert it into numerical presentation • Select particular algorithm any of above • Evacuate the result

12 Continue….

3) Categorizing content with Mahout • Two approaches for categorizing - Naïve Bayes classifier and complementary naïve Bayes classifier • One part of Naive Bayes classifier process that deal with keeping track of the words associated with a particular document and category • Second deal with information prediction using part one • Complementary Naïve Bayes classifier is similar to naïve Bayes approach with simplicity

13 Some Other Tools of Machine Learning and Data Mining

• Big data R is used for statistical computing using high performance statistical computing on big data • Machine Online Analysis is machine learning algorithm that is used for • Massive Online Analysis uses classification, regression, clustering and frequent item set mining and frequent graph mining • is able to handle terabytes of data • Vowpal Wabbit can give better throughput using single machine network • Pegasus is big graph mining tool that finds patterns and anomalies from large massive graphs • GraphLab is High level parallel data mining system built without using MapReduce 14 Comparison

• Apache Hadoop is used for batch processing • Data is divided into large size of blocks that makes it easy to handle • Put extra overhead of segmentation • Apache S4 is used for streaming data • No need of segmentation of data • Cannot add or remove nodes from running clusters • Apache Mahout is used for writing machine learning algorithm • No lack in community and documentation and examples are provided 15 Conclusion

• Big data is crucial concern as data is going to increase in future • Different techniques are needed for mining this big data • Apache Mahout gives recommendations to users according to their past experience • Hadoop is used for data mining using MapReduce and HDFS • Apache S4 for mining streams of data • All techniques have their own significance for different types of companies 16 References

[1] Apache Hadoop Fundamentals – HDFS and MapReduce Explained with a Diagram By RAMESH NATARAJAN on JANUARY 4, 2012 [2] Pros and Cons of Hadoop By Guruzon.com on June 01, 2013 [3] HDFS: Facebook has the world's largest Hadoop cluster! [4] S4 distributed stream of computing platform- overview [5] S4 distributed stream computing Platform By Aleksandar Bradic, Sr. Director, Engineering and R&D [6] Streaming Big Data By William Zhou in William Zhou's Blog on Sep 24, 2012 [7] Introducing Apache Mahout -Scalable, commercial-friendly machine learning for building intelligent applications by Grant Ingersoll on 08 September 2009 [8] Introduction to scalable machine learning with apache mahout Grant Ingersoll on 15 September 2010 [9] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA: Massive Online Analysis [10] J. Langford. Vowpal Wabbit, 2011. [11]U. Kang, D. H. Chau, and C. Faloutsos. PEGASUS: Mining Billion-Scale Graphs in the Cloud. 2012. 17 [12]R. Smolan and J. Erwitt. The Human Face of Big Data. Sterling Publishing Company Incorporated, 2012. Any Queries :