Big Data Mining Technologies by Kushagra Trivedi Contents

Big Data Mining Technologies By Kushagra Trivedi Contents • Introduction to big data and big data mining • Apache Hadoop for big data mining • Apache S4 for big data mining • Apache Mahout for machine learning • Some other tools of machine learning and data mining • Comparison of big data mining technologies • Conclusion • References I Introduction To Big Data And Big Data Mining • Data with large amount and greater complexity • Definition of big data • Sources of data expansion • Definition of data mining • Why data mining is necessary • Some of the technologies are used for data mining 1 Apache Hadoop • Data intensive distributes architecture • Centralized server vs. distributed server • MapReduce and Hadoop Distributed File System • HDFS divides data blocks among • Writing application that rapidly process large amount of data in parallel on large clusters of compute nodes • Applications - Yahoo, Facebook and other Fortune 50 companies are using apache Hadoop 2 Hadoop Distributed File System Name Node DataNode DataNode DataNode DataNode 1 2 3 4 b b b b b b b b … … … … 1 2 2 3 1 3 1 2 Cont… 3 Cont… • NameNode maintains all meta information about DataNodes • DataNodes contains actual data blocks • HDFS distributes and replicates data blocks among data nodes • Clients executes a query goes to NameNode and search actual data by looking at meta infomation 4 MapReduce Algorithm Austin 1 REDUCE Powers 1 defeated 1 GROUP Austin 2 the 1 Powers 2 MAP league 1 defeated 2 Austin Powers defeated of 1 the 2 1 the league of evil villains evil the 1 league 2 REDUCE villains 1 league 1 of 2 of 1 evil 2 If 1 MAP evil 1 villains 2 Austin 4 The league of evil villains villains 1 Powers 4 Austin Power defeated Austin 1 GROUP defeated4 Powers 1 the 4 if 1 defeated 1 league 4 austin 1 REDUCE of 4 powers 1 evil 4 MAP If 2 If Austin Power defeated defeated 1 GROUP villains 4 Austin 2 the 1 Where’s 1 the Powers 2 league 1 League of evil villains defeated2 of 1 the 2 evil 1 league 2 villains 1 Where’s 1 The 1 of 2 MAP evil 2 Where’s the league of evil League 1 Of 1 villains 2 villains Austin Powers Evil 1 Where’s 1 defeated Villains 1 Austin 1 Powers 1 Figure 2 MapReduce defeated 1 Cont…. 5 distribution [2] Continue…. • Uses two functions: map and reduce • Data are fed into map function in order to produce intermediate key and value pair • Intermediate result is then given to reduce function in order to produce final result • Task tracker- do work that is assigned by job tracker • Job tracker- if task tracker fails then reallocation of task tracker is done 6 Apache S4 • S4 stands for simple scalable streaming system • Uses MapReduce and Actor model for computation • Data processing is done through processing elements • S4 framework provide a way to route and create processing elements according to necessary • Applications - Yahoo, LinkedIn, A9 and Quantbench are several companies use Apache S4 for big data mining 7 Continue… Figure 3 S4 word count sample [6] 8 Continue… • Processing elements are basic computational units • Processing elements only executes those events for which key it was created • A special processing element is keyless element and it is created for accepting any type of input • Processing nodes are logical hosts of processing elements • S4 routes events to processing nodes based on hash value of keyed attributes in those events 9 Apache Mahout • Open source project of Apache foundation which allows programmer to write machine learning algorithm • Works on three different algorithms those are clustering, classification and collaborative filtering • Includes several distributed clustering algorithm such as k- Means, Fuzzy k-Means, Dirchlet, Mean-Shift and Canopy • Applications- Products you want to buy, people you might want to connect with, potential life partners and recommending songs you might like 10 Continue…. 1) Building a recommendation engine •) Currently provides “Taste Library” in order to build recommendation engine •) Library comes up with user based and item based recommendations •) Five preliminary components- DataMode, UserSimilarity, ItemSimilarity, Recommender, UserNeighborhood •) User can develop application that can give online and offline recommendations using these components 11 Continue…. 2) Clustering with Apache Mahout • Clustering algorithm written using MapReduce algorithm • Canopy, k-Means, Mean-Shift, and Dirichlet are clustering algorithms • Select the data and convert it into numerical presentation • Select particular algorithm any of above • Evacuate the result 12 Continue…. 3) Categorizing content with Mahout • Two approaches for categorizing - Naïve Bayes classifier and complementary naïve Bayes classifier • One part of Naive Bayes classifier process that deal with keeping track of the words associated with a particular document and category • Second deal with information prediction using part one • Complementary Naïve Bayes classifier is similar to naïve Bayes approach with simplicity 13 Some Other Tools of Machine Learning and Data Mining • Big data R is used for statistical computing using high performance statistical computing on big data • Machine Online Analysis is machine learning algorithm that is used for data stream mining • Massive Online Analysis uses classification, regression, clustering and frequent item set mining and frequent graph mining • Vowpal Wabbit is able to handle terabytes of data • Vowpal Wabbit can give better throughput using single machine network • Pegasus is big graph mining tool that finds patterns and anomalies from large massive graphs • GraphLab is High level parallel data mining system built without using MapReduce 14 Comparison • Apache Hadoop is used for batch processing • Data is divided into large size of blocks that makes it easy to handle • Put extra overhead of segmentation • Apache S4 is used for streaming data • No need of segmentation of data • Cannot add or remove nodes from running clusters • Apache Mahout is used for writing machine learning algorithm • No lack in community and documentation and examples are provided 15 Conclusion • Big data is crucial concern as data is going to increase in future • Different techniques are needed for mining this big data • Apache Mahout gives recommendations to users according to their past experience • Hadoop is used for data mining using MapReduce and HDFS • Apache S4 for mining streams of data • All techniques have their own significance for different types of companies 16 References [1] Apache Hadoop Fundamentals – HDFS and MapReduce Explained with a Diagram By RAMESH NATARAJAN on JANUARY 4, 2012 [2] Pros and Cons of Hadoop By Guruzon.com on June 01, 2013 [3] HDFS: Facebook has the world's largest Hadoop cluster! [4] S4 distributed stream of computing platform- overview [5] S4 distributed stream computing Platform By Aleksandar Bradic, Sr. Director, Engineering and R&D [6] Streaming Big Data By William Zhou in William Zhou's Blog on Sep 24, 2012 [7] Introducing Apache Mahout -Scalable, commercial-friendly machine learning for building intelligent applications by Grant Ingersoll on 08 September 2009 [8] Introduction to scalable machine learning with apache mahout Grant Ingersoll on 15 September 2010 [9] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA: Massive Online Analysis [10] J. Langford. Vowpal Wabbit, 2011. [11]U. Kang, D. H. Chau, and C. Faloutsos. PEGASUS: Mining Billion-Scale Graphs in the Cloud. 2012. 17 [12]R. Smolan and J. Erwitt. The Human Face of Big Data. Sterling Publishing Company Incorporated, 2012. Any Queries :.

Big Data Mining Technologies by Kushagra Trivedi Contents

Data Mining – Intro

Frequent Item Set Mining Using INC MINE in Massive Online Analysis Frame Work

Performance Analysis of Hoeffding Trees in Data Streams by Using Massive Online Analysis Framework

Massive Online Analysis, a Framework for Stream Classiﬁcation and Clustering

Adaptive Learning and Mining for Data Streams and Frequent Patterns

Online Learning for Big Data Analytics

An Extensible Framework for Data Stream Clustering Research with R

MOA: a Real-Time Analytics Open Source Framework

Estra: an Easy Streaming Data Analysis Tool

Interactive Event-Driven Knowledge Discovery from Data Streams

A Framework and Algorithm for Data Stream Cluster Analysis

Package 'Streammoa'