Advanced Analytics in Business [D0s07a] Big Data Platforms & Technologies [D0s06a] Hadoop and Mapreduce Spark Streaming Analytics Overview
Total Page:16
File Type:pdf, Size:1020Kb
Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and MapReduce Spark Streaming Analytics Overview Big data Hadoop MapReduce Spark SparkSQL MLlib Streaming analytics and other trends 2 The landscape is incredibly complex 3 Heard about Hadoop? Spark? H2O? Many vendors with their "big data and analytics" stack Amazon There's always "roll your own" Cloudera Open source, or walled garden? Datameer Support? DataStax What's up to date? Dell Which features? Oracle IBM MapR Pentaho Databricks Microsoft Hortonworks EMC2 4 Two sides emerge Infrastructure "Big Data" "Integration" "Architecture" "Streaming" 5 Two sides emerge Analytics "Data Science" "Machine Learning" "AI" 6 There's a difference 7 Previously In-memory analytics Together with some intermediate techniques: mostly based on disk swapping and directed acyclic execution graphs Now: moving to the world of big data Managing, storing, querying data Storage and computation in a distributed setup: over multiple machines And (hopefully still, though it will be difficult for a while...): analytics 8 Hadoop 9 Hadoop We all know of Hadoop and hear it being mentioned every time a team is talking about some big daunting task related to the analysis or management of big data Have lots of volume? Hadoop! Unstructured data? Hadoop! Streaming data? Hadoop! Want to run super-fast machine learning in parallel? You guessed it… Hadoop! So what is Hadoop and what is it not? 10 History The genesis of Hadoop came from the Google File System paper that was published in 2003 Spawned another research paper from Google: MapReduce Hadoop itself however started as part of a search engine called Nutch, which was being worked on by Doug Cutting and Mike Cafarella 5k lines of code for NDFS (Nutch Distributed File System) and 6k lines of code for MapReduce In 2006, Cutting joined Yahoo! to work in its search engine division. The part of Nutch which dealt with distributed computing and processing (initially constructed to handle with the simultaneous parsing of enormous amounts of web links in an efficient manner) was split of and renamed to “Hadoop” Toy elephant of his son In 2008, Yahoo! open-sourced Hadoop Hadoop become part of an ecosystem of technologies which are managed by the non-profit Apache Software foundation Today, most of the hype around Hadoop has been passed us, for reasons we'll see later on 11 Hadoop Even when talking about “raw” Hadoop, it is important to know that it describes a stack containing four core modules: 1. Hadoop Common (a set of shared libraries) 2. Hadoop Distributed File System (HDFS), a Java-based file system to store data across multiple machines 3. MapReduce (a programming model to process large sets of data in parallel) 4. YARN (Yet Another Resource Negotiator), a framework to schedule and handle resource requests in a distributed environment In MapReduce version 1 (Hadoop 1), HDFS and MapReduce were tightly coupled. Didn’t scale well to really big clusters. In Hadoop 2, the resource management and scheduling tasks are separated from MapReduce by YARN 12 HDFS HDFS is the distributed file system used by Hadoop to store data in the cluster HDFS lets you connect nodes (commodity personal computers, which was a big deal at the time) over which data files are distributed You can then access and store the data files as one seamless file system HDFS is fault tolerant and provides high-throughput access Theoretically, you don't need to have it running and files could instead be stored elsewhere HDFS is composed of a NameNode, an optional SecondaryNameNode (for data recovery in the event of failure), and DataNodes which hold the actual data A NameNode manages holds all the metadata regarding the stored files and manages namespace operations like opening, closing, and renaming files and directories, and maps data blocks to DataNodes DataNodes handle read and write requests from HDFS clients and also create, delete, and replicate data blocks according to instructions from the governing NameNode A typical installation cluster has a dedicated machine that runs a name node and at least one data node Data nodes continuously loop, asking the name node for instructions HDFS supports a hierarchical file organization of directories and files inside them 13 HDFS HDFS replicates file blocks for fault tolerance An application can specify the number of replicas of a file at the time it is created, and this number can be changed any time after that. The name node makes all decisions concerning block replication One of the main goals of HDFS is to support large files. The size of a typical HDFS block is 64MB Therefore, each HDFS file consists of one or more 64MB blocks HDFS tries to place each block on separate data nodes 14 HDFS 15 HDFS 16 HDFS 17 HDFS 18 HDFS HDFS provides a native Java API and a native C-language wrapper for the Java API, as well as shell commands to interface with the file system byte[] fileData = readFile(); String filePath = "/data/course/participants.csv"; Configuration config = new Configuration(); org.apache.hadoop.fs.FileSystem hdfs = org.apache.hadoop.fs .FileSystem.get(config); org.apache.hadoop.fs.Path path = new org.apache.hadoop.fs.Path(filePath); org.apache.hadoop.fs.FSDataOutputStream outputStream = hdfs.create(path); outputStream.write(fileData, 0, fileData.length); In layman’s terms: a massive, distributed, C:-drive… And note: reading in a massive file in a naive way will still end up in trouble 19 MapReduce What is MapReduce? A “programming framework” for coordinating tasks in a distributed environment HDFS uses “behind the scenes” this to make access fast Reading a file is converted to a MapReduce task to read across multiple DataNodes and stream the resulting file Can be used to construct scalable and fault-tolerant operations in general HDFS provides a way to store files in a distributed fashion, MapReduce allows to do something with them in a distributed fashion 20 MapReduce The concepts of "map" and "reduce" existed long before Hadoop and stem from the domain of functional programming Map: apply a function on every item in a list: result is a new list of values numbers = [1,2,3,4,5] numbers.map(λ x : x * x) # [1,4,9,16,25] Reduce: apply function on a list: result is a scalar numbers.reduce(λ x : sum(x)) # 15 21 MapReduce A Hadoop map-reduce pipeline works over lists of (key, value) pairs The map operations maps each pair to a list of output key-value pairs Zero, one, or more This operation can be run in parallel over the input pairs The input list could also contain a single key-value pair Next, the output entries are shuffled and distributed so that all output entries belonging to the same key are assigned to the same worker All of these workers then apply a reduce function to each group Producing a final key-value pair for each distinct key The resulting, final outputs are then (optionally) sorted per key to produce the final outcome 22 MapReduce # Input: a list of key-value pairs documents = [ ('document1', 'two plus two does'), ('document2', 'not equal two'), ] def map(key, value): # For each word, produce an output pair for word in value.split(' '): yield (word, 1) for input in documents: map(input) # [ (two, 1), (plus, 1), (two, 1), (does, 1) ] # [ (not, 1), (equal, 1), (two, 1) ] def reduce(key, values): # For each key, produce output as sum of values yield (key, sum(values)) reduce('two', [1, 1, 1]) # ('two', 2) reduce('plus', [1]) # ('plus', 1) # ... and so on 23 MapReduce: word count example 24 MapReduce: word count example 25 MapReduce: averaging example def map(key, value): yield (value['genre'], value['nrPages']) 26 MapReduce: averaging example # Minibatch-style approach would also be possible def map(key, value): for record in value: yield (record['genre'], record['nrPages']) 27 MapReduce: averaging example def reduce(key, values): yield (key, sum(values) / length(values)) 28 MapReduce: averaging example There's a gotcha, however: the reduce operation should work on partial results and be able to be applied multiple times in a chain 29 MapReduce: averaging example 30 MapReduce The reduce operation should work on partial results and be able to be applied multiple times in a chain 1. The reduce function should output the same structure as emitted by the map function, since this output can be used again in an additional reduce operation 2. The reduce function should provide correct results even if called multiple times on partial results 31 MapReduce: correct averaging example function map(key, value): yield (value['genre'], (value['nrPages'], 1)) function reduce(key, values): sum, newcount = 0, 0 for (nrPages, count) in values: sum = sum + nrPages * count newcount = newcount + count yield (key, (sum/newcount, newcount)) Instead of using a running average as a value, our value will now itself be a pair of (running average, number of records already seen) 32 MapReduce: correct averaging example 33 Testing it out A small piece of Python code map_reduce.py will be made available as background material You can use this to play around with the MapReduce paradigm without setting up a full Hadoop stack 34 Testing it out: word count example from map_reduce import runtask documents = ['een twee drie drie drie vier twee acht', 'acht drie twee zes vijf twee', 'zes drie acht'] # Provide a mapping function of the form mapfunc(value) # Must yield (k,v) pairs def mapfunc(value): for x in value.split(): yield (x,1) # Provide a reduce function of the form reducefunc(key, list_of_values) # Must yield (k,v) pairs