Hadoop, Spark and Streaming Analytics

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and Spark Streaming Analytics Overview Introduction Hadoop: HDFS and MapReduce Spark: SparkSQL and MLlib Streaming analytics and other trends 2 Recall… 3 Two sides emerge Infrastructure Big Data Integration Architecture NoSQL and NewSQL Streaming AI and ML ops 4 Two sides emerge Analytics Data Science Machine Learning AI NLP But also still: BI and Visualization 5 There’s a difference 6 Previously In-memory analytics Together with some intermediate techniques (Dask and friends): based on disk swapping and directed acyclic execution graphs Now: moving to the world of big data Managing, storing, querying data Storage and computation in a distributed setup: setting of multiple machines is assumed And (hopefully) the same for analytics: distributed data frames, distributed model training 7 Hadoop 8 Hadoop At some point, Hadoop was mentioned every time a team was talking about some big daunting task related to the analysis or management of big data Have lots of volume? Hadoop! Unstructured data? Hadoop! Streaming data? Hadoop! Want to run super-fast machine learning in parallel? You guessed it… Hadoop! So what is Hadoop and what is it not? 9 Hadoop The genesis of Hadoop came from the Google File System paper that was published in 2003 Spawned another research paper from Google: MapReduce Hadoop itself however started as part of a search engine called Nutch, which was being worked on by Doug Cutting and Mike Cafarella 5k lines of code for NDFS (Nutch Distributed File System) and 6k lines of code for MapReduce In 2006, Cutting joined Yahoo! to work in its search engine division. The part of Nutch which dealt with distributed computing and processing (initially constructed to handle with the simultaneous parsing of enormous amounts of web links in an efficient manner) was split of and renamed to “Hadoop” Toy elephant of his son In 2008, Yahoo! open-sourced Hadoop Hadoop become part of an ecosystem of technologies which are managed by the non-profit Apache Software foundation Today, most of the hype around Hadoop has passed, for reasons we’ll see later 10 Hadoop “Raw” Hadoop contains four core modules: 1. Hadoop Common (a set of shared libraries) 2. Hadoop Distributed File System (HDFS), a Java-based file system to store data across multiple machines 3. MapReduce (a programming model to process large sets of data in parallel) 4. YARN (Yet Another Resource Negotiator), a framework to schedule and handle resource requests in a distributed environment In MapReduce version 1 (Hadoop 1), HDFS and MapReduce were tightly coupled. This didn’t scale well to really big clusters. In Hadoop 2, the resource management and scheduling tasks are separated from MapReduce by YARN 11 HDFS HDFS is the distributed file system used by Hadoop to store data in the cluster HDFS lets you connect nodes (commodity computers) contained within clusters over which data files are distributed You can then access and store the data files as one seamless file system Theoretically, you don’t need to have it running and files could instead be stored elsewhere HDFS replicates file blocks for fault tolerance and high-throughput access An application can specify the number of replicas of a file at the time it is created, and this number can be changed any time after that A “name node” makes all decisions concerning block replication, “data nodes” hold the actual data blocks One of the main goals of HDFS is to support large files. The size of a typical HDFS block is 64MB Therefore, files can consist of one or more 64MB blocks HDFS tries to place each block on separate data nodes 12 HDFS HDFS provides a native Java API and a native C-language wrapper for the Java API, as well as shell commands to interface with the file system byte[] fileData = readFile(); String filePath = "/data/course/participants.csv"; Configuration config = new Configuration(); org.apache.hadoop.fs.FileSystem hdfs = org.apache.hadoop.fs.FileSystem.get(config); org.apache.hadoop.fs.Path path = new org.apache.hadoop.fs.Path(filePath); org.apache.hadoop.fs.FSDataOutputStream outputStream = hdfs.create(path); outputStream.write(fileData, 0, fileData.length); In layman’s terms: a massive, distributed, C:-drive… And note: reading in a massive file in a naive way will still end up badly Distributed storage, not computing! 13 MapReduce What is MapReduce? A “programming framework” for coordinating tasks in a distributed environment HDFS uses this “behind the scenes” Reading a file is converted to a MapReduce task to read across multiple DataNodes and stream the resulting file Can be used to construct scalable and fault-tolerant operations in general HDFS provides a way to store files in a distributed fashion, MapReduce allows to do something with them in a distributed fashion 14 MapReduce The concepts of “map” and “reduce” existed long before Hadoop and stem from the domain of functional programming Map: apply a function on every item in a list: result is a new list of values numbers = [1,2,3,4,5] numbers.map(λ x : x * x) # [1,4,9,16,25] Reduce: apply function on a list: result is a scalar numbers.reduce(λ x : sum(x)) # 15 15 MapReduce A Hadoop map-reduce pipeline works over lists of (key, value) pairs The map operations maps each pair to a list of output key-value pairs (zero, one, or more) This operation can be run in parallel over the input pairs The input list could also contain a single key-value pair Next, the output entries are shuffled and distributed so that all output entries belonging to the same key are assigned to the same worker All of these workers then apply a reduce function to each group Producing a final key-value pair for each distinct key The resulting, final outputs are then (optionally) sorted per key to produce the final outcome 16 MapReduce: word count example 17 MapReduce: word count example 18 MapReduce: averaging example def map(key, value): yield (value['genre'], value['nrPages']) 19 MapReduce: averaging example def reduce(key, values): yield (key, sum(values) / length(values)) 20 MapReduce: averaging example There’s a gotcha, however: the reduce operation should work on partial results and be able to be applied multiple times in a chain 21 MapReduce: averaging example 22 MapReduce The reduce operation should work on partial results and be able to be applied multiple times in a chain 1. The reduce function should output the same structure as emitted by the map function, since this output can be used again in an additional reduce operation 2. The reduce function should provide correct results even if called multiple times on partial results def map(key, value): yield (value['genre'], (value['nrPages'], 1)) def reduce(key, values): sum, newcount = 0, 0 for (nrPages, count) in values: sum = sum + nrPages * count newcount = newcount + count yield (key, (sum/newcount, newcount)) Instead of using a running average as a value, our value will now itself be a pair of (running average, number of records already seen) 23 MapReduce: correct averaging example 24 Testing it out map_reduce.py will be made available as background material You can use this to play around with the MapReduce paradigm without setting up a full Hadoop stack # Minimum per group example from map_reduce import runtask documents = [ ('drama', 200), ('education', 100), ('action', 20), ('thriller', 20), ('drama', 220), ('education', 150), ('action', 10), ('thriller', 160), ('drama', 140), ('education', 160), ('action', 20), ('thriller', 30) ] # Provide a mapping function of the form mapfunc(value) # Must yield (k,v) pairs def mapfunc(value): genre, pages = value yield (genre, pages) # Provide a reduce function of the form reducefunc(key, list_of_values) # Must yield (k,v) pairs def reducefunc(key, values): yield (key, min(values)) # Pass your input list, mapping and reduce functions runtask(documents, mapfunc, reducefunc) 25 Back to Hadoop On Hadoop, MapReduce tasks are written using Java Bindings for Python and other languages exist as well, but Java is the “native” environment Java program is packages as a JAR archive and launched using the command: hadoop jar myfile.jar ClassToRun [args...] hadoop jar wordcount.jar RunWordCount /input/dataset.txt /output/ 26 Back to Hadoop public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; IntWritable result = new IntWritable(); for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } 27 Back to Hadoop hadoop jar wordcount.jar WordCount /users/me/dataset.txt /users/me/output/ 28 Back to Hadoop $ hadoop fs -ls /users/me/output Found 2 items -rw-r—r-- 1 root hdfs 0 2017-05-20 15:11 /users/me/output/_SUCCESS -rw-r—r-- 1 root hdfs 2069 2017-05-20 15:11 /users/me/output/part-r-00000 $ hadoop fs -cat /users/me/output/part-r-00000 and 2 first 1 is 3 line 2 second 1 the 2 this 3 29 Back to Hadoop MapReduce tasks can consist of more than mappers and reducers Partitioners, Combiners, Shufflers, and Sorters 30 MapReduce Constructing MapReduce programs requires a certain skillset in terms of programming (to put it lightly) There’s a reason why most tutorials don’t go much further than counting words Tradeoffs in terms of speed, memory consumption, and scalability Big does not mean fast Does your use case really align with that of a search engine? 31 YARN How is a MapReduce program coordinated amongst

Hadoop, Spark and Streaming Analytics

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support