Data-Intensive Distributed Computing CS 431/631 (Fall 2021)

Part 3: From MapReduce to Spark (1/3)

Ali Abedi

These slides are available at https://www.student.cs.uwaterloo.ca/~cs451/

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States 1 See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

1 The datacenter is the computer! What’s the instruction set?

2 Source:

2 Map/Reduce Abstraction Instruction set Combine/Partition

CPU Cluster of computers

3

We need a solution for both storage and computing.

3 So you like programming in assembly?

4 Source: Wikipedia (ENIAC)

So when we program in MapReduce is it like programming in assembly?! How can we do better?

4 What’s the solution?

Design a higher-level language Write a compiler

5

5 Hadoop is great, but it’s really waaaaay too low level!

What we really need is a What we really need is SQL! scripting language!

Answer: Answer:

6

Yahoo and Facebook designed their own solutions on top of Hadoop to make it more flexible for their engineers.

6 SQL Pig Scripts

Both open-source projects today! 7

7 Hive Pig

MapReduce

HDFS

8

Pig and Hive programs are converted to MapReduce jobs at the end of the day.

8 Pig!

9 Source: Wikipedia (Pig)

9 Pig: Example Task: Find the top 10 most visited pages in each category

Visits URL Info

User Url Time Url Category PageRank

Amy cnn.com 8:00 cnn.com News 0.9

Amy bbc.com 10:00 bbc.com News 0.8

Amy flickr.com 10:05 flickr.com Photos 0.7

Fred cnn.com 12:00 espn.com Sports 0.9

10 Pig Slides adapted from Olston et al. (SIGMOD 2008)

10 Pig: Example Script

visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;

11 Pig Slides adapted from Olston et al. (SIGMOD 2008)

11 Pig Query Plan

load visits

group by url

foreach url generate count load urlInfo

join on url

group by category

foreach category generate top(urls, 10)

12 Pig Slides adapted from Olston et al. (SIGMOD 2008)

12 Pig: MapReduce Execution

load visits Map1

group by url Reduce1 Map2 foreach url generate count load urlInfo

join on url Reduce2 Map group by category 3 Reduce3 foreach category generate top(urls, 10)

13 Pig Slides adapted from Olston et al. (SIGMOD 2008)

13 visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

14

14 But isn’t Pig slower? Sure, but c can be slower than assembly too…

15

15 The datacenter is the computer! What’s the instruction set? Okay, let’s fix this!

16 Source: Google

Having to formulate the problem in terms of map and reduce only is restrictive.

16 MapReduce Workflows

HDFS

map map map map

reduce reduce reduce reduce

HDFS HDFS HDFS HDFS

What’s wrong? 17

There is a lot of disk i/o involved which significantly reduces running MapReduce jobs like this.

17 Want MM?

HDFS HDFS

map map map

map

HDFS HDFS HDFS ✔ ✗ 18

It’s okay not to have reduce but the output of map cannot go to another map.

18 Want MRR?

HDFS HDFS

map map map

reduce reduce reduce reduce

HDFS HDFS HDFS ✔ ✗ 19

Similarly we cannot directly move the output of reduce to another reduc.

19 The datacenter is the computer! Let’s enrich the instruction set!

20 Source: Google

Can we add more operations to make the instruction set more flexible?

20 Spark Answer to “What’s beyond MapReduce?”

Brief history: Developed at UC Berkeley AMPLab in 2009 Open-sourced in 2010 Became top-level Apache project in February 2014

21

21 Spark vs. Hadoop

Spark September2014 Hadoop

Google Trends

22

Spark is more popular than Hadoop today.

22 23 MapReduce

List[(K1,V1)]

map f: (K1, V1) ⇒ List[(K2, V2)]

reduce g: (K2, Iterable[V2]) ⇒ List[(K3, V3)]

List[K3,V3])

24

This is the only mechanism we had in MapReduce.

24 Map-like Operations

RDD[T] RDD[T] RDD[T] RDD[T]

map filter mapPartitions flatMap f: (T) f: (T) ⇒ f: (Iterator[T]) f: (T) ⇒ TraversableOnce[U] ⇒ U Boolean ⇒ Iterator[U]

RDD[U] RDD[T] RDD[U] RDD[U]

25

But Spark provides many more operations (enriched instruction set).

25 Reduce-like Operations

RDD[(K, V)] RDD[(K, V)] RDD[(K, V)]

aggregateByKey reduceByKey groupByKey seqOp: (U, V) ⇒ U, combOp: (U, f: (V, V) ⇒ V U) ⇒ U

RDD[(K, Iterable[V])] RDD[(K, V)] RDD[(K, U)]

26

26 And many other operations!

27

27