Simple processing

Stream Processing

• Single node – Read data from socket COS 418: Distributed Systems – Lecture 22 – Write output Michael Freedman

2

Examples: Stateless conversion Examples: Stateless filtering

CtoF Filter

• Convert Celsius temperature to Fahrenheit • Function can filter inputs – Stateless operation: emit (input * 9 / 5) + 32 – if (input > threshold) { emit input }

3 4

1 Examples: Stateful conversion Examples: Aggregation (stateful)

EWMA Avg

• Compute EWMA of Fahrenheit temperature • E.g., Average value per window – new_temp = ⍺ * ( CtoF(input) ) + (1- ⍺) * last_temp – Window can be # elements (10) or time (1s) – last_temp = new_temp – Windows can be disjoint (every 5s) – emit new_temp – Windows can be “tumbling” (5s window every 1s)

5 6

Stream processing as chain Stream processing as directed graph

sensor type 1 alerts CtoF Avg Filter CtoF Avg Filter

sensor type 2 storage KtoF

7 8

2 The challenge of stream processing

• Large amounts of data to process in real time

• Examples Enter “BIG DATA” – Social network trends (#trending) – Intrusion detection systems (networks, datacenters) – Sensors: Detect earthquakes by correlating vibrations of millions of smartphones – Fraud detection • Visa: 2000 txn / sec on average, peak ~47,000 / sec

9 10

Scale “up” Scale “up”

Tuple-by-Tuple Micro-batch Tuple-by-Tuple Micro-batch input ← read inputs ← read Lower Latency Higher Latency if (input > threshold) { out = [] emit input Lower Throughput Higher Throughput for input in inputs { } if (input > threshold) { out.append(input) Why? Each read/write is an system call into kernel. More cycles performing kernel/application transitions } (context switches), less actually spent processing data. } emit out 11 12

3 Scale “out” Stateless operations: trivially parallelized

C F

C F

C F

13 14

State complicates parallelization State complicates parallelization

• Aggregations: • Aggregations: – Need to join results across parallel computations – Need to join results across parallel computations

Sum CtoF Filter Cnt

Sum CtoF Filter CtoF Avg Filter Cnt Avg

Sum CtoF Filter Cnt

15 16

4 Parallelization complicates fault-tolerance Parallelization complicates fault-tolerance

• Aggregations: Can we ensure exactly-once semantics? – Need to join results across parallel computations

Sum Sum CtoF Filter CtoF Filter Cnt Cnt

Sum Sum CtoF Filter CtoF Filter Cnt Avg Cnt Avg

Sum - blocks - Sum - blocks - CtoF Filter CtoF Filter Cnt Cnt

17 18

Can parallelize joins Can parallelize joins

• Compute trending keywords Hash 1. merge partitioned 2. sort – E.g., tweets 3. top-k portion tweets portion tweets Sum Sum Sum Sort top-k / key / key / key portion tweets portion tweets Sum Sum Sum Sum Sort top-k Sort top-k / key / key / key / key portion tweets portion tweets Sum - blocks - Sum Sum Sort top-k / key / key / key

19 20

5 Parallelization complicates fault-tolerance

Hash 1. merge partitioned 2. sort tweets 3. top-k A Tale of Four Frameworks portion tweets Sum Sum Sort top-k / key / key 1. Record acknowledgement (Storm) portion tweets 2. Micro-batches (Spark Streaming, Storm Trident) Sum Sum Sort top-k / key / key 3. Transactional updates (Google Cloud dataflow) portion tweets Sum Sum 4. Distributed snapshots (Flink) Sort top-k / key / key

21 22

Apache Storm : Parallelization

• Architectural components • Multiple processes (tasks) run per bolt – Data: streams of tuples, e.g., Tweet = • Incoming streams split among tasks – Sources of data: “spouts” – Shuffle Grouping: Round-robin distribute tuples to tasks – Operators to process data: “bolts” – Fields Grouping: Partitioned by key / field – Topology: Directed graph of spouts & bolts – All Grouping: All tasks receive all tuples (e.g., for joins)

23 24

6 Fault tolerance via record acknowledgement Fault tolerance via record acknowledgement (Apache Storm -- at least once semantics) (Apache Storm -- at least once semantics)

• Goal: Ensure each input ”fully processed” • Spout assigns new unique ID to each tuple • Approach: DAG / tree edge tracking • When bolt “emits” dependent tuple, it informs system of dependency (new edge) – Record edges that get created as tuple is processed. • When a bolt finishes processing tuple, it calls ACK (or can FAIL) – Wait for all edges to be marked done – Inform source (spout) of data when • Acker tasks: complete; otherwise, they resend tuple. – Keep track of all emitted edges and receive ACK/FAIL messages from bolts. • Challenge: “at least once” means: – When messages received about all edges in graph, inform originating spout – Bolts can receive tuple > once • Spout garbage collects tuple or retransmits – Replay can be out-of-order • Note: Best effort delivery by not generating – ... application needs to handle. dependency on downstream tuples. 25 26

Apache Spark Streaming: Streaming: Discretized Stream Processing Dataflow-oriented programming

# Create a local StreamingContext with batch interval of 1 second ssc = StreamingContext(sc, 1) • Split stream into series of small, atomic live data # Create a DStream that reads from network socket stream batch jobs (each of X seconds) Spark lines = ssc.socketTextStream("localhost", 9999) Streaming words = lines.flatMap(lambda line: line.split(" ")) # Split each line into words • Process each individual batch using batches of X Spark “batch” framework seconds # Count each word in each batch pairs = words.map(lambda word: (word, 1)) – Akin to in-memory MapReduce wordCounts = pairs.reduceByKey(lambda x, y: x + y) Spark processed • Emit each micro-batch result results wordCounts.pprint()

– RDD = “Resilient Distributed Data” ssc.start() # Start the computation ssc.awaitTermination() # Wait for the computation to terminate 27 28

7 Apache Spark Streaming: Fault tolerance via micro batches Dataflow-oriented programming (Apache Spark Streaming, Storm Trident)

# Create a local StreamingContext with batch interval of 1 second • Can build on batch frameworks (Spark) and tuple-by-tuple (Storm) ssc = StreamingContext(sc, 1) – Tradeoff between throughput (higher) and latency (higher) # Create a DStream that reads from network socket lines = ssc.socketTextStream("localhost", 9999) • Each micro-batch may succeed or fail – Original inputs are replicated (memory, disk) words = lines.flatMap(lambda line: line.split(" ")) # Split each line into words – At failure, latest micro-batch can be simply recomputed (trickier if stateful)

# Count each word in each batch • DAG is a of transformations from micro-batch to micro-batch pairs = words.map(lambda word: (word, 1)) – Lineage info in each RDD specifies how generated from other RDDs wordCounts = pairs.reduceByKeyAndWindow( lambda x, y: x + y, lambda x, y: x - y, 3, 2) • To support failure recovery: wordCounts.pprint() – Occasionally checkpoints RDDs (state) by replicating to other nodes ssc.start() # Start the computation – To recover: another worker (1) gets last checkpoint, (2) determines ssc.awaitTermination() # Wait for the computation to terminate upstream dependencies, then (3) starts recomputing using those usptream dependencies starting at checkpoint (downstream might filter) 29 30

Fault Tolerance via transactional updates Fault Tolerance via distributed snapshots (Google Cloud Dataflow) ()

• Computation is long-running DAG of continuous operators • Rather than log each record for each operator, take system-wide snapshots • For each intermediate record at operator • Snapshotting: – Create commit record including input record, state update, and – Determine consistent snapshot of system-wide state derived downstream records generated (includes in-flight records and operator state) – Write commit record to transactional log / DB – Store state in durable storage

• On failure, replay log to • Recover: – Restore a consistent state of the computation – Restoring latest snapshot from durable storage – Rewinding the stream source to snapshot point, and replay inputs – Replay lost records (further downstream might filter) • Algorithm is based on Chandy-Lamport distributed snapshots, • Requires: High-throughput writes to distributed store 31 but also captures stream topology 32

8 Fault Tolerance via distributed snapshots (Apache Flink) Optimizing stream processing

• Use markers (barriers) in the input data stream to tell • Keeping system performant: downstream operators when to consistently snapshot – Careful optimizations of DAG – Scheduling: Choice of parallelization, use of resources – Where to place computation – …

• Often, many queries and systems using same cluster concurrently: “Multi-tenancy”

33 34

Wednesday lecture

Cluster Scheduling

35

9