Large-Scale in the Hadoop Ecosystem

Gyula Fóra [email protected] Márton Balassi [email protected] This talk

§ Stream processing by example § Open source stream processors § Runtime architecture and programming model § Counting words… § Fault tolerance and stateful processing § Closing

2015-09-28 Apache: Big Data Europe 2 Stream processing by example

2015-09-28 Apache: Big Data Europe 3 Streaming applications

ETL style operations Process/Enrich Inpu • Filter incoming data, InpuInpu t Log analysis Inputtt • High throughput, connectors, at-least-once processing Window aggregations • Trending tweets, User sessions, Stream joins • Window abstractions

2015-09-28 Apache: Big Data Europe 4 Streaming applications

Machine learning • Fitting trends to the evolving stream, Stream clustering • Model state, cyclic flows Pattern recognition • Fraud detection, Triggering signals based on activity • Exactly-once processing

2015-09-28 Apache: Big Data Europe 5 Open source stream processors

2015-09-28 Apache: Big Data Europe 6 Apache Streaming landscape

2015-09-28 Apache: Big Data Europe 7

§ Started in 2010, development driven by BackType, then Twitter § Pioneer in large scale stream processing § Distributed dataflow abstraction (spouts & bolts)

2015-09-28 Apache: Big Data Europe 8

§ Started in 2008 as a research project (Stratosphere) at European universities § Unique combination of low latency streaming and high throughput batch analysis § Flexible operator states and windowing

Stream Data Kafka, RabbitMQ, ...

Batch data HDFS, JDBC, ... 2015-09-28 Apache: Big Data Europe 9

§ Started in 2009 at UC Berkley, Apache since 2013 § Very strong community, wide adoption § Unified batch and stream processing over a batch runtime § Good integration with batch programs

2015-09-28 Apache: Big Data Europe 10 Apache Samza

§ Developed at LinkedIn, open sourced in 2013 § Builds heavily on Kafka’s log based philosophy § Pluggable messaging system and execution backend

2015-09-28 Apache: Big Data Europe 11 System comparison

Streaming Native Micro-batching Native Native model API Compositional Declarative Compositional Declarative

Fault tolerance Record ACKs RDD-based Log-based Checkpoints

Guarantee At-least-once Exactly-once At-least-once Exactly-once State as Stateful Stateful State Only in Trident DStream operators operators Windowing Not built-in Time based Not built-in Policy based Latency Very-Low Medium Low Low Throughput Medium High High High

2015-09-28 Apache: Big Data Europe 12 Runtime and programming model

2015-09-28 Apache: Big Data Europe 13 Native Streaming

2015-09-28 Apache: Big Data Europe 14 Distributed dataflow runtime

§ Storm, Samza and Flink § General properties • Long standing operators • Pipelined execution • Usually possible to create cyclic flows Pros Cons • Full expressivity • Fault-tolerance is hard • Low-latency execution • Throughput may suffer • Stateful operators • Load balancing is an issue

2015-09-28 Apache: Big Data Europe 15 Distributed dataflow runtime

§ Storm • Dynamic typing + Kryo • Dynamic topology rebalancing § Samza • Almost every component pluggable • Full task isolation, no backpressure (buffering handled by the messaging layer) § Flink • Strongly typed streams + custom serializers • Flow control mechanism • Memory management

2015-09-28 Apache: Big Data Europe 16 Micro-batching

2015-09-28 Apache: Big Data Europe 17 Micro-batch runtime

§ Implemented by Apache Spark § General properties • Computation broken down to time intervals • Load aware scheduling • Easy interaction with batch Pros Cons • Easy to reason about • Latency depends on • High-throughput batch size • FT comes for “free” • Limited expressivity • Dynamic load balancing • Stateless by nature

2015-09-28 Apache: Big Data Europe 18 Programming model

Compositional Declarative

§ Offer basic building blocks § Expose a high-level API for composing custom § Operators are higher order operators and topologies functions on abstract data § Advanced behavior such as stream types windowing is often missing § Advanced behavior such as § Topology needs to be hand- windowing is supported optimized § Query optimization

2015-09-28 Apache: Big Data Europe 19 Programming model

§ Transformations abstract DStream, DataStream operator details § Suitable for engineers and data analysts

Spout, Consumer, Bolt, Task, Topology § Direct access to the execution graph / topology • Suitable for engineers

2015-09-28 Apache: Big Data Europe 20 Counting words…

2015-09-28 Apache: Big Data Europe 21 WordCount

(storm, 4) storm budapest flink (budapest, 1) apache storm spark (flink, 4) streaming samza storm (apache, 2) flink apache flink (spark, 1) bigdata storm (streaming, 2) flink streaming (samza, 1) (bigdata, 1)

2015-09-28 Apache: Big Data Europe 22 Storm

Assembling the topology

TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new SentenceSpout(), 5); builder.setBolt("split", new Splitter(), 8).shuffleGrouping("spout"); builder.setBolt("count", new Counter(), 12) .fieldsGrouping("split", new Fields("word")); Rolling word count bolt public class Counter extends BaseBasicBolt { Map counts = new HashMap();

public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1; counts.put(word, count); collector.emit(new Values(word, count)); } }

2015-09-28 Apache: Big Data Europe 23 Samza

Rolling word count task public class WordCountTask implements StreamTask { private KeyValueStore store;

public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { String word = envelope.getMessage(); Integer count = store.get(word); if(count == null){count = 0;} store.put(word, count + 1); collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", ”wc"), Tuple2.of(word, count))); } }

2015-09-28 Apache: Big Data Europe 24 Flink case class Word (word: String, frequency: Int) Rolling word count val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() Window word count val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print()

2015-09-28 Apache: Big Data Europe 25 Spark

Window word count

Rolling word count (kind of)

2015-09-28 Apache: Big Data Europe 26 Fault tolerance and stateful processing

2015-09-28 Apache: Big Data Europe 27 Fault tolerance intro

§ Fault-tolerance in streaming systems is inherently harder than in batch • Can’t just restart computation • State is a problem • Fast recovery is crucial • Streaming topologies run 24/7 for a long period § Fault-tolerance is a complex issue • No single point of failure is allowed • Guaranteeing input processing • Consistent operator state • Fast recovery • At-least-once vs Exactly-once semantics

2015-09-28 Apache: Big Data Europe 28 Storm record acknowledgements

§ Track the lineage of tuples as they are processed (anchors and acks) § Special “acker” bolts track each lineage DAG (efficient xor based algorithm) § Replay the root of failed (or timed out) tuples

2015-09-28 Apache: Big Data Europe 29 Samza offset tracking

§ Exploits the properties of a durable, offset based messaging layer § Each task maintains its current offset, which moves forward as it processes elements § The offset is checkpointed and restored on failure (some messages might be repeated)

2015-09-28 Apache: Big Data Europe 30 Flink checkpointing

§ Based on consistent global snapshots § Algorithm designed for stateful dataflows (minimal runtime overhead) § Exactly-once semantics

2015-09-28 Apache: Big Data Europe 31 Spark RDD recomputation

§ Immutable data model with repeatable computation § Failed RDDs are recomputed using their lineage § Checkpoint RDDs to reduce lineage length § Parallel recovery of failed RDDs § Exactly-once semantics

2015-09-28 Apache: Big Data Europe 32 State in streaming programs

§ Almost all non-trivial streaming programs are stateful

§ Stateful operators (in essence): �: ��, ����� ⟶ ���, �����

§ State hangs around and can be read and modified as the stream evolves

§ Goal: Get as close as possible while maintaining scalability and fault-tolerance

2015-09-28 Apache: Big Data Europe 33 § States available only in Trident API § Dedicated operators for state updates and queries § State access methods Exactly-once guarantee • stateQuery(…) • partitionPersist(…) • persistentAggregate(…) § It’s very difficult to implement transactional states

2015-09-28 Apache: Big Data Europe 34 § Stateless runtime by design • No continuous operators • UDFs are assumed to be stateless § State can be generated as a separate stream of RDDs: updateStateByKey(…)

�: ���[���], ������ ⟶ ����� � § � is scoped to a specific key § Exactly-once semantics

2015-09-28 Apache: Big Data Europe 35 § Stateful dataflow operators (Any task can hold state) § State changes are stored as a log by Kafka § Custom storage engines can be plugged in to the log § � is scoped to a specific task § At-least-once processing semantics

2015-09-28 Apache: Big Data Europe 36 § Stateful dataflow operators (conceptually similar to Samza) § Two state access patterns • Local (Task) state • Partitioned (Key) state § Proper API integration • Java: OperatorState interface • Scala: mapWithState, flatMapWithState… § Exactly-once semantics by checkpointing

2015-09-28 Apache: Big Data Europe 37 Performance

§ Throughput/Latency • A cost of a network hop is 25+ msecs • 1 million records/sec/core is nice § Size of Network Buffers/Batching § Buffer Timeout § Cost of Fault Tolerance § Operator chaining/Stages § Serialization/Types

2015-09-28 Apache: Big Data Europe 38 Closing

2015-09-28 Apache: Big Data Europe 39 Comparison revisited

Streaming Native Micro-batching Native Native model API Compositional Declarative Compositional Declarative

Fault tolerance Record ACKs RDD-based Log-based Checkpoints

Guarantee At-least-once Exactly-once At-least-once Exactly-once State as Stateful Stateful State Only in Trident DStream operators operators Windowing Not built-in Time based Not built-in Policy based Latency Very-Low Medium Low Low Throughput Medium High High High

2015-09-28 Apache: Big Data Europe 40 Summary

§ Streaming applications and stream processors are very diverse § 2 main runtime designs • Dataflow based (Storm, Samza, Flink) • Micro-batch based (Spark) § The best framework varies based on application specific needs § But high-level APIs are nice J

2015-09-28 Apache: Big Data Europe 41 Thank you! List of Figures (in order of usage)

§ https://upload.wikimedia.org/wikipedia/commons/thumb/2/2a/CPT-FSM- abcd.svg/326px-CPT-FSM-abcd.svg.png § https://storm.apache.org/images/topology.png § https://databricks.com/wp-content/uploads/2015/07/image11-1024x655.png § https://databricks.com/wp-content/uploads/2015/07/image21-1024x734.png § https://people.csail.mit.edu/matei/papers/2012/hotcloud_spark_streaming.pdf, page 2. § http://www.slideshare.net/ptgoetz/storm-hadoop-summit2014, page 69-71. § http://samza.apache.org/img/0.9/learn/documentation/container/checkpointi ng.svg § https://databricks.com/wp-content/uploads/2015/07/image41-1024x602.png § https://storm.apache.org/documentation/images/spout-vs-state.png § http://samza.apache.org/img/0.9/learn/documentation/container/stateful_job. png

2015-09-28 Apache: Big Data Europe 43