DSP Frameworks DSP Frameworks We Consider

Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini DSP frameworks we consider • Apache Storm (with lab) • Twitter Heron – From Twitter as Storm and compatible with Storm • Apache Spark Streaming (lab) – Reduce the size of each stream and process streams of data (micro-batch processing) • Apache Flink • Apache Samza • Cloud-based frameworks – Google Cloud Dataflow – Amazon Kinesis Streams Valeria Cardellini - SABD 2017/18 1 Apache Storm • Apache Storm – Open-source, real-time, scalable streaming system – Provides an abstraction layer to execute DSP applications – Initially developed by Twitter • Topology – DAG of spouts (sources of streams) and bolts (operators and data sinks) Valeria Cardellini - SABD 2017/18 2 Stream grouping in Storm • Data parallelism in Storm: how are streams partitioned among multiple tasks (threads of execution)? • Shuffle grouping – Randomly partitions the tuples • Field grouping – Hashes on a subset of the tuple attributes Valeria Cardellini - SABD 2017/18 3 Stream grouping in Storm • All grouping (i.e., broadcast) – Replicates the entire stream to all the consumer tasks • Global grouping – Sends the entire stream to a single task of a bolt • Direct grouping – The producer of the tuple decides which task of the consumer will receive this tuple Valeria Cardellini - SABD 2017/18 4 Storm architecture • Master-worker architecture Valeria Cardellini - SABD 2017/18 5 Storm components: Nimbus and Zookeeper • Nimbus – The master node – Clients submit topologies to it – Responsible for distributing and coordinating the topology execution • Zookeeper – Nimbus uses a combination of the local disk(s) and Zookeeper to store state about the topology Valeria Cardellini - SABD 2017/18 6 Storm components: worker • Task: operator instance – The actual work for a bolt or a spout is done in the task • Executor: smallest schedulable entity – Execute one or more tasks related to same operator • Worker process: Java process running one or more executors worker process • Worker node: computing JAVA PROCESS executor executor resource, a container for THREAD THREAD one or more worker processes task task task task task Valeria Cardellini - SABD 2017/18 7 Storm components: supervisor • Each worker node runs a supervisor The supervisor: • receives assignments from Nimbus (through ZooKeeper) and spawns workers based on the assignment • sends to Nimbus (through ZooKeeper) a periodic heartbeat; • advertises the topologies that they are currently running, and any vacancies that are available to run more topologies Valeria Cardellini - SABD 2017/18 8 Twitter Heron • Realtime, distributed, fault-tolerant stream processing engine from Twitter • Developed as direct successor of Storm – Released as open source in 2016 https://twitter.github.io/heron/ – De facto stream data processing engine inside Twitter • Goal of overcoming Storm’s performance, reliability, and other shortcomings • Compatibility with Storm – API compatible with Storm: no code change is required for migration Valeria Cardellini - SABD 2017/18 9 Heron: in common with Storm • Same terminology of Storm – Topology, spout, bolt • Same stream groupings – Shuffle, fields, all, global • Example: WordCount topology Valeria Cardellini - SABD 2017/18 10 Heron: design goals • Isolation – Process-based topologies rather than thread-based – Each process should run in isolation (easy debugging, profiling, and troubleshooting) – Goal: overcoming Storm’s performance, reliability, and other shortcomings • Resource constraints – Safe to run in shared infrastructure: topologies use only initially allocated resources and never exceed bounds • Compatibility – Fully API and data model compatible with Storm Valeria Cardellini - SABD 2017/18 11 Heron: design goals • Backpressure – Built-in rate control mechanism to ensure that topologies can self-adjust in case components lag – Heron dynamically adjusts the rate at which data flows through the topology using backpressure • Performance – Higher throughput and lower latency than Storm – Enhanced configurability to fine-tune potential latency/throughput trade-offs • Semantic guarantees – Support for both at-most-once and at-least-once processing semantics • Efficiency – Minimum possible resource usage Valeria Cardellini - SABD 2017/18 12 Heron topology architecture • Master-work architecture • One Topology Master (TM) – Manages a topology throughout its entire lifecycle • Multiple Containers – Each Container multiple Heron Instances, a Stream Manager, and a Metrics Manager – A Heron Instance is a process that handles a single task of a spout or bolt – Containers communicate with TM to ensure that the topology forms a fully connected graph Valeria Cardellini - SABD 2017/18 13 Heron topology architecture Valeria Cardellini - SABD 2017/18 14 Heron topology architecture • Stream Manager (SM): routing engine for data streams – Each Heron connects to its local SM, while all of the SMs in a given topology connect to one another to form a network – Responsbile for propagating back pressure Valeria Cardellini - SABD 2017/18 15 Topology submit sequence Valeria Cardellini - SABD 2017/18 16 Self-adaptation in Heron • Dhalion: framework on top of Heron to autonomously reconfigure topologies to meet throughput SLOs, scaling resource consumption up and down as needed • Phases in Dhalion: - Symptom detection (backpressure, skew, …) - Diagnosis generation - Resolution • Adaptation actions: parallelism changes Valeria Cardellini - SABD 2017/18 17 Heron environment • Heron supports deployment on Apache Mesos • Can also run on Mesos using Apache Aurora as a scheduler or using a local scheduler Valeria Cardellini - SABD 2017/18 18 Batch processing vs. stream processing • Batch processing is just a special case of stream processing Valeria Cardellini - SABD 2017/18 19 Batch processing vs. stream processing • Batched/stateless: scheduled in batches – Short-lived tasks (Hadoop, Spark) – Distributed streaming over batches (Spark Streaming) • Dataflow/stateful: continuous/scheduled once (Storm, Flink, Heron) – Long-lived task execution – State is kept inside tasks Valeria Cardellini - SABD 2017/18 20 Native vs. non-native streaming Valeria Cardellini - SABD 2017/18 21 Apache Flink • Distributed data flow processing system • One common runtime for DSP applications and batch processing applications – Batch processing applications run efficiently as special cases of DSP applications • Integrated with many other projects in the open-source data processing ecosystem • Derives from Stratosphere project by TU Berlin, Humboldt University and Hasso Plattner Institute • Support a Storm-compatible API Valeria Cardellini - SABD 2017/18 22 Flink: software stack • On top: libraries with high-level APIs for different use cases, still in beta Valeria Cardellini - SABD 2017/18 23 Flink: programming model • Data stream – An unbounded, partitioned immutable sequence of events • Stream operators – Stream transformations that generate new output data streams from input ones Valeria Cardellini - SABD 2017/18 24 Flink: some features • Supports stream processing and windowing with Event Time semantics – Event time makes it easy to compute over streams where events arrive out of order, and where events may arrive delayed • Exactly-once semantics for stateful computations • Highly flexible streaming windows Valeria Cardellini - SABD 2017/18 25 Flink: some features • Continuous streaming model with backpressure • Flink's streaming runtime has natural flow control: slow data sinks backpressure faster sources Valeria Cardellini - SABD 2017/18 26 Flink: APIs and libraries • Streaming data applications: DataStream API – Supports functional transformations on data streams, with user-defined state, and flexible windows – Example: how to compute a sliding histogram of word occurrences of a data stream of texts WindowWordCount in Flink's DataStream API Sliding time window of 5 sec length and 1 sec trigger interval Valeria Cardellini - SABD 2017/18 27 Flink: APIs and libraries • Batch processing applications: DataSet API • Supports a wide range of data types beyond key/value pairs, and a wealth of operators Core loop of the PageRank algorithm for graphs Valeria Cardellini - SABD 2017/18 28 Flink: program optimization • Batch programs are automatically optimized to exploit situations where expensive operations (like shuffles and sorts) can be avoided, and when intermediate data should be cached Valeria Cardellini - SABD 2017/18 29 Flink: control events • Control events: special events injected in the data stream by operators • Periodically, the data source injects checkpoint barriers into the data stream by dividing the stream into pre- checkpoint and post-checkpoint • More coarse-grained approach than Storm: acks sequences of records instead of individual records • Watermarks signal the progress of event-time within a stream partition • Flink does not provide ordering guarantees after any form of stream repartitioning or broadcasting – Dealing with out-of-order records is left to the operator implementation Valeria Cardellini - SABD 2017/18 30 Flink: fault-tolerance • Based on Chandy-Lamport distributed snapshots • Lightweight mechanism – Allows to maintain high throughput rates and provide strong consistency guarantees at the same time Valeria Cardellini - SABD 2017/18 31 Flink: performance and memory management • High performance and low latency • Memory management

DSP Frameworks DSP Frameworks We Consider

Working with Storm Topologies Date of Publish: 2018-08-13

Apache Flink™: Stream and Batch Processing in a Single Engine

Large-Scale Learning from Data Streams with Apache SAMOA

Comparative Analysis of Data Stream Processing Systems

Apache Storm Tutorial

Network Traffic Profiling and Anomaly Detection for Cyber Security

HDP 3.1.4 Release Notes Date of Publish: 2019-08-26

Hdf® Stream Developer 3 Days

ADMI Cloud Computing Presentation

Installing and Configuring Apache Storm Date of Publish: 2018-08-30

Perform Data Engineering on Microsoft Azure Hdinsight (775)

A Performance Comparison of Open-Source Stream Processing Platforms