Real Time Processing with Hadoop

Hortonworks. We do Hadoop.

Page 1 © Inc. 2014 Topics – Hortonworks and Apache Storm

• Hortonworks and Enterprise Hadoop

with Apache Storm – Introduction

• Apache Storm – Key Constructs

• Apache Storm – Data Flows and Architecture

• Apache Storm Usage Cases and Real World Walkthrough

• Apache Storm Demo and How To

Page 2 © Hortonworks Inc. 2014 Hortonworks and Enterprise Hadoop

Hortonworks. We do Hadoop.

Page 3 © Hortonworks Inc. 2014 Power your Modern Data Architecture Our Mission: with HDP and Enterprise

Who we are June 2011: Original 24 architects, developers, operators of Hadoop from Yahoo! June 2014: An enterprise software company with 420+ Employees

Our model Innovate and deliver Apache Hadoop as a complete enterprise data platform completely in the open, backed by a world class support organization

Key Partners

…and hundreds more

Page 4 © Hortonworks Inc. 2014 HDP delivers a comprehensive data management platform

YARN is the architectural HDP 2.1 center of HDP Hortonworks Data Platform • Enables batch, interactive and real-time workloads

GOVERNANCE BATCH, INTERACTIVE & REAL-TIME • Single SQL engine for both batch SECURITY OPERATIONS & INTEGRATION DATA ACCESS and interactive • Enable existing ISV apps to plug Data Workflow, Script SQL NoSQL Stream Search In-Memory Others Authentication Provision, Lifecycle & Authorization Manage & directly into Hadoop via YARN Governance Pig Hive HBase Storm Solr Spark ISV Accounting Monitor HCatalog Accumulo Engines Data Protection Falcon Tez Tez Ambari Provides comprehensive Storage: HDFS Zookeeper Flume Resources: YARN enterprise capabilities NFS YARN: Data Access: Hive, … WebHDFS : Falcon Scheduling • Governance Cluster: Knox 1 ° ° ° ° ° ° ° ° ° ° Oozie • Security HDFS ° ° ° ° ° ° ° ° ° ° ° • Operations (Hadoop Distributed File System) ° ° ° ° ° ° ° ° ° ° N The widest range of

DATA MANAGEMENT deployment options • Linux & Windows Deployment Choice Linux Windows On-Premise Cloud • On premise & cloud

Page 5 © Hortonworks Inc. 2014 Choice Of Tool For Stream Processing

Storm § Event by Event processing § Suited for sub second real time processing HADOOP 2.x Multi Use Data Platform Batch, Interactive, Online, Streaming, … – Fraud detection

Apache Spark Tez (micro batch) (interacve) STORM Spark Streaming (streaming) § Micro-Batch processing § Suited for near real time processing YARN – Trending Facebook likes (cluster resource management) HDFS2 (redundant, reliable storage)

Page 6 © Hortonworks Inc. 2014 Stream Processing with Apache Storm

Page 7 © Hortonworks Inc. 2014 Why stream processing IN Hadoop?

Stream processing has emerged as a key use case What is the need? § Exponential rise in real-time data § Ability to real-time data opens new HADOOP 2.x business opportunities Multi Use Data Platform Batch, Interactive, Online, Streaming, … Apache Why Now? STORM Spark Tez (streaming) § Economics of Open source software & (micro batch) (interacve) commodity hardware § YARN allows multiple computing paradigms to YARN co-exist in the data lake (cluster resource management) HDFS2 (redundant, reliable storage)

Page 8 © Hortonworks Inc. 2014 Storm Use Cases – Business View

Sentiment Clickstream Machine/Sensor Logs Geo-location Monitor real-time data to..

Prevent Optimize

- Securities Fraud - Order routing Finance - Compliance violations - Pricing - Security breaches - Bandwidth allocation Telco - Network Outages - Customer service - Shrinkage - Offers Retail - Stock outs - Pricing - Device failures - Supply chain Manufacturing - Driver & fleet issues - Routes Transportation - Pricing - Application failures - Site content Web - Operational issues

Page 9 © Hortonworks Inc. 2014 Stream processing very different from batch

Factors Streaming Batch Freshness Real-time (usually < 15 Historical – usually more than min) 15 min old Data Location Primarily memory (moved Primarily in disk moved to to disk after processing) memory for processing Speed Sub second to few Few minutes to hours Processing seconds Frequency Always running Sporadic to periodic Who? Automated systems only Human & automated systems Clients Type Primarily operational Primarily analytical applications systems

Page 10 © Hortonworks Inc. 2014 Key Requirements of a Streaming Solution

Data Ingest • Extremely high ingest rates – millions of events/second

• Ability to easily plug different processing frameworks Processing • Guaranteed processing – at least once processing semantics

Persistence • Ability to persist data to multiple relational and non- relational data stores

Operations • Security, HA, fault tolerance & management support

Page 11 © Hortonworks Inc. 2014 Apache Storm Leading for Stream Processing

Open source, real-time event stream processing platform that provides fixed, continuous, & low latency processing for very high frequency streaming data

• Horizontally scalable like Hadoop Highly scalable • Eg: 10 node cluster can process 1M tuples per second

Fault-tolerant • Automatically reassigns tasks on failed nodes

Guarantees • Supports at least once & exactly once processing semantics processing Language • Processing logic can be defined in any language agnostic

Apache project • Brand, governance & a large active community

Page 12 © Hortonworks Inc. 2014 Storm Use Cases – IT operations view

Continuous • Continuously ingest high rate messages, process them and update data processing stores

High speed data • Aggregate multiple data streams that emit data at extremely high rates aggregation into one central data store

Data filtering • Filter out unwanted data on the fly before it is persisted to a data store

Distributed RPC • Extremely resource( CPU, mem or I/O) intensive processing that would response time take long time to process on a single machine can be parallelized with reduction Storm to reduce response times to seconds

Page 13 © Hortonworks Inc. 2014 Apache Storm Key Constructs

Hortonworks. We do Hadoop.

Page 14 © Hortonworks Inc. 2014 Key Constructs in Apache Storm

Basic Concepts of a Topology Tuples and Streams, Sprouts, Bolts Field Grouping Architecture and Topology Submission Parallelism Processing Guarantee

Page 15 © Hortonworks Inc. 2014 Basic Concepts

Spouts: Generate streams.

Topology Tuple: Most fundamental data structure and is a named list of values that can be of any datatype

Streams: Groups of tuples

Bolts: Contain data processing, persistence and alerting logic. Can also emit tuples for downstream bolts

Tuple Tree: First spout tuple and all the tuples that were emitted by the bolts that processed it

Topology: Group of spouts and bolts wired together into a workflow

Page 16 © Hortonworks Inc. 2014 Tuples and Streams

What is a Tuple? § Fundamental data structure in Storm. Is a named list of values that can be of any data type.

• What is a Stream? – An unbounded sequences of tuples. – Core abstraction in Storm and are what you “process” in Storm

Page 17 © Hortonworks Inc. 2014 Spouts

What is a Spout? § Generates or a source of Streams – E.g.: JMS, Twitter, Log, Kafka Spout § Can spin up multiple instances of a Spout and dynamically adjust as needed

Page 18 © Hortonworks Inc. 2014 Bolts

What is a Bolt? § Processes any number of input streams and produces output streams § Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting logic § Can spin up multiple instances of a Bolt and dynamically adjust as needed

Example of Bolts: 1. HBaseBolt: persist stream in Hbase 2. HDFSBolt: persist stream into HFDS as Avro Files using Flume 3. MonitoringBolt: Read from Hbase and create alerts via email and messaging queues if given thresholds are exceeded.

Page 19 © Hortonworks Inc. 2014 Fields Grouping

Question: If multiple instances of single bolt can be created, when a tuple is emitted, how do you control what instance the tuple goes to? Answer: Field Grouping

What is a Field grouping? § Provides various ways to control tuple routing to bolts. § Many field grouping exist include shuffle, fields, global..

Example of Field grouping in Use Case § Events for a given driver/truck must be sent to the same monitoring bolt instance.

Page 20 © Hortonworks Inc. 2014 Different Fields Grouping

Grouping type What it does When to use

Shuffle Grouping Sends tuple to a bolt in random round robin - Doing atomic operations eg.math operations sequence Fields Grouping Sends tuples to a bolt based on or more field's in - Segmentation of the incoming stream the tuple - Counting tuples of a certain type All grouping Sends a single copy of each bolt to all instances - Send some signal to all bolts like clear cache or of a receiving bolt refresh state etc. - Send ticker tuple to signal bolts to save state etc. Custom grouping Implement your own field grouping so tuples are - Used to get max flexibility to change processing routed based on custom logic sequence, logic etc. based on different factors like data types, load, seasonality etc. Direct grouping Source decides which bolt will receive tuple - Depends

Global grouping Global Grouping sends tuples generated by all - Global counts.. instances of the source to a single target instance (specifically, the task with lowest ID)

Page 21 © Hortonworks Inc. 2014 Data Processing Guarantees in Storm

Processing How is it achieved? Applicable use cases guarantee At least once Replay tuples on failure - Unordered idempotent operations - Sub second latency processing Exactly once Trident - Need ordered processing - Global counts - Context aware processing - Causality based - Latency not important

Page 22 © Hortonworks Inc. 2014 Architecture

Nimbus(Management server) • Similar to job tracker • Distributes code around cluster • Assigns tasks • Handles failures

Supervisor(slave nodes): • Similar to task tracker • Run bolts and spouts as ‘tasks’

Zookeeper: • Cluster co-ordination • Nimbus HA • Stores cluster metrics • Consumption related metadata for Trident topologies

Page 23 © Hortonworks Inc. 2014 Parallelism in a Storm Topology

Supervisor § The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Worker (JVM Process) § Is a JVM process that executes a subset of a topology § Belongs to a specific topology and may run one or more executors (threads) for one or more components (spouts or bolts) of this topology. Executors ( in JVM Process) § Is a thread that is spawned by a worker process and runs within the worker’s JVM § Runs one or more tasks for the same component (spout or bolt) Tasks § Performs the actual data processing and is run within its parent executor’s thread of execution

Page 24 © Hortonworks Inc. 2014 Storm Components and Topology Submission

Nimbus (Management server) • Similar to job tracker • Distributes code around cluster Nimbus • Assigns tasks Submit (Yarn App Master Node) • Handles failures storm-event-processor topology Supervisor (Worker nodes) • Similar to task tracker • Run bolts and spouts as ‘tasks’

Zookeeper: • Cluster co-ordination • Nimbus HA Zookeeper Zookeeper Zookeeper • Stores cluster metrics

HBase HDFS HDFS HDFS HBase Monitor Bolt Bolt Bolt Bolt Bolt Bolt Supervisor Supervisor Supervisor Supervisor Supervisor (Slave Node) (Slave Node) (Slave Node) (Slave Node) (Slave Node) Kafka Kafka Monitor Kafka Kafka Kafka Spout Spout Bolt Spout Spout Spout

Page 25 © Hortonworks Inc. 2014 Relationship Between Supervisors, Workers, Executors & Tasks

supervisor

Each supervisor machine in storm has specific Predefined ports to which a worker process is assigned

Source: http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/ Page 26 © Hortonworks Inc. 2014 Parallelism Example

Example Configuration in Use Case: § Supervisors = 4 § Workers = 8 § spout.count = 5 § bolt.count = 16

How many worker threads per supervisor? § 8/4 = 2 How many total executors(threads)? § 16 + 5 = 21 How many executors(threads) per worker? § 21/8 = 2 to 3

Page 27 © Hortonworks Inc. 2014 Apache Storm Data Flows and Architecture

Hortonworks. We do Hadoop.

Page 28 © Hortonworks Inc. 2014 Flexible Architecture with Apache Storm

Any applicaon development plaorm Simplify development of Storm topologies

Any Real-me views for operaonal KAFKA OR JMS NoSQL dashboards -Pump data into storm database -Send noficaons from Storm Storm Any Search interface In-memory caching for analysts & search plaorms dashboards plaorm Temporary data storage

Any RDBMS HDFS Provide reference data for storm Data lake topologies

Page 29 © Hortonworks Inc. 2014 An Apache Storm Solution Architecture Example

Real-time data feeds

Online Data Real Time Stream Search Processing Processing Solr HBase Storm

Accumulo

YARN

HDFS

HDP 2.x Data Lake

Page 30 © Hortonworks Inc. 2014 HPD Data Flow and Architecture

Hbase & Solr UI

Input Feed

Output Feed

Hive

Storm Query UI HDFS

Page 31 © Hortonworks Inc. 2014 HDP Data Flow and Architecture

Hbase & Solr UI

Input Feed

Output Feed

Hive

Storm Query UI HDFS

Source Input Stream

Page 32 © Hortonworks Inc. 2014 HDP Data Flow and Architecture

Hbase & Solr UI

Input Feed

Output Feed

Hive

Storm Query UI HDFS

Ingest and Process Stream (Normalization/Micro ETL) in Real Time Using Storm

Page 33 © Hortonworks Inc. 2014 HDP Data Flow and Architecture

Hbase & Solr UI

Input Feed

Output Feed

Hive

Storm Query UI HDFS

Publish to Output Feed (Available via RESTful API) Push to Hbase & Solr and Sink to HDFS/HCatalog (Meta Model) for Ad-Hoc SQL

Page 34 © Hortonworks Inc. 2014 HDP Data Flow and Architecture

Hbase & Solr UI

Input Feed

Output Feed

Hive

Storm Query UI HDFS

Data Available for Search and Query (RESTful API Calls an Option as Well)

Page 35 © Hortonworks Inc. 2014 Apache Storm – Code Walkthrough

Hortonworks. We do Hadoop.

Page 36 © Hortonworks Inc. 2014 Automated Stock Insight

Monitor Tweet Volume For Early Detection of Stock Movement – Monitor Twitter stream for S&P 500 companies to identify & act on unexpected increase in tweet volume

Solution components § Ingest: – Listen for Twitter streams related to S&P 500 companies § Processing – Monitor tweets for unexpected volume – Volume thresholds managed in HBASE § Persistence – HBase & HDFS § Refine – Update threshold values based on historical analysis of tweet volumes

Page 37 © Hortonworks Inc. 2014 High Level Architecture

Truck Streaming Data Messaging Grid (WMQ, ActiveMQ, Kafka)

T(1) T(2) T(N) truck events TOPIC

High Speed Ingestion Interactive Query Batch Analytics Stream Processing with Storm Real-time Serviing with HBase

Kafka Spout

Write to HBase TEZ MR2 driver HDFS HBase dangerous Monitoring events Bolt Bolt Bolt Perform Ad Hoc Do batch analysis/ driver Queries on driver/truck dangerou models & update Write to Update Alert events and other s events HBase with right HDFS Thresholds count related data sources thresholds for alerts

Distributed Storage

HDFS

Page 38 © Hortonworks Inc. 2014 HDP Provides a Single Data Platform

Truck Streaming Data Messaging Grid (WMQ, ActiveMQ, Kafka)

T(1) T(2) T(N) truck events TOPIC HDP Data Lake

High Speed Ingestion Interactive Query Batch Analytics Stream Processing with Storm Real-time Serviing with HBase

Kafka Spout

Write to HBase TEZ MR2 driver HDFS HBase dangerous Monitoring events Bolt Bolt Bolt Perform Ad Hoc Do batch analysis/ driver Queries on driver/truck dangerou models & update Write to Update Alert events and other s events HBase with right HDFS Thresholds related data sources Enables 4 different apps/workloads on count thresholds for alerts YARN a single cluster

Distributed Storage

HDFS

Page 39 © Hortonworks Inc. 2014 HDP Provides a Single Data Platform

Truck Streaming Data Messaging Grid (WMQ, ActiveMQ, Kafka)

T(1) T(2) T(N) truck events TOPIC HDP Data Lake

High Speed Ingestion Interactive Query Batch Analytics Stream Processing with Storm Real-time Serviing with HBase

Kafka Spout

Write to HBase TEZ MR2 driver HDFS HBase dangerous Monitoring events Bolt Bolt Bolt Perform Ad Hoc Do batch analysis/ driver Queries on driver/truck dangerou models & update Write to Update Alert events and other s events HBase with right HDFS Thresholds related data sources Enables 4 different apps/workloads on count thresholds for alerts YARN a single cluster

Distributed Storage

HDFS

Page 40 © Hortonworks Inc. 2014 Q&A

Page 41 © Hortonworks Inc. 2014