Apache Apex: Next Gen Big Data Analytics

Thomas Weise @thweise PMC Chair , Architect DataTorrent Apache Big Data Europe, Sevilla, Nov 14th 2016 Stream Data Processing

Data Delivery Transform / Analytics Real-time visualization, …

Declarative

SQL API

Data

Beam Beam SAMOA Operator SAMOA DAG API Sources Library

Events Logs Oper1 Oper2 Oper3 Sensor Data Social Databases CDC

(roadmap)

2 Industries & Use Cases

Financial Services Ad-Tech Telecom Manufacturing Energy IoT

Real-time Call detail record customer facing (CDR) & Supply chain Fraud and risk Smart meter Data ingestion dashboards on extended data planning & monitoring analytics and processing key performance record (XDR) optimization indicators analysis

Understanding Reduce outages Credit risk Click fraud customer Preventive & improve Predictive assessment detection behavior AND maintenance resource analytics context utilization

Packaging and Improve turn around Asset & Billing selling Product quality & time of trade workforce Data governance optimization anonymous defect tracking settlement processes management customer data

HORIZONTAL

• Large scale ingest and distribution • Enforcing data quality and data governance requirements • Real-time ELTA (Extract Load Transform Analyze) • Real-time data enrichment with reference data • Dimensional computation & aggregation • Real-time machine learning model scoring

3 Apache Apex

• In-memory, distributed • Application logic broken into components (operators) that execute distributed in a cluster • Unobtrusive Java API to express (custom) logic • Maintain state and metrics in member variables • Windowing, event-time processing • Scalable, high throughput, low latency • Operators can be scaled up or down at runtime according to the load and SLA • Dynamic scaling (elasticity), compute locality • Fault tolerance & correctness • Automatically recover from node outages without having to reprocess from beginning • State is preserved, checkpointing, incremental recovery • End-to-end exactly-once • Operability • System and application metrics, record/visualize data • Dynamic changes and resource allocation, elasticity

4 Native Hadoop Integration

• YARN is the resource manager • HDFS for storing persistent state

5 Application Development Model

A Stream is a sequence of data Directed Acyclic Graph (DAG) tuples A typical Operator takes one or Filtered Enriched more input streams, performs Stream Stream computations & emits one or more output streams Output Operator • Each Operator is YOUR custom Stream business logic in java, or built-in operator from our open source library Tuple Operator Operator Operator Operator • Operator has many instances that run in parallel and each instance is single-threaded Filtered Enriched Stream Stream Directed Acyclic Graph (DAG) is Operator made up of operators and streams

6 Development Process

Kafka Word JDBC Parser Filter Input Counter Output

Lines Words Filtered Counts Kafka Database Apex Application

• Operators from library or develop for custom logic • Connect operators to form application • Configure operator properties • Configure scaling and other platform attributes • Test functionality, performance, iterate 7 Application Specification

DAG API (compositional)

Java Stream API (declarative)

8 Developing Operators

9 Operator Library

Messaging NoSQL RDBMS • Kafka • Cassandra, HBase • JDBC • JMS (ActiveMQ, …) • Aerospike, Accumulo • MySQL • Kinesis, SQS • Couchbase/ CouchDB • Oracle • Flume, NiFi • , MongoDB • MemSQL • Geode File Systems Parsers Transformations • HDFS/ Hive • XML • Filter, Expression, Enrich • NFS • JSON • Windowing, Aggregation • S3 • CSV • Join • Avro • Dedup • Parquet Analytics Protocols Other • Dimensional Aggregations • HTTP • Elastic Search (with state management for • FTP • Script (JavaScript, Python, R) historical data + query) • WebSocket • Solr • MQTT • Twitter • SMTP

10 Stateful Processing with Event Time

Event Stream

k=A k=B k=B k=A k=A t=4:00 t=5:00 t=5:59 t=4:30 t=5:00

Processing Time +30s +60s +90s

(All) : 1 (All) : 4 (All) : 5 t=4:00 : 1 t=4:00 : 2 t=4:00 : 2 State t=5:00 : 2 t=5:00 : 3 k=A, t=4:00 : 1 k=A, t=4:00 : 2 k=A, t=4:00 : 2 k=A, t=5:00 : 1 K=B, t=5:00 : 2 k=B, t=5:00 : 2

11 Windowing - Model

Event-time Session windows Watermarks Accumulation Triggers Keyed or Not Keyed Allowed Lateness Accumulation Mode Merging streams ApexStream stream = StreamFactory .fromFolder(localFolder) .flatMap(new Split()) .window(new WindowOption.GlobalWindow(), new TriggerOption().withEarlyFiringsAtEvery(Duration.millis(1000)).accumulatingFiredPanes()) .countByKey(new ConvertToKeyVal()).print();

12 Fault Tolerance

• Operator state is checkpointed to persistent store ᵒ Automatically performed by engine, no additional coding needed ᵒ Asynchronous and distributed ᵒ In case of failure operators are restarted from checkpoint state • Automatic detection and recovery of failed containers ᵒ Heartbeat mechanism ᵒ YARN process status notification • Buffering to enable replay of data from recovered point ᵒ Fast, incremental recovery, spike handling • Application master state checkpointed ᵒ Snapshot of physical (and logical) plan ᵒ Execution layer change log

13 Checkpointing State

. Distributed, asynchronous . No artificial latency . Periodic callbacks . Pluggable storage

14 Buffer Server & Recovery • In-memory PubSub Container 1 Container 2 • Stores results until committed Operator Buffer Operator • Backpressure / spillover to disk 1 Server 2 • Ordering, idempotency Node 1 Node 2

Independent pipelines Downstream Operators reset (can be used for speculative execution)

15 Recovery Scenario

sum … EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1 0

sum … EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1 7

sum … EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1 10

sum … EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1 7

16 Processing Guarantees At-least-once • On recovery data will be replayed from a previous checkpoint ᵒ No messages lost ᵒ Default, suitable for most applications • Can be used to ensure data is written once to store ᵒ Transactions with meta information, Rewinding output, Feedback from external entity, Idempotent operations At-most-once • On recovery the latest data is made available to operator ᵒ Useful in use cases where some data loss is acceptable and latest data is sufficient Exactly-once ᵒ At-least-once + idempotency + transactional mechanisms (operator logic) to achieve end-to-end exactly once behavior

17 End-to-End Exactly Once

• Important when writing to external systems • Data should not be duplicated or lost in the external system in case of application failures • Common external systems ᵒ Databases ᵒ Files ᵒ Message queues • Exactly-once = at-least-once + idempotency + consistent state • Data duplication must be avoided when data is replayed from checkpoint ᵒ Operators implement the logic dependent on the external system ᵒ Platform provides checkpointing and repeatable windowing

18 Scalability

Unifier NxM Partitions Logical DAG

Logical Diagram 0 1 2 3

0 1 2 Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier

1a 2a

Physical Diagram with operator 1 with 3 partitions 0 1b Unifier Unifier 3

1 2b 1c

Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck 0 1 Unifier 2 1a Unifier 2a

1 0 1b Unifier 3

Unifier 2b 1c

19 Advanced Partitioning

Parallel Partition Cascading Unifiers Logical DAG Logical Plan 0 1 2 3 4 uopr dopr

Execution Plan, for N = 4; M = 1

Physical DAG uopr1 1a

uopr2 Container

0 Unifier 2 3 4 unifier dopr NIC 1b uopr3

uopr4

Physical DAG with Parallel Partition Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers

uopr1 Container

1a 2a 3a

unifier

NIC NIC

0 Unifier 4 uopr2 Container

unifier dopr NIC

1b 2b 3b uopr3 Container

unifier

NIC NIC uopr4 20 Dynamic Partitioning

2a 2a

1a 2a 1a 2b 1a 2b 3a

3 3

1b 2b 1b 2c 1b 2c 3b

2d 2d

Unifiers not shown • Partitioning change while application is running ᵒ Change number of partitions at runtime based on stats ᵒ Determine initial number of partitions dynamically • Kafka operators scale according to number of kafka partitions ᵒ Supports re-distribution of state when number of partitions change ᵒ API for custom scaler or partitioner 21

How dynamic partitioning works

• Partitioning decision (yes/no) by trigger (StatsListener) ᵒ Pluggable component, can use any system or custom metric ᵒ Externally driven partitioning example: KafkaInputOperator • Stateful! ᵒ Uses checkpointed state ᵒ Ability to transfer state from old to new partitions (partitioner, customizable) ᵒ Steps: • Call partitioner • Modify physical plan, rewrite checkpoints as needed • Undeploy old partitions from execution layer • Release/request container resources • Deploy new partitions (from rewritten checkpoint) ᵒ No loss of data (buffered) ᵒ Incremental operation, partitions that don’t change continue processing • API: Partitioner interface

22 Compute Locality

• By default operators are deployed in containers (processes) on different nodes across the Hadoop cluster • Locality options for streams ᵒ RACK_LOCAL: Data does not traverse network switches ᵒ NODE_LOCAL: Data transfer via loopback interface, frees up network bandwidth ᵒ CONTAINER_LOCAL: Data transfer via in memory queues between operators, does not require serialization ᵒ THREAD_LOCAL: Data passed through call stack, operators share thread • Host Locality ᵒ Operators can be deployed on specific hosts • (Anti-)Affinity ᵒ Ability to express relative deployment without specifying a host

23 Performance: Throughput vs. Latency?

https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming- computation-engines-at http://data-artisans.com/extending-the-yahoo-streaming-benchmark/

24 High-Throughput and Low-Latency

Apex, Flink w/ 4 Kafka brokers 2.7 million events/second, Kafka latency limit

Apex w/o Kafka and Redis: 43 million events/second with more than 90 percent of events processed with the latency less than 0.5 seconds https://www.datatorrent.com/blog/throughput-latency-and-yahoo/ 25 Recent Additions & Roadmap

• Declarative Java API • Windowing Semantics following Beam model • Scalable state management • SQL support using • Apache Beam Runner, SAMOA integration

• Enhanced support for Batch Processing • Support for Mesos • Encrypted Streams • Python support for operator logic and API • Replacing operator code at runtime • Dynamic attribute changes • Named checkpoints

26 Enterprise Tools

27 Monitoring Console Logical View Physical View

28 Real-Time Dashboards

29 Maximize Revenue w/ real-time insights

PubMatic is the leading marketing automation software company for publishers. Through real-time analytics, yield management, and workflow automation, PubMatic enables publishers to make smarter inventory decisions and improve revenue performance

Business Need Apex based Solution Client Outcome

• Ingest and analyze high volume clicks & • DataTorrent Enterprise platform, • Helps PubMatic deliver ad performance views in real-time to help customers powered by Apache Apex insights to publishers and advertisers in improve revenue • In-memory stream processing real-time instead of 5+ hours - 200K events/second data • Comprehensive library of pre-built • Helps Publishers visualize campaign flow operators including connectors performance and adjust ad inventory in • Report critical metrics for campaign • Built-in fault tolerance real-time to maximize their revenue monetization from auction and client • Dynamically scalable • Enables PubMatic reduce OPEX with logs • Real-time query from in-memory state efficient compute resource utilization - 22 TB/day data generated • Management UI & Data Visualization • Built-in fault tolerance ensures • Handle ever increasing traffic with console customers can always access ad efficient resource utilization network • Always-on ad network

30 Industrial IoT applications

GE is dedicated to providing advanced IoT analytics solutions to thousands of customers who are using their devices and sensors across different verticals. GE has built a sophisticated analytics platform, Predix, to help its customers develop and execute Industrial IoT applications and gain real-time insights as well as actions.

Business Need Apex based Solution Client Outcome

• Ingestion application using DataTorrent • Helps GE improve performance and lower • Ingest and analyze high-volume, high speed Enterprise platform cost by enabling real-time Big Data data from thousands of devices, sensors • Powered by Apache Apex analytics per customer in real-time without data loss • In-memory stream processing • Helps GE detect possible failures and • Predictive analytics to reduce costly • Built-in fault tolerance minimize unplanned downtimes with maintenance and improve customer • Dynamic scalability centralized management & monitoring of service • Comprehensive library of pre-built devices • Unified monitoring of all connected sensors operators • Enables faster innovation with short and devices to minimize disruptions • Management UI console application development cycle • Fast application development cycle • No data loss and 24x7 availability of • High scalability to meet changing business applications and application workloads • Helps GE adjust to scalability needs with auto-scaling

31 Smart energy applications

Silver Spring Networks helps global utilities and cities connect, optimize, and manage smart energy and smart city infrastructure. Silver Spring Networks receives data from over 22 million connected devices, conducts 2 million remote operations per year

Business Need Apex based Solution Client Outcome

• Helps Silver Spring Networks ingest & • Ingest high-volume, high speed data from • DataTorrent Enterprise platform, powered analyze data in real-time for effective load millions of devices & sensors in real-time by Apache Apex management & customer service without data loss • In-memory stream processing • Helps Silver Spring Networks detect • Make data accessible to applications • Pre-built operators/connectors possible failures and reduce outages with without delay to improve customer service • Built-in fault tolerance centralized management & monitoring of • Capture & analyze historical data to • Dynamically scalable devices understand & improve grid operations • Management UI console • Enables fast application development for • Reduce the cost, time, and pain of faster time to market integrating with 3rd party apps • Helps Silver Spring Networks scale with • Centralized management of software & easy to partition operators operations • Automatic recovery from failures

32 Who is using Apex?

• Powered by Apex • http://apex.apache.org/powered-by-apex.html • Also using Apex? Let us know to be added: [email protected] or @ApacheApex • Pubmatic • https://www.youtube.com/watch?v=JSXpgfQFcU8 • GE • https://www.youtube.com/watch?v=hmaSkXhHNu0 • http://www.slideshare.net/ApacheApex/ge-iot-predix-time-series-data-ingestion-service-using- apache-apex-hadoop • SilverSpring Networks • https://www.youtube.com/watch?v=8VORISKeSjI • http://www.slideshare.net/ApacheApex/iot-big-data-ingestion-and-processing-in-hadoop-by- silver-spring-networks

33 Q&A

34 Resources

• http://apex.apache.org/ • Learn more - http://apex.apache.org/docs.html • Subscribe - http://apex.apache.org/community.html • Download - http://apex.apache.org/downloads.html • Follow @ApacheApex - https://twitter.com/apacheapex • Meetups - https://www.meetup.com/topics/apache-apex/ • Examples - https://github.com/DataTorrent/examples • Slideshare - http://www.slideshare.net/ApacheApex/presentations • https://www.youtube.com/results?search_query=apache+apex • Free Enterprise License for Startups - https://www.datatorrent.com/product/startup-accelerator/

35