Apache Apex: Next Gen Big Data Analytics
Thomas Weise
Data Delivery Transform / Analytics Real-time visualization, …
Declarative
SQL API
Data
Beam Beam SAMOA Operator SAMOA DAG API Sources Library
Events Logs Oper1 Oper2 Oper3 Sensor Data Social Databases CDC
(roadmap)
2 Industries & Use Cases
Financial Services Ad-Tech Telecom Manufacturing Energy IoT
Real-time Call detail record customer facing (CDR) & Supply chain Fraud and risk Smart meter Data ingestion dashboards on extended data planning & monitoring analytics and processing key performance record (XDR) optimization indicators analysis
Understanding Reduce outages Credit risk Click fraud customer Preventive & improve Predictive assessment detection behavior AND maintenance resource analytics context utilization
Packaging and Improve turn around Asset & Billing selling Product quality & time of trade workforce Data governance optimization anonymous defect tracking settlement processes management customer data
HORIZONTAL
• Large scale ingest and distribution • Enforcing data quality and data governance requirements • Real-time ELTA (Extract Load Transform Analyze) • Real-time data enrichment with reference data • Dimensional computation & aggregation • Real-time machine learning model scoring
3 Apache Apex
• In-memory, distributed stream processing • Application logic broken into components (operators) that execute distributed in a cluster • Unobtrusive Java API to express (custom) logic • Maintain state and metrics in member variables • Windowing, event-time processing • Scalable, high throughput, low latency • Operators can be scaled up or down at runtime according to the load and SLA • Dynamic scaling (elasticity), compute locality • Fault tolerance & correctness • Automatically recover from node outages without having to reprocess from beginning • State is preserved, checkpointing, incremental recovery • End-to-end exactly-once • Operability • System and application metrics, record/visualize data • Dynamic changes and resource allocation, elasticity
4 Native Hadoop Integration
• YARN is the resource manager • HDFS for storing persistent state
5 Application Development Model
A Stream is a sequence of data Directed Acyclic Graph (DAG) tuples A typical Operator takes one or Filtered Enriched more input streams, performs Stream Stream computations & emits one or more output streams Output Operator • Each Operator is YOUR custom Stream business logic in java, or built-in operator from our open source library Tuple Operator Operator Operator Operator • Operator has many instances that run in parallel and each instance is single-threaded Filtered Enriched Stream Stream Directed Acyclic Graph (DAG) is Operator made up of operators and streams
6 Development Process
Kafka Word JDBC Parser Filter Input Counter Output
Lines Words Filtered Counts Kafka Database Apex Application
• Operators from library or develop for custom logic • Connect operators to form application • Configure operator properties • Configure scaling and other platform attributes • Test functionality, performance, iterate 7 Application Specification
DAG API (compositional)
Java Stream API (declarative)
8 Developing Operators
9 Operator Library
Messaging NoSQL RDBMS • Kafka • Cassandra, HBase • JDBC • JMS (ActiveMQ, …) • Aerospike, Accumulo • MySQL • Kinesis, SQS • Couchbase/ CouchDB • Oracle • Flume, NiFi • Redis, MongoDB • MemSQL • Geode File Systems Parsers Transformations • HDFS/ Hive • XML • Filter, Expression, Enrich • NFS • JSON • Windowing, Aggregation • S3 • CSV • Join • Avro • Dedup • Parquet Analytics Protocols Other • Dimensional Aggregations • HTTP • Elastic Search (with state management for • FTP • Script (JavaScript, Python, R) historical data + query) • WebSocket • Solr • MQTT • Twitter • SMTP
10 Stateful Processing with Event Time
Event Stream
k=A k=B k=B k=A k=A t=4:00 t=5:00 t=5:59 t=4:30 t=5:00
Processing Time +30s +60s +90s
(All) : 1 (All) : 4 (All) : 5 t=4:00 : 1 t=4:00 : 2 t=4:00 : 2 State t=5:00 : 2 t=5:00 : 3 k=A, t=4:00 : 1 k=A, t=4:00 : 2 k=A, t=4:00 : 2 k=A, t=5:00 : 1 K=B, t=5:00 : 2 k=B, t=5:00 : 2
11 Windowing - Apache Beam Model
Event-time Session windows Watermarks Accumulation Triggers Keyed or Not Keyed Allowed Lateness Accumulation Mode Merging streams ApexStream
12 Fault Tolerance
• Operator state is checkpointed to persistent store ᵒ Automatically performed by engine, no additional coding needed ᵒ Asynchronous and distributed ᵒ In case of failure operators are restarted from checkpoint state • Automatic detection and recovery of failed containers ᵒ Heartbeat mechanism ᵒ YARN process status notification • Buffering to enable replay of data from recovered point ᵒ Fast, incremental recovery, spike handling • Application master state checkpointed ᵒ Snapshot of physical (and logical) plan ᵒ Execution layer change log
13 Checkpointing State
. Distributed, asynchronous . No artificial latency . Periodic callbacks . Pluggable storage
14 Buffer Server & Recovery • In-memory PubSub Container 1 Container 2 • Stores results until committed Operator Buffer Operator • Backpressure / spillover to disk 1 Server 2 • Ordering, idempotency Node 1 Node 2
Independent pipelines Downstream Operators reset (can be used for speculative execution)
15 Recovery Scenario
sum … EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1 0
sum … EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1 7
sum … EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1 10
sum … EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1 7
16 Processing Guarantees At-least-once • On recovery data will be replayed from a previous checkpoint ᵒ No messages lost ᵒ Default, suitable for most applications • Can be used to ensure data is written once to store ᵒ Transactions with meta information, Rewinding output, Feedback from external entity, Idempotent operations At-most-once • On recovery the latest data is made available to operator ᵒ Useful in use cases where some data loss is acceptable and latest data is sufficient Exactly-once ᵒ At-least-once + idempotency + transactional mechanisms (operator logic) to achieve end-to-end exactly once behavior
17 End-to-End Exactly Once
• Important when writing to external systems • Data should not be duplicated or lost in the external system in case of application failures • Common external systems ᵒ Databases ᵒ Files ᵒ Message queues • Exactly-once = at-least-once + idempotency + consistent state • Data duplication must be avoided when data is replayed from checkpoint ᵒ Operators implement the logic dependent on the external system ᵒ Platform provides checkpointing and repeatable windowing
18 Scalability
Unifier NxM Partitions Logical DAG
Logical Diagram 0 1 2 3
0 1 2 Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
1a 2a
Physical Diagram with operator 1 with 3 partitions 0 1b Unifier Unifier 3
1 2b 1c
Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck 0 1 Unifier 2 1a Unifier 2a
1 0 1b Unifier 3
Unifier 2b 1c
19 Advanced Partitioning
Parallel Partition Cascading Unifiers Logical DAG Logical Plan 0 1 2 3 4 uopr dopr
Execution Plan, for N = 4; M = 1
Physical DAG uopr1 1a
uopr2 Container
0 Unifier 2 3 4 unifier dopr NIC 1b uopr3
uopr4
Physical DAG with Parallel Partition Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers
uopr1 Container
1a 2a 3a
unifier
NIC NIC
0 Unifier 4 uopr2 Container
unifier dopr NIC
1b 2b 3b uopr3 Container
unifier
NIC NIC uopr4 20 Dynamic Partitioning
2a 2a
1a 2a 1a 2b 1a 2b 3a
3 3
1b 2b 1b 2c 1b 2c 3b
2d 2d
Unifiers not shown • Partitioning change while application is running ᵒ Change number of partitions at runtime based on stats ᵒ Determine initial number of partitions dynamically • Kafka operators scale according to number of kafka partitions ᵒ Supports re-distribution of state when number of partitions change ᵒ API for custom scaler or partitioner 21
How dynamic partitioning works
• Partitioning decision (yes/no) by trigger (StatsListener) ᵒ Pluggable component, can use any system or custom metric ᵒ Externally driven partitioning example: KafkaInputOperator • Stateful! ᵒ Uses checkpointed state ᵒ Ability to transfer state from old to new partitions (partitioner, customizable) ᵒ Steps: • Call partitioner • Modify physical plan, rewrite checkpoints as needed • Undeploy old partitions from execution layer • Release/request container resources • Deploy new partitions (from rewritten checkpoint) ᵒ No loss of data (buffered) ᵒ Incremental operation, partitions that don’t change continue processing • API: Partitioner interface
22 Compute Locality
• By default operators are deployed in containers (processes) on different nodes across the Hadoop cluster • Locality options for streams ᵒ RACK_LOCAL: Data does not traverse network switches ᵒ NODE_LOCAL: Data transfer via loopback interface, frees up network bandwidth ᵒ CONTAINER_LOCAL: Data transfer via in memory queues between operators, does not require serialization ᵒ THREAD_LOCAL: Data passed through call stack, operators share thread • Host Locality ᵒ Operators can be deployed on specific hosts • (Anti-)Affinity ᵒ Ability to express relative deployment without specifying a host
23 Performance: Throughput vs. Latency?
https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming- computation-engines-at http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
24 High-Throughput and Low-Latency
Apex, Flink w/ 4 Kafka brokers 2.7 million events/second, Kafka latency limit
Apex w/o Kafka and Redis: 43 million events/second with more than 90 percent of events processed with the latency less than 0.5 seconds https://www.datatorrent.com/blog/throughput-latency-and-yahoo/ 25 Recent Additions & Roadmap
• Declarative Java API • Windowing Semantics following Beam model • Scalable state management • SQL support using Apache Calcite • Apache Beam Runner, SAMOA integration
• Enhanced support for Batch Processing • Support for Mesos • Encrypted Streams • Python support for operator logic and API • Replacing operator code at runtime • Dynamic attribute changes • Named checkpoints
26 Enterprise Tools
27 Monitoring Console Logical View Physical View
28 Real-Time Dashboards
29 Maximize Revenue w/ real-time insights
PubMatic is the leading marketing automation software company for publishers. Through real-time analytics, yield management, and workflow automation, PubMatic enables publishers to make smarter inventory decisions and improve revenue performance
Business Need Apex based Solution Client Outcome
• Ingest and analyze high volume clicks & • DataTorrent Enterprise platform, • Helps PubMatic deliver ad performance views in real-time to help customers powered by Apache Apex insights to publishers and advertisers in improve revenue • In-memory stream processing real-time instead of 5+ hours - 200K events/second data • Comprehensive library of pre-built • Helps Publishers visualize campaign flow operators including connectors performance and adjust ad inventory in • Report critical metrics for campaign • Built-in fault tolerance real-time to maximize their revenue monetization from auction and client • Dynamically scalable • Enables PubMatic reduce OPEX with logs • Real-time query from in-memory state efficient compute resource utilization - 22 TB/day data generated • Management UI & Data Visualization • Built-in fault tolerance ensures • Handle ever increasing traffic with console customers can always access ad efficient resource utilization network • Always-on ad network
30 Industrial IoT applications
GE is dedicated to providing advanced IoT analytics solutions to thousands of customers who are using their devices and sensors across different verticals. GE has built a sophisticated analytics platform, Predix, to help its customers develop and execute Industrial IoT applications and gain real-time insights as well as actions.
Business Need Apex based Solution Client Outcome
• Ingestion application using DataTorrent • Helps GE improve performance and lower • Ingest and analyze high-volume, high speed Enterprise platform cost by enabling real-time Big Data data from thousands of devices, sensors • Powered by Apache Apex analytics per customer in real-time without data loss • In-memory stream processing • Helps GE detect possible failures and • Predictive analytics to reduce costly • Built-in fault tolerance minimize unplanned downtimes with maintenance and improve customer • Dynamic scalability centralized management & monitoring of service • Comprehensive library of pre-built devices • Unified monitoring of all connected sensors operators • Enables faster innovation with short and devices to minimize disruptions • Management UI console application development cycle • Fast application development cycle • No data loss and 24x7 availability of • High scalability to meet changing business applications and application workloads • Helps GE adjust to scalability needs with auto-scaling
31 Smart energy applications
Silver Spring Networks helps global utilities and cities connect, optimize, and manage smart energy and smart city infrastructure. Silver Spring Networks receives data from over 22 million connected devices, conducts 2 million remote operations per year
Business Need Apex based Solution Client Outcome
• Helps Silver Spring Networks ingest & • Ingest high-volume, high speed data from • DataTorrent Enterprise platform, powered analyze data in real-time for effective load millions of devices & sensors in real-time by Apache Apex management & customer service without data loss • In-memory stream processing • Helps Silver Spring Networks detect • Make data accessible to applications • Pre-built operators/connectors possible failures and reduce outages with without delay to improve customer service • Built-in fault tolerance centralized management & monitoring of • Capture & analyze historical data to • Dynamically scalable devices understand & improve grid operations • Management UI console • Enables fast application development for • Reduce the cost, time, and pain of faster time to market integrating with 3rd party apps • Helps Silver Spring Networks scale with • Centralized management of software & easy to partition operators operations • Automatic recovery from failures
32 Who is using Apex?
• Powered by Apex • http://apex.apache.org/powered-by-apex.html • Also using Apex? Let us know to be added: [email protected] or @ApacheApex • Pubmatic • https://www.youtube.com/watch?v=JSXpgfQFcU8 • GE • https://www.youtube.com/watch?v=hmaSkXhHNu0 • http://www.slideshare.net/ApacheApex/ge-iot-predix-time-series-data-ingestion-service-using- apache-apex-hadoop • SilverSpring Networks • https://www.youtube.com/watch?v=8VORISKeSjI • http://www.slideshare.net/ApacheApex/iot-big-data-ingestion-and-processing-in-hadoop-by- silver-spring-networks
33 Q&A
34 Resources
• http://apex.apache.org/ • Learn more - http://apex.apache.org/docs.html • Subscribe - http://apex.apache.org/community.html • Download - http://apex.apache.org/downloads.html • Follow @ApacheApex - https://twitter.com/apacheapex • Meetups - https://www.meetup.com/topics/apache-apex/ • Examples - https://github.com/DataTorrent/examples • Slideshare - http://www.slideshare.net/ApacheApex/presentations • https://www.youtube.com/results?search_query=apache+apex • Free Enterprise License for Startups - https://www.datatorrent.com/product/startup-accelerator/
35