Flash Storage Complementing a Data Lake for Real-Time Insight

Dr. Sanhita Sarkar Global Director, Analytics Software Development

August 7, 2018

1 Delivering insight along the entire spectrum from edge to core CouplingCoupling bigbig datadata withwith fastfast data:data: complementingcomplementing aa datadata 2 lakelake forfor realreal--timetime insightinsight StreamStream ingestioningestion toto ActiveScaleActiveScale™™ objectobject storagestorage assystem a as 3 componenta component within within a dataa data lake lake

4 IoT speed layer – implementation

55 A real-world IoT use case

consumer

platforms & systems embedded & removable

solid state & hard disk drives

Flash Memory Summit 2018, Santa Clara, CA ©2018 Western Digital Corporation or its affiliates. All rights reserved. 4 Delivering Value to Data from Core to Edge Enable tight coupling between Big Data and Fast Data

Big Data Data Fast Data Aggregation Streaming Analytics

INSIGHT MOBILITY Batch Analytics Machine Learning REAL-TIME PREDICTION RESULTS

Modeling

PRESCRIPTION Artificial SMART MACHINES Intelligence Scale Performance

Flash Memory Summit 2018, Santa Clara, CA ©2018 Western Digital Corporation or its affiliates. All rights reserved. 5 Delivering Value to Data from Core to Edge Enable data sharing and utilization of system resources across diverse applications

Past Current and Future Data Held Captive by Single Application Data Pooled and Shared by Multiple Applications create transform create create store transform app capture deliver store

store access transform create

access store create

create deliver

1 Delivering insight along the entire spectrum from edge to core Coupling big data with fast data: complementing a data 2 lake for real-time insight Stream ingestion to ActiveScale™ object storage system as 3 a component within a data lake

4 IoT speed layer – implementation

5 A real-world IoT use case

Big data

Batch Layer

Hadoop® Store + Map Raw Reduce Batch Data Views

Visualization, Machine Learning, Serving Streaming event data from Merged Views Layer business processes Capture Queries, (vehicles, transactions,…) Models, Metrics

Transformed Incremental Data Real-time Flash Views Streaming Analytics Memor y Summit Speed 2018 Layer Santa Fast data Clara, Flash Memory Summit 2018, Santa Clara, CA ©2018 Western Digital Corporation or its affiliates. All rights reserved. CA 8 An Enhanced Workflow for Analytics on the 3rd Platform

Big data Batch ActiveScale™ object storage system S3 / Layer as the repository of all raw data, NFS metadata & process data Other Apps Other non-Map Reduce apps can use ActiveScale Data Sources Hadoop® (Images, Backups, Documents,…., Data Lake Business Process Models) YARN Batch Views ActiveScale™ All Raw Apache Kafka® S3A connector Real-time & Batch Queries, Analytics Smart Data S3 connector Search, Machine Learning, Dashboards 푲풂풇풌풂®+ Batch View Streaming event data from Apache business processes (index /tag/ Serving (vehicles, transactions,…) Kafka® Merged Views search/ML) Data Ingest /ML/ Layer Transformed Transformation/ Real-time View Data Process Data Rack-scale Flash (shared PCIe all- Persistence Incremental Flash flash enclosure, IntelliFlash™), Real-time Memor NVMe™ SSDs: memory latency, I/O Views y throughput Streaming Analytics Summit 2018 Speed Layer Rack-scale Flash (shared PCIe all-flashSanta Fast data enclosure, IntelliFlash™), NVMe™ SSDs:Clara, Flash Memory Summit 2018, Santa Clara, CA ©2018 Western Digital Corporation or its affiliates. All rights reserved. memory latency, IOPS CA 9 Aggregated vs. Disaggregated Architectures for Analytics

Application Tier Data Store Tier Object Data Network storage system

Co-located compute and storage • Applications communicate with an integrated compute and storage (HDD/Flash) and with a back-end object storage. • Homogenous across resources.

Application Tier Data Store Tier Pool of compute Data Network

• Applications communicate with a Object disaggregated pool of compute, HDDs, Shared pool shared flash and object storage. Pool of HDDs storage of flash • Independent scaling of resources. system • Heterogeneous, and interoperable Flash Memory Summit 2018, Santa Clara, CA ©2018 Western Digital Corporation or its affiliates. All rights reserved. across resources. 10 Agenda

4 IoT speed layer – implementation

5 A real-world IoT use case

Flash Memory Summit 2018, Santa Clara, CA ©2018 Western Digital Corporation or its affiliates. All rights reserved. 11 Stream Ingestion from Apache Kafka® to ActiveScale 2 NVMe SSDs in a Apache Kafka® Architecture shared PCIe all-flash enclosure allocated to Connect Cluster 10 GigE connections 100 GigE connections each Kafka® node Worker 1 ActiveScale™

Concurrent Apache Kafka® Connector 1 .... Connector 32 Clients Cluster Scaler 1 Worker 2 Client Node 1 Kafka® Node 1 Scaler 2 Connector 33 .... Connector 64

Client Node 2 Streaming data Worker 3 Scaler 3 Kafka® Node 2 Client Node 3 Connector 65 .... Connector 96

Client Node 4 Worker 4 8 GB/s 4.5 GB/s Kafka® Node 3 Connector 97 .... Connector 128 Client Node 5

Client Node 6 Worker 5 .... Kafka® Node 4 Connector 129 Connector 160 Client Node 7 • Achieved average and Worker 6 maximum parallel write throughput to Client Node 8 Connector 161 .... Connector 192 ActiveScale™ are ~ 4.28 GB/s and 6 GB/s, For each benchmark run on 8 For each benchmark run, • Each connector in the Kafka® respectively. clients: Kafka® cluster can process: Connect Cluster works as a • Projected throughput • Record size: 500 • 16 M messages/sec, for the subscriber to a Kafka® topic in the will increase by 2x bytes/message. message size of 500 bytes. Kafka® Cluster and sinks the data with 12 worker nodes • Sends 800 M messages at 8 • Average latency of to the ActiveScale™. in the Kafka® Connect GB/s. acknowledgment to the • A total of 192 connectors is writing cluster. • Publishes to a Kafka® topic, clients is 75 ms/message. to ActiveScale™ and saturating the 32 M messages at a time. write throughput to a single Flash Memory Summit 2018, Santa Clara, CA ©2018 Western Digital Corporation or its affiliates. All rights reserved. ActiveScale™ system. 12 Apache Kafka® S3 Connector to ActiveScale™: Performance Kafka-S3 Connector Kafka-S3 Connector Scalability with Number of Connectors Scalability with S3 Upload Part Size 7.0 6.09 4.0 Higher 6.02 Higher Average 3.38 3.42 3.5 the better 6.0 Average the better 2.92 4.92 3.0 Maximum 5.0 Maximum 2.5 1.93 4.0 2.99 4.28 4.24 2.0 3.0 1.5 2.04 1.53 1.67 2.0 1.47 2.77 1.0 1.39 0.73 0.5 1.0 0.29 0.39

1.35 (GB/s) Throughput Write 0.0 0.75

0.0 0.09 0.19 0.38 32 MB 64 MB 128 MB 256 MB Write Throughput (GB/s) Throughput Write 3 6 12 24 48 96 192 384 S3 Upload Part Size (1 Million Messages or 500MB Commit Size) Number of Connectors (one connector per partition) Kafka S3 sink Connector • An optimal throughput is achieved with a Scalability with File/Object Commit Size 3.5 Higher total of 192 connectors writing to a single 2.92 2.83 the better ActiveScale™ system. 3.0 Average 2.69 2.5 Maximum 2.16 2.0 • An optimal and stable performance is 1.5 0.82 Flash achieved with the S3 upload part size of 128 1.0 1.53 1.49 1.50 1.34 Memor 0.5 MB and with a commit size of 1 Million 0.70 y 0.0 messages (500 MB), irrespective of the Summit

Write Throughput (GB/s) Throughput Write 5 50 500 1000 2000 number of connector partitions. 48 connectors 2018 have been used for these performance tests. File/Object Commit Size (MB) Santa (S3 Upload Part Size - 64 MB) Clara, Flash Memory Summit 2018, Santa Clara, CA ©2018 Western Digital Corporation or its affiliates. All rights reserved. CA 13 Apache Kafka® Performance on Flash vs. HDDs

• With 16 M messages/sec, for the message size of 500 bytes: Kafka performance per node: Flash vs. HDD The average latency of 200 Lower the acknowledgment to the clients 180 better by Kafka is 75 ms/message with 160 140 160 2 NVMe SSDs in a shared PCIe all-flash enclosure allocated to 120 133 100 each of the Kafka nodes. 80 60 75 • A low latency of Kafka with 2 SSDs 40 60 allocated to each node of a 4-node 20 cluster has allowed for achieving a Acknowledgment Latency (ms) Latency Acknowledgment 10 HDDs (5 1 SSD - 3TB 2 SSDs - 6TB 3 SSDs - 9TB HDDs/LUN, 2 LUNs) rate of 16 M messages/s. - 6TB • With 1 SSD ≈10 HDDs, in terms of latency, it would be necessary to use 20 HDDs to achieve the above message rate by Kafka.

1 Delivering insight along the entire spectrum from edge to core Coupling big data with fast data: complementing a data 3 lake for real-time insight Stream ingestion to ActiveScale™ object storage system as 4 a component within a data lake

5 IoT speed layer – implementation

6 A real-world IoT use case

Requirements: Stream Processing Requirements: Storing • Process events in “near-real time”: low latency; • Fast/real-time lookup storage; • High Velocity; • Support for write-intensive workloads; • Guaranteed processing; • Support for concurrent reads and writes of multiple data sizes within a minimal response • Fault-tolerant; time. • Scales ‘easily’. Choices Choices • NoSQL is a good option • Spark Streaming • Cassandra® – Low-Latency, High Throughput, Fault-tolerant; – Best-in-class, scalable NoSQL data management – Uses Spark core for large-scale stream processing, In-flight system; messages; – Parallel, distributed, high throughput, 100% – Component of Apache Spark™ Ecosystem; fault-tolerant; – Uses: ETL on streaming data, Operational dashboards, – No dependency on Hadoop; has Spark anomaly & fraud detection, NLP analysis, sensor data connector; analytics; – Typical applications are IoT, fraud detection, – High Adoption Rate. recommendation engines, messaging apps. • Others: HBase®, Redis, Accumulo®, • Others: Apache Storm™, Flink®, Apache Apex™, MongoDB™, etc. Apache Samza™, Apache NiFi™, etc. Flash Memory Summit 2018, Santa Clara, CA ©2018 Western Digital Corporation or its affiliates. All rights reserved. 16 Apache Spark™ Streaming Performance Spark Streaming Scalability Spark Streaming Scalability - Rate of incoming events - Number of Kafka partitions/Streams 70000 Lower 60000 Lower the 60000 the better 100K events/sec 50000 better 55,189 40000 50000 30000 49,122 20000 46,928

10000 45,901 Processing Time Processing (ms)

Processing Time ms) Processing ( 0 40000 10K 20K 40K 50K 60K 80K 100K 120K 1p/1s 4p/4s 8p/8s 96p/96s Incoming events per sec Number of Streams/Partitions • Spark Streaming is implemented on 4 nodes (24c, 48t, 96GB) with 2 NVMe SSDs in a shared PCIe all-flash Scalability of Concurrent Streaming Applications enclosure allocated to each node. - 4 nodes (24c, 48t, 96GB/node)

• The processing time of a “single” streaming application Concurrent Processing Time (secs) reaches a threshold at 100K events/s. CPU(%)/node • Increasing the number of partitions or streams for a 75% “single” application, does not improve the processing 55s time, due to the overhead of aggregating results across 47s 48s multiple partitions. • From measured data, it is estimated that we could easily 2% 15% execute multiple streaming applications concurrently (a total of 3.2M events) within a 55 sec response time and 1 app (100K/s/app) 8 apps (100K/s/app) 32 apps (100K/s/app) CPU usage of 75-80%. (projected) Flash Memory Summit 2018, Santa Clara, CA ©2018 Western Digital Corporation or its affiliates. All rights reserved. 17 Agenda

5 IoT speed layer – implementation

6 A real-world IoT use case

ActiveScale™ S3A ActiveScale™ object storage system connector for Hadoop® as the storage repository of all raw data Big data Batch Apache Layer Hive™ Hadoop® Data Lake YARN Batch All Raw Views

Data Smart Analytics Smart Apache Tornadoes past history Tornadoes present and past, Raw data feeds Kafka® S3 Forecasting and event- Connector Batch View driven data 푲풂풇풌풂®+ Merged from weather Serving stations Views Layer Real-time View Transformed Data Cassandra® on a Tornado alerts Apache Kafka® on a Incremental shared PCIe all- Real-time shared PCIe all-flash flash enclosure Views enclosure Streaming Analytics Speed Layer Fast data Spark™ Streaming on on a shared PCIe all-flash Flash Memory Summit 2018, Santa Clara, CA ©2018 Western Digital Corporation or its affiliates. All rights reserved. enclosure 19 Summary and Best Practices

• For efficient coupling between big data and fast data for IoT use cases, it is recommended to:  Implement ActiveScale™ object storage system as a component within a data lake, for storing all the incoming raw streaming data from the ingestion layer.  Tune the number of connectors configurable in the Kafka® Connect Cluster, the S3 upload part size, and the File/Object commit size (also known as flush size), for ingesting streaming data from Apache Kafka® to ActiveScale™ object storage system with an optimal write throughput.  Leverage ActiveScale™ S3A connector for Hadoop to perform various long-running batch analytics and ad-hoc queries against the petabyte-scale data stored in ActiveScale™, without any physical data movement.  Implement a “shared” all-flash storage for the speed layer. Allocate the shared flash storage to compute nodes, for a balanced throughput, IOPS and capacity, required by the applications in the speed layer including ingestion, stream processing and real-time store.  Implement the “shared” all-flash storage to complement an ActiveScale™ object storage system to achieve a disaggregation of compute and storage for the whole solution, without compromising on long-term data persistence, centralization or quality, durability, high availability and cost.  Tune the number of concurrent applications, number of streams per application, and also the allocation of compute and JVM memory to achieve an optimal performance in the speed layer.

Flash Memory Summit 2018, Santa Clara, CA ©2018 Western Digital Corporation or its affiliates. All rights reserved. 20 Western Digital, the Western Digital logo, ActiveScale, and IntelliFlash are registered trademarks or trademarks of Western Digital Corporation or its affiliates in the US and/or other countries. Apache, Apache Apex, Apache Hive, Apache Samza, Apache Spark, Apache Storm, Accumulo, Cassandra, Flink, Hadoop, HBase, and Kafka are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. The NVMe word mark is a trademark of NVM Express, Inc. All other marks are the property of their respective owners.