AWS Analytics Modernization Modernize Your Big Data Platform with AWS Analytics Services

Jay Elango Analytics Specialist Architect

© 2021, , Inc. or its Affiliates. Topics

• Challenges associated with on-premises Hadoop Big Data Platform

• New realities facing the organizations

• Lake house architecture on AWS

• Value in move to managed service with AWS EMR

• EMR Migration Programs

• Customer examples

© 2021, Amazon Web Services, Inc. or its Affiliates. Challenges with on-premises Hadoop Big Data platform

© 2021, Amazon Web Services, Inc. or its Affiliates. Compute and storage grow together

• Storage grows along with compute Tightly coupled • Compute requirements vary

Tightly coupled

© 2021, Amazon Web Services, Inc. or its Affiliates. Replication adds to cost

• Data is replicated several times • Typically only in one

3x

© 2021, Amazon Web Services, Inc. or its Affiliates. Under utilized or Scarce resources

120 Re-processing

100 Weekly peaks 80 Steady state

60

40

20

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

© 2021, Amazon Web Services, Inc. or its Affiliates. Contention for the same resources

Compute Memory bound bound

© 2021, Amazon Web Services, Inc. or its Affiliates. Separation of resources creates data silos

Team A

© 2021, Amazon Web Services, Inc. or its Affiliates. Limited on Fast Following Application Versions

With a monolithic cluster, there may be dependencies of downstream applications that impact the inability to upgrade versions. By not upgrading, organizations could be limiting innovation.

• Large Scale Transformation: Map/Reduce, Hive, Pig, Spark • Interactive Queries: Impala, Spark SQL, Presto • Machine Learning: Spark ML, MxNet, Tensorflow • Interactive Notebooks: Jupyter, Zeppelin • NoSQL: HBase

© 2021, Amazon Web Services, Inc. or its Affiliates. The new realities organizations are facing

© 2021, Amazon Web Services, Inc. or its Affiliates. New realities – Organizations want more value from their data

G R O W I N G F R O M N E W INCREASINGLY U S E D B Y A N A L Y Z E D EXPONENTIALLY SOURCES DIVERSE MANY PEOPLE BY M A N Y APPLICATIONS

What organizations are looking to build? Modernized data platform

Modernization Goals : ▪ Drive innovation ▪ Enable organizations for Customer Experience/Journey Analytics/360 Analytics & build product intelligence by deriving insights from various data sources and formats. ▪ Business agility ▪ Enable business to scale infrastructure, manage performance & optimize Cost.

© 2021, Amazon Web Services, Inc. or its Affiliates. New realities – Organization’s modernized platform needs

Relational databases

Big data Non-relational Business Business processing databases Intelligence Intelligence

Data to DW Silo 1 silos DW Silo 2 Data Lake

Log Machine analytics learning OLTP ERP CRM LOB Devices Web Sensors Social

Data warehousing

© 2021, Amazon Web Services, Inc. or its Affiliates. New realities - Organizations moving to Lake House architecture Relational databases Scalable data lakes

Non- Big data relational processing databases Purpose-built data services

Data Lake Seamless data movement

Log Machine analytics learning Unified governance

Data Performant and warehousing cost-effective

© 2021, Amazon Web Services, Inc. or its Affiliates. Lake House architecture on AWS

Amazon Aurora Scalable data lakes

Amazon Amazon EMR DynamoDB Purpose-built data services Amazon Athena

Amazon Seamless S3 data movement

Amazon Amazon Elasticsearch SageMaker Service Unified governance

Amazon Performant and Redshift cost-effective

© 2021, Amazon Web Services, Inc. or its Affiliates. Move to managed AWS Analytics services

Amazon Amazon Managed Amazon Kinesis Amazon Elasticsearch Streaming Data Analytics for EMR Service for Apache Kafka Apache Flink

Spark, Hive, Presto, Operational Real-time Real-time Hudi, HBase analytics analytics analytics

Apache Flink Elasticsearch Logstash Kibana

© 2021, Amazon Web Services, Inc. or its Affiliates. Amazon EMR Automate provisioning, configuring, and tuning Easily run Spark, Hadoop, Hive, Easy setup, management, and monitoring Presto, HBase, and other big data frameworks

Amazon Get the latest, stable, open-source releases Aurora Latest open-source framework updates within 30 days

Amazon Amazon EMR DynamoDB

Amazon Athena Automatically scale up and down Manage cluster size based on utilization to reduce costs

Amazon Amazon Elasticsearch SageMaker Service

Amazon Simple and predictable pricing Redshift Per-second pricing, and save 50%–80% with Amazon EC2 Spot and Reserved Instances

© 2021, Amazon Web Services, Inc. or its Affiliates. Amazon Confidential | © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. The value in move to managed with Amazon EMR for big data platforms

© 2021, Amazon Web Services, Inc. or its Affiliates. Foundation 1: Decouple storage and compute

© 2021, Amazon Web Services, Inc. or its Affiliates. Foundation 2: Amazon S3 is you persistent data store

Unmatched durability, Strong read-after-write consistency availability, and scalability Support for transactions

Easiest to use with Broadest portfolio cost optimization: of analytics tools Intelligent tiering Amazon S3

Best security (including Row & Most ways to get data in Column level), Compliance, and audit capabilities

Cold storage and archive capabilities

© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 1 : Turn off clusters

Amazon S3 Amazon S3 Amazon S3 © 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 2 : Built-in Disaster Recovery

Cluster 1 Cluster 2

Amazon S3

Cluster 3 Cluster 4

Availability Zone Availability Zone

© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 3 : Logical separation of jobs/applications

Traditional Monolithic Cluster Re-architect Monolithic to Purpose-built clusters by: • Creating Transient and/or Persistent clusters • Separating clusters by Application • Separating clusters by Application Version • Isolating Department specific clusters

Design consideration are given to: Purpose-built Clusters • How do you submit jobs or build pipelines • Persisting your data in S3 vs. • Storing metadata off the cluster • How long does the job run • What applications are needed

Ad-Hoc

© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 4 : Auto-scaling Clusters (Persistent / Transient )

Amazon EMR Cluster

© 2021, Amazon Web Services, Inc. or its Affiliates. Amazon EMR Managed Scaling: Reduce costs by up to 60%

• Completely managed environment for automatically scaling clusters • No configurations required except min/max capacity • More data points and faster reaction time • Can save 20%-60% costs depending on the workload pattern

© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 5 : Leverage Spot Instances & Instance Fleets Auto Scale with Spot Instance to reduce cost and run-time

10 node cluster running for 14 hours Cost = $1 * 10 nodes * 14 hours = Total $140

20 node cluster running for 7 hours Add 10 more nodes of Spot at Cost = $1 * 10 nodes * 7 hours = $70 50% discount = $0.5 * 10 nodes * 7 hours = $35= Total $105

Results : 50% less run-time (14hrs → 7hrs), 25% less cost ($140 → $105) Diversify Spot and On-demand Instances via Instance Fleets • Can mix different instance types, markets (On- demand or Spot) in one group • Don’t specify an AZ and we will find the cheapest one © 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 6 : EMR Self-service with AWS Service Catalog

Configure Consume

Standardize Developer Autonomy

Enforce Consistency and One-Stop Shop Compliance

Limit Access Automate Deployments

Enforce Tagging, Security Agile Governance Groups

© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 7 : Amazon EMR differentiated performance

1.7x faster performance than standard Apache Spark 3.0 at 40% of the cost

Up to 2.6x faster performance than open-source Presto 0.238 at 80% of the cost

11.5% average performance improvement with Graviton2

25.7% average cost reduction with Graviton2

© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 8 : Fully Managed EMR Notebooks

1. Provide an end to end data engineering and data science using EMR Notebooks which is based on the popular open source Jupyter Notebooks to build applications with Apache Spark 2. Attach / Detach from individual clusters; automatically backed up to S3 3. Tag-based Permissions 4. Support for PySpark, Spark SQL, Spark R, and Scala 5. NEW features include a visual experience to debug and monitor Spark jobs directly into the off-cluster, persistent, Apache Spark History Server using the EMR Console, associate Git repositories such as GitHub and Bitbucket, and compare and merge two different notebooks using the nbdime utility.

© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 8 : EMR Studio integrated development environment

Easily build and Start notebooks Build production Save debugging deploy data in seconds, run pipelines simply time with native science code jobs later and flexibly application UIs in without logging one place in to AWS console

© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 9 : Analysts confirm Lowest TCO in the Industry

Nov. 2018, IDC report confirms: Dec. 2018, Gartner suggests: Feb. 2019 Forrester recognizes:

“EMR provides 57% reduced costs “AWS remains the largest AWS EMR as the vs. on premise resulting in 342% Hadoop provider in terms of Hadoop/Spark (HARK) Leader. ROI over 5 years.” both revenue and user base.”

The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service © 2021, Amazon Web Services, Inc. or its Affiliates. depicted in the Forrester Wave™. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change. EMR Benefit 10 : Leverage AWS Lake House Analytic ecosystem

Lake Analytics Formation Machine Learning Amazon Athena Amazon SageMaker Amazon EMR AWS Deep Learning AMIs Amazon Redshift Amazon Rekognition AWS Blueprin ML Data Access Amazon Lex Amazon Elasticsearch service Glue ts Transforms Catalog Contro AWS DeepLens Amazon Kinesis l Amazon QuickSight Amazon Comprehend Amazon Translate Amazon Transcribe Data Lake Amazon Polly on AWS

On-premises Real-time Data Movement Data Movement AWS IoT Core AWS Direct Connect AWS Kinesis Firehose AWS Storage Gateway AWS Kinesis Data Streams AWS Snowball AWS Kinesis Video Streams AWS Snowmobile

© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Migration Programs

© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Migration Program

EMR Migration Guide • Technical advice to help planning migration

Free EMR Migration Workshop • Jumpstart your migration to the cloud

Visit aws.amazon.com/emr/emr-migration/

Email [email protected]

© 2021, Amazon Web Services, Inc. or its Affiliates. Customer Examples

© 2021, Amazon Web Services, Inc. or its Affiliates. NHS Digital

Challenge NHS Digital wanted to modernize their data access environment for its users across the UK. The legacy system was too slow, expensive to maintain and users were frustrated with performance issues. Solution NHS Digital migrated the dataset from their legacy systems, converted the data into parquet format, loaded them into S3. Used KMS to encrypt the data. Used Amazon EMR to process the data from S3. Benefit Performance Improvement from 137 minutes to 137 seconds using AWS EMR.

© 2021, Amazon Web Services, Inc. or its Affiliates. Customer Examples - High impact results with Amazon EMR

near real-time analytics for 140M players

scales 3,000 transient clusters on a daily basis

powers the Predix solution processing 1,000,000 data executions/day

achieves costs savings of 55% when compared to on-demand pricing and 40% savings when compared to Reserved Instances

computes Zestimates on 100M +homes in hours instead of 1 day

© 2021, Amazon Web Services, Inc. or its Affiliates. Customer Examples - On-premises migrations to Amazon EMR

Processes 135B events/day and have cost savings of 60% (~$20M)

decreased costs by $600k in less than 5 months

reduced cost of operation and improved Spark performance 3x

saves 75% and is 60% more efficient

re-architects 1 monolithic pipeline into 3 purpose built clusters

© 2021, Amazon Web Services, Inc. or its Affiliates. Thank You

© 2021, Amazon Web Services, Inc. or its Affiliates.