AWS Analytics Modernization Modernize Your Big Data Platform with AWS Analytics Services
Jay Elango Analytics Specialist Architect
© 2021, Amazon Web Services, Inc. or its Affiliates. Topics
• Challenges associated with on-premises Hadoop Big Data Platform
• New realities facing the organizations
• Lake house architecture on AWS
• Value in move to managed service with AWS EMR
• EMR Migration Programs
• Customer examples
© 2021, Amazon Web Services, Inc. or its Affiliates. Challenges with on-premises Hadoop Big Data platform
© 2021, Amazon Web Services, Inc. or its Affiliates. Compute and storage grow together
• Storage grows along with compute Tightly coupled • Compute requirements vary
Tightly coupled
© 2021, Amazon Web Services, Inc. or its Affiliates. Replication adds to cost
• Data is replicated several times • Typically only in one data center
3x
© 2021, Amazon Web Services, Inc. or its Affiliates. Under utilized or Scarce resources
120 Re-processing
100 Weekly peaks 80 Steady state
60
40
20
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
© 2021, Amazon Web Services, Inc. or its Affiliates. Contention for the same resources
Compute Memory bound bound
© 2021, Amazon Web Services, Inc. or its Affiliates. Separation of resources creates data silos
Team A
© 2021, Amazon Web Services, Inc. or its Affiliates. Limited on Fast Following Application Versions
With a monolithic cluster, there may be dependencies of downstream applications that impact the inability to upgrade versions. By not upgrading, organizations could be limiting innovation.
• Large Scale Transformation: Map/Reduce, Hive, Pig, Spark • Interactive Queries: Impala, Spark SQL, Presto • Machine Learning: Spark ML, MxNet, Tensorflow • Interactive Notebooks: Jupyter, Zeppelin • NoSQL: HBase
© 2021, Amazon Web Services, Inc. or its Affiliates. The new realities organizations are facing
© 2021, Amazon Web Services, Inc. or its Affiliates. New realities – Organizations want more value from their data
G R O W I N G F R O M N E W INCREASINGLY U S E D B Y A N A L Y Z E D EXPONENTIALLY SOURCES DIVERSE MANY PEOPLE BY M A N Y APPLICATIONS
What organizations are looking to build? Modernized data platform
Modernization Goals : ▪ Drive innovation ▪ Enable organizations for Customer Experience/Journey Analytics/360 Analytics & build product intelligence by deriving insights from various data sources and formats. ▪ Business agility ▪ Enable business to scale infrastructure, manage performance & optimize Cost.
© 2021, Amazon Web Services, Inc. or its Affiliates. New realities – Organization’s modernized platform needs
Relational databases
Big data Non-relational Business Business processing databases Intelligence Intelligence
Data to DW Silo 1 silos DW Silo 2 Data Lake
Log Machine analytics learning OLTP ERP CRM LOB Devices Web Sensors Social
Data warehousing
© 2021, Amazon Web Services, Inc. or its Affiliates. New realities - Organizations moving to Lake House architecture Relational databases Scalable data lakes
Non- Big data relational processing databases Purpose-built data services
Data Lake Seamless data movement
Log Machine analytics learning Unified governance
Data Performant and warehousing cost-effective
© 2021, Amazon Web Services, Inc. or its Affiliates. Lake House architecture on AWS
Amazon Aurora Scalable data lakes
Amazon Amazon EMR DynamoDB Purpose-built data services Amazon Athena
Amazon Seamless S3 data movement
Amazon Amazon Elasticsearch SageMaker Service Unified governance
Amazon Performant and Redshift cost-effective
© 2021, Amazon Web Services, Inc. or its Affiliates. Move to managed AWS Analytics services
Amazon Amazon Managed Amazon Kinesis Amazon Elasticsearch Streaming Data Analytics for EMR Service for Apache Kafka Apache Flink
Spark, Hive, Presto, Operational Real-time Real-time Hudi, HBase analytics analytics analytics
Apache Flink Elasticsearch Logstash Kibana
© 2021, Amazon Web Services, Inc. or its Affiliates. Amazon EMR Automate provisioning, configuring, and tuning Easily run Spark, Hadoop, Hive, Easy setup, management, and monitoring Presto, HBase, and other big data frameworks
Amazon Get the latest, stable, open-source releases Aurora Latest open-source framework updates within 30 days
Amazon Amazon EMR DynamoDB
Amazon Athena Automatically scale up and down Amazon S3 Manage cluster size based on utilization to reduce costs
Amazon Amazon Elasticsearch SageMaker Service
Amazon Simple and predictable pricing Redshift Per-second pricing, and save 50%–80% with Amazon EC2 Spot and Reserved Instances
© 2021, Amazon Web Services, Inc. or its Affiliates. Amazon Confidential | © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. The value in move to managed with Amazon EMR for big data platforms
© 2021, Amazon Web Services, Inc. or its Affiliates. Foundation 1: Decouple storage and compute
© 2021, Amazon Web Services, Inc. or its Affiliates. Foundation 2: Amazon S3 is you persistent data store
Unmatched durability, Strong read-after-write consistency availability, and scalability Support for transactions
Easiest to use with Broadest portfolio cost optimization: of analytics tools Intelligent tiering Amazon S3
Best security (including Row & Most ways to get data in Column level), Compliance, and audit capabilities
Cold storage and archive capabilities
© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 1 : Turn off clusters
Amazon S3 Amazon S3 Amazon S3 © 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 2 : Built-in Disaster Recovery
Cluster 1 Cluster 2
Amazon S3
Cluster 3 Cluster 4
Availability Zone Availability Zone
© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 3 : Logical separation of jobs/applications
Traditional Monolithic Cluster Re-architect Monolithic to Purpose-built clusters by: • Creating Transient and/or Persistent clusters • Separating clusters by Application • Separating clusters by Application Version • Isolating Department specific clusters
Design consideration are given to: Purpose-built Clusters • How do you submit jobs or build pipelines • Persisting your data in S3 vs. • Storing metadata off the cluster • How long does the job run • What applications are needed
Ad-Hoc
© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 4 : Auto-scaling Clusters (Persistent / Transient )
Amazon EMR Cluster
© 2021, Amazon Web Services, Inc. or its Affiliates. Amazon EMR Managed Scaling: Reduce costs by up to 60%
• Completely managed environment for automatically scaling clusters • No configurations required except min/max capacity • More data points and faster reaction time • Can save 20%-60% costs depending on the workload pattern
© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 5 : Leverage Spot Instances & Instance Fleets Auto Scale with Spot Instance to reduce cost and run-time
10 node cluster running for 14 hours Cost = $1 * 10 nodes * 14 hours = Total $140
20 node cluster running for 7 hours Add 10 more nodes of Spot at Cost = $1 * 10 nodes * 7 hours = $70 50% discount = $0.5 * 10 nodes * 7 hours = $35= Total $105
Results : 50% less run-time (14hrs → 7hrs), 25% less cost ($140 → $105) Diversify Spot and On-demand Instances via Instance Fleets • Can mix different instance types, markets (On- demand or Spot) in one group • Don’t specify an AZ and we will find the cheapest one © 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 6 : EMR Self-service with AWS Service Catalog
Configure Consume
Standardize Developer Autonomy
Enforce Consistency and One-Stop Shop Compliance
Limit Access Automate Deployments
Enforce Tagging, Security Agile Governance Groups
© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 7 : Amazon EMR differentiated performance
1.7x faster performance than standard Apache Spark 3.0 at 40% of the cost
Up to 2.6x faster performance than open-source Presto 0.238 at 80% of the cost
11.5% average performance improvement with Graviton2
25.7% average cost reduction with Graviton2
© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 8 : Fully Managed EMR Notebooks
1. Provide an end to end data engineering and data science using EMR Notebooks which is based on the popular open source Jupyter Notebooks to build applications with Apache Spark 2. Attach / Detach from individual clusters; automatically backed up to S3 3. Tag-based Permissions 4. Support for PySpark, Spark SQL, Spark R, and Scala 5. NEW features include a visual experience to debug and monitor Spark jobs directly into the off-cluster, persistent, Apache Spark History Server using the EMR Console, associate Git repositories such as GitHub and Bitbucket, and compare and merge two different notebooks using the nbdime utility.
© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 8 : EMR Studio integrated development environment
Easily build and Start notebooks Build production Save debugging deploy data in seconds, run pipelines simply time with native science code jobs later and flexibly application UIs in without logging one place in to AWS console
© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 9 : Analysts confirm Lowest TCO in the Industry
Nov. 2018, IDC report confirms: Dec. 2018, Gartner suggests: Feb. 2019 Forrester recognizes:
“EMR provides 57% reduced costs “AWS remains the largest AWS EMR as the Cloud vs. on premise resulting in 342% Hadoop provider in terms of Hadoop/Spark (HARK) Leader. ROI over 5 years.” both revenue and user base.”
The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service © 2021, Amazon Web Services, Inc. or its Affiliates. depicted in the Forrester Wave™. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change. EMR Benefit 10 : Leverage AWS Lake House Analytic ecosystem
Lake Analytics Formation Machine Learning Amazon Athena Amazon SageMaker Amazon EMR AWS Deep Learning AMIs Amazon Redshift Amazon Rekognition AWS Blueprin ML Data Access Amazon Lex Amazon Elasticsearch service Glue ts Transforms Catalog Contro AWS DeepLens Amazon Kinesis l Amazon QuickSight Amazon Comprehend Amazon Translate Amazon Transcribe Data Lake Amazon Polly on AWS
On-premises Real-time Data Movement Data Movement AWS IoT Core AWS Direct Connect AWS Kinesis Firehose AWS Storage Gateway AWS Kinesis Data Streams AWS Snowball AWS Kinesis Video Streams AWS Snowmobile
© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Migration Programs
© 2021, Amazon Web Services, Inc. or its Affiliates. EMR Migration Program
EMR Migration Guide • Technical advice to help planning migration
Free EMR Migration Workshop • Jumpstart your migration to the cloud
Visit aws.amazon.com/emr/emr-migration/
Email [email protected]
© 2021, Amazon Web Services, Inc. or its Affiliates. Customer Examples
© 2021, Amazon Web Services, Inc. or its Affiliates. NHS Digital
Challenge NHS Digital wanted to modernize their data access environment for its users across the UK. The legacy system was too slow, expensive to maintain and users were frustrated with performance issues. Solution NHS Digital migrated the dataset from their legacy systems, converted the data into parquet format, loaded them into S3. Used KMS to encrypt the data. Used Amazon EMR to process the data from S3. Benefit Performance Improvement from 137 minutes to 137 seconds using AWS EMR.
© 2021, Amazon Web Services, Inc. or its Affiliates. Customer Examples - High impact results with Amazon EMR
near real-time analytics for 140M players
scales 3,000 transient clusters on a daily basis
powers the Predix solution processing 1,000,000 data executions/day
achieves costs savings of 55% when compared to on-demand pricing and 40% savings when compared to Reserved Instances
computes Zestimates on 100M +homes in hours instead of 1 day
© 2021, Amazon Web Services, Inc. or its Affiliates. Customer Examples - On-premises migrations to Amazon EMR
Processes 135B events/day and have cost savings of 60% (~$20M)
decreased costs by $600k in less than 5 months
reduced cost of operation and improved Spark performance 3x
saves 75% and is 60% more efficient
re-architects 1 monolithic pipeline into 3 purpose built clusters
© 2021, Amazon Web Services, Inc. or its Affiliates. Thank You
© 2021, Amazon Web Services, Inc. or its Affiliates.