AWS Analytics Modernization Modernize Your Big Data Platform with AWS Analytics Services
Total Page:16
File Type:pdf, Size:1020Kb
AWS Analytics Modernization Modernize Your Big Data Platform with AWS Analytics Services Jay Elango Analytics Specialist Architect © 2021, Amazon Web Services, Inc. or its Affiliates. Topics • Challenges associated with on-premises Hadoop Big Data Platform • New realities facing the organizations • Lake house architecture on AWS • Value in move to managed service with AWS EMR • EMR Migration Programs • Customer examples © 2021, Amazon Web Services, Inc. or its Affiliates. Challenges with on-premises Hadoop Big Data platform © 2021, Amazon Web Services, Inc. or its Affiliates. Compute and storage grow together • Storage grows along with compute Tightly coupled • Compute requirements vary Tightly coupled © 2021, Amazon Web Services, Inc. or its Affiliates. Replication adds to cost • Data is replicated several times • Typically only in one data center 3x © 2021, Amazon Web Services, Inc. or its Affiliates. Under utilized or Scarce resources 120 Re-processing 100 Weekly peaks 80 Steady state 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 © 2021, Amazon Web Services, Inc. or its Affiliates. Contention for the same resources Compute Memory bound bound © 2021, Amazon Web Services, Inc. or its Affiliates. Separation of resources creates data silos Team A © 2021, Amazon Web Services, Inc. or its Affiliates. Limited on Fast Following Application Versions With a monolithic cluster, there may be dependencies of downstream applications that impact the inability to upgrade versions. By not upgrading, organizations could be limiting innovation. • Large Scale Transformation: Map/Reduce, Hive, Pig, Spark • Interactive Queries: Impala, Spark SQL, Presto • Machine Learning: Spark ML, MxNet, Tensorflow • Interactive Notebooks: Jupyter, Zeppelin • NoSQL: HBase © 2021, Amazon Web Services, Inc. or its Affiliates. The new realities organizations are facing © 2021, Amazon Web Services, Inc. or its Affiliates. New realities – Organizations want more value from their data G R O W I N G F R O M N E W INCREASINGLY U S E D B Y A N A L Y Z E D EXPONENTIALLY SOURCES DIVERSE MANY PEOPLE BY M A N Y APPLICATIONS What organizations are looking to build? Modernized data platform Modernization Goals : ▪ Drive innovation ▪ Enable organizations for Customer Experience/Journey Analytics/360 Analytics & build product intelligence by deriving insights from various data sources and formats. ▪ Business agility ▪ Enable business to scale infrastructure, manage performance & optimize Cost. © 2021, Amazon Web Services, Inc. or its Affiliates. New realities – Organization’s modernized platform needs Relational databases Big data Non-relational Business Business processing databases Intelligence Intelligence Data to DW Silo 1 silos DW Silo 2 Data Lake Log Machine analytics learning OLTP ERP CRM LOB Devices Web Sensors Social Data warehousing © 2021, Amazon Web Services, Inc. or its Affiliates. New realities - Organizations moving to Lake House architecture Relational databases Scalable data lakes Non- Big data relational processing databases Purpose-built data services Data Lake Seamless data movement Log Machine analytics learning Unified governance Data Performant and warehousing cost-effective © 2021, Amazon Web Services, Inc. or its Affiliates. Lake House architecture on AWS Amazon Aurora Scalable data lakes Amazon Amazon EMR DynamoDB Purpose-built data services Amazon Athena Amazon Seamless S3 data movement Amazon Amazon Elasticsearch SageMaker Service Unified governance Amazon Performant and Redshift cost-effective © 2021, Amazon Web Services, Inc. or its Affiliates. Move to managed AWS Analytics services Amazon Amazon Managed Amazon Kinesis Amazon Elasticsearch Streaming Data Analytics for EMR Service for Apache Kafka Apache Flink Spark, Hive, Presto, Operational Real-time Real-time Hudi, HBase analytics analytics analytics Apache Flink Elasticsearch Logstash Kibana © 2021, Amazon Web Services, Inc. or its Affiliates. Amazon EMR Automate provisioning, configuring, and tuning Easily run Spark, Hadoop, Hive, Easy setup, management, and monitoring Presto, HBase, and other big data frameworks Amazon Get the latest, stable, open-source releases Aurora Latest open-source framework updates within 30 days Amazon Amazon EMR DynamoDB Amazon Athena Automatically scale up and down Amazon S3 Manage cluster size based on utilization to reduce costs Amazon Amazon Elasticsearch SageMaker Service Amazon Simple and predictable pricing Redshift Per-second pricing, and save 50%–80% with Amazon EC2 Spot and Reserved Instances © 2021, Amazon Web Services, Inc. or its Affiliates. Amazon Confidential | © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. The value in move to managed with Amazon EMR for big data platforms © 2021, Amazon Web Services, Inc. or its Affiliates. Foundation 1: Decouple storage and compute © 2021, Amazon Web Services, Inc. or its Affiliates. Foundation 2: Amazon S3 is you persistent data store Unmatched durability, Strong read-after-write consistency availability, and scalability Support for transactions Easiest to use with Broadest portfolio cost optimization: of analytics tools Intelligent tiering Amazon S3 Best security (including Row & Most ways to get data in Column level), Compliance, and audit capabilities Cold storage and archive capabilities © 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 1 : Turn off clusters Amazon S3 Amazon S3 Amazon S3 © 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 2 : Built-in Disaster Recovery Cluster 1 Cluster 2 Amazon S3 Cluster 3 Cluster 4 Availability Zone Availability Zone © 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 3 : Logical separation of jobs/applications Traditional Monolithic Cluster Re-architect Monolithic to Purpose-built clusters by: • Creating Transient and/or Persistent clusters • Separating clusters by Application • Separating clusters by Application Version • Isolating Department specific clusters Design consideration are given to: Purpose-built Clusters • How do you submit jobs or build pipelines • Persisting your data in S3 vs. • Storing metadata off the cluster • How long does the job run • What applications are needed Ad-Hoc © 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 4 : Auto-scaling Clusters (Persistent / Transient ) Amazon EMR Cluster © 2021, Amazon Web Services, Inc. or its Affiliates. Amazon EMR Managed Scaling: Reduce costs by up to 60% • Completely managed environment for automatically scaling clusters • No configurations required except min/max capacity • More data points and faster reaction time • Can save 20%-60% costs depending on the workload pattern © 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 5 : Leverage Spot Instances & Instance Fleets Auto Scale with Spot Instance to reduce cost and run-time 10 node cluster running for 14 hours Cost = $1 * 10 nodes * 14 hours = Total $140 20 node cluster running for 7 hours Add 10 more nodes of Spot at Cost = $1 * 10 nodes * 7 hours = $70 50% discount = $0.5 * 10 nodes * 7 hours = $35= Total $105 Results : 50% less run-time (14hrs → 7hrs), 25% less cost ($140 → $105) Diversify Spot and On-demand Instances via Instance Fleets • Can mix different instance types, markets (On- demand or Spot) in one group • Don’t specify an AZ and we will find the cheapest one © 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 6 : EMR Self-service with AWS Service Catalog Configure Consume Standardize Developer Autonomy Enforce Consistency and One-Stop Shop Compliance Limit Access Automate Deployments Enforce Tagging, Security Agile Governance Groups © 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 7 : Amazon EMR differentiated performance 1.7x faster performance than standard Apache Spark 3.0 at 40% of the cost Up to 2.6x faster performance than open-source Presto 0.238 at 80% of the cost 11.5% average performance improvement with Graviton2 25.7% average cost reduction with Graviton2 © 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 8 : Fully Managed EMR Notebooks 1. Provide an end to end data engineering and data science using EMR Notebooks which is based on the popular open source Jupyter Notebooks to build applications with Apache Spark 2. Attach / Detach from individual clusters; automatically backed up to S3 3. Tag-based Permissions 4. Support for PySpark, Spark SQL, Spark R, and Scala 5. NEW features include a visual experience to debug and monitor Spark jobs directly into the off-cluster, persistent, Apache Spark History Server using the EMR Console, associate Git repositories such as GitHub and Bitbucket, and compare and merge two different notebooks using the nbdime utility. © 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 8 : EMR Studio integrated development environment Easily build and Start notebooks Build production Save debugging deploy data in seconds, run pipelines simply time with native science code jobs later and flexibly application UIs in without logging one place in to AWS console © 2021, Amazon Web Services, Inc. or its Affiliates. EMR Benefit 9 : Analysts confirm Lowest TCO in the Industry Nov. 2018, IDC report confirms: Dec. 2018, Gartner suggests: Feb. 2019 Forrester recognizes: “EMR provides 57% reduced costs “AWS remains the largest AWS EMR as the Cloud vs. on premise resulting in 342% Hadoop provider in terms of Hadoop/Spark (HARK) Leader. ROI over 5 years.” both revenue and user base.” The Forrester Wave™ is copyrighted by Forrester Research,