Amazon EMR Migration Guide How to Move Apache Spark and Apache Hadoop from On-Premises to AWS
Total Page:16
File Type:pdf, Size:1020Kb
Amazon EMR Migration Guide How to Move Apache Spark and Apache Hadoop From On-Premises to AWS December 2, 2020 Notices Customers are responsible for making their own independent assessment of the information in this document. This document: (a) is for informational purposes only, (b) represents current AWS product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers. © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Contents Overview .............................................................................................................................. 1 Starting Your Journey .......................................................................................................... 3 Migration Approaches ...................................................................................................... 3 Prototyping ....................................................................................................................... 6 Choosing a Team ............................................................................................................. 8 General Best Practices for Migration ............................................................................... 9 Gathering Requirements ................................................................................................... 11 Obtaining On-Premises Metrics ..................................................................................... 11 Cost Estimation and Optimization ..................................................................................... 12 Optimizing Costs ............................................................................................................ 12 Storage Optimization ...................................................................................................... 13 Compute Optimization.................................................................................................... 16 Cost Estimation Summary ............................................................................................. 19 Optimizing Apache Hadoop YARN-based Applications ................................................ 19 Amazon EMR Cluster Segmentation Schemes ................................................................ 22 Cluster Characteristics ................................................................................................... 22 Common Cluster Segmentation Schemes .................................................................... 24 Additional Considerations for Segmentation ................................................................. 25 Securing your Resources on Amazon EMR ..................................................................... 26 EMR Security Best Practices ......................................................................................... 26 Authentication ................................................................................................................. 27 Authorization................................................................................................................... 31 Encryption ....................................................................................................................... 41 Perimeter Security .......................................................................................................... 44 Network Security ............................................................................................................ 45 Auditing ........................................................................................................................... 47 Software Patching .......................................................................................................... 48 Software Upgrades ........................................................................................................ 49 Common Customer Use Cases ..................................................................................... 50 Data Migration ................................................................................................................... 55 Using Amazon S3 as the Central Data Repository ....................................................... 55 Large Quantities of Data on an Ongoing Basis ............................................................. 58 Event and Streaming Data on a Continuous Basis ....................................................... 62 Optimizing an Amazon S3-Based Central Data Repository ......................................... 63 Optimizing Cost and Performance ................................................................................. 66 Data Catalog Migration ...................................................................................................... 70 Hive Metastore Deployment Patterns ............................................................................ 70 Hive Metastore Migration Options ................................................................................. 75 Multitenancy on EMR ........................................................................................................ 78 Silo Mode ........................................................................................................................ 78 Shared Mode .................................................................................................................. 80 Considerations for Implementing Multitenancy on Amazon EMR ................................ 81 Extract, Transform, Load (ETL) on Amazon EMR............................................................ 88 Orchestration on Amazon EMR ..................................................................................... 88 Migrating Apache Spark ................................................................................................. 98 Migrating Apache Hive .................................................................................................102 Amazon EMR Notebooks .............................................................................................108 Incremental Data Processing ..........................................................................................112 Considerations for using Apache Hudi on Amazon EMR ...........................................113 Sample Architecture .....................................................................................................118 Providing Ad Hoc Query Capabilities ..............................................................................120 Considerations for Presto ............................................................................................120 HBase Workloads on Amazon EMR ............................................................................122 Migrating Apache Impala .............................................................................................127 Operational Excellence ...................................................................................................128 Upgrading Amazon EMR Versions ..............................................................................128 General Best Practices for Operational Excellence ....................................................132 Testing and Validation .....................................................................................................133 Data Quality Overview .................................................................................................133 Check your Ingestion Pipeline .....................................................................................134 Overall Data Quality Policy ..........................................................................................135 Estimating Impact of Data Quality ...............................................................................136 Tools to Help with Data Quality ...................................................................................138 Amazon EMR on AWS Outposts ....................................................................................139 Limitations and Considerations ....................................................................................139 Support for Your Migration ..............................................................................................141 Amazon EMR Migration Program ................................................................................141 AWS Professional Services .........................................................................................142 AWS Partners ...............................................................................................................144 AWS Support ................................................................................................................144 Contributors .....................................................................................................................146 Additional Resources ......................................................................................................147