From /Spark to Amazon EMR: Best Migration Practices and Cost Optimization Strategies Executive Summary

As more businesses look to technologies like machine learning and predictive analytics to more efficiently drive business outcomes, they seek to migrate on-premises and Hadoop clusters to the cloud. Wrestling with rising costs, maintenance uncertainties and administrative headaches, they turn to Amazon Elastic MapReduce (EMR) to support , Spark, and Hadoop environments.

Amazon EMR is a cost-efficient, high-performance service that improves business agility and keeps down resource utilization costs. As a foundation for building data streaming and machine learning infrastructure, EMR enables organizations to transform into AI-driven enterprises, to become better at targeting prospects, supporting customers, building effective products, and responding to market needs.

This whitepaper will walk you through Amazon EMR’s key features, look into its business and technology benefits, and explore core scenarios for migrating Apache Spark & Hadoop clusters to Amazon EMR:

Hadoop on-prem to EMR on AWS through Lift & Shift or Re-Architect Next-Gen architecture on AWS — Containers, non-HDFS, or Streaming Existing Hadoop on-prem to Hadoop on AWS through Rehost, Replatform or Redistribution

We will also look into major cost optimization strategies with Amazon EMR for businesses running workloads on Apache Hadoop and Apache Spark.

2 Amazon EMR Overview

Amazon EMR is the industry-leading cloud-native Big Data platform for improving resource utilization by Apache Hadoop, Hive, Spark, Map/Reduce, and machine learning workloads. Designed to process vast amounts of data quickly and cost-effectively at scale, it is powered by open-source tools such as Apache Spark, , Apache HBase, Apache Flink, Apache Hudi and Presto, coupled with Amazon EC2 and Amazon S3.

Amazon EMR gives analytical teams the capability to run PB-scale analysis for a fraction of the cost of traditional on-premises clusters. It eliminates the complexity involved in manual provisioning and setup of data lake resources, environment tuning and fine-tuning, and all other operational challenges.

Amazon EMR supports services for data analysis, analytics, data lake management, and machine learning, including Amazon Redshift, Amazon Athena, Amazon Glue, Amazon Kinesis, and Amazon SageMaker (also: Jupyter notebooks, Spark ML, Tensorflow).

Amazon EMR helps organizations resolve complex technical and business challenges, from clickstream and log analysis to real-time and predictive analytics.

Key Features

1. Efficient Data Storage. Amazon S3 is an 11 9s availability storage for various data types. It separates storage and compute, to manage multi-tenancy for both performance and chargeback to different business units.

2. Reduced Operational Cost. EMR's automated cluster provisioning, i.e. cluster setup, Hadoop configuration, and cluster tuning, reduces overall operational cost. This feature also improves your Operation team’s productivity.

3. High Performance. Amazon EMR's built-in Auto Scaling increases the performance of various types of workloads while keeping the overall costs low. This feature also improves the price-performance ratio.

4. Cost-Efficiency. Scale-out or back into the worker nodes of purpose-managed separate clusters for ephemeral, long-running, and smaller workloads. This feature enables pay-per-use versus often idle large clusters.

© 2020 Provectus. All rights reserved | provectus.com 3 Major Use Cases

Create new products Extract Transform Load (ETL) Clickstream Analysis

Real-Time Streaming Interactive Analytics Genomicso

Apache Hadoop/Spark and Amazon EMR

Over 10 years ago, the advance of Hadoop gave a fresh start to data lake technology. By offering a large-scale data collection environment with massive, cluster-based parallel processing, Hadoop enabled organizations that once relied on expensive high-end systems to work with their data for a fraction of the cost.

As data lakes started to evolve beyond simple data collection and organization towards advanced data processing and analytics, clusters backfired. IT teams struggled to efficiently manage data among multiple clusters, and businesses were forced to buy and deploy systems that were rarely used. Rising costs, maintenance uncertainties, and administrative headaches became ordinary issues for organizations running Apache Hadoop/Spark workloads. Eventually, they started to consider the cloud as an alternative to Hadoop clusters.

Industry Trends

The appeal of a cloud-based data lake has massively grown in recent years. The growth is primarily driven by a combination of four factors:

1. Large and growing Hadoop market. By 2024, the Hadoop market is projected to reach $9.4B, with an annual revenue growth reaching 33%. The trend coincides with growth in such technology application areas as Data Lakes, Intelligent Systems of Engagement, and Self-Tuning Systems of Intelligence.

2. Rapid growth of cloud adoption in the Big Data space. According to Forrester Research , global spending on Big Data solutions via cloud subscriptions will grow almost 7.5x faster than on-prem subscriptions.

© 2020 Provectus. All rights reserved | provectus.com 4 3. Uncertainty with leading Hadoop vendors. Cloudera, Hortonworks, MapR , as well as other vendors offering Big Data Hadoop solutions are at a crossroads, as their clients are exploring cloud, to take advantage of such benefits as cost, flexibility, scalability, performance, and other benefits.

4. Availability of resources in the cloud. Both data lake engineers and Big Data engineers prefer cloud to on-premises, since cloud resources are easily accessible, can be scaled up or down, and, in most cases, are more cost-effective for irregular workloads.

It’s time to move to the cloud, especially if Big Data is in the picture.

Business Benefits

Issues of efficiency, scale, and management have always been on the agenda of organizations that deploy data lakes using Apache Hadoop/Spark. Amazon EMR is one of the managed services that help address these issues while:

Reducing infrastructure costs

Increasing productivity of engineers and IT staff

Improving the availability of Big Data/Hadoop/Spark environments

In 2018, IDC released comprehensive research on business benefits of using Amazon EMR for Apache Hadoop/Spark, titled The Economic Benefits of Migrating Apache Spark and Hadoop to Amazon EMR. Overall, it was concluded that organizations migrating from Apache Hadoop/Spark to Amazon EMR could reduce their total cost of ownership by 57%.

Organizations that migrate their Big Data/Hadoop/Spark environments to Amazon EMR reduce total cost of ownership by 57% through improved business agility and lower costs.

Other business benefits include:

342% increase in ROI over 60% reduction in IT infrastructure the course of five years costs over the course of five years 8 months from migration 33% more efficient to breakeven, on average Big Data teams $2.9 million of additional 46% more efficient Big Data/Hadoop new revenue gained per year environment management staff $18.1 million of total annual 99% reduction benefits per organization in unplanned downtime

© 2020 Provectus. All rights reserved | provectus.com 5 Overall, IDC found that users of Amazon EMR increased the number of useful applications and capacity, realized better performance, and reduced staff time required for routine operations, all while realizing considerable cost savings.

Technology Benefits

Managing data lake technologies, including data collection, data processing, and data analytics systems in the on-premises data center poses certain challenges:

1. Tightly coupled compute and storage. Storage is bound to grow with compute, and vice versa. Compute requirements usually vary a lot, and it is highly inefficient to bind the two together.

2. Data gets replicated. Not only does replication add to cost, but it also pushes the limits of a data center. A 3X replication on one data center is a common occurrence for HDFS clusters.

3. Underutilized or scarce resources. It is not easy to scale up or down a large monolithic cluster. There is contention for resources at peak times, and massive underutilization of resources at steady state.

4. Contention for the same resources. Apache Spark is compute-bound, and Apache Hive is memory-bound. They cannot be separated on-premises, and are forced to contend for the same resources.

5. Down-the-line apps may not receive updates in time. With a monolithic cluster, there may be dependencies of downstream applications that do not allow for upgrades. This lack of flexibility limits innovation.

6. Separation of data creates data silos. Though separation of data helps resolve some of the above challenges, it creates a challenge of its own — data silos between teams using Hive, Spark, and other frameworks.

The challenges of on-premises Apache Hadoop/Spark can be summarized as follows:

Fixed Cost Storage Always On Self Serve Compute

Static: Outages Production Slow Deployment No Scalable Impact Upgrade Cycle

© 2020 Provectus. All rights reserved | provectus.com 6 Moving Apache Hadoop/Spark to the cloud using a managed service such as Amazon EMR helps organizations avoid these drawbacks, and enables them to maximize their potential.

The technology benefits of migrating to Amazon EMR include:

Decoupled compute Leverage Spot pricing for unused and storage EC2 capacity Autoscale nodes EMR is surrounded by the industry’s with Spot instances broadest analytics ecosystem Turn off your clusters through Diversify instance types using transient clusters instance fleets Agility in auto-scaling Self-service with AWS Service persistent clusters Catalog Logical separation of jobs Spark performance and applications improvements Fully managed EMR Notebooks Built-in disaster recovery

Amazon EMR allows organizations to take maximum advantage of the cloud environment while continuing to benefit from Apache Hadoop/Spark. It brings with it such major benefits as the ability to decouple compute and storage, flexible pricing, availability of resources for various environments, and reliable performance.

Migration Scenarios and Risk Mitigation Strategies

Migration from on-premises Apache Hadoop/Spark to Amazon EMR may present its own difficulties. Some of the most commonly cited challenges are deterioration of performance, decline in price-to- performance ratio, hard to build and manage provisioning pipelines, and disaster recovery issues.

Fortunately, all it takes to migrate successfully is to (a) select an appropriate migration scenario, both technology-wise and business-wise; (b) stick to the industry’s best risk mitigation strategies.

Migration Scenarios

Amazon Web Services distinguishes six distinct Hadoop migration scenarios among three major patterns.

#1 Pattern A

Re-purchase: Hadoop distro on-prem to EMR on AWS (also referred to as “Lift & Shift”).

© 2020 Provectus. All rights reserved | provectus.com 7 Re-architect: Hadoop distro on-prem to EMR on AWS with completely new architecture and complementary services, to provide additional functionality, scalability, flexibility, cost, etc.

#2 Pattern B

Next Gen architecture: Moving Hadoop workload from on-prem to AWS, but with a new architecture which may include Containers, non-HDFS, Streaming, etc. The workload remains the same or has added functionality.

#3 Pattern C

Re-host / Lift & Shift: Any Hadoop distro on-prem to the same Hadoop distro on AWS. For example, Cloudera on-prem to Cloudera on-AWS.

Re-platform: Hadoop distro on-prem to Hadoop distro on AWS with additional optimizations, such as separation of compute and storage, use of complementary services such as Glue, Athena, etc. to optimize environment, while using the same distro that was used on-prem. For example, Cloudera on-prem to Cloudera on AWS with optimizations.

Re-distro: Hadoop distro A on-prem to Hadoop distro B on AWS, with completely new architecture and complementary services, to provide additional functionality, scalability, flexibility, cost, etc. For example, from MapR to Cloudera-on-AWS

While all of these patterns and scenarios are viable, practice has shown that only three scenarios enable organizations to take maximum advantage of migration.

EMR Migration Scenarios pros and cons:

1. Lift & Shift — Migrate to Amazon EMR by lifting and shifting your Hadoop distro on-premises to the AWS cloud.

Low Risk & Lowest migration cost

Very high ongoing cost

Low business value addition

Quickest time to market

2. Re-Architect — Migrate to Amazon EMR with a new architecture, with complimentary services to optimize cost and to provide additional functionality, scalability, flexibility, etc.

Medium risk, medium migration cost

Medium ongoing cost

High business value addition

© 2020 Provectus. All rights reserved | provectus.com 8 Medium time to market

3. Next Gen Architecture — Migrate to Amazon EMR with a completely new architecture, which may include Streaming, Containers with added functionality, scalability, flexibility, etc.

High risk, highest migration cost

Lowest ongoing cost

Highest business value addition

Longest time to market

Each of these scenarios has its advantages and disadvantages, but bear in mind that only the latter two that imply architecture rework deliver tangible results in terms of total cost of ownership, as summarized in the image below.

True TCO comparison

On-Premises Lift&Shift Instance S3 vs. Transient Auto- Spot Automated EMR Right-Sizing HDFS clusters scaling Pricing Orchestration Optimized

(We will talk more about cost reduction opportunities of the above-mentioned scenarios further down the line.)

Risk Mitigation Strategies

Correctly balancing risk and reward is of substantial importance to any organization that needs to migrate their Big Data, Apache Hadoop, or Apache Spark workloads to Amazon EMR. Be it the simplest Lift & Shift or a radically more complex re-architecture, the following strategies will help:

Analyze all applications and workloads to ascertain the exact amount of compute, memory, and storage. Also, identify the run time of day, week, or month, and any other infrastructure needs.

Develop a business value model and an implementation complexity model for all applications and workloads. Create a business value vs complexity prioritization matrix to be aware of all potential pain points.

© 2020 Provectus. All rights reserved | provectus.com 9 Ensure an organized mirroring of data loads onto Amazon EMR cluster with on-premises Apache Hadoop cluster.

Create and share a detailed migration plan among all stakeholders. Move workloads to Amazon EMR in an orderly fashion.

Identify excited innovators within each business unit to spread the word about the potential of Amazon EMR. Use their help to move from on-prem to Amazon EMR.

By following the tips above (and by choosing the correct migration scenario right off the bat), you safeguard your organization from critical mistakes that can cost you time and money. In the next section, we will learn about best practices to optimize Amazon EMR migration costs.

Cost Optimization for Hadoop/Spark Workloads with Amazon EMR

Companies that migrate their Apache Hadoop/Spark workloads to Amazon ERM can expect to reduce IT infrastructure costs while benefiting from better team productivity and performance.

Given a variety of migration patterns and scenarios, company-to-company cost optimization opportunities also vary significantly. Let’s look at cost optimization factors that impact how much an organization can save across three major migration scenarios.

1. Lift & Shift — 10-20% cost reduction

CapEx to OpEx Maintenance overhead

On-prem license fees Uncertainty in Hadoop vendors

2. Re-Architect — 10-40% cost reduction

Decoupled Storage & Compute Optimized hardware Spot pricing

Transient clusters Amazon S3 lifecycle Autoscaling

Proprietary Spark EMR engine

3. Next Gen Architecture — 20-90% cost reduction

Data Pipelines optimization Serverless ETL Serverless Data Catalog

Workload decomposition Serverless ad-hoc queries Streaming processing (EMR, Redshift, Athena, Sagemaker)

© 2020 Provectus. All rights reserved | provectus.com 10 Cost Reduction

Lift & Shift

Cost Reduction & Performance Impact by migration scenario

At this point, it is clear that an organization’s potential to save by migrating to Amazon EMR primarily depends on the migration scenario and its ability to risk investing time and resources into a full-scale rework of architecture, from Lift & Shift to EMR Optimized.

However, no matter which scenario your organization chooses to pursue, you will be able to optimize costs, simply by using Amazon S3 as your persistent storage, or by taking advantage of autoscaling, or by turning your clusters on and off.

The key advantage of Amazon EMR from a cost optimization standpoint is that it exists within a next-generation ecosystem that is designed to support each and every task pertaining to AI/ML, Big Data, Analytics, and more.

Just consider two of the most common Apache Hadoop pipelines below:

Sources Ingestion Heavy Lightweight Query Processing Processing

Full Scans MPP Queries Flat SQL Big Joins transforms Fast Selects

Data re-shuffle

90% of your From-scratch Hadoop Costs processing

Apache Hadoop pipeline #1 — Heavy Processing dependant

© 2020 Provectus. All rights reserved | provectus.com 11 Heavy Lightweight Sources Ingestion Query Processing Processing

Full Scans

Big Joins

Data re-shuffle

90% of your Hadoop Costs Re-partitioning

Apache Hadoop pipeline #2 — Query dependant

Both pipelines do their job but are costly when it comes to full scans, big joins, data re-shuffling, from-scratch processing, and re-partitioning — because all of these operations are performed on-premises.

Compare those to an EMR-based pipeline for Hadoop:

Sources Ingestion Stream & Batch Processing Query

Incremental Flat SQL Reports CDC Processing transforms Serving

2-3x of cost reduction

Hadoop pipeline on Amazon EMR

As seen from the image above, going for incremental processing and flat SQL transforms at the Stream & Batch Processing stage of your pipeline enables organizations to reduce costs by 2-3X. This is possible because of the ecosystem that is built around data lakes on AWS.

© 2020 Provectus. All rights reserved | provectus.com 12 Machine Learning Analytics Amazon SageMaker Amazon Athena Amazon Rekognition Amazon EMR

Amazon Textract Amazon Redshift Amazon Comprehend Amazon Elasticsearch

Amazon Personalize Amazon Kinesis Amazon Forecast Amazon QuickSight

On-prem Data Movement Real-time Data Movement AWS Direct Connect AWS IoT Core AWS Storage Gateway AWS Kinesis Firehose

AWS Snowball AWS Kinesis Data Streams AWS Snowmobile AWS Kinesis Video Streams

The ecosystem around Data Lake on AWS

Overall, AWS and Amazon EMR offer a wide selection of means and methods to realize cost savings, from cloud’s innate agility and flexibility to architecture improvements to “big picture” ecosystem advantages.

© 2020 Provectus. All rights reserved | provectus.com 13 Summary

Amazon EMR is the industry-leading cloud Big Data platform. Envisioned by as the ultimate tool for Apache Hadoop, Hive, Spark, Map/Reduce and machine learning workloads, Amazon EMR is coupled Amazon EC2 and Amazon S3 within the AWS ecosystem to process vast amounts of data quickly and cost-effectively, at scale.

The service is ideal for organizations willing to run large-scale analysis for a fraction of the cost of traditional on-prem clusters, while avoiding complexities associated with manual provisioning and setup of data lake resources, tuning and fine-tuning of environments, and other operational challenges.

Amazon EMR is a golden opportunity for businesses and IT teams looking to migrate to the cloud, reduce costs, increase team productivity, and eliminate administrative uncertainty. Many think tanks, from Gartner and Forrester to IDC, have proven the advantages of Amazon EMR over on-premises Apache Hadoop and Apache Spark.

14 Amazon Web Services distinguishes six distinct Hadoop migration scenarios among three major patterns, including Lift & Shift, Re-Architect, and Next Gen Architecture.

Every migration scenario has its pros and cons: Lift & Shift is the quickest and easiest but only reduces costs by up to 20%; Re-Architect is the middle ground; Next Gen Architecture is the most complex option, but it can reduce costs by up to 90%. Specific strategies can help organizations mitigate migration risks and move more quickly from on-premises to the cloud.

Amazon EMR offers organizations an arsenal of features to realize cost savings. It includes decoupled compute & storage, built-in disaster recovery, “flexible” transient clusters, autoscaling of persistent clusters, auto scalable nodes with Spot instance, a variety of instance types and instance fleets, EMR Notebooks, and more.

At Provectus, we are dedicated to helping businesses reduce the cost and increase the performance of their Big Data solutions on Apache Hadoop/Spark by migrating to Amazon EMR. We use industry best practices to assess the scope of migration and craft a smart migration strategy around architecture improvements and cost optimization opportunities.

Learn more about Amazon EMR Migration Program and request the Amazon EMR Migration Workshop

15 About Provectus Contact Us Provectus is an Artificial Intelligence consultancy and 125 University Avenue, Suite 290 solutions provider helping businesses achieve their Palo Alto, California, 94301 objectives through AI. Provectus is a Premier Consulting +1 (800) 950-9840 Partner with AWS competencies in Machine Learning, Data & Analytics, and DevOps.

[email protected] provectus.com

© 2020 Provectus. All rights reserved | provectus.com 16 Index

1. aws.amazon.com/emr/?whats-new-cards.sort-by=item.additiona lFields.postDateTime&whats-new-cards.sort-order=desc

2. aws.amazon.com/getting-started/hands-on/optimize-amazon-e mr-clusters-with-ec2-spot/

3. aws.amazon.com/aws-cost-management/aws-cost-optimization/

4. aws.amazon.com/blogs/big-data/optimize-amazon-emr-cost s-with-idle-checks-and-automatic-resource-termination-using -advanced-amazon-cloudwatch-metrics-and-aws-lambda/

5. wikibon.com/2016-2026-worldwide-big-data-market-forecast/

6. go.forrester.com/blogs/insight-paas-accelerate-big-data-cloud/

7. d1.awsstatic.com/analyst-reports/IDC%20Economic%20Ben efits%20of%20Migrating%20to%20EMR%20White%20Pape r.pdf

8. d0.awsstatic.com/whitepapers/aws-amazon-emr-best-practic es.pdf

17