From Apache Hadoop/Spark to Amazon EMR: Best Migration Practices and Cost Optimization Strategies Executive Summary

From Apache Hadoop/Spark to Amazon EMR: Best Migration Practices and Cost Optimization Strategies Executive Summary As more businesses look to technologies like machine learning and predictive analytics to more efficiently drive business outcomes, they seek to migrate on-premises Apache Spark and Hadoop clusters to the cloud. Wrestling with rising costs, maintenance uncertainties and administrative headaches, they turn to Amazon Elastic MapReduce (EMR) to support Big Data, Spark, and Hadoop environments. Amazon EMR is a cost-efficient, high-performance service that improves business agility and keeps down resource utilization costs. As a foundation for building data streaming and machine learning infrastructure, EMR enables organizations to transform into AI-driven enterprises, to become better at targeting prospects, supporting customers, building effective products, and responding to market needs. This whitepaper will walk you through Amazon EMR’s key features, look into its business and technology benefits, and explore core scenarios for migrating Apache Spark & Hadoop clusters to Amazon EMR: Hadoop on-prem to EMR on AWS through Lift & Shift or Re-Architect Next-Gen architecture on AWS — Containers, non-HDFS, or Streaming Existing Hadoop on-prem to Hadoop on AWS through Rehost, Replatform or Redistribution We will also look into major cost optimization strategies with Amazon EMR for businesses running workloads on Apache Hadoop and Apache Spark. 2 Amazon EMR Overview Amazon EMR is the industry-leading cloud-native Big Data platform for improving resource utilization by Apache Hadoop, Hive, Spark, Map/Reduce, and machine learning workloads. Designed to process vast amounts of data quickly and cost-effectively at scale, it is powered by open-source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi and Presto, coupled with Amazon EC2 and Amazon S3. Amazon EMR gives analytical teams the capability to run PB-scale analysis for a fraction of the cost of traditional on-premises clusters. It eliminates the complexity involved in manual provisioning and setup of data lake resources, environment tuning and fine-tuning, and all other operational challenges. Amazon EMR supports services for data analysis, analytics, data lake management, and machine learning, including Amazon Redshift, Amazon Athena, Amazon Glue, Amazon Kinesis, and Amazon SageMaker (also: Jupyter notebooks, Spark ML, Tensorflow). Amazon EMR helps organizations resolve complex technical and business challenges, from clickstream and log analysis to real-time and predictive analytics. Key Features 1. Efficient Data Storage. Amazon S3 is an 11 9s availability storage for various data types. It separates storage and compute, to manage multi-tenancy for both performance and chargeback to different business units. 2. Reduced Operational Cost. EMR's automated cluster provisioning, i.e. cluster setup, Hadoop configuration, and cluster tuning, reduces overall operational cost. This feature also improves your Operation team’s productivity. 3. High Performance. Amazon EMR's built-in Auto Scaling increases the performance of various types of workloads while keeping the overall costs low. This feature also improves the price-performance ratio. 4. Cost-Efficiency. Scale-out or back into the worker nodes of purpose-managed separate clusters for ephemeral, long-running, and smaller workloads. This feature enables pay-per-use versus often idle large clusters. © 2020 Provectus. All rights reserved | provectus.com 3 Major Use Cases Create new products Extract Transform Load (ETL) Clickstream Analysis Real-Time Streaming Interactive Analytics Genomicso Apache Hadoop/Spark and Amazon EMR Over 10 years ago, the advance of Hadoop gave a fresh start to data lake technology. By offering a large-scale data collection environment with massive, cluster-based parallel processing, Hadoop enabled organizations that once relied on expensive high-end systems to work with their data for a fraction of the cost. As data lakes started to evolve beyond simple data collection and organization towards advanced data processing and analytics, clusters backfired. IT teams struggled to efficiently manage data among multiple clusters, and businesses were forced to buy and deploy systems that were rarely used. Rising costs, maintenance uncertainties, and administrative headaches became ordinary issues for organizations running Apache Hadoop/Spark workloads. Eventually, they started to consider the cloud as an alternative to Hadoop clusters. Industry Trends The appeal of a cloud-based data lake has massively grown in recent years. The growth is primarily driven by a combination of four factors: 1. Large and growing Hadoop market. By 2024, the Hadoop market is projected to reach $9.4B, with an annual revenue growth reaching 33%. The trend coincides with growth in such technology application areas as Data Lakes, Intelligent Systems of Engagement, and Self-Tuning Systems of Intelligence. 2. Rapid growth of cloud adoption in the Big Data space. According to Forrester Research, global spending on Big Data solutions via cloud subscriptions will grow almost 7.5x faster than on-prem subscriptions. © 2020 Provectus. All rights reserved | provectus.com 4 3. Uncertainty with leading Hadoop vendors. Cloudera, Hortonworks, MapR, as well as other vendors offering Big Data Hadoop solutions are at a crossroads, as their clients are exploring cloud, to take advantage of such benefits as cost, flexibility, scalability, performance, and other benefits. 4. Availability of resources in the cloud. Both data lake engineers and Big Data engineers prefer cloud to on-premises, since cloud resources are easily accessible, can be scaled up or down, and, in most cases, are more cost-effective for irregular workloads. It’s time to move to the cloud, especially if Big Data is in the picture. Business Benefits Issues of efficiency, scale, and management have always been on the agenda of organizations that deploy data lakes using Apache Hadoop/Spark. Amazon EMR is one of the managed services that help address these issues while: Reducing infrastructure costs Increasing productivity of engineers and IT staff Improving the availability of Big Data/Hadoop/Spark environments In 2018, IDC released comprehensive research on business benefits of using Amazon EMR for Apache Hadoop/Spark, titled The Economic Benefits of Migrating Apache Spark and Hadoop to Amazon EMR. Overall, it was concluded that organizations migrating from Apache Hadoop/Spark to Amazon EMR could reduce their total cost of ownership by 57%. Organizations that migrate their Big Data/Hadoop/Spark environments to Amazon EMR reduce total cost of ownership by 57% through improved business agility and lower costs. Other business benefits include: 342% increase in ROI over 60% reduction in IT infrastructure the course of five years costs over the course of five years 8 months from migration 33% more efficient to breakeven, on average Big Data teams $2.9 million of additional 46% more efficient Big Data/Hadoop new revenue gained per year environment management staff $18.1 million of total annual 99% reduction benefits per organization in unplanned downtime © 2020 Provectus. All rights reserved | provectus.com 5 Overall, IDC found that users of Amazon EMR increased the number of useful applications and capacity, realized better performance, and reduced staff time required for routine operations, all while realizing considerable cost savings. Technology Benefits Managing data lake technologies, including data collection, data processing, and data analytics systems in the on-premises data center poses certain challenges: 1. Tightly coupled compute and storage. Storage is bound to grow with compute, and vice versa. Compute requirements usually vary a lot, and it is highly inefficient to bind the two together. 2. Data gets replicated. Not only does replication add to cost, but it also pushes the limits of a data center. A 3X replication on one data center is a common occurrence for HDFS clusters. 3. Underutilized or scarce resources. It is not easy to scale up or down a large monolithic cluster. There is contention for resources at peak times, and massive underutilization of resources at steady state. 4. Contention for the same resources. Apache Spark is compute-bound, and Apache Hive is memory-bound. They cannot be separated on-premises, and are forced to contend for the same resources. 5. Down-the-line apps may not receive updates in time. With a monolithic cluster, there may be dependencies of downstream applications that do not allow for upgrades. This lack of flexibility limits innovation. 6. Separation of data creates data silos. Though separation of data helps resolve some of the above challenges, it creates a challenge of its own — data silos between teams using Hive, Spark, and other frameworks. The challenges of on-premises Apache Hadoop/Spark can be summarized as follows: Fixed Cost Storage Always On Self Serve Compute Static: Outages Production Slow Deployment No Scalable Impact Upgrade Cycle © 2020 Provectus. All rights reserved | provectus.com 6 Moving Apache Hadoop/Spark to the cloud using a managed service such as Amazon EMR helps organizations avoid these drawbacks, and enables them to maximize their potential. The technology benefits of migrating to Amazon EMR include: Decoupled compute Leverage Spot pricing for unused and storage EC2 capacity Autoscale nodes EMR is surrounded by the industry’s with Spot instances broadest analytics ecosystem

From Apache Hadoop/Spark to Amazon EMR: Best Migration Practices and Cost Optimization Strategies Executive Summary

Hitachi Solution for Databases in Enterprise Data Warehouse Offload Package for Oracle Database with Mapr Distribution of Apache

Database Solutions on AWS

Mapr Spark Certification Preparation Guide

IBM Big SQL (With Hbase), Splice Major Contributor to the Apache Be a Major Determinant“ Machine (Which Incorporates Hbase Madlib Project

PROCESSING LARGE / BIG DATA SET THROUGH Mapr and PIG

Your Expert Guide to Hadoop Big Data Platforms

Amazon Elastic Mapreduce Developer Guide API Version 2009-03-31 Amazon Elastic Mapreduce Developer Guide

Digging Into Hadoop-Based Big Data Architectures

Big Data Fundamentals

Implementing Informatica® Big Data Management 10.2 in an Amazon

Big Data Solutions Glossary

Implementation of Hadoop -2