Immersion Day – Building a Data Lake on AWS

Jason Moldan, Solutions Architect Norman Owens, Solutions Architect Arun Shanmugam, Data Architect Sreeji Gopal, Data Architect Matt Atwater, Principle Architect, Clear Scale

© 2021, Web Services, Inc. or its Affiliates. Workshop Agenda

1. Overview of the Data Lake 2. Hydrating the Data Lake Lab1 – Hydrating the Data Lake with DMS 3. Working Within the Data Lake Lab2 – ETL with AWS Glue Lab3 - Querying the Data Lake with Amazon Athena and Amazon QuickSight 4. Consuming the Data Lake – Reporting, , 5. Introduction to DataBrew

© 2021, , Inc. or its Affiliates. What is a Data Lake?

• A centralized repository for both structured and unstructured data

• Store data as-is in open-source file formats to enable direct analytics

© 2021, Amazon Web Services, Inc. or its Affiliates. Why a Data Lake?

• Decouple storage from compute, allowing you to scale

• Enable advanced analytics across all of your data sources

• Reduce complexity in ETL and operational overhead

• Future extensibility as new database and analytics technologies are invented

© 2021, Amazon Web Services, Inc. or its Affiliates. Traditionally, Analytics Looked Like This

Relational Data

Business Intelligence TBs-PBs Scale

Schema Defined Prior to Data Load

Operational and Ad Hoc Reporting

Data Warehouse Large Initial Capex + $$K / TB/ Year

OLTP ERP CRM LOB © 2021, Amazon Web Services, Inc. or its Affiliates. Data Lakes Extend the Traditional Approach

Business Intelligence Machine Learning TB-EBs Scale

All Data in one place, a Single Source of Truth

DW Interactive Real-Time Queries Processing Relational and Non-Relational Data Catalog Decouples (low cost) Storage and Compute

Schema on Read 100110000100101011100 101010111001010100001 011111011010001111001 0110010110 Diverse Analytical Engines 0100011000010

Data Lake

OLTP ERP CRM LOB Devices Web Sensors Social

© 2021, Amazon Web Services, Inc. or its Affiliates. Store Data in the Format You Want Open and comprehensive

CSV

ORC • Store data in the format you want: Grok • Text files like CSV A m a z o n S 3 • Columnar like Apache Parquet, and Apache ORC Amazon Glacier Avro A W S G l u e • Logstash like Grok • JSON (simple, nested), AVRO Parquet • And more…

JSON

© 2021, Amazon Web Services, Inc. or its Affiliates. Any Scale Scalable and durable

• S3 has trillions of objects and exabytes of data

• Built to store any amount of data

• Run analytic engines at largest scale by spinning up any amount of compute resources in minutes

• Runs on the world’s largest global cloud infrastructure

© 2021, Amazon Web Services, Inc. or its Affiliates. Why for a Data Lake?

Durable Available High performance . Multiple upload Designed for 11 9s Designed for . Range GET of durability 99.99% availability

Easy to use Scalable Integrated . Simple REST API . Store as much as you need . Amazon EMR . AWS SDKs . Scale storage and compute . Amazon Redshift . Read-after-create consistency independently . Amazon DynamoDB . Event notification . No minimum usage commitments . Amazon SageMaker . Lifecycle policies . Many more

© 2021, Amazon Web Services, Inc. or its Affiliates. Pay Only for the Resources You Use as you Scale Lowest Cost Traditional approach leads to wasted capacity Unmet demand upset players missed revenue Servers • Pay-as-you-go for the resources you consume

Demand Excess capacity • As low as $0.05/GB scanned with Athena wasted $$$

Traditional: Rigid • EMR and Athena can automatically scale down resources after job completes, saving you costs

AWS approach: pay for the capacity you use • Commit to a set term and save up to 75% with

Capacity Reserved Instance

Demand • Run on spare compute capacity with EMR and save up to 90% with Spot

AWS: Elastic

© 2021, Amazon Web Services, Inc. or its Affiliates. Lowest Total Cost of Ownership (TCO) Cost-effective

• Less admin• Less time admin to time manage, to and supportmanage, and support • No up-front• No upcosts-front— costshardware— acquisition,hardware installation acquisition, installation • Save on operating • Save on operating costs—datacosts center—data centerspace, space, power, coolingpower, cooling • Business• value:Business cost value: of cost delays, of risk premium,delays, competitive risk premium, competitive abilities, abilities, governance,governance, etc. etc.

© 2021, Amazon Web Services, Inc. or its Affiliates. Typical steps for building a data lake

Processing & Analytics Implementing a Data Lake architecture requires a broad set of tools and technologies to serve an increasingly diverseReal- timeset of applications and Analyticsuse Elasticsearch Kinesis Data Analytics, cases. Service Kinesis Data Streams EMR Athena Hadoop, Spark, Query Service 1 Spark Streaming Apache Flink Presto Set up storage on EMR on EMR Redshift Apache Storm on AWS Lambda EMR

AI & Predictive Transactional & RDBMS Amazon Lex Amazon Polly Speech Text to speech recognition

DynamoDB Aurora Amazon Machine Learning Rekognition Predictive analytics NoSQL DB Relational Database 2 Move data 3 Cleanse, prep, 4 Configure and 5 Make data SageMaker and catalog enforce securityBI & available for data and compliance analytics policies

© 2021, Amazon Web Services, Inc. or its Affiliates. Data Lake on AWS

Central Storage Scalable, secure, cost- effective ETL Enrich

AWS Amazon AWS Direct AWS Database AWS Storage S3 Snowball Kinesis Data Connect Migration Gateway Firehose Service SST Temp SVT Data Ingestion

© 2021, Amazon Web Services, Inc. or its Affiliates. Data Lake on AWS

AWS Amazon Amazon Amazon Amazon Elasticsearch AWS AppSync API Gateway Cognito DynamoDB Service Glue Catalog & Search Access & User Interfaces Central Storage Scalable, secure, cost- effective

Amazon Amazon AWS Amazon Amazon Athena EMR Glue Redshift DynamoDB

AWS Amazon AWS Direct AWS Database AWS Storage S3 Snowball Kinesis Data Connect Migration Gateway Firehose Service

Data Ingestion Manage & Secure Amazon Amazon Amazon Amazon Amazon QuickSight Kinesis Elasticsearch Neptune RDS Service Analytics & Serving AWS AWS AWS Amazon KMS IAM CloudTrail CloudWatch

© 2021, Amazon Web Services, Inc. or its Affiliates. Data Lake on AWS

AWS Amazon Amazon Amazon Amazon Elasticsearch AWS AppSync API Gateway Cognito DynamoDB Service Glue Catalog & Search Access & User Interfaces Central Storage Scalable, secure, cost- effective

Amazon Amazon AWS Amazon Amazon Athena EMR Glue Redshift DynamoDB

AWS Amazon AWS Direct AWS Database AWS Storage S3 Snowball Kinesis Data Connect Migration Gateway Firehose Service

Data Ingestion Manage & Secure Amazon Amazon Amazon Amazon Amazon QuickSight Kinesis Elasticsearch Neptune RDS Service Analytics & Serving AWS AWS AWS Amazon KMS IAM CloudTrail CloudWatch

© 2021, Amazon Web Services, Inc. or its Affiliates. Data Lake Architectureson AWS

Relational databaAmazsones Aurora S CALAB LE D ATA LAK E S

Big data processing P U R P O S E - B U I LT Non- Amazon Amazon D ATA S E R V I C E S EMR DyrelationalnamoDB databases Amazon Athena S EAM LE S S D ATA M O V E M E N T Amazon S3 Log analytics Amazon Amazon U N I FI E D G O V E R N AN C E Elasticsearch SageMakMachinee r Service learning

P E R FO R M AN T AN D Data C O S T- E FFE CTI V E wareAmahouzsoning Redshift © 2021, Amazon Web Services, Inc. or its Affiliates. Benefits of a Data Lake – All Data in One Place

“Why is the data distributed in Store and analyze all of your data, many locations? Where is the from all of your sources, in one single source of truth ?” centralized location.

© 2021, Amazon Web Services, Inc. or its Affiliates. Benefits of a Data Lake – Quick Ingest

“How can I collect data quickly Quickly ingest data from various sources and store without needing to force it into a it efficiently?” pre-defined schema.

© 2021, Amazon Web Services, Inc. or its Affiliates. Benefits of a Data Lake – Storage vs Compute

“How can I scale up with the Separating your storage and compute volume of data being generated?” allows you to scale each component as required

© 2021, Amazon Web Services, Inc. or its Affiliates. Benefits of a Data Lake – Schema on Read

“Is there a way I can apply multiple A Data Lake enables ad-hoc analytics and processing frameworks analysis by applying schemas to the same data?” on read, not write.

© 2021, Amazon Web Services, Inc. or its Affiliates. What can you do with a Data Lake?

© 2021, Amazon Web Services, Inc. or its Affiliates. Query Directly with Amazon Athena

© 2021, Amazon Web Services, Inc. or its Affiliates. Analyze with Hadoop on Amazon EMR

© 2021, Amazon Web Services, Inc. or its Affiliates. Create Visualizations with Amazon QuickSight

© 2021, Amazon Web Services, Inc. or its Affiliates. Train ML Models with Amazon SageMaker

© 2021, Amazon Web Services, Inc. or its Affiliates. Create a Central Data Catalog with AWS Glue

© 2021, Amazon Web Services, Inc. or its Affiliates. Load into Downstream Services

Amazon Redshift AURORA Amazon Aurora Run complex analytic queries against A MySQL and PostgreSQL compatible relational petabytes of structured data database built for the cloud

Amazon DynamoDB Amazon Elasticsearch A NoSQL database service that Delivers Elasticsearch’s real-time analytics delivers consistent, single-digit capabilities alongside the availability, millisecond latency at any scale. scalability, and security that production workloads require.

Amazon SageMaker fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly

© 2021, Amazon Web Services, Inc. or its Affiliates. Thanks! You

© 2021, Amazon Web Services, Inc. or its Affiliates.