Building Your Data Lake on AWS

Luke Anderson Business Development, AWS

© 2018 Web Services, Inc. or its Affiliates. All rights reserved. What to expect from the session

1. Defining the Data Lake 2. Reducing Costs 3. Increasing Performance 4. Planning for the Future

© 2018 , Inc. or its Affiliates. All rights reserved. Rethink how to become a data-driven business

• Business outcomes • Experimentation • Agile and timely

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Traditionally, looked like this (Duplication & Sprawl)

Databases Storage Hadoop Data Arrays Spark Warehouse NoSQL

Structured Data Raw Data Advanced Analytics SQL ETL ETL

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Defining the AWS data lake

Data lake is an architecture with a virtually limitless centralized storage platform capable of categorization, processing, analysis, and consumption of heterogeneous data sets

Key data lake attributes • Decoupled storage and compute • Rapid ingest and transformation • Secure multi-tenancy • Query in place • Schema on read

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Data Lake Components Any analytic workload, any scale, at the lowest possible cost

Insights QuickSight SageMaker Comprehend

DW processing Interactive Redshift EMR Athena Analytics +Spectrum Real-time Elasticsearch service Kinesis Data Analytics

Data Lake S3/Glacier Glue (Storage) (ETL & Data Catalog)

Data Movement Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Reasons to choose for data lake

Unmatched durability, Best security, compliance, and audit Object-level control availability, and scalability capability at any scale

Business insight into Most ways to bring Twice as many partner your data data in integrations

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Reducing Data Lake Costs

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Optimize costs with data tiering

Hot HDFS  Use EMR/Hadoop with local HDFS for hottest data sets

Amazon  Store cooler data in S3 and S3 standard cold in Glacier to reduce costs

 Use S3 Analytics to optimize Amazon S3— tiering strategy infrequent access

Amazon Cold Glacier

S3 Analytics © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Process data in place…

Amazon Athena AWS Glue Amazon EMR Spectrum Amazon S3

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EMR: Decouple compute & storage

Highly distributed Aggregate small files processing frameworks such Compress datasets as Hadoop/Spark S3distcp “group-by” clause Columnar file formats

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift Spectrum: Exabyte Scale query-in-place

Structured data w/ joins Columnar file formats Multiple on-demand Better query performance clusters-scale concurrency Data partitioning with predicate pushdown

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena: Query without ETL

Optimize file sizes Optimize querying ( Serverless service Compress datasets backend) Schema on read Columnar file formats Query Data in Glacier (Coming)

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Today: All of these tools…

retrieve a lot of data they don’t need and do the heavy lifting

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Today: You need to….

entire object from to Amazon S3 and then use it.

Amazon Amazon Glacier S3

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Select

Amazon S3 Select and Amazon Glacier Select

Select subset of data from an object based on a SQL expression

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Motivation Behind S3 Select

GET all the data from S3 objects, and my application will filter the data that I need

Redshift Spectrum Example: • Beta customer: Run 50,000 queries • Amount of data fetched from S3: 6 PBs • Amount of data used in Redshift: 650 TB

Data needed from S3: 10%

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select

SELECT a filtered set of data from within an object using standard SQL Statements

• First content aware API within Amazon S3 • Unlike Amazon Athena and Spectrum, operates within the Amazon S3 system • SQL Statement operates on a per-object basis—not across a group of objects • Works and scales like GET requests • Accessible via SDK (Java, Python), AWS CLI and Presto Connector—others to follow • Who will use it? • Amazon Redshift Spectrum, Amazon Athena, Presto and other custom Query engines • Everyone doing log mining

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select

Input Output

Format: delimited text (CSV, Format: delimited text (CSV, TSV), JSON … TSV), JSON … Compression: GZIP …

Clauses Data types Operators Functions Select String Conditional String From Integer, Float, Decimal Math Cast Where Timestamp Logical Math Boolean String (Like, ||) Aggregate

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select: Simple pattern matches

…get-object …object… | awk -F ’{ if($4=="x") print $1}’

...select-object …object… ‘SELECT o._1 WHERE o._4 == “x”…’

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select: Serverless applications

Lambda Trigger

Amazon AWS S3 Amazon S3 Lambda Select SNS

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select: Serverless MapReduce

Before After 200 seconds and 11.2 cents 95 seconds and costs 2.8 cents

# Download and process all keys # Select IP Address and Keys for key in src_keys: for key in src_keys: response = s3_client.get_object(Bucket=src_bucket, Key=key) response = s3_client.select_object_content contents = response['Body'].read() (Bucket=src_bucket, Key=key, expression = for line in contents.split('\n')[:-1]: SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object as obj) line_count +=1 contents = response['Body'].read() try: for line in contents: data = line.split(',') line_count +=1 srcIp = data[0][:8] try: …. …. 2X Faster at 1/5 of the cost

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demo – S3 Select Timing

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select with Presto

S3 Select

Amazon S3

Works with your existing Hive Metastore Automatically converts predicates into S3 Select requests

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select: Accelerating big data

Before After After

5X Faster with 1/40 of the CPU

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Using Amazon Glacier Select

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. How Amazon Glacier Select Works

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Delivering Results Faster

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Optimizing data lake performance

Amazon Amazon S3 DynamoDB

Aggregate small files S3 Select Data Formats EMR: S3distcp Big data cheaper, faster Columnar formats Amazon Kinesis Firehose Up to 400% faster EMRFS consistent view

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Kinesis—Real Time

Easily collect, process, and analyze and data streams in real time

New

SQL

Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics

Capture, process, Build custom Load data streams Analyze data streams and store video applications that into AWS data stores with SQL streams for analytics analyze data streams

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data preparation accounts for ~80% of the work

Building training sets

Cleaning and organizing data

Collecting data sets

Mining data for patterns

Refining algorithms

Other

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue—Serverless Data catalog & ETL service

Automatically discovers data and stores schema ETL Job Data Catalog authoring Data searchable, and available for ETL

Discover data and Auto-generates Generates customizable code extract schema customizable ETL code in Python and Spark Schedules and runs your ETL jobs

Serverless

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker (GA) The quickest and easiest way to get ML models from idea to production

$

End-to-End Zero setup Flexible Model Pay by the second Training Platform

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Planning for the Future

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Evolve As Needed!

Collect Store ETL Analyze Visualize

Amazon Hot Amazon ML Predictions Web Apps ElastiCache ML

Amazon Amazon Amazon Redshift Mobile DynamoDB QuickSight Apps iOS Android Transactional Data Amazon Applications RDS Interactive Impala Fast Amazon Warm Logstash ES Search Data Cache SQL NoSQL Search

Hot Analysis Analysis &Visualization A Amazon Slow S3 Batch Pig

File Data Logging

File File Storage Amazon Glacier Cold

Streaming Amazon MapReduce Elastic Apache

Kafka Notebooks Amazon Fast Hot Amazon

Kinesis IDE Kinesis

Stream Data Stream Storage Stream

IoT Amazon AWS Apps & APIs

DynamoDB Lambda Stream ProcessingStream

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Training Offer

Make your data driven decisions count, and make a career in Big AWS Certified Big Data - Specialty Data on AWS. Follow the Big Data Specialty learning path and become a specialist in Big Data: • Implement core AWS Big Data services according to best Big Data on AWS – 3-day Classroom Training practices • Design and maintain Big Data

• Leverage tools to automate data analysis Free AWS digital training: Big Data Technology Fundamentals Who should attend

• Enterprise solutions • Big Data solutions architects architects

Certified Cloud • Data scientists • Data analysts Associate-level Certification Practitioner

Visit www.aws.training to find out more. Free AWS digital training: Foundational knowledge

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank You For Attending AWS Data Driven Decisions Webinar Series.

We hope you found it interesting! A kind reminder to complete the survey. Let us know what you thought of today’s event and how we can improve the event experience for you in the future.

[email protected] .com/user/AmazonWebServices

.com/AWSCloud slideshare.net/AmazonWebServices

facebook.com/AmazonWebServices twitch.tv/aws

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.