Building Your Data Lake on AWS

Luke Anderson Business Development, AWS

1. Defining the Data Lake 2. Reducing Costs 3. Increasing Performance 4. Planning for the Future

• Business outcomes • Experimentation • Agile and timely

Databases Storage Hadoop Data Arrays Spark Warehouse NoSQL

Structured Data Raw Data Advanced Analytics SQL ETL ETL

Data lake is an architecture with a virtually limitless centralized storage platform capable of categorization, processing, analysis, and consumption of heterogeneous data sets

Key data lake attributes • Decoupled storage and compute • Rapid ingest and transformation • Secure multi-tenancy • Query in place • Schema on read

Insights QuickSight SageMaker Comprehend

DW Big data processing Interactive Redshift EMR Athena Analytics +Spectrum Real-time Elasticsearch service Kinesis Data Analytics

Data Lake S3/Glacier Glue (Storage) (ETL & Data Catalog)

Data Movement Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams

Unmatched durability, Best security, compliance, and audit Object-level control availability, and scalability capability at any scale

Business insight into Most ways to bring Twice as many partner your data data in integrations

Hot HDFS  Use EMR/Hadoop with local HDFS for hottest data sets

Amazon  Store cooler data in S3 and S3 standard cold in Glacier to reduce costs

 Use S3 Analytics to optimize Amazon S3— tiering strategy infrequent access

Amazon Cold Glacier

Amazon Athena Amazon Redshift AWS Glue Amazon EMR Spectrum Amazon S3

Highly distributed Aggregate small files processing frameworks such Compress datasets as Hadoop/Spark S3distcp “group-by” clause Columnar file formats

Structured data w/ joins Columnar file formats Multiple on-demand Better query performance clusters-scale concurrency Data partitioning with predicate pushdown

Optimize file sizes Optimize querying (Presto Serverless service Compress datasets backend) Schema on read Columnar file formats Query Data in Glacier (Coming)

retrieve a lot of data they don’t need and do the heavy lifting

entire object from Amazon Glacier to Amazon S3 and then use it.

Amazon Amazon Glacier S3

Amazon S3 Select and Amazon Glacier Select

Select subset of data from an object based on a SQL expression

GET all the data from S3 objects, and my application will filter the data that I need

Redshift Spectrum Example: • Beta customer: Run 50,000 queries • Amount of data fetched from S3: 6 PBs • Amount of data used in Redshift: 650 TB

Data needed from S3: 10%

SELECT a filtered set of data from within an object using standard SQL Statements

• First content aware API within Amazon S3 • Unlike Amazon Athena and Spectrum, operates within the Amazon S3 system • SQL Statement operates on a per-object basis—not across a group of objects • Works and scales like GET requests • Accessible via SDK (Java, Python), AWS CLI and Presto Connector—others to follow • Who will use it? • Amazon Redshift Spectrum, Amazon Athena, Presto and other custom Query engines • Everyone doing log mining

Input Output

Format: delimited text (CSV, Format: delimited text (CSV, TSV), JSON … TSV), JSON … Compression: GZIP …

Clauses Data types Operators Functions Select String Conditional String From Integer, Float, Decimal Math Cast Where Timestamp Logical Math Boolean String (Like, ||) Aggregate

…get-object …object… | awk -F ’{ if($4=="x") print $1}’

...select-object …object… ‘SELECT o._1 WHERE o._4 == “x”…’

Lambda Trigger

Amazon AWS S3 Amazon S3 Lambda Select SNS

Before After 200 seconds and 11.2 cents 95 seconds and costs 2.8 cents

# Download and process all keys # Select IP Address and Keys for key in src_keys: for key in src_keys: response = s3_client.get_object(Bucket=src_bucket, Key=key) response = s3_client.select_object_content contents = response['Body'].read() (Bucket=src_bucket, Key=key, expression = for line in contents.split('\n')[:-1]: SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object as obj) line_count +=1 contents = response['Body'].read() try: for line in contents: data = line.split(',') line_count +=1 srcIp = data[0][:8] try: …. …. 2X Faster at 1/5 of the cost

S3 Select

Amazon S3

Works with your existing Hive Metastore Automatically converts predicates into S3 Select requests

Before After After

5X Faster with 1/40 of the CPU

Amazon Amazon S3 DynamoDB

Aggregate small files S3 Select Data Formats EMR: S3distcp Big data cheaper, faster Columnar formats Amazon Kinesis Firehose Up to 400% faster EMRFS consistent view

Easily collect, process, and analyze video and data streams in real time

New

SQL

Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics

Capture, process, Build custom Load data streams Analyze data streams and store video applications that into AWS data stores with SQL streams for analytics analyze data streams

Building training sets

Cleaning and organizing data

Collecting data sets

Mining data for patterns

Refining algorithms

Other

Automatically discovers data and stores schema ETL Job Data Catalog authoring Data searchable, and available for ETL

Discover data and Auto-generates Generates customizable code extract schema customizable ETL code in Python and Spark Schedules and runs your ETL jobs

Serverless

End-to-End Zero setup Flexible Model Pay by the second Machine Learning Training Platform

Collect Store ETL Analyze Visualize

Amazon Hot Amazon ML Predictions Web Apps ElastiCache ML

Amazon Amazon Amazon Redshift Mobile DynamoDB QuickSight Apps iOS Android Transactional Data Amazon Applications RDS Interactive Impala Fast Amazon Warm Logstash ES Search Data Cache SQL NoSQL Search

Hot Analysis Analysis &Visualization A Amazon Slow S3 Batch Pig

File Data Logging

File File Storage Amazon Glacier Cold

Streaming Amazon MapReduce Elastic Apache

Kafka Notebooks Amazon Fast Hot Amazon

Kinesis IDE Kinesis

Stream Data Stream Storage Stream

IoT Amazon AWS Apps & APIs

DynamoDB Lambda Stream ProcessingStream

Make your data driven decisions count, and make a career in Big AWS Certified Big Data - Specialty Data on AWS. Follow the Big Data Specialty learning path and become a specialist in Big Data: • Implement core AWS Big Data services according to best Big Data on AWS – 3-day Classroom Training practices • Design and maintain Big Data

• Leverage tools to automate data analysis Free AWS digital training: Big Data Technology Fundamentals Who should attend

• Enterprise solutions • Big Data solutions architects architects

Certified Cloud • Data scientists • Data analysts Associate-level Certification Practitioner

Visit www.aws.training to find out more. Free AWS digital training: Foundational knowledge

We hope you found it interesting! A kind reminder to complete the survey. Let us know what you thought of today’s event and how we can improve the event experience for you in the future.

[email protected] youtube.com/user/AmazonWebServices

twitter.com/AWSCloud slideshare.net/AmazonWebServices

facebook.com/AmazonWebServices twitch.tv/aws