Best Practices for Implementing a Data Lake on Amazon S3
Total Page:16
File Type:pdf, Size:1020Kb
S T G 3 5 9 - R Best practices for implementing a Data Lake on Amazon S3 Amy Che John Mallory Gerard Bartolome Principal Technical Program Storage Business Data Platform Architect Manager, Amazon S3 Development Manager Sweetgreen Amazon Web Services Amazon Web Services © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data at scale Growing From new Increasingly Used by Analyzed by exponentially sources diverse many people many applications Agenda Data at scale and Data Lakes Sweetgreen’s Data Lake best practices Data Lake foundation best practices 100110000100101011100101010111001010100001 011111011010 0011110010110010110 0100011000010 [ Data Lake ] performance best practices [ Data Lake ] security best practices © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Defining the Data Lake Business Machine Intelligence Learning DW Queries Big data Interactive Real-time processing Catalog 10011000010010101110010101 01110010101000010111110110 10 0011110010110010110 0100011000010 Data warehouse Data Lake OLTP ERP CRM LOB Devices Web Sensors Social Defining the Data Lake Amazon Simple Storage Service (S3) Amazon S3 as the foundation for Data Lakes Amazon Athena Elasticsearch Service Durable, available, exabyte scalable Amazon Kinesis EMR Lake Formation & Glue Amazon Amazon Secure, compliant, auditable Redshift Amazon S3 Comprehend Amazon Amazon SageMaker Rekognition High performance Low cost storage and analytics Broad ecosystem integration Snowball Kinesis Data Streams Snowmobile Kinesis Data Firehose Best practices for implementing a Data Lake on Amazon S3 Gerard Bartolome sweetgreen | AWS re:INVENT 7 ECOSYSTEM < IMAGE > sweetgreen | AWS re:INVENT 9 EXTRACTION "Adapt the language to the data; DON'T adapt the data to the language” sweetgreen | AWS re:INVENT 7 TRANSFORM: S3 SECURITY sweetgreen | AWS re:INVENT 7 TRANSFORM: ECS / SERVERLESS sweetgreen | AWS re:INVENT 7 TRANSFORM: EMR sweetgreen | AWS re:INVENT 7 USAGE sweetgreen | AWS re:INVENT 7 CALIFORNIA CONSUMER PRIVACY ACT Anonymize user sweetgreen | AWS re:INVENT 7 We are hiring!! https://grnh.se/b4116be81 © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Lake on AWS Amazon Amazon Amazon Amazon Elasticsearch AWS AWS Lake AWS DynamoDB Service Glue Formation AppSync API Gateway Cognito Catalog & search Access & user interfaces Central storage Scalable, secure, cost-effective Amazon Amazon AWS Amazon Amazon Athena EMR Glue Redshift AWS Amazon AWS Direct AWS Database AWS Storage Snowball Kinesis Data Connect Migration Gateway S3 Firehose Service Data ingestion Amazon Amazon Manage & secure Amazon Amazon Elasticsearch DynamoDB QuickSight Kinesis Service AWS AWS AWS Identity and Access Amazon Key Management CloudTrail Management CloudWatch Amazon Service Amazon Amazon Amazon Rekognition SageMaker RDS Neptune Analytics, Machine Learning & Serving Data Lake ingest and transform patterns Pipelined architectures improve governance, data management and efficiency Raw data ETL Production data Data warehouse Amazon S3-Standard Amazon Glue or EMR (Data Lake) Amazon Redshift Amazon S3-Intelligent Tiering Triggered code ETL & catalog management Triggered code Amazon Lambda AWS Glue and AWS Lake Formation Amazon Lambda Data management at scale best practices Utilize S3 object tagging Granularly control access, analyze usage, manage lifecycle policies, and replicate objects Implement lifecycle policies Automated policy-driven archive and data expiration Utilize batch operations Manage millions to billions of objects with a single request Plan for rapid growth and automating management at any scale Choosing the right Data Lake storage class Select storage class by data pipeline stage Production Online Historical Raw data ETL Data Lake cool data data Amazon Amazon Amazon Amazon S3-Standard Amazon S3-Standard S3-Standard S3-Intelligent-Tiering Infrequent Access (S3-IA/ZIA) S3-Glacier or Deep-Archive Small log files Data churn Optimized sizes (MBs) Replicated DR data Historical assets Overwrites if synced Small intermediates Many users Infrequently accessed ML model training Short lived Multiple transforms Unpredictable access Infrequent queries Compliance/Audit Moved & deleted Deletes < 30 days Long lived assets ML model training Data protection Batched & archived Output to Data Lake Hot to cool Planned restores Optimize costs for all stages of Data Lake workflows Efficiently ingest data from all sources IoT, Sensor Data, Clickstream Data, Real time Social Media Feeds, Streaming Logs Predictive analytics, IoT, On-premise Data Lakes, EDW, sentiment analysis, AWS Direct Connect Amazon Kinesis Large Scale Data Collection recommendation engines On-premise ERP, Mainframes, Lab Equipment, NAS Storage Batch BI reporting, log analysis, AWS DataSync AWS Storage Oracle, MySQL, MongoDB, DB2, data warehousing, usage optimization Gateway SQL Server, Amazon RDS Bulk Offline Sensor Data, NAS, AWS Database AWS Snowball Edge On-premise Hadoop Machine learning model training, Ad hoc Migration Service data discovery, data annotation An S3 Data Lake accommodates a wide variety of concurrent data sources Batch relational data ingestion Event-driven batch ingest pipeline Let Amazon CloudWatch Events and AWS Lambda drive the pipeline Run Crawl New raw Crawl SLA ‘optimize’ optimized Ready raw dataset data arrives job dataset for reporting deadline < 22:00 Start Start Start Reporting 02:00 UTC crawler job or trigger crawler dataset UTC ready Data arrives Crawler Job in Amazon S3 succeeds succeeds Real-time data ingestion Collect, process, analyze and aggregate data streams in real time Streaming data collected and processed in fast layer Spark on Amazon Amazon EMR DynamoDB Aggregated and batched before ingesting in S3 Provides real-time SQL insights and query Aggregated raw Kinesis Data Kinesis Data Kinesis Data Streams Analytics Firehose data stored for further analysis Ingest Aggregate, Egress data store data filter, streams streams enrich data Running analytics on AWS Data Lakes Lift & Shift AWS Managed Services Redshift Glue EMR Athena Run third-party analytic tools on EC2 AWS Managed & Serverless Platforms What Use EBS and S3 as data stores Glue, Athena, EMR, Redshift Self-managed environments More options to process data in place Simplify on-premises migrations Focus on data outcomes, not infrastructure Why Use existing tools, code and customizations Speed adoption of new capabilities Minimize application changes More tightly integrated with AWS security You provision, manage and scale Utilizing AWS Lake Formation Consider You monitor and manage availability Flexibility and choice with open data formats You own upgrades and versioning Leverage AWS pace of innovation Amazon S3 is the storage foundation for both approaches AWS Lake Formation Build a secure Data Lake in days Build Data Lakes Simplify security Provide self-service quickly management access to data Move, store, catalog, Build a data catalog that Centrally define security, governance, and clean your data faster describes your data and auditing policies Transform to open formats like Enable analyst and data scientist Enforce policies consistently Parquet and ORC to easily find relevant data across multiple services ML-based de-duplication Analyze with multiple analytics Integrates with IAM and KMS and record matching services without moving data © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Optimizing Data Lake performance Scaling request rates on S3 S3 automatically scales to thousands of transactions per second in request performance At least 3500 PUT/COPY/POST/DELETE and 5500 GET/HEAD request per second per prefix in a bucket Horizontally scale parallel requests to S3 endpoints to distribute load over multiple paths through the network 10 prefixes in an S3 bucket will scale read performance to 55,000 read requests per second Use the AWS SDK Transfer Manager to automate horizontally scaling connections No limits to the number of prefixes in a bucket! Vast majority of analytic use cases don’t require prefix customization Optimizing Data Lake performance Scaling request rates on S3 Using the AWS SDK Transfer Manager, the vast majority of applications can use any prefix naming scheme and get thousands of RPS on ingest and egress. AWS SDK retry logic helps occasional 503 errors while S3 automatically scales for sustained high load. Only consider prefix customization if: Your application exponentially increases RPS in seconds or a few minutes (e.g., 0 RPS to 600K RPS for GET in 5 minutes). Your application requires a high RPS on another S3 API like LIST. Automatic request rate scaling on Amazon S3 Autonomous driving Data Lake New index partitions All cars get throttled are created, raising 4000 around 3,500 max TPS 3500 PUTs/sec (total) 3000 2500 S P 2000 T 1500 1000 500 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 TIME CAR01 CAR02 CAR03 CAR04 CAR05 Optimizing Data Lake performance Use optimized object sizes and data formats Aim for 16–256 MB object sizes to optimize throughput and cost • Also reduces LISTs, metadata operations, and job setup time Aggregate during ingest with Kinesis or during ETL with Glue or EMR+Spark Utilize Parquet or ORC formats • Compress by default and are splittable • Parquet enables parallel queries Utilize caching and tiering where appropriate Utilize EMR HDFS namespace for small file Spark workloads Consider Amazon DynamoDB and ElastiCache for low latency data presentation Utilize Amazon CloudFront for distributing frequently