Accelerate Insights and Streamline Ingestion with a Data Lake on AWS

Accelerate insights and streamline ingestion with a data lake on AWS © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Learn how to get the full benefits of cloud data lakes without compromising the productivity of your DevOps Helen Beal team. Chief Ambassador, DevOps Institute Presenters will explain how a modern data lake can: ○ Realize the elasticity of the cloud and reduce costs through consumption utilization ○ Prepare your structured, unstructured, and semi- structured data for integration with analytics and machine learning tools ○ Support data democratization and enable collaboration while maintaining security and data Kanchan Waikar governance Solutions Architect, AWS Accelerate Insights and Streamline Ingestion with a Data Lake on AWS Human DevOps and Ways of Working Practitioner Top DevOps Evangelist, 2020 (DevOps Dozen) Chief Ambassador: DevOps Institute Chair: Value Stream Management Consortium Strategic Advisor DevOps Editor: InfoQ Ambassador:PLEASE CD LEAVE Foundation THIS SPACE FOR YOUR Analyst: AcceleratedGRAPHIC StrategiesARTIST RECORDING Mission: Bringing joy to work YOUR LOGO HERE Flow: Talk Map Stream Data is Extracting ML, AI and DevOps Processing everywher insights DataOps Practices e 3rd party 3Vs Batch data Data is Everywhere Every application, every service, every environment PAGE | 6 Icons by Eucalyp and Freepik from Flaticon Data from Third Parties Find, subscribe to, and use third-party data in the cloud PAGE | 7 Icons by Freepik from Flaticon Extracting Data Insights An insight is only of value when it has a positive outcome Why is this so hard to do? Outcome Data Insights s PAGE | 8 Icons by Freepik from Flaticon Different Data, Different Needs The 3 Vs of Big Data 1 2 3 VOLUME VELOCITY VARIETY It’s a scaling problem PAGE | 9 Stream Processing The goal is to process big data volumes and provide useful insights into the data prior to saving it to long-term storage Use cases for stream processing are found when systems handle big data volumes and where real-time results Dimension Batch Stream matter. If the value of the information History Traditional Modern contained in the data stream decreases rapidly as it gets older, Data Processing Location System of Record Source stream processing is appropriate. In Event of Failure Restart batch Retry increment E.g.: Pros Simple and robust Live, scalable, fault tolerant ● Real-time analytics Cons Latency Complex, expensive? ● Anomaly, fraud or pattern detection ● Complex event processing ● Real-time statistics and dashboards ● Real-time extract, transform, load (ETL) ● Implementing event driven architectures PAGE | 10 Aggregation and Batch SubtopicProcessing ● Use Batch Processing jobs to prepare large, bulk datasets for downstream analytics ● Avoid lifting and shifting batch processing to AWS ● Automate and orchestrate everywhere ● Use Spot Instances to save on flexible batch processing jobs ● Continuously monitor and improve batch processing PAGE | 11 Structured and Unstructured SemistructuredData data uses tagging systems or other markers, separating different elements and enabling search (self-describing); think JSON, CSV, XML 20%- 80%+ Unstructured data: Dimension Structured Unstructured ● Documents Format Defined Undefined ● Publications ● Reports Type Qualitative Quantitative ● Emails Usually Stored Data warehouses Data lakes ● Social media ● Videos Search and Analyze Easy More work ● Images Database RDBMS NoSQL ● Audio ● Mobile activity Programming Language SQL Various ● Satellite imagery Sensors Analysis Regression, classification, clustering Data mining, data stacking ● Customer Insights High level Deeper insights into sentiment and behavior PAGE | 12 Centrally Managed Data Data warehouses, lakes and lake houses: key to enabling analytics ● Predictive analysis ● Join LoBs ● Cross-organizational insights ● Make better business decisions ● Automate DSS ● Improve customer interactions ● Improve R&D innovation choices ● Increase operational efficiencies PAGE | 13 ML, AI and DataOps Deep exploration Personalized Insights Real-time Queries Transparency Context The business doesn’t need A DevOps team quickly Anomaly Detection to be data engineers or builds a high-quality, device- scientists to search using friendly app on a cloud- Causal Relationships natural language for the based platform designed answers they need and gain with developer and Trend Isolation the insights that will enable consumer usability in mind them to make intelligent, and makes it available via data-driven business self-service. Noise Reduction decisions. Segmentation Icons by Smashicons, Freepic, Dimitry Miroliubov, Eucalyp from Flaticon PAGE | 14 Leveraging DevOps Practices DevOps DataOps Incremental , continuous Data needs to be mined and business intelligence analyzed at speed and with adaptability too. change Systems from backlog to deployment must handle data needs. CICD & DevOps Toolchains Teams working with data need to leverage the power of automation to maximise throughput and stability and provide CICD capabilities and limited blast radius. The Three Ways We want to accelerate flow, amplify feedback and use our data to drive experiments too. Monitoring and observability are key with AI for feedback. A high-trust, collaborative In order to build trust in a DevOps culture we have data-driven, not opinion driven conversations. culture Data must be available real-time, on demand and via self-service. Value stream centric Truly understanding flow, means all people in the value stream have a profound understanding of working the end-to-end system and this is driven by data insights. “We build it, we own it.” Teams must be multifunctional, cross-skilling must be standard practice, it must be quick and easy to get results from tools - choose those designed with usability. Focus on value outcomes Insights lead to decisions lead to measurement experience improvements for the customer: AI accelerates mean time to outcome (MTTO). PAGE | 15 Key Takeaways Accelerate Insights and Streamline Ingestion with a Data Lake on AWS The 3 Vs Data as a Service Augmented Analytics ● There is a LOT of data, coming ● Making data available centrally is ● AI/ML and predictive analytics from many different sources in key for efficient processing and accelerate time to insight and time multiple formats at variable access to outcome speeds ● The objective is to make better ● This makes more innovation time ● Remember the 3Vs: Volume, business decisions available to build differentiating Velocity and Variety; these ● Those business decisions must features demand scalability result in sublime customer ● DataOps accelerates the data ● Different data has different needs experiences pipeline PAGE | 16 THANK YOU Streamline ingestion and extract insights from your data lake Kanchan Waikar Solutions Architect, AWS © 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Key steps of an end-to-end analytics process © 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Advantages of a data lake on AWS Flexibility Agility Security and Broad and Deep Compliance Capabilities © 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data lake infrastructure © 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data lake storage – Amazon S3 Amazon S3 • Amazon S3 is designed for 99.999999999% (11 9s) of data durability • Security by design • Scalability on demand • Durable against the failure of an entire AWS Availability Zone • Integrations with third-party service providers © 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS data lake Architecture Web app logs AWS Glue Amazon RDS Amazon Athena Amazon QuickSight Other databases Amazon Simple AWS Glue AWS Glue data Storage Service Crawler catalog Amazon EMR Amazon SageMaker On prem data Amazon Streaming data Redshift © 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AI Services Vision Speech Language Chatbots Forecasting Recommendations REKOGNITION REKOGNITION TEXTRACT POLLY TRANSCRIBE TRANSLATE COMPREHEND LEX FORECAST PERSONALIZE IMAGE VIDEO &COMPREHEND MEDICAL Pre-trained AI services that require Easily add intelligence to your Quality and accuracy from no ML skills or training existing apps and workflows continuously-learning APIs © 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data ingestion © 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data sources HDFS Unstructured Real-time Video/audio Database/Datawarehouse … Data movement Database Migration Service | Snowball | Snowmobile | Kinesis Data Streams | Kinesis Data Firehose | Managed Streaming for Apache Kafka Amazon Amazon On-prem data Amazon Redshift RDS DynamoDB Amazon Amazon CloudWatch Logs Third-party data Aurora Kinesis Amazon CRM system Marketing data S3 © 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use AWS Data Exchange for procuring third-party data on AWS Data Providers Data Subscribers Reach to millions Distribute data in a secure Quickly find diverse Easily analyze data as of AWS customers and compliant way data in one place its published No longer need to maintain Migrate existing Automatically Migrate existing data storage, delivery, billing, subscriptions at no access new data subscriptions at no or entitling technology additional cost additional cost © 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Marketplace 8,000+ 1,600+ 24

Accelerate Insights and Streamline Ingestion with a Data Lake on AWS

Cloud Computing and Enterprise Data Reliability Luan Gashi University for Business and Technology, [email protected]

Geospatial Data AS a Service: Use Cases

Cloud Computing and Big Data Is There a Relation Between the Two: a Study

Evolution of As-A-Service Era in Cloud

Accelerate Delivery of Your Modern Data Analytics Platform with the Adatis Framework

Lenovo Big Data Validated Design for Cazena Saas Using Cloudera on Thinkagile HX

Towards an Efficient Distributed Cloud

IDC TECHNOLOGY SPOTLIGHT Sponsored By: IBM

Charting New Territory with Google Cloud Platform

The Foundations of Hybrid IT

The Xaas Revolution Distribution’S Pivot to the Future Xaas Factors in Focus and on Demand

A Data-Science-As-A-Service Model