Mining Your Data Lake for Analytics Insights Directly Access the Richness of Your Data Lake for Advanced Analytics MINING YOUR DATA LAKE for 2 ANALYTICS INSIGHTS

Total Page:16

File Type:pdf, Size:1020Kb

Mining Your Data Lake for Analytics Insights Directly Access the Richness of Your Data Lake for Advanced Analytics MINING YOUR DATA LAKE for 2 ANALYTICS INSIGHTS Mining Your Data Lake for Analytics Insights Directly access the richness of your data lake for advanced analytics MINING YOUR DATA LAKE FOR 2 ANALYTICS INSIGHTS Unify analytics across INTRODUCTION your business For years, companies have dumped data into their data lake and deferred organizing the data until later. To make this data useful for analytics, the data must be carefully structured and cataloged. And as more data is rapidly introduced, from log files and IoT sources, it must also be structured in the same way. Delta Lake on Databricks provides a way to streamline these data pipelines so the data is instantly available for analysis. In this eBook, learn more about using Delta Lake and Amazon Web Services (AWS) to prepare and deliver data lake data directly to drive valuable analytics insights. Read use cases from LoyaltyOne and Comcast to see how they are getting value out of Delta Lake. MINING YOUR DATA LAKE FOR 3 ANALYTICS INSIGHTS The challenge of scaling a With each day, there’s increasingly more data to manage—think streaming data, data lake for analytics IoT data, event data, and social media data. A fleet of trucks can provide sensor readings every five seconds. A corporate intrusion detection program can track every IP address that enters a network, and their actions. These types of use cases create hundreds of terabytes of information daily. A data lake can provide companies a place to store all that data, but it does not provide that data in a way that is analytics-ready. That data may pass through processes to reformat it and into other storage systems as it is prepared for analysis. This process can take several hours to several days. Meanwhile, trucks can break down and network infiltrators can wreak havoc. Organizations need instant access to all that data to keep their business running. MINING YOUR DATA LAKE FOR 4 ANALYTICS INSIGHTS Data reliability challenges When it comes to making data in data lakes accessible for analytics, there are a and data lakes number of issues. FAILED WRITES If a production job that is writing data experiences failures, which are inevitable in large distributed environments, it can result in data corruption through partial writes or multiple writes. This leaves partial datasets littering the bottom of the data lake. What is needed is a mechanism that can ensure that either a write takes place completely or not at all (and not multiple times, adding spurious data). Failed jobs can impose a considerable burden to recover to a clean state. SCHEMA MISMATCH When ingesting content from multiple sources, typical of large, modern big data environments, it can be difficult to ensure that the same data is encoded in the same way. In other words, that the schema matches. A similar challenge arises when the formats for data elements are changed without informing the data engineering team. Both can result in low quality, inconsistent data that requires cleaning up to improve its usability. Schema enforcement is one of the keys to consistency; reading a data set while it is being updated is the other. MINING YOUR DATA LAKE FOR 5 ANALYTICS INSIGHTS When ingesting content LACK OF CONSISTENCY To provide insights, it is necessary to combine from multiple sources, historical batch with new streaming data to show current behavior. Trying to read typical of large, modern big data while it is being appended provides a challenge; on the one hand there is a data environments, it can be desire to keep ingesting new data, while on the other hand anyone reading the difficult to ensure that the same data is encoded in the data prefers a consistent view. This is especially an issue when there are multiple same way. readers and writers at work. It is undesirable and impractical to stop read access while writes complete or stop write access while reads are in progress. MINING YOUR DATA LAKE FOR 6 ANALYTICS INSIGHTS Data engineering drivers Your data engineers are responding to several different drivers in adopting for advanced analytics advanced analytics. They include: • GETTING MORE VALUE FROM CORPORATE ASSETS Advanced analytics, including methods based on ML, have evolved to such a degree that organizations seek to use them to derive far more value from their corporate assets. • WIDESPREAD ADOPTION Advanced approaches are being adopted across a multitude of industries, and across private and public sector organizations. This further drives the need for strong data engineering practices. • REGULATION REQUIREMENTS There is increased interest in how growing amounts of data are protected and managed. Regulations such as GDPR (General Data Protection Regulation) are very specific. MINING YOUR DATA LAKE FOR 7 ANALYTICS INSIGHTS Deriving value from • TECHNOLOGY INNOVATION The move to cloud-based analytics data must be done in a architectures is propelled further by new innovations such as analytics- financially responsible way, focused chipsets, pipeline automation, and the unification of data and add value to the enterprise, machine learning. and generate ROI. • FINANCIAL SCRUTINY Financial scrutiny. Analytics initiatives are subject to increasing financial scrutiny. There is a greater understanding of data as an asset. Deriving value from data must be done in a financially responsible way, add value to the enterprise, and generate ROI. • ROLE EVOLUTION Reflecting the importance of managing data and maximizing value, the Chief Data Officer (CDO) role is more prominent and roles are emerging such as Data Curator. They must balance the needs of governance, security and democratization. MINING YOUR DATA LAKE FOR 8 ANALYTICS INSIGHTS Evolving data pipelines for Your data engineers need to account for a broad set of dependencies and requirements advanced analytics as they design and build their data pipelines. Making quality data available in a reliable manner drives success for data analytics initiatives, whether they’re regular dashboards or reports, or advanced analytics projects drawing on state-of-the-art ML techniques. Three primary goals that drive data engineers as they work to enable analytics in their organizations are: 1. DELIVER QUALITY DATA IN LESS TIME When it comes to data, quality and timeliness are key. Data with gaps or errors—which can arise for many reasons—is unreliable, can lead to incorrect conclusions, and is of diminished value to downstream users. Many applications are of limited value without up-to-date information. 2. ENABLE FASTER QUERIES Wanting fast responses to queries is natural in today’s online world. Achieving this is particularly challenging when queries are based on very large data sets. 3. SIMPLIFY DATA ENGINEERING AT SCALE It is manageable to have high reliability and performance in a limited, development, or test environment. What matters more is the ability to support robust production data pipelines at large scale—without requiring high operational overhead. MINING YOUR DATA LAKE FOR 9 ANALYTICS INSIGHTS Apache Spark: A Unified Apache Spark™ was originally developed at UC Berkeley in 2009. Uniquely Data Analytics Engine bringing data and AI technologies together, Apache Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning and graph processing. These standard libraries increase developer productivity and can be seamlessly combined to create complex workflows. Since its release, Apache Spark, has seen rapid adoption by enterprises across a wide range of industries. Netflix, Yahoo, and eBay have deployed Apache Spark at massive scale, collectively processing multiple petabytes (PBs) of data on clusters of over 8,000 nodes— making it the defacto choice for new analytics initiatives. It has quickly become the largest open-source community in big data, with over 1,000 contributors from more than 250 organizations. The founders of Databricks donated Apache Spark to the opensource big data community and continue to contribute 75% of the code to Apache Spark. MINING YOUR DATA LAKE FOR 10 ANALYTICS INSIGHTS Delta Lake storage layer Your data scientists and data engineers need to focus on writing pipelines and serializes, compacts, algorithms. Delta Lake—which is open-source, open format, and compatible with and cleanses data Apache Spark APIs—automates many of the tasks required to prepare data for analytics, helping to speed time to value for analytics projects. Working in conjunction with your data lake, Delta Lake: • Automatically compacts data and executes de-duplication tasks. • Makes ETL processes much faster on the front end and streamlines the porting of data into an ML model. • Provides a storage layer that brings ACID (atomicity, consistency, isolation, durability) transactions to Apache Spark™ and data lakes. MINING YOUR DATA LAKE FOR 11 ANALYTICS INSIGHTS A common architecture uses tables that correspond to different quality levels in the data engineering pipeline, progressively adding structure to the data: data ingestion (“Bronze” tables), transformation/feature engineering (“Silver” Delta Lake data flow tables), and machine learning training or prediction (“Gold” tables). Combined, and refinement these tables are referred to as a “multi-hop” architecture. It allows data engineers to build a pipeline that begins with raw data as a “single source of truth” from which everything flows. Subsequent transformations and aggregations can be recalculated and validated to ensure that business-level aggregate tables still reflect the underlying data, even as downstream users refine the data and introduce context-specific structure. Streaming Analytics and ML Ingestion Tables Refined Tables Feature/Agg Data Store Batch BRONZE SILVER GOLD YOUR EXISTING DATA LAKE MINING YOUR DATA LAKE FOR 12 ANALYTICS INSIGHTS Features of Delta Lake include: Delta Lake enforces schema • TIME TRAVEL (DATA VERSIONING) Delta Lake provides snapshots of and supports versioning data, enabling developers to revert to earlier versions of data for audits or rollbacks, or to reproduce experiments. For more details on versioning please read Introducing Delta Time Travel for Large Scale Data Lakes.
Recommended publications
  • Amazon Connect Data Lake Best Practices AWS Whitepaper Amazon Connect Data Lake Best Practices AWS Whitepaper
    Amazon Connect Data Lake Best Practices AWS Whitepaper Amazon Connect Data Lake Best Practices AWS Whitepaper Amazon Connect Data Lake Best Practices : AWS Whitepaper Copyright © Amazon Web Services, Inc. and/or its affiliates. All rights reserved. Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by Amazon. Amazon Connect Data Lake Best Practices AWS Whitepaper Table of Contents Abstract and introduction .................................................................................................................... i Abstract .................................................................................................................................... 1 Are you Well-Architected? ........................................................................................................... 1 Introduction .............................................................................................................................. 1 Amazon Connect ................................................................................................................................ 4 Data lake design principles .................................................................................................................
    [Show full text]
  • Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility
    Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility July 2017 © 2017, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only. It represents AWS’s current product offerings and practices as of the date of issue of this document, which are subject to change without notice. Customers are responsible for making their own independent assessment of the information in this document and any use of AWS’s products or services, each of which is provided “as is” without warranty of any kind, whether express or implied. This document does not create any warranties, representations, contractual commitments, conditions or assurances from AWS, its affiliates, suppliers or licensors. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers. Contents Introduction 1 Amazon S3 as the Data Lake Storage Platform 2 Data Ingestion Methods 3 Amazon Kinesis Firehose 4 AWS Snowball 5 AWS Storage Gateway 5 Data Cataloging 6 Comprehensive Data Catalog 6 HCatalog with AWS Glue 7 Securing, Protecting, and Managing Data 8 Access Policy Options and AWS IAM 9 Data Encryption with Amazon S3 and AWS KMS 10 Protecting Data with Amazon S3 11 Managing Data with Object Tagging 12 Monitoring and Optimizing the Data Lake Environment 13 Data Lake Monitoring 13 Data Lake Optimization 15 Transforming Data Assets 18 In-Place Querying 19 Amazon Athena 20 Amazon Redshift Spectrum 20 The Broader Analytics Portfolio 21 Amazon EMR 21 Amazon Machine Learning 22 Amazon QuickSight 22 Amazon Rekognition 23 Future Proofing the Data Lake 23 Contributors 24 Document Revisions 24 Abstract Organizations are collecting and analyzing increasing amounts of data making it difficult for traditional on-premises solutions for data storage, data management, and analytics to keep pace.
    [Show full text]
  • Cost Modeling Data Lakes for Beginners How to Start Your Journey Into Data Analytics
    Cost Modeling Data Lakes for Beginners How to start your journey into data analytics November 2020 Notices Customers are responsible for making their own independent assessment of the information in this document. This document: (a) is for informational purposes only, (b) represents current AWS product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers. © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Contents Introduction .......................................................................................................................... 1 What should the business team focus on?...................................................................... 2 Defining the approach to cost modeling data lakes ........................................................... 2 Measuring business value ................................................................................................... 4 Establishing an agile delivery process ................................................................................ 5 Building data lakes .............................................................................................................
    [Show full text]
  • A Comprehensive Study of Recent Metadata Models for Data Lake
    DATA ANALYTICS 2020 : The Ninth International Conference on Data Analytics A Comprehensive Study of Recent Metadata Models for Data Lake Redha Benaissa∗yz, Omar Boussaidy, Aicha Mokhtariz, and Farid Benhammadix ∗x DBE Laboratory, Ecole Militaire Polytechnique, Bordj el Bahri, Algiers, Algeria y ERIC, Universite de Lyon, Lyon 2, France z RIIMA Laboratory, USTHB University, Algiers, Algeria Email: ∗[email protected], [email protected], [email protected], [email protected] Data Source Abstract—In the era of Big Data, an un amount of 1 precedented Data Discovery Data Source heterogeneous and unstructured data is generated every day, 2 which needs to be stored, managed, and processed to create new services and applications. This has brought new concepts in data Data Source n Construction of Data Storage management such as Data Lakes (DL) where the raw data is Catalog stored without any transformation. Successful DL systems deploy efficient metadata techniques in order to organize the DL. This paper presents a comprehensive study of recent metadata models for Data Lake that points out their rationales, strengths, and Query Tools weaknesses. More precisely, we provide a layered taxonomy of recent metadata models and their specifications. This is followed Figure 1. Data management process in a Data Lake. by a survey of recent works dealing with metadata management in DL, which can be categorized into level, typology, and content extraction of the right metadata to build the catalog remains a metadata. Based on such a study, an in-depth analysis of key challenging issue. features, strengths, and missing points is conducted. This, in turn, allowed to find the gap in the literature and identify open research This paper presents a comprehensive study of recent meta- issues that require the attention of the community.
    [Show full text]
  • Harness the Power of Your Data
    Harness the power of your data: Why Financial Services institutions are building data lakes on AWS What is a data lake? A data lake is a centralized repository that allows you to store all structured and unstructured data at scale and run flexible analytics Financial institutions are collecting such as dashboards, visualizations, big data processing, real-time analytics, and machine learning, to guide better decisions. and storing massive amounts of data Machine The Financial Services industry has relied on traditional data infrastructures Learning Analytics for decades, but traditional data solutions can’t keep up with the volumes and variety of data financial institutions are collecting today. A cloud-based data lake helps financial institutions store all of their data in one central repository, making it easy to support compliance priorities, realize cost efficiencies, perform forecasts, execute risk assessments, better understand customer behavior, and drive innovation. AWS delivers an integrated suite of services that provides the capabilities needed to quickly build and manage a secure data lake that is ready for analysis and the application of machine learning. In this overview, learn how financial institutions are unlocking the value of their data by building data lakes on AWS. On-Premises Real-Time Data Movement Data Movement © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. | 2 There are many benefits to building adata lake on AWS Compliance & Security Scalability Agility Innovation Cost effective Encrypt highly sensitive Amazon S3 data lakes allow any Perform ad-hoc and Aggregated and normalized Pay-as-you-go pricing data and enable controls type of data to be stored at any cost-effective analytics on a data sets provide a foundation for compute, storage, for data access, auditability, scale, making it easy to meet per-query basis without moving for advanced analytics and and analytics.
    [Show full text]
  • Lake Data Warehouse Architecture for Big Data Solutions
    (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 8, 2020 Lake Data Warehouse Architecture for Big Data Solutions Emad Saddad1 Hoda M. O. Mokhtar3 Climate Change Information Center and Renewable Energy Faculty of Computers and Artificial Intelligence and Expert System‎ Cairo University ‎ Agricultural Research Center (ARC)‎ Giza, Egypt Giza, Egypt 4 2 Maryam Hazman Ali El-Bastawissy Climate Change Information Center and Renewable Energy Faculty of Computer Science and Expert System MSA University Agricultural Research Center (ARC), Giza, Egypt Giza, Egypt Abstract—Traditional Data Warehouse is a multidimensional [2], [3]. To achieve this, we propose a new DW architecture repository. It is nonvolatile, ‎subject-oriented, integrated, time- called Lake Data Warehouse Architecture. Lake Data variant, and non-‎operational data. It is gathered from Warehouse Architecture is a hybrid system that preserves the multiple ‎heterogeneous data ‎sources. We need to adapt traditional DW features. It adds additional features and traditional Data Warehouse architecture to deal with the capabilities that facilitate working with big data technologies new ‎challenges imposed by the abundance of data and the and tools (Hadoop, Data Lake, Delta Lake, and Apache Spark) current big data characteristics, containing ‎volume, value, in a complementary way to support and enhance existing variety, validity, volatility, visualization, variability, and venue. architecture. The new ‎architecture also needs to handle existing drawbacks, including availability, scalability, and ‎consequently query Our proposed contribution solve several issues that face performance. This paper introduces a novel Data Warehouse integrating data from big data repositories such as: architecture, named Lake ‎Data Warehouse Architecture, to provide the traditional Data Warehouse with the capabilities Integrating traditional DW technique, Hadoop to ‎overcome the challenges.
    [Show full text]
  • Building a Data Lake for the Enterprise
    Arcadia PAGE 20 HOW TO GET MORE VALUE FROM YOUR DATA LAKE AND EXTEND ACCESS TO MORE USERS HPE Data Security PAGE 22 SOLVING FOR BUILDING A DATA SECURITY AND PRIVACY IN IoT Cask DATA LAKE FOR PAGE 24 5 KEY INGREDIENTS OF A SUCCESSFUL THE ENTERPRISE DATA LAKE Informatica PAGE 26 BUILDING AN ENTERPRISE DATA LAKE WITH BRAINS, NOT BRAWN IBM PAGE 27 HARNESSING A LOGICAL DATA WAREHOUSE AND DATA RESERVOIR FOR BIG DATA SUCCESS Franz PAGE 28 TRANSFORMING DATA LAKES INTO KNOWLEDGEABLE PLATFORMS Cambridge PAGE 29 SMART DATA LAKES: REVOLUTIONIZING ENTERPRISE ANALYTICS Best Practices Series BUILDING A DATA LAKE FOR THE ENTERPRISE Best Practices Series Data lakes are forming as a response to today’s big data to launch a data lake initiative and another 15% had already sub- challenges, offering a cost-effective way to maintain and manage mitted a budget for adoption. On top of that, 35% were research- immense data resources that hold both current and future ing and considering a data lake approach. In addition, the Uni- potential to the enterprise. However, enterprises need to build sphere survey found clear early use cases exist for the data lake. these environments with great care and consideration, as these However, governance and security are still top of mind as key potentially critical business resources could quickly lose their way challenges and success factors for the data lake. with loose governance, insecure protocols, and redundant data. Besides the mechanical considerations of data storage, there In their book, BI and Analytics on a Data Lake: The Definitive is a need to integrate complex datasets to build more intelligent Guide, Sameer Nori and Jim Scott identify the role of a data lake, analytic frameworks for more targeted and intelligent decisions.
    [Show full text]
  • LOOK BEFORE YOU LEAP INTO the DATA LAKE by Rash Gandhi, Sanjay Verma, Elias Baltassis, and Nic Gordon
    LOOK BEFORE YOU LEAP INTO THE DATA LAKE By Rash Gandhi, Sanjay Verma, Elias Baltassis, and Nic Gordon o fully capture the tremendous Enter the “data lake,” a term that refers to Tvalue of using big data, organizations a large repository of data in a “natural,” need nimble and flexible data architectures unprocessed state. Data lakes’ flexibility able to liberate data that could otherwise and size allow for substantially easier stor- remain locked within legacy technologies age of raw data streams that today include and organizational processes. a multitude of data types. Data can be col- lected and later sampled for ideas, tapped Rapid advances in technology and analyti- for real-time analytics, and even potential- cal processing have enabled companies to ly treated for analysis in traditional struc- harness and mine an explosion of data gen- tured systems. But before organizations erated by smartphone apps, website click dive into the data lake, it’s important to un- trails, customer support audio feeds, social derstand what makes this new architecture media messages, customer transactions, unique, the challenges organizations can and more. Traditional enterprise data face during implementation, and ways to warehouse and business intelligence tools address those challenges. excel at organizing the structured data that businesses capture—but they stumble bad- ly when it comes to storing and analyzing What Exactly Is a Data Lake? data of the variety and quantity captured Historically, organizations have invested today and doing so at the speed now re- heavily in building data warehouses. Signif- quired. Companies need data architectures icant up-front time, effort, and cost go into that can handle the diversity of data avail- identifying all the source data required for able now (semistructured data, unstruc- analysis and reporting, defining the data tured data, log files, documents, videos, model and the database structure, and de- and audio, for example) and yield even veloping the programs.
    [Show full text]
  • Solution Brief Data-Driven Transformation on AWS: a Blueprint
    Solution Brief Data-Driven Transformation on AWS: A Blueprint for Success Key Benefits Modernize Cloud Analytics Quickly and Cost-Effectively With Intelligent, • Accelerate data warehouse and Automated Cloud Lakehouse Data Management From Informatica data lake modernization with a metadata-driven, AI-powered integration platform as a service The challenge for today’s enterprises is to unleash the power of their data, while getting more out • Successfully deploy a new of their IT investment. In order to do so, companies are moving their analytics workloads to the data warehouse, data lake, cloud and modernizing their applications, with the goal of increasing efficiency, productivity, and or data lakehouse on AWS with cost savings. Given that over 50% of data often comes from sources outside the company, many high-performance data integration organizations struggle with a lack of control over data format and data quality. In addition, legacy that connects to all data, and seamlessly integrates high on-premises systems are often hobbled by inefficient processes and lack the agility to quickly volumes of data for any deliver the time-sensitive data insights required to keep companies competitive. analytics workload • Reduce the cost of data To overcome these hurdles, organizations are investing in cloud data warehouses, data lakes, and warehouse and data lake data lakehouses. But without the capability to automatically discover, cleanse, integrate, manage, consolidation and modernization by leveraging the flexibility govern, and secure data across your environment, these initiatives may fail. One-off solutions and scalability of AWS such as hand coding increase manual processes, consume resources needed by the business, • Ensure better business outcomes and can erode trust.
    [Show full text]
  • A Big Data Lake for Multilevel Streaming Analytics
    A Big Data Lake for Multilevel Streaming Analytics Ruoran Liu Haruna Isah Farhana Zulkernine School of Computing School of Computing School of Computing Queen’s University Queen’s University Queen’s University Kingston, Canada Kingston, Canada Kingston, Canada [email protected] [email protected] [email protected] Abstract—Large organizations are seeking to create new security, or authorization [3]. It is a large storage repository that architectures and scalable platforms to effectively handle data holds a vast amount of raw data in its native format until it is management challenges due to the explosive nature of data rarely needed. This is in contrast to the traditional data warehouse seen in the past. These data management challenges are largely approach, also known as schema on write, which requires more posed by the availability of streaming data at high velocity from upfront design and assumptions about how the data will be used. various sources in multiple formats. The changes in data paradigm In a data warehouse, adding data to a database requires data to have led to the emergence of new data analytics and management be transformed into a pre-determined schema before it can be architecture. This paper focuses on storing high volume, velocity loaded. This step often consumes a lot of time, effort, and and variety data in the raw formats in a data storage architecture expense before the data can be used for downstream applications called a data lake. First, we present our study on the limitations of [4]. A data lake is meant to complement and not replace a data traditional data warehouses in handling recent changes in data paradigms.
    [Show full text]
  • Data Lakes Efficiently Consolidate Your Data
    Business Intelligence & Data Analytics / Data Integration / Data Lakes Data Integration Data Lakes Challenge Solution Ingesting data into a data warehouse can be slow, because A data lake, which complements your data warehouse, allows the data must fit the database’s schema. Also, before any you to store data in its native format. Data lakes also allow data analytics can be done, the data must first be extracted analytic tools to work with the data directly, rather than having to from the database. first extract it from the database. Data Lakes Efficiently Consolidate Your Data Data lakes, which complement rather than replacing data warehouses, provide a cost-effective way for large organizations to store, process, analyze, and efficiently utilize large amounts of data in many different shapes and levels of structure. A data lake is an architectural approach that allows users to store massive amounts of data in a central location, so it’s readily available to be categorized, processed, analyzed, and consumed by diverse groups within an organization. For instance, an organization can place internal data, external data, partner data, competitor data, business process data, social data, and people data in a data lake so that it is at the ready for analysis and integration for diverse use. Unlike a data warehouse, the data lake excels at utilizing the availability of large quantities of coherent data, along with deep learning algorithms to recognize items of interest that will power real-time decision analytics. To get data into the data warehouse, it is usually extracted, transformed, and then loaded (ETL) from the data source into the data warehouse, which has a highly-structured data model.
    [Show full text]
  • Sub-Second Analytics for User-Facing Applications with Apache Spark™ and Rockset Venkat Venkataramani CEO and Co-Founder, Rockset About Me
    Sub-Second Analytics for User-Facing Applications with Apache Spark™ and Rockset Venkat Venkataramani CEO and co-founder, Rockset About me 2016 - present 2007 - 2015 2002 - 2007 Venkat Venkataramani Co-Founder & CEO 2002 - 2007 Agenda ▪ Large-scale data applications ▪ Rockset: designed to serve data applications ▪ Reference architectures with Apache Spark and Rockset ▪ Conclusion Large-scale data applications Building apps on Apache Spark BI Machine Learning Data Pipelines Apache Spark Data Lake Building apps on Apache Spark BI Machine Learning Data Pipelines Data Apps low latency high concurrency Serving Tier Apache Spark (MySQL, Postgres) Data Lake What happens when we get to large scale? BI Machine Learning Data Pipelines Large-Scale Data Apps TBs of data low latency high concurrency Serving Tier Apache Spark (MySQL, Postgres) Data Lake BI and SQL Data Science and Data Engineering Analytics Machine Learning Real-Time Data Applications Apache Spark Rockset Data Lake Example: Investment decisions at Sequoia Capital • Data sets from multiple vendors loaded regularly into data lake • Run data enrichment in Apache Spark/Databricks • Entity 360: combine with internal data sources for complete view of potential investments • Investment team and data scientists use app to help make investment decisions Example: Personalized recommendations at Ritual • Health technology company selling multivitamins online • Customer data from Segment loaded into data warehouse and data lake • Machine learning modeling in Apache Spark/Databricks • Build personalized
    [Show full text]