Mining Your Data Lake for Analytics Insights Directly Access the Richness of Your Data Lake for Advanced Analytics MINING YOUR DATA LAKE for 2 ANALYTICS INSIGHTS
Total Page:16
File Type:pdf, Size:1020Kb
Mining Your Data Lake for Analytics Insights Directly access the richness of your data lake for advanced analytics MINING YOUR DATA LAKE FOR 2 ANALYTICS INSIGHTS Unify analytics across INTRODUCTION your business For years, companies have dumped data into their data lake and deferred organizing the data until later. To make this data useful for analytics, the data must be carefully structured and cataloged. And as more data is rapidly introduced, from log files and IoT sources, it must also be structured in the same way. Delta Lake on Databricks provides a way to streamline these data pipelines so the data is instantly available for analysis. In this eBook, learn more about using Delta Lake and Amazon Web Services (AWS) to prepare and deliver data lake data directly to drive valuable analytics insights. Read use cases from LoyaltyOne and Comcast to see how they are getting value out of Delta Lake. MINING YOUR DATA LAKE FOR 3 ANALYTICS INSIGHTS The challenge of scaling a With each day, there’s increasingly more data to manage—think streaming data, data lake for analytics IoT data, event data, and social media data. A fleet of trucks can provide sensor readings every five seconds. A corporate intrusion detection program can track every IP address that enters a network, and their actions. These types of use cases create hundreds of terabytes of information daily. A data lake can provide companies a place to store all that data, but it does not provide that data in a way that is analytics-ready. That data may pass through processes to reformat it and into other storage systems as it is prepared for analysis. This process can take several hours to several days. Meanwhile, trucks can break down and network infiltrators can wreak havoc. Organizations need instant access to all that data to keep their business running. MINING YOUR DATA LAKE FOR 4 ANALYTICS INSIGHTS Data reliability challenges When it comes to making data in data lakes accessible for analytics, there are a and data lakes number of issues. FAILED WRITES If a production job that is writing data experiences failures, which are inevitable in large distributed environments, it can result in data corruption through partial writes or multiple writes. This leaves partial datasets littering the bottom of the data lake. What is needed is a mechanism that can ensure that either a write takes place completely or not at all (and not multiple times, adding spurious data). Failed jobs can impose a considerable burden to recover to a clean state. SCHEMA MISMATCH When ingesting content from multiple sources, typical of large, modern big data environments, it can be difficult to ensure that the same data is encoded in the same way. In other words, that the schema matches. A similar challenge arises when the formats for data elements are changed without informing the data engineering team. Both can result in low quality, inconsistent data that requires cleaning up to improve its usability. Schema enforcement is one of the keys to consistency; reading a data set while it is being updated is the other. MINING YOUR DATA LAKE FOR 5 ANALYTICS INSIGHTS When ingesting content LACK OF CONSISTENCY To provide insights, it is necessary to combine from multiple sources, historical batch with new streaming data to show current behavior. Trying to read typical of large, modern big data while it is being appended provides a challenge; on the one hand there is a data environments, it can be desire to keep ingesting new data, while on the other hand anyone reading the difficult to ensure that the same data is encoded in the data prefers a consistent view. This is especially an issue when there are multiple same way. readers and writers at work. It is undesirable and impractical to stop read access while writes complete or stop write access while reads are in progress. MINING YOUR DATA LAKE FOR 6 ANALYTICS INSIGHTS Data engineering drivers Your data engineers are responding to several different drivers in adopting for advanced analytics advanced analytics. They include: • GETTING MORE VALUE FROM CORPORATE ASSETS Advanced analytics, including methods based on ML, have evolved to such a degree that organizations seek to use them to derive far more value from their corporate assets. • WIDESPREAD ADOPTION Advanced approaches are being adopted across a multitude of industries, and across private and public sector organizations. This further drives the need for strong data engineering practices. • REGULATION REQUIREMENTS There is increased interest in how growing amounts of data are protected and managed. Regulations such as GDPR (General Data Protection Regulation) are very specific. MINING YOUR DATA LAKE FOR 7 ANALYTICS INSIGHTS Deriving value from • TECHNOLOGY INNOVATION The move to cloud-based analytics data must be done in a architectures is propelled further by new innovations such as analytics- financially responsible way, focused chipsets, pipeline automation, and the unification of data and add value to the enterprise, machine learning. and generate ROI. • FINANCIAL SCRUTINY Financial scrutiny. Analytics initiatives are subject to increasing financial scrutiny. There is a greater understanding of data as an asset. Deriving value from data must be done in a financially responsible way, add value to the enterprise, and generate ROI. • ROLE EVOLUTION Reflecting the importance of managing data and maximizing value, the Chief Data Officer (CDO) role is more prominent and roles are emerging such as Data Curator. They must balance the needs of governance, security and democratization. MINING YOUR DATA LAKE FOR 8 ANALYTICS INSIGHTS Evolving data pipelines for Your data engineers need to account for a broad set of dependencies and requirements advanced analytics as they design and build their data pipelines. Making quality data available in a reliable manner drives success for data analytics initiatives, whether they’re regular dashboards or reports, or advanced analytics projects drawing on state-of-the-art ML techniques. Three primary goals that drive data engineers as they work to enable analytics in their organizations are: 1. DELIVER QUALITY DATA IN LESS TIME When it comes to data, quality and timeliness are key. Data with gaps or errors—which can arise for many reasons—is unreliable, can lead to incorrect conclusions, and is of diminished value to downstream users. Many applications are of limited value without up-to-date information. 2. ENABLE FASTER QUERIES Wanting fast responses to queries is natural in today’s online world. Achieving this is particularly challenging when queries are based on very large data sets. 3. SIMPLIFY DATA ENGINEERING AT SCALE It is manageable to have high reliability and performance in a limited, development, or test environment. What matters more is the ability to support robust production data pipelines at large scale—without requiring high operational overhead. MINING YOUR DATA LAKE FOR 9 ANALYTICS INSIGHTS Apache Spark: A Unified Apache Spark™ was originally developed at UC Berkeley in 2009. Uniquely Data Analytics Engine bringing data and AI technologies together, Apache Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning and graph processing. These standard libraries increase developer productivity and can be seamlessly combined to create complex workflows. Since its release, Apache Spark, has seen rapid adoption by enterprises across a wide range of industries. Netflix, Yahoo, and eBay have deployed Apache Spark at massive scale, collectively processing multiple petabytes (PBs) of data on clusters of over 8,000 nodes— making it the defacto choice for new analytics initiatives. It has quickly become the largest open-source community in big data, with over 1,000 contributors from more than 250 organizations. The founders of Databricks donated Apache Spark to the opensource big data community and continue to contribute 75% of the code to Apache Spark. MINING YOUR DATA LAKE FOR 10 ANALYTICS INSIGHTS Delta Lake storage layer Your data scientists and data engineers need to focus on writing pipelines and serializes, compacts, algorithms. Delta Lake—which is open-source, open format, and compatible with and cleanses data Apache Spark APIs—automates many of the tasks required to prepare data for analytics, helping to speed time to value for analytics projects. Working in conjunction with your data lake, Delta Lake: • Automatically compacts data and executes de-duplication tasks. • Makes ETL processes much faster on the front end and streamlines the porting of data into an ML model. • Provides a storage layer that brings ACID (atomicity, consistency, isolation, durability) transactions to Apache Spark™ and data lakes. MINING YOUR DATA LAKE FOR 11 ANALYTICS INSIGHTS A common architecture uses tables that correspond to different quality levels in the data engineering pipeline, progressively adding structure to the data: data ingestion (“Bronze” tables), transformation/feature engineering (“Silver” Delta Lake data flow tables), and machine learning training or prediction (“Gold” tables). Combined, and refinement these tables are referred to as a “multi-hop” architecture. It allows data engineers to build a pipeline that begins with raw data as a “single source of truth” from which everything flows. Subsequent transformations and aggregations can be recalculated and validated to ensure that business-level aggregate tables still reflect the underlying data, even as downstream users refine the data and introduce context-specific structure. Streaming Analytics and ML Ingestion Tables Refined Tables Feature/Agg Data Store Batch BRONZE SILVER GOLD YOUR EXISTING DATA LAKE MINING YOUR DATA LAKE FOR 12 ANALYTICS INSIGHTS Features of Delta Lake include: Delta Lake enforces schema • TIME TRAVEL (DATA VERSIONING) Delta Lake provides snapshots of and supports versioning data, enabling developers to revert to earlier versions of data for audits or rollbacks, or to reproduce experiments. For more details on versioning please read Introducing Delta Time Travel for Large Scale Data Lakes.