LOOK BEFORE YOU LEAP INTO the DATA LAKE by Rash Gandhi, Sanjay Verma, Elias Baltassis, and Nic Gordon

LOOK BEFORE YOU LEAP INTO THE DATA LAKE By Rash Gandhi, Sanjay Verma, Elias Baltassis, and Nic Gordon o fully capture the tremendous Enter the “data lake,” a term that refers to Tvalue of using big data, organizations a large repository of data in a “natural,” need nimble and flexible data architectures unprocessed state. Data lakes’ flexibility able to liberate data that could otherwise and size allow for substantially easier stor- remain locked within legacy technologies age of raw data streams that today include and organizational processes. a multitude of data types. Data can be col- lected and later sampled for ideas, tapped Rapid advances in technology and analyti- for real-time analytics, and even potential- cal processing have enabled companies to ly treated for analysis in traditional struc- harness and mine an explosion of data gen- tured systems. But before organizations erated by smartphone apps, website click dive into the data lake, it’s important to un- trails, customer support audio feeds, social derstand what makes this new architecture media messages, customer transactions, unique, the challenges organizations can and more. Traditional enterprise data face during implementation, and ways to warehouse and business intelligence tools address those challenges. excel at organizing the structured data that businesses capture—but they stumble bad- ly when it comes to storing and analyzing What Exactly Is a Data Lake? data of the variety and quantity captured Historically, organizations have invested today and doing so at the speed now re- heavily in building data warehouses. Signif- quired. Companies need data architectures icant up-front time, effort, and cost go into that can handle the diversity of data avail- identifying all the source data required for able now (semistructured data, unstruc- analysis and reporting, defining the data tured data, log files, documents, videos, model and the database structure, and de- and audio, for example) and yield even veloping the programs. The process often more accurate predictive modeling and follows a sequence of steps known as ETL: customer insight at a highly detailed level. extract source data, transform it, and load For more on this topic, go to bcgperspectives.com Exhibit 1 | How a Data Lake Works SOURCE DATA LAKE SYSTEMS INGEST MANAGE MODEL RESULTS Source data Data repository Reporting Raw Fuzzy Internal Structured and unstructured Metadata Data is discovered data enters the (business, Visualization repository operational, and modeled technical) is captured Transform/Improve Clean External Accurate Analytics Source: BCG analysis. it into the data warehouse. Making changes processing power and data capacity. These to an existing data warehouse requires siz- systems are typically configured with data able additional investment to redesign the redundancy to ensure high resilience and programs that extract, transform, and load availability. Much of the big data software data—we estimate that 60% to 75% of de- is open source, which drives down costs. velopment costs come in the ETL layer. The total cost of establishing and running a data lake can be five to ten times lower Moreover, data warehouse solutions typical- than the cost of using traditional SQL- ly provide historical, or backward-looking, based systems. views. And here is where the challenge aris- es: organizations today are demanding that A company’s data lake can be built on any data tell them not just what happened in of multiple technology ecosystems (for ex- the past but also what is likely to happen in ample, Hadoop, Drill, and Cassandra), the the future. They seek predictive and action- most notable of which is the well-estab- able insights, gleaned from a variety of data lished Hadoop. Both upstarts (including accessed through both batch and real-time Cloudera, MapR, and Hortonworks) and processing to inform their strategies. traditional IT players (such as IBM, HP, Mi- crosoft, and Intel) have used Hadoop in Traditional data warehouses are not ideal constructing their data lakes. solutions to this challenge. They are slow to change and costly to operate, and they Data lakes are highly flexible, and they en- can’t be scaled cost-efficiently to process able a responsive “fail fast” approach to the growing volume of data. Data lakes can analytics that can drive significant value. In fill the void. their simplest form, data lakes have three core functions (see Exhibit 1): Given companies’ storage requirements (to house vast amounts of data at low cost) • To ingest structured and unstructured and computing requirements (to process data from multiple sources into a data and run analytics on this volume of data), repository data lakes typically use low-cost, commodi- ty servers, in a scale-out architecture. Serv- • To manage data by cleaning, describing, ers can be added as needed to increase and improving it | Look Before You Leap into the Data Lake 2 • To model data to produce insights that of data and create innovative business in- can be visualized or integrated into sights dynamically. The schema-on-read at- operating systems tribute of a data lake can be one of its major strengths, especially when the properties of the data are correctly described and What Is Different About Data when data quality is fully understood. Fail- Lakes? ure to adequately govern those properties, A data lake brings new approaches to data however, can quickly undermine the data management on several fronts. lake’s business value. When data is not properly described, for example, it can’t be Content Variety. A data lake is designed to understood and searched; when its quality store and process content in a wide variety isn’t measured, it can’t be trusted. of states (including multistructured, unstructured, and structured content), unlike Timeliness of Data. Information can be traditional data warehouses, which can streamed directly into a data lake, provid- meaningfully store and process only struc- ing an almost real-time view of the data. tured content. As an example of the power This can be important when business of data lakes, an analysis of unstructured decisions rely on real-time analytics, such data such as e-mails or voice calls linked to as making credit decisions. With traditional customer transactions can help identify SQL data architectures, a delay typically fraudulent trading activity. A data ware- occurs as source data is cleaned and then house is limited to storing only structured loaded into a data warehouse, in hourly, data, making such analyses difficult and daily, or weekly batches. Furthermore, time-consuming. traditional SQL systems use heavy controls to insert (that is, to store) data, which slows Data Structure. Traditional data architec- down data throughput. Users of data tures mandate a database structure that warehouses may, therefore, not have the is defined up front. Data architects pre- most up-to-date information, which is a scriptively model and define the physical shortcoming that can undermine the database prior to transforming and load- quality of the insights derived through data ing data into it, a process referred to as analytics. “schema on write.” Data Quality and Validation. Because raw But companies increasingly need an archi- data gets loaded into a data lake, its quality tecture in which users are free to access and trustworthiness is tested at the time it and structure data dynamically on the fly; is accessed for analysis. Traditional SQL- that process is sometimes referred to as based data warehouse programs, on the “schema on read” or “late-binding execu- other hand, undergo extensive testing of tion.” Instead of using the sequence com- the programs used to extract, transform, mon to data warehouses—extract, trans- and load data, which can help ensure the form, load (ETL)—it employs the ELT high quality of the data moved into them. approach, swapping the load and transform steps so that the raw loaded data is cleaned Access and Security. Tools to capture basic and transformed in the data lake. That of- technical metadata for the data ingested in a fers an advantage when the accuracy of the data lake are available, but that information data is imperative, such as for regulatory still must be enriched with business and reporting. Because data quality validation operational metadata that enables users to happens as needed in the data lake, you access and fully exploit data. The rich meta- don’t need to create a big IT project to data and associated security policies en- clean all the data, thus saving time and forced in a traditional data warehouse cost. enable construction of complex user-access models. Fine-grained security policies and The flexibility of a schema-on-read model role-based access privileges grant effective enables users to experiment with a variety control of user access to content. | Look Before You Leap into the Data Lake 3 Effort and Cost. Data lakes are significantly Companies simply won’t be able to model easier and less expensive to implement data in the ways they need to in the future than traditional SQL-based data ware- without the flexible architecture that build- houses, owing to the commoditized nature ing a data lake creates. But to get the great- of the platform, the lower cost of open- est value from these efforts, we recom- source technologies, and the deferment of mend the following steps. data modeling until the user needs to analyze the data. In contrast, the cost of a Identify the highest-value opportunities. data warehouse solution can reach millions The shift toward big data architectures has of dollars for large companies as a result of forced many senior executives to take a the need for up-front data modeling, the fresh look at the major components of a long period of time to design and build, the data architecture strategy and the game- requirement for data integration, and the changing capabilities they enable.

LOOK BEFORE YOU LEAP INTO the DATA LAKE by Rash Gandhi, Sanjay Verma, Elias Baltassis, and Nic Gordon

Amazon Connect Data Lake Best Practices AWS Whitepaper Amazon Connect Data Lake Best Practices AWS Whitepaper

Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility

Cost Modeling Data Lakes for Beginners How to Start Your Journey Into Data Analytics

A Comprehensive Study of Recent Metadata Models for Data Lake

Harness the Power of Your Data

Lake Data Warehouse Architecture for Big Data Solutions

Building a Data Lake for the Enterprise

Solution Brief Data-Driven Transformation on AWS: a Blueprint

A Big Data Lake for Multilevel Streaming Analytics

Data Lakes Efficiently Consolidate Your Data

Sub-Second Analytics for User-Facing Applications with Apache Spark™ and Rockset Venkat Venkataramani CEO and Co-Founder, Rockset About Me

Essential Guide to Data Lakes