LOOK BEFORE YOU LEAP INTO THE DATA LAKE By Rash Gandhi, Sanjay Verma, Elias Baltassis, and Nic Gordon

o fully capture the tremendous Enter the “data lake,” a term that refers to Tvalue of using , organizations a large repository of data in a “natural,” need nimble and flexible data architectures unprocessed state. Data lakes’ flexibility able to liberate data that could otherwise and size allow for substantially easier stor- remain locked within legacy technologies age of raw data streams that today include and organizational processes. a multitude of data types. Data can be col- lected and later sampled for ideas, tapped Rapid advances in technology and analyti- for real-time , and even potential- cal processing have enabled companies to ly treated for analysis in traditional struc- harness and mine an explosion of data gen- tured systems. But before organizations erated by smartphone apps, website click dive into the data lake, it’s important to un- trails, customer support audio feeds, social derstand what makes this new architecture media messages, customer transactions, unique, the challenges organizations can and more. Traditional enterprise data face during implementation, and ways to warehouse and business intelligence tools address those challenges. excel at organizing the structured data that businesses capture—but they stumble bad- ly when it comes to storing and analyzing What Exactly Is a Data Lake? data of the variety and quantity captured Historically, organizations have invested today and doing so at the speed now re- heavily in building data warehouses. Signif- quired. Companies need data architectures icant up-front time, effort, and cost go into that can handle the diversity of data avail- identifying all the source data required for able now (semistructured data, unstruc- analysis and reporting, defining the data tured data, log files, documents, videos, model and the database structure, and de- and audio, for example) and yield even veloping the programs. The process often more accurate predictive modeling and follows a sequence of steps known as ETL: customer insight at a highly detailed level. extract source data, transform it, and load

For more on this topic, go to bcgperspectives.com Exhibit 1 | How a Data Lake Works

SOURCE DATA LAKE SYSTEMS INGEST MANAGE MODEL RESULTS

Source data Data repository Reporting Raw Fuzzy

Internal Structured and unstructured Data is discovered data enters the (business, Visualization repository operational, and modeled technical) is captured Transform/Improve

Clean External Accurate Analytics

Source: BCG analysis.

it into the . Making changes processing power and data capacity. These to an existing data warehouse requires siz- systems are typically configured with data able additional investment to redesign the redundancy to ensure high resilience and programs that extract, transform, and load availability. Much of the big data software data—we estimate that 60% to 75% of de- is open source, which drives down costs. velopment costs come in the ETL layer. The total cost of establishing and running a data lake can be five to ten times lower Moreover, data warehouse solutions typical- than the cost of using traditional SQL- ly provide historical, or backward-looking, based systems. views. And here is where the challenge aris- es: organizations today are demanding that A company’s data lake can be built on any data tell them not just what happened in of multiple technology ecosystems (for ex- the past but also what is likely to happen in ample, Hadoop, Drill, and Cassandra), the the future. They seek predictive and action- most notable of which is the well-estab- able insights, gleaned from a variety of data lished Hadoop. Both upstarts (including accessed through both batch and real-time Cloudera, MapR, and ) and processing to inform their strategies. traditional IT players (such as IBM, HP, Mi- crosoft, and Intel) have used Hadoop in Traditional data warehouses are not ideal constructing their data lakes. solutions to this challenge. They are slow to change and costly to operate, and they Data lakes are highly flexible, and they en- can’t be scaled cost-efficiently to process able a responsive “fail fast” approach to the growing volume of data. Data lakes can analytics that can drive significant value. In fill the void. their simplest form, data lakes have three core functions (see Exhibit 1): Given companies’ storage requirements (to house vast amounts of data at low cost) •• To ingest structured and unstructured and computing requirements (to process data from multiple sources into a data and run analytics on this volume of data), repository data lakes typically use low-cost, commodi- ty servers, in a scale-out architecture. Serv- •• To manage data by cleaning, describing, ers can be added as needed to increase and improving it

| Look Before You Leap into the Data Lake 2

•• To model data to produce insights that of data and create innovative business in- can be visualized or integrated into sights dynamically. The schema-on-read at- operating systems tribute of a data lake can be one of its ma- jor strengths, especially when the properties of the data are correctly described and What Is Different About Data when data quality is fully understood. Fail- Lakes? ure to adequately govern those properties, A data lake brings new approaches to data however, can quickly undermine the data management on several fronts. lake’s business value. When data is not properly described, for example, it can’t be Content Variety. A data lake is designed to understood and searched; when its quality store and process content in a wide variety isn’t measured, it can’t be trusted. of states (including multistructured, un- structured, and structured content), unlike Timeliness of Data. Information can be traditional data warehouses, which can streamed directly into a data lake, provid- meaningfully store and process only struc- ing an almost real-time view of the data. tured content. As an example of the power This can be important when business of data lakes, an analysis of unstructured decisions rely on real-time analytics, such data such as e-mails or voice calls linked to as making credit decisions. With traditional customer transactions can help identify SQL data architectures, a delay typically fraudulent trading activity. A data ware- occurs as source data is cleaned and then house is limited to storing only structured loaded into a data warehouse, in hourly, data, making such analyses difficult and daily, or weekly batches. Furthermore, time-consuming. traditional SQL systems use heavy controls to insert (that is, to store) data, which slows Data Structure. Traditional data architec- down data throughput. Users of data tures mandate a database structure that warehouses may, therefore, not have the is defined up front. Data architects pre- most up-to-date information, which is a scriptively model and define the physical shortcoming that can undermine the database prior to transforming and load- quality of the insights derived through data ing data into it, a process referred to as analytics. “schema on write.” Data Quality and Validation. Because raw But companies increasingly need an archi- data gets loaded into a data lake, its quality tecture in which users are free to access and trustworthiness is tested at the time it and structure data dynamically on the fly; is accessed for analysis. Traditional SQL- that process is sometimes referred to as based data warehouse programs, on the “schema on read” or “late-binding execu- other hand, undergo extensive testing of tion.” Instead of using the sequence com- the programs used to extract, transform, mon to data warehouses—extract, trans- and load data, which can help ensure the form, load (ETL)—it employs the ELT high quality of the data moved into them. approach, swapping the load and transform steps so that the raw loaded data is cleaned Access and Security. Tools to capture basic and transformed in the data lake. That of- technical metadata for the data ingested in a fers an advantage when the accuracy of the data lake are available, but that information data is imperative, such as for regulatory still must be enriched with business and reporting. Because data quality validation operational metadata that enables users to happens as needed in the data lake, you access and fully exploit data. The rich meta- don’t need to create a big IT project to data and associated security policies en- clean all the data, thus saving time and forced in a traditional data warehouse cost. enable construction of complex user-access models. Fine-grained security policies and The flexibility of a schema-on-read model role-based access privileges grant effective enables users to experiment with a variety control of user access to content.

| Look Before You Leap into the Data Lake 3

Effort and Cost. Data lakes are significantly Companies simply won’t be able to model easier and less expensive to implement data in the ways they need to in the future than traditional SQL-based data ware- without the flexible architecture that build- houses, owing to the commoditized nature ing a data lake creates. But to get the great- of the platform, the lower cost of open- est value from these efforts, we recom- source technologies, and the deferment of mend the following steps. data modeling until the user needs to analyze the data. In contrast, the cost of a Identify the highest-value opportunities. data warehouse solution can reach millions The shift toward big data architectures has of dollars for large companies as a result of forced many senior executives to take a the need for up-front data modeling, the fresh look at the major components of a long period of time to design and build, the data architecture strategy and the game- requirement for , and the changing capabilities they enable. The first need to customize database, server, data step in this work is to think through the integration, and analytics technologies. highest-value use cases for big data. Across industries, we often find uses in real-time customer ad targeting, real-time risk and Overcoming the Challenges fraud alert monitoring, regulatory report- of Data Lakes ing, and IT performance optimization. (See Today’s data lakes present two overarching “Big Data and Beyond,” a collection of challenges. BCG articles about big data.) Once compa- nies determine these use cases, they must The first is the lack of tools or, at least, ma- identify the target organization and the ture tools available for the Hadoop environ- processes and technologies required for the ment. Data warehouses have built these transformation. tools over more than two decades. Data lakes have much catching up to do. For ex- Keep the right goals in mind. A data lake ample, data lakes don’t yet have the level of typically has three uses. security that users of data warehouses are accustomed to, although significant improve- The first, and currently primary, use is ments have been made and the environ- insight generation—pulling global data for ment is vastly different from that of just a reporting or visualization purposes, for year ago. The same situation exists in terms example, or running machine-learning of data quality and validation. Metadata jobs to determine new connections and tools within the Hadoop environment used relationships. for checking data quality have been matur- ing over the past several years. But without The second use is operational analytics, robust controls, users could lose trust in the which is more time sensitive than insight accuracy of the data needed to derive value, generation. An example is assessing the such as for regulatory reporting. credit risk associated with a transaction while the transaction is being processed; The second is the skills gap. Not enough the risk associated with the transaction people have sufficient experience working amount, location, or type is scored in real in the data architecture of the Hadoop en- time based on the customer’s profile and vironment. Eighty-three percent of respon- history. dents to a 2016 survey by CrowdFlower said there is a shortage of data scientists, The big data landscape is constantly chang- up from 79% in 2015. In 2020, Gartner pre- ing, and new components—such as Ma- dicts, the US will face a shortage of 200,000 hout, Tez, and Pig—are maturing, making data scientists. it possible to mostly replace traditional data warehouses for these first two uses. The good news is that, once these chal- lenges are addressed, the process of devel- Indeed, today’s big data technologies pro- oping a data lake is useful in and of itself. vide the scope and scale to cover many of

| Look Before You Leap into the Data Lake 4

Exhibit 2 | Four Big Data Operating Models

COMPLEMENT CARVE OUT

Existing reporting New analytical ETL workload workload and data workload and data Staged and Migrated ETL archived data workload

Migrated data Existing data Existing data 1 warehouse New data lake1 warehouse New data lake

TRANSFORM OUTSOURCE

Existing reporting New analytical and ETL workload workload and data Data analytics as a service Reporting “layer” data Transformed workload Staged and (reporting and ETL) archived data and data Cloud technology

Existing data 1 warehouse New data lake

Source: BCG analysis. Note: ETL = extract, transform, and load. 1Infrastructure and data analytics services can be sourced internally and/or externally (for example, in the cloud).

the analytics needs of today’s enterprises, data warehouse to support new use opening the door to the third, and current- cases that cannot be accomplished ly nascent, use of the data lake: transaction effectively with a traditional data processing, which is time sensitive and must warehouse, such as multistructured guarantee the integrity and consistency of data analysis and predictive analytics. data access. Take, for example, a payment made from your bank account—the money •• With the carve-out model, companies needs to leave your account immediately build a data lake to replace parts of the and, likewise, the account balance and the existing data warehouse solution that record of payment must be instantly updat- are better suited for storage and ed. With the emergence and continued processing in a data lake. This is evolution of SQL technologies, it may even- typically done to reduce IT costs such as tually be possible to cover some portion of the cost of off-loading expensive ETL transaction-processing needs with big data development to the data lake. platforms. Hadoop-based big data plat- forms are still maturing, but the risk of any •• In the more radical transform model, associated growing pains should not deter the data lake progressively replaces the a potentially radical architectural vision. A broader suite of relational-database forward-looking implementation allows a platforms that process data and deliver company to embrace the true potential of insights across customer, product, and data lakes as they mature. business management processes. The motivations for such a radical move Select the right operating model. We include transforming a company into a typically see four key models for imple- data-driven digital business, increased menting a data lake: complement, carve agility, and improved efficiency. out, transform, and outsource. (See Exhibit 2 for an illustration of the models.) •• With the outsource model, a company frees itself from building and maintain- •• With the complement model, a ing a data lake. It can achieve this by company builds a data lake alongside a adopting cloud technology and thereby

| Look Before You Leap into the Data Lake 5

reducing capital investments for reliance on IT for analytics, visualization, infrastructure and specialist skills. and the production of reports. (See Exhibit Furthermore, companies can leverage 3.) This kind of organization structure analytics as a service by sending data to enables flexible consumption of data based a vendor, which processes the data and on business needs, adding agility and returns results or insights. subtracting cost. Leading organizations are already positioning themselves for self- Ensure data quality, security, and gover- service analytics on the business side. In nance. Regardless of where data is stored, fact, to ensure that they are capturing the businesses cannot make effective data-driv- utmost value from the data lake, they are en decisions if their information is not building these capabilities company-wide, robust, secure, and trustworthy. Leading adding roles and skills that focus on data organizations enforce data governance and ownership and stewardship. data quality policies, processes, tools, and stewardship to ensure that data is fully described and understood and of high ata lakes offer huge potential to quality. (See “How to Avoid the Big Bad Dtransform businesses. Using them well Data Trap,” BCG article, June 2015.) They requires understanding their strengths and also ensure full traceability and lineage of limitations and taking a pragmatic ap- data. Effective implementations make proach to implementation. (See “Changing certain that relevant data is sufficiently the Game with a Data Lake: An Interview anonymized and that access is strictly with Centrica’s David Cooper and Daljit controlled and restricted to relevant Rehal,” BCG article, September 2016.) authorized users. Companies that implement data lakes will have the opportunity to build a flexible Build the organization. With secure, high- architecture that enables data delivery at quality data in place, companies must next the right time and in the right format. The build an organization that is ready and able ultimate goal: accurate and actionable to make the best use of that data. They insights that drive business value. need to embed a data consciousness on the business side of the house, adding data scientist roles and skills there, not just in IT, so that their business team can tap the data lake whenever it needs to. This reduces

Exhibit 3 | Build Data Capabilities on the Business Side of the Organization, Not Just in IT

FUTURE MODEL: BUSINESS-DRIVEN ANALYTICS

ANALYTICS VISUALIZATION REPORTING DATA SCIENCE CAPABILITY (BUSINESS/FUNCTIONALLY ALIGNED)

DATA LAKE (WITH METADATA LAYER) Transformed/improved data EXTRACT AND LOAD

SOURCE DATA: OPERATING SYSTEMS AND DATA STORES

IT roles (for example, data analyst, custodian, ETL developer) Business roles (for example, data scientist, steward, owner, reporting) Source: BCG analysis. Note: ETL = extract, transform, and load.

| Look Before You Leap into the Data Lake 6

About the Authors Rash Gandhi is a principal in the London office of The Boston Consulting Group and a member of the Technology Advantage practice. His areas of expertise include business-aligned IT strategies and architec- tures, big data enablement, transformation of application development and maintenance functions, and performance of technical health checks and reviews to derisk programs. You may contact him by e-mail at [email protected].

Sanjay Verma is a partner and managing director in the firm’s San Francisco office. He is a core member of the Technology, Media & Telecommunications practice. Prior to joining the firm, Verma launched and led Cloud Labs, a business unit of Flex. He also was the architect of the Oracle TimesTen In-Memory Data- base. You may contact him by e-mail at [email protected].

Elias Baltassis is a director in BCG’s Paris office and a core member of the Technology Advantage and Financial Institutions practices. Prior to joining BCG, he was a founding member and managing director of a world-leading big data company. He has led a broad range of big data and analytics projects in financial services, private equity, retail, and telecommunications. You may contact him by e-mail at [email protected].

Nic Gordon is an associate director in the firm’s London office and a core member of the Financial Insti- tutions practice. Prior to joining the firm, he held the senior positions of chief data officer, global head of business intelligence and analytics, global head of data services, and global head of data strategy and ar- chitecture for leading banks around the world. You may contact him by e-mail at [email protected].

The Boston Consulting Group (BCG) is a global management consulting firm and the world’s leading advi- sor on business strategy. We partner with clients from the private, public, and not-for-profit sectors in all regions to identify their highest-value opportunities, address their most critical challenges, and transform their enterprises. Our customized approach combines deep in­sight into the dynamics of companies and markets with close collaboration at all levels of the client organization. This ensures that our clients achieve sustainable compet­itive advantage, build more capable organizations, and secure lasting results. Founded in 1963, BCG is a private company with 85 offices in 48 countries. For more information, please visit bcg.com.

© The Boston Consulting Group, Inc. 2016. All rights reserved. 9/16

| Look Before You Leap into the Data Lake 7