Hadoop and Data Lakes

Use Cases, Benefits and Limitations

BARC Research Study

IT LOG Hadoop and Data Lakes Authors

Timm Grosser Melanie Mack Senior Analyst Head of Market Research [email protected] [email protected]

Jacqueline Bloemen Jevgeni Vitsenko Senior Analyst Research Analyst [email protected] [email protected]

This independent study was conducted and written by BARC, an objective market analyst. We thank Cloudera, SAS and Teradata for sponsoring the survey. Thanks to SAS, this study is available in English language.

2 Hadoop and Data Lakes ©2016 BARC - Business Application Research Center, a CXP Group Company Table of contents Hadoop and Data Lakes

4 | Preface 31 | Hadoop Theories Put to the Test

6 | Demographics 35 | Sponsors

8 | Management Summary 36 | SAS

11 | Key Findings 37 | About BARC

11 | Usage 17 | Drivers 19 | Reasons for Use and Benefits 24 | Data Lakes 27 | Implementation 29 | Challenges

©2016 BARC - Business Application Research Center, a CXP Group Company Hadoop and Data Lakes 3 Preface

IT LOG Preface Hadoop and Data Lakes

iscussion surrounding Hadoop and concept of a data lake, however, can extend • How widespread is the current usage of data lakes is as relevant as ever. The far beyond that, depending on how broad- Hadoop and data lakes in companies? What D Hadoop ecosystem is considered ly the term is defined. A data lake often are their plans for the future? THE technological breakthrough for enab- focuses on data availability and providing • How do companies utilize Hadoop or plan ling companies to capitalize on the downstream applications with schema-free to use it? revolution. A data lake, in turn, is viewed data that is close to its raw format regard- • How do companies currently use data la- as a broad data management concept and less of origin. kes? a prerequisite for data-driven companies. • What problems do they face? It promises a fast, efficient, low-cost way Enterprise projects with Hadoop technolo- to manage, use and analyze any amount • What real-world benefits does Hadoop gy and data lakes in production have only of data from different systems with varying bring? What projects have companies imple- emerged in recent years. Accordingly, many structures. As a source for any type of ana- mented already? companies find it hard to distinguish bet- lytic task, it can also claim to be the tech- • How is the technological implementation ween media hype and the benefits that nological backbone of digitalization and the set up? can realistically be enjoyed. There is limi- (big) datafication of the entire economy. ted experience in terms of how and where it really makes sense to implement, which This study was conducted independently by Hadoop, a top-level project of The Apache obstacles can arise during implementation, BARC. Thanks to sponsorship from Cloude- Software Foundation, is an open source Java and what potential benefits are delivered in ra, SAS and Teradata, it can be published framework for scalable, distributed applica- real-world scenarios. free of charge. BARC would like to thank tions. It includes a collection of components you, our readers, in advance for your parti- for administering, accessing and analyzing cipation in future surveys as well. Your help The following BARC user survey provides structured and unstructured data. Hadoop is and support are essential to fuel discussion answers and insights. It explores the sta- capable of managing huge amounts of poly- through empirical data. tus quo of Hadoop and data lakes in gene- structured data and adding value to new or ral and real experiences from Hadoop use established IT technologies. It is especially cases across the globe. It tackles important well-suited as a platform for implementing questions including: big data projects and is often viewed as a technology for data lake deployments. The

IT LOG

©2016 BARC - Business Application Research Center, a CXP Group Company Hadoop and Data Lakes 5 Demographics

IT LOG Demographics Hadoop and Data Lakes

Over

380 23% 33% 45% respondents Under 250 250 - 2,500 Over 2,500 employees employees employees

Europe North America

6% 14% Public Sector Banking 24% 22% 16% 9% Industry Services IT 9% Retail Other 257 58 (77%) (18%)

IT LOG

©2016 BARC - Business Application Research Center, a CXP Group Company Hadoop and Data Lakes 7 Management Summary

IT LOG Management summary Hadoop and Data Lakes

BICC and data science Hadoop is a technology Hot Spot Hot Spot Opinion on data lakes is Hot Spot teams are driving trend with great #1 #2 divided #3 Hadoop and data lake potential projects

DataThe discovery/visualization,number of Hadoop projects data in producti quality/- DataTwo discovery/visualization,sides have emerged in the data debate quality/ on DataBI competencediscovery/visualization, centers and datadata sciencequality/ masteron is growing,data management especially in Europe. and self-service However, masterdata lakes.data Onemanagement group views and it as anself-service import- masterteams dataare currently management the main and drivers self-service behind arethe currently number theof companies topics BI practitionersthat do not want iden - areant currently concept the and topics a prerequisite BI practitioners for data-dri iden- - areHadoop currently and the data topics lake projects. BI practitioners They, in turn, iden - to use Hadoop is also increasing. As more ven companies. The other considers it mar- are bringing more business relevance to the- tify as the most important trends in their work. tify as the most important trends in their work. tify as the most important trends in their work. companies understand the potential benefits keting buzz, a new name for an old concept, se topics. At ofthe Hadoop other useend cases,of the they spectrum, can make data better labs/ At orthe a topicother with end no of real the relevance. spectrum, It isdata possib labs/- At the other end of the spectrum, data labs/ science,informed cloud decisions BI and on data whether as a product to use thehave science,le that cloudthis gap BI isand propelled data as by a productinsecurities, have science, cloud BI and data as a product have beentechnology. voted as Therethe least is aimportant surprisingly of the broad nine - beenwhich voted opens as thethe leastopportunity important to find of the a ninedefi- - been voted as the least important of the nine- teenspectrum trends ofcovered Hadoop in systems this report. in production. teennition trends and coveredcreate transparency in this report. in terms of teen trends covered in this report. Companies of all sizes with varying volumes benefits and best practices. The term ‘data of data, data types and demands for data lake’ is still not used consistently. There is a currency are using Hadoop. That makes it a tendency, however, to use this term when re- potential technology for a wide range of use ferring to data preparation/storage or explo- cases in all types of companies. Hadoop is rative environments. increasingly emerging from a simple file sys- tem to a runtime environment for analytic ap- plications.

IT LOG

©2016 BARC - Business Application Research Center, a CXP Group Company Hadoop and Data Lakes 9 Hadoop and Data Lakes Management summary

Companies primarily use Customer intelligence Hot Spot Hot Spot commercial tools and and predictive analysis Hot Spot Hadoop generates #4 Hadoop distributions for #5 are clear Hadoop use #6 analytical benefits project implementations cases

Commercial tools and distributions received Customer intelligence and predictive analysis are Hadoop’s greatest benefits lie in the analysis higher marks than compo- by far the most common Hadoop implementations. of heterogeneous data from divergent sour- nents in most tool categories and are im- Analyzing data from heterogeneous, divergent ces, the prediction of customer behaviors/ plemented much more frequently. The one sources, predicting customer behavior, strengt- customer retention and in increasing flexibi- exception is for streaming. Cost savings, fun- hening customer loyalty, and increasing flexibility lity. ctional performance/innovative capacity and are the most common benefits. Hadoop not only takes on the role of the file operations improvement are the main rea- system. It also serves as a platform and runti- sons for choosing Apache Hadoop. me environment with core data and predictive analysis functionalities.

Hadoop enables use cases Hot Spot Hot Spot Lack of skills and insecurities that were previously not #7 #8 pose the greatest challenges possible

Users view Hadoop as a potential technolo- It appears that not much has changed since gy for implementing new types of use cases the last BARC Hadoop survey. A lack of pro- that were previously not possible with exis- fessional know-how and technical skills still ting systems. Saving costs and implementing lead the list of challenges. European com- a technically better platform play a minor panies, in particular, are uncertain how to role when deciding whether to implement use it properly. Their counterparts in North Hadoop. America, in comparison, are more likely to complain about low usability, immature eco- systems and the high cost of training and de- velopment. IT LOG

10 Hadoop and Data Lakes ©2016 BARC - Business Application Research Center, a CXP Group Company Key Findings Usage

IT LOG The number of Hadoop projects in production is Hadoop and Data Lakes growing along with the number of critics

Hadoop is still being put to the test. The benefits of the framework, however, appear to be delivering Hadoop initiative by region good arguments in its favor. n=371/302/196 The numbers show that two groups are emerging for and against Hadoop. Accordingly, Hadoop will Hadoop is in Hadoop Hadoop No, but in future No, and we not go down in history as the cure for all analytic production pilot project initiative conceivable have no plans requirements. It has advantages and disadvanta- is running planned for one ges depending on the use case. 40 percent of participants worldwide are Hadoop 2016 12% 16% 12% 34% 27% supporters (e.g. with projects in production, a pilot phase or in planning). They show a clear interest in the technology. 12 percent of them already have Hadoop in production. Europe 9% 16% 10% 33% 32% Many supporters view Hadoop as a potential com- North America 17% 16% 12% 43% 12% ponent to help build analytic environments for spe- cial applications. 2016 DACH region 8% 17% 9% 33% 33% 27 percent of respondents categorize themselves as Hadoop opponents. Another 34 percent cur- 2015 DACH region 4% 14% 11% 49% 21% rently do not have Hadoop but can imagine using it in the future. Clear jump in Hadoop usage throughout Germany, Austria and Switzerland Germany, Austria and Switzerland, also known as using Hadoop rose from 21 percent to 33 percent. maturity rates in using information technology so the DACH region, recorded a clear jump in usage Widespread Hadoop use in North America this is not altogether surprising. However, when it comes to planned adoption of Hadoop, a similar (4 percent to 8 percent) in comparison to last year. The number of Hadoop installations in production proportion of North American and European com- The number of Hadoop prospects (i.e. those that is nearly twice as high in North America (17 percent) panies have initiatives in the pipeline. have Hadoop initiatives or are planning pilot pro- as in Europe (9 percent). The North American mar- jects) remained steady at 26 percent. However, the ket has a reputation for early adoption and higher percentage of respondents who cannot imagine

12 Hadoop and Data Lakes ©2016 BARC - Business Application Research Center, a CXP Group Company Hadoop usage depends on the use case. Hadoop and Data Lakes Company size is irrelevant.

Survey findings show that not only large compa- Hadoop initiative by company size nies and those with large data volumes see the n=370 relevance of using Hadoop as part of a big data strategy. This is reflected in the distribution of our survey participants. The spread covers large (41 percent), small (20 percent) and midsize compa- nies (16 percent).

Less than 250 250 - 2,500 More than 2,500 employees employees employees

Yes, Hadoop is in 7% 7% 18% production

Yes, we have a Hadoop pilot project 13% 9% 23% running

Yes, we are planning 8% 9% 15% a Hadoop initiative

No, but it is conceivable we may have one in the 36% 41% 27% future

No, and we have no plans for one in the 35% 34% 17% future

©2016 BARC - Business Application Research Center, a CXP Group Company Hadoop and Data Lakes 13 Hadoop and Data Lakes Hadoop usage does not depend on data volume or currency

Data volume in Hadoop Data processing frequency in Hadoop applications n=251 n=143

Today Next 12 months Build a data archive with non Near/Real-time requirements (i.e. time-critical access to historical “Streaming“/“Event Processing“) data Small (<25 TB) 59% 19%

8% 21% Medium (>25 TB) 29% 48%

35% Large (>500 TB) 11% 25% 36%

Very large (>500 TB) 1% 8% Near-time requirements Daily - or less frequent - (i.e. prompt processing of loading of the Hadoop cluster data in seconds/minutes)

The majority of Hadoop applications process the near future (i.e. the next 12 months). (35 percent), especially in combination with custo- small amounts of data Companies start with smaller-scale Hadoop initia- mer intelligence applications. This shows that The survey reveals that Hadoop usage does not tives in order to first see the potential for further Hadoop and MapReduce are no longer limited to depend on data volume and currency. development. The largest increase was reported batch applications and are increasingly being used to address high demands on data currency. 59 percent of the Hadoop scenarios described among scenarios exceeding 25 TB. were implemented with “small” data volumes up Hadoop usage is no longer limited to batch ap- to 25 terabytes (TB). Scenarios exceeding 1 peta- plications byte (PB) are few and far between (1 percent). As A closer look at data currency reveals high respon- expected, higher data volumes are anticipated in ses for streaming (21 percent) or near-time usage

14 Hadoop and Data Lakes ©2016 BARC - Business Application Research Center, a CXP Group Company Transactional data is the most commonly used data type Hadoop and Data Lakes

The survey asked what types of data companies Data types in Hadoop applications use in Hadoop applications. More than half of res- n=143 pondents (53 percent) use transactional data, follo- wed by log data (40 percent) and clickstream data/ Web analytics (33 percent). More than a quarter (27 In use Planned percent) use documents and text files in Hadoop applications. This seems surprisingly high in com- Data from transactional systems 53% 35% parison to the findings of BARC studies from years past. A closer look at the responses for planned Log data from IT Systems 40% 42% usage shows ambitious endeavors in almost every data category. The findings do not show that Hadoop is especial- Clickstream data/web analytics 33% 49% ly suitable for certain data types. Instead, it appe- ars to be a potential technology for processing all Documents/text ( e.g. emails) 27% 38% types of data. It can also be assumed that Hadoop primarily uses data that can also be used in traditi- onal platforms. If data volumes or types do not play Sensor, RFID or other machine data 22% 33% a role, the reasons for using Hadoop still remain open. Cost, functionality for particular use cases, Open data 17% 48% know-how and the availability of a technical inf- rastructure or existing IT processes, however, are also points to examine. Social media data 15% 53%

Videoclips/pictures 10% 35%

Other 5% 9%

©2016 BARC - Business Application Research Center, a CXP Group Company Hadoop and Data Lakes 15 Hadoop and Data Lakes Hadoop use cases still cover a broad spectrum

Usage of Hadoop n=144 Worldwide Europe North America

Runtime environment (sandbox) for 65% 69% 58% advanced analysis, discovery, exploration

Memory for raw data/detail data 60% 76% 15%

Data preparation/ 57% 57% 58%

Data archive 40% 42% 35%

Runtime environment for classic BI 30% 24% 42%

Support of transactional systems 19% 20% 19%

Hadoop is used for many different purposes, espe- North America uses Hadoop more intensively experience in analytic Hadoop usage as a whole cially as a runtime environment for new types of as a runtime environment for BI. and, therefore, focuses on other use cases. advanced analytics (65 percent), as memory for Using Hadoop as memory for raw/detailed data Furthermore, using Hadoop as a runtime environ- raw/detailed data (60 percent), and for data pro- varies greatly between North America and Euro- ment for advanced analytics is more popular in Eu- cessing and integration (57 percent). Other scena- pe. 76 percent of respondents in Europe use it rope (an 11 percent difference) than in North Ame- rios, however, are conceivable and have been im- for this purpose in comparison to only 15 percent rica. It appears that Hadoop is not “just” used in plemented as well. in North America. North America for advanced analytics, exploration A breakdown by region paints an interesting pic- One possible explanation is that with growing or other new forms of analysis but rather as the ture. experience and maturity, Hadoop usage shifts technology itself to support both old and new ana- from simple data storage more towards an ana- lytic tasks. One could assume, therefore, broader lytic engine as a runtime environment for BI. This usage and, in certain cases, better utilization of po- would also mean that North America has more tential opportunities in this market as well. IT LOG

16 Hadoop and Data Lakes ©2016 BARC - Business Application Research Center, a CXP Group Company Key Findings Driving Factors

IT LOG BI competence centers and data science teams Hadoop and Data Lakes are driving Hadoop initiatives

The BARC Hadoop Survey 2015 showed that IT de- partments were the drivers and visionaries behind Driver for Hadoop Hadoop technologies and data lake initiatives. But n=110 times have changed. With 58 percent in Europe Europe North America and 56 percent in North America, business intelli- gence competence centers have now emerged as the driving force in companies by some distance. BI organization (BICC) 58% 56% Special organizational units, such as data scien- ce teams and big data labs are becoming more commonplace, especially in Europe. In contrast, IT Data Science Team 27% 24% application development and independent depart- ments for digitalization and innovation are driving Hadoop initiatives in North America. On the whole, (Big) Data Lab 26% 24% pure IT departments are losing ground as a key driver. One reason could be the high demand for accessing data in business departments. This de- Business department 22% 8% velopment primarily involves organizational units that have closer ties to business departments or specialize in analytics. IT – Application development 21% 36%

A specific department for digitalization 20% 32% and innovation initiatives

IT – Innovation area 19% 16%

IT – Other areas 9% 4%

IT LOG

18 Hadoop and Data Lakes ©2016 BARC - Business Application Research Center, a CXP Group Company Key Findings Reasons for Use and Benefits

IT LOG Commercial software and in-house Hadoop distributions Hadoop and Data Lakes are the top choice for implementations

Tools for implementing Hadoop Tools for implementing Hadoop by software category n=348 n=141 Apache Hadoop Commercial Not applicable products Commercial Apache Hadoop products Data storage & access 40% 45% 14%

Data integration & quality 28% 48% 23% 31% 27% System management 25% 41% 34%

Streaming 25% 22% 53% 42% Governance and security 23% 31% 46% Not applicable Advanced analysis & 20% 64% 16% visualization

Almost a third of participants use commercial pro- Users prefer on-premises installationsViewpoint advanced analytics and visualizationViewpoint (64 percent). ducts or Hadoop distributionsViewpoint to implement Hado- Further analysis of the software selection process Data storage is the only area where the usage of op projects. Over a quarter (27 percent) build their shows that 61 percent choose in-house deploy- Apache Hadoop as an open source framework is own Hadoop ecosystems with Apache compo- ments of Hadoop distributions. Only a few rely on very high in comparison to commercial tools or nents. That number is surprisingly high because it managed services (11 percent), cloud platforms (9 Hadoop distributions. Data storage with the Hado- takes in-depth knowledge to harmonize and ma- percent) or appliances (10 percent). op Distributed File System, (HDFS) is one of the nage the components as part of the implementati- original basic functions of Hadoop and is therefore Analysis and visualization is clearly the domain of on, maintenance and operations. better known. commercial tools (64 percent). Over 40 percent of respondents stated that they About half of participants use no tools for strea- A closer look at the tool categories reveals other do not know or cannot clearly identify the compo- ming (53 percent) or governance and security (46 interesting insights. Commercial tools and Hado- nents for the implemented application scenarios. percent). There appear to be no clearly defined op distributions clearly dominate the categories One can assume that the technical implementati- products for these categories. for data integration and data quality (48 percent), on is not always clear-cut, for example, due to the system management (41 percent) and, especially, many different products in a Hadoop ecosystem.

20 Hadoop and Data Lakes ©2016 BARC - Business Application Research Center, a CXP Group Company Cost effectiveness, functional performance/innovative capacity and ope- Hadoop and Data Lakes rations improvement are the main reasons for deploying Apache Hadoop

Reasons for using Apache Hadoop and commercial products and Hadoop distributions Apache Hadoop Commercial products

Data integration & Streaming Data storage System mana- Governance & Advanced quality (n=121) (n=80) (n=1 20) gement (n=91) security (n=80) analysis (n=116) 61% 59% 64% 59% 45% 47% Cost eectiveness/Cost savings 37% 29% 33% 33% 27% 23% Functional performance and 42% 33% 49% 22% 38% 47% innovative capacity 47% 48% 40% 40% 32% 52% 30% 41% 36% 38% 55% 35% Operations improvement 49% 39% 43% 62% 37% 41%

Existing skills/know-how 30% 26% 28% 31% 31% 29% 25% 32% 27% 25% 34% 52% Integration in the system 27% 26% 26% 31% 38% 41% landscape 35% 32% 37% 29% 32% 40% 9% 22% 8% 22% 24% 18% Ease-of-use of the application 24% 45% 18% 27% 15% 44% Flexibility to adapt the 24% 19% 21% 9% 28% 41% application 21% 19% 18% 8% 12% 33% 6% 11% 13% 13% 24% 35% Quick to implement 31% 32% 22% 15% 29% 24% 9% 11% 13% 13% 24% 35% Easy to maintain 22% 32% 22% 15% 29% 24% Governance and 13% 13% 18% 12% 27% 15% management 6% 4% 9% 9% 10% 18% 6% 4% 9% 9% 10% 18% Risk prevention 7% 6% 15% 15% 29% 7% 36% 19% 23% 13% 10% 24% Scalability 28% 23% 30% 17% 24% 21%

Participants cited cost effectiveness/savings but ted more frequently for Apache components than improvement are popular reasons for choosing also functional performance/innovative capacity for commercial tools and Hadoop distributions commercial products, as is ease of use in the areas and operations improvement as the main reasons across all tool categories. The reasons for using of streaming and advanced analytics. for choosing the open source components from commercial tools vary by tool category. Functional Apache. In fact, these same reasons were quo- performance/innovative capacity and operations

©2016 BARC - Business Application Research Center, a CXP Group Company Hadoop and Data Lakes 21 Customer intelligence and predictive analytics Hadoop and Data Lakes are the most common Hadoop use cases

Customer intelligence/experience projects (32 percent) closely followed by predictive analytics Use cases for Hadoop projects projects (31 percent) top the list of Hadoop use ca- n=148 ses. Customer intelligence already has many use cases since customers were at the heart of the di- scussion at the start of the hype around big data, Customer Intelligence/Experience 32% and vast amounts of data on customers, their be- havior and channels is available. Many of these ap- Predictive analytics (analysis plications, such as next-best offers in Web portals 31% of sensor data) or POS data analysis, are already in production. Predictive analysis has many use cases as well. Viewed as the epitome of explorative analysis, it Recommendation/next best action 13% helps uncover the hidden potential in data. An analysis of use cases by company size unco- Technical use case i.e. Data vers more findings: 12% Warehouse o­oading • Small companies focus much more on technical use cases such as offloading. • Customer intelligence is a major topic in midsize Innovation discovery 5% companies while predictive analysis is the focus of large companies. • Midsize and large companies use significantly Fraud detection 4% more clickstream, sensor and social media data.

Other 3%

22 Hadoop and Data Lakes ©2016 BARC - Business Application Research Center, a CXP Group Company Better heterogeneous data analysis, increased customer understanding Hadoop and Data Lakes and greater flexibility are the main benefits of Hadoop

Using Hadoop has advantages almost across the Benefits of Hadoop initiative board. Among the named benefits of Hadoop in- n=144 itiatives, comprehensive analytics and improved Worldwide Europe North America data integration (59 percent worldwide) ranked Analysis of data from heterogeneous, first followed by building a platform to better pre- 59% 65% 50% divergent data sources dict customer behavior and improve customer re- We can predict customer behavior, tention (53 percent worldwide). Increased analytic 53% 64% 46% improving customer retention flexibility (47 percent worldwide) came in third. Re- More flexible use of data and gional differences can be found in these respon- 47% 51% 42% advanced analytics ses. Aside from predicting customer behavior, the North American sample uses Hadoop more hea- Increased competitive capability 43% 48% 42% vily for predicting the success of sales or products Data analysis can be done more often (42 percent), monitoring and optimizing IT systems 33% 35% 27% and more cheaply (38 percent), and increasing the effectiveness of operational processes (35 percent). Faster response to changes in the market 33% 34% 31% Another interpretation is that North America uses Hadoop on a broader scale and analyzes areas We can predict sales and product success 27% 25% 42% beyond customers – for example, products – in or- We can predict fraud and/or other der to make operational processes more efficient. 26% 30% 27% financial risks This last point, in particular, is where BARC sees We can monitor machinery and proactively the greatest potential for data. In order to become 26% 28% 23% maintain it a data-driven company, it is important to act on the results of data analysis in operational processes We can do sentiment and trend analysis 25% 28% 23% instead of just following predefined workflows and rules. Users should be able to access additional Increased revenues 20% 19% 23% functionality within an individual process and take actions based on insights. This closes the classic We can monitor and optimize IT systems 19% 14% 38% and risks management loop from information to action.

Improved operational ecency 18% 14% 35%

We currently can not determine the 6% 6% 4% benefits of our Hadoop Initiative

©2016 BARC - Business Application Research Center, a CXP Group Company Hadoop and Data Lakes 23 Key Findings Data Lakes

IT LOG Almost half of users worldwide confirm the benefits of data lakes Hadoop and Data Lakes

User opinion about Data Lakes n=384 Worldwide Europe NorthViewpoint America A data lake is a requirement for a 15% 13% 21% data-driven company

The data lake is the central point for all data 32% 33% 21%

A data lake is the new term for data 24% 23% 31% warehouse, but covers the same things

The data lake is a pure marketing term 11% 14% 7% with no technical innovation

Data lakes are not relevant to me 13% 12% 12%

Data lakes are a highly debated concept yet still • Which governance requirements apply to data requisite for a data-driven company (compared to lack a clear definition. Many view them as places lakes? 13 percent in Europe). to store structured, semi-structured and unstructu- Regardless of the answers to these questions, the In Europe (46 percent supporters, 37 percent op- red data – schema-free and close to its raw data concept is relevant. 47 percent of users worldwide ponents) and North America (42 percent suppor- format. The structure comes with usage, when the confirm the benefits of data lakes. ters, 38 percent opponents), respondents are split data is needed. This definition, however, still lea- 35 percent of respondents still view the data lake into two factions. ves many unanswered questions: as a new term for an old concept or a pure marke- On the whole, there still appear to be insecurities • Is the term “data lake” a synonym for Hadoop ting term. 13 percent feel that the data lake con- regarding the benefits of data lakes. Its unclear and big data technologies or is it a collection cept is irrelevant. definition and/or lofty promises from vendors can of all data storage concepts available within a Variations by region make it even harder to make a realistic assess- company? ment of this concept. In a regional comparison, over one fifth of respon- • Is a data lake physical storage or a logical dents in North America view the concept as a pre- concept? IT LOG

©2016 BARC - Business Application Research Center, a CXP Group Company Hadoop and Data Lakes 25 Hadoop and Data Lakes Diversity of data lake usage

A precise definition of the term “data lake” is yet to be established, but it is clear that it is a broad con- Usage of Data Lake cept that is not confined to a single usage scena- n=100 rio. There is, however, a tendency to use the term to refer primarily to data preparation and storage, or to an explorative environment. Europe North America A differentiated view of data lake usage by region reveals further insights. Data lakes are more likely to be deployed for preparing and using data in Memory for raw data/detail data 78% 38% North America than in Europe. In Europe, however, a response of 78 percent clearly shows that the data lake concept is primarily used as memory for Runtime environment for advanced 54% 54% raw and detail data. analysis, discovery, exploration

Data preparation/data integration 53% 65%

Data archive 48% 46%

Runtime environment for classic BI 23% 31%

Support of transactional systems 17% 19%

IT LOG

26 Hadoop and Data Lakes ©2016 BARC - Business Application Research Center, a CXP Group Company Key Findings Implementation

IT LOG Users primarily view Hadoop as a (potential) technology Hadoop and Data Lakes for implementing new types of use cases

Main driver for using Hadoop by company size n=144

Average Less than 250 250 - 2,500 More than 2,500 employees employees employees

Technically better platform than the one we 29% 22% 40% 27% used previously

New technical platform for things we have not 60% 65% 53% 62% previously managed to do

Low cost replacement for certain tasks that we 10% 13% 3% 11% were doing previously

The question of the most important drivers for form” with 40 percent and “implementation of new primarily driven by new analytic requirements. using Hadoop shows a clear trend. Six out of ten types of use cases” with 53 percent), relatively clo- Costs are not viewed as a main driver. Although responded: “Hadoop is a new technical platform se together. cost savings have been considered a decisive ad- for things we have not previously managed to Only one in ten respondents listed costs as the vantage of Hadoop for a long time, there seems to do.” Three out of ten answered that Hadoop is a main driver. This number is surprisingly low becau- be a more differentiated view today. Hadoop can “technically better platform”. se costs were identified as a reason for using be - but is not necessarily - less expensive depen- Only in midsize companies were these two drivers Hadoop components (see p. 21). ding on the use case. (“optimization through a technically better plat- BARC’s findings show that Hadoop usage is IT LOG

28 Hadoop and Data Lakes ©2016 BARC - Business Application Research Center, a CXP Group Company Key Findings Challenges

IT LOG Hadoop and Data Lakes Lack of know-how and usage insecurities are the biggest challenges

BARC’s previous Hadoop Survey identified usa- ge insecurities and a lack of technical and pro- Challenges when implementing Hadoop fessional know-how as the main challenges for n=309 Hadoop. It is a similar story in this survey. Lack 54% of professional (54 percent) and technical (50 Lack of professional know-how 54% 55% percent) know-how topped the list of challen- ges. Lack of know-how in the construction of a 50% 52% A regional comparison offers further insights. big data architecture 45% A lack of know-how to use new findings from Lack of know-how to use new findings from 41% 50% data research (50 percent) was more frequently data research 26% viewed as a challenge in Europe. North Ame- 33% rican participants, in contrast, had more prob- A lack of compelling use cases 34% lems with a lack of sponsors and support from 33% executive level (36 percent) and the usability of Benefits of a Hadoop initiative are not clear and 27% Hadoop (26 percent). 30% cannot be communicated clearly 26% It seems that while many talk about Hadoop, Lack of sponsors/Lack of support from manage- 27% few use it. And those who do apparently do not 24% ment level want to talk about it. That is understandable. Af- 36% ter all, Hadoop is about gaining a competitive Concerns regarding data protection 22% 20% advantage, which the results of this survey con- and/or data security 28% firm. The market still lacks experience, security Costs for implementing the new technology 21% and, especially, knowledge when it comes to 19% are too high Hadoop. 26% Criticisms of the Hadoop system, such as data 19% Immature ecosystem 16% protection and security, were only listed as a 24% challenge in every fourth or fifth case. Broken 16% down by industry, data security was a concern Low usability of Hadoop for end users 12% 31% primarily in finance (33 percent) and the public Wordwide sector (38 percent) in comparison to the manu- 14% Costs for training and development are too high 10% facturing/processing industries (18 percent) and 26% Europe retail (11 percent). North America 4% There are no challenges 4% 0% IT LOG

30 Hadoop and Data Lakes ©2016 BARC - Business Application Research Center, a CXP Group Company Hadoop Theories Put to the Test

IT LOG Hadoop and Data Lakes Hadoop and data lakes require further examination

After the hype comes disillusionment and the gro- area and make the usage and benefits of Hado- listed some theories about Hadoop and the data wing realization that Hadoop and data lakes are op and data lakes more transparent and tangible lake concept prior to this survey, compared them not the answer for all analytic tasks. On the other based on real experiences. This still appears to with the survey findings, and added their closing hand, the technology and concept has great po- be one of the greatest challenges in the debate. comments. tential. There is still a need to further clarify this To help address these insecurities, BARC analysts

Theory

Hadoop is the preferred BARC analysis Survey ndings technology for implementing a data lake. The findings show no clear answer. Respon- There are no clear guidelines for building a dents view Hadoop as a more or less data lake. There are many open questions suitable technology to implement a data when it comes to designing a data lake, lake concept. especially regarding metadata management and requirements for a virtual/logical data lake. As a result, Hadoop cannot be called a “preferred” technology in general.

Theory Hadoop has functional advanta- Survey ndings BARC analysis ges over “classic” BI/DW tools.

Survey respondents viewed this neither as a Hadoop oƒers advantages in terms of main advantage nor disadvantage of programming flexibility. However, standard Hadoop. Basic suitability can be assumed software tools all have their own benefits so just as in commercial tools and Hadoop the decision on which approach to adopt distributions. should depend on the project and the resources and skills available. The prerequi- site is to have the basic equipment regard- less of the actual requirements.

32 Hadoop and Data Lakes ©2016 BARC - Business Application Research Center, a CXP Group Company Hadoop and data lakes require further examination Hadoop and Data Lakes

Theory

Hadoop is fast, flexible and easy Survey ndings BARC analysis to implement.

The survey shows that flexible application Hadoop oers advantages in terms of design is a strong reason for choosing programming flexibility, and it can also be Apache. A fast, simple implementation used for predictive analytics. However, (implementation eciency) was more of a standard software tools all have their own reason for choosing commercial tools and benefits so the decision on which approach Hadoop distributions. to adopt should depend on the project and the resources and skills available. It also depends on the available knowledge of massive parallel processing (MPP). Theory

Hadoop supports data with many Survey ndings dierent structures. BARC analysis

This has been confirmed both in this and Yes, in the sense of a simple file system previous surveys. saving dierent formats. The schema comes with usage.

Theory

Hadoop is cost-ecient. Survey ndings BARC analysis

This is generally true, but not the main It can be, but not always. Many simply think reason for using Hadoop. of license costs. Costs, however, also depend on the implementation, hardware and operations.

©2016 BARC - Business Application Research Center, a CXP Group Company Hadoop and Data Lakes 33 Hadoop and Data Lakes Hadoop and data lakes require further examination

Theory

Hadoop easily scales with Survey ndings BARC analysis growing data volumes and workloads in parallel Survey participants viewed this neither as a environments. Generally speaking, yes, Hadoop oers main advantage nor as a disadvantage of advantages in terms of programming flexibi- Hadoop. Basic suitability, therefore, can be lity. However, standard software tools all assumed. have their own benefits. A basic set of equipment should be guaranteed regard- less of the actual requirements.

Theory

Hadoop can be used for analytics Survey ndings BARC analysis as well as online/real-time processing. Hadoop is mainly used as a technology Generally speaking, yes, Hadoop oers for analytics and is rarely deployed for advantages in terms of programming flexibi- online/real-time processing. lity. Its needs to be considered that Ana- lytics and transactional applications require dierent designs, components and system configurations.

IT LOG

34 Hadoop and Data Lakes ©2016 BARC - Business Application Research Center, a CXP Group Company Sponsors

IT LOG Hadoop and Data Lakes Sponsors

SAS business profile www.sas.com

With more than $3 billion in sales, SAS is one of pliance. Insurance companies use it to detect ged services and can be deployed in the pu- the world’s largest software companies and fraudsters. Retailers rely on SAS to optimize blic cloud, the private cloud or in hybrid cloud the leading vendor of big data analytics so- customer communication and campaign ma- environments. One focus here is on solutions lutions. At more than 80,000 locations around nagement, and to improve online shoppers’ for self-service and mobile business analytics the world, enterprises rely on SAS analytics so- customer experience. Industrial enterprises and that enables individu- lutions for a competitive edge in strategic and use it to manage their service and maintenan- al departments and the management level to operational decisions by tapping a wide range ce processes – for instance to ensure that gain valuable insights from data without having of business data – both separately and in con- components are replaced before they cause special knowledge of statistics or requiring junction with external data of any scale – for unplanned downtime. support from the IT department. solid business insights. SAS big data analytics helps enterprises ext- Background: SAS arose out of a research pro- Big data analytics is the key to not only ma- ract maximum value from their data. No mat- ject at North Carolina State University. Establis- naging, but profiting from the digital transfor- ter how large and how complex the data sets hed in 1976 and headquartered in Cary, North mation and successfully putting the disruptive are – SAS software identifies the relevant struc- Carolina, SAS employs around 14,000 peop- processes it entails in place. Thanks to 40 ye- tures and relationships. Data become insights, le in 59 countries worldwide. Heidelberg has ars of experience in the field of data analysis, and thus a foundation for solid and prescient been the home of SAS’ German headquarters SAS not only has sweeping vision, but also business decisions. since 1982. The subsidiary has offices in Ber- technology that is pragmatic, proven, secure SAS high-performance analytics takes full ad- lin, Frankfurt, Hamburg, Cologne and Munich and built for swift, productive deployment. vantage of Hadoop and in-memory computing and currently employs 520 people. German customers include Allianz, Continental, Com- SAS systems can be found throughout the bu- for fast, economical big data processing. SAS merzbank, HUK Coburg, Fraport, DER Touristik, siness world and in public administration. Its also offers enterprises a platform to analyze, Nestlé, Galeria Kaufhof, BASF and Meyer Werft. core industries are banking, insurance, trade enhance and review data – a major contributi- and manufacturing. Banks use SAS to control on to data quality and governance. their processes and ensure regulatory com- All SAS solutions are also available as mana-

IT LOG

36 Hadoop and Data Lakes ©2016 BARC - Business Application Research Center, a CXP Group Company About BARC

IT LOG BARC — Business Application Research Center A CXP Group Company

BARC is a leading enterprise software industry BARC consulting can help you find the most re- analyst and consulting firm delivering informa- liable and cost effective products to meet your Other BARC research reports bring tion to more than 1,000 customers each year. specific requirements, guaranteeing a fast return transparency to the market Major companies, government agencies and fi- on your investment. Neutrality and competency Surveys nancial institutions rely on BARC’s expertise in are the two cornerstones of BARC’s approach software selection, consulting and IT strategy to consulting. BARC also offers technical archi- projects. tecture reviews and coaching and advice on The BI Survey 16 is the world’s lar- developing a software strategy for your organi- For over twenty years, BARC has specialized gest survey of business intelligence in core research areas including Data Manage- zation, as well as helping software vendors with software users. Based on a sample ment (DM), Business Intelligence (BI), Customer their product and market strategy. of over 3,000 responses, it offers an Relationship Management (CRM) and Enterprise BARC organizes regular conferences and semi- unsurpassed level of user feedback Content Management (ECM). nars on Business Intelligence, Enterprise Con- on 37 leading BI products. BARC’s expertise is underpinned by a conti- tent Management and Customer Relationship nuous program of market research, analysis and Management software. Vendors and IT decisi- a series of product comparison studies to main- on-makers meet to discuss the latest product The BARC Big Data Use Cases Sur- tain a detailed and up-to-date understanding of updates and market trends, and take advantage vey explores the usage of big data the most important software vendors and pro- of valuable networking opportunities. in companies worldwide. 559 busi- ducts, as well as the latest market trends and Along with CXP and Pierre Audoin Consultants ness and IT decision-makers com- pleted the survey in the first quarter developments. (PAC), BARC forms part of the CXP Group – the of 2015. BARC research focuses on helping companies leading European IT research and consulting find the right software solutions to align with firm with 140 staff in eight countries including the UK, France, Germany, Austria and Switzerland. their business goals. It includes evaluations of The BI Trend Monitor 2016 from the leading vendors and products using metho- CXP and PAC complement BARC’s expertise in BARC reflects on the trends cur- dologies that enable our clients to easily draw software markets with their extensive knowled- rently driving the BI and data ma- comparisons and reach a software selection ge of technology for IT Service Management, nagement market from a users’ per- decision with confidence. BARC also publishes HR and ERP. spective. We asked close to 2,800 users, consultants and vendors for insights into market trends and developments, their views on the most important BI and dispenses proven best practice advice. trends.

38 Hadoop and Data Lakes ©2016 BARC - Business Application Research Center, a CXP Group Company Notes Hadoop and Data Lakes

©2016 BARC - Business Application Research Center, a CXP Group Company Hadoop and Data Lakes 39 IT LOG

Business Application Research Center – BARC GmbH

Germany Austria Switzerland Rest of the World BARC GmbH BARC GmbH BARC Schweiz GmbH +44 1536 772 451 Berliner Platz 7 Goldschlagstraße 172 / Stiege 4 / 2.OG Täfernstrasse 22a www.barc-research.com D-97080 Würzburg A-1140 Wien CH-5405 Baden-Dättwil +49 (0) 931 880651-0 +43 (1) 8901203-451 +41 76 340 35 16 www.barc.de www.barc.de www.barc.ch