<<

June 30, 2016

REPORT: BIG IDEA Democratze Big : How to Bring Order and Accessibility to Data Lakes Success Depends on Adequate Governance, Cataloging and Hands-On Access to Data in Hadoop

Doug Henschen Vice President and Principal Analyst

Content Editor: R "Ray" Wang Copy Editor: Maria Shao Layout Editor: Aubrey Coggins

Produced exclusively for Constellation Research clients TABLE OF CONTENTS

EXECUTIVE SUMMARY ...... 3

BROADER ACCESS DRIVES INSIGHTS AND ACTIONS ...... 4

HADOOP EMERGES AS A CORPORATE STANDARD FOR BIG AND INSIGHT ...... 5

DATA LAKE SUCCESS DEPENDS ON A MATURE APPROACH ...... 7

SEEK EASE OF USE, REPEATABILITY AND AUTOMATION ...... 8

LOOK TO NEXT-GENERATION VENDORS TO FILL DATA LAKE GAPS ...... 12

CONSIDER INCUMBENTS FOR BROADER NEEDS ...... 16

RECOMMENDATION: TARGET THE CENTER OF ANALYTICAL GRAVITY ...... 19

RECOMMENDATION: CONSIDER CONSULTANTS AND SYSTEMS INTEGRATORS ...... 21

TAKEAWAY: PREVENT DATA SWAMPS AND MAKE BIG DATA ACCESSIBLE ..... 23

ANALYST BIO ...... 24

ABOUT CONSTELLATION RESEARCH ...... 25

© 2016 Constellation Research, Inc. All rights reserved. 2 EXECUTIVE SUMMARY

This report explores key trends in the popularization of Hadoop and strategies being employed to manage data lakes and make them more accessible. Companies have embraced the concept of the data lake or data hub that spans analytical and data-driven application needs. But gaps remain in the maturity and capability of the Hadoop stack, leaving organizations to struggle with how to ingest, cleanse, transform, discover, enrich, blend, catalog and govern data in data lakes.

If the data lake concept is to succeed, Constellation Research believes organizations need three key capabilities:

1. Data management and governance 2. Data cataloging and metadata management 3. Self-service discovery and data preparation

This report examines these three capabilities and the mix of tools available from Hadoop distributions and from next-generation and incumbent data management vendors. In particular, it looks at categories of tools and suites that not only abstract the complexities of Hadoop but also open up access to enterprise IT and business professionals. These users expect unifed, enterprise-grade interfaces, self- service data discovery and prep capabilities, and prebuilt integrations to popular business intelligence and analytics tools.

Business Themes Technology Data to Decisions Optimization

© 2016 Constellation Research, Inc. All rights reserved. 3 BROADER ACCESS DRIVES different formats, which is fne for a modern INSIGHTS AND ACTIONS platform such as Hadoop, but context and metadata must still be maintained, Innovators and industry leaders enable data- ingested, enriched and applied to support the driven decisions across their organizations. orchestration of information and alignment However, companies must do preparatory with business processes. Over time, insight work before data-driven decisions can and more metadata are gained as queries and be achieved. algorithms are applied to the information to identify correlations, trends and patterns. First, organizations must ingest and organize a These insights then guide decisions and actions vast variety of data originating from numerous (see Figure 1). When the orchestration and sources, including and increasingly from insight steps are closed off to small groups of external sources. This data often arrives in experts, the entire process takes longer. The

Figure 1. Better Orchestration and Business User Access Drive Insights and Actions in

Constellation’s Data-to-Decision Framework.

Source: Constellation Research

© 2016 Constellation Research, Inc. All rights reserved. 4 broader and more collaborative the access, the HADOOP EMERGES more quickly insights can be discovered and AS A CORPOR ATE decisions can be driven by the data. STANDARD FOR BIG DATA MANAGEMENT Technology Optimization and Innovation is AND INSIGHT about investing in innovation and strategic advantage while lessening the cost of providing The sheer scale and variety of data now ongoing support. New economic realities generated and used in business have created necessitate that organizations become demand for new platforms for managing smarter in adopting new technologies that information. Relational are not going can deliver business value while reducing away, but they are too costly and too rigid to the cost of delivering IT services. Hadoop is handle data in all its varieties and at today’s viewed by some as a lower-cost alternative massive scale. Apache Hadoop has emerged to relational databases and storage for as the leading platform for Big , managing information at scale. That’s one thanks to its ability to handle high volumes beneft, but a key role of the data lake is to and diverse velocities and varieties of data support exploration and correlation of varied in a cost-effective way. NoSQL databases (a data types to uncover new insights that drive separate topic not addressed in this report) innovation and competitive advantage. have emerged as the leading alternative to relational databases for running high-scale transactional applications.

More than 3,000 organizations are now paying customers of the three big Hadoop software distributors: Cloudera, Hortonworks and MapR. Tens of thousands more are using either Hadoop cloud services, such as Amazon Elastic

© 2016 Constellation Research, Inc. All rights reserved. 5 MapReduce and Microsoft Azure HDInsight, Hadoop for archival storage and, in more or are experimenting with open source sophisticated cases, analytics. These strategies distributions on-premises or in the cloud. require a disciplined approach to metadata management for retrieval or operational use. Hadoop is now routinely showing up in the enterprise, and it’s a given at big banks, In other cases, frms are reducing the amount insurers, retailers, telecommunications of new data that is loaded into the warehouse. companies and other large, data-intensive They’re using Hadoop as a lower-cost organizations. Nonetheless, Hadoop is far from alternative to extract, transform and load mature. Giant internet frms such as Yahoo (ETL) data processing. The raw data at scale and Facebook have led the way on Hadoop stays in Hadoop for exploratory analytics innovation, but with their deep engineering while structured data is refned from the raw and teams, these frms are data and moved into the warehouse or served tolerant of low-level management tools and up through integrations or through APIs as coding-intensive data-prep and data-analysis data services. Metadata and are methods. More mainstream organizations required to maintain connectivity to Hadoop often lack the skills and resources to spin-up and the provenance of the data. and maintain code-intensive systems. As Hadoop deployments mature, the number Within these more mainstream enterprises, of use cases and data-driven applications early Hadoop use cases have often been starts to expand. This expansion goes hand-in- limited to cost savings tied to hand with growth in the number and variety and storage optimization. The data warehouses of data sets in Hadoop and it results in rising aren’t being eliminated, but organizations are complexity in data-management needs. It is at moving older data out of these comparatively this critical point when a project that may have expensive platforms built on relational started as Hadoop cluster experimentation databases and are putting that data into must evolve into a data lake. Veterans have

© 2016 Constellation Research, Inc. All rights reserved. 6 been here before; it’s similar to the difference Data lakes can handle all forms of data, between a data warehouse and the underlying including structured data, but they are not a management system. replacement for an enterprise data warehouse that supports predictable production queries DATA L AKE SUCCESS and reports against well-structured data. DEPENDS ON A The value in the data lake is in exploring and MATURE APPROACH blending data and using the power of data at scale to fnd correlations, model behaviors, The rough idea of the data lake is to serve as predict outcomes, make recommendations, the frst destination for data in all its forms, and trigger smarter decisions. Breakthrough including structured transactional records and ad hoc discoveries can also be turned into unstructured and semi-structured data types automated data pipelines that can feed data- such as log fles, clickstreams, email, images, driven applications and trigger smart actions. social streams and text documents. Some label unstructured and semi-structured as The key challenge is that a Hadoop deployment “new” data types, but most have been around does not magically turn into a data lake. As a long time. We just couldn’t afford to retain or the number of use cases and data diversity analyze this information—until now. increases over time, a data lake can turn into a swamp if you fail to plan and implement a well- The data lake is not synonymous with Hadoop. ordered data architecture.

A lake should be part of the total analytic ecosystem. Thus, it must be integrated with If Hadoop-based data lakes are to succeed, and support incumbent data-management you’ll need to ingest and retain raw data in a infrastructure, including data warehouses, data landing zone with enough metadata tagging marts and the applications that feed and use to know what it is, when it originated or the data lake. changed and where it’s from. You’ll want zones for refned data that has been cleansed and

© 2016 Constellation Research, Inc. All rights reserved. 7 normalized for broad use. You’ll want zones Fortunately, we’re seeing a rich ecosystem for application-specifc data that you develop emerging around Hadoop to address data by aggregating, transforming and enriching lakes and larger analytical ecosystems. data from multiple sources. And you’ll want Options include both open source and zones for data experimentation. Finally, for commercial software designed to integrate governance reasons, you’ll need to be able to with and complement Hadoop. Some tools track audit trails, versions and data lineage also work with other Apache Spark, NoSQL as required by regulations that apply to your and NewSQL databases and conventional industry and organization. relational databases. This report addresses emerging data management, governance, To support such an architecture, organizations profling, cataloging, discovery and self-service need a robust and mature set of capabilities prep options that are democratizing the data for data ingestion, transformation, profling, lake (see Figure 2). In the pages that follow, cataloging, governance, exploration, discovery Constellation Research will examine each and preparation. The Apache Hadoop category, associated vendors, what to look for, community is working on all of the above and strategic buying considerations. as well as more fundamental challenges, including Hadoop systems management and SEEK EASE OF USE, security-and-access controls. But many of the REPEATABILIT Y AND tools and projects at the core of Hadoop are AUTOMATION coding-intensive and unfamiliar to enterprise IT and data-management professionals. What’s Data lakes can turn into data swamps as more, Hadoop-native tools may only address the volume and variety of data retained and Hadoop, whereas data pipelines often span managed grows. That’s why it’s so important multiple platforms, including NoSQL databases to have a well-planned data architecture. You and relational sources and targets as well can’t just dump everything into the lake and as Hadoop.

© 2016 Constellation Research, Inc. All rights reserved. 8 Figure 2. Three Ways to Better Manage and Democratize Data Lakes.

Source: Constellation Research trust that you’ll be able to fnd or effectively lineage as required by regulations that apply to use the information. specifc industries.

A well-planned data architecture will include There are multiple projects within the Apache the ability to ingest and retain raw data with Hadoop framework that address fundamental enough metadata to illuminate the substance challenges, including data access, security and provenance of the data. You’ll want to controls, data profling, cataloging and data establish a zone for refned data that has been lineage tracking. Unfortunately, some of these cleansed and normalized to serve as standard projects are fractured along distributor lines, data (a single source of truth on customers, for a fracturing that contributes to gaps and example) that can be used in many applications. differences in data lake management. And you’ll want to establish a zone for application-specifc data that might have been Cloudera, for example, touts Apache Sentry transformed, aggregated and enriched for for unifed role-based security access control specifc use cases. For governance reasons, you and RecordService for the centralized may need to show audit trails and track data enforcement of these permissions across

© 2016 Constellation Research, Inc. All rights reserved. 9 the Hadoop platform. Cloudera Manager security enables data stewards to apply provides perimeter security, with Active tags such as PII or PCI to data assets such as Directory and Kerberos integration. Cloudera columns, tables or databases. Navigator provides data governance and data management capabilities that offer MapR follows yet another path on data better visibility and validation into and governance and delivers these available, including audit, column-level capabilities at the platform level. The company lineage, discovery, metadata management, replaces the Hadoop Distributed File System and policy enforcement. Cloudera also (HDFS) with a commercial POSIX fle system provides encryption across all data and key with read/write capabilities as well as other management as part of its platform. data organization constructs such as “volumes” to govern data. Volumes facilitate the best Hortonworks touts Ranger, Knox and Atlas practice of organizing data in different zones as its mix of security and data-governance and MapR touts its approach as offering software. Ranger addresses centralized, higher scalability and easier management policy-based authorization, authentication, than Hadoop distributions that rely on HDFS. audit and security controls across the Hadoop The POSIX fle system also supports Linux stack, while Knox provides perimeter security. pluggable authentication and fexible access Hortonworks introduced Apache Atlas in control modules known as Access Control 2015 as a metadata service for accessing and Expressions (ACEs). exchanging data inside and outside of Hadoop. Atlas addresses data-lineage tracking across The basics of data-access control and security Squoop, Hive, Falcon, Storm and Kafka as well are addressed by all Hadoop distributions, as metadata management and classifcation but at this writing, available governance tools by asset type and business language, giving are not complete. For example, governance organizations a better understanding of data has to account for metadata for data linkage, inside the data lake. Classifcation-based master data referencing, and .

© 2016 Constellation Research, Inc. All rights reserved. 10 Hadoop vendors rely on third-party vendors The next-generation data management for these capabilities, providing APIs to vendors, which have all emerged within their Hadoop-native tools. What’s more, the the last fve years, are mostly focused on components typically used for data processing supporting Hadoop and helping to manage and transformation, namely MapReduce and data lakes. Incumbent data management the Apache Spark Core, are comparatively vendors have served SQL-centric, relational low-level and coding-intensive, requiring needs for a decade or more, but most have specialized expertise. either added technologies or reengineered their technologies to work with Hadoop, To address the critical gaps in capabilities NoSQL databases and a wider variety of high- required to manage the data lake, commercial scale data. Both camps promise to simplify vendors are extending and complementing and democratize the data lake through three the tools and components native to Hadoop. crucial traits: Two camps of vendors have emerged -- next-generation frms and incumbent data- • Ease of use: The next-generation and management frms -- that are attempting to incumbent data management vendors democratize the data lake by making it easier both promise a simplifed and unifed for enterprise data management teams to data-management experience that opens ingest, cleanse, combine, transform, catalog up the data lake to more professionals by and govern data in data lakes. Many are abstracting them from the complexities of also attempting to make data accessible to a coding and myriad, low-level interfaces. broader community of users, including data This reduces the need for specialized skills analysts and data-savvy business users by and, in some cases, introduces self-service adding their own versions of the style of self- capabilities for business users. Many of service data exploration, discovery and prep these vendors expose data catalogs and tools pioneered by next-generation vendors. self-service data-prep capabilities directly to

© 2016 Constellation Research, Inc. All rights reserved. 11 data analysts and data-savvy business users to broaden A lake should be access to the data lake. part of the total analytic ecosystem. • Repeatability: Both camps help professionals build data Thus, it must be pipelines for ingesting data, capturing metadata, applying integrated with and data-quality rules and executing parsing, fltering and support incumbent transformation steps in a repeatable way. This ensures data-management consistency and eliminates repetitive, labor-intensive infrastructure, tasks. Self-service data-prep tools also contribute to including data repeatability, enabling data analysts and business users warehouses and the to fnd new insights and combinations of data that can be applications that feed adapted to production data pipelines. and use the data lake.

• Automation: Beyond creating repeatable workfows

with job-scheduling features, next-generation and The Technologies incumbent data management vendors often automate · Data ingestion tasks such as data ingestion, integration, quality fltering, · · Workfow data transformation, data profling and data cataloging. · Data cataloging Some tools also apply machine learning, natural language · Metadata management · Data lineage processing and semantic technologies to automatically · Data sampling and profling discover, categorize, prepare and recommend data sets. · Self-service data prep

© 2016 Constellation Research, Inc. All rights reserved. 12 LOOK TO NEXT- data-prep modules last year. GENERATION VENDORS TO FILL DATA LAKE GAPS • Data cataloging and metadata management: Data catalogs and metadata As the data lake concept has taken root, enable users to see what data sets are organizations lacking Hadoop skills and available, what they contain, where they’re experience have longed for easier-to-use and from, how they’re used and more. These more unifed tools for data lake management. capabilities are provided by Podium Data Next-generation vendors were the frst to and Zaloni, but the focused, best-of-breed identify the gaps in open-source distributions players in this space are Alation, Collibra and many have created tools, like data catalogs and Waterline Data. These vendors and self-service prep tools, that have since continuously scan available data, capture become must-have data lake capabilities. and enhance metadata and support settings, Some of these vendors and tool sets are alerts and notifcations on data policies. Hadoop-centric while others also address data- The technology also supplements Hadoop- management needs across relational platforms. native data-lineage and security capabilities.

• Data management and governance: Alation is particularly adept at cataloging Data ingestion, transformation, profling, structured data and can scan relational cataloging, quality and metadata sources such as Amazon Redshift, IBM

management are all core capabilities Netezza, MySQL and Teradata. Alation has a required for data lake management. The SQL-centric understanding of keys, indexes most comprehensive offerings from next- and query usage patterns. It’s also certifed generation vendors are from Podium Data with Cloudera and Hortonworks, with and Zaloni. Both vendors also supplement Hadoop implementations being primarily Hadoop-native security and data-lineage Hive focused, with support for both Impala capabilities and both added self-service and Tez.

© 2016 Constellation Research, Inc. All rights reserved. 13 Collibra is aimed at data governance and and repeatable workfows. These tools stewardship, providing a business semantics also provide data access to personal and glossary and collaborative tools for corporate business intelligence, data managing data structures, hierarchies visualization and analytical tools. and mapping. Pioneers in this category include Datameer, Waterline is a Hadoop-centric vendor that Paxata, Tamr and Trifacta, all of which have uses machine learning and semantic analysis evolved their technologies to work with technology to automatically profle and Apache Spark. Paxata and Trifacta are close add metadata to structured data and to competitors while Datameer and Tamr differ unstructured and semi-structured data sets in that they also offer analytical capabilities. including clickstreams, log fles, JSON data, Datameer offers general-purpose Big Avro and Parquet formats. Data visualization and analysis tools while Tamr offers focused customer and clinical • Self-service discovery and data , procurement and media preparation: This is a popular route to analytics. Podium Data and Zaloni joined the democratization of data because it’s self-service push last year with their releases about letting data analysts and data- of Podium Prepare and Zaloni Mica. savvy business users profle, explore and statistically sample data, typically with Constellation Research Analysis the help of machine learning, natural language processing, and semantic analysis No two organizations are exactly alike, so technologies. The next step is self-service priorities for one frm may be less important cleansing, joining, enrichment, quality, to another. These differences can relate to mastering and transformation of data the nature of the business and variety of data without help from IT. Prep work is usually in use. Firms differ in their talent pools and supported by collaborative features ability to attract new expertise. They also

© 2016 Constellation Research, Inc. All rights reserved. 14 differ in sophistication and current technology service data prep tool if the suite vendor is investments. Finally, companies differ in their far behind. ability and appetite to spread data-driven decision-making to business users. Keeping • Ease of use: “Easy to use” is a relative these variables in mind, Constellation Research term, with some products being easy to believes that next-generation vendors should use and intuitive to data scientists and be considered in the following contexts: others being straightforward to data- savvy business users. Before looking at • Breadth of capabilities: In the best-of-breed products, frst consider what classes of versus suite wars, Constellation Research users and how many users will need to use believes that suites usually win…eventually. the tools. This applies to both administrative But when new categories like data and management tools and data-access, cataloging and self-service data prep are exploration and prep tools. Complexity and still emerging, the innovators tend to have lack of familiarity kill adoption and create differentiating features and functions while bottlenecks. Make sure would-be users try the suite vendors are still playing catch up. (and perhaps even train) before you buy. Another consideration is that organizations don’t always need the breadth of capabilities • Enterprise-grade capabilities: Consider offered by a suite. Those considering the maturity of products to be deployed Podium and Zaloni, for example, might also in complex, shared services environments

consider the ingestion, transformation, with strict service level agreements and cleansing and workfow capabilities of an governance and compliance requirements. incumbent vendor that they may already Next-generation tools must support use (see below). What’s more, having a collaborative data workfows and suite does not mean you have to use all of governance activities while also meeting its components. A company might want to scalability, multi-tenancy, and automated add a best-of-breed data cataloging or self- data-delivery capabilities.

© 2016 Constellation Research, Inc. All rights reserved. 15 • SQL-centric versus Big-Data-centric supporting both styles of analysis. Thus, the needs: Even Hadoop vendors like Cloudera, next-generation vendors should be considered which offers a high-scale data mart and incumbent vendors assessed in terms alternative with its Impala SQL engine for of how far they’ve come in developing next- Hadoop, acknowledge that a data lake generation capabilities. is no replacement for an enterprise data warehouse (EDW). On the other hand, CONSIDER INCUMBENTS an EDW is not well suited to fexible, FOR BROADER NEEDS schema-on-read analysis in which users can explore and fnd value in variable and Well-established data management vendors, semi-structured data such as clickstreams, including IBM, Informatica, Oracle, Pentaho log fles, sensor data, mobile app data, social (now a Hitachi Group company), SnapLogic, data and so on. Syncsort and Talend, have long focused on simplifying, streamlining and automating A key question is whether your data lake data management tasks. All are evolving their will drive predominantly SQL-centric or suites to work with modern Big Data platforms exploratory, Big-Data-centric analysis. Next- including Hadoop, Apache Spark and NoSQL generation tools are usually more adept at databases. The question is how far they’ve handling big, messy data sets while incumbent progressed in their evolutionary journey and data management tools typically shine in SQL- how they might ft with your data lake goals centric settings. Realistically, many enterprises and larger analytics strategy. lack and are fnding it hard to hire people with experience with Big Data analysis techniques, Incumbents are worthy of consideration, so they’re falling back on familiar, SQL-centric particularly if you’re using one of these suites analyses. Nonetheless, Constellation Research and your people are already familiar with its believes companies should set their sights on pre-built connectors to source systems, visual

© 2016 Constellation Research, Inc. All rights reserved. 16 data-fow orchestration interfaces, out-of- it can’t hurt you to consider that vendor’s the-box data-parsing and data-transformation Hadoop and data lake-focused capabilities. If routines, job scheduling modules and services- those capabilities build on and have the look based data-delivery capabilities. All of the and feel of the rest of suite, you’ll at least have above save data-management professionals product familiarity on your side and training time and deliver on the promise of ease of use, needs might be reduced. Here are four other repeatability and automation. Some of these evaluation criteria to consider: vendors have truly comprehensive suites incorporating data quality and master data • Breadth of capabilities: While some suites management capabilities, and these, too, are are focused tightly on data ingestion, being evolved to address Big Data and data transformation and enrichment to data- lake needs. delivery – the data lake equivalent of ETL -- others offer more comprehensive Most of the vendors mentioned above capabilities including data quality, master emerged in the relational database era while data management, data cataloging and a few, including IBM and Syncsort, got started self-service data prep. In some cases, suites in the mainframe era. Several have also added have been assembled through multiple their answers to the data cataloging and self- acquisitions, so consider whether they service data prep modules popularized by next- present consistent user, deployment and generation vendors. administrative experiences.

Constellation Research Analysis Some vendors, notably SnapLogic, have recently re-architected to deliver their The incumbents, by defnition, have the technology software-as-a-service style, advantage of incumbency. If you’re already simplifying and speeding deployment and invested with a particular vendor and plan ongoing administration. Still other vendors, to stick with it for your SQL-centric needs, notably Syncsort, have deep capabilities

© 2016 Constellation Research, Inc. All rights reserved. 17 for offoading and integrating mainframe a monolithic, hard-to-change, bottlenecked workloads to the data lake. Consider short- system constraining your data lake. term and long-term needs and choose the portfolio and deployment style that best ft • Spark compatibility: Apache Spark has your needs. become a standard component of Hadoop distributions, thanks to its versatility and • Hadoop compatibility: Most, but not all, in-memory processing power. The most incumbent vendors have adapted their advanced incumbent vendors can not only suites to run on Hadoop. This step is connect to Spark but invoke Spark as part crucial, as it’s the key to handling data at of data processing pipelines. Some systems Hadoop scale with all the advantages of can automatically assign workloads to the distributed, massively parallel processing. best-ft processing engine for a given job, Develop a detailed understanding of how whether that’s MapReduce for high-scale the technology runs on and works with batch processing or Spark for low-latency, Hadoop, including compatibility with YARN, in-memory processing of smaller batches or native security, governance modules, streaming data. MapReduce processing, open-source data- connection and data-movement tools such • Data cataloging and self-service data prep: as Hive and HCatalog as well as Hadoop- Incumbent data management vendors have centric data formats such as Parquet and typically focused on the needs of IT and data

ORC. Also inquire about licensing terms professionals, but some have responded to and any requirements to run software the demand for data-analyst and business- or data-processing steps on servers that user compatible data cataloging and data- are separate from the Hadoop cluster. prep tools. The key questions are just how Finally, investigate data-pipeline capacity business-user friendly and capable these constraints when moving data between tools are. Are they better suited to advanced Hadoop and other platforms. You don’t want data analysts or are they truly usable by

© 2016 Constellation Research, Inc. All rights reserved. 18 reasonably data-savvy business users? It’s or home-grown tools as needed. Greenfeld important for would-be users to try before scenarios like this are rare. the company buys. Further, can self-service data-prep workfows be put directly into Within many companies, particularly production or are they representations of mainstream frms, Hadoop and open source workfows that have to be built out and platforms in general have tended to enter connected to data by IT types? If it’s the through the back door. Pilots and proof-of- latter, IT bottlenecks may remain. concept projects were initiated in isolation, with small teams assembled apart from RECOMMENDATION: mainstream data management staffs. TARGET THE CENTER OF ANALYTICAL GRAVITY Times are changing. As Hadoop has evolved into a corporate standard, as use cases have Hadoop vendors talk about the “center of data multiplied, and as the data lake concept has gravity” moving to Hadoop, but Constellation taken root, data management and data analysis Research believes you also have to consider must become more of a single, cohesive the center of analytical gravity. That is, on ecosystem. That does not mean that one what analytical platforms – Hadoop, Spark, platform takes over. Rather, the data lake must relational databases, embedded apps, etc. – support evolved analytical tools and evolved will companies derive the greatest return on applications and application infrastructure. insight and will that change? Ideally, it should also support a wide spectrum of analytic talent, from spreadsheet warriors to When companies are starting from scratch, it’s data scientists. much easier to make technology selections. Internet startups building from scratch to Data marts might move to the Hadoop harness Big Data, for example, can start with environment, or they might remain in relational open source distributions and add commercial environments for many years to come.

© 2016 Constellation Research, Inc. All rights reserved. 19 Advanced analytical analyses might move to Hadoop or The data lake’s Spark, or they might move to distributed database or in- ability to handle memory grids. The question is where will the bulk of the varied data at high analysis take place and what are the breakthrough insights scale has changed and production workloads of tomorrow? the game, driving dynamic, 360-degree SQL has been around a long time and it was never terribly customer views effective at cracking variable and semi-structured data and breakthrough types like clickstreams, log fles, sensor data, mobile insights based on app data, social data and so on. The data lake’s ability to correlations of social, handle varied data types at high scale has changed the mobile, machine and game, driving dynamic, 360-degree customer views and transactional data. breakthrough insights based on correlations of social, mobile, machine and transactional data.

While it’s important for the data lake to support SQL- centric analysis, don’t think of it as a modern, high-scale replacement for the data warehouse. If you’re not tapping new data and coming up with new insights, you’re not doing it right.

Two Other Important Trends to Consider: Cloud and Streaming

There are two other important trends to weigh when considering data lake architecture and management technologies:

© 2016 Constellation Research, Inc. All rights reserved. 20 • Cloud deployment options: Cloud is the smart oil felds, smart utilities and precision fastest-growing deployment mode for medicine being popular examples. A key Hadoop, for databases, and for applications point is that historical data, such as that and infrastructure. It’s important to consider held in a data lake and in data warehouses, private-cloud and public-cloud deployment is invariably needed to bring context to the options and cloud-service compatibility. A real-time insights. Therefore, streaming starting-point question is whether a tool or data management and streaming data suite is available as software as a service. If delivery capabilities -- to apps and real-time not, are certain components available in analytical tools – should be data the cloud? If it’s a software-only offering, lake capabilities. does it have a micro-services architecture and RESTful APIs that facilitate RECOMMENDATION: cloud deployment? CONSIDER CONSULTANTS AND SYSTEMS • Streaming support: Among the hottest INTEGR ATORS trends emerging in 2016 in the data-to- decisions domain is support for streaming How might your frm take advantage of Big data processing and analysis. Marketers Data? Experienced consultants and systems and e-commerce vendors have been integrators tend to start with the big questions, among the pioneers of real-time use cases such as what are the business objectives and

because presenting the right content, offer competitive challenges? Next, what insights or recommendation at the right time can might lead to a deeper understanding of make a huge dollars-and-cents difference in customers, optimization of current practices, campaign and sales performance. monetization of data or, even better, breakthrough products or services that might New streaming use cases are popping up yield entirely new revenue streams? What’s all over the place, with connected cars, the total inventory of data available -- even if

© 2016 Constellation Research, Inc. All rights reserved. 21 it’s not currently used -- and how might it drive partnerships with both Waterline and Trifacta. breakthroughs and innovations? In another example, Think Big Analytics Big-picture thinking and experience are focuses mainly on consulting services for reasons organizations consider the services of building and maintaining data lakes and a consulting frm or systems integrator such as data science with open source technologies. Accenture, Deloitte, IBM, CapGemini, Infosys, Think Big offers tools such as a data lake Wipro or a smaller frm such as Persistent framework for simplifying ingestion and Systems or Teradata’s Think Big Analytics unit. governance using diverse processing tools and The frms tend to have analytics practices and Dashboard Engine that opens up access to experienced Big Data deployment teams that data through REST, SQL and MDX interfaces can draw on lessons learned across scores, with support for session-based analysis, time- if not hundreds, of deployments. What’s series analysis, dashboarding and reporting more, these consultancies and integrators are through mainstream tools such as Tableau and experienced at defning and implementing data MicroStrategy. architecture strategies. Consultants and systems integrators often In some cases, integrators support Hadoop have partnerships and experience deploying and data lake deployments with technology open-source distributions and incumbent and automation capabilities as well as commercial offerings, but check whether they frameworks, blueprints and best-practice have experience with next-generation options, methodologies. Infosys, for example, has the which aren’t as well known. Some next- Infosys Information Platform, which offers a generation vendors contend that consultants unifed interface to both Hadoop and Spark, and integrators have a bias toward incumbents supporting data pipelines that might make or pure open-source software because of the use of MapReduce, Spark and Infosys’ own consultants’ experience set. machine learning technologies. Infosys has

© 2016 Constellation Research, Inc. All rights reserved. 22 Also helpful is consultant and systems data management teams. Commercial vendors integrator experience with optimization of are closing the gaps with: data warehouses and offoading of archival and data-processing workloads to Hadoop. 1. Data management and governance tools They can help frms assess current-state and that are accessible to enterprise IT plan future-state deployments, identifying 2. Data cataloging and metadata management opportunities for innovation and determining tools that are accessible to IT, analysts and the appropriate center of analytical gravity. data stewards 3. Self-service discovery and data-prep tools TAKEAWAY: PREVENT DATA that are accessible to analysts and data- SWAMPS AND MAKE BIG savvy business users. DATA ACCESSIBLE The choice of tools and suites that might best If the data lake is to succeed, you can’t complement the Hadoop stack will depend on just ingest data and create new data sets the data, business goals, sophistication, and indiscriminately. You’ll need to create zones analytical focus of the organization. But the for landing raw data, cleansing and normalizing success of the data lake depends on a standard data, and aggregating and enhancing mature approach and technology- application-specifc data. In short, you need a empowered data management, analysis mature approach to data lake management and business professionals. and governance.

Unfortunately, the capabilities and technologies of the Hadoop stack are incomplete and many tools are low-level, coding intensive and unfamiliar to enterprise

© 2016 Constellation Research, Inc. All rights reserved. 23 ANALYST BIO

Doug Henschen Vice President and Principal Analyst

Doug Henschen is Vice President and Principal Analyst at Constellation Research, Inc., focusing on data- driven decision making. His Data-to-Decisions research examines how organizations employ data analysis to reimagine their business models and gain a deeper understanding of their customers. Data insights also fgure into tech optimization and innovation in human-to-machine and machine-to-machine business processes in manufacturing, retailing and services industries.

Henschen’s research acknowledges the fact that innovative applications of data analysis require a multi- disciplinary approach, starting with information and orchestration technologies, continuing through business intelligence, data visualization, and analytics, and moving into NoSQL and Big Data analysis, third-party data enrichment, and decision management technologies. Insight-driven business models and innovations are of interest to the entire C-suite.

Previously, Henschen led analytics, Big Data, business intelligence, optimization, and smart applications research and news coverage at InformationWeek. His experiences include leadership in analytics, business intelligence, database, data warehousing, and decision-support research and analysis for Intelligent Enterprise.

Further, Henschen led business process management and enterprise content management research and analysis at Transform magazine. At DM News, he led the coverage of database marketing and digital marketing trends and news.

@DHenschen | www.constellationr.com/users/ doug-henschen www.linkedin.com/in/doughenschen

© 2016 Constellation Research, Inc. All rights reserved. 24 ABOUT CONSTELLATION RESEARCH

Constellation Research is an award-winning, Silicon Valley-based research and advisory frm that helps organizations navigate the challenges of digital disruption through business models transformation and the judicious application of disruptive technologies. Unlike the legacy analyst frms, Constellation Research is disrupting how research is accessed, what topics are covered and how clients can partner with a research frm to achieve success. Over 350 clients have joined from an ecosystem of buyers, partners, solution providers, C-suite, boards of directors and vendor clients. Our mission is to identify, validate and share insights with our clients.

Organizational Highlights

· Named Institute of Industry Analyst Relations (IIAR) New Analyst Firm of the Year in 2011 and #1 Independent Analyst Firm for 2014 and 2015.

· Experienced research team with an average of 25 years of practitioner, management and industry experience.

· Organizers of the Constellation Connected Enterprise – an innovation summit and best practices knowledge-sharing retreat for business leaders.

· Founders of Constellation Executive Network, a membership organization for digital leaders seeking to learn from market leaders and fast followers.

www.ConstellationR.com @ConstellationR

[email protected] [email protected]

Unauthorized reproduction or distribution in whole or in part in any form, including photocopying, faxing, image scanning, e-mailing, digitization, or making available for electronic downloading is prohibited without written permission from Constellation Research, Inc. Prior to photocopying, scanning, and digitizing items for internal or personal use, please contact Constellation Research, Inc. All trade names, trademarks, or registered trademarks are trade names, trademarks, or registered trademarks of their respective owners.

Information contained in this publication has been compiled from sources believed to be reliable, but the accuracy of this information is not guaranteed. Constellation Research, Inc. disclaims all warranties and conditions with regard to the content, express or implied, including warranties of merchantability and ftness for a particular purpose, nor assumes any legal liability for the accuracy, completeness, or usefulness of any information contained herein. Any reference to a commercial product, process, or service does not imply or constitute an endorsement of the same by Constellation Research, Inc.

This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold or distributed with the understanding that Constellation Research, Inc. is not engaged in rendering legal, accounting, or other professional service. If legal advice or other expert assistance is required, the services of a competent professional person should be sought. Constellation Research, Inc. assumes no liability for how this information is used or applied nor makes any express warranties on outcomes. (Modifed from the Declaration of Principles jointly adopted by the American Bar Association and a Committee of Publishers and Associations.)

Your trust is important to us, and as such, we believe in being open and transparent about our fnancial relationships. With our clients’ permission, we publish their names on our website.

San Francisco | Belfast | Boston | Colorado Springs | Cupertino | Denver | London | New York | Northern Virginia Palo Alto | Pune | Sacramento | Santa Monica | Sydney | Toronto | Washington, D.C

© 2016 Constellation Research, Inc. All rights reserved. 25