Building a Data Lake for the Enterprise

Arcadia PAGE 20 HOW TO GET MORE VALUE FROM YOUR DATA LAKE AND EXTEND ACCESS TO MORE USERS HPE Data Security PAGE 22 SOLVING FOR BUILDING A DATA SECURITY AND PRIVACY IN IoT Cask DATA LAKE FOR PAGE 24 5 KEY INGREDIENTS OF A SUCCESSFUL THE ENTERPRISE DATA LAKE Informatica PAGE 26 BUILDING AN ENTERPRISE DATA LAKE WITH BRAINS, NOT BRAWN IBM PAGE 27 HARNESSING A LOGICAL DATA WAREHOUSE AND DATA RESERVOIR FOR BIG DATA SUCCESS Franz PAGE 28 TRANSFORMING DATA LAKES INTO KNOWLEDGEABLE PLATFORMS Cambridge PAGE 29 SMART DATA LAKES: REVOLUTIONIZING ENTERPRISE ANALYTICS Best Practices Series BUILDING A DATA LAKE FOR THE ENTERPRISE Best Practices Series Data lakes are forming as a response to today’s big data to launch a data lake initiative and another 15% had already sub- challenges, offering a cost-effective way to maintain and manage mitted a budget for adoption. On top of that, 35% were research- immense data resources that hold both current and future ing and considering a data lake approach. In addition, the Uni- potential to the enterprise. However, enterprises need to build sphere survey found clear early use cases exist for the data lake. these environments with great care and consideration, as these However, governance and security are still top of mind as key potentially critical business resources could quickly lose their way challenges and success factors for the data lake. with loose governance, insecure protocols, and redundant data. Besides the mechanical considerations of data storage, there In their book, BI and Analytics on a Data Lake: The Definitive is a need to integrate complex datasets to build more intelligent Guide, Sameer Nori and Jim Scott identify the role of a data lake, analytic frameworks for more targeted and intelligent decisions. serving as a repository for raw data “which was previously too The data lake helps move big data analytics beyond handling large expensive to store and process.” It can then be moved to a data files of flat data. Advanced analytics, sitting on a data lake, will warehouse, or simply serve as “as an online archive for infrequently enable the creation of a robust set of business capabilities, such as accessed data.” The benefits of a data lake are far-reaching, from predictive and prescriptive analytics. boosting big data analytics capabilities to simply serving as a hold- For example, a key area in which data lakes are proving their ing tank for raw data that can be refined at some future point by potential is the healthcare sector. A semantic data lake for health- more advanced systems. The best part is data lakes are agnostic care is underway at Montefiore Medical Center, which involves to the format of the data being supported—data can come in any a sophisticated machine learning project that is slated to go live format, from any vendor, for any purpose. for patient care in the summer of 2017. The data lake supports In a survey of 385 IT and data managers, Unisphere Research predictive analytics to clinicians, starting with the ability to flag found that the data lake is increasingly recognized as both a viable patients entering the health system at high risk of experiencing and compelling component within a data strategy, with compa- a serious crisis event within 48 hours. The system also generates nies large and small continuing to move toward adoption (“Data customized checklists of intervention tasks sent to clinicians that Lake Adoption and Maturity Survey Findings Report,” Unisphere may help to avert or lessen the impact of the crisis. Research/Radiant Advisors, October 2015). At the time of the sur- The following are best practices for making the most of data vey, more than 32% of enterprises already had an approved budget lakes in the enterprise: 18 BIG DATA QUARTERLY | SPRING 2017 sponsored content THINK ABOUT THE BUSINESS also be part of the architectural equation. There are many different As with any technology endeavor, a data lake that is created types of data-level solutions that can be considered and designed, simply for the purpose of having a data lake will accomplish little. including Hadoop, Spark, NoSQL databases, and data as a service. The Unisphere survey finds 70% of data managers see data lakes It will also be necessary to eventually be able to accommodate as a way to boost data discovery, data science, and big data proj- next-generation approaches, such as real-time data streaming, ects—functions critical to today’s analytics-driven enterprises. machine learning, artificial intelligence, and 3D interfaces that A majority of respondents, 58%, also view data lakes as key to will become an essential part of business processes. Also important enabling real-time analytics or operationalized insights—again, are the significant technical distinctions between lakes and ware- functions that are critical to businesses that seek to succeed in the houses. While data warehouse data is highly structured, data lakes 21st century. Thus, alignment with business objectives—better store data in its original format—structured, unstructured, and understanding customer needs or creating new revenue streams— semi-structured. Because data warehouse data is loaded through is vital, and the data lake needs to accomplish purposes that data extract, transform, and load processes, it is more expensive per warehouses may not be capable of addressing. byte than data lakes, which do not thoroughly vet incoming data. In addition, data warehouses are more likely supported by com- THINK EXPERTISE mercial vendors, while data lakes tend to be home-grown affairs. As with any major data environment, it’s critical to have compe- tent professionals. Hadoop skills are still notoriously hard to come THINK SECURE by, as is expertise in data governance, data architecture, and data One of the most pressing concerns about data lakes is that security. Managing data lakes requires renaissance professionals they may end up as “data swamps.” The worry is that they may with skills in a broad range of areas. Needed are people who have end up storing untrustworthy data, sensitive corporate data that strong backgrounds in data management and analytics. Data sci- may be subject to unintentional exposure, or even data that vio- entists or high-level data analysts are also required to help business lates corporate policies or ethics, such as personally identifiable users get the most out of their data experiences. information or pornography. Accordingly, the Unisphere survey finds that respondents report the most vexing challenges are meta- THINK ABOUT WHAT DATA REALLY NEEDS data management issues (71%), security (67%), and governance TO BE CAPTURED AND STORED (71%). Rules and policies that outline access and permissions In today’s on-demand environment, many third-party data need to be established, just as in a standard database environ- sources are available almost instantaneously from their original ment. The same safeguards and processes employed to protect sources, such as Google or Twitter Analytics, or other sites with traditional data environments—such as data encryption—need indexed information that are made available for analysis. These to be extended to data lakes. Regulations and mandates—such sources are readily available to enterprise decision makers, and it as HIPAA or Sarbanes-Oxley—will apply to data lakes as well. may not be necessary to capture and store the data in a separate Data held within data lakes is also subject to legal discovery or corporate location. Data that may need to be stored and main- proceedings, in the same manner as all other electronic data. In tained is that which is transitory—such as readings from sensors. addition, disaster recovery and business continuity should be part Before making decisions that will require storage capacity and of a security strategy. In the event of system failure, what datasets security considerations, determine what can be best maintained get priority in terms of recovery? Which datasets should be repli- on the web and what needs to be captured. Another consideration cated in real time, immediately available to end users without so is the fact that established data is often very difficult to move— much as a hiccup? As many data lakes may serve as repositories for data already existing in transactional systems, such as mainframe historical data, they may not require priority, real-time replication. platforms, may need to stay where it is. THINK SELF-SERVICE THINK LONG-TERM, THINK ARCHITECTURALLY All across the data and IT space, there is a growing emphasis on Enterprise architecture and future infrastructure requirements enabling end users to access IT services as they need them. A data are part of the equation in looking forward over the next 3, 5, or lake that delivers value would enable business users to make inqui- 10 years. While data lakes can be quite fluid and requirements and ries with little or no help from the IT department. Framed within technologies are changing quickly beneath data managers’ feet, a permissions, users need to be able to see all the datasets available for data lake still must be part of a long-term, comprehensive strat- analysis or reporting. Ideally, users should be able to, at the click of egy. Since the data lake needs to map to the business, it’s import- a button, view an on-demand catalog of datasets and be able to run ant to plan how and what kind of data will be maintained in the analysis or reports at any time. They should also be able to identify repository, who will have access to it, and how it will be governed. potential new data sources as well, and understand the trustworthi- Achieving a robust infrastructure capable of scaling for large ness of the data.

Building a Data Lake for the Enterprise

Amazon Connect Data Lake Best Practices AWS Whitepaper Amazon Connect Data Lake Best Practices AWS Whitepaper

Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility

Cost Modeling Data Lakes for Beginners How to Start Your Journey Into Data Analytics

A Comprehensive Study of Recent Metadata Models for Data Lake

Harness the Power of Your Data

Lake Data Warehouse Architecture for Big Data Solutions

LOOK BEFORE YOU LEAP INTO the DATA LAKE by Rash Gandhi, Sanjay Verma, Elias Baltassis, and Nic Gordon

Solution Brief Data-Driven Transformation on AWS: a Blueprint

A Big Data Lake for Multilevel Streaming Analytics

Data Lakes Efficiently Consolidate Your Data

Sub-Second Analytics for User-Facing Applications with Apache Spark™ and Rockset Venkat Venkataramani CEO and Co-Founder, Rockset About Me

Essential Guide to Data Lakes