Defining the Data Lake

An Introduction to Big Data Lakes and Common Use Cases

www.zaloni.com Introduction A data lake is a central location in which to store all your data, regardless of its source or format. It is typically, although not always, built using Hadoop. The data can be structured or unstructured. You can then use a variety of storage and processing tools—typically tools in the extended big data ecosystem—to extract value quickly and inform key organizational decisions.

Because of the growing variety and volume of data, data lakes are an emerging and Data Lake: a central powerful architectural approach, especially as enterprises turn to mobile, cloud- location to store all of based applications, and the Internet of Things (IoT) as right-time delivery mediums your data, regardless of its for big data. source or format. Data Lake versus EDW

The differences between enterprise data warehouses (EDW) and data lakes are significant. An EDW is fed data from a broad variety of enterprise applications. Naturally, each application’s data has its own schema, requiring the data to be transformed to conform to the EDW’s own predefined schema. Designed to collect only data that is controlled for quality and conforming to an enterprise data model, the EDW is capable of answering only a limited number of questions.

Data lakes, on the other hand, are fed information in its native form. Little or no processing is performed for adapting the structure to an enterprise schema. The biggest advantage of data lakes is flexibility. By allowing the data to remain in its native format, a far greater—and timelier—stream of data is available for analysis.

Some of the benefits of a data lake include:

• Ability to derive value from unlimited types of data • Ability to store all types of structured and unstructured data in a data lake, from CRM data to social media posts • More flexibility—you don’t have to have all the answers up front • Ability to store raw data—you can refine it as your understanding and insight improves • Unlimited ways to query the data • Application of a variety of tools to gain insight into what the data means • Elimination of data silos • Democratized access to data via a single, unified view of data across the organization when using an effective data management platform

www.zaloni.com

2 Key Attributes of a Data Lake

To be classified as a data lake, a big data repository should exhibit three key characteristics:

1. A single shared repository of data, typically stored within Distributed File System Three Characteristics of (DFS). Data lakes preserve data in its original form and capture changes to data a Data Lake: and contextual semantics throughout the data lifecycle. This approach is especially useful for compliance and internal auditing activities. This is an improvement over • A single shared repository the traditional EDW, where if data has undergone transformations, aggregations of data and updates, it is challenging to piece data together when needed, and organizations • Includes orchestration and struggle to determine the provenance of data. job scheduling capabilites • Contains a set of 2. Includes orchestration and job scheduling capabilities (e.g., via YARN). Workload applications or workflows execution is a prerequisite for the enterprise and YARN provides resource to consume, process or act management and a central platform to deliver consistent operations, security and upon the data data governance tools across Hadoop clusters, ensuring analytic workflows have access to the data and the computing power they require.

3. Contains a set of applications or workflows to consume, process or act upon the data. Easy user access is one of the hallmarks of a data lake, due to the fact that organizations preserve the data in its original form. Whether structured, unstructured or semi-structured, data is loaded and stored as-is. Data owners can then consolidate customer, supplier and operations data, eliminating technical—and even political—roadblocks to sharing data.

Zaloni’s reference data lake architecture highights a zone-based approach to governing data in the data lake:

Data lakes are becoming more and more central to enterprise data strategies. Data lakes best address today’s data realities: much greater data volumes and varieties, higher expectations from users, and the rapid globalization of economies.

www.zaloni.com

3 Most common use cases for data lakes today Although the potential use cases are limitless, today data lakes are seeing success with these common use cases: 1. EDW augmentation: Offloading data from a traditional enterprise data warehouse (EDW) to Hadoop or the cloud brings storage cost savings and increases bandwidth in the EDW for business intelligence (BI) processes.

2. Agile analytics: A “fail fast” approach to data science where hypotheses, Common Data Lake testing, iterating and improvements are in a constant cycle using real-time Use Cases: data, which can result in more and innovative insights that can add business 1. EDW Augmentation value. 2. Agile Analytics 3. Enterprise reporting: The ability to do ad hoc reporting using an enterprise- 3. Enterprise Reporting wide data source is key to understanding the business in real-time and reducing risk. 4. Data Monetization 4. Data monetization: More enterprises are leveraging their data to better 5. Data Science understand current customers and also develop new products and services that resonate with consumers. 5. Data science: Some enterprises are working to support an enterprise-wide data science capability, particularly for predictive modeling and analytics using machine-learning and large datasets.

You want a data lake, not a data swamp Data lakes can provide big wins for an organization. But implementing, managing and scaling a data lake can be problematic. The key to realizing continuous business value from a data lake is in applying, from day 1, the appropriate data management and governance needed for data visibility, reliability, security and privacy that can then allow broader access to data by multiple users (with permissions). Organizations that apply data management best practices to their big data lakes are experiencing significant benefits, whether its reducing costs, enhancing or adding new revenue streams, or reducing business risk.

Case studies in data lake success Companies are making serious strides with big data implementations, addressing old problems in new ways and creating new opportunities to enhance their business, their customer loyalty and their competitive advantage. Below we outline 4 distinct use cases in 4 different industries, that leveraged an integrated data lake management platform to build a clean, actionable data lake and were well-rewarded for the effort:

1. Fraud detection in the insurance industry 2. EDW augmentation in market research 3. Hospital re-admission risk reduction in healthcare 4. Network analytics in telecommunications www.zaloni.com

4 1. Improving fraud detection and reduction

Workers’ compensation fraud, including employee, employer and medical provider fraud, is estimated to cost states $1 billion-$3 billion each annually. One reason At a Glance: is that many fraudsters are able to manipulate billing faster than investigators • Industry: Insurance can audit. To support a new real-time anti-fraud analytics solution for one of the • Company Description: country’s largest provider of workers’ compensation insurance, Zaloni built a data Large US-based provider of lake solution. workers’ compensation Before the data lake: The company’s business analysts were using manual • Technical Use Case: Data processes to build and run SQL queries from several systems, which could take Lake 360˚: Agile Analytics hours or days to get results. The company wanted to unify its data silos and legacy • Business Use Case: Fraud data platforms to support a new anti-fraud analytics solution and enable non- and Payment Integrity technical staff to quickly perform self-service data analytics through an intuitive user interface. • Deployment: On premises

With the data lake: Zaloni’s big data management platform enabled real-time data ingestion and automated fraud detection. We also created a centralized data catalog that provided an intuitive user interface for search and ad-hoc analytics of all data.

Results: The new system improved data quality and the company’s ability to detect patterns of potential fraud hidden in vast amounts of claims data, and enabled attorneys to quickly build cases with original claim documents. Data was used to prosecute more than $150 million in fraudulent cases. With queries now taking seconds versus hours or days, the company was able to analyze more than 10 million claims, 100 million bill line item details and other records.

2. Enjoying more cost-effective storage and faster data processing

A leading provider of market and shopper information, predictive analytics and business intelligence to 95% of the Fortune 500 consumer packaged goods (CPG) At a Glance: and retail companies needed a more cost-effective and efficient approach to • Industry: Market Research process, analyze and manage data. Zaloni designed and built the backend solution • Company Description: architecture for an enterprise data warehouse (EDW) augmentation solution to Top 10 Market Research provide a more cost-effective, flexible and expandable data processing and stor- Firm age environment. • Technical Use Case: EDW Before the data lake: Tasked with managing massive volumes of data (ingesting Augmentation hundreds of gigabytes of data from external data providers every week), the com- • Deployment: On premises pany struggled to keep costs as low while providing clients with state-of-the-art analytic and intelligence tools.

With the data lake: Using Hadoop for the aggregate POS dataset and servicing the extractions that populate the company’s custom, in-memory analytics farm enabled the company to offload more data, faster, and realize substantial savings in a very short period of time.

Results: The company realized $5.2 million annual savings and $4.4 million project- ed additional savings (upon completion of the fine-grained data-at-rest modifica- tion project). Additionally, the company saw a nearly 50% reduction in mainframe MIPS (millions of instructions per second/processing power and CPU consump- tion) and better throughput with Zaloni Bedrock managing the data lake. www.zaloni.com

5 3. Reducing re-admission risk

One of the US’ leading healthcare companies wanted to more effectively use At a Glance: big data to deliver the best care more affordably. Specifically, it wanted to be • Industry: Healthcare able to better determine readmission risk for patient discharges and predic- • Company Description: tively classify readmissions as potentially preventable or not preventable. Healthcare Provider and Zaloni helped them achieve this by developing a Hadoop-based managed data Health Plan lake housing more than 60 million records of historical patient data. • Technical Use Case: Data Before the data lake: The company was unable to seamlessly access high vol- Lake 360°: Agile Analytics umes of both new and existing data from disparate sources and analyze it for • Business Use Case: patterns and trends. Population Health: Re- With the data lake: The healthcare customer had a clear view of what data was admission Risk in the data lake; could track its source, format and lineage; and enable users to • Deployment: On premises more efficiently search, browse and find the data needed for analytics.

Results: Zaloni’s solution enabled the healthcare customer to reduce costs and improve outcomes by ultimately lowering readmission rates, due to significantly improved understanding of potentially preventable readmissions, the ability to develop better algorithms to more accurately predict readmissions, and a new ability to identify patients with the highest risk of readmission early in their initial hospitalization and proactively adjust treatment plans sooner to account for that risk.

4. Reporting compliance and improving customer experience

A large telecom company was required to archive large and growing volumes of wireless call details records (CDRs) to comply with government regulations. Although it was a significant cost to store all of this data, the company identi- fied an opportunity to leverage the data to provide better customer service, grow its business and ensure it was getting the expected return on billions invested in capital expenses. To enable this goal, Zaloni architected and built At a Glance: a managed data lake to serve as the single repository for all traffic-related, • Industry: inventory and provisioning data (CDRs, SNMP, server logs). Telecommunications

Before the data lake: The company’s wireless network generated 4 terabytes • Company Description: of data per day from voice, data and SMS CDRs. This data was created by 11 South American different servers and switches with 8-10 different record layouts. As a result, telecommunications giant there was no centralized repository to store all these CDRs. In addition, the up- • Technical Use Case: Data stream mediation system, which is responsible for merging CDR records into a Lake 360°: Agile Analytics single record for each session, was sending duplicate CDRs or not stitching the • Business Use Case: NOC/ call records completely. SOC: Network Analytics With the data lake: The company was better able to perform load balancing, • Deployment: On premises government reporting, network analysis, and parallel processing that removed duplicate data and reconstructed call records from partial or lost records. With a managed data lake, the telecom company not only gained a compliance solution, but was able to more efficiently manage the network and continue to provide a great customer experience as the volume of subscriber usage contin- ued to grow significantly.

Results: The company was able to double data ingestion, to 8 terabytes/day. www.zaloni.com Also, the company avoided costly fines and penalties by meeting the immediate need for government reporting requirements, and gained insights into network utilization to avoid network congestion in near real-time.

6 Data Lake Management Platform Self Service Data Preparation Bedrock is a fully integrated data lake management Mica provides the on-ramp for self-service data discovery, platform that provides visibility, governance, and reliability. curation, and governance of data in the data lake. Mica provides By simplifying and automating common data management business users with an enterprise-wide data catalog through tasks, customers can focus time and resources on building which to discover data sets, interact with them and derive real the insights and analytics that drive their business. business insights.

Zaloni Professional Services Professional Services Include: Your trusted partner for building production • Big Data Use Case Discovery and Definition data lakes • Data Lake Assessment Services Zaloni has more than 400+ staff years of big data experience • Solution Architecture Services working globally across the US, Latin America, Europe, Middle • Data Lake Build Services East and Asia. Zaloni Professional services offers expert big • Data Lake Analytics Application Development data consulting and training services, helping clients plan, prepare, implement and deploy data lake solutions. • Data Science Services

About Zaloni Delivering on the Business of Big Data Zaloni is a provider of enterprise data lake management solutions. Our software platforms, Bedrock and Mica, enable customers to gain competitive advantage through organized, actionable big data lakes. Serving the Fortune 500, Zaloni has helped its customers build production implementations at many of the world’s leading companies.

To learn more about Zaloni: www.zaloni.com Call us: +1 919.323.4050 I E-mail: [email protected] I Visit: Find Us on Social Media:

Twitter handle @zaloni