IBM Hybrid Data Management Building a robust, governed data lake for AI

Scale to handle across hybrid deployments with federation, governance and storage options. Contents

04 07 10 14 The data lake’s role Providing choice, Automating Choosing the right in an AI-focused scalability and governance data lake storage information architecture integration for data lakes in the data lake technology

17 20 Real-world data lake Building your robust, applications across governed data lake industries with IBM

2 80% of worldwide data will be unstructured by 2025.1

3 Several trends are converging that make the data lake more important than ever before. The first is the rise ofartificial intelligence (AI) and (ML). Both require an abundance of data from a variety of different sources in order to produce the most accurate models and insights. This has been true of all forms of , but the level of automation in AI and ML places more importance on robust and varied data.

4 The data lake’s role in an AI-focused information architecture

The second trend boosting the data lake’s YouTube has over 500 hours The net result is that companies need a The number of IoT devices importance is the sheer volume and variety data lake that is capable of sitting alongside of data being produced—particularly of new content uploaded their databases and data warehouses while could rise to 41.6 billion unstructured and semi-structured data. For every minute.2 handling these new forms and amounts of by 2025.3 example, the Internet of Things (IoT) is playing data. Yet it’s important to find a data lake that an increasingly prominent role as more devices stands up to the challenges of the enterprise. become connected and relay data back to This ebook introduces key components of organizations. This streaming IoT data is also building a successful data lake including being joined by social media and other written flexibility, governance and storage technology, information such as patient notes, alongside with several real-world examples provided. audio or visual data like that from call centers or customer complaints.

5 The data lake’s role in an AI-focused information architecture

According to a recent Ventana research study, among enterprises with reported that they noted lowered saw better a data lake: gained competitive costs communication and advantage knowledge sharing

enjoyed improved believed it helped revealed an customer them respond better increase in sales experiences to opportunities and threats

Learn more about Ventana’s research

6 The data lake is prized for its flexibility in accepting data “as-is” without ascribing a structure onto it at the time of ingestion. However, the data lake’s ability to change to suit business needs should extend beyond that to include high levels of scalability and multiple deployment options. Federation across other components of the information architecture is also vital to ensure that the data lake can be used as a key component of the data management strategy no matter where it lives.

7 Providing choice, scalability and integration for data lakes

Scalability Data federation In order to accommodate the heightened Federating data across the entire information volumes of data being created, the data lake architecture is essential for AI and ML; must be able to scale extensively, quickly, it doesn’t matter how much data you have and at a low cost. Often this is accomplished for models and insights if that data is siloed by using clusters of inexpensive commodity and difficult to combine. Federating the data hardware that alleviate over-provisioning and that resides in data warehouses, data lakes 85% of companies are 45% of businesses make scaling easier. This will help ensure and even external sources with a SQL-on- that there are no complications in gathering Hadoop engine like IBM Db2 Big SQL helps operating in a multicloud worldwide are running at the wide array of data required for AI and data scientists easily select the data they environment.4 least one of their Big Data ML to create insight. Quick scalability also want and use it quickly in a self-service model. 5 means that the ingestion of real-time data That includes unstructured, semi-structured workloads in the cloud. is less likely to be interrupted due to lack of and structured data, which are all needed room. As an added bonus, cheap, extensive for models to be created with a wholistic scalability also opens the opportunity to perspective. store “cold” historical data at a lower cost than in a database or . As a result, data scientists spend less time preparing the data and more time creating Deployment options the AI and ML models that will propel the Enterprises must be able to deploy data business forward. In addition, federation is Federation is much lakes on premises, across multiple clouds or much quicker than previous enterprise service faster than ESB or ETL as part of a hybrid solution. These options bus (ESB) or extract, transform, load (ETL) are necessary to address the various needs options that were handled as batch processes. batch processes. enterprises face. For instance, data that is Instead, federation provides low-latency heavily regulated may need to reside on capabilities that are perfect to capture up- premises behind the firewall for compliance to-date streaming data from IoT and similar while the time and cost benefits of a cloud sources. “pay-as-you-go” model may be more advantageous for other use cases. Hybrid provides the best of both worlds.

8 Poor data quality costs the US economy up to $3.1 trillion yearly.2

9 A successful AI and ML practice requires more than just the flexibility to ingest multiple types of data at high volumes and integrate them across deployments. Without proper governance including and cataloging, the data being used to train models may be inaccurate, leading to biased or untrustworthy results. Moreover, a lack of proper data cleansing and implementation could lead to data scientists spending more time preparing the data instead of using it to drive insight. A data lake must be capable of providing accurate data to those authorized to access it through self-service. Some key capabilities that enable this are listed below.

10 Automating governance in the data lake

Data integration By 2022, manual data Data cataloging An intelligent metadata Modern data integration solutions require An enterprise data catalog facilitates the a data delivery platform supporting different integration tasks will be inventory of all structured and unstructured catalog helps you define delivery styles—it should efficiently integrate reduced by 45% thanks enterprise information assets. By using an data in business terms. data across multicloud environments with intelligent metadata catalog, you can define in-line data quality and active metadata and to ML and automated data in business terms, track the lineage of policy enforcements. Often this includes an service-level management.7 your data and visually explore it to better agile architecture based on containers and understand the data in your data lake. The microservices, and ML-powered capabilities catalog can serve multiple stakeholders in that synchronize data from databases to cloud the organization, eliminating inefficiencies data warehouses, providing faster access associated with “lost in translation” issues. to high-quality and high-volume data for analytics. IBM Watson® Knowledge Catalog in particular takes data cataloging to the next Automation capabilities operationalize high level with automated data discovery and quality data throughout the enterprise and metadata generation, ML-extracted business build self-service data pipelines by providing glossaries and automated scanning and risk the right people with access to the right data assessments of unstructured data. at the right time. One of the best examples is IBM DataStage®, which is AI-powered Learn more about data cataloging benefits to fulfill multicloud and hybrid cloud data and use cases by reading this eBook integration needs.

Read our eBook to learn more about multicloud data integration and IBM DataStage

11 Automating governance in the data lake

Data governance Prepare now to meet Self-service data access 80% of data scientists’ time Keep your data compliant and audit-ready Data scientists’ time is valuable. They should by building a clean, governed data lake. A future compliance and spend it finding patterns and trends, not is spent finding, cleaning data catalog can automate the classification audit challenges. preparing data. Provide reliable, high-quality and organizing data.8 and profiling of data assets and automatically data to your data scientists, data stewards enforce data protection rules established and governance and compliance teams and to anonymize and restrict access to sensitive empower them to reach your organization’s information. More importantly, if something analytics goals. goes wrong, controls allow the organization to rapidly respond to an issue, whether that The latest development in modern data science means flagging sensitive data, identifying and is an AutoAI capability that automates the data remediating issues, or collecting information preparation and modeling stages of the data in response to an audit. science lifecycle. Now, not only can more data scientists use their specialized skills the way The data lineage capabilities, common they were intended, but more businesses can vocabulary and ability for business users benefit from data science, from prediction and data scientists to search and explore to optimization with solutions such as IBM data visually with products like IBM Watson Watson Studio Premium for IBM Cloud Pak Knowledge Catalog also provide deeper for Data. insight through governance. Watch this webinar to see how to give your AI and data science a boost

12 Leaders are 74% more likely to use extensive data cleansing.9

13 Cloud object storage, file storage, and are the three primary data storage options for data lakes due to their high scalability, ability to accept data in various formats and easy accessibility by data scientists and analysts. While all are valuable as part of a data lake, each is best suited for certain use cases—there are important differences that can guide your choice of one over the other.

14 Choosing the right data lake storage technology

Object storage File storage Hadoop Object storage manages and manipulates A Global Parallel File System provides Apache Hadoop is built on open source distinct units called objects which contain a multi-protocol interface to allow both foundations, relying on community support to the data, metadata and a unique identifier. transparent HDFS access and file access to drive continual improvements and innovation. Data remains in its native format and there’s the same storage capacity. Object storage Like object storage, Hadoop efficiently stores no limit to the metadata that is used. The solutions can be integrated into the global and processes large datasets and ingests semi- main benefit of object storage is that it enables parallel file system data for additional big data and un-structured data before transformation. users to scale computing power and storage simplicity. Capacity expands with scalability It is built on the reliable Hadoop Distributed independently. This provides considerable into the yottabytes and is flash accelerated File System (HDFS) storage layer, offering fault cost efficiencies in rapidly changing, dynamic for faster performance and data access. tolerance, reliability and high availability. A environments. For these reasons, it is best Automated storage lifecycle management key distinction between Hadoop and object used for line-of-business applications, including data reduction provides cost storage is that processing and storage capacity websites, mobile apps, IoT data stores, saving efficiency and additional capacity are scaled together, which may cause a lack analytics initiatives, and long-term archives. savings. Security is enhanced to ensure of flexibility in dynamic markets. data privacy, authenticity, and audibility with live event notification, end-to-end encryption and WORM or immutable data.

15 62% of data lake users deploy their data lakes on Hadoop.10

16 Data lakes have gained popularity in customer- related industries including financial services, healthcare and communications. However, the data lake can provide a cost-effective and scalable architecture for the collection and processing of data for advanced analytics across all industries. Think about the following examples might apply to your own industry.

17 Real-world data lake applications across industries

Financial services Healthcare Communications service providers While the financial services industry may The rapid increase in telemedicine and cross- A plethora of new data is available for seem like the perfect example of a business organizational healthcare collaboration in communications service providers (CSP) that would primarily rely on structured data, response to customer desires and worldwide such as data from IoT and connected the reality is much different. Semi-structured health crises has put an even greater emphasis devices, clickstreams, image and video and unstructured data such as location data, on the semi-structured and unstructured feeds, and a variety of external sources. A IoT data, clickstreams, and social media medical data being created. Remote diagnosing highly scalable data lake is necessary to can help improve customer insights and and monitoring benefits from a data lake that make the most use out of these sources of fight fraud. can collect images and written descriptions semi-structured and unstructured big data. of issues provided by the patient and deliver Using these new sources of data gives a The right data lake can use real-time data to a more detailed data set for analysis. more holistic, 360-degree view of customers, take these unstructured data sources and use which in turn supports churn prediction and them to reveal customer behaviors, attitudes Similarly, the doctors’ notes and other prevention, more personalized marketing, and experiences. This can, in turn, lead to the unstructured data can be included in and improved customer experiences. development of profitable products and services electronic health records (EHR), providing that more accurately reflect the needs of a more robust view of the patient for other In addition, the data lake’s support for fast customers. Furthermore, real-time data from practitioners and insurance companies. In analysis of large data sets can help make IoT sensors and social media can provide a more aggregate, these patient records provide an more efficient use of the CSP’s network, accurate view of customer actions, helping to invaluable set of information that collaborating which typically accounts for 40% of its prevent fraud by revealing inconsistencies companies can collect within a highly scalable capital and operating budgets. Bottlenecks, in user behavior. For example, a bank in Asia data lake and use to more accurately research peak loads, logs and other information can was able to increase conversions by 300% and potential avenues for more effective treatments. be tracked and used either in real-time or decrease fraud incidents by 30% using a data predictively to optimize the network, driving lake, advanced analytics, and AI. lower costs and delighting customers.

Read the TechTarget Financial Services Read the TechTarget Healthcare Read the TechTarget Communications white paper white paper Service Provider white paper

18 Big data analytics in healthcare could be worth $67.82 billion by 2025.11

19 AI and ML need a data lake to be part of an IBM and Cloudera have Have questions or want to organization’s information architecture so that they can take full advantage of semi- partnered to deliver the best know more about any of the structured and unstructured data from sources enterprise-grade Hadoop concepts in this ebook. IBM such as IoT. Only then can they achieve the all-encompassing perspective necessary solution available, complete experts are just a click away to deliver complete insights and unbiased with multi-vendor support. and are happy to help guide modules modules. Learn more in this Total Value you to a better, AI-ready IBM provides enterprise-grade data lake of Ownership analyst report. data lake. solutions with the flexibility, governance and storage options you require and decades of industry experience to help provide the right setup for your business. IBM’s partnership Read the analyst report Schedule time with with Cloudera also gives customers the ability on Cloudera and IBM a data lake expert to select the Cloudera Data Platform with IBM and augment it with IBM offerings such as IBM Db2 Big SQL and IBM Big Replicate as well as multi-vendor support and consulting services. Visit the links to the right to learn more about IBM’s data lake solutions and partnership with Cloudera.

20 © Copyright IBM Corporation 2020 The client is responsible for ensuring compliance with laws and 01 https://solutionsreview.com/data-management/80-percent- regulations applicable to it. IBM does not provide legal advice or of-your-data-will-be-unstructured-in-five-years IBM Corporation represent or warrant that its services or products will ensure that New Orchard Road the client is in compliance with any law or regulation. 02 https://www.cnbc.com/2018/03/14/with-over-1-billion- Armonk, NY 10504 users-heres-how-youtube-is-keeping-pace-with-change.html Statement of Good Security Practices: IT system security involves Produced in the United States of America protecting systems and information through prevention, detection 03 https://techjury.net/blog/big-data-statistics/#gref December 2020 and response to improper access from within and outside your enterprise. Improper access can result in information being 04 https://www.ibm.com/thought-leadership/institute-business- IBM, the IBM logo, ibm.com, Db2, DataStage, IBM Watson, altered, destroyed, misappropriated or misused or can result value/report/multicloud InfoSphere, and IBM Cloud Pak are trademarks of International in damage to or misuse of your systems, including for use in Business Machines Corp., registered in many jurisdictions attacks on others. No IT system or product should be considered 05 https://techjury.net/blog/big-data-statistics/#gref worldwide. Other product and service names might be trademarks completely secure and no single product, service or security of IBM or other companies. A current list of IBM trademarks is measure can be completely effective in preventing improper use 06 https://techjury.net/blog/big-data-statistics/#gref available on the web at “Copyright and trademark information” at or access. www.ibm.com/legal/copytrade.shtml. 07 https://www.gartner.com/en/newsroom/press- IBM systems, products and services are designed to be part of a releases/2019-02-18-gartner-identifies-top-10-data-and- This document is current as of the initial date of publication and lawful, comprehensive security approach, which will necessarily analytics-technolo may be changed by IBM at any time. Not all offerings are available involve additional operational procedures, and may require other in every country in which IBM operates. systems, products or services to be most effective. IBM DOES 08 https://www.ibm.com/analytics/data-science NOT WARRANT THAT ANY SYSTEMS, PRODUCTS OR SERVICES The performance data and client examples cited are presented for ARE IMMUNE FROM, OR WILL MAKE YOUR ENTERPRISE IMMUNE 09 https://www.ibm.com/downloads/cas/K1OGEMA9 illustrative purposes only. Actual performance results may vary FROM, THE MALICIOUS OR ILLEGAL CONDUCT OF ANY PARTY. depending on specific configurations and operating conditions. 10 https://insidebigdata.com/2018/07/24/new-survey-reveals- BQRDKWYZ businesses-bullish-data-lakes/ THE INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING 11 https://techjury.net/blog/big-data-statistics/#gref WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided.