Building a Robust, Governed Data Lake for AI

IBM Hybrid Data Management Building a robust, governed data lake for AI Scale to handle big data across hybrid deployments with federation, governance and storage options. Contents 04 07 10 14 The data lake’s role Providing choice, Automating Choosing the right in an AI-focused scalability and governance data lake storage information architecture integration for data lakes in the data lake technology 17 20 Real-world data lake Building your robust, applications across governed data lake industries with IBM 2 80% of worldwide data will be unstructured by 2025.1 3 Several trends are converging that make the data lake more important than ever before. The first is the rise of artificial intelligence (AI) and machine learning (ML). Both require an abundance of data from a variety of different sources in order to produce the most accurate models and insights. This has been true of all forms of analytics, but the level of automation in AI and ML places more importance on robust and varied data. 4 The data lake’s role in an AI-focused information architecture The second trend boosting the data lake’s YouTube has over 500 hours The net result is that companies need a The number of IoT devices importance is the sheer volume and variety data lake that is capable of sitting alongside of data being produced—particularly of new content uploaded their databases and data warehouses while could rise to 41.6 billion unstructured and semi-structured data. For every minute.2 handling these new forms and amounts of by 2025.3 example, the Internet of Things (IoT) is playing data. Yet it’s important to find a data lake that an increasingly prominent role as more devices stands up to the challenges of the enterprise. become connected and relay data back to This ebook introduces key components of organizations. This streaming IoT data is also building a successful data lake including being joined by social media and other written flexibility, governance and storage technology, information such as patient notes, alongside with several real-world examples provided. audio or visual data like that from call centers or customer complaints. 5 The data lake’s role in an AI-focused information architecture According to a recent Ventana research study, among enterprises with reported that they noted lowered saw better a data lake: gained competitive costs communication and advantage knowledge sharing enjoyed improved believed it helped revealed an customer them respond better increase in sales experiences to opportunities and threats Learn more about Ventana’s research 6 The data lake is prized for its flexibility in accepting data “as-is” without ascribing a structure onto it at the time of ingestion. However, the data lake’s ability to change to suit business needs should extend beyond that to include high levels of scalability and multiple deployment options. Federation across other components of the information architecture is also vital to ensure that the data lake can be used as a key component of the data management strategy no matter where it lives. 7 Providing choice, scalability and integration for data lakes Scalability Data federation In order to accommodate the heightened Federating data across the entire information volumes of data being created, the data lake architecture is essential for AI and ML; must be able to scale extensively, quickly, it doesn’t matter how much data you have and at a low cost. Often this is accomplished for models and insights if that data is siloed by using clusters of inexpensive commodity and difficult to combine. Federating the data hardware that alleviate over-provisioning and that resides in data warehouses, data lakes 85% of companies are 45% of businesses make scaling easier. This will help ensure and even external sources with a SQL-on- that there are no complications in gathering Hadoop engine like IBM Db2 Big SQL helps operating in a multicloud worldwide are running at the wide array of data required for AI and data scientists easily select the data they environment.4 least one of their Big Data ML to create insight. Quick scalability also want and use it quickly in a self-service model. 5 means that the ingestion of real-time data That includes unstructured, semi-structured workloads in the cloud. is less likely to be interrupted due to lack of and structured data, which are all needed room. As an added bonus, cheap, extensive for models to be created with a wholistic scalability also opens the opportunity to perspective. store “cold” historical data at a lower cost than in a database or data warehouse. As a result, data scientists spend less time preparing the data and more time creating Deployment options the AI and ML models that will propel the Enterprises must be able to deploy data business forward. In addition, federation is Federation is much lakes on premises, across multiple clouds or much quicker than previous enterprise service faster than ESB or ETL as part of a hybrid solution. These options bus (ESB) or extract, transform, load (ETL) are necessary to address the various needs options that were handled as batch processes. batch processes. enterprises face. For instance, data that is Instead, federation provides low-latency heavily regulated may need to reside on capabilities that are perfect to capture up- premises behind the firewall for compliance to-date streaming data from IoT and similar while the time and cost benefits of a cloud sources. “pay-as-you-go” model may be more advantageous for other use cases. Hybrid provides the best of both worlds. 8 Poor data quality costs the US economy up to $3.1 trillion yearly.2 9 A successful AI and ML practice requires more than just the flexibility to ingest multiple types of data at high volumes and integrate them across deployments. Without proper governance including data integration and cataloging, the data being used to train models may be inaccurate, leading to biased or untrustworthy results. Moreover, a lack of proper data cleansing and metadata implementation could lead to data scientists spending more time preparing the data instead of using it to drive insight. A data lake must be capable of providing accurate data to those authorized to access it through self-service. Some key capabilities that enable this are listed below. 10 Automating governance in the data lake Data integration By 2022, manual data Data cataloging An intelligent metadata Modern data integration solutions require An enterprise data catalog facilitates the a data delivery platform supporting different integration tasks will be inventory of all structured and unstructured catalog helps you define delivery styles—it should efficiently integrate reduced by 45% thanks enterprise information assets. By using an data in business terms. data across multicloud environments with intelligent metadata catalog, you can define in-line data quality and active metadata and to ML and automated data in business terms, track the lineage of policy enforcements. Often this includes an service-level management.7 your data and visually explore it to better agile architecture based on containers and understand the data in your data lake. The microservices, and ML-powered capabilities catalog can serve multiple stakeholders in that synchronize data from databases to cloud the organization, eliminating inefficiencies data warehouses, providing faster access associated with “lost in translation” issues. to high-quality and high-volume data for analytics. IBM Watson® Knowledge Catalog in particular takes data cataloging to the next Automation capabilities operationalize high level with automated data discovery and quality data throughout the enterprise and metadata generation, ML-extracted business build self-service data pipelines by providing glossaries and automated scanning and risk the right people with access to the right data assessments of unstructured data. at the right time. One of the best examples is IBM DataStage®, which is AI-powered Learn more about data cataloging benefits to fulfill multicloud and hybrid cloud data and use cases by reading this eBook integration needs. Read our eBook to learn more about multicloud data integration and IBM DataStage 11 Automating governance in the data lake Data governance Prepare now to meet Self-service data access 80% of data scientists’ time Keep your data compliant and audit-ready Data scientists’ time is valuable. They should by building a clean, governed data lake. A future compliance and spend it finding patterns and trends, not is spent finding, cleaning data catalog can automate the classification audit challenges. preparing data. Provide reliable, high-quality and organizing data.8 and profiling of data assets and automatically data to your data scientists, data stewards enforce data protection rules established and governance and compliance teams and to anonymize and restrict access to sensitive empower them to reach your organization’s information. More importantly, if something analytics goals. goes wrong, controls allow the organization to rapidly respond to an issue, whether that The latest development in modern data science means flagging sensitive data, identifying and is an AutoAI capability that automates the data remediating issues, or collecting information preparation and modeling stages of the data in response to an audit. science lifecycle. Now, not only can more data scientists use their specialized skills the way The data lineage capabilities, common they were intended, but more businesses can vocabulary and ability for business users benefit from data science, from prediction and data scientists to search and explore to optimization with solutions such as IBM data visually with products like IBM Watson Watson Studio Premium for IBM Cloud Pak Knowledge Catalog also provide deeper for Data. insight through governance. Watch this webinar to see how to give your AI and data science a boost 12 Leaders are 74% more likely to use extensive data cleansing.9 13 Cloud object storage, file storage, and Apache Hadoop are the three primary data storage options for data lakes due to their high scalability, ability to accept data in various formats and easy accessibility by data scientists and analysts.

Load more