Eliminating Data Silos, Connecting All Data for AI
Total Page:16
File Type:pdf, Size:1020Kb
Modernes Datenmanagement als Basis künstlicher Intelligenz — Overcome Data Siloes through Data Virtualization on IBM Cloud Pak for Data — Claus Huempel <[email protected]> Technical Sales, Hybrid Data Management IBM Deutschland GmbH DOAG 2019 Konferenz und Austellung, Nuernberg Slides courtesy of: Sam Lightstone Chief Technology Officer for Data IBM Fellow & Master Inventor Mukta Singh Program Director, Data & AI Product Management DOAG 2019 Konferenz und Austellung, Nuernberg / © 2019 IBM Corporation Please note IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBM’s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. IBM Cloud / © 2019 IBM Corporation What if you could tap into all of your critical data assets no matter where they physically are? What if you could query 2 or 2,000 data systems with a single query? The AI Ladder INFUSE – Operationalize AI throughout the business ANALYZE – Build & scale AI with trust and transparency MODERNIZE ORGANIZE - Create a business-ready analytics foundation Make your data ready for an AI and hybrid cloud world COLLECT - Make data simple and accessible One Platform, Any Cloud Data of every type - your critical data assets no matter where they physically are 3 © 2019 IBMCorporation Eliminating data silos will be transformative: To your team + your business Smart decisions On-Demand DataDriven Speed time to insight to make better Reduce IT/ Operational Technology Cost; Trusted, Real-timeData decisions Increase Agility Connected Experience Trusted Connectivity and access anywhere on any Innovative Workforce Security andprivacy device Workplace Transformation IBM Data Virtualization Unified data asset Unified access catalog, lineage and control and security provenance policies Data Virtualization [+ caching layer] Data Big Data Warehouses (Hadoop) & Marts Relational Spreadsheets Databases & No SQL Text files Locations: Private and public clouds, standalone systems, worldwide. Alternative: Physically centralize data (serious cost, complexity, latency, security risk, and GDPR blocker) • Risk to data security • No longer viable to move all data (using extract, transform, load • Data inconsistency ETL) • Rigid, limits business agility • High latency • Complexity & costly Data Lake or ETL Warehouse Server IBM Data Virtualization Query across multiple data sources: Rich application capabilities • Oracle, Db2, SQLServer, Informix, Netezza, • Connect to Data Virtualization with your favorite MySQL, PostgreSQL, Big SQL, Apache Hive, SQL apps and tools HDP Hive, Cloudera Impala and more! • RStudio, Jupyter Notebook, Cognos, Tableau, Microstrategy Scale! 1 or 1,000 at once Secure! • More than 10x better for several important • Strict access controls use cases • Fully encrypted communications. Schema discovery and folding Deeply integrated w Cloud Pak for Data • Automatically find and match tables across • Enterprise Data Catalog, governance and systems so you can query them as a single security. e.g. Automatic publishing of virtual table. virtualized data into the data catalog. • Immediate access via Cognos and Watson Studio ▪ Db2 family for HDM ▪ Excel, CSV, Text ▪ Big SQL ▪ Db2 for iSeries Broad support ▪ IIAS, PDA (Netezza) ▪ Db2 for z/OS for common ▪ Informix ▪ Requires DB2 Connect License data source ▪ Derby ▪ Teradata types ▪ Oracle ▪ Requires separately licensed ▪ SQL Server JDBC driver More to beadded ▪ MySQL ▪ in the future. ▪ PostgreSQL Coming soon ▪ Data Virtualization Manager for ▪ Apache Hive, HDP other Z sources Hive ▪ Snowflake ▪ BigQuery ▪ Cloudera Impala ▪ MapR (Hive) ▪ Mongo ▪ Apache Drill Schema Folding: Simplify your data • Common or similar schemas appear in multiple databases. • E.g. branch database for a bank or retailer. Folded Schema Database #1 Database #2 simplifies access Real System test growing a Constellation • Video of constellation growing to 349 Nodes. • Network stays compact. • 2 and 10 links between nodes • No manual configuration. • Latency aware connection between nodes • Which nodes connect to which others? • Fastest reply strategy • Diameter of the constellation (i.e. the number of hops between the two furthest nodes) grows logarithmically. Small diameter is ideal for communications. Actual system test performed by Emerging Technology Services, IBM Hursley, United Kingdom Data and Result Caching Powerful Application SQL • Cache results (common SQL Oracle statements) • Cache data (data or aggregates, etc). • Define refresh rate IBM Data • Monitor usage/effectiveness Virtualizatio n Cluster Under the hood Db2 • Advanced query compiler determines whether to use cached data and results for part or all of a query result. Data & Results Cache Hive Cache creation: 3-step process for Admins to create & define a cache entry with a periodic refresh schedule Cache Storage: Shows the total storage capacity allocated for Caching (100GB in this example) and the storage consumed by active and inactive cache entries Query Responsiveness: Shows a histogram of total queries that have been executed in the past X days against their response (execution) times. Shows the distribution of queries that used/did not use cache. Query History: Dashboard to see history of queries over a period of time along with details such as query text, execution time (with cache if applicable), query owner etc. Governed Data Fabric for Enterprise Use Data Virtualization with Knowledge Catalog Getting your Data ready for Multicloud and AI Journey IBM Cloud / © 2019 IBM Corporation 17 Gartner predicts that by 2020, about 35 percent of enterprises will make Data Virtualization a part of their data integration strategy A modern Information Architecture One data fabric, on a cloud native architecture Data Consumers IBM Cognos Tableau MicroStrategy BI Tools Impact of modernInformation Architecture Plotly R Jupyter − Multiple, global data sources appear as one database Consumer Layer (Interface Provisioning) − Centralized control & governance Governed Enterprise Data Data Catalog Virtualization Layer (Metadata) Virtualization − Reducedata copies and movement Connection Layer (Adaptors) − Sharp reduction in ETL and duplicative storage Existing Data Hadoop Data Data Marts − Enable self-service fordata Warehouses Sources Data Sources Ability to connect to data sources globally For the Data Consumer layer data were inside a single database Cloud Pak for Data – Self-serve ready Foundational “out of the box” multicloud data & AI services Open, Extensible Platform Personalized, Collaborative Platform App Developers Business Partners Data Stewards Data Scientists Business Users & Analytics Ops Data Engineers The Ladder to AI APIs Integrated User Experience Extensible : “add-ons”, accelerators and Solutions Modular: - provision services & scale out when needed Collect & Connect Organize & Integrate Analyze & Infuse o Data Science & Visualization o Data virtualization o Discovery & search o Dashboards & reporting o Provision SQL & NOSQL o Data transformation o AUTO-AI, ML deployments & Databases o Data catalogs, quality & operations o Warehouses & Marts Curation o AI Trust and Transparency - o Event Ingestion & Streaming o Business glossary Explainability & Bias Analytics o Policies, rules & privacy o Distributed compute – detection o MODERNIZE your data estate Apache Spark AI services –Chat, NLU for AI in a multi-cloud world • • • • Core Services Logging Metering Auditing Identity Access Mgmt. • Monitoring • Storage Volumes • Security • Docker Registry , Helm IBM Cloud Hyperconverged System Data Virtualization patterns at clients • Access to organization data remain a bottleneck – Multiple Clients • Realize self-service of Data for business users – Large European Bank • Improve the Data Management Capability with a Virtual Data Fabric that allows for Data Governance, Data Security, Data Quality, Data Lineage of Metadata– Large European Bank • Create Enterprise-grade single Data Fabric, cutting data silos, for the business – Worldwide Mass Media Conglomerate • Rationalize, Standardize and Simplify the data and technology landscape to move away from the current fragmented architecture toward an Enterprise Architecture – South African Bank • Enable enterprise Advanced Analytics on media assets in the organizations’ repository based on some parameters –Large American entertainment company © 2019 IBM Corporation 21 Governed Data Virtualization in Cloud Pak for Data The ability to view, access, manipulate and analyze data without the need to know or understand