Ein Unternehmen der Daimler AG

DATA CATALOG INSTEAD OF METADATA MANAGEMENT

Andreas Buckenhofer, Data Analytics, 2019 ABOUT ME

Andreas Buckenhofer Contact/Connect Senior DB Professional

Since 2009 at Daimler TSS Department: Machine Learning Solutions

Business Unit: Analytics vcard

• Over 20 years experience with • Oracle ACE Associate database technologies • DOAG responsible for InMemory DB • Over 20 years experience with Data • Lecturer at DHBW Warehousing • Certified Data Vault Practitioner 2.0 • International project experience • Certified Oracle Professional • Certified IBM Big Data Architect CUSTOMER JOURNEY 360° DIGITAL PRODUCTS Create an integrated customer Increase sales and understanding through value creation through DAIMLER trustable data collec- attractive digital ELECTRIC MOBILITY PLATFORM ECONOMY tion & analysis. offers. Accelerate intelligent drive Interact directly with Daimler systems with modern IT. customers in the digital ecosystem. FOCUS MODERN IT UNDERSTANDING AUTONOMOUS DRIVING Ensure the fundamentals for the digital Include AI as decisive game changer. transformation with advanced IT. INTEGRATION LEARNING OF 3RD PARTIES THROUGH Encourage profitable GLOBAL PRESENCE cooperation through Be effective among the the technical exchange. digital, international competition. CUSTOMER

PRODUCT LINES. OUR VALUE FOR DAIMLER TRANSFORMED MOBILITY DIGITAL VEHICLE VEHICLE SALES & CARE DIGITAL PRODUCTION CYBER SECURITY STRATEGIC INITATIVES

Daimler TSS Company Presentation | Feb 2019 3 FLAGSHIP PROJECTS

• car2go

• InCar-Delivery

• EQ-Ready

• Mercedes me

• PrivateCarSharing

Daimler TSS  1. METADATA 2. TOOL SELECTION 3. CATALOGING agenda 4. DATA CATALOGS

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 5 MAKING IT EASIER TO DISCOVER DATASETS AVAILABLE SINCE 05-SEP-2018

Source: announcement https://www.blog.google/products/search/making-it-easier-discover-datasets/

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 6 MAKING IT EASIER TO DISCOVER DATASETS AVAILABLE SINCE 05-SEP-2018

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 7 FIND THE RIGHT DATA – MASTERING THE DATA EXPLOSION

• With data science and analytics on the rise and under way to being democratized, the importance of being able to find the right data to investigate hypotheses and derive insights is paramount Source: https://www.zdnet.com/article/google-can-now-search-for-datasets-first-research-then-the-world • Google Dataset search helps to find external data • Schema.org defines open metadata format; dataset itself may not be open/free • Search engines can interpret the format • Ranking of data • Help users discover where the data is and user can access it directly from the source What about internal data? TYPES OF METADATA

Technical Metadata Business Metadata Tagging / Linkage • Schema • Business terms • Table • Domain model • Column • Glossary, etc • etc Profiling Legal Metadata • Density • Sensitive data • Cardinality • GDPR • Min, Max, Median • Security classification • Popularity / Ranking Critical parts Daimler TSS Data Catalog instead of Metadata Management | 03-2019 9 TECHNICAL METADATA MANAGEMENT VERY OFTEN NOT SUCCESSFUL Who enriches / tags technical metadata Metadata Repository with legal and business relevant information???

OLTP-1 Microservice-1 Data Lake Microservice-1 Microservice-1 Microservice-1 OLTP-2 DWH DOES METADATA MANAGEMENT PROVIDE ANSWERS TO SUCH QUESTIONS ACROSS THE WHOLE WORKFLOW?

How to get access What table contains How is this Is the data to the data? production dates? column reliable? calculated? Is FIN unique? Find Understand Trust Access Write Search for data Work with data What are What is the difference important? between production_date and prod_dt? How to join the Who knows about tables? the data?

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 11 DATA CATALOG A HOT TOPIC

• New Data Catalog vendors are entering the market • Established vendors rebrand and enrich their existing tools Daimler TSS Data Catalog instead of Metadata Management | 03-2019 12 1. METADATA  2. TOOL SELECTION 3. CATALOGING agenda 4. DATA CATALOGS

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 13 GARTNER MAGIC QUADRANT FOR METADATA MANAGEMENT SOLUTIONS 2018

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 14 FORRESTER WAVE MACHINE LEARNING DATA CATALOGS Q2/2018

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 15 EVALUATION CRITERIA

Technical Metadata Data access Business Metadata incl. Glossary Lineage Tagging (Linkage) API Collective Intelligence(Collaboration) Versioning Search Architecture Security Components Source connectors Prerequisites Data profiling Licencing

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 16 ALATION ARCHITECTURE

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 17 INFORMATICA ARCHITECTURE EDC + AXON (+ DATA QUALITY)

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 18 POC – ALATION VS INFORMATICA Alation Informatica „Self-service data marketplace“ „Comprehensive directory of assets“ Main users: data scientists Main users: corporate governance •  (Hadoop) Connectors •  Mix of tools •  Excellent capabilities regarding •  Complex architecture with Hbase collaboration for metadata storage •  Query Log •  Machine Learning capabilities •  /  Maturity •  MITI (Meta Integration) support • Logical SAP BW matadata integration  in Theory: ; in Practice:  •  /  Maturity

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 19 EVALUATION CRITERIA

Technical Metadata Data access Many tools in the market Business Metadata• Newincl. „data Glossary catalog“ vendors Lineage Tagging• Old(Linkage „metadata) mangement“ vendors API Collective Intelligence• „Others(Collaboration“ marketing) their tools, e,g. VirtualizationVersioning tools NoSearch vendor has currently a satisfactoryArchitecture coverage. Tools lack •SecurityFunctionality Components • Maturity (+ often enterprise-readiness) Source connectors Prerequisites Data profilingFind your own best fit Licencing

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 20 WHAT TYPE OF DATA CATALOG IS YOUR ORGANIZATION CONSIDERING INVESTMENT IN?

Source: Matt Aslett: From out of nowhere: the unstoppable rise of the data catalog, 2018, 451 Research Daimler TSS Data Catalog instead of Metadata Management | 03-2019 21 1. METADATA 2. TOOL SELECTION  3. CATALOGING agenda 4. DATA CATALOGS

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 22 CATALOGING SOURCE SYSTEMS MANY FORMATS = MANY CONNECTORS

• RDBMS (Oracle, Db2, SQL Server, Teradata, …) • Hadoop (HDFS, Hive, …; on-premises, Cloud) • NoSQL DBs • (Excel, csv, …) • Powerdesigner, Erwin, and other data modeling tools METADATA DIFFERENCES: BOTH TABLES CONTAIN SAME INFORMATION, BUT DIFFERENT METADATA

Vehicle_id Engine_id Cab_id Axle_id

WDB123 1234 ABCD XY12

Vehicle_id Type Id

WDB123 Engine 1234

WDB123 Cab ABCD

WDB123 Axle XY12

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 24 SOME CHALLENGES: METADATA IMPORT OF JSON, CSV, ETC

Where is the data and where is the metadata in this logfile?

Data Lake: decentralized control of the data DATA LAKE / HADOOP

• Easy approach: Access Hive Metastore and import metadata • Prerequisite: all data/files in HDFS require Hive access • But unrealistic prerequisite • Many logs (esp. Sensor data) are just dumped into the file system: • HDFS Default-Filenames like 00000.txt, 00001.txt  added value??? • Interpreting ALL files by catalog SW unrealistic, too. • Huge computing power • Huge number of variations (Cloud, on-premises, SW versions) lacks support of vendors for catalog SW • Sources should deliver metadata CATALOGING @GOOGLE

Heavy usage of Automation and Machine Learning

Source: https://ai.google/research/pubs/pub45390 Daimler TSS Data Catalog instead of Metadata Management | 03-2019 27 CATALOGING AT NETFLIX, TWITTER, LINKEDIN, ETC.

Company Link

https://medium.com/netflix-techblog/metacat-making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520 Netflix (Metacat) https://github.com/Netflix/metacat Twitter https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html

https://github.com/linkedin/WhereHows LinkedIn (WhereHows) https://github.com/linkedin/WhereHows/wiki

https://ai.google/research/pubs/pub45390 Google (Goods) https://www.buckenhofer.com/2016/10/goods-how-to-post-hoc-organize-the-data-lake/ Uber https://eng.uber.com/databook/ ebay https://www.ebayinc.com/stories/blogs/tech/bigdata-governance-hive-metastore-listener-for-apache-atlas-use-cases/ 1. METADATA 2. TOOL SELECTION 3. CATALOGING agenda  4. DATA CATALOGS

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 29 CATALOGS ARE EVERYWHERE … GOOGLE, AMAZON

INVENTORY USER EXPERIENCE

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 30 INVENTORY VS USER EXPERIENCE

Suppliers provide inventory •A catalog should list everything that is actually available Consumers require user experience •A catalog should provide data usage statistics, ratings, data samples, statistical profiles, lineage, lists of users and stewards, and tips on how the data should be interpreted AUTOMATION, CROWD KNOWLEDGE, AND EXPERTS

•  limitation of permissions to a trusted group • A trusted group documents few datasets very well • But most of the metadata is not documented • Failure of many past approaches •  Automation, crowd knowledge and experts required • Automation to get a broad coverage and use existing information like query logs • Crowd to increase broad coverage • Experts to confirm or reject „guesses“ -> Combination of coverage and accuracy TYPES OF METADATA

Technical Metadata Business Metadata Tagging / Linkage • Schema • List of business • Table terms • Column • Glossary, etc • etc Profiling Legal Metadata • Density • Sensitive data Automation, crowd knowledge and experts • • GDPR Cardinality required • Min, Max, Median • Security classification • Popularity / Ranking Critical parts Daimler TSS Data Catalog instead of Metadata Management | 03-2019 33 WORKING WITH ALATION

• Onboarding of data sources • IT just creates cover and grants access • Self-service: Done by application owners including source system connection • No need for central password management • Scales for onboarding of many systems

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 34 DATA SOURCES

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 35 CONNECTIONS TO DATA SOURCES

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 36 EXTRACTION AUTOMATION

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 37 MANUAL CSV IMPORT

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 38 DATA STEWARD VS DATA ANALYST

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 39 DATA STEWARD

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 40 DATA ANALYSTS - CATALOG SEARCH

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 41 SCHEMA AND ITS TABLES

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 42 CENTRAL VS LOCAL DATA CATALOGS

Source 1 Central data catalog

Source 2 catalog • Integrated views • Mammoth task Source 3 Data • No redundancy

Source 1

Local data catalogs (reality) Data Source 2 catalog • Legal requirements • Feasibility Source 3

• Tool support very weak  Data catalog

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 43 CHALLENGES

•3rd party tool and enterprise requirements •SAML configuration / OpenID connect Security •Firewall configuration for zones + port ranges

•GDPR •Associate legal tags to domain model Governance •Huge manual effort but huge benefit

•Data sharing & collaboration Data culture •System onboarding

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 44 BIMA-STUDIE 2018 (BARC + SOPRA STERIA CONSULTING) DIGITIZATION HOT SPOTS

Data quality and meta data management

Domain knowledge

Data culture

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 45 IS THE DATA CATALOG A “METADATA MANAGEMENT RELOADED”?

Name it as you like, but there are some critical developments • Automation, Collective intelligence and expert knowledge • Enable crowd sourcing and get help from other users • Help to understand quality of data and usage of datasets • Rating of information • Web application for search / collaboration and API to access metadata • Enabling self-service, multi-location management (especially data lakes) • Governance and legal framework for e.g. GDPR scenarios • Capture metadata for security and end-user data consumption • Identify the owner of the dataset and get access to source data

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 46 DATA CATALOG – AMAZON FOR INFORMATION

Technical Metadata Inventory Business Automation Metadata

User Machine Data Collective experience Learning Intelligence & Catalog enrichment

Expert Governance Sourcing

Data Access

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 47 Daimler TSS GmbH Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99 [email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSS Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Chairperson), Steffen Bäuerle OVER 75% • Of time is spent for • Say they least enjoy DATA PREPARATION

Data Consumers

Daimler TSS Data Catalog instead of Metadata Management | 03-2019 49 CATALOGING @UBER

Source: https://eng.uber.com/databook/ Daimler TSS Data Catalog instead of Metadata Management | 03-2019 50 CATALOGING @TWITTER

Source: https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html Daimler TSS Data Catalog instead of Metadata Management | 03-2019 51 CATALOGING @LINKEDIN (OPEN SOURCE)

Source: https://github.com/LinkedIn/Wherehows/wiki Daimler TSS Data Catalog instead of Metadata Management | 03-2019 52