DATA LINEAGE Comment cartographier ses données au sein de son SI ?

Le catalog des entreprises data-driven

1 www.zeenea.com - @DataXDay and IA

Data Architect Matthieu Blanc

@matthieublanc Hadoop Instructor Spark Instructor

Kafka Instructor

Cofounder

www.zeenea.com - @DataXDay DATA LINEAGE ?

www.zeenea.com - @DataXDay Definition

Data lineage is defined as a data life cycle that includes the data’s origins and where “ it moves over time. It describes what happens to data as it goes through diverse processes.”

[Wikipedia]

www.zeenea.com - @DataXDay Evolution The size of the digital universe will double every two years

1960 1980 1990 2000 2010 2020

Operational NoSQL Flat Relational OLTP NewSQL Files DBMS

Datalakes Hadoop ???

Cloud

OLAP

www.zeenea.com - @DataXDay Big Data...

www.zeenea.com - @DataXDay … or Big Mess?

www.zeenea.com - @DataXDay Which data sets and table columns are driving key performance indicators?

www.zeenea.com - @DataXDay How is privacy-sensitive data being used?

Do we have sensitive data that’s being propagated unsafely? www.zeenea.com - @DataXDay Where is the data coming from?

How are my data actually calculated and transformed?

www.zeenea.com - @DataXDay What systems and reports would be impacted by a change in that particular process?

www.zeenea.com - @DataXDay Trust in data

www.zeenea.com - @DataXDay Regulation compliance

www.zeenea.com - @DataXDay Extract value from data

Offensive data governance

Time Money Opportunities

www.zeenea.com - @DataXDay Solutions?… Entreprise architecture

www.zeenea.com - @DataXDay Solutions? ETL tools

[Pentaho visual Dataflows]

www.zeenea.com - @DataXDay Big Data

New ecosystem and paradigm constitute a challenge, fills a gap.

www.zeenea.com - @DataXDay You need a Data Catalog

You need to have a good understanding of the data storages you have : dbs, data warehouses, data lakes.

You need to understand the metadata :

● Business metadata ● Technical metadata ● Operational metadata

www.zeenea.com - @DataXDay Big Picture

Data Data Discover Stewards Users Document Share Trust

data catalog

Cataloging Monitoring Analyses Cloud Data Lake RGBD NoSQL Storage Data lineage storages

Graph DBS are perfect for data lineage : ● Easy to model the flow of data in a graph ● Query relationships with ease and in real time ● Adapt your schema to accommodate new data and relationships

www.zeenea.com - @DataXDay Datasets level Lineage

www.zeenea.com - @DataXDay Datasets Level Lineage

Stuff happens result = 42 here...

www.zeenea.com - @DataXDay Datasets Level Lineage

result = 42

www.zeenea.com - @DataXDay Naive visualization : Directed Graph

Missing context of transform Inconsistent node placement : poor visual navigation Clutter when too many edges

www.zeenea.com - @DataXDay Naive visualization : Directed Graph

www.zeenea.com - @DataXDay A better approach

Timeline + Genealogy Vizualisation : based on Stanford vis group (now UW Interactive Data Lab)

Tracing Genealogical Data with TimeNets Nam Wook Kim AND Stuart K. Card AND Jeffrey Heer www.zeenea.com - @DataXDay A better approach

www.zeenea.com - @DataXDay XKCD Plot Lines [Munroe 2009] Principles

Timeline : transformations steps rows : lifeline of data fields solid lines : values dependencies dotted lines : conditional dependencies www.zeenea.com - @DataXDay Manual documentation

A necessary evil

www.zeenea.com - @DataXDay Automation, Vizualisation and Exploration

www.zeenea.com - @DataXDay Is there a way to know exactly how a new data was produced?

www.zeenea.com - @DataXDay Is there a way to know exactly how a new data was produced?

www.zeenea.com - @DataXDay Is there a way to know exactly how a new data was produced?

Unpractical : coders happen to be somewhat expensive and hard to work with

www.zeenea.com - @DataXDay SQL Parsing (Hive, Presto, Dremel, BigQuery…)

SELECT a.deptno "Department", a.num_emp/b.total_count "Employees", a.sal_sum/b.total_sal "Salary" FROM (SELECT deptno, COUNT(*) num_emp, SUM(SAL) sal_sum FROM scott.emp GROUP BY deptno) a, (SELECT COUNT(*) total_count, SUM(sal) total_sal FROM scott.emp) b

www.zeenea.com - @DataXDay SQL Parsing (Hive, Presto, Dremel, BigQuery…)

SELECT a.deptno "Department", a.num_emp/b.total_count "Employees", a.sal_sum/b.total_sal "Salary" FROM (SELECT deptno, COUNT(*) num_emp, SUM(SAL) sal_sum FROM scott.emp GROUP BY deptno) a, (SELECT COUNT(*) total_count, SUM(sal) total_sal FROM scott.emp) b

www.zeenea.com - @DataXDay SQL Parsing (Hive, Presto, Dremel, BigQuery…)

SELECT a.deptno "Department", a.num_emp/b.total_count "Employees", a.sal_sum/b.total_sal "Salary" FROM (SELECT deptno, COUNT(*) num_emp, SUM(SAL) sal_sum FROM scott.emp GROUP BY deptno) a, (SELECT COUNT(*) total_count, SUM(sal) total_sal FROM scott.emp) b

www.zeenea.com - @DataXDay Execution plan

Spark Job and Spline

import za.co.absa.spline.core.SparkLineageInitializer._ spark.enableLineageTracking() Spark library

Spark session Generate Execution Plan Spline

Transformations Spline UI

Actions

www.zeenea.com - @DataXDay Execution plan

www.zeenea.com - @DataXDay Execution plan

www.zeenea.com - @DataXDay Execution plan

www.zeenea.com - @DataXDay Execution plan

www.zeenea.com - @DataXDay Summary

Data catalog

{lineage} json Code analysis

dataset1

Spark Graph DB dataset2 SQL dataset4 ...

dataset3

www.zeenea.com - @DataXDay Datasets level Data Lineage

www.zeenea.com - @DataXDay Columns level Data Lineage

www.zeenea.com - @DataXDay The data catalog of the data-driven companies

Data Lake Centric Access to a business and technical view of your data from your Big Data systems. Plug & Play ● Inventory your data ● Document the data ● Collaborate with other users ● Tooling your data governance ● Organize your data ● Federate your teams around the data

www.zeenea.com - @DataXDay www.zeenea.com | [email protected] | 75008 Paris

We democratize access to data

We are the metadata search engine of the company.

www.zeenea.com - @DataXDay We democratize access to data

We are the data marketplace of the company.

www.zeenea.com - @DataXDay We democratize access to data

We are the data collaborative network of the company.

www.zeenea.com - @DataXDay Suivez-nous

156 Boulevard Haussmann Paris 75008 France

[email protected] www.zeenea.com Le data catalog des entreprises data-driven