DATA LINEAGE Comment cartographier ses données au sein de son SI ?
Le data catalog des entreprises data-driven
1 www.zeenea.com - @DataXDay Machine Learning and IA
Data Architect Matthieu Blanc
@matthieublanc Hadoop Instructor Spark Instructor
Kafka Instructor
Cofounder
www.zeenea.com - @DataXDay DATA LINEAGE ?
www.zeenea.com - @DataXDay Definition
Data lineage is defined as a data life cycle that includes the data’s origins and where “ it moves over time. It describes what happens to data as it goes through diverse processes.”
[Wikipedia]
www.zeenea.com - @DataXDay Evolution The size of the digital universe will double every two years
1960 1980 1990 2000 2010 2020
Operational NoSQL Flat Relational OLTP NewSQL Files DBMS
Datalakes Hadoop ???
Cloud
OLAP
www.zeenea.com - @DataXDay Analytics Big Data Big Data...
www.zeenea.com - @DataXDay … or Big Mess?
www.zeenea.com - @DataXDay Which data sets and table columns are driving key performance indicators?
www.zeenea.com - @DataXDay How is privacy-sensitive data being used?
Do we have sensitive data that’s being propagated unsafely? www.zeenea.com - @DataXDay Where is the data coming from?
How are my data actually calculated and transformed?
www.zeenea.com - @DataXDay What systems and reports would be impacted by a change in that particular process?
www.zeenea.com - @DataXDay Trust in data
www.zeenea.com - @DataXDay Regulation compliance
www.zeenea.com - @DataXDay Extract value from data
Offensive data governance
Time Money Opportunities
www.zeenea.com - @DataXDay Solutions?… Entreprise architecture
www.zeenea.com - @DataXDay Solutions? ETL tools
[Pentaho visual Dataflows]
www.zeenea.com - @DataXDay Big Data
New ecosystem and paradigm constitute a challenge, fills a gap.
www.zeenea.com - @DataXDay You need a Data Catalog
You need to have a good understanding of the data storages you have : dbs, data warehouses, data lakes.
You need to understand the metadata :
● Business metadata ● Technical metadata ● Operational metadata
www.zeenea.com - @DataXDay Big Picture
Data Data Discover Stewards Users Document Share Trust
data catalog
Cataloging Monitoring Analyses Cloud Data Lake RGBD NoSQL Storage Data lineage storages
Graph DBS are perfect for data lineage : ● Easy to model the flow of data in a graph ● Query relationships with ease and in real time ● Adapt your schema to accommodate new data and relationships
www.zeenea.com - @DataXDay Datasets level Lineage
www.zeenea.com - @DataXDay Datasets Level Lineage
Stuff happens result = 42 here...
www.zeenea.com - @DataXDay Datasets Level Lineage
result = 42
www.zeenea.com - @DataXDay Naive visualization : Directed Graph
Missing context of transform Inconsistent node placement : poor visual navigation Clutter when too many edges
www.zeenea.com - @DataXDay Naive visualization : Directed Graph
www.zeenea.com - @DataXDay A better approach
Timeline + Genealogy Vizualisation : based on Stanford vis group (now UW Interactive Data Lab)
Tracing Genealogical Data with TimeNets Nam Wook Kim AND Stuart K. Card AND Jeffrey Heer www.zeenea.com - @DataXDay A better approach
www.zeenea.com - @DataXDay XKCD Plot Lines [Munroe 2009] Principles
Timeline : transformations steps rows : lifeline of data fields solid lines : values dependencies dotted lines : conditional dependencies www.zeenea.com - @DataXDay Manual documentation
A necessary evil
www.zeenea.com - @DataXDay Automation, Vizualisation and Exploration
www.zeenea.com - @DataXDay Is there a way to know exactly how a new data was produced?
www.zeenea.com - @DataXDay Is there a way to know exactly how a new data was produced?
www.zeenea.com - @DataXDay Is there a way to know exactly how a new data was produced?
Unpractical : coders happen to be somewhat expensive and hard to work with
www.zeenea.com - @DataXDay SQL Parsing (Hive, Presto, Dremel, BigQuery…)
SELECT a.deptno "Department", a.num_emp/b.total_count "Employees", a.sal_sum/b.total_sal "Salary" FROM (SELECT deptno, COUNT(*) num_emp, SUM(SAL) sal_sum FROM scott.emp GROUP BY deptno) a, (SELECT COUNT(*) total_count, SUM(sal) total_sal FROM scott.emp) b
www.zeenea.com - @DataXDay SQL Parsing (Hive, Presto, Dremel, BigQuery…)
SELECT a.deptno "Department", a.num_emp/b.total_count "Employees", a.sal_sum/b.total_sal "Salary" FROM (SELECT deptno, COUNT(*) num_emp, SUM(SAL) sal_sum FROM scott.emp GROUP BY deptno) a, (SELECT COUNT(*) total_count, SUM(sal) total_sal FROM scott.emp) b
www.zeenea.com - @DataXDay SQL Parsing (Hive, Presto, Dremel, BigQuery…)
SELECT a.deptno "Department", a.num_emp/b.total_count "Employees", a.sal_sum/b.total_sal "Salary" FROM (SELECT deptno, COUNT(*) num_emp, SUM(SAL) sal_sum FROM scott.emp GROUP BY deptno) a, (SELECT COUNT(*) total_count, SUM(sal) total_sal FROM scott.emp) b
www.zeenea.com - @DataXDay Execution plan
Spark Job and Spline
import za.co.absa.spline.core.SparkLineageInitializer._ spark.enableLineageTracking() Spark library
Spark session Generate Execution Plan Spline
Transformations Spline UI
Actions
www.zeenea.com - @DataXDay Execution plan
www.zeenea.com - @DataXDay Execution plan
www.zeenea.com - @DataXDay Execution plan
www.zeenea.com - @DataXDay Execution plan
www.zeenea.com - @DataXDay Summary
Data catalog
{lineage} json Code analysis
dataset1
Spark Graph DB dataset2 SQL dataset4 ...
dataset3
www.zeenea.com - @DataXDay Datasets level Data Lineage
www.zeenea.com - @DataXDay Columns level Data Lineage
www.zeenea.com - @DataXDay The data catalog of the data-driven companies
Data Lake Centric Access to a business and technical view of your data from your Big Data systems. Plug & Play ● Inventory your data ● Document the data ● Collaborate with other users ● Tooling your data governance ● Organize your data ● Federate your teams around the data
www.zeenea.com - @DataXDay www.zeenea.com | [email protected] | 75008 Paris
We democratize access to data
We are the metadata search engine of the company.
www.zeenea.com - @DataXDay We democratize access to data
We are the data marketplace of the company.
www.zeenea.com - @DataXDay We democratize access to data
We are the data collaborative network of the company.
www.zeenea.com - @DataXDay Suivez-nous
156 Boulevard Haussmann Paris 75008 France
[email protected] www.zeenea.com Le data catalog des entreprises data-driven