Tracer: a Machine Learning Approach to Data Lineage Felipe Alex Hofmann

Tracer: A Machine Learning Approach to Data Lineage by Felipe Alex Hofmann Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2020 ○c Massachusetts Institute of Technology 2020. All rights reserved. Author................................................................ Department of Electrical Engineering and Computer Science May 20, 2020 Certified by. Kalyan Veeramachaneni Principal Research Scientist Thesis Supervisor Accepted by . Katrina LaCurts Chair, Master of Engineering Thesis Committee 2 Tracer: A Machine Learning Approach to Data Lineage by Felipe Alex Hofmann Submitted to the Department of Electrical Engineering and Computer Science on May 20, 2020, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract The data lineage problem entails inferring the source of a data item. Unfortunately, most of the existing work in this area relies either on metadata, code analysis or data annotations. In contrast, our primary focus is to present a machine learning solution that uses the data itself to infer the lineage. This thesis will formally define the data lineage problem, specify the underlying assumptions under which we solved it, as well as provide a detailed description of how our system works. Thesis Supervisor: Kalyan Veeramachaneni Title: Principal Research Scientist 3 4 Acknowledgments I would like to thank my supervisor, Kalyan, for his insights, ideas, and instruction during this project; his support and guidance were invaluable for this thesis. I would also like to thank Kevin and Carles for their help with the design and development of the system. Finally, I would like to thank my family — Ricardo, Leila, Gustavo, and Nathália — for their unwavering support over the years. 5 6 Contents 1 Introduction 17 1.1 Data Lineage Tracing . 17 1.1.1 Definition . 17 1.1.2 Why does it matter? . 18 1.2 Tracer . 19 1.2.1 Overview . 19 1.2.2 Why is Tracer needed? . 20 1.2.3 Scope . 20 1.3 Thesis Road Map . 21 2 Related Work 23 2.1 Definition . 23 2.2 Classification . 23 2.3 Setting . 24 3 System Design 25 3.1 Overview . 25 3.2 Tracer . 26 3.3 Terminology . 27 4 Tracer Primitives 29 4.1 Overview . 29 4.2 Primary Key Discovery . 30 7 4.2.1 Design choices . 31 4.3 Foreign Key Discovery . 32 4.3.1 Design choices . 33 4.4 Column Map Discovery . 34 4.4.1 Examples . 34 4.4.2 Inputs and Outputs . 36 4.4.3 Design choices . 37 4.5 Putting It All Together . 38 5 Modeling Methods 41 5.1 Primary Key Discovery . 41 5.2 Foreign Key Discovery . 42 5.3 Column Map Discovery . 43 5.3.1 Transform . 44 5.3.2 Solver . 44 5.3.3 Inverse Transform . 45 6 Datasets 47 6.1 Column Mapping Simulation . 48 6.1.1 Software Design . 49 6.1.2 Atoms . 50 6.1.3 Sieve . 51 7 Experiments 53 7.1 Primary Key . 53 7.1.1 Evaluation . 53 7.1.2 Experiments . 53 7.2 Foreign Key . 55 7.2.1 Evaluation . 55 7.2.2 Experiment . 56 7.3 Column Map . 57 8 7.3.1 Evaluation . 57 7.3.2 Experiments . 57 8 Conclusion 61 A Column Map without Foreign Keys 63 B Relational Dataset 67 C Column Map Discovery Performance 71 9 10 List of Figures 1-1 Example of data lineage under our definitions. Input tables I1 and I2 are going through transformations T1, T2, T3, T4 and generating output tables O1, O2. Therefore, the lineage of O1, O2 is the set containing I1 and I2.................................... 18 1-2 This figure shows a user who is interested in a statistic and wantsto know the source(s) of the data: in other words, the data lineage. The goal of this project is to answer this question in the context of tabular databases. 19 3-1 This figure provides a brief overview of how Tracer works and itsrole in the system context. Tables without lineage (in the left) go through the Data Ingestion Engine to be converted into the appropriate format, and then fed into Tracer — the focus of this thesis — which discovers the lineage of the tables. 26 3-2 This figure shows how all the primitives are put together into asin- gle pipeline, as well as their standard inputs and outputs. Further clarification regarding the data types can be found in Table 3.1.. 26 4-1 This figure showcases the functionality of the Primary Key Discovery module. It takes as input a set of tables (on the left) and returns as output the name of the columns that it predicts are primary keys (on the right). 30 11 4-2 This is an example of how the Foreign Key Discovery module works. It takes a set of tables as input (on the left), as well as their corre- sponding primary keys (represented in green), and outputs the foreign key relationships that it predicts, in sorted order from most likely to least (on the right). 32 4-3 This figure shows an example of the functionality of the Column Map Discovery primitive, where the module learns that the target column "Total" was generated by adding “Cost 1” and “Cost 2”. 34 4-4 This example shows the functionality of the Column Map Discovery primitive, where the module discovers that “Purchase Date” was generated from “Quantity” (the amount of rows for each key in “Purchase Date” corresponds to the values in “Quantity”). 35 4-5 This figure shows the Column Map Discovery primitive being applied, where the module learns that “Total” was generated from adding all values from “Cost 1” and “Cost 2” with the same foreign keys. 35 4-6 This example shows the functionality of the Column Map Discovery primitive, where the module discovers that “Min/Max” was generated from “Prices” (the values in “Min/Max” are the minimum and maxi- mum values from each key of “Prices”). 36 5-1 This figure demonstrates how Column Map Discovery works in three steps: (1) the model expands the columns of the dataset (e.g. “Date” becomes “Day”, “Month”, “Year”); (2) the model attempts to predict which of these columns generated the target column, assigning scores to each of them; (3) the model aggregates the columns that were ex- panded, selecting the largest scores as lineage of the target. 43 6-1 This figure shows the schema for the MovieLens dataset [24], wherethe arrows represent foreign key relationships, the first column describes the columns in each table, and the second column informs the data type of each column. 47 12 6-2 A basic example on how to use DataReactor. 48 A-1 This figure shows an example of the functionality of the Column Map Discovery primitive, where the module learns that "Total" was generated by adding “Cost 1” and “Cost 2” without knowing the foreign keys..................................... 63 13 14 List of Tables 2.1 A summary of the research done to each data lineage field, separated by type (how-lineage or why-lineage) and granularity (schema-level or instance-level) [19]. 24 3.1 This table defines the standard inputs and outputs of the Primary Key, Foreign Key and Column Map Discovery primitives. 28 4.1 This table shows an example of the correct input/output format for the Primary Key Discovery primitive. 31 4.2 This table shows an example of the correct input/output format for the Foreign Key Discovery primitive. 33 4.3 This table shows an example of the correct input/output format for the Column Map Discovery primitive. 37 6.1 Description of the Dataset and DerivedColumn data structures. 50 7.1 Number of tables per dataset, as well as the accuracy of our implementation of the Primary Key Discovery primitive in three situations, (1) normally, (2) without utilizing the column names, and (3) without utilizing the position of the columns. 54 7.2 Example of the Foreign Key Discovery evaluation method, when ground truth is A, B, C. 55 15 7.3 Number of foreign keys per dataset, F-measure, precision and recall of our implementation of the Foreign Key Discovery primitive when tested against eighteen datasets, using the evaluation method described in Section 7.2.1. 56 7.4 This table shows the performance of our current implementation of the Column Map Discovery primitive. These results were obtained by aggregating the values from Table C.1 for each dataset. 58 7.5 This table shows the F-measure, precision and recall of our implementation of the Column Map Discovery primitive. These results were obtained by aggregating the values from Table C.1 for the two transformations generated by DataReactor. 58 B.1 This table provides some details regarding the 73 relational datasets from [24]. 69 C.1 This table shows the F-measure, precision and recall of our implementation of the Column Map Discovery primitive when tested against eleven relational datasets using the evaluation method proposed in Sec- tion 7.3.1. 75 16 Chapter 1 Introduction In data science, the inability to trace the lineage of particular data is a common and fundamental problem that undermines both security and reliability [22]. Under- standing data lineage enables auditing and accountability, and allows users to better understand the flow of data through a system. However, by the time data makesits way through a system, its lineage has often been lost. Our goal in this thesis is to present a machine learning solution to the data lineage problem.

Tracer: a Machine Learning Approach to Data Lineage Felipe Alex Hofmann

GAVS' Blockchain-Based

Data Governance with Oracle

Data Management Capability

Achieving Regulatory Compliance with Data Lineage Solutions

Lineage Tracing for General Data Warehouse Transformations

Effective Data Governance

Harness the Power of Your Data

Metadata Management on a Hadoop Eco-System

Data Lineage Management: Impact and Value

Data Warehouse Optimization with Hadoop

Solution Brief Intelligent Data Cataloging for Cloud Data

Data Governance 101