Tracer: A Approach to Lineage by Felipe Alex Hofmann Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2020 ○c Massachusetts Institute of Technology 2020. All rights reserved.

Author...... Department of Electrical Engineering and Computer Science May 20, 2020

Certified by...... Kalyan Veeramachaneni Principal Research Scientist Thesis Supervisor

Accepted by ...... Katrina LaCurts Chair, Master of Engineering Thesis Committee 2 Tracer: A Machine Learning Approach to Data Lineage by Felipe Alex Hofmann

Submitted to the Department of Electrical Engineering and Computer Science on May 20, 2020, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science

Abstract The data lineage problem entails inferring the source of a data item. Unfortunately, most of the existing work in this area relies either on metadata, code analysis or data annotations. In contrast, our primary focus is to present a machine learning solution that uses the data itself to infer the lineage. This thesis will formally define the data lineage problem, specify the underlying assumptions under which we solved it, as well as provide a detailed description of how our system works.

Thesis Supervisor: Kalyan Veeramachaneni Title: Principal Research Scientist

3 4 Acknowledgments

I would like to thank my supervisor, Kalyan, for his insights, ideas, and instruction during this project; his support and guidance were invaluable for this thesis. I would also like to thank Kevin and Carles for their help with the design and development of the system. Finally, I would like to thank my family — Ricardo, Leila, Gustavo, and Nathália — for their unwavering support over the years.

5 6 Contents

1 Introduction 17 1.1 Data Lineage Tracing ...... 17 1.1.1 Definition ...... 17 1.1.2 Why does it matter? ...... 18 1.2 Tracer ...... 19 1.2.1 Overview ...... 19 1.2.2 Why is Tracer needed? ...... 20 1.2.3 Scope ...... 20 1.3 Thesis Road Map ...... 21

2 Related Work 23 2.1 Definition ...... 23 2.2 Classification ...... 23 2.3 Setting ...... 24

3 System Design 25 3.1 Overview ...... 25 3.2 Tracer ...... 26 3.3 Terminology ...... 27

4 Tracer Primitives 29 4.1 Overview ...... 29 4.2 Primary Key Discovery ...... 30

7 4.2.1 Design choices ...... 31 4.3 Foreign Key Discovery ...... 32 4.3.1 Design choices ...... 33 4.4 Column Map Discovery ...... 34 4.4.1 Examples ...... 34 4.4.2 Inputs and Outputs ...... 36 4.4.3 Design choices ...... 37 4.5 Putting It All Together ...... 38

5 Modeling Methods 41 5.1 Primary Key Discovery ...... 41 5.2 Foreign Key Discovery ...... 42 5.3 Column Map Discovery ...... 43 5.3.1 Transform ...... 44 5.3.2 Solver ...... 44 5.3.3 Inverse Transform ...... 45

6 Datasets 47 6.1 Column Mapping Simulation ...... 48 6.1.1 Software Design ...... 49 6.1.2 Atoms ...... 50 6.1.3 Sieve ...... 51

7 Experiments 53 7.1 Primary Key ...... 53 7.1.1 Evaluation ...... 53 7.1.2 Experiments ...... 53 7.2 Foreign Key ...... 55 7.2.1 Evaluation ...... 55 7.2.2 Experiment ...... 56 7.3 Column Map ...... 57

8 7.3.1 Evaluation ...... 57 7.3.2 Experiments ...... 57

8 Conclusion 61

A Column Map without Foreign Keys 63

B Relational Dataset 67

C Column Map Discovery Performance 71

9 10 List of Figures

1-1 Example of data lineage under our definitions. Input tables 퐼1 and 퐼2

are going through transformations 푇1, 푇2, 푇3, 푇4 and generating output

tables 푂1, 푂2. Therefore, the lineage of 푂1, 푂2 is the set containing 퐼1

and 퐼2...... 18

1-2 This figure shows a user who is interested in a statistic and wantsto know the source(s) of the data: in other words, the data lineage. The goal of this project is to answer this question in the context of tabular ...... 19

3-1 This figure provides a brief overview of how Tracer works and itsrole in the system context. Tables without lineage (in the left) go through the Data Ingestion Engine to be converted into the appropriate format, and then fed into Tracer — the focus of this thesis — which discovers the lineage of the tables...... 26

3-2 This figure shows how all the primitives are put together into asin- gle pipeline, as well as their standard inputs and outputs. Further clarification regarding the data types can be found in Table 3.1.. 26

4-1 This figure showcases the functionality of the Primary Key Discovery module. It takes as input a set of tables (on the left) and returns as output the name of the columns that it predicts are primary keys (on the right)...... 30

11 4-2 This is an example of how the Foreign Key Discovery module works. It takes a set of tables as input (on the left), as well as their corre- sponding primary keys (represented in green), and outputs the foreign key relationships that it predicts, in sorted order from most likely to least (on the right)...... 32 4-3 This figure shows an example of the functionality of the Column Map Discovery primitive, where the module learns that the target column "Total" was generated by adding “Cost 1” and “Cost 2”...... 34 4-4 This example shows the functionality of the Column Map Discovery primitive, where the module discovers that “Purchase Date” was gen- erated from “Quantity” (the amount of rows for each key in “Purchase Date” corresponds to the values in “Quantity”)...... 35 4-5 This figure shows the Column Map Discovery primitive being applied, where the module learns that “Total” was generated from adding all values from “Cost 1” and “Cost 2” with the same foreign keys. . . . . 35 4-6 This example shows the functionality of the Column Map Discovery primitive, where the module discovers that “Min/Max” was generated from “Prices” (the values in “Min/Max” are the minimum and maxi- mum values from each key of “Prices”)...... 36

5-1 This figure demonstrates how Column Map Discovery works in three steps: (1) the model expands the columns of the dataset (e.g. “Date” becomes “Day”, “Month”, “Year”); (2) the model attempts to predict which of these columns generated the target column, assigning scores to each of them; (3) the model aggregates the columns that were ex- panded, selecting the largest scores as lineage of the target...... 43

6-1 This figure shows the schema for the MovieLens dataset [24], wherethe arrows represent foreign key relationships, the first column describes the columns in each table, and the second column informs the data type of each column...... 47

12 6-2 A basic example on how to use DataReactor...... 48

A-1 This figure shows an example of the functionality of the Column Map Discovery primitive, where the module learns that "Total" was gen- erated by adding “Cost 1” and “Cost 2” without knowing the foreign keys...... 63

13 14 List of Tables

2.1 A summary of the research done to each data lineage field, separated by type (how-lineage or why-lineage) and granularity (schema-level or instance-level) [19]...... 24

3.1 This table defines the standard inputs and outputs of the Primary Key, Foreign Key and Column Map Discovery primitives...... 28

4.1 This table shows an example of the correct input/output format for the Primary Key Discovery primitive...... 31

4.2 This table shows an example of the correct input/output format for the Foreign Key Discovery primitive...... 33

4.3 This table shows an example of the correct input/output format for the Column Map Discovery primitive...... 37

6.1 Description of the Dataset and DerivedColumn data structures. . . . 50

7.1 Number of tables per dataset, as well as the accuracy of our imple- mentation of the Primary Key Discovery primitive in three situations, (1) normally, (2) without utilizing the column names, and (3) without utilizing the position of the columns...... 54

7.2 Example of the Foreign Key Discovery evaluation method, when ground truth is A, B, C...... 55

15 7.3 Number of foreign keys per dataset, F-measure, precision and recall of our implementation of the Foreign Key Discovery primitive when tested against eighteen datasets, using the evaluation method described in Section 7.2.1...... 56 7.4 This table shows the performance of our current implementation of the Column Map Discovery primitive. These results were obtained by aggregating the values from Table C.1 for each dataset...... 58 7.5 This table shows the F-measure, precision and recall of our implemen- tation of the Column Map Discovery primitive. These results were obtained by aggregating the values from Table C.1 for the two trans- formations generated by DataReactor...... 58

B.1 This table provides some details regarding the 73 relational datasets from [24]...... 69

C.1 This table shows the F-measure, precision and recall of our implemen- tation of the Column Map Discovery primitive when tested against eleven relational datasets using the evaluation method proposed in Sec- tion 7.3.1...... 75

16 Chapter 1

Introduction

In data science, the inability to trace the lineage of particular data is a common and fundamental problem that undermines both security and reliability [22]. Under- standing data lineage enables auditing and accountability, and allows users to better understand the flow of data through a system. However, by the time data makesits way through a system, its lineage has often been lost. Our goal in this thesis is to present a machine learning solution to the data lineage problem. We argue that our approach provides a more general and more extensible solution to this problem than existing approaches which rely only on handcrafted heuristics [14], code analysis [23], and/or manual annotation of data [9].

1.1 Data Lineage Tracing

1.1.1 Definition

To approach the problem we need a formal definition of data lineage. For the purposes of this thesis, we will be using the definition provided in [14]: Define 푂 as a set of output tables, 퐼 as a set of input tables, 표 as a data item (row or column) from the output set, and define transformation 푇 as a procedure that takes a set of tables as input and produces another set of tables as output. Then, given 푇 (퐼) = 푂 and an output item 표 ∈ 푂, the set 퐼′ ⊆ 퐼 of input data items that

17 I2 T2 O1

I1 T1 T4 O2 T3

Figure 1-1: Example of data lineage under our definitions. Input tables 퐼1 and 퐼2 are going through transformations 푇1, 푇2, 푇3, 푇4 and generating output tables 푂1, 푂2. Therefore, the lineage of 푂1, 푂2 is the set containing 퐼1 and 퐼2. contributed to 표’s derivation is the lineage of 표, denoted as 퐼′ = 푇 ′(표, 퐼). Figure 1-1 provides an example of lineage under this definition. With this, we can define a lineage graph to be a directed acyclic graph ofthe lineage, where nodes are entries of tables and edges are the lineage of these entries, pointing from the original entry to the entry that was generated. Notice that under this definition, 푇 ′ only provides the immediate parents of a data item, and the function has to be called recursively to fully compute the lineage graph.

1.1.2 Why does it matter?

Reliable data is fundamental for effective decision making. However, as data — specifically big relational data — is sampled, mixed and altered, and passed through systems and between stakeholders, its lineage is frequently lost [22]. If the original source of data is obscured, the quality of the derived data may be questioned, along with the reliability of any decisions based on it. A method for tracing the origin of data and its movement between people, processes, and systems would help address this issue. Furthermore, more control over data is essential for regulatory compliance. As issues like privacy and control over personal data receive more attention, data protec- tion agencies around the world are requiring that companies and other data-collecting entities provide users a higher level of control over their own data. Some recent exam- ples of legislation are the GDPR (General Data Protection Regulation) [2] which was implemented in the European Union in 2018, and the CCPA (California Consumer

18 Where is this data ?

coming from? ?

?

Figure 1-2: This figure shows a user who is interested in a statistic and wants toknow the source(s) of the data: in other words, the data lineage. The goal of this project is to answer this question in the context of tabular databases.

Privacy Act) [1], which takes effect in 2020. With automatic data lineage tracing, it becomes easier to answer the question “Where did this data come from?” and to satisfy these regulatory compliance requirements. Understanding the flow of data through systems has many benefits for stakehold- ers. Accurate data lineage can help identify gaps in reporting, highlight issues, and help trace errors back to their root cause. It can help address questions such as how valuable the data is, whether the data is duplicated or redundant, and whether it is internal or external. Being able to trace the lineage of data is, overall, indispensable for enterprise applications.

1.2 Tracer

1.2.1 Overview

In light of the issues described above, we propose Tracer, a machine learning solution to the data lineage problem. At a high level, our system traces the lineage of data by (1) taking a tabular as input, (2) analyzing the tables with statistical methods, machine learning techniques, and hand-crafted heuristics, and (3) returning the rows and columns that generated the tables. An end-to-end solution to the data lineage problem consists of:

1. Identifying parent-child relationships between tables.

19 2. Understanding how each row/column in a parent table maps to a row/column in the child table.

3. Presenting this information in a human-readable manner which allows users to derive actionable insights, as well as in a machine-readable manner which can be utilized by downstream data science tools.

1.2.2 Why is Tracer needed?

While many solutions to the data lineage problem have been proposed before, all available attempts at tackling the problem rely either on handcrafted heuristics (such as [14], which provides algorithms for lineage tracing in data warehouses), code anal- ysis (such as [23], which provides a system for lineage tracing relying on source code analysis), and/or manual annotation of data (such as [9], which provides a data an- notation system for relational databases). While these are all valid ways of lineage tracing, they all have limitations. Code analysis requires access to the code, which may be inaccessible when dealing with external applications or if the code was lost/deprecated. Algorithms require a strict set of conditions to be able to run, such as knowing ahead of time all the possible transformations, which cannot always be satisfied. Finally, solutions that require annotations are great for future lineage tracing, but they were not designed to answer the question of where the present non-annotated data came from. Tracer does not have these same limitations. Differently than these approaches, Tracer learns the lineage from the data itself and so (1) it doesn’t require code access, (2) there are few constraints to the data, since it uses machine learning instead of algorithms, and (3) it doesn’t need any annotations, since it only requires the data to be able to learn its lineage.

1.2.3 Scope

Trying to solve the data lineage problem for all types of data would be a massive undertaking, and we would likely encounter issues achieving high accuracy and speed.

20 Therefore, we will focus our attention on a subset of more common situations, as enumerated below:

1. All transformations must be deterministic, since adding randomization to the problem makes inferring the lineage significantly harder.

2. All transformations must be reducible to operations on from columns to columns or from rows to rows. For example, merging two columns together is valid, but merging a column with a row is not.

3. The lineage must be unambiguous. For example, let’s assume the input column [1, 2, 3] generated the output column [4, 8, 12]. If there is another input [2, 4, 6], the model will not be able to infer which input generated the output, since there are equally simple transformations for both inputs that would result in the output.

4. All the data must be provided in tables. Tracer currently does not support conversion of other data types into the appropriate format.

1.3 Thesis Road Map

This thesis is organized as follow:

∙ Chapter 2 introduces related work about the data lineage problem.

∙ Chapter 3 presents the high-level design of the Tracer system.

∙ Chapter 4 describes the API of our system.

∙ Chapter 5 clarifies the technical workings of our system.

∙ Chapter 6 presents the datasets used to test our system.

∙ Chapter 7 demonstrates the performance of our system.

∙ Chapter 8 highlights promising areas for future work on this topic.

21 22 Chapter 2

Related Work

In this section, we provide an overview of previous work on the data lineage problem.

2.1 Definition

People have been working to keep track of data lineage for several decades. Back in 1997, Woodruff and Stonebraker [28] defined the lineage of data as its entire processing history, and provided a novel system to generate lineage without relying on metadata. Buneman, Khanna and Tan discussed the issue as well [11]. The field was further advanced when Cui and Widom published their work on the topic [15], [13], [14], defining data lineage in the context of data warehouses and providing algorithms for lineage tracing.

2.2 Classification

Research regarding data lineage can be divided into four distinct fields, based on the lineage type (how-lineage or where-lineage) and granularity (schema-level or instance- level). Lineage type refers to which question is trying to be answered, either “where” the data comes from (i.e. which tables generated some output) or “how” the data was transformed (i.e. the specific transformations which generated some output). Granularity refers to the format of the lineage, where schema-level/coarse-grained

23 Schema Instance Why and Where [12] General Warehouse [14] Warehouse View [15] Where Workflow [17] Non-Answers [18] Approximate [25] Trio [6] Curated [10] How Curated [10] Workflow [17]

Table 2.1: A summary of the research done to each data lineage field, separated by type (how-lineage or why-lineage) and granularity (schema-level or instance-level) [19]. lineage means the model looks at whole tables at a time, or even a entire datasets, while instance-level/fine-grained means that each data item is analysed individually. This is further discussed by Ikeda and Widom [19], who show some of the existing body of work for each category, which we reproduce on the Table 2.1.

2.3 Setting

Data context determines what type of data lineage tracing can be performed. For example, in data warehouses, where information is properly organized and well- annotated, the lineage can be discovered with basic heuristics, such as described by [14]. On the other hand, in malicious environments, intentional and unintentional leakages must be taken into consideration, and entirely different frameworks must be developed, as [7] and [22] discuss. Other examples include lineage in the context of arrays, as studied in [21], and in the setting of relational databases containing uncertain data, as described by [8] and [6].

24 Chapter 3

System Design

At a high level, the Tracer library works by executing pipelines that contain modules called primitives. Each of these primitives is responsible for solving a particular subproblem related to the data lineage task, from primary and foreign key discovery to identifying column mappings. In this chapter, we describe how these pipelines and primitives interact to solve the data lineage problem.

3.1 Overview

The Tracer library was developed to solve the data lineage problem in the context of tabular databases. To understand the purpose and functionality of the library, we must first understand how Tracer interacts with other modules within the same context. Figure 3-1 provides a general overview of the system design. On the left side, the figure shows various systems generating tables. We don’t always know which system generated which tables, which can lead to a loss of data lineage. These systems can also be both complex and diverse — for example, System A could be a MySQL database, and System B a Hadoop database, with each one requiring distinct handling. The tables are fed into the Data Ingestion Engine. The two components of this module — the System Connector and the Input Generator — help to transform the various possible inputs into a standardized format called Metadata.JSON [3].

25 Tracer RESTFUL API

Data Ingestion Engine Table a1 Updated

.. metadata.json System A System Input Connector Generator Tracer Table an Metadata.json metadb ...

Pipelines

Data Set(s) Table b1 Service APIs

.. (DL Orchestration Primitives System B Engine) ... Table bn

UI/UX Library

Accenture DL Portal Data Lineage Business Dashboard Users

Figure 3-1: This figure provides a brief overview of how Tracer works and its rolein the system context. Tables without lineage (in the left) go through the Data Ingestion Engine to be converted into the appropriate format, and then fed into Tracer — the focus of this thesis — which discovers the lineage of the tables.

Tracer then takes the datasets, along with the Metadata.Json file, and attempts to identify useful properties pertinent to the data lineage problem. It does this by applying various procedures called primitives which extract information (such as pri- mary keys and foreign keys) and store it in the metadatabase, metadb. This data is then processed by the UI/UX library, which provides a graphical interface, the Data Lineage Dashboard, that facilitates the user’s interaction with Tracer.

3.2 Tracer

Primary Key Foreign Key Column Map Tables Tables Tables ColumnMap Discovery PrimaryKeys Discovery ForeignKeys Discovery

Figure 3-2: This figure shows how all the primitives are put together into asingle pipeline, as well as their standard inputs and outputs. Further clarification regarding the data types can be found in Table 3.1.

26 Now that we have an overview of the system at large, let’s take a closer look at the Tracer module. Tracer is an end-to-end pipeline takes in a set of data tables and outputs the lineage of that data. It accomplishes this by running through a series of steps, called primitives. Each primitive generates relevant information that is then fed to the next primitive in the pipeline, and on down the line until the data lineage is discovered. This interaction is graphically represented in Figure 3.2. As the diagram shows, we start with a set of tables, which are fed to the Primary Key Discovery primitive to generate the primary keys of the dataset. Next, the Foreign Key Discovery primitive uses the tables and the generated primary keys to detect any foreign key relationships that may exist in the dataset. Finally the Column Map Discovery primitive uses these foreign keys, as well as the original tables, to discover which columns generated which ones in the dataset. The design and functionality of each of these primitives is described in Chapter 4. The above process, if accurate, is sufficient to provide the lineage of any item ina database. This is due to one of the constraints we presented previously: Tracer only deals with transformations on rows and columns. Therefore, if we know the mapping of every row (through foreign keys), and the mapping of every column (through the column map), we can pinpoint the exact origin of each cell of a table.

3.3 Terminology

As can be seen in Figure 3.2, each primitive has well-defined inputs and outputs. These data structures are defined in Table 3.1 and will be used throughout therest of this thesis.

27 Name Description Example Tables A set of tables, each of which has (1) a unique name, (2) one { or more named columns, and (3) table1: DataFrame, table2: Dataframe, zero or more rows. Data Struc- } ture: dictionary mapping table names to dataframes. PrimaryKeys Indicates which columns of a set of tables are the primary { keys. Data Structure: dictionary table1: user_id, table2: prod_id, mapping table names to column } names. ForeignKeys All the foreign key mappings that exist within a set of tables, where [ each referenced table maps to all { "table": users, tables that refer to it. Data "column": user_id, Structure: a sorted list of dictio- "ref_table": posts, naries, each of which contains five "ref_column": key, elements: table, field, ref_table, "score": .9 ref_field, score. }, ... ]

ColumnMap The mapping between a target column to the columns from the [ parent table that generated it. { "table": costs, Data Structure: a sorted list of "column": cost_1, dictionaries. "score": .9 }, ... ]

Table 3.1: This table defines the standard inputs and outputs of the Primary Key, Foreign Key and Column Map Discovery primitives.

28 Chapter 4

Tracer Primitives

This chapter aims to provide further context about the building blocks of Tracer: the primitives. As explained in Chapter 3, the goal of each primitive is to solve a sub- problem of the data lineage task. When combined, the solutions to these subproblems allow us to answer any data lineage query on a standard relational database. In this chapter we will (1) define what a primitive is, (2) explain how each of them works, and (3) show how they they work together to solve a data lineage problem.

4.1 Overview

While Section 3.2 explained what a primitive does, it did not answer the question of what a primitive is. A primitive is a well defined Python class which contains methods aimed at solving a specific part of the data lineage problem. There are three types of primitives implemented in Tracer: Primary Key Discovery, Foreign Key Discovery, and Column Map Discovery. Because the output of each successive primitive is required as an input for the one that follows, as shown in Figure 3.2, they must be called in this specific order. As for their actual implementation, each Tracer primitive class satisfies the fol- lowing properties:

1. Contains a fit function, which is optional and corresponds to training the model on some external datasets.

29 2. Contains a solve function, which is required and uses the fitted model to pro- duce the appropriate output.

3. Has standard inputs/outputs as shown in Figure 3.2.

4. Is easily extensible. In the scenario where a superior method to solve a primitive task is discovered (we describe our current implementation in Chapter 5), only the code for that particular primitive has to be substituted, since each of the primitives is independent of the others.

In the following sections we will describe (1) the three main types of primitives implemented in the Tracer library, (2) their respective APIs, and (3) the reasoning behind their design.

4.2 Primary Key Discovery

Primary Key

Figure 4-1: This figure showcases the functionality of the Primary Key Discovery module. It takes as input a set of tables (on the left) and returns as output the name of the columns that it predicts are primary keys (on the right).

The goal of the Primary Key Discovery primitive is to discover all the primary keys within a dataset. It takes as input a dictionary mapping table names to dataframes (Tables) and as output a dictionary mapping table names to the column name that describes the primary key (PrimaryKeys). If a table has no primary key, it will map to "None". Figure 4-1 demonstrates the functionality of this primitive. Consider the scenario shown in Table 4.1. In this example there are two tables, “users” and “posts”, with primary key columns “user_id” and “post_id”, respectively.

30 Input Output

{ { users: pd.DataFrame(user_id, user_name, ...), users: user_id, posts: pd.DataFrame(post_id, post_title, ...), posts: post_id } }

Table 4.1: This table shows an example of the correct input/output format for the Primary Key Discovery primitive.

Now assume we lost the information indicating which columns are the primary keys. To rediscover them, we can use the Primary Key Discovery primitive by calling the solve function on these tables, and the model will return the primary keys for us.

4.2.1 Design choices

The Primary Key Discovery primitive takes a dataset as input, as opposed to only a single table. We designed it as such because some methods for discovering primary keys rely on looking at relationships between tables. For example, consider a dataset (Tables), where the primary key for every table always has the same column name. Under our implementation, a primary key discovery model would be able to use this fact to simplify the key tracing. However, this argument does not extend to multiple datasets: there is little that can be learned across multiple datasets that could not be done within one, and so the input is a single dataset instead of a list. In addition, PrimaryKeys only stores one primary key per table, because the current implementation of Tracer can only discover up to one key per table. This will be enhanced in a future version of the project. Another relevant decision concerns why this module only takes Tables as input and not other potentially useful information, such as foreign keys. There are two reasons for this. First, the Foreign Key Discovery module takes the primary keys as input, so if Primary Key Discovery required the foreign keys, we would have a loop. Second, the current module already achieves near-perfect accuracy, as is shown in Section 7.2, which means the added complexity of alternative implementations

31 (e.g. merging both the Key Discovery primitives to solve the looping issue) is not justifiable.

4.3 Foreign Key Discovery

Foreign Key

Foreign Key

Figure 4-2: This is an example of how the Foreign Key Discovery module works. It takes a set of tables as input (on the left), as well as their corresponding primary keys (represented in green), and outputs the foreign key relationships that it predicts, in sorted order from most likely to least (on the right).

The goal of the Foreign Key Discovery primitive is to discover all the foreign keys within some dataset. It takes as input a dictionary mapping table names to dataframes (Tables) and a dictionary mapping these tables to their primary keys (PrimaryKeys), and returns a list of dictionaries representing the foreign keys of the table, sorted by how likely they are according to the model (ForeignKeys). No- tice that the majority of the keys in ForeignKeys don’t actually exist. Because ForeignKeys contains all possible foreign key relationships, sorted from most to least likely to exist based on their “score“ (see Table 3.1), only the initial values of the list correspond to actual foreign keys, while the other values must be thresholded. Figure 4-2 demonstrates the functionality of this primitive. Consider the scenario shown in Table 4.2. In this example there are two tables, “users” and “posts,” with a foreign key relationship between “user_id” and “key.” Now assume we have lost the information about which columns are foreign keys. To rediscover them, Tracer will first call the Primary Key Discovery primitive, which will provide the appropriate PrimaryKeys object. Then we will use the Foreign Key Discovery’s solve function with Tables and PrimaryKeys as input, which will return

32 all possible foreign keys, sorted by most to least likely based on their “score.” Notice that the example in Table 4.2 only shows the first foreign key for the output, but the list contains four items, where the other possible foreign keys — (“user_id,” “key”), (“user_name,” “id”) and (“user_name,” “key”) — have lower scores.

Input Output

{#Tables [ users: pd.DataFrame(user_id, user_name, ...), { posts: pd.DataFrame(id, key, nb_posts, ...) "table": users }, "column": user_id {#PrimaryKeys "ref_table": posts users: user_id, "ref_column": key posts: key "score": .95 } }, ... ]

Table 4.2: This table shows an example of the correct input/output format for the Foreign Key Discovery primitive.

4.3.1 Design choices

Unlike the Primary Key Discovery primitive, this primitive does allow any number of keys per table. We decided to prioritize doing so in this module because, from our observations, most relational databases contain at least one compound foreign key, while compound primary keys are not as widespread. Another major decision about this API was the output format. In earlier ver- sions of Tracer, Foreign Key Discovery generated a dictionary mapping the column names of each foreign key to a list of tuples containing (“parent_table”, “child_table”, “child_column”), corresponding to the matching foreign keys. While the dictionary data structure allowed for easier access to a specific foreign key, it was a very unintu- itive format. Changing it to a list of objects made it more approachable, while also facilitating the sorting of the foreign keys by their scores, which simplifies the job of the Column Map Discovery when it receives the ForeignKeys. Finally, we need to explain our decision to generate all possible foreign keys and

33 their respective scores. Initially, the output of this module was solely the thresholded foreign keys, and so no scores were provided to the next module. While this method generated cleaner outputs, we had trouble achieving high accuracy in the Column Map Discovery primitive, because results generated from Foreign Key Discovery were failing. This is caused by the error compounding multiplicatively in the pipeline: if the predicted foreign keys are incorrect, then the next module will also be incorrect, because it was fed the wrong information. To fix this issue, we now provide all possible foreign keys with their respective scores, which gives the next primitive more information with which to make decisions.

4.4 Column Map Discovery

This primitive type attempts to discover which columns of the dataset generated some target column chosen by the user. It takes as input a dictionary mapping table names to dataframes (Tables), a list of dictionaries representing the foreign keys of the table (ForeignKeys), as well the target column, and outputs a sorted list of dictionaries representing the columns that generated the target (ColumnMap), as shown in Figure 3.2. This primitive is significantly more complex than the other two, so we will study it by analyzing four distinct examples of problems this module is capable of solving.

4.4.1 Examples

Foreign Key

Key Cost 1 Cost 2 Key Total Total = f (Cost 1, Cost 2) 1 100 200 3 100 2 50 150 2 300 3 0 100 1 200

Figure 4-3: This figure shows an example of the functionality of the Column Map Discovery primitive, where the module learns that the target column "Total" was generated by adding “Cost 1” and “Cost 2”.

34 For the first example, consider the situation in Figure 4-3. The image showsthe Column Map Discovery learning that the lineage of the column “Total” consists of the columns “Cost 1” and “Cost 2.” Note that it doesn’t know the exact transformation (a combination of columns, where Cost = Cost 1 + Cost 2), only that there exists some correlation between the columns.

Foreign Key

Key Purchase Date

2 05 / 03 / 19 Key Name Quantity Purchase Date = f (Quantity) 2 05 / 07 / 19 1 Alice 2 1 05 / 09 / 19 2 Bob 3 1 05 / 21 / 19 2 05 / 28 / 19

Figure 4-4: This example shows the functionality of the Column Map Discovery prim- itive, where the module discovers that “Purchase Date” was generated from “Quan- tity” (the amount of rows for each key in “Purchase Date” corresponds to the values in “Quantity”).

Another type of transformation Column Map Discovery is capable of recognizing is shown in Figure 4-4. In this example, the elements from column “Purchase Date” were generated from column “Quantity,” where the number in each cell of “Quantity” represents how many rows for each key there will be in “Purchase Date.” Note that in this example the column “Name” serves no purpose. The primitive should be able to figure out that there is no correlation between “Name” and “Purchase Date.”

Foreign Key

Key Cost 1 Cost 2

2 1 2 Key Total Total = f (Cost 1, Cost 2) 2 0 1 1 18 1 5 10 2 5 1 0 3 2 -1 2

Figure 4-5: This figure shows the Column Map Discovery primitive being applied, where the module learns that “Total” was generated from adding all values from “Cost 1” and “Cost 2” with the same foreign keys.

35 The third example is shown in Figure 4-5. Similarly to the first example, “Total” is a function of columns “Cost 1”’ and “Cost 2.” Differently than in example one, multiple rows also combine, so that “Total” is generated by the summation of all elements of both columns that share the same key. Column Map Discovery supports this type of transformation by converting the multiple rows into one through the Transform stage, as explained in Section 5.3.

Foreign Key

Key Prices Key Min/Max 2 1 1 0 Min/Max = f (Prices) 2 -2 1 5 1 5 2 -2 1 0 2 50 2 50

Figure 4-6: This example shows the functionality of the Column Map Discovery primitive, where the module discovers that “Min/Max” was generated from “Prices” (the values in “Min/Max” are the minimum and maximum values from each key of “Prices”).

A last example is shown in Figure 4-6. In this situation, the elements from column “Min/Max” were generated from column “Prices” in the first table by selecting the smallest and largest values for each key, and then placing the maximum value below the minimum one. This example really underscores the value of Tracer, seeing as even very unorthodox transformations can still be discovered by the Column Map Discovery primitive.

4.4.2 Inputs and Outputs

Consider the scenario shown in Table 4.3. In this example, there are two tables, “users” and “posts” (where “nb_posts” is generated by combining the columns from “post1” and “post2”), as well as a foreign key relationship between the columns “user_id” and “key.” Now assume we lost this information. To rediscover the lineage of the target column “nb_posts”, Tracer will first call the Primary Key Discovery primitive, which will feed its output to the Foreign Key Discovery primitive, which then generates the

36 ForeignKeys object we need. Now we can simply use the Column Map Discovery’s solve function with Tables and ForeignKeys as input to generate the lineage of “nb_posts.”

Input Output

{#Tables users: pd.DataFrame(user_id, nb_posts, ...), [ posts: pd.DataFrame(key, post1, post2, ...) { }, "table": posts nb_posts,#target column "column": post1 [ "score": .95 {#ForeignKeys }, "table": users { "column": user_id "table": posts "ref_table": posts "column": post2 "ref_column": key "score": .9 "score": .95 } }, ... ] ]

Table 4.3: This table shows an example of the correct input/output format for the Column Map Discovery primitive.

4.4.3 Design choices

One of the most important design decisions in this project involved the API of the Col- umn Map Discovery. While the Primary and Foreign Key Discovery primitives deal with standard relational database structures, the Column Map Discovery is unique to Tracer, and very flexible both in implementation and functionality. This was the first module that was implemented, and the whole architecture of Tracer was designed around what this module would require to function. One important decision about the module is what type of inputs it should take. Theoretically it could work by only taking Tables as input (see Appendix A), but the lack of row mapping (i.e. foreign key relationship) would make the model im- practically slow, since allowing uncertainty in both row and column mappings at the same time is an immense search space.

37 We dealt with this issue by providing the Column Map Discovery primitive with the ForeignKeys object. While it is true that this creates the issue of multiplicative error (i.e. if the Foreign Key primitive is incorrect, the Column Map will also be incorrect, since it is acting on false data), we concluded that predicting foreign keys is sufficiently easy and there is enough research on the topic that we will alwaysbe able to achieve sufficiently accurate results on that module. Another important design choice is how dependent Column Map Discovery should be on ForeignKeys. In earlier versions of Tracer, Column Map Discovery was actually a class of four distinct primitives, each one aimed at dealing with a different types of foreign key. While this division facilitates the creation of primitives, we noticed that it limited the potential for solutions that use multiple foreign types at the same time. Instead, we decided not to limit our framework as such, and in the current version of Tracer there is no distinction between the types of foreign keys. This means that any Column Map Discovery primitive must find a way to deal with the various foreign keys. For an example of how our current implementation accomplishes this, refer to Section 5.3.

4.5 Putting It All Together

During the fit stage, the pipeline is given a list of databases. Each “database” contains not only Tables with raw data, but also the metadata, which includes the foreign keys, primary keys, and constraints. Then, the execution pipeline performs the following:

1. Fit the Primary Key Discovery module by building a training set using the given primary keys.

2. Fit the Foreign Key Discovery module by building a training set using the given primary and foreign keys.

Note that the models can be saved to disk and used for inference (i.e. we can train the system on a large collection of databases, store the resulting pipeline, and provide

38 the pre-trained pipelines to users to apply to their databases). During the solve stage, the pipeline takes a dictionary of tables as well as a (ta- ble_name, column_name) tuple which specifies the target column for which we want to trace the lineage. The execution pipeline runs the following:

1. Apply the Primary Key Discovery module to the tables and identify the primary keys.

2. Apply the Foreign Key Discovery module to the tables and primary keys and identify the foreign keys.

3. Apply the Column Map Discovery module to the tables, foreign keys, and target column in order to identify the lineage of the target column.

Note that to obtain the lineage for multiple columns, the user doesn’t need to rerun the first two steps, only the third step which takes as input the target column. Thisis implemented by caching the results during pipeline execution — i.e. since the inputs to the primary and foreign key discovery module do not change, the results will be retrieved from the cache.

39 40 Chapter 5

Modeling Methods

In this chapter, we will delve further into the modeling algorithms used to support each primitive. While Chapter 4 went through what primitives are and how they are implemented, this chapter will focus on some of the algorithms Tracer uses to solve each primitive task.

5.1 Primary Key Discovery

At a high level, the primary key discovery module works by generating a feature vector for each column in a table, and predicting which column (if any) is most likely to be the primary key. Note that in this version of Tracer, we do not aim to support composite primary keys, only single-column primary keys. The feature vector contains handcrafted features describing the column, decided based on common patterns observed among primary keys. These feature vectors include:

1. Ordinal position. Primary keys tend to be one of the first columns in the table.

2. Uniqueness. Primary keys must be unique.

3. Data type. Primary keys are typically strings or integers.

41 4. Column name. We encode binary features for whether the name of the column contains common terms used for primary keys such as “_id” or “_key”.

During the fit stage, the Primary Key Discovery module trains a classification model to predict primary key columns using these features; it performs hyperparam- eter optimization and cross-validation using the BTB library [16] to identify the best model. During the solve stage, it uses the model and identifies the most promising candidate (if any) for the primary key.

5.2 Foreign Key Discovery

As discussed in [29] [26], foreign keys typically satisfy the following 6 properties:

1. High cardinality.

2. Contain most of the primary keys.

3. Tend to be a primary key of few foreign keys.

4. Are not a subset of too many primary keys.

5. Share average lengths (mostly for strings).

6. Have similar column names.

By using the above properties, we came up with the following implementation to discover foreign keys. First, the model generates all the possible pairs of columns (i.e. for every two tables, create all the pairings between any column of one table and a column of the other), as long as a foreign key is viable (e.g. if the data types don’t match, then there can be no foreign key relationship, so this paring is not created). Then the model has to decide which pairs are actually foreign keys. It accomplishes this by using the following handcrafted features, based on the six properties described above:

1. Intersection. Foreign keys should have significant intersection.

42 2. Inclusion dependency. One of the foreign keys tends to be a subset of the other.

3. Uniqueness. At least one of the foreign keys tends to be unique (i.e. a primary key).

4. Column name. We encode binary features for whether the name of either column contains common terms used for foreign keys, such as “_id” or “_key”.

5. Data type. Foreign keys are typically strings or integers.

Finally, much like for the Primary Key Discovery primitive, the fit procedure of the Foreign Key Discovery module trains a random forest classifier to predict foreign keys using these features; it then performs hyperparameter optimization and cross- validation using the BTB library [16] to identify the best model. During the solve stage, it then uses the model to identify the most likely foreign keys.

5.3 Column Map Discovery

Target Target Target Target 0.1 0.1 0.3 0.3 0.1 0.7 Date Day Month Year Day Month Year Date 1 2 3 is derived from

Figure 5-1: This figure demonstrates how Column Map Discovery works in three steps: (1) the model expands the columns of the dataset (e.g. “Date” becomes “Day”, “Month”, “Year”); (2) the model attempts to predict which of these columns generated the target column, assigning scores to each of them; (3) the model aggregates the columns that were expanded, selecting the largest scores as lineage of the target.

The Column Map Discovery primitive provides the following functionality: given a target column, it identifies the columns that the target column is derived from.At a high level, as can be seen in Figure 5-1, it works by (1) transforming the database into an (X, y) representation where X and y contain numerical values representing the table and the target column, respectively, (2) identifying the relationships between X and y, and (3) reversing the transformation to identify which of the original columns contributed to the target column.

43 5.3.1 Transform

The transform step is meant to lead to more useful representations of the data. As an example, let’s consider the situation in Figure 5-1, where one of the columns is a date field. In this scenario, the transformation step would expand this one column into three: day, month and year. If the target column happens to be the year field, then this transformation has made it much easier for the model to identify that the date field is related to the target column. More specifically, the transform step uses reversible data transforms [4] to generate a new set of columns, which can then be further processed in the next steps. The functionality of this step can be further improved with the addition of new features, such as computing the length of a string. This procedure works for any type of foreign key relationship. If the foreign keys are many-to-many (i.e. multiple rows of the parent table match to multiple rows of the child table), for example, the module simply computes various aggregate transforms (e.g. number of rows in each child table), and the end result is still an (X, y) pair which contains numerical values. Therefore, the output is indistinguishable from the one-to-one case, and the next step doesn’t need to be concerned with the foreign keys.

5.3.2 Solver

This is the step of the process in which it is determined which columns generated the target. The Solver takes the (X, y) generated from the Transform step (i.e. the expanded table and the corresponding target column) and assigns scores to each column indicating how likely they are to have generated the target column. It does so by training a random forest regressor to identify the feature importance of each column. To more formally explain why the above procedure works, let us consider the following situation. If the input table contains two columns 퐴 and 퐵 and only column 퐴 contributed to some output 푌 , then given sufficient data and assuming that 푌 is not conditionally independent of 퐴 given 퐵 (e.g. the simplest case where this assumption

44 is violated would be if 퐴 and 퐵 are identical/redundant), then a model with sparsity regularization will only use column 퐴. Also, if the transformations are deterministic (e.g. there is no randomness in how the output column is generated) and we have enough data, then a universal function estimator (e.g. deep learning) will achieve perfect accuracy on the training set. It is important to observe that in this context overfitting isn’t a problem, it’s a feature. To identify the set of input columns that contributed to the output column, we simply need to solve:

2 푎푟푔푚푖푛푆[푚푖푛퐹 ||퐹 (푋[푆]) − 푦|| ] + 휆||푆||0 (5.1)

where 푆 is a subset of columns and 퐹 is an arbitrary machine learning model. Intuitively, this corresponds to identifying the smallest set of variables that results in the best prediction accuracy.

5.3.3 Inverse Transform

Finally, now that we have identified how useful each column is for generating the target column, we can reverse our initial transform. For example, if a date field was transformed into three columns and the columns are assigned importance score (0.0, 0.1, 0.4), then we would aggregate it into a total importance score of 0.5 and assign it to the original date field. We explored two different ways of aggregating the importance scores — adding them and taking the max over them — and found addition to work better and more consistently. This results in a score for each column which indicates how likely it is to have contributed to the target column.

45 46 Chapter 6

Datasets

To evaluate the performance of our data lineage system, we compiled a suite of rela- tional datasets from [24]. This repository contains 73 multi-table relational datasets covering a diverse range of applications, from molecular chemistry to supply chain management; it also includes standard benchmark datasets that have been used in related work on data processing systems [5].

movies2actors movies2directors u2base

movieid mediumint movieid mediumint userid int actorid mediumint directorid mediumint movieid mediumint cast_num int genre varchar rating varchar

movies users actors movieid mediumint directors userid int actorid mediumint year int directorid mediumint age varchar a_gender enum isEnglish enum d_quality int u_gender varchar a_quality int country varchar avg_revenue int occupation varchar runningtime int

Figure 6-1: This figure shows the schema for the MovieLens dataset [24], wherethe arrows represent foreign key relationships, the first column describes the columns in each table, and the second column informs the data type of each column.

Each dataset consists of multiple tables which are related through various primary and foreign key relationships. For example, the schema for the MovieLens dataset is shown in Figure 6-1. We store these datasets (and the associated metadata) using the Metadata.JSON format proposed in [3]. We can then use these datasets to evaluate

47 the performance of our Primary and Foreign Key Discovery modules. For further information about these datasets, please consult Appendix B. However, these datasets are not sufficient to evaluate the Column Map Discovery primitive. To assess this primitive, we developed a system to transform the relational datasets and generate new “derive” columns where the column mapping is known. Then, we can use this augmented dataset to evaluate the performance of the Column Map Discovery primitive.

6.1 Column Mapping Simulation

The DataReactor project aims to take a relational dataset and intelligently generate derived columns in order to aid in the evaluation of the DataReactor library and guide the design of Metadata.JSON [3]. These derived columns are a function of the other columns in their table as well as the associated rows in the parent and child tables. DataReactor is implemented as an independent Python package which can be used as follows:

from datareactor import DataReactor

reactor = DataReactor() reactor.transform( source="/path/to/dataset", destination="/path/to/dataset_new" )

Figure 6-2: A basic example on how to use DataReactor.

The transform operation takes in a path to a dataset which is stored in a Meta- data.JSON compatible format and produces a new Metadata.JSON compatible dataset where:

1. It has added derived columns to the tables which have useful statistical prop- erties.

2. It has updated the metadata object with the lineage of these derived columns.

48 For example, suppose we have a simple database containing users and transac- tions. One of the derived columns produced by DataReactor might be an additional column in the users table which stores the number of transactions (i.e. the number of corresponding rows in the transactions table for each user). Other examples of derived columns produced by DataReactor include summary statistics (i.e. the aver- age amount spent per transaction for each user), transforms on datetime fields (i.e. extracting the day of the week), and transforms on string fields (i.e. the length of the string).

6.1.1 Software Design

At a high level, DataReactor performs the following operations:

1. Load the tables and metadata.

2. Generate derived columns for each table.

(a) This functionality is provided by the DataReactor.Atoms sub-module.

(b) Each Atom is responsible for generating a type of derived column. The Atom API is easily extensible, allowing us to quickly add new transforms in the future.

(c) By randomly applying these Atoms to different tables, we can generate a wide range of derived columns.

3. Prune the derived columns to remove meaningless values.

(a) This functionality is provided by the DataReactor.Sieve sub-module.

(b) Each Sieve provides a different strategy for filtering derived columns, from more machine learning-oriented approaches to feature selection to simpler heuristics such as removing duplicate and constant-valued columns.

4. Write the expanded tables and updated metadata to disk.

49 We describe the modules responsible for steps 2 and 3 in greater detail in the following sections; the data structures defined in DataReactor are described in Ta- ble 6.1.

Class Name Description Dataset The Dataset object is responsible for reading and writing datasets. It manages both the metadata and the tables. It also provides the ability to add derived columns to a dataset (by updating the metadata and target table).

DerivedColumn The DerivedColumn object represents a derived column and contains (1) the name of the table it belongs to, (2) the ac- tual values in the column, (3) information about the name and type of the column, and (4) a constraint object indicating the lineage of the column.

Table 6.1: Description of the Dataset and DerivedColumn data structures.

6.1.2 Atoms

In this section, we introduce the atom module which implements the logic for generat- ing derived columns. Each atom inherits from the Atom base class (see Appendix C) and is required to implement a derive method which takes as input a dataset and tar- get table and returns a sequence of DerivedColumn objects containing new columns for that table. One simple Atom would be the RowCountAtom which creates a derived column which counts the number of rows in the child table. For example, suppose there is a foreign key relationship between users and transactions. This atom would propose a new derived column for the users table which contains the number of transactions for each user. Another example of an Atom is the FeatureToolsAtom which uses the FeatureTools [20] library to generate features, identifies the lineage of the features, and converts them to a derived column format. By integrating with FeatureTools, we are able to generate a broad variety of derived columns from data type-specific transforms

50 (i.e. converting datetime strings into time zones, etc.) to more general aggregate transforms (i.e. identifying the mode of a categorical column).

6.1.3 Sieve

Finally, the sieve module filters the derived columns. The generative process can result in columns that are constant-valued — for example, a transform that extracts the year from a datetime field could always return the same result if all the data comes from the same year — or even columns that are duplicates of one another. The sieve module applies filters to remove these edge cases and produce a finalset of derived columns for analysis.

51 52 Chapter 7

Experiments

In this section we will test each of our primitives using the datasets described in Chapter 6.

7.1 Primary Key

7.1.1 Evaluation

As discussed in Chapter 4, this primitive is designed to discover all the primary keys of a dataset. To assess how successful it is at achieving this goal, we will use the following evaluation method: (1) disregard all tables that contain composite keys or no keys at all, seeing as our current implementation returns single-column keys; (2) on this restricted dataset, check what percentage of the predicted keys are actually correct; (3) return the accuracy.

7.1.2 Experiments

Primary keys generally follow very strict rules, as discussed in Section 5.1. These constraints make the identification of primary keys fairly trivial, and so any reasonable model should achieve near-perfect accuracy on most datasets. Indeed, as can be seen in Table 7.1, our model achieves near perfect accuracy under the current evaluation procedure.

53 Dataset Number of Accuracy Accuracy Accuracy Tables w/o Names w/o Ordinals AustralianFootball 4 1 1 1 Biodegradability 5 1 1 1 Dunur 17 1 1 1 Elti 11 1 1 1 Hepatitis_std 7 0.25 0.25 1 Mesh 29 1 1 1 Mooney_Family 68 1 1 1 NBA 4 1 1 1 PremierLeague 4 1 1 1 Same_gen 4 1 1 1 Toxicology 4 1 1 1 UW_std 4 1 1 1 Walmart 4 1 1 1 classicmodels 8 1 1 1 cs 8 1 1 1 imdb_MovieLens 7 1 1 1 pubs 11 1 1 1 university 5 1 1 1

Table 7.1: Number of tables per dataset, as well as the accuracy of our implementation of the Primary Key Discovery primitive in three situations, (1) normally, (2) without utilizing the column names, and (3) without utilizing the position of the columns.

To test whether our model was actually looking at the data rather than overly depending on the metadata (seeing as around 40% of the column names contain indicators of the nature of the column, such as id or key), we removed the column names from the previous datasets and tested again. As we can see in Table 7.1, our results are invariant, with only one poor performance in the eighteen datasets. This outcome is positive, since it seems to indicate our model is not overly reliant on the metadata. We also tested how well our model performs without the ordinal feature (i.e. without knowing which order the columns are in). As the table shows, the model achieved perfect performance under this circumstance, which is a little surprising considering it outperforms the original model. Upon further investigation, the issue seems to be that for almost all training samples primary keys are the first column, which causes the model to overestimate the importance of this feature.

54 7.2 Foreign Key

7.2.1 Evaluation

As discussed in Section 4.3, the Foreign Key Discovery primitive attempts to discover all the foreign keys of a dataset. As for the Primary Key Discovery primitive, we have to design a metric to evaluate how successful the model is. In this case, though, accuracy is not a reasonable choice of evaluation, as some tables may contain multiple foreign keys while others have none, which wouldn’t be properly captured by such a simple metric. Instead, we will use precision, recall and F-measure for evaluation, as recommended in [29].

We compute these values is as follows. First, recall that the Foreign Key Discovery primitive returns a sorted list of foreign keys, from most to least likely. Therefore we can select the first X% most likely foreign keys which our algorithm predicted, for various values of X. Then we compute the precision, recall and F-measure of our method for each value of X, and finally select the best F-measure achieved, as wellas the corresponding value for X.

To clarify this idea, let us go through the example in Table 7.2. Imagine a dataset containing three foreign key mappings, A, B and C. Now imagine the Foreign Key Discovery module predicts the following possible foreign keys, sorted from most to least likely: A, D, B, C, E. To evaluate the performance of the model, we will select the top 20% percent of our predictions (i.e. we believe A is the only foreign key relationship in the dataset) and compute the recall, precision and F-measure for it.

Predicted X Recall Precision F-Measure A 20% 0.33 1.00 0.50 D 40% 0.33 0.50 0.40 B 60% 0.66 0.66 0.66 C 80% 1.00 0.75 0.86 E 100% 1.00 0.60 0.80

Table 7.2: Example of the Foreign Key Discovery evaluation method, when ground truth is A, B, C.

55 We then repeat this process for multiple values of X, and select the best-performing result, which in this example is a F-measure = 0.86, for 푋 = 80%.

7.2.2 Experiment

Dataset Foreign Keys F-Measure Precision Recall AustralianFootball 5 1.00 1.00 1.00 Biodegradability 5 1.00 1.00 1.00 Dunur 32 1.00 1.00 1.00 Elti 20 1.00 1.00 1.00 Hepatitis_std 6 1.00 1.00 1.00 Mesh 33 0.43 0.77 0.30 Mooney_Family 134 0.70 0.61 0.83 NBA 5 0.83 0.71 1.00 PremierLeague 5 0.75 1.00 0.60 Same_gen 6 1.00 1.00 1.00 Toxicology 5 0.75 1.00 0.60 UW_std 4 0.86 1.00 0.75 Walmart 3 1.00 1.00 1.00 classicmodels 8 0.73 1.00 0.57 cs 10 0.74 0.78 0.70 imdb_MovieLens 6 1.00 1.00 1.00 pubs 10 0.82 0.75 0.90 university 4 1.00 1.00 1.00

Table 7.3: Number of foreign keys per dataset, F-measure, precision and recall of our implementation of the Foreign Key Discovery primitive when tested against eighteen datasets, using the evaluation method described in Section 7.2.1.

Foreign keys are allowed much more flexibility than primary keys. Each table can have multiple keys or none at all, the column names are much less indicative than for primary keys, and their overall structure is more complex. As such, discovering foreign keys are significantly harder than its counterpart. Our results for this task, when using the evaluation method previously discussed, can be seen in Table 7.2.2. Overall, the model achieves an average F-measure of 0.87, an average precision of 0.92, and average recall of 0.85. One thing to note is that the Precision and Recall are fairly similar to each other, which means we chose an appropriate value for X, seeing as, in general, a higher X increases recall and decreases precision, while a lower X

56 does the opposite. A balance between the two values is desirable given the harmonic nature of the F-measure. When comparing our results in Table 7.2.2 with the structure of the datasets in Table B.1, we notice they are highly correlated. For example, if we limit our results to the 13 datasets containing less than 10 tables, our average F-measure, precision and recall become 0.90, 0.96, 0.86, respectively, which is a measurable improvement. This superior result in smaller datasets implies an inferior result for the more complex ones, something we will continue working on to improve in future versions of Tracer.

7.3 Column Map

7.3.1 Evaluation

For the final primitive, the goal is to trace all the columns from the dataset that helped generate some target column. In order to evaluate the success rate of this module we will use a fixed threshold value of 0. This means we select all column maps predicted by our algorithm with feature importance larger than 0. Then we simply compute the precision, recall and F-measure and report our results.

7.3.2 Experiments

To evaluate the Column Map Discovery primitive we will need to utilize DataReactor. As explained in Chapter 6, DataReactor takes a relational dataset and intelligently generates derived columns. These derived columns are a function of other columns in their respective datasets, so we can use them to test Column Map Discovery. For the following experiment, DataReactor was used to generate two types of target columns: “nb_rows” which is based on foreign key relationships and corresponds to the number of rows in the child table for each of the keys in the parent table, and “add_numerical” which corresponds to a random linear combination of the numerical columns in the target table. Target columns generated by DataReactor follow a specific naming convention.

57 To understand it, consider an example where DataReactor generated a target col- umn called “molecule.nb_rows_in_atom”. This means the column was derived from some column in table “atom” through transformation “nb_rows”. This transforma- tion counts the number of rows in the “atom” table which belong to each “molecule” table, where the “atom” table is related to the “molecule” table through a foreign key relationship. We test the Column Map Discovery primitive by using eleven datasets from the Relational Dataset Repository [24], with target columns generated by DataReactor and evaluation as described in Section 7.3.1. Our full results are reported in Table C.1. We also aggregate the results by dataset (Table 7.4) and by transformation (Table 7.5). Notice that precision and recall are the average of the values, while F-measure is computed using these two results.

Dataset F-Measure Precision Recall AustralianFootball 0.69 0.61 0.79 Biodegradability 0.60 0.43 1.00 Dunur 0.29 0.17 1.00 Elti 0.31 0.18 1.00 Hepatitis_std 0.61 0.51 0.77 Mesh 0.62 0.45 1.00 Mooney_Family 0.05 0.03 1.00 NBA 0.23 0.13 0.93 Same_gen 0.50 0.34 1.00 Toxicology 0.35 0.21 1.00 UW_std 0.67 0.53 0.90

Table 7.4: This table shows the performance of our current implementation of the Column Map Discovery primitive. These results were obtained by aggregating the values from Table C.1 for each dataset.

Transformation F-Measure Precision Recall add_numerical 0.78 0.76 0.79 nb_rows 0.38 0.23 1.00

Table 7.5: This table shows the F-measure, precision and recall of our implementation of the Column Map Discovery primitive. These results were obtained by aggregating the values from Table C.1 for the two transformations generated by DataReactor.

58 As shown in Table 7.4, our current implementation of the Column Key Discovery primitive achieves an average F-measure of 0.45, an average precision of 0.33, and an average recall of 0.94. Given the extensibility of our system, we expect these numbers to improve as additional transformations/primitives are implemented in future version of Tracer. Furthermore, we note that the F-measure is only weakly correlated with the real- world utility provided by our system. With our system, an user can look at the top X% of the column maps suggested by our model to quickly figure out the lineage of the target column. This is a significant improvement over manually digging through thousands of combinations of columns.

59 60 Chapter 8

Conclusion

In this paper we proposed Tracer, a novel machine learning approach to data lineage discovery. We present a framework that accomplishes end-to-end lineage tracing by separating the tracing process into its primitive components. While the work we have accomplished is extensive, there is much that can be improved. As of yet, we mostly assume columns and rows to be completely separate from each other, while in reality the lineage of each item in a table has high correlation with that of other items. Rewriting our primitives to include this fact should speed up the process manifold. Tracer also does not utilize past lineage searches into future computations. Given that the data from a company shares many of the same patterns, a specific module to aggregate it and analyze it could be an interesting line of research. Another topic that needs further work is the addition of new primitives. Cur- rently Tracer only contains the most basic primitive modules, and utilizing current and future researches to write new and more sophisticated primitives is essential for Tracer to achieve its full potential. One simple example is in the Primary Key Discovery module. Currently Tracer does not support composite keys, even though multi-column primary keys can be commonly found in relational datasets. Better usage of metadata and data annotations would also improve Tracer. While Tracer does make use of column names for key discovery, the Column Map Discov- ery module does not. Further exploration of how we can leverage natural language

61 processing techniques for this problem is essential for obtaining better results. Another potential line of research centers on multi-parent tables. For example, suppose we have a users table, a locations table, and a companies table. Each user works at a company and has a location. Currently, Tracer computes the lineage for these tables separately - first users/location, then users/companies. Taking both tables into consideration at the same time could allow for more accurate results, and should be implemented into a future version of the project. Finally, another aspect that could be further improved is how we can handle loops in datasets. Taking the example from above, with a users table, a locations table, and a companies table, it is possible each user works at a company, each user has a location and each company has a location. This forms a loop, which is a situation we have not considered sufficiently, and may require further research. By adding these additional features to the Tracer project, we can further enhance the system and enable much more general tracing than provided by alternatives.

62 Appendix A

Column Map without Foreign Keys

Key Cost 1 Cost 2 Key Total

? 100 200 ? 100 ? 50 150 ? 200 ? 0 100 ? 300

Figure A-1: This figure shows an example of the functionality of the Column Map Discovery primitive, where the module learns that "Total" was generated by adding “Cost 1” and “Cost 2” without knowing the foreign keys.

In this Appendix we describe an alternate algorithm for the Column Map Dis- covery primitive which discovers the ColumnMap without using ForeignKeys. Notice that ForeignKeys is still provided, since it is the standard input of the primitive, but it is not used. This module is useful in situations where the foreign keys can- not be trusted, so we need to derive ColumnMap directly from Tables. Figure A-1 demonstrates a situation where this module might be useful. This method is purely theoretical and is not currently implemented in Tracer. To solve this problem, we propose an iterative procedure which alternates between (1) learning a distribution over row mappings based on the optimal column mapping and (2) learning the optimal column mapping based on the distribution over row mappings. Under weak assumptions about the types of transformations that can be applied to the data, we can show that, after each iteration, the distribution over the row mapping converges toward the true row mapping.

63 At a high level, the iterative estimation procedure works as follows:

1. Sample from a distribution over row mappings.

2. Assuming the row mapping is correct, find the optimal column mapping and find the loss.

3. If the loss is lower than average, update the distribution over row mappings to increase the probability of each (input, output) pair in the mapping.

4. Otherwise, update the distribution over row mappings to decrease the proba- bility of each (in, out) pair in the mapping.

5. Repeat.

In order for this procedure to work, we need to make a few assumptions. Primarily, we need to assume that the loss associated with the optimal column mapping is a decreasing function of the number of correctly matched rows. In other words, if our row mapping has more correct matches, than our model will be more successful at learning the relationship between the inputs and outputs. Formally, we can use a regularization based approach to solve the problem, where our goal is to find the row mapping 푀 and subset of columns 푆 which minimizes the loss

2 푎푟푔푚푖푛푀,푆[푚푖푛퐹 ||퐹 (푋[푆]) − 푀(푦)|| ] + 휆||푆||0 (A.1)

where 푆 is a subset of columns and 퐹 is an arbitrary machine learning model. The biggest obstacle to implementing this procedure is figuring out how to effi- ciently sample from and iteratively update the distribution over row mappings. In general, exact solutions to the bipartite matching problem (e.g. finding the maximum likelihood between two sets of rows) is intractable since given an input and output table with 푁 rows, there are 푁! possible row mappings. However, given a solution for sampling from and iteratively updating a distribution over row mappings [27], we can then concurrently estimate both the row mappings and the column mappings for

64 a pair of tables, allowing us to identify the set of input cells which contributed to a particular output cell.

65 66 Appendix B

Relational Dataset

Dataset Name Size (MB) Num of Tables Temporal? Numeric? String?

Accidents 234.5 3 Yes Yes Yes AdventureWorks 233.8 71 Yes Yes Yes Airline 464.5 19 Yes Yes Yes Atherosclerosis 3.2 4 No Yes Yes AustralianFootball 34.1 4 Yes Yes Yes BasketballMen 18.3 9 No Yes Yes BasketballWomen 2.3 8 No Yes Yes Biodegradability 3.7 5 No Yes Yes Bupa .3 9 No Yes Yes Carcinogenesis 21 6 No Yes Yes CCS 15.9 6 Yes Yes Yes Chess .3 2 Yes Yes Yes ClassicModels .5 8 Yes Yes Yes CORA 4.6 3 No Yes Yes Countries 8.6 4 No Yes Yes Credit 317.9 9 Yes Yes Yes CS .3 8 Yes Yes Yes DCG .3 2 No Yes Yes

67 Dunur .8 17 No Yes Yes Elti .6 11 No Yes Yes Employee 197.4 6 Yes Yes Yes Facebook 2 2 No Yes No Financial 78.8 8 Yes Yes Yes FNHK 129.8 3 Yes Yes Yes FTP 7.5 2 Yes Yes Yes Geneea 61.4 19 Yes Yes Yes Genes 1.8 3 No Yes Yes Hepatitis 2.2 7 No Yes Yes Hockey 15.6 22 No Yes Yes IMDb 477.1 7 No Yes Yes KRK .1 1 No Yes Yes Lahman 74.1 25 No Yes Yes LegalActs 240.2 5 Yes Yes Yes Mesh 1.1 29 No No Yes Mondial 3,2 40 Yes Yes Yes MooneyFamily 3.2 68 No No Yes MovieLens 154.9 7 No Yes Yes Musk .4 2 No Yes Yes Mutagenesis .9 3 No Yes Yes Nations 1.2 3 No Yes Yes NBA .3 4 Yes Yes Yes NCAA 35.8 9 Yes Yes Yes Northwind 1.1 29 Yes Yes Yes Pima .7 9 No Yes Yes PremiereLeague 11.3 4 Yes Yes Yes PTC 8.1 4 No No Yes PTE 4.4 38 No Yes Yes Pubs .4 11 Yes Yes Yes

68 Pyrimidine ? 2 No Yes No Restbase 3 3 No Yes Yes Sakila 6.5 16 Yes Yes Yes SalesDB 584.3 4 No Yes Yes SameGen .3 4 No Yes Yes SAP 246.9 5 Yes Yes Yes SAT 2.7 37 No Yes Yes Seznam 146.8 4 Yes Yes Yes Stats 658.4 8 Yes Yes Yes StudentLoan .9 10 No Yes Yes Thrombosis 1.9 3 Yes Yes Yes TPCC 165.8 9 Yes Yes Yes TPCD 2500 8 Yes Yes Yes TPCDS 4800 24 Yes Yes Yes TPCH 2000 8 Yes Yes Yes Trains .1 2 No Yes Yes Triazine .2 2 No Yes No University .3 5 No Yes Yes UTube .2 2 No Yes Yes UW-CSE .2 4 No Yes Yes VisualGenome 198.1 6 No Yes Yes VOC 4.7 8 Yes Yes Yes Walmart 167.3 4 Yes Yes Yes WebKP 12.8 3 No No Yes World .7 3 No Yes Yes

Table B.1: This table provides some details regarding the 73 relational datasets from [24].

69 70 Appendix C

Column Map Discovery Performance

Dataset Target Column F1 Prec. Rec.

AustralianFootball matches.add_numerical 0.75 1.00 0.60 AustralianFootball players.add_numerical 0.50 1.00 0.33 AustralianFootball players.nb_rows_in_match_stats 1.00 1.00 1.00 AustralianFootball teams.nb_rows_in_match_stats 0.03 0.02 1.00 AustralianFootball teams.nb_rows_in_matches 0.03 0.02 1.00 Biodegradability atom.nb_rows_in_bond 0.40 0.25 1.00 Biodegradability atom.nb_rows_in_gmember 0.67 0.50 1.00 Biodegradability bond.add_numerical 1.00 1.00 1.00 Biodegradability group.nb_rows_in_gmember 0.67 0.50 1.00 Biodegradability molecule.add_numerical 0.36 0.22 1.00 Biodegradability molecule.nb_rows_in_atom 0.20 0.11 1.00 Dunur person.nb_rows_in_aunt 0.15 0.08 1.00 Dunur person.nb_rows_in_brother 0.22 0.13 1.00 Dunur person.nb_rows_in_daughter 0.17 0.09 1.00 Dunur person.nb_rows_in_dunur 0.25 0.14 1.00 Dunur person.nb_rows_in_father 0.33 0.20 1.00 Dunur person.nb_rows_in_husband 0.29 0.17 1.00 Dunur person.nb_rows_in_husband2 0.22 0.13 1.00

71 Dunur person.nb_rows_in_mother 0.33 0.20 1.00 Dunur person.nb_rows_in_nephew 0.14 0.08 1.00 Dunur person.nb_rows_in_niece 0.18 0.10 1.00 Dunur person.nb_rows_in_sister 0.20 0.11 1.00 Dunur person.nb_rows_in_son 0.18 0.10 1.00 Dunur person.nb_rows_in_target 0.04 0.02 1.00 Dunur person.nb_rows_in_uncle 0.18 0.10 1.00 Dunur person.nb_rows_in_wife 0.22 0.13 1.00 Dunur person.nb_rows_in_wife2 0.20 0.11 1.00 Dunur target.add_numerical 1.00 1.00 1.00 Elti person.nb_rows_in_brother 0.11 0.06 1.00 Elti person.nb_rows_in_daughter 0.33 0.20 1.00 Elti person.nb_rows_in_elti 0.13 0.07 1.00 Elti person.nb_rows_in_father 0.12 0.06 1.00 Elti person.nb_rows_in_husband 0.25 0.14 1.00 Elti person.nb_rows_in_mother 0.09 0.05 1.00 Elti person.nb_rows_in_sister 0.12 0.06 1.00 Elti person.nb_rows_in_son 0.33 0.20 1.00 Elti person.nb_rows_in_target 0.10 0.05 1.00 Elti person.nb_rows_in_wife 0.25 0.14 1.00 Elti target.add_numerical 1.00 1.00 1.00 Hepatitis_std Bio.add_numerical 0.46 0.30 1.00 Hepatitis_std Bio.nb_rows_in_rel11 0.18 0.10 1.00 Hepatitis_std dispat.add_numerical 0.25 0.25 0.25 Hepatitis_std dispat.nb_rows_in_rel11 0.25 0.14 1.00 Hepatitis_std dispat.nb_rows_in_rel12 0.13 0.07 1.00 Hepatitis_std dispat.nb_rows_in_rel13 0.40 0.25 1.00 Hepatitis_std indis.add_numerical 0.33 0.50 0.25 Hepatitis_std inf.add_numerical 0.67 1.00 0.50 Hepatitis_std rel11.add_numerical 0.67 1.00 0.50

72 Hepatitis_std rel12.add_numerical 1.00 1.00 1.00 Hepatitis_std rel13.add_numerical 1.00 1.00 1.00 Mesh element.nb_rows_in_circuit 0.67 0.50 1.00 Mesh element.nb_rows_in_circuit_hole 0.50 0.33 1.00 Mesh element.nb_rows_in_cont_loaded 0.67 0.50 1.00 Mesh element.nb_rows_in_equal 0.50 0.33 1.00 Mesh element.nb_rows_in_fixed 0.67 0.50 1.00 Mesh element.nb_rows_in_free 0.67 0.50 1.00 Mesh element.nb_rows_in_half_circuit 0.67 0.50 1.00 Mesh element.nb_rows_in_half_circuit_hole 0.67 0.50 1.00 Mesh element.nb_rows_in_llong 0.50 0.33 1.00 Mesh element.nb_rows_in_long_for_hole 0.40 0.25 1.00 Mesh element.nb_rows_in_mesh 0.13 0.07 1.00 Mesh element.nb_rows_in_mesh_test 0.13 0.07 1.00 Mesh element.nb_rows_in_mesh_test_Neg 0.14 0.08 1.00 Mesh element.nb_rows_in_neighbour_xy 0.67 0.50 1.00 Mesh element.nb_rows_in_neighbour_yz 0.50 0.33 1.00 Mesh element.nb_rows_in_neighbour_zx 0.67 0.50 1.00 Mesh element.nb_rows_in_noload 0.67 0.50 1.00 Mesh element.nb_rows_in_notimportant 0.67 0.50 1.00 Mesh element.nb_rows_in_one_side_fixed 0.67 0.50 1.00 Mesh element.nb_rows_in_one_side_loaded 0.67 0.50 1.00 Mesh element.nb_rows_in_opposite 0.67 0.50 1.00 Mesh element.nb_rows_in_quarter_circuit 0.50 0.33 1.00 Mesh element.nb_rows_in_short_for_hole 0.67 0.50 1.00 Mesh element.nb_rows_in_sshort 0.67 0.50 1.00 Mesh element.nb_rows_in_two_side_fixed 0.67 0.50 1.00 Mesh element.nb_rows_in_two_side_loaded 1.00 1.00 1.00 Mesh element.nb_rows_in_usual 0.67 0.50 1.00 Mesh mesh_test_Neg.add_numerical 1.00 1.00 1.00

73 Mesh mesh_test.add_numerical 0.67 0.50 1.00 Mesh mesh.add_numerical 1.00 1.00 1.00 Mesh nnumber.add_numerical 0.29 0.17 1.00 Mesh nnumber.nb_rows_in_mesh 0.29 0.17 1.00 Mooney_Family person.nb_rows_in_aunt 0.02 0.01 1.00 Mooney_Family person.nb_rows_in_brother 0.05 0.03 1.00 Mooney_Family person.nb_rows_in_daughter 0.08 0.04 1.00 Mooney_Family person.nb_rows_in_father 0.04 0.02 1.00 Mooney_Family person.nb_rows_in_husband 0.08 0.04 1.00 Mooney_Family person.nb_rows_in_mother 0.04 0.02 1.00 Mooney_Family person.nb_rows_in_nephew 0.02 0.01 1.00 Mooney_Family person.nb_rows_in_niece 0.02 0.01 1.00 Mooney_Family person.nb_rows_in_sister 0.03 0.02 1.00 Mooney_Family person.nb_rows_in_son 0.11 0.06 1.00 Mooney_Family person.nb_rows_in_uncle 0.02 0.01 1.00 Mooney_Family person.nb_rows_in_wife 0.08 0.04 1.00 NBA Actions.add_numerical 0.40 0.40 0.40 NBA Game.add_numerical 0.05 0.03 1.00 NBA Game.nb_rows_in_Actions 0.04 0.02 1.00 NBA Player.add_numerical 0.40 0.25 1.00 NBA Player.nb_rows_in_Actions 0.50 0.33 1.00 NBA Team.add_numerical 0.01 0.00 1.00 NBA Team.nb_rows_in_Actions 0.02 0.01 1.00 NBA Team.nb_rows_in_Game 0.02 0.01 1.00 Same_gen person.nb_rows_in_parent 0.17 0.09 1.00 Same_gen person.nb_rows_in_same_gen 0.25 0.14 1.00 Same_gen person.nb_rows_in_target 0.20 0.11 1.00 Same_gen target.add_numerical 1.00 1.00 1.00 Toxicology atom.nb_rows_in_connected 0.50 0.33 1.00 Toxicology molecule.nb_rows_in_atom 0.29 0.17 1.00

74 Toxicology molecule.nb_rows_in_bond 0.25 0.14 1.00 UW_std advisedBy.add_numerical 0.80 0.67 1.00 UW_std course.add_numerical 1.00 1.00 1.00 UW_std course.nb_rows_in_taughtBy 0.22 0.13 1.00 UW_std person.add_numerical 0.50 1.00 0.33 UW_std person.nb_rows_in_advisedBy 0.29 0.17 1.00 UW_std person.nb_rows_in_taughtBy 0.15 0.08 1.00 UW_std taughtBy.add_numerical 0.80 0.67 1.00

Table C.1: This table shows the F-measure, precision and recall of our implementation of the Column Map Discovery primitive when tested against eleven relational datasets using the evaluation method proposed in Section 7.3.1.

75 76 Bibliography

[1] Ccpa. https://oag.ca.gov/privacy/ccpa. Accessed: 2020-05-04.

[2] Gdpr. https://gdpr-info.eu/. Accessed: 2020-05-04.

[3] Metadata.json. https://github.com/HDI-Project/MetaData.json. Accessed: 2020-05-04.

[4] Rdt. https://sdv-dev.github.io/RDT/. Accessed: 2020-05-14.

[5] Tpc. http://www.tpc.org/default5.asp. Accessed: 2020-05-03.

[6] Parag Agrawal, Omar Benjelloun, Anish Das Sarma, Chris Hayworth, Shubha Nabar, Tomoe Sugihara, and Jennifer Widom. Trio: A system for data, uncer- tainty, and lineage. Proc. of VLDB 2006 (demonstration description), 2006.

[7] Michael Backes, Niklas Grimm, and Aniket Kate. Data lineage in malicious envi- ronments. IEEE Transactions on Dependable and Secure Computing, 13(2):178– 191, 2015.

[8] Omar Benjelloun, Anish Das Sarma, Alon Halevy, and Jennifer Widom. Uldbs: Databases with uncertainty and lineage. In Proceedings of the 32nd international conference on Very large data bases, pages 953–964. VLDB Endowment, 2006.

[9] Deepavali Bhagwat, Laura Chiticariu, Wang-Chiew Tan, and Gaurav Vijay- vargiya. An annotation management system for relational databases. The VLDB Journal, 14(4):373–396, 2005.

[10] Peter Buneman, Adriane Chapman, and James Cheney. Provenance management in curated databases. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 539–550, 2006.

[11] Peter Buneman, Sanjeev Khanna, and Wang-Chiew Tan. Data provenance: Some basic issues. In International Conference on Foundations of Software Technology and Theoretical Computer Science, pages 87–93. Springer, 2000.

[12] Peter Buneman, Sanjeev Khanna, and Tan Wang-Chiew. Why and where: A characterization of data provenance. In International conference on database theory, pages 316–330. Springer, 2001.

77 [13] Yingwei Cui and Jennifer Widom. Practical lineage tracing in data warehouses. In Proceedings of 16th International Conference on Data Engineering (Cat. No. 00CB37073), pages 367–378. IEEE, 2000.

[14] Yingwei Cui and Jennifer Widom. Lineage tracing for general transformations. The VLDB Journal—The International Journal on Very Large Data Bases, 12(1):41–58, 2003.

[15] Yingwei Cui, Jennifer Widom, and Janet L Wiener. Tracing the lineage of view data in a warehousing environment. ACM Transactions on Database Systems (TODS), 25(2):179–227, 2000.

[16] Laura Gustafson. Bayesian tuning and bandits: an extensible, open source library for AutoML. PhD thesis, Massachusetts Institute of Technology, 2018.

[17] Thomas Heinis and Gustavo Alonso. Efficient lineage tracking for scientific work- flows. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1007–1018, 2008.

[18] Jiansheng Huang, Ting Chen, AnHai Doan, and Jeffrey F Naughton. On the provenance of non-answers to queries over extracted data. Proceedings of the VLDB Endowment, 1(1):736–747, 2008.

[19] Robert Ikeda and Jennifer Widom. Data lineage: A survey. Technical report, Stanford InfoLab, 2009.

[20] James Max Kanter and Kalyan Veeramachaneni. Deep feature synthesis: To- wards automating data science endeavors. In 2015 IEEE International Con- ference on Data Science and Advanced (DSAA), pages 1–10. IEEE, 2015.

[21] Arunprasad P Marathe. Tracing lineage of array data. Journal of Intelligent Information Systems, 17(2-3):193–214, 2001.

[22] Patrick McDaniel. Data provenance and security. IEEE Security & Privacy, 9(2):83–85, 2011.

[23] JP Moresmau, Fabian Schyns, Uta Sommerweiss, Lothar Grabowsky, Jens-Uwe Richter, Henric Gomes, Gerald Csapo, Karsten Baensch, Gunter Wiedemer, Micha Treiber, et al. Intelligent metadata management and data lineage tracing, September 21 2017. US Patent App. 15/457,808.

[24] Jan Motl and Oliver Schulte. The ctu prague relational learning repository. arXiv preprint arXiv:1511.03086, 2015.

[25] Christopher Ré and Dan Suciu. Approximate lineage for probabilistic databases. Proceedings of the VLDB Endowment, 1(1):797–808, 2008.

78 [26] Alexandra Rostin, Oliver Albrecht, Jana Bauckmann, Felix Naumann, and Ulf Leser. A machine learning approach to foreign key discovery. In WebDB, 2009.

[27] Maksims Volkovs and Richard S Zemel. Efficient sampling for bipartite matching problems. In Advances in Neural Information Processing Systems, pages 1313– 1321, 2012.

[28] Allison Woodruff and Michael Stonebraker. Supporting fine-grained data lin- eage in a database visualization environment. In Proceedings 13th International Conference on Data Engineering, pages 91–102. IEEE, 1997.

[29] Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M Procopiuc, and Divesh Srivastava. On multi-column foreign key discovery. Proceedings of the VLDB Endowment, 3(1-2):805–814, 2010.

79