DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018

Semantic Integration across Heterogeneous Finding Data Correspondences using Agglomerative Hierarchical Clustering and Artificial Neural Networks

MARK HOBRO

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Semantic Integration across Heterogeneous Databases

Finding Data Correspondences using Agglomerative Hierarchical Clustering and Artificial Neural Networks

MARK HOBRO

Master in Computer Science Date: April 11, 2018 Supervisor: John Folkesson Examiner: Hedvig Kjellström Swedish title: Semantisk integrering mellan heterogena databaser: Hitta datakopplingar med hjälp av hierarkisk klustring och artificiella neuronnät School of Computer Science and Communication

iii

Abstract

The process of is an important part of the field when it comes to database migrations and the merging of data. The research in the area has grown with the addition of machine learn- ing approaches in the last 20 years. Due to the complexity of the re- search field, no go-to solutions have appeared. Instead, a wide variety of ways of enhancing database migrations have emerged. This thesis examines how well a learning-based solution performs for the seman- tic integration problem in database migrations. Two algorithms are implemented. One that is based on informa- tion retrieval theory, with the goal of yielding a matching result that can be used as a benchmark for measuring the performance of the machine learning algorithm. The machine learning approach is based on grouping data with agglomerative hierarchical clustering and then training a neural network to recognize patterns in the data. This al- lows making predictions about potential data correspondences across two databases. The results show that agglomerative hierarchical clustering per- forms well in the task of grouping the data into classes. The classes can in turn be used for training a neural network. The matching al- gorithm gives a high recall of matching tables, but improvements are needed to both receive a high recall and precision. The conclusion is that the proposed learning-based approach, us- ing agglomerative hierarchical clustering and a Neural network, works as a solid base to semi-automate the data integration problem seen in this thesis. But the solution needs to be enhanced with scenario spe- cific algorithms and rules, to reach desired performance. iv

Sammanfattning

Dataintegrering är en viktig del inom området databaser när det kom- mer till databasmigreringar och sammanslagning av data. Forskning inom området har ökat i takt med att maskininlärning blivit ett at- traktivt tillvägagångssätt under de senaste 20 åren. På grund av kom- plexiteten av forskningsområdet, har inga optimala lösningar hittats. Istället har flera olika tekniker framställts, som tillsammans kan för- bättra databasmigreringar. Denna avhandling undersöker hur bra en lösning baserad på maskininlärning presterar för dataintegreringspro- blemet vid databasmigreringar. Två algoritmer har implementerats. En är baserad på - sökningsteori, som främst används för att ha en prestandamässig ut- gångspunkt för algoritmen som är baserad på maskininlärning. Den algoritmen består av ett första steg, där data grupperas med hjälp av hierarkisk klustring. Sedan tränas ett artificiellt neuronnät att hitta mönster i dessa grupperingar, för att kunna göra förutsägelser huruvi- da olika datainstanser har ett samband mellan två databaser. Resultatet visar att agglomerativ hierarkisk klustring presterar väl i uppgiften att klassificera den data som använts. Resultatet av match- ningsalgoritmen visar på att en stor mängd av de matchande tabeller- na kan hittas. Men förbättringar behöver göras för att både ge hög en hög återkallelse av matchningar och hög precision för de matchningar som hittas. Slutsatsen är att ett inlärningsbaserat tillvägagångssätt, i detta fall att använda agglomerativ hierarkisk klustring och sedan träna ett ar- tificiellt neuronnät, fungerar bra som en basis för att till viss del auto- matisera ett dataintegreringsproblem likt det som presenterats i den- na avhandling. För att få bättre resultat, krävs att lösningen förbättras med mer situationsspecifika algoritmer och regler. v

Acknowledgements

I would first like to thank my supervisor at KTH, John Folkesson, for all the support during the project as well as my examiner, Hed- vig Kjellström, for making this thesis possible. I also want to express my gratitude to my supervisor at Sokigo, Kevin James, for giving me the opportunity to do my thesis at their office. Finally, I would like to thank my family and friends who have supported me throughout the entire process. Contents

1 Introduction 1 1.1 Motivation ...... 1 1.2 Problem Definition ...... 2 1.3 Limitation ...... 2 1.4 Sustainability and Ethics ...... 3 1.5 Outline of Report ...... 4

2 Background 5 2.1 Semantic Integration ...... 5 2.1.1 Schema Matching ...... 5 2.1.2 Instance Matching ...... 6 2.1.3 Match Cardinality ...... 6 2.2 Rule-based Matching ...... 6 2.3 Retrieval ...... 7 2.3.1 Tf-idf and Cosine Similarity ...... 7 2.4 Learning-based Approach ...... 8 2.4.1 Data Representation ...... 8 2.4.2 Agglomerative Hierarchical Clustering ...... 10 2.4.3 Neural Network ...... 12 2.5 Difficulties in Schema Integration ...... 16 2.5.1 Linguistic Challenges ...... 17 2.5.2 Structural Ambiguity ...... 17 2.6 Related Work ...... 17

3 Method 21 3.1 Setup and Development Tools ...... 21 3.2 Dataset ...... 21 3.2.1 Preprocessing ...... 22 3.3 Tf-idf and Cosine Similarity ...... 24

vi CONTENTS vii

3.4 Agglomerative Hierarchical Clustering ...... 25 3.4.1 Effect of Normalization ...... 25 3.4.2 Analysing Clustering Outcome ...... 27 3.4.3 Testing Dissimilarity Measures ...... 28 3.5 Neural Network ...... 29 3.5.1 Defining the Model ...... 29 3.5.2 Grid Search ...... 30 3.5.3 Name Matcher ...... 31 3.5.4 Training Setup ...... 32 3.6 Evaluating Performance ...... 33 3.6.1 Application in Database Environments ...... 34 3.6.2 Evaluating Testing Accuracy ...... 34

4 Result 35 4.1 Tf-idf and Cosine Similarity ...... 35 4.1.1 Precision and Recall ...... 35 4.2 Agglomerative Hierarchical Clustering ...... 36 4.2.1 Clustering Execution Times ...... 36 4.2.2 Clustering Visualization ...... 37 4.3 Neural Network ...... 37 4.3.1 Grid Search ...... 37 4.3.2 Model Evaluation ...... 38 4.3.3 Precision and Recall ...... 38

5 Discussion 41 5.1 Tf-idf and Cosine Similarity ...... 41 5.2 Agglomerative Hierarchical Clustering ...... 41 5.3 Neural Network ...... 42 5.4 About the Semantic Integration Problem ...... 43 5.5 Methodology ...... 45

6 Conclusion 47 6.1 Future Work ...... 48

Bibliography 49

A Agglomerative Hierarchical Clustering 53

B Cluster Content Explanation 56

Chapter 1

Introduction

Semantic integration has been a widely discussed topic in the last three decades. The earlier work focused on structural integration, i.e. build- ing global data models with the goal of integrating well-structured data [39]. With continuous growth of information on the , it became more common with semi- and unstructured data. Semantic integration of data was born out of the problems that came with this evolution. Dealing with heterogeneity across different data-sources is complex, especially since it needs to be done on several levels, such as schema and instance data level [39]. Therefore, semantic integra- tion will continue to be a research field that requires attention and it is nowadays an important area of database research [12].

1.1 Motivation

Semantic integration has evolved into a significant research area in the database community with the constant need of processing com- plex data. The research has been taking two different paths. Firstly, it has been investigated whether semantic integration is applicable to integrate data automatically (or at least semi-automatically) in specific scenarios. Secondly, the research has been focused on finding generic solutions, where semantic integration can be applied to build ontolo- gies that can integrate a wider scope of heterogeneous data. The lat- ter is rather complex, due to the immense diversity and range of data sources that can be encountered. Because of the diversity of data to- day, there are many data scenarios in which semantic integration has not been attempted for automatic schema and data integration. There-

1 2 CHAPTER 1. INTRODUCTION

fore, there is a need to examine how well different semantic integration techniques perform for real-world data. Automation of database migrations is something that can be of use widely. It is not uncommon for companies to migrate data between databases and since modern technology evolves so swiftly, it is inter- esting to investigate how data migrations can be made more efficient.

1.2 Problem Definition

Earlier research in semantic integration for database systems has shown that it is possible to automate the integration process to a certain ex- tent. But in many studies, the test data has been generic and have not covered many of the problems that come with matching complex and heterogeneous data. In this study, one goal is to build a generic solu- tion for the integration process of complex and field specific data and analyse the difficulties that may appear. It is interesting to investigate how different strategies, such as rule and learning-based, scale with increased data complexity. Will deep neural networks show toward better performance? Other questions that will be considered are how well a learning-based approach can be applied to the relational data and if a learning-based approach alone is enough to solve the integration problem. Is agglomerative hierarchi- cal clustering a preferable solution for establishing classes that can be used for training a neural network? The hypothesis is that both agglomerative hierarchical clustering and neural networks can be used to semi-automate the semantic in- tegration problem but will need the help of well constructed rules when the complexity of the data increases. The meaning with semi- automation is that the learning-based approach can distinguish the patterns between databases, but is not sufficient alone to yield optimal results (finding all data correspondences, meaning full automation).

1.3 Limitation

The data used in this study is restricted to real-world relational data from SQL database management systems (DBMS). The data is taken from Sokigo AB and two of their databases (one legacy system and a newer system), which partly contain similar data. CHAPTER 1. INTRODUCTION 3

Although a wide variety of approaches to semantic integration are discussed in this thesis, the work is restricted to further investigating a learning-based implementation and how well it can perform in the task of finding corresponding data. For easier performance evaluation, another algorithm which is based on information retrieval techniques will be implemented. This gives another perspective of the semantic integration problem and what the expectations can be for the perfor- mance results of the learning-based approach. Database migrations can be a tedious process consisting of many other challenges apart from data integration. But this project is focused on finding data correspondences between heterogeneous databases.

1.4 Sustainability and Ethics

The work in this thesis focuses on investigating an approach to data integration in database migrations with the aim to make the process more time efficient. Speeding up database migrations and increasing automation can lead to a reduction in the need of human resources. This can be beneficial from mainly an economical perspective. It is dif- ficult to advocate for the environmental advantages of the work in this thesis. The economic growth that efficient database migrations could lead to, does not mean any positive consequences for the environmen- tal sustainability. Database migrations are also conducted infrequently One ethical aspect that should be taken into account when perform- ing this kind of study on real-world data, is how to handle personal in- formation. It is common to store customer data in relational databases and keeping the integrity of this data is important. In this thesis, a learning-based approach is proposed to improve the automation of data integration in database migrations. A positive property of this approach is that it does not necessarily need to work with the actual data, but only the . This means that the sensitive parts of the data will not be exposed. The conducted work lacks a connection with environmental sus- tainability. But the previous paragraphs show that the work at least touches the subject of social and economical sustainability. 4 CHAPTER 1. INTRODUCTION

1.5 Outline of Report

Chapter two contains theory and related work in the semantic inte- gration research area that is important for the subsequent chapters. Chapter three explains the experiments that have been conducted, fol- lowed by chapter four, which shows the results based on these experi- ments. In chapter five, the results are analysed and discussed. The last chapter, chapter six, contains a conclusion and proposes what could be improved in future extensions of this project. Chapter 2

Background

The following chapter contains background theory and a briefing of related work in the research field of semantic integration.

2.1 Semantic Integration

As proposed by Rahm and Bernstein [31], it is easier to talk about se- mantic integration divided into two categories, namely schema and instance level matching. Schema level matching focuses on depen- dencies and relations between components of a schema. Instance spe- cific information, such as the actual data content, is not taken into con- sideration in the schema level matching but instead the instance level matching.

2.1.1 Schema Matching Schema matching consists of looking at a combination of numerous elements, to find relations between different data tables or databases [31]. An example would be the process of integrating a legacy database into a new database where the database structures have substantial differences. To make this possible, any relationship between the legacy database and the new database needs to be distinguished. The struc- tures of the databases are likely to cause ambiguity for the semantic relationship between them. In a database, there are two main properties that should be con- sidered when matching schemas. Firstly, the metadata properties of certain elements like attribute name, data type and description can

5 6 CHAPTER 2. BACKGROUND

be used to find similarities between heterogeneous data. Secondly, matches can be found by looking at the structural dependencies in a schema. An example of a technique that can be used is to consider the constraints between tables, to figure out which tables have relations between them [31].

2.1.2 Instance Matching Instance data can also be used for semantic integration. Finding pat- terns between instance data in different sources falls in the category of instance matching [31]. Information retrieval is one approach that is of use here. One of the most trivial approaches is to match on attribute names, but this will fall short in many scenarios, especially when there is an ambiguity between corresponding data. In this case it can be interesting to calculate the similarity between the attribute data. In- formation retrieval algorithms are applicable here. It is also common to exploit language properties of the given attributes (looking at syn- onyms, edit distances etc.) [31]. However, the instance matching ap- proaches assume that both databases contain instance specific data, which might not always be the case.

2.1.3 Match Cardinality The matching problem can be divided into classes of cardinality. The most basic matching is of the 1:1 cardinality. This could be the match- ing of the attributes property and estate. The 1:n or n:1 cardinality spec- ifies the matching of multiple elements to one element, e.g. firstName and lastName with name. The n:m cardinality means matching multi- ple elements with multiple other elements, with both sets of elements having the same context. The matching of property and owner with propertyName, propertyID, propertyOwner and propertyOwnerID is an ex- ample of an n:m matching [15][31]. The 1:1 and 1:n cardinalities will be focused on in this report. The n:m cardinality is rather complex and not widely researched [15].

2.2 Rule-based Matching

To carry out integration of relational data, expert knowledge is in many cases needed. Real-world data can be complex and field specific and CHAPTER 2. BACKGROUND 7

in these cases supervision from experts in the area might be needed. The foundation of semantic integration is based on tailored rules and the performance of the rules is strictly dependent on the dataset. The research field of semantic integration is constantly evolving, and the scope of the available matching rules is rather wide. Therefore, only a few common rules will be explained below. Attribute equality is one of the simplest rules. It tries to match two elements based on their attribute name. This could either be substrings of the names, the whole name or an edit distance [31]. Data type comparisons can be effective to reduce the complexity of the matching problem by clustering elements with the same data type together. Li and Clifton [24] use a clustering technique to find a better distinction in which elements that are possible matches. Data length can give a lot of information about corresponding data. A column containing strings (for example varchar) with length 4 and one with length 256 are probably not gonna be a good match [24]. However, this relies on the databases using a similar scheme for defin- ing lengths.

2.3 Information Retrieval

In the case when instance data is available for both data sources, it is possible to use information retrieval techniques to compare the simi- larity of the respective data sources.

2.3.1 Tf-idf and Cosine Similarity A common information retrieval approach to represent data in docu- ments is called tf-idf (Term frequency-inverse document frequency). Tf-idf is used together with a popular similarity measure called co- sine similarity. This approach is based on comparing term frequency vectors, which contain word occurrences in documents. The term fre- quency measure can also be extended to take the inverse document frequency in to account (this is where the name tf-idf comes from). The inverse document frequency mitigates the importance of terms that appear frequently across several documents, by normalizing the term frequency for such values [18]. Based on the tf-idf representation, it is possible to calculate the sim- ilarity between these vectors using cosine similarity. Cosine similarity 8 CHAPTER 2. BACKGROUND

calculates the cosine of the angle between two vectors. A resulting value of 1 means that the two documents are identical. A resulting value of 0, means that the vectors have no intersection of content at all [18].

Pn x , y similarity(x, y) = i=1 i i pPn pPn (2.1) i=1 xi i=1 yi Equation 2.1 denotes the cosine similarity between vector x and y [32]. The cosine similarity measure can be applied to columns in database tables. The columns are perceived as documents and the term frequency is calculated for each column.

2.4 Learning-based Approach

When it comes to semantic integration, machine learning approaches have the benefit of improving future predictions by learning to see patterns in the data. A rule-based approach demands expert knowl- edge of the data and a deeper analysis prior to constructing a system for conducting the integration. If a learning-based approach is used, more focus can be put on analysing the result and confirming matches afterwards. Trying to construct rules to match complex heterogeneous data can be tedious, which is a process that machine learning can sim- plify [31][38].

2.4.1 Data Representation To build a robust machine learning model, it is important to optimize the model for the dataset that is used. This means, avoiding overfit- ting and underfitting. Overfitting is when a model is not generalized enough, i.e. the model has been influenced too much by the training data. Underfitting is when a model is too generalized, i.e. it is not catching enough of the behaviour of the data. The consequences of both overfitting and underfitting is that the model will not be able to make good predictions on test data [36]. The first step to realize the training and test data is to model what metadata is desirable for the problem in question. There are numer- ous ways the metadata can be exploited to train a model to find corre- spondences between data sources. It is common to make a numerical representation of the data, in the range 0 to 1 [24]. The problem lies in CHAPTER 2. BACKGROUND 9

translating different types of metadata to the same numerical represen- tation (relational data is generally diverse). To do this, normalization and scaling of metadata is needed for certain cases. Li and Clifton [24] mentions three main classes of data and their respective normalization method. The classes are binary, category and range values. Binary values can be categorized as either 1 or 0 (true or false), for example whether a field is a primary key or not. Data types can be classed as category values, i.e. the values are nominal. Therefore, different data types cannot be given values between 0 and 1. This is the case, because then assumptions would have to be made about how similar different data types are [24]. Instead Li and Clifton [24] proposes dividing the data types into a vector of binary values. In a neural network, this means that each data type will have its own input value, either 0 or 1, depending on if the specific metadata corresponds to that certain data type or not. The last class is values that can be put in a range (for ex- ample length values). These values can be normalized with a function to fit the criteria of being in the range [0, 1] [24]. Li and Clifton [24] proposes a normalization function that is based on the Sigmoid [16] function.  1  f(x) = 2 − 0.5 (2.2) 1 + c−x Equation 2.2 can be used to normalize values from a specific range to the range [0, 1]. The coefficient c decides what range the function can handle [24]. The advantage with this function is that outliers will not affect the total distribution of the normalized values. For example, if c = 0.01, values between 0 and 500 are well distributed in the range [0, 1]. Values that are larger than 500 will still be normalized to a value close to 1 and therefore not have a negative impact on the total distri- bution. This property can be desirable when working with database values. Another alternative to normalize data is the min-max normaliza- tion [19]. It is a standard method of normalizing values in a given range. Given a set of values X and x ∈ X, the min-max normalization maps x in the range [0, 1] according to the minimum and maximum values of X.

x − min(X) norm(x) = (2.3) max(X) − min(X) 10 CHAPTER 2. BACKGROUND

2.4.2 Agglomerative Hierarchical Clustering Agglomerative Hierarchical Clustering can be used to find patterns in data. All data points start as a cluster and are sequentially merged to- gether using a dissimilarity measure. The cluster pair with the shortest distance between them are merged next in the sequence. This contin- ues until one large cluster is formed [26]. The advantage with this type of clustering technique is that a fixed number of clusters does not need to be defined in advance. The optimal number of clusters can be found afterwards. This can be done by simply interrupting the clustering se- quence at any given point that yields a desirable number of clusters.

Dissimilarity Measures To decide the dissimilarity between two clusters/data points, a dissim- ilarity measure is used to calculate the distance between the clusters. The goal is to find the two clusters that are the closest to each other, i.e. have the shortest distance. This is done to figure out which clusters to merge in each iteration [32]. There are multiple alternatives of such methods. A few of the most commonly used dissimilarity measures are explained below.

D(X,Y ) = min d(x, y) (2.4) x∈X,y∈Y Equation 2.4 is called Single linkage and denotes the minimal dis- tance between two data points x and y in cluster X and Y [32].

D(X,Y ) = max d(x, y) (2.5) x∈X,y∈Y Equation 2.5 is the Complete linkage distance measure. Complete linkage takes the maximum distance d(x, y) (the data points farthest away from each other) for the clusters X and Y and then looks for which cluster pair that distance is the shortest [32].

1 X X D(X,Y ) = d(x, y) (2.6) |X||Y | x∈X y∈Y Equation 2.6 is the Average linkage, which calculates the average distance between all the data points in cluster X and Y [32].

D(X,Y ) = kcX − cY k (2.7) CHAPTER 2. BACKGROUND 11

Equation 2.7, with the respective cluster centers seen in equation 2.8, denotes the distance between the cluster centers cX and cY and is called Centroid linkage.

1 X 1 X cX = x and cY = y (2.8) |X| |Y | x∈X y∈Y The last dissimilarity measure is called Ward’s method and can be seen in equation 2.9.

  2 |X||Y | D(X,Y ) = cX − cY (2.9) |X| + |Y | There are different interpretations and implementations of the al- gorithm, but they have a common methodology [27]. Ward’s method looks at the resulting cluster of a merge of two clusters and tries to minimize the aggregate deviation. This is done by first calculating a centroid for the merged cluster X ∪ Y and then a sum of squared de- viations for all the data points to the centroid for that cluster. This will be done for all potential new clusters and the resulting cluster with the minimal deviation is the one to be merged. To calculate the distance between two points, the Euclidean dis- tance is commonly used. v u n uX 2 d(x, y) = t (yi − xi) (2.10) i=1 The distance between the two points x and y in n-dimensional space is given by the Pythagorean formula specified in equation 2.10 [8].

Finding Optimal Number of Clusters To decide when the clustering process should stop, there are several methods that can be used. One example is called the Elbow method. It looks at the variance between each cluster merge. The ratio between each cluster pair merge and the total variance will indicate which step that gives the optimal clustering state (basically where further merges do not improve the model) [3]. Another method is to plot a dendrogram and analyse which itera- tion of the agglomerative hierarchical clustering that yields an optimal 12 CHAPTER 2. BACKGROUND

state for the use case. A dendrogram is also a useful alternative to plot- ting high dimensional data, which otherwise is difficult to visualize. Finding a breaking point where the number of clusters are optimal for a specific use case is difficult [22], but by analysing the dendrogram of the clustering, it is easier to get an idea of desirable breaking points.

2.4.3 Neural Network Neural networks have a wide range of uses and the applicability of them in semantic integration has been shown in past work such as the research conducted by Li and Clifton [24]. This work has been a building block for recent research. A neural network consists of three or more layers: the input, out- put and hidden layer. If there is more than one hidden layer, it is called a deep neural network (an example of this can be seen in figure 2.1). The network consists of neurons and synapses. The synapses take an input value from a neuron and multiply it with a weight and outputs it to the next neuron. Each neuron adds up all the values from the in- put synapses and uses an activation function, e.g. Sigmoid or ReLU (Rectified Linear Units), to calculate a new value to be sent further in the network. Different activation functions can help to find non- linear patterns in data [23]. The input neurons are dependent on the dimensionality of the training data. Using three metadata properties to characterize data in a database, yields three input neurons, one for each metadata property. The output neurons rely on what kind of clas- sification problem that is attempted to be solved. If the model should predict whether the input data forms an address or not, one output neuron can be used. The goal of the model is to recognize the patterns in the input data and then output either 0 or 1, depending on it being true or false (that the input is an address). The process mentioned above of how the neurons and synapses work is called forward propagation. This definition does not include how the neural network actually trains itself on recognizing patterns in data to be able to carry out good predictions. The training part is done using back-propagation. After forward propagation, a loss/error function is calculated to give an indication on how much the model miscalculated according to the expected output. This error function is propagated back in the network to help adjust the weights to possibly get better predictions. This is done with a computed gradient. The gra- CHAPTER 2. BACKGROUND 13

dient tells in what direction to step to move towards a minimal error measure. By knowing in what direction the error is changing, the pro- cess of finding a global minimum for the error can be done efficiently. Gradient descent is a method for this, but it is quite computational expensive. Stochastic gradient descent (SGD) can help with this, by estimating the gradient and not fully calculating it. For example, it can hand examples of the training data as input and then calculate the average gradient for all of those examples repeatedly until the error is converging to a minimum (this is commonly called mini-batch) [23].

wt+1 = wt − ηt∇Q(wt) (2.11) Equation 2.11 denotes the update step of the gradient descent, where Q(wt) is an error measure and ∇Q(wt) is the respective gradient (ηt is the learning rate).

n 1 X Q(wt) = Qi(wt) (2.12) n i=1 n 1 X ∇Q(wt) = ∇Qi(wt) (2.13) n i=1 At each step of the SGD algorithm, the weight should be recalcu- lated according to the gradient. The algorithm depends on the training examples being randomly picked at all times [4]. Otherwise it could happen that the algorithm cannot generalize the entirety of the dataset and fail to find the minimum. Another method that is commonly used to minimize the loss function is called Adam [21]. 14 CHAPTER 2. BACKGROUND

Figure 2.1: Example of a deep neural network

Loss Functions The loss function measures the error of a certain prediction and this is the function that SGD wants to minimize [4]. By minimizing this, there is a higher chance of making good predictions and the SGD algorithm should iterate until the error is low enough and not much improve- ment is made by further iterations. There exist numerous loss func- tions, but two of the most common ones are called Cross-entropy and Mean squared error. Mean squared error (MSE) calculates the mean of the squares of the error values.

n 1 X ˆ 2 MSE = (Xi − Xi) (2.14) n i=1

ˆ Here Xi denotes the predicted value and Xi is the observed value [17]. Cross-entropy can be used in a similar way as MSE, to calculate how far off the model’s predictions are to the true values [7]. Cross- entropy does this by looking at a predicted and true distribution, where p(xi) denotes the desired probability and p(yi) the actual probability. CHAPTER 2. BACKGROUND 15

X H(x, y) = − p(xi) log p(yi) (2.15) i

Activation Functions Every neuron in the hidden layer uses an activation function to add up all the weights of the inputs and add a bias to this sum. This is done to form a representable output to the next layer. The goal of the activation function is to give a clear distinction of the paths taken in the neural network and the calculated outputs. To give an example, using a binary activation function yielding 0 or 1, it would be possible to have several outputs with the same value. This is a bad consequence of the activation function not being able to make good distinctions. The ReLU activation function is commonly used due to its ability to learn faster than other activation functions like Sigmoid and Tanh [23]. ( xi if xi ≥ 0 f(x) = (2.16) 0 if xi < 0

Equation 2.16 is the definition of ReLU, which is commonly seen in the notation f(x) = max(0, x) [37]. Below, equation 2.17 denotes the Sigmoid function. 1 f(x) = (2.17) 1 + e−x

In comparison to Sigmoid, ReLU will not have the same problem with vanishing gradients [16]. For example, if sigmoid is used in a multilayer network, the gradients will be multiplied for each layer. Using sigmoid, gradients will be between 0 and 1 (it converges to 1 for large values and 0 for small values as can be seen in figure 2.3) and this can lead to the gradients "vanishing". ReLU is also more sparse than Sigmoid. In the case when x ≤ 0, the ReLU function will just be 0, while a sigmoid function would yield a non-zero value [16]. 16 CHAPTER 2. BACKGROUND

1 y

0.8

0.6

0.4

0.2

x −1 −0.5 0.5 1

Figure 2.2: The ReLU function.

y

0.8

0.6

0.4

0.2 x −4 −2 2 4

Figure 2.3: The Sigmoid function.

2.5 Difficulties in Schema Integration

Integrating data across heterogeneous databases is not a trivial task and it comes with several difficulties that need consideration. Both schema and instance level integration come with their respective chal- lenges. On schema level, complex structures can cause problems, while on instance level, complex language properties can be problematic. CHAPTER 2. BACKGROUND 17

2.5.1 Linguistic Challenges One recurrent problem with databases and semantic integration is the often-complex compound schema attributes or field specific terms. If the data is complex, there could be a need of constructing complex schema attribute names, which leads to difficulties when trying to process it [30]. The same problem appears with abbreviations. Terms might be used that are only known by a small amount of people with expertise in the area. Processing this data is problematic, unless very specific rules are defined.

2.5.2 Structural Ambiguity Integrating large data models comes with a structural complexity. It is possible that you want to find matches between two databases that are using two different modeling structures. If one database uses a hierarchical structure and the other one a simpler relational model, the structural differences consequently lead to difficulties in utilizing structural characteristics for matching the schemas. Lack of constraints, such as defined foreign keys for the relation- ships in the database, can also create confusion of which tables that have a relation. Depending on the naming convention, it can be diffi- cult to distinguish the relations between tables. If two tables have an incremental primary key with the same name, it is not trivial to de- cide whether or not these fields have a relation. But if a foreign key constraint is defined, it is easier to distinguish these relations.

2.6 Related Work

The research in the semantic integration area stretches from the ear- lier work with rule-based approaches to the more modern addition of machine learning [2][12]. Apart from the research in developing better rules and learning strategies, using techniques from the field of infor- mation retrieval and ontology research is common [28][10]. Due to the wide scope of data integration research, the related work mentioned in this report is just a few examples of what has been done earlier. The articles presented below cover both work done in the early 2000s (which laid a foundation to this field) and more recent research. It is interesting to include the early work, since the theory it is based on 18 CHAPTER 2. BACKGROUND

is still relevant and used in the more recently proposed solutions [2]. Many of the new schema matching approaches are combinations of techniques discussed in the earlier semantic integration research.

Semantic Integrator - SEMINT SEMINT [24] uses a neural network with the motivation that "Clearly, there is no perfect procedure or known set of rules that solves the prob- lems of identifying attribute correspondences. This is because that at- tribute relationships are usually fuzzy and the availability of database information may vary." [24]. They mention factors like fuzzy relation- ships between attributes, limited metadata and the lack of adaptabil- ity to different databases of pre-defined rules as implications to why learning-based algorithms are better suited for automatic matching in heterogeneous databases [24]. SEMINT takes advantage of more than ten metadata properties in the proposed approach. These include data types, length, primary keys, foreign keys, candidate keys etc. A Self- Organizing Map is used for classification of the properties before train- ing in a neural network is carried out [24].

Cupid Madhavan, Bernstein, and Rahm [25] implemented a schema match- ing system called Cupid [25]. One goal of the study was to develop a schema matching tool that is useful in more general situations and not only for a specific use case. Cupid took inspiration from SEMINT’s clustering technique [24] and DIKE’s [29] and ARTEMIS’ [5] hybrid schema-based matchers. Cupid is schema-based only (using for ex- ample element-, structure- and linguistic matching), and not instance- based. Since Cupid is a generic solution, it is possible to improve the algorithms to handle n:m cardinality, but in the version proposed in [25], 1:1 and 1:n cardinality are the supported mappings [25].

COMA COMA (the name comes from the two words combining and match) [10] is a matching system with the focus on combining different match- ing algorithms and is therefore a generic solution which is applicable for heterogeneous data sources. COMA has three different types of matchers. Simple matchers look at element names (using e.g. n-grams, CHAPTER 2. BACKGROUND 19

edit distances or synonym relations) and data types to calculate a sim- ilarity measure. Hybrid matchers look at a combination of proper- ties, e.g. element names together with data types or paths, child ele- ments and leaf elements. The last matcher type is the reuse matcher, which takes previous matching results into account when matching new schemas. This technique relies on the assumption that previously matched schemas have similar matching patterns as future schemas [10].

QuickMig QuickMig [14] is a data migration tool with the purpose of integrating data from one or more legacy databases into a new database. The ar- chitecture of QuickMig is based on COMA [10] with a few extensions. The first is an equality matcher, which matches equal instances across schemas. The second technique looks for elements that can be matched across schemas by either concatenating or splitting fields. The last im- provement is a matcher based on a domain ontology [14].

Automatch Automatch is a schema integration system based on Bayesian learning [1]. Automatch uses a knowledge base of schema attributes that the schemas that are to be matched are compared against. Both schemas calculate a score of the specific schema attribute against the knowl- edge base. Later these scores are added up together and forms a total score for the matching. The schema matching with the highest score is predicted to be the best match [1]. The knowledge base consists of a finite set of possible attribute values with a corresponding probabil- ity. This structure allows the usage of the Bayesian learning on the dataset. To reduce the number of calculations of possible matches be- tween two schemas, a minimum-cost-maximum-flow graph is used. To reduce the size of the knowledge base dictionary, feature selection is used. Mutual Information, Information Gain and Likelihood Ratio are strategies tested to achieve fewer schema attributes in the knowl- edge base dictionary [1]. 20 CHAPTER 2. BACKGROUND

Learning Source Description - LSD LSD approaches the schema integration problem, by using base learn- ers and a meta learner [11]. The predictions of the base learners are processed by the meta learner, which learns about the importance of the different base learners. The base learners implemented in LSD in- clude a name matcher, a content matcher (matches on the data con- tent), a Naive Bayes learner and an instance specific name recognizer. LSD adds to the field of schema integration by having the possibility to add or remove learning-techniques. This allows the system’s base learners to be customized for different data scenarios [11].

Analysing Database Queries Ding and Sun [9] propose an approach where the structure of database queries is analysed to help with schema matching. Information about queries is extracted by reading logs and then a statistical evaluation is carried out for the position of attributes in each individual query. Using two scoring algorithms, the statistics are used to find an optimal schema mapping [9].

GLUE When it comes to research of ontology-based approaches to seman- tic integration, GLUE [13] uses machine learning to generate seman- tic mappings. The name GLUE is based on how semantic correspon- dences act like glue to keep ontologies together [13]. It is based on cal- culating joint probability distributions of the concepts to be mapped. The probability of the conjunction of two schemas is sought for. To calculate this, instances of one schema is used to train a classifier for the same schema. This classifier is used to do the same thing for the other schema. A multi-strategy approach is used for training the clas- sifiers, where the results of different strategies are combined together in a meta-learner [13]. Chapter 3

Method

This chapter explains the experiments that have been carried out. The first algorithm is solely implemented to solve the matching problem with the given instance data and is based on information retrieval. The learning-based solution is a more generic implementation, which learns about the metadata patterns to find possible matching fields in the databases.

3.1 Setup and Development Tools

The work was done on a Windows 10 Professional 64-bit machine with the following specifications: Dell Precision M6500, Intel(R) Core(TM) i5 M650 2.67GHz (4 CPUs), 8 GB RAM and AMD Mobility Radeon HD 5800. Python was used for the implementation. Python provides useful libraries for machine learning development. In this thesis, the most important libraries that have been used are SciPy [34], Scikit-learn [33] and Keras [20] (with a TensorFlow backend).

3.2 Dataset

The dataset consists of data from SQL Server databases. The data is in the form of table and column metadata for the leaning based ap- proach. For the information retrieval approach, the data contents of all the columns are used. The database used as training data is con- structed independently of the legacy database, i.e. the structure is not

21 22 CHAPTER 3. METHOD

based on the legacy database. The new database, which is called Ecos2, has a more hierarchical structure, where several tables belong to the same object. The legacy database, Ecos1, mostly has tables that be- long to their own objects. Therefore, there is a greater chance of seeing a direct correspondence between two columns than two tables across the databases. Ecos1 consists of 183 tables and Ecos2 consists of 238 tables. Both databases contain around 2000 columns. Although Ecos1 and Ecos2 lack a clear structural dependency, there are some similari- ties between them which the solutions are supposed to figure out. For scientific purposes, the dataset taken from the databases is the result of an already completed database migration. It simplifies the process of measuring the performance of both solutions, since all the possible correspondences can be distinguished by analysing the instance data. The information retrieval approach also relies on having instance data in both databases.

3.2.1 Preprocessing The metadata in its original form is not suited for training. It is very diverse, since it consists of integer values, decimal values, dates, at- tribute names etc. and therefore needs to be scaled and normalized into a workable representation. The data types can be represented using a categorical binary vector as mentioned in section 2.4.1. The categories can be tailored accord- ing to a specific need. If two heterogeneous data sources should be merged, it could happen that some columns use different data types (but still contain the same data). This can be solved by grouping dif- ferent data types together to the same element in the binary vector. To give an example, categorizing the data types seen in table 3.1 could yield the vector seen below.

{1, 0, 0, 0, 0} This vector says that the data type is either int, bigint, tinyint, uniquei- dentifier or numeric. In this case there are five different groups: Int, bigint, tinyint, uniqueidentifier and numeric form one group, decimal and float another group, nvarchar, char, varchar, text also form one group and lastly, datetime and bit are separate groups. CHAPTER 3. METHOD 23

Table 3.1: Data types encountered in the databases int bigint tinyint uniqueidentifier numeric decimal float nvarchar char varchar text datetime bit

Apart from the data types mentioned in table 3.1, the data types real, varbinary and image appeared less than five times altogether and the decision was made to discard them to reduce the complexity of the model. Decisions about grouping data types together to both reduce the complexity of the model and possibly increase the performance are important to how the training will turn out. The reason is that when for example a floating-point value is normalized, the normalization function should generalize floating-point values appearing in all the different data sources. With the categorical values, a bias can be added by saying that numeric fields in one database often are int fields in the other database. This will affect the model’s chances of predicting those fields to have a connection. There are several other metadata properties that can be utilized, apart from data types. The data used for training in this thesis can be divided into two groups. Firstly, the information schema metadata in a database consists of several properties of each column (data type is one of these). The main properties that have been utilized in this thesis are name, data type, length and numeric precision. For names, a sim- ple string matcher was implemented. A common naming convention was the use of "id" in names and this was adopted when processing the data. Whether a field is a primary key or not is also utilized. Secondly, properties of data instances can also be utilized. The difference here is that information schema metadata is not dependent on the contents 24 CHAPTER 3. METHOD

of the database, only the specifications of the schema. Only using in- formation schema metadata opens the possibility of creating a generic solution. Instance data properties can only be used if the target schema already contains data. The columns do not necessarily have to be filled with data but contain enough data to represent the possible variations. In this thesis, the focus has been to investigate the generic alternatives, but instance data properties have also been tested. These properties are count values for columns, min and max length. From min and max length of entries in one column it is possible to calculate a length dif- ference or averages of that column. This gives information about how the data in the column looks like. The non-binary metadata like length of fields needs to be normal- ized. As mentioned in section 2.4.1, the min-max normalization is a standard method for normalizing data. And the equation 2.2 can be used similarly, to normalize values in a set, but disregarding the min- imum and maximum values with the goal to construct a uniform dis- tribution in a specific range. During preprocessing of the data, both normalization functions have been tested.

3.3 Tf-idf and Cosine Similarity

Which rules to base an implementation on is dependent on the nature of the data. It is difficult to build an algorithm based on rules that are generic and applicable for a wide variety of datasets. The rules have to be decided upon after analysing the patterns in the data. The dataset used in this thesis is complex in the means of lacking useful metadata characteristics and structural dependencies. Therefore, not much structural information could be used to match on schema level. Instead, information about individual columns and tables was focused on for finding data correspondences. The principle behind applying cosine similarity to the problem of data integration in databases relies on both data sources containing data. All columns that are empty have been excluded from the dataset (both training and test data). The workflow of the implementation is as follows: • Data is extracted column-wise, creating two datasets with data for all columns for each table in the two databases • Constructing tf-idf vectors for both databases (term frequencies are calculated for a specific column) CHAPTER 3. METHOD 25

• Calculate the cosine similarity between all the vectors in both databases and return the ones that are the most similar (the num- ber of resulting matches are decided by a similarity threshold)

• Filter out false positives that are easy to spot, by adding a rule which removes special cases (for example, datetime values gave high similarity scores with specific decimal values)

3.4 Agglomerative Hierarchical Clustering

As seen in section 2.4.2, agglomerative hierarchical clustering is an un- supervised clustering technique that can find patterns in data without defining a desired number of output clusters. This characteristic, to- gether with it being a commonly used method, motivates the choice of using agglomerative hierarchical clustering for this project. To eval- uate the performance of the clustering, an experiment measuring the computation time and an analysis of the resulting clusters were per- formed.

3.4.1 Effect of Normalization Since dissimilarity measures are being used when clustering the data, disregarding the normalization could lead to one data entry having too big of an impact on the distance calculations. The length field is one example of a metadata property that can cause problems. Ecos1 and Ecos2 use different specifications for the length of columns. Ecos2 uses the value −1 to specify maximum length, while Ecos1 uses the spe- cific maximum value directly in the metadata. Mixing binary values with large length values will result in the clustering being biased to- wards the length value compared to the other metadata. As seen in the dendrogram in figure 3.1, the clusters are mostly formed in two ends. Where one end contains the objects with a small length value and the other end contains the objects with large length values. It is also no- table, that the clusters are merged with very small distance differences in both ends. This is the effect of the length values affecting the dis- tance measures too much and giving a bad representation across all metadata. In figure 3.2 the cluster distribution is more uniform, and the maximum distance of a cluster merge is significantly smaller. 26 CHAPTER 3. METHOD

Figure 3.1: Agglomerative Hierarchical clustering without normaliza- tion

Figure 3.2: Agglomerative Hierarchical clustering with normalization CHAPTER 3. METHOD 27

3.4.2 Analysing Clustering Outcome Automating the decision of when to stop the clustering can be difficult. The reason to this is that the optimal number of clusters is dependent on the situation and this is difficult to evaluate automatically.

Elbow Method As mentioned in section 2.4.2, there are methods that can find the op- timal number of clusters. The Elbow method is one of those and was attempted on the clustered data in this thesis. Figure 3.3 says that the optimal number of clusters would be 3, according to the second derivative (the change in acceleration for a cluster merge is the great- est at this point). The problem with this is that the Elbow method does not take into account how modeling a neural network according to the clustering result will play out. The easiest way to control this is to cut off the clustering at a specific distance by analysing a dendrogram.

Figure 3.3: Elbow method of Agglomerative hierarchical clustering us- ing Ward’s method 28 CHAPTER 3. METHOD

Dendrograms A dendrogram can give a visualization of high-dimensional data which can help to find an optimal number of clusters for a specific use case. As seen in figure 3.4, the goal is to have as many cluster merges as possible around the same distance. Clusters that were merged on a distance far from the threshold might not be optimal (since the data points in the cluster might differ a lot). But a good threshold can be found by trial and error, testing different cutting points.

Figure 3.4: Example of a Dendrogram with a cluster threshold using Ward’s method

3.4.3 Testing Dissimilarity Measures To see how the agglomerative hierarchical clustering acts, several dis- similarity measures have been tested. As explained in section 2.4.2, the dissimilarity measures are different ways of finding the best clus- ter to merge at each iteration. The experiment was conducted by test- ing the Single linkage, Complete linkage, Average linkage, Centroid linkage and Ward’s method as dissimilarity measures for the agglomerative hi- erarchical clustering. All objects start as their own cluster and at each CHAPTER 3. METHOD 29

iteration, two clusters are merged together according to the dissimilar- ity measure. This continues until the cutting point is reached. For calculating the distances between single data points, the Eu- clidean distance has been used which is shown in equation 2.10.

3.5 Neural Network

Using a neural network to predict possible matches is supposed to give a solution that is generic enough, to be applicable on different databases without having to analyze the data in advance. But the training parameters might produce worse result for one database com- pared to another, therefore a conjunction of metadata that will be good for training in different scenarios is important.

3.5.1 Defining the Model There are numerous factors that have to be taken into account when setting up the model. They will directly affect how well the model will learn and make predictions. To experiment with different param- eters, a basic grid search was conducted, which will be discussed in the end of this section. The properties and parameters that have been considered are:

• Network complexity (number of layers and neurons)

• Number of epochs (full training steps for the whole dataset), batch size and learning rate

• Activation functions

• Loss functions

• Optimizers

• Regularization, dropout and early stopping

Activation Functions The activation functions that have been used during testing are ReLU and Sigmoid. Sigmoid has the property of giving outputs between 0 and 1, which makes it viable as activation function in the output layer. 30 CHAPTER 3. METHOD

This is required to output a categorical classification (binary vector). ReLU has been used in the input and hidden layers.

Loss Functions Mean squared error and Cross-entropy are two of the more popular loss functions and have been used in this thesis. For Cross-entropy, the variation used is the Categorical Cross-entropy which is an adaption for multi-class classification problems.

Epochs, Batch Size and Learning Rate The number of epochs affects how many complete training cycles the model gets to train on the training data. The training data is shuffled and then reused for every epoch. At every iteration, a fixed batch size is used which takes as many training samples as specified and then trains the network repeatedly until all the training samples have been used (this happens in one epoch). The learning rate can be adjusted to increase or decrease the step size of the gradient descent (to potentially lead to a higher rate of minimizing the loss function).

Regularization, Dropout and Early Stopping Regularizers and dropout can be used to prevent overfitting. Dropout can be explained as randomly excluding some neurons and their con- nections from the network, thus reducing the complexity of the model and making it more difficult to overfit the training data [35]. Another way of reducing the risk of overfitting is to apply weights or penal- ties to the loss function. During the tests, both of these utilities have been experimented with. Early stopping has been attempted as well, to avoid ending up with an overfit model after running an excessive number of epochs. Early stopping is simply stopping the training when the model has not made improvements in minimizing the loss function in the last few epochs.

3.5.2 Grid Search A grid search was carried out to find a model that optimizes the train- ing for the data used in this thesis. The grid search used the following setup: CHAPTER 3. METHOD 31

Table 3.2: Grid search parameters Layers: 1, 2, 3, 4 or 5 (with 10 to 64 neurons) Epochs: 5, 10, 50, 100 Batches: 10, 20, 35 Learning rate: 0.1, 0.01, 0.001 Optimizers: Stochastic gradient descent and Adam Activations: ReLU and Sigmoid Loss functions Mean squared error and Categorical cross-entropy L2-regularization: 0, 0.01 Dropout: 0, 0.5

3.5.3 Name Matcher In an attempt to increase the precision of the predictions of the model, a simple name matcher was added post-training to try to filter out matching columns that have similar properties but actually have no correspondence. Since the attribute names were in two different lan- guages (Swedish and English), this data could not really be used for training. Therefore, this matcher was constructed. The matcher takes advantage of keywords in both databases and adds a bias to the columns that might have a connection. If the model predicts the columns to belong to the same group and the name matcher says that the two columns could have a correspondence, then this is a potential match. Some of these keywords are words that have a clear correspon- dence between the databases (basically the same word, even though the databases use different languages):

• radon and rado

• tele and tele

Other keywords (most of the keywords) have a binding between them, which indicates that they belong to a similar object (in some cases the direct translation between the languages):

• intress and address

• rapp and report

• arende and caseset 32 CHAPTER 3. METHOD

3.5.4 Training Setup The experiments consisted of 30 separate runs, where the dissimilar- ity measures and cutting points (the cutting points were different for the different dissimilarity measures but resulted in the same amount of clusters) varied for the clustering, but the neural network used for training was the same. Each test was carried out five times and the best training result was chosen for prediction (this was done to avoid any inconsistencies where one model did not train efficiently and got stuck in a local minimum). If it was needed, the learning rate was altered to avoid training getting stuck. All models reached a training accuracy of about 99%. The different tests were carried with the following setup:

• Dissimilarity measures:

– Single linkage – Complete linkage – Average linkage – Centroid linkage – Ward’s method

• Resulting number of clusters:

– 5, 10, 15, 20, 25 and 30

• Neural Network:

– The best performing model from the grid search

Below, in figure 3.5, is an explanation of the flow of data in the implementation. There are several steps, which has been explained earlier in this chapter, and this figure attempts to visualize the whole work flow. CHAPTER 3. METHOD 33

Figure 3.5: Data flow of the learning-based prediction process

3.6 Evaluating Performance

The results of the experiments are evaluated using a precision and re- call metric, which is a common approach for evaluating the perfor- mance of both information retrieval and machine learning algorithms [6]. The prerequisite for being able to calculate the precision and recall is that all the correspondences between the databases are known. The following concepts are of use, when applied to finding data correspon- dences:

• True positive (TP) is when two objects are correctly predicted to have a correspondence.

• False positive (FP) is when two objects are incorrectly predicted to have a correspondence.

• True negative (TN) is when two objects are correctly predicted to not have a correspondence. 34 CHAPTER 3. METHOD

• False negative (FN) is when two objects are incorrectly predicted to not have a correspondence.

According to these definitions, the precision and recall can be cal- culated in the following manner: TP P recision = (3.1) TP + FP TP Recall = (3.2) TP + FN

3.6.1 Application in Database Environments A thorough analysis of the two databases used for the experiments has been carried out. To be able to calculate the recall, it is needed to have knowledge about all the correspondences between the two databases. The TP + FN have been estimated by going through all the tables and columns and building knowledge about the possible matching columns and tables. Estimating this value enables to have a good performance measure for the algorithms. After analysing the databases, it was concluded that the TP + FN in estimation equals to 114. This value stands for 114 matching tables.

3.6.2 Evaluating Testing Accuracy Due to how the data integration problem is constructed for database migrations (integrating data from a legacy database to a newer database), the training data consists of metadata/data from the target schema (Ecos2) and the test data is taken from the legacy database (Ecos1). Therefore, the evaluation/testing accuracy is how well the model clas- sifies legacy data (test data). Chapter 4

Result

In the following chapter, the results from the experiments are pre- sented.

4.1 Tf-idf and Cosine Similarity

The experiments for calculating the cosine similarity between the data in the Ecos1 and Ecos2 databases conform to the definition explained in section 3.3.

4.1.1 Precision and Recall A threshold has been used to give a variation in the matching preci- sion. This threshold tells how similar two columns have to be, to be considered a match. The threshold stretches from a low number (0.05), which yields a low precision but higher recall, to a threshold of 1.0, which means that the column pairs contain the exact same data.

Table 4.1: Results of the approach using cosine similarity Threshold 0.05 0.1 0.3 0.5 0.7 0.9 1.0 Precision 0.047 0.072 0.220 0.373 0.534 0.585 0.750 Recall 0.738 0.670 0.534 0.427 0.379 0.233 0.087

35 36 CHAPTER 4. RESULT

1

0.8

0.6

0.4 P recision

0.2

0 0 0.2 0.4 0.6 0.8 1 Recall

Figure 4.1: Precision and Recall of the cosine similarity algorithm

4.2 Agglomerative Hierarchical Clustering

The results of the clustering algorithm is presented in the form of ex- ecution times and dendrograms using the different dissimilarity mea- sures. For these results, the data has been preprocessed and normal- ized using the Sigmoid-like function presented in equation 2.2.

4.2.1 Clustering Execution Times The execution times shown in table 4.2 are averages, which are the result of 10 continuous runs for every dissimilarity measure. Although the process of clustering has to be done only once for training data, it can be interesting to see how the dissimilarity measures compare to each other.

Table 4.2: Agglomerative Hierarchical Clustering Execution Times Diss. Measure Single Complete Average Centroid Ward Time (in ms) 121 ± 2 149 ± 5 153 ± 9 167 ± 18 178 ± 27 CHAPTER 4. RESULT 37

4.2.2 Clustering Visualization The clustering results are easiest analysed by plotting dendrograms. These graphs can be seen in Appendix A. The dendrograms show at what distances the different clusters have been merged together to form a new cluster. The dendrograms are truncated, which means that not all merges are visible in the graph. But the x-axis shows the distri- bution of the cluster merges, which gives an idea of how the clustering has acted during run-time.

4.3 Neural Network

The result was generated by first establishing classes by clustering the training data and then training the neural network based on the results of the grid search, which can be seen in table 4.3. All dissimilarity measures were tested, and the constructed neural network was solely a result of which parameters performed well in the grid search.

4.3.1 Grid Search The following table summarizes the top results from the grid search. Categorical Cross-entropy is abbreviated as cc. An important note is that this grid search only aims to find the parameters which result in fast convergence to a minimal loss (high training accuracy). Dropout and regularization were not taken into consideration in the final grid search due to the lack of ability to automatically distinguish when they were actually needed.

Table 4.3: Top results of the grid search Training Accuracy Loss Activation Batch Size Epochs Hidden Layers Learning Rate Loss function Optimizer

0.997551 0.001997 relu 35 100 2 0.001 cc adam 0.996735 0.003056 relu 35 50 2 0.01 cc adam 0.995918 0.003051 relu 35 100 2 0.01 cc adam 0.995918 0.003051 relu 20 100 3 0.01 cc adam 0.995918 0.003051 relu 20 50 2 0.01 cc adam 0.995918 0.003051 relu 20 100 2 0.001 cc adam 0.995918 0.003051 relu 20 100 1 0.001 cc adam 38 CHAPTER 4. RESULT

4.3.2 Model Evaluation All the models that were converging to a global minimum loss value (the ones that were not getting stuck in a local minimum) gave an evaluation accuracy of 100% after just ∼ 3 epochs (which means that the models could give predictions for all the test data). Since the test dataset was separate from the training set from the start, the evalua- tion was performed as explained in section 3.6.2 using precision and recall measurements.

4.3.3 Precision and Recall The following results consist of generating clusters of different size for all five dissimilarity measures: Single linkage, Complete linkage, Av- erage linkage, Centroid linkage and Ward’s method and subsequently using the output to train a neural network (the neural network result- ing from the optimal hyperparameters found in the grid search). This process is repeated at different cutting points for each dissimilarity measure. An important note is that the different dissimilarity measures will react differently to the varying cutting points. To let all the dissimi- larity measures be treated equally, the same number of clusters have been used at each interval. The chosen number of clusters might not be optimal for all dissimilarity measures, but it is otherwise difficult to make a comparison. For example, using 10 clusters might give a good distribution for Single linkage, but Complete linkage would have ben- efited more for e.g. 12 clusters. All the models have been trained until convergence to a minimal loss. The runs which got stuck while learn- ing, have been rerun. Early stopping was used for all the runs. The precision and recall calculations in this chapter are based on the equations 3.1 and 3.2.

Single Linkage

Table 4.4: Training results using Single linkage clustering Number of Clusters 5 10 15 20 25 30 Precision 0.361 0.400 0.443 0.464 0.530 0.561 Recall 0.807 0.667 0.579 0.342 0.307 0.202 CHAPTER 4. RESULT 39

Complete Linkage

Table 4.5: Training results using Complete linkage clustering Number of Clusters 5 10 15 20 25 30 Precision 0.382 0.366 0.481 0.477 0.653 0.629 Recall 0.842 0.553 0.570 0.368 0.281 0.193

Average Linkage

Table 4.6: Training results using Average linkage clustering Number of Clusters 5 10 15 20 25 30 Precision 0.423 0.415 0.450 0.491 0.475 0.667 Recall 0.798 0.596 0.316 0.246 0.167 0.088

Centroid Linkage

Table 4.7: Training results using Centroid linkage clustering Number of Clusters 5 10 15 20 25 30 Precision 0.383 0.413 0.442 0.429 0.452 0.574 Recall 0.833 0.684 0.570 0.500 0.246 0.237

Ward’s Method

Table 4.8: Training results using Ward’s method Number of Clusters 5 10 15 20 25 30 Precision 0.472 0.430 0.490 0.548 0.621 0.674 Recall 0.807 0.596 0.439 0.351 0.316 0.254

Figure 4.2 shows the precision and recall after making predictions with the neural network, that is modelled based on the results from the grid search. This was done for all the five dissimilarity measures. The name matcher is also used during these experiments. 40 CHAPTER 4. RESULT

1 Single 0.8 Complete Average Centroid 0.6 Ward’s

0.4 P recision

0.2

0 0 0.2 0.4 0.6 0.8 1 Recall

Figure 4.2: Precision and Recall of all training results

Training Results without a Name Matcher These results were obtained using Ward’s method and stopping at 13 clusters. Not using a name matcher, discards the possibility of using just a few clusters (for example 5 clusters), since the precision will be too low to efficiently evaluate the result. Table 4.9 shows the precision and recall for the trained neural net- work on the classes shown above (this is without using the name matcher). Table B.1 in Appendix B explains what kind of objects that have been grouped to every cluster in this case.

Table 4.9: Training results using Ward’s method without the name matcher Number of Clusters 13 Precision 0.046 Recall 0.886 Chapter 5

Discussion

In this chapter, the results will be analysed and discussed.

5.1 Tf-idf and Cosine Similarity

As seen in table 4.1 and figure 4.1, the trend is low precision and high recall or high precision and low recall. The information retrieval solu- tion is a good foundation for evaluating the performance of the ma- chine learning results in section 4.3. Apart from that, this solution alone does not yield pleasant results when it comes to solving the problem of semantic integration. Having a high recall is more impor- tant than high precision and it is shown that around the same precision (0.046 and 0.047) the baseline neural network without a name matcher (table 4.9) outperforms cosine similarity (table 4.1). The cosine similarity could reach close to 100 percent precision, but the recall would be too low to make it relevant for the use case. This approach is missing out on several table correspondences that lack a clear linguistic connection in the instance data. Consequently, coming close to 100 percent recall is not possible, even when a low similarity threshold is used.

5.2 Agglomerative Hierarchical Clustering

The dendrograms in Appendix A give a good indication about reason- able cutting points and they also help to understand how the data has been grouped into classes. One important consequence of the choice of

41 42 CHAPTER 5. DISCUSSION

dissimilarity measure is the distribution of the data. It can be seen that Ward’s method gives a better distribution of objects across different clusters, while for example Single linkage groups many of the objects in the same cluster early in the clustering sequence. The effect of this is that Single linkage would not work well with a trained model that does not use a name matcher. The name matcher can filter out many of the false positives while still keeping the number of true positives at about the same level. But the more objects that are grouped together in one cluster, the bigger the chance of incorrect matches. In the same manner, having the objects well distributed over all the clusters might have the consequence that tables that are supposed to be matched are in adjacent clusters, when they in fact should belong in the same clus- ter. This makes it impossible for a model to be trained to match those specific objects.

5.3 Neural Network

There are several machine learning approaches that could have been attempted for this problem. The reason a neural network was cho- sen, was based on what have been attempted in earlier work in this research field, for example what Li and Clifton [24] did with SEMINT. This work showed that using a neural network is a good approach to the problem of semantic integration. Table 4.4 to 4.8 and figure 4.2 indicate that the trained model to- gether with the name matcher, even at higher recall, yields decent pre- cision. As seen in table 4.9, the trained model cannot be expected to get better than about 90% recall. The name matcher improves the pre- cision about ten times with the trade-off of losing 5 to 10% recall. For this kind of data integration problem, it is desirable to find all the pos- sible correspondences in the data. Therefore, the main goal is to reach a high recall, while still having a decent precision. In this case, the ap- proach using a Neural network performs better in comparison to the cosine similarity approach. The reason to using 13 clusters for the test in table 4.9 is that around distance 5, many clusters have been merged. This is good, because it means that the objects in one cluster are just as similar to each other as the objects are in the other clusters. If one cluster got merged at distance 1 and another at distance 5, the objects in the first cluster are a lot more similar than the objects in the cluster CHAPTER 5. DISCUSSION 43

that got merged at distance 5.

5.4 About the Semantic Integration Problem

There is one factor that affects the results of the learning-based ap- proach in a large scale. That is the scarcity of useful metadata proper- ties. The fact that the databases do not have a clear connection in their attribute names (since they are in different languages and the names do not use the same naming convention), increases the importance of having additional metadata for training. The decisions about how Ecos2 was modelled, was taken in disregard of the Ecos1 database. Therefore, properties like whether a column was nullable, unique, had a default value or had any constraints were discarded. There were no patterns or correspondences in between the two databases for these properties. The Ecos1 database also had no defined constraints be- tween tables, which made it difficult to analyse which tables were con- nected even in the same database. One case where the cosine similarity algorithm falls flat is when fields in the tables look different, even though they belong to a similar object, i.e. there is a lack of clear correspondence between the two columns. This means that the Ecos1 database can have a completely different table structure in table x compared to table y in the Ecos2 database, but table x and y still have a correspondence. For example, common differences between the two databases are that IDs in Ecos1 most of the time are incremental integers starting from 1 or a fixed length string containing integers. At the same time Ecos2 mostly has IDs in the shape of hash-numbers of a fixed length. These differences together with a change in table structure, can lead to cosine similarity failing to find any similarity at all between two columns or even tables that have a correspondence. Since the learning-based approach did not look at instance data but only the metadata properties, cases like the one previously mentioned could still be solved and a correspondence could be found. The importance of normalizing the data before clustering can be seen in figure 3.1 and 3.2. It shows that clustering the data without normalization, will lead to outliers being difficult to handle. Take for example the length property, which sometimes has the max value 2147483647. If this parameter is not normalized it would just divide 44 CHAPTER 5. DISCUSSION

all the objects in to two groups, one where the length is max and one where the length is small (maybe between 1 and 1000). The length property would therefore have a greater impact than all the other meta- data and this is something we want to avoid. Therefore, normalizing all relevant metadata is important. The min-max normalization does not deal with the case when we have outliers, but still works decently to normalize metadata in a given range. What can happen, is that if for example all length values except one is in the range 1 to 1000 and we have one outlier with length 10000, the outlier will affect the distribution of all values when normalizing. This yields a skew representation of the data. Agglomerative hierarchical clustering works well for building classes that can be used for training a neural network. Table B.1 shows that we get a decent distribution of objects across the clusters, where each clus- ter contains very similar objects. We can also see that using 13 clusters in this case yields adjacent clusters that are fairly similar (in some cases only differentiating by the length of the field). The table can also show how there is a relationship between the number of metadata proper- ties and the number of clusters. The more metadata properties we can use, the more clusters we can form without having too similar objects in different clusters. Take table B.1 as an example. If 5 clusters would have been used instead of 13, the objects would have been mixed up. Cluster 1 would contain all primary keys (regardless of any other fea- tures), cluster 2 would contain integers and ID’s that are not primary keys, cluster 3 would contain all character fields, cluster 4 would con- tain a mix of integers and decimal values and cluster 5 would contain datetimes and bits. This distribution is not that good, since it is desir- able to have objects with very small feature differences in each cluster. The conducted grid search (table 4.3) shows that having more than one hidden layer in the network results in slightly better training. For the metadata that has been used in this thesis, the model has not needed to be that complex. Using two or three hidden layers is just a small improvement to using one hidden layer. To summarize the rest of the results from the grid search, there were many models that learned fast with just a small variation of batch size, hidden layers, learning rate and epochs. In general, 50 or more epochs were needed to reach a high training accuracy. The batch size mostly affected convergence speed, but in some cases a bigger batch size also lead to the training getting stuck in a local minimum. Having a smaller learning rate also CHAPTER 5. DISCUSSION 45

helped to avoid getting stuck during training. The Sigmoid activa- tion function was used for the output layer to ensure output in the range [0, 1]. But apart from that, ReLU performed better. Categorical Cross-entropy performed better than Mean squared error (the loss was minimized quicker, resulting in faster training). Lastly, faster conver- gence was achieved by using the Adam optimizer, in comparison to Stochastic gradient descent. The range of metadata parameters that could be used to train the model was not wide enough to get any problems with overfitting. Due to the lack of input parameters, it was not beneficial to build more com- plex models and therefore dropout or regularization were not really needed.

5.5 Methodology

The work in this thesis used field specific data and the learning-based model was built and designed accordingly. Therefore, the results can only be reproduced using the same data and the model will probably perform worse with other datasets. But the methodology should re- gardless work well for relational data of similar nature and the same approach (but tweaked with different parameter values) should yield similar results. Building an optimal model for a specific use case (like the one stud- ied in this thesis) is not trivial and there was a need to decide upon a few options to try out for this classification problem. For example, there are more optimizers, activation functions and loss functions that could have been utilized, but they were left out of the grid search due to the inability to try out all the different parameter options that are available. Stochastic gradient descent was used due to being one of the most talked about optimizers and Adam being another popular option. The reason to why ReLU is a good activation function is ex- plained in Section 2.4.3, and Sigmoid is used in the output layer to ensure output values to be between 0 and 1. Mean squared error and Cross-entropy are two of the most widely used loss functions and the Categorical cross-entropy variation works well with the classification task in this thesis. The dissimilarity measures used in the agglomerative hierarchical clustering are all commonly used. It can be argued that the dissimi- 46 CHAPTER 5. DISCUSSION

larity measures are quite similar and that it is redundant to test all of them, but since it was not a complex task to test them all, they all were included. Another discussion point is training data and how it should be han- dled. In this project, the metadata is extracted from the Ecos2 database to be trained on. This limits the training data to a few thousand ob- jects. Which necessarily does not have to be negative, since the model is not that complex and still can yield good training results as shown in the previous chapter. But due to the nature of the approach, that the metadata chosen to train on is quite generic, it could be relevant to extract metadata from several databases (databases that are similar to Ecos2). This was attempted but not included in the report since it did yield about the same training results (just a bit slower training). But if there would be more metadata properties that could have been utilized in training, increasing the training set could possibly have re- sulted in better training results. But for the scope of this project, the training set has been sufficient. The reason to why taking training data from several databases would be viable in this project is that numer- ous databases were made available from Sokigo’s part for this project. These databases had almost identical structure but contained different data. In this case, data from different databases could be merged to one training set. But it might not be beneficial if most of the entries are identical, since this is kind of what is done when training is done across several epochs. It was also attempted to use instance data for training, like count values of columns. But since these properties made the solution less generic (instance data may differ a lot across different databases), it was not further researched. Chapter 6

Conclusion

The goal of this thesis was to investigate whether agglomerative hier- archical clustering and neural networks can be used to assist in seman- tic integration across databases containing complex real-world data and if there are any advantages using deep neural networks. The ap- proach based on cosine similarity was implemented to use as a bench- mark in comparison with the learning-based approach. The experi- ments and results show that agglomerative hierarchical clustering is a viable option for grouping relational data into classes. The classes constructed are well formed for the classification task and for training in a neural network. The results show that a neural network performs well in the task of finding corresponding attributes in heterogeneous databases but needs assistance to work towards the goal of fully au- tomating the integration process. The neural network can find good patterns in the data, but the lack of metadata properties is a liability to the performance of the approach. Adding a name matcher helps to fil- ter out undesirable matches. It is also shown that increasing the depth of the neural network slightly increases the performance. But for deep neural networks to excel, more metadata would be needed. The semantic integration problem can be automated using agglom- erative hierarchical clustering and neural networks. But the results have shown that for a fully automated semantic integration approach to be possible, the solution would need to be enhanced with additional algorithms and rules.

47 48 CHAPTER 6. CONCLUSION

6.1 Future Work

To extend the work conducted in this thesis, it would be interesting to evaluate a few different datasets, to see which properties that are important to utilize from when training a model. Although it was a part of this project to work with complex data, it would have been in- teresting to evaluate the model on data that was better defined (more usable metadata, better structural dependencies etc.). Being able to predict classes more precisely means being less dependent on a name matcher, thus giving a more generic solution. Working with data that has clearer patterns and structure opens opportunities to implement other solutions that utilize from schema properties. Ontology match- ing is one approach that would have been interesting to research fur- ther. More rules could also have been implemented, to improve the performance for this use case. But since most of the rules that could be added would have reduced the genericity of the solution, it was not that interesting to add many new rules in comparison to trying to im- prove the performance of the neural network and the name matcher. The implementation in this thesis only attempts to find the best matching columns and tables across the two databases. Therefore, a future extension could be to perform the integration in practice and build an application for this. Many new factors would have to be taken into consideration, when the integration is carried out in practice. For example, it is fine that the precision is not 100% from the experiments in this thesis, but if we were to integrate this data in practice, there would be a need of a layer in the application where a user can confirm the matches or discard them (since falsely matching two tables is bad). Bibliography

[1] Jacob Berlin and Amihai Motro. “Database schema matching us- ing machine learning with feature selection”. In: International Conference on Advanced Information Systems Engineering. Springer. 2002, pp. 452–466. [2] Philip A Bernstein, Jayant Madhavan, and Erhard Rahm. “Generic schema matching, ten years later”. In: Proceedings of the VLDB Endowment 4.11 (2011), pp. 695–701. [3] Purnima Bholowalia and Arvind Kumar. “EBK-means: A clus- tering technique based on elbow method and k-means in WSN”. In: International Journal of Computer Applications 105.9 (2014). [4] Léon Bottou. “Large-scale machine learning with stochastic gra- dient descent”. In: Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177–186. [5] Silvana Castano and Valeria De Antonellis. “Global viewing of heterogeneous data sources”. In: IEEE Transactions on Knowledge and Data Engineering 13.2 (2001), pp. 277–297. [6] Jesse Davis and Mark Goadrich. “The relationship between Precision- Recall and ROC curves”. In: Proceedings of the 23rd international conference on Machine learning. ACM. 2006, pp. 233–240. [7] Pieter-Tjerk De Boer et al. “A tutorial on the cross-entropy method”. In: Annals of operations research 134.1 (2005), pp. 19–67. [8] Michel Marie Deza and Elena Deza. “Encyclopedia of distances”. In: Encyclopedia of Distances. Springer, 2009, pp. 1–583. [9] Guohui Ding and Tianhe Sun. “Schema matching based on posi- tion of attribute in query statement”. In: Knowledge-Based Systems 75 (2015), pp. 41–51.

49 50 BIBLIOGRAPHY

[10] Hong-Hai Do and Erhard Rahm. “COMA: a system for flexible combination of schema matching approaches”. In: Proceedings of the 28th international conference on Very Large Data Bases. VLDB Endowment. 2002, pp. 610–621. [11] AnHai Doan, Pedro Domingos, and Alon Y Halevy. “Reconcil- ing schemas of disparate data sources: A machine-learning ap- proach”. In: ACM Sigmod Record. Vol. 30. 2. ACM. 2001, pp. 509– 520. [12] AnHai Doan and Alon Y Halevy. “Semantic integration research in the database community: A brief survey”. In: AI magazine 26.1 (2005), p. 83. [13] AnHai Doan et al. “Ontology matching: A machine learning ap- proach”. In: Handbook on ontologies. Springer, 2004, pp. 385–403. [14] Christian Drumm et al. “Quickmig: automatic schema matching for data migration projects”. In: Proceedings of the sixteenth ACM conference on Conference on information and . ACM. 2007, pp. 107–116. [15] Fabien Duchateau and Zohra Bellahsene. “Designing a bench- mark for the assessment of schema matching tools”. In: Open Journal of Databases (OJDB) 1.1 (2014), pp. 3–25. [16] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. “Deep sparse rectifier neural networks”. In: Proceedings of the Fourteenth In- ternational Conference on Artificial Intelligence and Statistics. 2011, pp. 315–323. [17] Hoshin V Gupta et al. “Decomposition of the mean squared er- ror and NSE performance criteria: Implications for improving hydrological modelling”. In: Journal of Hydrology 377.1-2 (2009), pp. 80–91. [18] Anna Huang. “Similarity measures for text document cluster- ing”. In: Proceedings of the sixth new zealand computer science re- search student conference (NZCSRSC2008), Christchurch, New Zealand. 2008, pp. 49–56. [19] Y Kumar Jain and Santosh Kumar Bhandare. “Min max normal- ization based data perturbation method for privacy protection”. In: International Journal of Computer & Communication Technology 2.8 (2011), pp. 45–50. BIBLIOGRAPHY 51

[20] Keras. https://keras.io/. [Accessed Oct. 6, 2017]. [21] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochas- tic optimization”. In: arXiv preprint arXiv:1412.6980 (2014). [22] Peter Langfelder, Bin Zhang, and Steve Horvath. “Defining clus- ters from a hierarchical cluster tree: the Dynamic Tree Cut pack- age for R”. In: Bioinformatics 24.5 (2007), pp. 719–720. [23] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learn- ing”. In: Nature 521.7553 (2015), pp. 436–444. [24] Wen-Syan Li and Chris Clifton. “SEMINT: A tool for identify- ing attribute correspondences in heterogeneous databases using neural networks”. In: Data & Knowledge Engineering 33.1 (2000), pp. 49–84. [25] Jayant Madhavan, Philip A Bernstein, and Erhard Rahm. “Generic schema matching with cupid”. In: vldb. Vol. 1. 2001, pp. 49–58. [26] Daniel Müllner et al. “fastcluster: Fast hierarchical, agglomera- tive clustering routines for R and Python”. In: Journal of Statistical Software 53.9 (2013), pp. 1–18. [27] Fionn Murtagh and Pierre Legendre. “Ward’s hierarchical clus- tering method: clustering criterion and agglomerative algorithm”. In: arXiv preprint arXiv:1111.6285 (2011). [28] Natalya F Noy. “Semantic integration: a survey of ontology-based approaches”. In: ACM Sigmod Record 33.4 (2004), pp. 65–70. [29] Luigi Palopoli, Giorgio Terracina, and Domenico Ursino. “The System DIKE: Towards the Semi-Automatic Synthesis of Coop- erative Information Systems and Data Warehouses.” In: ADBIS- DASFAA Symposium. 2000, pp. 108–117. [30] Laura Po and Serena Sorrentino. “Automatic generation of prob- abilistic relationships for improving schema matching”. In: Infor- mation Systems 36.2 (2011), pp. 192–208. [31] Erhard Rahm and Philip A Bernstein. “A survey of approaches to automatic schema matching”. In: the VLDB Journal 10.4 (2001), pp. 334–350. [32] Thomas A Runkler. Data Analytics. Springer, 2012, pp. 113–115. [33] Scikit-learn. http : / / scikit - learn . org / stable/. [Ac- cessed Oct. 3, 2017]. 52 BIBLIOGRAPHY

[34] Scipy. https://www.scipy.org/. [Accessed Oct. 4, 2017]. [35] Nitish Srivastava et al. “Dropout: a simple way to prevent neural networks from overfitting.” In: Journal of machine learning research 15.1 (2014), pp. 1929–1958. [36] Wil MP Van der Aalst et al. “Process mining: a two-step ap- proach to balance between underfitting and overfitting”. In: Soft- ware & Systems Modeling 9.1 (2010), p. 87. [37] Bing Xu et al. “Empirical evaluation of rectified activations in convolutional network”. In: arXiv preprint arXiv:1505.00853 (2015). [38] Huimin Zhao. “ across heterogeneous data sources”. In: Communications of the ACM 50.1 (2007), pp. 45–50. [39] Patrick Ziegler and Klaus R Dittrich. “Three decades of data integration-All problems solved?” In: IFIP congress topical ses- sions. Springer. 2004, pp. 3–12. Appendix A

Agglomerative Hierarchical Clus- tering

Figure A.1: Truncated Single linkage dendrogram

53 54 APPENDIX A. AGGLOMERATIVE HIERARCHICAL CLUSTERING

Figure A.2: Truncated Complete linkage dendrogram

Figure A.3: Truncated Average linkage dendrogram APPENDIX A. AGGLOMERATIVE HIERARCHICAL CLUSTERING 55

Figure A.4: Truncated Centroid linkage dendrogram

Figure A.5: Truncated Ward’s method dendrogram Appendix B

Cluster Content Explanation

Table B.1: Description of cluster distribution

Cluster Description number

All objects are IDs, not null, primary keys and 1 with a fixed length of 16 All objects are IDs, not null, primary keys and 2 with a length up to 16 All objects are IDs, null values are allowed in 3 most cases, no primary keys and with bigger field lengths All objects are IDs, null values are allowed in 4 most cases, no primary keys and with smaller field lengths All objects are nvarchars, a mix between 5 nullable and not nullable and with mostly small length values All objects are nvarchars, a mix between 6 nullable and not nullable and with mostly middle size length values (30-100) All objects are nvarchars, a mix between 7 nullable and not nullable and with bigger length values (above 100) All objects are decimal point values, a mix between nullable and not nullable, and mostly 8 with a fixed length somewhere between 4 and 8 All objects are integers, nullable mostly 9 allowed and with a fixed length of 4 All objects are integers or tinyints, nullable 10 mostly disallowed and with a small length (fixed to 1 in many cases) All objects are of the type datetime, a mix 11 between nullable and not nullable and length varies All objects are of the type bit, a mix between 12 nullable and not nullable and no limit on length All objects are of the type bit, a mix between 13 nullable and not nullable and mostly with a fixed length of 1

56 www.kth.se