Semantic Integration Across Heterogeneous Databases Finding Data Correspondences Using Agglomerative Hierarchical Clustering and Artificial Neural Networks
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018 Semantic Integration across Heterogeneous Databases Finding Data Correspondences using Agglomerative Hierarchical Clustering and Artificial Neural Networks MARK HOBRO KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Semantic Integration across Heterogeneous Databases Finding Data Correspondences using Agglomerative Hierarchical Clustering and Artificial Neural Networks MARK HOBRO Master in Computer Science Date: April 11, 2018 Supervisor: John Folkesson Examiner: Hedvig Kjellström Swedish title: Semantisk integrering mellan heterogena databaser: Hitta datakopplingar med hjälp av hierarkisk klustring och artificiella neuronnät School of Computer Science and Communication iii Abstract The process of data integration is an important part of the database field when it comes to database migrations and the merging of data. The research in the area has grown with the addition of machine learn- ing approaches in the last 20 years. Due to the complexity of the re- search field, no go-to solutions have appeared. Instead, a wide variety of ways of enhancing database migrations have emerged. This thesis examines how well a learning-based solution performs for the seman- tic integration problem in database migrations. Two algorithms are implemented. One that is based on informa- tion retrieval theory, with the goal of yielding a matching result that can be used as a benchmark for measuring the performance of the machine learning algorithm. The machine learning approach is based on grouping data with agglomerative hierarchical clustering and then training a neural network to recognize patterns in the data. This al- lows making predictions about potential data correspondences across two databases. The results show that agglomerative hierarchical clustering per- forms well in the task of grouping the data into classes. The classes can in turn be used for training a neural network. The matching al- gorithm gives a high recall of matching tables, but improvements are needed to both receive a high recall and precision. The conclusion is that the proposed learning-based approach, us- ing agglomerative hierarchical clustering and a Neural network, works as a solid base to semi-automate the data integration problem seen in this thesis. But the solution needs to be enhanced with scenario spe- cific algorithms and rules, to reach desired performance. iv Sammanfattning Dataintegrering är en viktig del inom området databaser när det kom- mer till databasmigreringar och sammanslagning av data. Forskning inom området har ökat i takt med att maskininlärning blivit ett at- traktivt tillvägagångssätt under de senaste 20 åren. På grund av kom- plexiteten av forskningsområdet, har inga optimala lösningar hittats. Istället har flera olika tekniker framställts, som tillsammans kan för- bättra databasmigreringar. Denna avhandling undersöker hur bra en lösning baserad på maskininlärning presterar för dataintegreringspro- blemet vid databasmigreringar. Två algoritmer har implementerats. En är baserad på informations- sökningsteori, som främst används för att ha en prestandamässig ut- gångspunkt för algoritmen som är baserad på maskininlärning. Den algoritmen består av ett första steg, där data grupperas med hjälp av hierarkisk klustring. Sedan tränas ett artificiellt neuronnät att hitta mönster i dessa grupperingar, för att kunna göra förutsägelser huruvi- da olika datainstanser har ett samband mellan två databaser. Resultatet visar att agglomerativ hierarkisk klustring presterar väl i uppgiften att klassificera den data som använts. Resultatet av match- ningsalgoritmen visar på att en stor mängd av de matchande tabeller- na kan hittas. Men förbättringar behöver göras för att både ge hög en hög återkallelse av matchningar och hög precision för de matchningar som hittas. Slutsatsen är att ett inlärningsbaserat tillvägagångssätt, i detta fall att använda agglomerativ hierarkisk klustring och sedan träna ett ar- tificiellt neuronnät, fungerar bra som en basis för att till viss del auto- matisera ett dataintegreringsproblem likt det som presenterats i den- na avhandling. För att få bättre resultat, krävs att lösningen förbättras med mer situationsspecifika algoritmer och regler. v Acknowledgements I would first like to thank my supervisor at KTH, John Folkesson, for all the support during the project as well as my examiner, Hed- vig Kjellström, for making this thesis possible. I also want to express my gratitude to my supervisor at Sokigo, Kevin James, for giving me the opportunity to do my thesis at their office. Finally, I would like to thank my family and friends who have supported me throughout the entire process. Contents 1 Introduction 1 1.1 Motivation . .1 1.2 Problem Definition . .2 1.3 Limitation . .2 1.4 Sustainability and Ethics . .3 1.5 Outline of Report . .4 2 Background 5 2.1 Semantic Integration . .5 2.1.1 Schema Matching . .5 2.1.2 Instance Matching . .6 2.1.3 Match Cardinality . .6 2.2 Rule-based Matching . .6 2.3 Information Retrieval . .7 2.3.1 Tf-idf and Cosine Similarity . .7 2.4 Learning-based Approach . .8 2.4.1 Data Representation . .8 2.4.2 Agglomerative Hierarchical Clustering . 10 2.4.3 Neural Network . 12 2.5 Difficulties in Schema Integration . 16 2.5.1 Linguistic Challenges . 17 2.5.2 Structural Ambiguity . 17 2.6 Related Work . 17 3 Method 21 3.1 Setup and Development Tools . 21 3.2 Dataset . 21 3.2.1 Preprocessing . 22 3.3 Tf-idf and Cosine Similarity . 24 vi CONTENTS vii 3.4 Agglomerative Hierarchical Clustering . 25 3.4.1 Effect of Normalization . 25 3.4.2 Analysing Clustering Outcome . 27 3.4.3 Testing Dissimilarity Measures . 28 3.5 Neural Network . 29 3.5.1 Defining the Model . 29 3.5.2 Grid Search . 30 3.5.3 Name Matcher . 31 3.5.4 Training Setup . 32 3.6 Evaluating Performance . 33 3.6.1 Application in Database Environments . 34 3.6.2 Evaluating Testing Accuracy . 34 4 Result 35 4.1 Tf-idf and Cosine Similarity . 35 4.1.1 Precision and Recall . 35 4.2 Agglomerative Hierarchical Clustering . 36 4.2.1 Clustering Execution Times . 36 4.2.2 Clustering Visualization . 37 4.3 Neural Network . 37 4.3.1 Grid Search . 37 4.3.2 Model Evaluation . 38 4.3.3 Precision and Recall . 38 5 Discussion 41 5.1 Tf-idf and Cosine Similarity . 41 5.2 Agglomerative Hierarchical Clustering . 41 5.3 Neural Network . 42 5.4 About the Semantic Integration Problem . 43 5.5 Methodology . 45 6 Conclusion 47 6.1 Future Work . 48 Bibliography 49 A Agglomerative Hierarchical Clustering 53 B Cluster Content Explanation 56 Chapter 1 Introduction Semantic integration has been a widely discussed topic in the last three decades. The earlier work focused on structural integration, i.e. build- ing global data models with the goal of integrating well-structured data [39]. With continuous growth of information on the Internet, it became more common with semi- and unstructured data. Semantic integration of data was born out of the problems that came with this evolution. Dealing with heterogeneity across different data-sources is complex, especially since it needs to be done on several levels, such as schema and instance data level [39]. Therefore, semantic integra- tion will continue to be a research field that requires attention and it is nowadays an important area of database research [12]. 1.1 Motivation Semantic integration has evolved into a significant research area in the database community with the constant need of processing com- plex data. The research has been taking two different paths. Firstly, it has been investigated whether semantic integration is applicable to integrate data automatically (or at least semi-automatically) in specific scenarios. Secondly, the research has been focused on finding generic solutions, where semantic integration can be applied to build ontolo- gies that can integrate a wider scope of heterogeneous data. The lat- ter is rather complex, due to the immense diversity and range of data sources that can be encountered. Because of the diversity of data to- day, there are many data scenarios in which semantic integration has not been attempted for automatic schema and data integration. There- 1 2 CHAPTER 1. INTRODUCTION fore, there is a need to examine how well different semantic integration techniques perform for real-world data. Automation of database migrations is something that can be of use widely. It is not uncommon for companies to migrate data between databases and since modern technology evolves so swiftly, it is inter- esting to investigate how data migrations can be made more efficient. 1.2 Problem Definition Earlier research in semantic integration for database systems has shown that it is possible to automate the integration process to a certain ex- tent. But in many studies, the test data has been generic and have not covered many of the problems that come with matching complex and heterogeneous data. In this study, one goal is to build a generic solu- tion for the integration process of complex and field specific data and analyse the difficulties that may appear. It is interesting to investigate how different strategies, such as rule and learning-based, scale with increased data complexity. Will deep neural networks show toward better performance? Other questions that will be considered are how well a learning-based approach can be applied to the relational data and if a learning-based approach alone is enough to solve the integration problem. Is agglomerative hierarchi- cal clustering a preferable solution for establishing classes that can be used for training a neural network? The hypothesis is that both agglomerative hierarchical clustering and neural networks can be used to semi-automate the semantic in- tegration problem but will need the help of well constructed rules when the complexity of the data increases. The meaning with semi- automation is that the learning-based approach can distinguish the patterns between databases, but is not sufficient alone to yield optimal results (finding all data correspondences, meaning full automation).