Services for Connecting and Integrating Big Number of Linked Datasets

UNIVERSITY OF CRETE DEPARTMENT OF COMPUTER SCIENCE FACULTY OF SCIENCES AND ENGINEERING Services for Connecting and Integrating Big Number of Linked Datasets by Michalis Mountantonakis PhD Dissertation Presented in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Heraklion, March 2020 UNIVERSITY OF CRETE DEPARTMENT OF COMPUTER SCIENCE Services for Connecting and Integrating Big Number of Linked Datasets PhD Dissertation Presented by Michalis Mountantonakis in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy APPROVED BY: Author: Michalis Mountantonakis Supervisor: Yannis Tzitzikas, Associate Professor, University of Crete, Greece Committee Member: Dimitrios Plexousakis, Professor, University of Crete, Greece Committee Member: Kostas Magoutis, Associate Professor, University of Crete, Greece Committee Member: Grigoris Antoniou, Professor, University of Huddersfield, United Kingdom Committee Member: Manolis Koubarakis, Professor, National and Kapodistrian University of Athens, Greece Committee Member: Sören Auer, Professor, Leibniz University of Hannover, Germany Committee Member: Giorgos Flouris, Principal Researcher, FORTH-ICS, Greece Department Chairman: Angelos Bilas, Professor, University of Crete, Greece Heraklion, March 2020 Acknowledgments First, I would like to express my deepest gratitude to my supervisor, Associate Professor Yannis Tzitzikas for everything that I have learned from him and for all the support and encouragement he has given to me. During the past four years, he has devoted a lot of time and effort for helping me with my PhD dissertation and for writing together several research papers that were published in prestigious journals and conferences. Indeed, his immense knowledge, motivation, passion and patience have given me more power and spirit to improve my research and presentation skills and to finish my dissertation. Apart from my supervisor, I would like to deeply thank the other two members of my Advisory Committee, Professor Dimitris Plexousakis and Associate Professor Kostas Magoutis, for their valuable comments and suggestions all these years, which helped me to improve my PhD thesis and my research skills. I am also grateful to Professor Grigoris Antoniou, Professor Manolis Koubarakis, Professor Soren¨ Auer and Principal Researcher Giorgos Flouris, for accepting the invitation to become members of my examination committee, and for their valuable comments and suggestions, that helped me to improve my dissertation. Finally, I would like also to thank the anonymous reviewers of our papers for providing us valuable feedback, which was crucial for improving our work. I gratefully acknowledge the funding received towards my PhD thesis from Hellenic Foundation for Research and Innovation and the General Secretariat for Research and Technology for the period August 2017 - March 2020. In particular, this dissertation was supported by the Hellenic Foundation for Research and Innovation (HFRI) and the Gen- eral Secretariat for Research and Technology (GSRT), under the HFRI PhD Fellowship grant (GA. No. 166). Moreover, I would like to acknowledge the support of the Institute of Com- puter Science of the Foundation of Research and Technology (FORTH-ICS), and especially the Information Systems Laboratory (ISL), for the financial support (from the last year of my undergraduate studies until July 2017), for the facilities, and for giving me the opportunity to travel in several places for presenting our research work. In particular, I would like to thank the Head of ISL, Professor Dimitris Plexousakis, and also all the faculty and staff members of ISL. Furthermore, I would like to acknowledge the support of BlueBridge (H2020, 2015-2018), since part of my dissertation was supported by this project. Additionally, I want to express my deepest thanks to the undergraduate and postgraduate students of Computer Science Department, and to the staff of University of Crete. Es- pecially, I desire to thank the secretaries of Computer Science Department and ELKE, for helping me during my undergraduate and postgraduate studies. Moreover, I would like v to thank okeanos GRNET’s cloud service (https://okeanos.grnet.gr/), for giving me the opportunity to have access and to use 14 machines, during the past four years. By using these machines, I managed to run most of the experiments which are included in this dissertation and to host several online services. Furthermore, I want to thank the people that supported me all these years. Especially in some difficult situations during my postgraduate studies, they encouraged me to con- tinue and to complete my dissertation. First, I want to deeply thank my beloved Efstathia for her love, support and patience, and for the many wonderful moments we have spent together. Moreover, I want to thank my family and especially the following persons: my grandfather Antonis, who is my biggest supporter, my parents Rena and Lefteris and my sister Maria, for believing in me and encouraging me all these years. Finally, I want to thank all my friends for all the good times we have shared all the years. vi Στην αγαπημένη μου Ευσταθία, στον παππού μου Αντώνη, στους γονείς μου Ρένα και Λευτέρη και στην αδερφή μου Μαρία, για την αμέριστη αγάπη, την ενθάρρυνση και την υποστήριξη τους. vii viii Abstract Linked Data is a method for publishing structured data that facilitates their sharing, linking, searching and re-use. A big number of such datasets (or sources), has already been published and their number and size keeps increasing. Although the main objective of Linked Data is linking and integration, this target has not yet been satisfactorily achieved. Even seemingly simple tasks, such as finding all the available information for an entity is challenging, since this presupposes knowing the contents of all datasets and performing cross-dataset identity reasoning, i.e., computing the symmetric and transitive closure of the equivalence relationships that exist among entities and schemas. Another big chal- lenge is Dataset Discovery, since current approaches exploit only the metadata of datasets, without taking into consideration their contents. In this dissertation, we analyze the research work done in the area of Linked Data In- tegration, by giving emphasis on methods that can be used at large scale. Specifically, we factorize the integration process according to various dimensions, for better understand- ing the overall problem and for identifying the open challenges. Then, we propose indexes and algorithms for tackling the above challenges, i.e., methods for performing cross- dataset identity reasoning, for finding all the available information for an entity, methods for offering content-based Dataset Discovery, and others. Due to the large number and volume of datasets, we propose techniques that include incremental and parallelized algorithms. We show that content-based Dataset Discovery is reduced to solving optimization problems, and we propose techniques for solving them in an efficient way. The aforementioned indexes and algorithms have been implemented in a suite of services that we have developed, called LODsyndesis, which offers all these services in real time. Furthermore, we present an extensive connectivity analysis for a big subset of LOD cloud datasets. In particular, we introduce measurements (concerning connectivity and efficiency) for 2 billion triples, 412 million URIs and 44 million equivalence relationships derived from 400 datasets, by using from 1 to 96 machines for indexing the datasets. Just indicatively, by using the proposed indexes and algorithms, with 96 machines it takes less than 10 minutes to compute the closure of 44 million equivalence relationships, and 81 minutes for indexing 2 billion triples. Furthermore, the dedicated indexes, along with the proposed incremental algorithms, enable the computation of connectivity metrics for 1 million subsets of datasets in 1 second (three orders of magnitude faster than conven- tional methods), while the provided services offer responses in a few seconds. These services enable the implementation of other high level services, such as services for Data En- ix richment which can be exploited for Machine-Learning tasks, and techniques for Knowl- edge Graph Embeddings, and we show that this enrichment improves the prediction of machine-learning problems. Keywords: Linked Data, RDF,Data Integration, Dataset Discovery and Selection, Con- nectivity, Lattice of Measurements, Big Data, Data Quality Supervisor: Yannis Tzitzikas Associate Professor Computer Science Department University of Crete x Περίληψη Τα ∆ιασυνδεδεμένα ∆εδομένα (Linked Data) είναι ένας τρόπος δημοσίευσης δεδο- μένων που διευκολύνει το διαμοιρασμό, τη διασύνδεση, την αναζήτηση και την επα- ναχρησιμοποίησή τους. ´Ηδη υπάρχουν χιλιάδες τέτοια σύνολα δεδομένων, στο εξής πηγές, και ο αριθμός και το μέγεθος τους αυξάνεται. Αν και ο κύριος στόχος των ∆ια- συνδεδεμένων ∆εδομένων είναι η διασύνδεση και η ολοκλήρωση τους, αυτός ο στόχος δεν έχει επιτευχθεί ακόμα σε ικανοποιητικό βαθμό. Ακόμα και φαινομενικά απλές εργασίες, όπως η εύρεση όλων των πληροφοριών για μία συγκεκριμένη οντότητα αποτελούν πρόκληση διότι αυτό προϋποθέτει γνώση των περιεχομένων όλων των πη- γών, καθώς και την ικανότητα συλλογισμού επί των συναθροισμένων περιεχομένων τους, συγκεκριμένα απαιτείται ο υπολογισμός του συμμετρικού και μεταβατικού κλεισίματος των σχέσεων ισοδυναμίας μεταξύ των ταυτοτήτων των οντοτήτων και των οντολογιών. Η

Services for Connecting and Integrating Big Number of Linked Datasets

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support