Revisiting Data Partitioning for Scalable RDF Graph Processing Jorge Armando Galicia Auyon

Revisiting Data Partitioning for Scalable RDF Graph Processing Jorge Armando Galicia Auyon To cite this version: Jorge Armando Galicia Auyon. Revisiting Data Partitioning for Scalable RDF Graph Processing. Other [cs.OH]. ISAE-ENSMA Ecole Nationale Supérieure de Mécanique et d’Aérotechique - Poitiers, 2021. English. NNT : 2021ESMA0001. tel-03167657 HAL Id: tel-03167657 https://tel.archives-ouvertes.fr/tel-03167657 Submitted on 12 Mar 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THESE` pour l'obtention du Grade de Docteur de l'Ecole Nationale Supérieure de Mécanique et d'Aérotechnique (DiplômeNational - Arrêtédu 25 mai 2016) Ecole Doctorale : Sciences et Ingéniérie pour l'Information et Mathématiques Secteur de Recherche : Informatique et Applications Présentée par : Jorge Armando GALICIA AUYON´ ************************ Revisiting Data Partitioning for Scalable RDF Graph Processing ************************ Directeur de thèse : Ladjel BELLATRECHE Co-encadrant de thèse : Amin MESMOUDI ************************ Soutenue le 12 Janvier 2021 devant la Commission d'Examen ************************ JURY Président : Emmanuel GROLLEAU Professeur, ISAE - ENSMA, Poitiers Rapporteurs : Yannis MANOLOPOULOS Professor, Open University of Cyprus, Chypre Farouk TOUMANI Professeur, UniversitéBlaise Pascal, Clermont-Ferrand Membres du jury : Genoveva VARGAS-SOLAR Chargéede Recherche, CNRS, Grenoble Carlos ORDONEZ Associate Professor, University of Houston, USA Patrick VALDURIEZ Directeur de Recherche, INRIA, Montpellier Amin MESMOUDI Ma^ıtre de Conférences, Universitéde Poitiers, Poitiers Ladjel BELLATRECHE Professeur, ISAE - ENSMA, Poitiers ii It always seems impossible until it's done. | Nelson Mandela iii iv Acknowledgements This thesis would not have been possible without the inspiration and support of so many indi- viduals: - I would like to thank my supervisors Prof. Ladjel Bellatreche and Dr. Amin Mes- moudi for their guidance, patience, and support during my thesis. The insightful dis- cussions I had with each of you definitely enlightened my way while exploring new ideas. Their careful editing contributed greatly to the production of this manuscript. You have helped me learn how to do research. - I would also want to show my sincere gratitude to Professors Yannis Manolopoulos and Farouk Toumani for taking their precious time to be rapporteurs of my thesis. I greatly appreciate your feedback that I carefully considered to improve the quality of my work. In addition, I would like to thank Professors Genoveva Vargas-Solar, Patrick Valduriez, Carlos Ordonez, and Emmanuel Grolleau who have honored me to be part of the examination committee and whose questions have enriched my study from many different angles. - I also thank all the members of the LIAS laboratory who welcomed me and with whom I shared so many moments of conviviality throughout these three years. Many thanks to Allel Hadjali, Brice Chardin, Henri Bauer, Michael Richard, and Yassine Ouhammou for mentoring me during the first steps of my teaching career. I wish to extend my special thanks to Mickaël Baron for the technical support throughout my work, for his availability and enthusiasm. I sincerely thank Bénédicte Boinot for her kindness, generosity, and all the administrative assistance to coordinate my thesis project. - I want to thank my past and present Ph.D. colleagues and friends. Abir, Amna, Ana¨ıs, Anh Toan, Chourouk, Fatma, Houssameddine, Ibrahim, Ishaq, Issam, Khe- didja, Louise, Matheus, Mohamed, Nesrine, Réda, Richard, Sana, Simon, and v Soulimane. Many thanks to everyone for all the shared moments in which we exchanged ideas of our academic works and for the diverse conversations that cleared my mind when I needed it most. - Many thanks to my group of friends, Anita, Abril, Gad, Gledys, Laura, Lina, Mar- cos, Mariana, Melissa, Moditha, and Suela on whom I was able to unload a lot of pressure thanks to so many shared laughs. Your sense of humor was the ray of sunshine for many cloudy days during this journey. - Finally, my deep and sincere gratitude to my family for their unconditional and unpar- alleled love, help, and support. Many thanks to my parents, Veronica and Jorge, my sisters Daniela and Jimena, brother José, and aunt Carmen for always being there to restore my optimism when I was missing. I am forever indebted to Régis, Marie France, Caroline, Thomas, and Marie José who became my family and I never felt alone while being so far from my loved Guatemala. Finally, I would like to express my deepest gratitude to Gabriel, who encouraged, supported, and loved me without fail and stood by my side every time. I will be forever grateful to you, this journey would not have been possible if not for all your support. <Muchas gracias a tod@s! vi Table of Contents Table of Contents vii Introduction 1 I Preliminaries 17 1 Data Partitioning Foundations 19 1.1 Introduction . 21 1.2 Data Partitioning Fundamentals . 22 1.2.1 Partitioning definition and development overview . 23 1.2.2 Partitioning concept evolution . 24 1.3 Partitioning dimensions . 27 1.3.1 Type....................................... 27 1.3.2 Main objective . 29 1.3.3 Mechanism . 30 1.3.4 Algorithm . 31 1.3.5 Cost Model . 31 1.3.6 Constraints . 32 1.3.7 Platform . 33 1.3.8 System Element . 33 1.3.9 Adaptability . 33 1.3.10 Data model . 33 1.4 Partitioning approaches . 34 1.4.1 Partitioning by type, platform and mechanism . 35 1.4.2 Partitioning by data model . 41 1.4.3 Partitioning by adaptability . 47 1.4.4 Partitioning by constraints . 51 1.5 Partitioning in large-scale platforms . 53 1.5.1 Hadoop ecosystem . 53 1.5.2 Apache Spark . 55 1.5.3 NoSQL stores . 56 1.5.4 Hybrid architectures . 58 1.6 Conclusion . 58 2 Graph Data : Representation and Processing 61 2.1 Introduction . 63 2.2 Graph database models . 63 2.2.1 Logical graph data structures . 64 vii TABLE OF CONTENTS 2.2.2 Data storage . 65 2.2.3 Query and manipulation languages . 66 2.2.4 Query processing . 69 2.2.5 Graph partitioning . 71 2.2.6 Graph databases . 77 2.3 Resource Description Framework . 80 2.3.1 Background . 80 2.3.2 Storage models . 82 2.3.3 Processing strategies . 86 2.3.4 Data partitioning . 87 2.4 Conclusion . 91 II Contributions 93 3 Logical RDF Partitioning 95 3.1 Introduction . 97 3.2 RDF partitioning design process . 97 3.3 Graph fragments . 99 3.3.1 Grouping the graph by instances . 99 3.3.2 Grouping the graph by attributes . 104 3.4 From logical fragments to physical structures . 107 3.5 Allocation problem . 109 3.5.1 Problem definition . 109 3.5.2 Graph partitioning heuristic . 111 3.6 RDF partitioning example . 112 3.7 Dealing with large fragments . 116 3.8 Conclusion . 117 4 RDFPartSuite in Action 119 4.1 Introduction . 121 4.2 RDF QDAG ...................................... 121 4.2.1 System architecture . 122 4.2.2 Storage model . 123 4.2.3 Execution model . 125 4.3 Loading costs . 128 4.3.1 Tested datasets . 128 4.3.2 Configuration setup . 129 4.3.3 Pre-processing times . 129 4.4 Evaluation of the fragmentation strategies . 130 4.4.1 Data coverage . 130 4.4.2 Exclusive comparison of fragmentation strategies . 131 4.4.3 Combining fragmentation strategies . 134 4.5 Evaluation of the allocation strategies . 136 4.5.1 Data skewness comparison . 136 4.5.2 Communication costs study . 137 4.5.3 Distributed experiments . 140 4.6 Partitioning language . 143 4.6.1 Notations . 143 4.6.2 CREATE KG statement . 144 4.6.3 LOAD DATA statement . 144 4.6.4 FRAGMENT KG statement . 144 viii TABLE OF CONTENTS 4.6.5 ALTER FRAGMENT statement . 145 4.6.6 ALLOCATE statement . 145 4.6.7 ALTER ALLOCATION statement . 146 4.6.8 DISPATCH statement . 146 4.6.9 Integration of the language to other systems . 146 4.7 RDF partitioning advisor . 147 4.7.1 Main functionalities . 147 4.7.2 System architecture . ..

Revisiting Data Partitioning for Scalable RDF Graph Processing Jorge Armando Galicia Auyon

A Comparison of On-Line Computer Science Citation Databases

Linking Educational Resources on Data Science

A Fuzzy Datalog Deductive Database System 1

Powerdesigner 16.6 Data Modeling

OLAP, Relational, and Multidimensional Database Systems

Analysis of Computer Science Communities Based on DBLP

On Active Deductive Databases

Multidimensional Network Analysis

How Do Computer Scientists Use Google Scholar?: a Survey of User Interest in Elements on Serps and Author Proﬁle Pages

PP-DBLP: Modeling and Generating Public-Private Attributed Networks with DBLP

The Best Nurturers in Computer Science Research

Oracle Rdb™ SQL Reference Manual Volume 4