Big Graph Processing: Partitioning and Aggregated Querying

Big Graph Processing : Partitioning and Aggregated Querying Ghizlane Echbarthi To cite this version: Ghizlane Echbarthi. Big Graph Processing : Partitioning and Aggregated Querying. Databases [cs.DB]. Université de Lyon, 2017. English. NNT : 2017LYSE1225. tel-01707153 HAL Id: tel-01707153 https://tel.archives-ouvertes.fr/tel-01707153 Submitted on 12 Feb 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. No d’ordre NNT : 2017LYSE1225 THESEDEDOCTORATDEL’UNIVERSIT` EDELYON´ opérée au sein de l’Université Claude Bernard Lyon 1 Ecole´ Doctorale ED512 Infomath Spécialité de doctorat : Informatique Soutenue publiquement le 23/10/2017, par : Ghizlane ECHBARTHI Big Graph Processing: Partitioning and Aggregated Querying Devant le jury composéde: Nom Prénom, grade/qualité, établissement/entreprise Président(e) Karine Zeitouni, Professeur, Université de Versaille Rapporteure Raphael Couturier, Professeur, Université de Franche-Comté Rapporteur Angela Bonifati, Professeur, Université Lyon 1 Examinatrice Mohand Boughanem, Professeur, Université Paul Sabatier Examinateur Hamamache Kheddouci, Professeur, Université Lyon 1 Directeur de thèse Remerciements Tout d’abord je tiens à remercier mon directeur de thèse Hamamache KHEDDOUCI, de m’avoir acceuilli au sein de l’équipe GOAL, d’avoir constamment veillé sur la qualité de mon travail et de m’avoir apporté un soutien scientifique et morale d’une grande importance. Mes sincères remerciements vont aux membres de jury, pour avoir accepté d’évaluer mon travail et de réviser ce manuscrit. Mes remerciements chaleureux vont à l’ensemble du personnel du laboratoire LIRIS et de l’université de Lyon, et en particulier le personnel administratif que j’ai pu côtoyer, Isabelle, Brigitte, Jean-Pierre, Catherine, pour leur efficacité et leur bienveillance. Je tiensa ` remercier mon amie Kaoutar Klaye, pour sa fidélité et son soutien intarissable et aussi pour l’inspiration qu’elle me donne pour toujours persévérer et donner le meilleur de moi-même. Je remercie ma famille, mon père, ma mère, mon frère et sa petite famille, ma soeur et sa petite famille, mes deux soeurs Fati et Meriem d’être là pour moi et de m’apporter toujours un soutien très chaleureux. Je tiensa ` remercier mon mari Ouadie, je lui dois beaucoup dans ce travail de thèse, je tiens à le remercier pour sa patience, son écoute, sa disponibilité, sa générositéetson précieux soutien. Ghizlane ii Abstract The advent of ”big data” has tremendous repercussions on a broad range of data related domains, and it has impelled the use of novel techniques that achieve the best tradeoff between computational cost and precision. In the graph theory field, graphs represent a powerful tool for data and problem modeling where all kinds of problems, starting from the very simple to the very complicated, can be effectively formalized. To resolve NP-complete or NP-hard problems in this field, research is being directed to approximation algorithms and heuristic solutions rather than exact solutions, which exhibit a very high computational cost that makes them impossible to use for massive graph datasets. In this thesis, we tackle two main problems: first, the graph partitioning problem is studied in the context of ”big data”, where the focus is put on large graph streaming partitioning. In fact, the graph partitioning is an important and challenging problem when performing computation tasks over large distributed graphs, the reason is that a good partitioning leads to faster computations. We studied and proposed several streaming partitioning models and heuristics, and we studied their performances theoretically and experimentally as well. The second problem that we tackle in this thesis, is the querying of partitioned/distributed graphs. In fact, querying graph datasets proves to be an important task as most of the real-life datasets are represented by networks of labeled entities. In this context, we study the problem of ”aggregated graph search” that aims to answer queries on several graph fragments, and has the task to build a coherent final answer such that it forms an ”approximate matching” to the initial query. We proposed a new method for ”aggregated graph search”, and we studied its performances theoretically and experimentally on different real word graph datasets. Key words: Balanced graph partitioning, Streaming partitioning, Streaming heuristics, Graph querying, Graph matching, Graph similarity metric, Aggregated search. iii Résumé Avec l’avènement du ”big data”, de nombreuses répercussions ont eu lieu dans tous les domaines de la technologie de l’information, préconisant des solutions innovantes rem- portant le meilleur compromis entre coûts et précision. En théorie des graphes, oùles graphes constituent un support de modélisation puissant qui permet de formaliser des problèmes allant des plus simples aux plus complexes, la recherche pour des problèmes NP-complet ou NP-difficil se tourne plutôt vers des solutions approchées, mettant ainsi en avant les algorithmes d’approximation et les heuristiques alors que les solutions ex- actes deviennent extrêmement coûteuses et impossible d’utilisation. Nous abordons dans cette thèse deux problématiques principales: dans un premier temps, le problème du partitionnement des graphes est abordé d’une perspective ”big data”, où les graphes massifs sont partitionnés en streaming. En effet, le partitionnement de graphe est une tâche cruciale et importante lors de la mise en place des graphes de données distribuées, pour la principale raison est qu’un bon partitionnement permet d’accélerer les calculs sur ces graphes de données distribuées. Dans ce contexte, nous étudions et proposons plusieurs modèles et heuristiques de partitionnement en streaming et nous évaluons leurs performances autant sur le plan théorique qu’empirique. Dans un second temps, nous nous intéressons au requêtage des graphes distribués ou par- titionnés. En effet, une grande majoritédedonnées est modélisée sous formes d’entités étiquetées et interconnectées, ce qui fait du requêtage des graphes distribués une tâche importante dans une large liste d’applications. Dans ce cadre, nous étudions la problématique de la ”recherche agrégative dans les graphes” qui a pour but de répondre àdesrequêtes interrogeant plusieurs fragments de graphes et qui se charge de la reconstruction de la réponse finale tel que l’on obtient un ”matching approché” avec la requête initiale. Mots clés: Requête de graphes, Matching de graphes, Mesure de similaritédansles graphes, Recherche agrégative dans les graphes, Partitionnement des graphes, Partition- nement en streaming, Heuristiques de streaming, Partitionnement équilibré des graphes. iv Contents Remerciements ii Abstract iii Résumé iv List of Figures ix List of Tables xi 1 Introduction 1 1.1 Scope of the thesis and contributions ..................... 2 I Massive Graph Partitioning 4 2 Graph Partitioning Problem 6 2.1 Graph Partitioning: Problem Definition and Application .......... 6 2.1.1 Notations ................................ 6 2.1.2 Problem Definition ........................... 7 2.1.2.1 Balanced & Unbalanced Graph Partitioning ....... 7 2.1.2.2 Hardness Results and Approximation ........... 8 2.1.3 Hypergraph Partitioning ........................ 8 2.1.4 Graph Partitioning Applications ................... 9 2.1.4.1 Parallel Processing ...................... 10 2.1.4.2 Complex networks ...................... 10 2.1.4.3 Road networks ........................ 11 2.1.4.4 Image Segmentation ..................... 11 2.1.4.5 VLSI Physical Design .................... 11 2.2 Offline Partitioning ............................... 12 2.2.1 Spectral Partitioning .......................... 12 2.2.2 Multilevel Partitioning ......................... 14 2.2.2.1 Coarsening phase ...................... 15 2.2.2.2 Partitioning phase ...................... 17 2.2.2.3 Uncoarsening and refinement phase ............ 18 2.2.3 Geometric Partitioning ......................... 19 2.2.3.1 Geometric partitioning: Linear Embedding based Graph Partitioning ......................... 20 2.3 Online Partitioning: Streaming Graph Partitioning ............. 21 v Contents vi 2.3.1 Problem Setting & Definition ..................... 21 2.3.1.1 Streaming Model ....................... 21 2.3.1.2 Problem Definition ...................... 23 2.3.2 One Pass Streaming Partitioning ................... 23 2.3.2.1 Linear Deterministic Greedy heuristic ........... 24 2.3.2.2 FENNEL heuristic ...................... 24 2.3.2.3 Fractional Greedy heuristic ................. 25 2.3.3 Restreaming Partitioning ....................... 25 2.3.3.1 Restreaming the one-pass streaming partitioning heurtis- tics .............................. 26 2.4 Chapter Summary ............................... 27 3 Fractional Greedy Heuristic & Partial Restreaming Partitioning 28 3.1 Fractional Greedy: a proposed streaming partitioning heuristic for large graphs .....................................

Big Graph Processing: Partitioning and Aggregated Querying

Error-Tolerant Graph Matching: a Formal Framework and Algorithms

Typical Snapshots Selection for Shortest Path Query in Dynamic Road Networks

On the Unification of the Graph Edit Distance and Graph Matching Problems Romain Raveaux

A Binary Linear Programming Formulation of the Graph Edit Distance

16.6 Graph Partition 541

Bayesian Graph Edit Distance

A Survey of Graph Edit Distance

A Metric Learning Approach to Graph Edit Costs for Regression Linlin Jia, Benoit Gaüzère, Florian Yger, Paul Honeine

Theory of Graph Traversal Edit Distance, Extensions, and Applications

Tensor Spectral Clustering for Partitioning Higher-Order Network Structures

Graph Edit Networks

Preconditioned Spectral Clustering for Stochastic Block Partition