Indexing Graph Structured Data
Total Page:16
File Type:pdf, Size:1020Kb
MASARYK UNIVERSITY FACULTY OF INFORMATICS Û¡¢£¤¥¦§¨ª«¬Æ°±²³´µ·¸¹º»¼½¾¿Ý Indexing Graph Structured Data PH.D. THESIS Stanislav Barto ˇn Brno, January 2007 Acknowledgement Though the following dissertation is an individual work, I could never have reached the heights or explored the depths without the help, support, guidance and efforts of a lot of people. Firstly, I would like to thank my supervisor Prof. Pavel Zezula for instillinginmethe qualities of being a good researcher and scientist. A very special thank you to my friends and colleagues Dr. Vlastislav Dohnal, Dr. Michal Batko and David Nov´ak for the sup- port they have lent me over all these years and also for providing me with contributions and comments on this work that have madea difference. Finally, I would like to acknowledge and sincerely thank my beloved parents for their inexhaustible faith and confidence in me and my abilities. iii Abstract This Ph.D. thesis concerns the problem of indexing techniques to- wards efficient discovery of complex relationships among entities in graph structured data. We propose a novel structure for evalu- ation of special type of operator denoting queries for all paths to a certain limiting length lying between a pair of inspected vertices in the indexed graph, ρ-path operator queries. We have compared it to various approaches suitable for this task and conducted numerous experiments to verify its properties. The proposed approach is based on a graph simplification method that we call graph segmentation. Using the recursive process of the graph segmentation a multilevel tree-like indexing structure called ρ-index is acquired. In this tree, each level represents a simpli- fied graph and each node a path type matrix describing the particular graph segment. An algorithm concerning the ρ-path queries using ρ -index is proposed and evaluated. The experiments are conducted on synthetic randomly generated graphs. These were acquired using our own incremental algorithm. These graphs represent the most general case for studying ρ-index’s properties since they lack any structural behavior. Although this the- sis also contains the evaluation on real-life data represented by a ci- tation graph of 30,000 scientific publication taken from the CiteSeer database. That part of this thesis also proposes novel approaches to discovery of important publications in the network influenced by the user’s predefined context. Supervisor: prof. Ing. Pavel Zezula, CSc. v Keywords index structures graph structured data path search algorithms semantic web citation networks citation analysis context citation publication search vii Contents 1 Introduction 13 1.1 StatementoftheProblem. 14 1.2 ResearchObjectives. 15 2 Background 17 2.1 GraphQueries........................ 17 2.1.1 GraphContainmentQuery . 17 2.1.2 ρ-operators ..................... 18 2.1.3 VertexReachabilityQuery . 21 2.2 GraphTheory ........................ 21 2.2.1 BasicDefinitions . 22 2.2.2 GraphSegmentation . 24 2.3 SegmentationHypotheses . 28 2.3.1 Correctness of a Proper Sequence of Segments Representation . 28 2.3.2 ConnectingPathofaSequenceofSegments . 29 2.3.3 Imposing a Weight Limit l ............. 33 2.3.4 IterationStep . 35 3 IndicesforGraphStructuredDataandRelatedWork 39 3.1 AlgorithmicApproaches . 39 3.1.1 GraphAlgorithms . 39 Single Source Path Expression Problem . 40 3.1.2 Transitive Closure Computation Algorithms . 41 Matrix-based Direct Algorithms . 41 Graph-basedDirectAlgorithms . 42 HybridAlgorithms . 43 3.1.3 Summary ...................... 44 3.2 GraphStructuredDataIndices . 44 3.2.1 Graphcontainmentqueries . 44 1 3.2.2 Reachabilityqueries . 46 IntervalBasedApproach . 46 2-hopApproach. 48 Hierarchical Labeling of Sub-Structures . 49 3.2.3 PathwayOrientedIndexingSchemes . 51 ClassandPathIndex. 51 An Indexing Scheme for RDF and RDF Schema BasedonSuffixArrays . 53 3.2.4 Summary ...................... 56 4 ρ-index 59 4.1 StructureoftheIndex. 59 4.1.1 PathTypeMatrix . 60 4.1.2 TablesofTransitionsAmongSegments . 63 4.1.3 ρ-index’s Structure Outline . 65 4.2 TranscriptionGraph . 66 4.2.1 FormalTranscriptionMethods . 69 Transforming the existPathTo TypeofEdge . 70 Transforming the transitionTo TypeofEdge . 71 Transforming the Dependency Types of Edges . 72 SoftandHardMinimalPathWeights . 73 4.2.2 Strategy of the Transcription Process . 73 Maintaining and Utilizing the Soft and Hard MinimalWeights. 75 4.3 ρ-indexCreationAlgorithm . 76 4.3.1 Graph Segmentation Methods and Strategies . 77 GeneralVertexClusteringMethod . 77 Segmentation Using Topological Order . 78 Summary ...................... 81 4.3.2 Transitive Closure Computation of the Path TypeMatrix ..................... 82 Sequence of Segments Weight Computation . 83 Suffix Tree for Disconnected Sequences of Seg- ments ................... 86 4.4 SearchAlgorithms . 88 4.4.1 ρ-path Algorithm .................. 88 The Initial State of the Transcription Graph . 89 TheResult...................... 90 2 4.4.2 ρ-connection AlgorithmOutline . 91 5 ρ-index Evaluation 93 5.1 DataCollection ....................... 93 5.1.1 RandomGraphModels . 94 5.1.2 Incremental Random Graph Generation Algo- rithm......................... 95 5.2 ρ-indexCreationTime . 98 5.3 SearchComplexity . .101 5.3.1 SearchComplexityofPositiveSearch . 103 5.3.2 SearchComplexityofNegativeSearch . 105 5.3.3 Search Complexity of Queries with Limited MaximalPathLength . .107 5.3.4 Search Complexity Affected by the Parameter Settings .......................110 5.3.5 Summary ......................110 6 Applying ρ-index in Citation Analysis 113 6.1 Bibliometrics. .113 6.1.1 CitationAnalysis . .114 6.1.2 MaterialSearchStrategies . 115 6.1.3 IndirectCitationRelationships . 116 6.1.4 RankingofPublications . .117 HITS .........................118 PageRank ......................119 SCEAS ........................119 6.1.5 Topological Studies of a Citation Network . 120 6.1.6 MappingofScience. .122 6.2 Implementing Indirect Complex Relationships Using ρ-operators..........................123 6.2.1 DataSet .......................125 6.2.2 ρ-path Results....................126 6.2.3 SemanticAnalysis . .129 Implementing Forward Chaining by ρ-index . 132 6.2.4 Summary ......................133 3 7 Conclusion 135 7.1 Summary...........................135 7.2 Contribution.. .. .. .. .. .. .. .137 7.3 FurtherResearchDirections . .138 Bibliography 139 4 List of Figures 2.1 Graph G′ is isomorphic to the subgraph of a graph G that is induced by a set of vertices {b, c, f} where the function f is defined as follows: f(b) = x, f(c) = y, f(d)= z. ........................... 18 2.2 ρ-path applied to vertices a and b. The result com- prises of all possible paths lying between inspected vertices: ρ − path(a, b)= {e1e2, e3e4, e5e6}......... 19 2.3 A result of ρ-connectionT o(a, b). One connection is de- noted by a pair of paths (e1e2, e5e4) terminated in ver- tex g and the another denoted by (e3e6, e7e8) termi- nated in vertex h. ...................... 20 2.4 A result of ρ-connectionF rom(g, h). One connection is denoted by a pair of paths (e1e2, e3e6) initiated in ver- tex a and the another denoted by (e5e4, e7e8) initiated in vertex b........................... 21 2.5 Segmentation of a graph G and its segment graph SG(G). ............................ 26 2.6 An example of a segmentation where sequence of seg- ments (S5S6S4S3) does not represent any path in G. .. 30 2.7 An example of partial connecting paths PCPi and a set of minimal connecting path CPs for a sequence of segments............................ 32 2.8 Demonstration ofasegment weightassignment. 36 3.1 A short example of a directed acyclic graph (DAG) anditstransitiveclosurematrix. 46 3.2 Two sets of vertices connected through a single vertex and the corresponding submatrix in transitive closure matrix. ............................ 48 5 3.3 An example illustrating Definition 3.2.3. The solid edges belong to spanning forest T . Grey vertices in- dicate exposed vertices. Vertex 6 is not exposed but it is the out-portal of 3 since it is the least common an- cestor all 3’s exposed descendants, vertices 8 and 9. .. 50 3.4 AnexampleofaRDFgraph.. 52 3.5 A directed acyclic graph together with two extracted path expressions. The character . denotes a delimiter ofcharactersinthepathexpression. 54 3.6 All possible suffixes generated from the pair of ex- tracted path expressions from Figure 3.5. The suffixes are then lexicographically ordered and duplicates are removed. The resulting suffix array is [1,1], [2,1], [1,3], [2,3], [1,5], [2,5], [1,7], [1,9], [1,2], [2,2], [1,4], [2,4], [1,6], [2,6],[1,8]. .......................... 55 4.1 ApathtypematrixMforadirectedgraph. 60 4.2 One step in the computation of the transitive closure ofthepathtypematrix.. 62 4.3 A fragment of a graph segmentation accompanied with the segment graph and a segment transition ta- blesforeachoftheparticipatingsegment. 63 4.4 Reversed transition tables for segments S1, S2 and S3 fromFigure4.3. ....................... 64 4.5 Visual outline of the ρ-index’s structure. 65 4.6 Initial state of a transcription graph for a search for all paths between vertices 1 and 10. ............. 68 4.7 Transcription of a transition to a lower level. 69 4.8 Transformation of a existsPathTo type of edge where the entry pXY of the path type matrix P is pXY = {(XA1A2A4A5Y ), (XA1A3A4A5Y ), (XB1A2A4A5Y ), (XB1B2Y )}. ............... 70 4.9 Transformation of a transition where the bor- der edges between the segments X and Y are EDGES OUT (X) ∩ EDGES IN(Y ) = {(A1, B1), (A2, B2),..., (An, Bn′ )}.............