MASARYK UNIVERSITY

FACULTY OF INFORMATICS

Æ

Indexing Graph Structured Data

PH.D. THESIS

Stanislav Barto ˇn

Brno, January 2007

Acknowledgement

Though the following dissertation is an individual work, I could never have reached the heights or explored the depths without the help, support, guidance and efforts of a lot of people. Firstly, I would like to thank my supervisor Prof. Pavel Zezula for instillinginmethe qualities of being a good researcher and scientist. A very special thank you to my friends and colleagues Dr. Vlastislav Dohnal, Dr. Michal Batko and David Nov´ak for the sup- port they have lent me over all these years and also for providing me with contributions and comments on this work that have madea difference. Finally, I would like to acknowledge and sincerely thank my beloved parents for their inexhaustible faith and confidence in me and my abilities.

iii

Abstract

This Ph.D. thesis concerns the problem of indexing techniques to- wards efficient discovery of complex relationships among entities in graph structured data. We propose a novel structure for evalu- ation of special type of operator denoting queries for all paths to a certain limiting length lying between a pair of inspected vertices in the indexed graph, ρ-path operator queries. We have compared it to various approaches suitable for this task and conducted numerous experiments to verify its properties. The proposed approach is based on a graph simplification method that we call graph segmentation. Using the recursive process of the graph segmentation a multilevel tree-like indexing structure called ρ-index is acquired. In this tree, each level represents a simpli- fied graph and each node a path type matrix describing the particular graph segment. An algorithm concerning the ρ-path queries using ρ -index is proposed and evaluated. The experiments are conducted on synthetic randomly generated graphs. These were acquired using our own incremental algorithm. These graphs represent the most general case for studying ρ-index’s properties since they lack any structural behavior. Although this the- sis also contains the evaluation on real-life data represented by a ci- tation graph of 30,000 scientific publication taken from the CiteSeer database. That part of this thesis also proposes novel approaches to discovery of important publications in the network influenced by the user’s predefined context.

Supervisor: prof. Ing. Pavel Zezula, CSc.

v

Keywords

index structures

graph structured data

path search algorithms

semantic web

networks

context citation publication search

vii

Contents

1 Introduction 13 1.1 StatementoftheProblem...... 14 1.2 ResearchObjectives...... 15

2 Background 17 2.1 GraphQueries...... 17 2.1.1 GraphContainmentQuery ...... 17 2.1.2 ρ-operators ...... 18 2.1.3 VertexReachabilityQuery ...... 21 2.2 GraphTheory ...... 21 2.2.1 BasicDefinitions ...... 22 2.2.2 GraphSegmentation ...... 24 2.3 SegmentationHypotheses ...... 28 2.3.1 Correctness of a Proper Sequence of Segments Representation ...... 28 2.3.2 ConnectingPathofaSequenceofSegments . . 29 2.3.3 Imposing a Weight Limit l ...... 33 2.3.4 IterationStep ...... 35

3 IndicesforGraphStructuredDataandRelatedWork 39 3.1 AlgorithmicApproaches ...... 39 3.1.1 GraphAlgorithms ...... 39 Single Source Path Expression Problem . . . . . 40 3.1.2 Transitive Closure Computation Algorithms . . 41 Matrix-based Direct Algorithms ...... 41 Graph-basedDirectAlgorithms ...... 42 HybridAlgorithms ...... 43 3.1.3 Summary ...... 44 3.2 GraphStructuredDataIndices ...... 44 3.2.1 Graphcontainmentqueries ...... 44

1 3.2.2 Reachabilityqueries ...... 46 IntervalBasedApproach ...... 46 2-hopApproach...... 48 Hierarchical Labeling of Sub-Structures . . . . . 49 3.2.3 PathwayOrientedIndexingSchemes ...... 51 ClassandPathIndex...... 51 An Indexing Scheme for RDF and RDF Schema BasedonSuffixArrays ...... 53 3.2.4 Summary ...... 56

4 ρ-index 59 4.1 StructureoftheIndex...... 59 4.1.1 PathTypeMatrix ...... 60 4.1.2 TablesofTransitionsAmongSegments . . . . . 63 4.1.3 ρ-index’s Structure Outline ...... 65 4.2 TranscriptionGraph ...... 66 4.2.1 FormalTranscriptionMethods ...... 69 Transforming the existPathTo TypeofEdge . . . 70 Transforming the transitionTo TypeofEdge . . . 71 Transforming the Dependency Types of Edges . 72 SoftandHardMinimalPathWeights ...... 73 4.2.2 Strategy of the Transcription Process ...... 73 Maintaining and Utilizing the Soft and Hard MinimalWeights...... 75 4.3 ρ-indexCreationAlgorithm ...... 76 4.3.1 Graph Segmentation Methods and Strategies . . 77 GeneralVertexClusteringMethod ...... 77 Segmentation Using Topological Order . . . . . 78 Summary ...... 81 4.3.2 Transitive Closure Computation of the Path TypeMatrix ...... 82 Sequence of Segments Weight Computation . . 83 Suffix Tree for Disconnected Sequences of Seg- ments ...... 86 4.4 SearchAlgorithms ...... 88 4.4.1 ρ-path Algorithm ...... 88 The Initial State of the Transcription Graph . . . 89 TheResult...... 90

2 4.4.2 ρ-connection AlgorithmOutline ...... 91

5 ρ-index Evaluation 93 5.1 DataCollection ...... 93 5.1.1 RandomGraphModels ...... 94 5.1.2 Incremental Random Graph Generation Algo- rithm...... 95 5.2 ρ-indexCreationTime ...... 98 5.3 SearchComplexity ...... 101 5.3.1 SearchComplexityofPositiveSearch ...... 103 5.3.2 SearchComplexityofNegativeSearch ...... 105 5.3.3 Search Complexity of Queries with Limited MaximalPathLength ...... 107 5.3.4 Search Complexity Affected by the Parameter Settings ...... 110 5.3.5 Summary ...... 110

6 Applying ρ-index in Citation Analysis 113 6.1 ...... 113 6.1.1 CitationAnalysis ...... 114 6.1.2 MaterialSearchStrategies ...... 115 6.1.3 IndirectCitationRelationships ...... 116 6.1.4 RankingofPublications ...... 117 HITS ...... 118 PageRank ...... 119 SCEAS ...... 119 6.1.5 Topological Studies of a Citation Network . . . 120 6.1.6 MappingofScience...... 122 6.2 Implementing Indirect Complex Relationships Using ρ-operators...... 123 6.2.1 DataSet ...... 125 6.2.2 ρ-path Results...... 126 6.2.3 SemanticAnalysis ...... 129 Implementing Forward Chaining by ρ-index . . 132 6.2.4 Summary ...... 133

3 7 Conclusion 135 7.1 Summary...... 135 7.2 Contribution...... 137 7.3 FurtherResearchDirections ...... 138

Bibliography 139

4 List of Figures

2.1 Graph G′ is isomorphic to the subgraph of a graph G that is induced by a set of vertices {b, c, f} where the function f is defined as follows: f(b) = x, f(c) = y, f(d)= z...... 18 2.2 ρ-path applied to vertices a and b. The result com- prises of all possible paths lying between inspected vertices: ρ − path(a, b)= {e1e2, e3e4, e5e6}...... 19 2.3 A result of ρ-connectionT o(a, b). One connection is de- noted by a pair of paths (e1e2, e5e4) terminated in ver- tex g and the another denoted by (e3e6, e7e8) termi- nated in vertex h...... 20 2.4 A result of ρ-connectionF rom(g, h). One connection is denoted by a pair of paths (e1e2, e3e6) initiated in ver- tex a and the another denoted by (e5e4, e7e8) initiated in vertex b...... 21 2.5 Segmentation of a graph G and its segment graph SG(G)...... 26 2.6 An example of a segmentation where sequence of seg- ments (S5S6S4S3) does not represent any path in G. .. 30 2.7 An example of partial connecting paths PCPi and a set of minimal connecting path CPs for a sequence of segments...... 32 2.8 Demonstration ofasegment weightassignment. . . . . 36 3.1 A short example of a (DAG) anditstransitiveclosurematrix...... 46 3.2 Two sets of vertices connected through a single vertex and the corresponding submatrix in transitive closure matrix...... 48

5 3.3 An example illustrating Definition 3.2.3. The solid edges belong to spanning forest T . Grey vertices in- dicate exposed vertices. Vertex 6 is not exposed but it is the out-portal of 3 since it is the least common an- cestor all 3’s exposed descendants, vertices 8 and 9. .. 50 3.4 AnexampleofaRDFgraph...... 52 3.5 A directed acyclic graph together with two extracted path expressions. The character . denotes a delimiter ofcharactersinthepathexpression...... 54 3.6 All possible suffixes generated from the pair of ex- tracted path expressions from Figure 3.5. The suffixes are then lexicographically ordered and duplicates are removed. The resulting suffix array is [1,1], [2,1], [1,3], [2,3], [1,5], [2,5], [1,7], [1,9], [1,2], [2,2], [1,4], [2,4], [1,6], [2,6],[1,8]...... 55 4.1 ApathtypematrixMforadirectedgraph...... 60 4.2 One step in the computation of the transitive closure ofthepathtypematrix...... 62 4.3 A fragment of a graph segmentation accompanied with the segment graph and a segment transition ta- blesforeachoftheparticipatingsegment...... 63 4.4 Reversed transition tables for segments S1, S2 and S3 fromFigure4.3...... 64 4.5 Visual outline of the ρ-index’s structure...... 65 4.6 Initial state of a transcription graph for a search for all paths between vertices 1 and 10...... 68 4.7 Transcription of a transition to a lower level...... 69 4.8 Transformation of a existsPathTo type of edge where the entry pXY of the path type matrix P is pXY = {(XA1A2A4A5Y ), (XA1A3A4A5Y ), (XB1A2A4A5Y ), (XB1B2Y )}...... 70 4.9 Transformation of a transition where the bor- der edges between the segments X and Y are EDGES OUT (X) ∩ EDGES IN(Y ) = {(A1, B1), (A2, B2),..., (An, Bn′ )}...... 71 4.10 Transformation of a collection of dependency edges at onesegment...... 72

6 4.11 Demonstration of a cycling segment sequence prob- lem. An acyclic path (e1e2e3) in the indexed graph is represented by its cycling proper sequence of seg- ments (S1S2S1S2)...... 79 4.12 A topological order of a directed acyclic graph. Con- sequent graph segmentation according to the position of a vertex in the order. In resulting segmentation only

ES1 =6 ∅...... 80 4.13 A visualization of the spanning problem. If the ver- tices v1 and v10 were assigned to one common segment and either of vertices v4 or v9 were not, the resulting segment graph would not be DAG and therefore the topologicalorderwouldbespoiled...... 81 4.14 Initial transcription graph for a computation of a weight of a segment of sequence. The segmentation concerning the sequence of segments is included. . . . 83 4.15 Final state of the transcription graph. For brevity, the segmentation depicts only the segments vertices and edges that are needed to transform the transcrip- tion graph graph. The result represents a the set of minimal connecting paths of the segment sequence (WXYZ)...... 85 4.16 A sequence of segments (CD) has a connection path thus is connected. A segment sequence (BCD) is not connected and is put to the suffix forest. By the time the (ABCD) is checked, it is immediately pronounced as disconnected since it has (BCD) as its suffix...... 86 4.17 A suffix tree built using following sequences of seg- ments (ACDE), (BCDE), (CDE), (BDE), (CDF ) and (BDF )...... 87 4.18 An initial state of the transcription graph where the arrays of segments for vertices v1 is [E,K,X] and v10 is [F,L,Y ]. The entry of the top-most matrix P , pXY = {(XA1A2A4A5Y ), (XA1A3A4A5Y ), (XB1A2A4A5Y ), (XB1B2Y )}...... 90 4.19 Demonstration of how the top-most ρ-index matrix is consulted in case of ρ-connectionT o (I) and ρ-connec- tionF rom (II)algorithmimplementation...... 91

7 5.1 Vertex degree distribution in the synthetic random graphG5000...... 97 5.2 Vertex degree distribution in the synthetic random graphG10000...... 97 5.3 Vertex degree distribution in the synthetic random graphG20000...... 98 5.4 Vertex degree distribution in the synthetic random graphG30000...... 98 5.5 Creation time of ρ-index having four levels with pa- rameters set at second level to ten and third to five. The parameter set at the first level is represented by the x-axisvalue...... 99 5.6 Creation time of ρ-index having four levels with pa- rameter setting at second level represented by the x- axis value. The parameter setting for the first level for graph G5000 is five, for G10000 is fifteen, for G20000 is thirty-five and for G30000 is forty-five. The third level parametersettingisfive...... 100 5.7 Creation time of ρ-index having four levels with pa- rameter setting at third level represented by the x-axis value. The parameter setting for the first level for graph G5000 is five, for G10000 is fifteen, for G20000 is thirty-five and for G30000 is forty-five. The second levelparametersettingisten...... 102 5.8 A ρ-index search complexity with respect to the graph size...... 103 5.9 Amounts of paths found by the search executed on the testinggraphs...... 104 5.10 A complexity of a search when no paths were present in the ρ-indexwithrespecttothegraphsize...... 106 5.11 A ρ-index search complexity of queries with different maximalsearchlength...... 107 5.12 A ρ-index percentage of paths longer than softL not found...... 108 5.13 A ρ-index search complexity related to the parameter setting...... 109

8 6.1 Direct relationships between a pair of inspected ver- tices, 1 and 2,identified in the citation analysis...... 114 6.2 Vertex degree distribution in the citation graph. . . . . 124 6.3 Publication search result visualization between the seed publication 1 and core publication 32. For the readability purposes, the result comprises only from pathsuptothelengthof6...... 127 6.4 Publication search result visualization between the seed publication 61 and core publication 32. Grey nodes denote the intersection with the result of search fromFigure6.3...... 129 6.5 A generalization of the influence of two publication search results conducted between seed publications S1, S2 and a core publication C...... 133

9

List of Tables

5.1 A summary of connectivity thresholds computed for the testing graphs generated by the random graph generationalgorithm...... 94 5.2 Parameter setting used to build ρ-indexes for the test- inggraphs...... 103 6.1 A summary of paths found between the reference [19] andthecorepublication...... 126 6.2 A summary of paths found between the reference [58] andthecorepublication...... 128 6.3 A summary of publications that form the vertices of the search between the publication [19] and core pub- lication [73] concerning the intersection with paths of the search with reference publication [58]. Part 1 . . . .130 6.4 A summary of publications that form the vertices of the search between the publication [19] and core pub- lication [73] concerning the intersection with paths of the search with reference publication [58]. Part 2 . . . .131 6.5 Filtered out publications from the first publication search using the second publications search...... 132

11

Chapter 1 Introduction

Analyzing and querying graph structured data has become very im- portant in many research areas including the emerging Semantic Web, the biological sciences or any other field that incorporates data in rather general unstructured way. To these also count bibliomet- rics and social sciences which are becoming very popular – the social networks analysis. Graph as a general data structure to represent relationships among entities has therefore attracted a substantial re- search interest. Therefore, along this research a wide range of search- ing and mining problems have arisen. Concerning the life sciences, there are several well-known projects in bioinformatics, e.g. Biopathways Graph Data Manager [25] or the BioCyc project [68]. These projects model data such as pro- tein interactions, metabolic pathways or gene regulatory networks as directed graphs. Vertices in them represent entities like com- pounds, promoters or proteins whereas edges specify relations be- tween those entities. The reachability questions whether a reactant u indirectly activate or inhibit protein v through some chain of reac- tions are asked. Another biology applications focus on discovering complex structural patterns [62] in graphs forming a graph database. The graph database may represent a set of complex compounds and queries answer whether a molecule represented by a graph is present in that database. This task is an instance of the subgraph isomor- phism decision problem that is NP-complete [28]. Therefore a need for efficient heuristics and access methods arises and various meth- ods have been introduced [79, 80, 81]. The complex relationships [6, 72] have been identified and stud- ied when the Semantic web [14] emerged as the successor of the

13 1. INTRODUCTION nowadays World Wide Web. Querying these complex relationships has been studied in [7]. The problem of searching for the complex relationships among entities in graph structured data can be gener- alized into a problem of searching for paths in directed graphs. It is understood that the problem of searching for paths in graphs has great computational complexity, therefore indexing structures have been developed supposed to efficiently evaluate the path queries in directed graphs. The efforts regarding the specific RDF [45] graphs have been presented in [7, 51]. Since these methods exploit the spe- cific features unique in the RDF graphs, they cannot be applied to general graph structured data. In recent years, digital libraries like DBLP [49], CiteSeer [15, 16] or CORA [52] have established large repositories of scientific litera- ture. The publications in these repositories together with the citation and reference information generate large graphs referred to as cita- tion networks in which mining has become very popular from both the structural [47, 82, 83] and discovery of important vertices in the network [59, 63] points of view.

1.1 Statement of the Problem

This thesis focuses on the problem of searching complex relation- ships among entities in graph structured data as they are denoted by the family of ρ-operators which are discussed more thoroughly in Section 2.1.2. The base of the problem is to be able efficiently retrieve all indirect relationships represented by all possible acyclic paths lying between the two inspected entities from the graph formed by the analyzed graph structured data. Along with the increasing length of the path representing the complex relationship, intuitively, the importance may sometimes de- crease. Therefore it is desirable for the user to be in able to impose limitations on the maximal length of the path that denotes the com- plex relationship.

14 1. INTRODUCTION 1.2 Research Objectives

Even though there are algorithmic approaches like is the Tarjan’s so- lution to the single source path problem [70, 71], or a rather naive breadth first search for all paths lying between a pair of inspected vertices that can be used to process a ρ-path query, they exhibit drawbacks in the direction that the ρ-path optimization requires such is the impossi- bility of a path length limitation of the former and the exponential computational complexity of the latter. From the introduced set of queries that can be issued on graph structured data this thesis focuses on the queries established by the family of the ρ-operator. The mostly studied type is the ρ-path op- erator. The effort of this thesis is to discover an approach to efficient query processing of such operator with a possible extension to the rest of the family of ρ-operators represented by a pair of ρ-connec- tion operators. The research presented in this thesis therefore concentrates on the design of an indexing structure that would ease the ρ-path queries processing. The efficient discovery of all paths laying between a pair of inspected vertices in the indexed graph should yield both the pos- sibility of limiting the search for the certain path length and still hav- ing the computation complexity similar to the algorithmic approach to solving the single source path problem described by Tarjan which is O(n ∗ log m) where n denotes the amount of graph’s vertices and m the number of edges. This thesis is organized as follows. Chapter 2 embraces back- ground and definitions from the necessary for proper presentation of the developed graph segmentation methods. This chapter also summarizes various types of queries that are becom- ing relevant in the context of graph structured data. In Chapter 3, a survey of existing algorithmic approaches and indexing structures for graph structured data is provided. The algorithms and index- ing methods are categorized according to the type of query they are aimed to process efficiently. In Chapter 4 a novel indexing struc- ture for processing the ρ-path type of queries is proposed that syner- gistically combines the graph simplification method with the matrix approach as a graph access method. Next, the proposed structure’s properties are experimentally evaluated and compared with similar

15 1. INTRODUCTION methods. The whole thesis is concluded and some future research directions are outlined in Chapter 7.

16 Chapter 2 Background

In this chapter we provide the reader with necessary information for understanding of various graph search problems. Firstly, we present the possible types of graph queries that are used in the information retrieval and give examples of such queries. Secondly, we make an introduction into the graph theory necessary to formally state the environment. Finally, we introduce a graph segmentation – a new method to partition graphs that is necessary for the further study of the problems stated in this thesis.

2.1 Graph Queries

The different sorts of graph queries are defined in this section. Al- though there can be identified more kinds of queries in graph struc- tured data, in this section we present the kinds that hold the most attraction in recent years and those that are interesting and related to the effort that is described in this thesis.

2.1.1 Graph Containment Query One of the problems that are becoming increasingly important in modeling complicated structures is a graph containment query an- swering. The query is represented by a graph and the answer to the query are all graphs from the graph database that contain the query graph as a subgraph. More precisely, the database returns all the graphs that have a subgraph that is isomorphic to the query graph according to Definition 2.1.1. Definition 2.1.1 (Subgraph isomorphism). A subgraph isomor- phism of G = (V, E) and G′ = (V ′, E′) is an surjective function

17 2. BACKGROUND a b c x

d e f y z

G G’

Figure 2.1: Graph G′ is isomorphic to the subgraph of a graph G that is induced by a set of vertices {b, c, f} where the function f is defined as follows: f(b)= x, f(c)= y, f(d)= z. f : V → V ′, such that: ∀u ∈ V, f(u) ∈ V ′ and l(u)= l′(f(u))

∀(u, v) ∈ E, (f(u), f(v)) ∈ E′and l(u, v)= l′(f(u), f(v))

where l and l′ are the label function of G and G′, respectively. Function f is called an embedding of G in G′. A brief example of the subgraph isomorphism is illustrated in Figure 2.1. The vertices in G′ are renamed in a way that all edges of G′ have their counterparts in G.

2.1.2 ρ-operators In the context of the Semantic Web, ρ-operators are proposed in [7] as a mean to explore complex relationships [72] between entities. The problem of searching for the complex relationships can be modeled as the process of searching paths in a graph where entities represent vertices and edges the relationships between them. We recognize two kinds of complex relationships that can be ob- served in general directed graphs. The first one is represented by a path lying between two inspected vertices. It is defined as an opera- tor that returns all paths lying between two vertices in a graph. For

18 2. BACKGROUND a c b e1 e2 e e 3 d 4

e5 e6 e

Figure 2.2: ρ-path applied to vertices a and b. The result comprises of all possible paths lying between inspected vertices: ρ − path(a, b)= {e1e2, e3e4, e5e6}. more precise definition see Definition 2.1.2. Contrary to the reach- ability type of queries, the ρ-path operator returns a set of all paths going from vertex x to vertex y instead of a single boolean value rep- resenting the reachability of vertex y from vertex x.

Definition 2.1.2 (ρ-path).

ρ − path(x, y)= {p =(v1e1v2e2 ...envn+1)|v1 = x ∧ vn+1 = y∧

e1,...en ∈ E ∧ p is acyclic}

An example of ρ-path applied to vertices a and b is depicted in Figure 2.2. The result of ρ − path(a, b) is a set of three possible paths between vertices a and b which are {e1e2, e3e4, e5e6}. The second ρ-operator is ρ-connection . We recognize two types of ρ-connections. The first one is defined as follows:

Definition 2.1.3 (ρ-connectionT o).

ρ − connectionT o(x, y)= {(p1,p2)|p1 =(v1e1v2e2 ...envn+1),

p2 = w1h1w2h2 ...hmwm+1)

∧ v1 = x ∧ w1 = y ∧ vn+1 = wm+1

∧ e1,...en, h1,...hm ∈ E

∧ p1,p2 are acyclic}

19 2. BACKGROUND

a c g e1 e2 e e3 d 4

e5 e e6 b h e e 7 f 8

Figure 2.3: A result of ρ-connectionT o(a, b). One connection is de- noted by a pair of paths (e1e2, e5e4) terminated in vertex g and the another denoted by (e3e6, e7e8) terminated in ver- tex h.

This type of ρ-connection returns all pairs of paths for a couple of inspected vertices and the inspected vertices are origins for the re- spective paths from each pair. An illustration of this kind of operator is depicted in Figure 2.4. The other type is very similar, the formal definition is: Definition 2.1.4 (ρ-connectionF rom).

ρ − connectionF rom(x, y)= {(p1,p2)|p1 =(v1e1v2e2 ...envn+1),

p2 = w1h1w2h2 ...hmwm+1)

∧ v1 = w1 ∧ vn+1 = x ∧ wm+1 = y

∧ e1,...en, h1,...hm ∈ E

∧ p1,p2 are acyclic} The only difference between the former kind of ρ-connection is that the inspected vertices are always terminals of the respective paths from each returned pair. The ρ-operators represent very important type of queries. In the first case it is a query for all paths lying between two vertices in a graph and the other one is the query for a common feature – a set of common successors or predecessors accompanied with all denoting paths – of the two inspected vertices.

20 2. BACKGROUND

a c g e1 e2 e e3 d 4

e5 e e6 b h e e 7 f 8

Figure 2.4: A result of ρ-connectionF rom(g, h). One connection is de- noted by a pair of paths (e1e2, e3e6) initiated in vertex a and the another denoted by (e5e4, e7e8) initiated in ver- tex b.

2.1.3 Vertex Reachability Query Contrary to the ρ-path operator sometimes only the existence of a path is a satisfying answer to a query and the path or paths them- selves are not important to the user. Therefore we introduce another type of queries, the reachability queries. This type of queries is popu- lar in the context of processing the XML documents where the struc- tural relationship between nodes in the documents is investigated, e.g. the ancestor-descendant type of relationship. Graph reachability is the following decision problem: Given two nodes u and v in a graph G, is there a path from u to v? If there is such a path, we can say that u reaches v or u −→∗ v.

2.2 Graph Theory

In this section we make an introduction into the graph theory through some of its terms and definitions since we would like to present and discuss all sorts of search problems on graph structured data. Afterwards we present our contribution to the graph theory represented by the graph segmentation because it is a crucial basis of

21 2. BACKGROUND our proposed indexing structure for indexing paths in a graph that is presented in Chapter 4 of this thesis.

2.2.1 Basic Definitions Directed graph is defined as an ordered pair of two sets. The first one denotes the set of objects that represent vertices in the graph. The second one is a set of ordered pairs that represent edges between the vertices in the graph:

Definition 2.2.1 (Vertices, Edges and Directed Graph).

• Vertices V = {v1, ...vn}

• Edges E = {e1, ...em}, E = V × V, ei =(v,w),v,w ∈ V

• Graph G =(V, E)

Because, the pair in E is ordered the resulting graph is directed. That means that an existence of an edge between the vertices x and y does not automatically imply an existence of an edge (y, x). That one has to explicitly appear in E. A special case of a directed graph is a directed acyclic graph that is defined in Definition 2.2.2. The specialty of a directed acyclic graph is that it contains no cycles – a path that is initiated and terminated in the same vertex. A DAG can be topologically partially ordered with respect to the direction of the edges.

Definition 2.2.2 (DAG and a topological order).

• Directed acyclic graph also known as DAG is a graph in which no directed cycles are present. In other words, it is a directed graph without a path that is initiated and terminated in the same vertex.

• Topological order on DAG: Vertices ordered according to a numbering of the vertices of the directed acyclic graph such that every edge from a vertex numbered i to a vertex numbered j satisfies i < j.

22 2. BACKGROUND

Obviously, an arbitrary directed graph cannot be topologically ordered in general. The existence of a topological order places DAGs as a structures between the graphs and tree structures from the struc- ture hierarchy point of view. For the brevity of following definitions of the graph theory terms we define the initial and terminal vertex of an edge in the graph in Definition 2.2.3. Definition 2.2.3 (Initial and terminal vertex of an edge).

• Initial vertex of an edge e: LEF T VTX(e)= v1 ⇔ e =(v1, v2)

• Terminal vertex of an edge e: RIGHT VTX(e) = v2 ⇔ e = (v1, v2) In our work, we will consider only those paths that are acyclic. That means that no vertex along a path appears twice in it and that the initial and terminal vertices are also different from each other. The precise definition of an acyclic path in graph G Definition 2.2.4 also states a simplified notation that comprises only of edges with vertices left out. Definition 2.2.4 (Path definitions).

• Acyclic path p = (v1e1v2e2 ...envn+1) in G : 1 ≤ i ≤ n, 1 ≤ j ≤ n+1, i =6 j : ei ∈ E ∧ vi, vj ∈ V ∧ vi = LEF T VTX(ei) ∧ vi+1 = RIGHT VTX(ei) ∧ vi =6 vj

• Simplified path notation: instead of (v1e1v2e2 ...envn+1) we sometimes use (e1e2 ...en)

• Initial vertex of a path p = (e1e2 ...en) : INITIAL(p)= LEF T - VTX(e1)

• Terminal vertex of a path p = (e1e2 ...en) : TERMINAL(p) = RIGHT VTX(en) • Path concatenation in G : ′ ′ ′ ′ ′ ′ (v1e1v2e2 ...envn+1).(v1e1v2e2 ...emvm+1) = (v1e1v2e2 . . . ′ ′ ′ ′ ′ ′ ′ ′ ′ envn+1e v1e1v2e2 ...emvm+1) ⇔ e =(vn+1, v1) ∈ E The path concatenation operator . in Definition 2.2.4 simply makes one path from two others under condition that it is possible to concatenate them so the result is again a valid path in G.

23 2. BACKGROUND

2.2.2 Graph Segmentation The graph segmentation as a graph partitioning method was firstly introduced in [11]. Contrary to the common notion of subgraph, see Definition 2.2.5, segment’s set of edges ES also contains the edges from the supergraph’s set E which’s only one vertex belongs to the set of segment’s vertices VS. Therefore the segment can be envisioned as an vertex induced subgraph extended with border edges that point to or out of that subgraph. The vertex-induced subgraph has its set of edges built in a way that it contains all the possible edges lying between the vertices in V ′.

Definition 2.2.5 (Subgraph). Subgraphs of a graph G:

• Subgraph H = (V ′, E′) of G = (V, E): H = (V ′, E′) : V ′ ⊆ ′ ′ ′ V ∧ E ⊆ E ∧ (v1, v2) ∈ E ⇒ v1, v2 ∈ V

• Vertex-induced subgraph H = (V ′, E′) of G = (V, E): V ′ ⊆ V ∧ E′ = {e ∈ E | RIGHT VTX(e) ∈ V ′ ∧ LEF T VTX(e) ∈ V ′}∧6 ∀e ∈ (E −E′) : RIGHT VTX(e) ∈ V ′ ∧LEF T VTX(e) ∈ V ′

Definition 2.2.6 states the exact definition of a graph segment. As we already mentioned, it is enhanced vertex-induced subgraph of G. Note that the definitions of the vertex-induced subgraph and graph’s segment virtually differ only in the logical connector in the built edge set. Definition 2.2.6 also formalizes the notion of border edges. For each segment there are two kinds of border edges, the edges that point to and the edges that point out of the segment.

Definition 2.2.6 (Segment). Segment S in a graph G : S =(VS, ES) : VS ⊆ V ∧ ES = {e ∈ E | RIGHT VTX(e) ∈ VS ∨ LEF T VTX(e) ∈ VS ∧ 6 ∀e ∈ (E − ES) : RIGHT VTX(e) ∈ VS ∨ LEF T VTX(e) ∈ VS}

• EDGES OUT(S) = {e|e ∈ ES ∧ LEF T VTX(e) ∈ VS ∧ RIGHT - VTX(e) 6∈ VS}

• EDGES IN(S) = {e|e ∈ ES ∧ RIGHT VTX(e) ∈ VS ∧ LEF T - VTX(e) 6∈ VS}

24 2. BACKGROUND

The reason that we introduced the notion of graph segment is an lossless graph partitioning to segments. Imagine a total partitioning of a graph G = (V, E) of a vertex-induced subgraphs. If we make union of all the vertex sets of all subgraphs in the partition, we get V since the partition is total. But if we do the same with all edge sets of all subgraphs in the partition we get E′ ⊂ E in each case when the partition comprises of more than one subgraph due to the fact that the border edges between subgraphs are not present in any of the regarding subgraph. By defining a graph segmentation, see formalization in Definition 2.2.7, we overcome this inconvenience of the graph partition to sub- graphs in graph theory.

Definition 2.2.7 (Segmentation). A graph segmentation S(G) of a graph G: ′ ′ S(G) = {S|S is a segment of G} ∧ ∀S,S ∈ S(G),S =6 S : VS ∩ VS′ = ∅ ∧ VS = V S∈SS(G)

The totality of ES = E is directly obvious from the fact, that S∈SS(G) each segment is vertex-induced and that all vertices are taking part in the partition. As we have the total partition of G we can create a new graph that will have a lot in common with original G upon its segmentation. In this graph the vertices will represent the collapsed segments in G and the edges will represent an existence of a border edge between a pair of segments with the respective direction. We will call such graph a segment graph of graph G and segmentation S(G). This notion is formalized in Definition 2.2.8. Figure 2.5 depicts a partition of a graph G into segments and together with its segment graph SG(G).

Definition 2.2.8 (Segment Graph). Segment graph SG(G) of G and segmentation S(G): SG(G)=(S(G), X), X = {h|h = (Si,Sj) ⇔ 1 ≤ i, j ≤ |S(G)| ∧ EDGES OUT (Si) ∩ EDGES IN(Sj) =6 ∅}

Since we defined the segment graph SG(G) now we can com- pare its properties with the original graph G. The main property that those two graphs have in common is a fact that all the paths present

25 2. BACKGROUND

S2 S3 S1

S4

S2 S1 S3

S4 S5

S6

S5

S6

Figure 2.5: Segmentation of a graph G and its segment graph SG(G). in graph G are present also in its segment graph SG(G) in a simpli- fied way. This assertion is proved in the following section. To distin- guish between the paths in a graph G and the paths in its segment graph SG(G) we introduce a sequence of segments which is meant to denote the paths in SG(G). This new term is precisely formalized in Definition 2.2.9.

Definition 2.2.9 (Sequence of Segments). Sequence of segments (S1 ...Sl) where S1,...,Sl ∈ S(G), ∀i :1 ≤ i ≤ l − 1 : EDGES OUT (Si) ∩ EDGES IN(Si+1) =6 ∅

Each path in G is encoded in a certain way into sequence of seg- ments and as we show in the following section, this mapping is sur- jective as more then one path in G may share the same sequence of segments representation but for each path there is only one such rep- resentation.

Definition 2.2.10 (Proper Sequence of Segments). Proper segment sequence for a path p =(v1e1v2e2 ...envn+1): S(p)=(S1 ...Sl) : S(p) is a segment sequence ∧ 1 ≤ i1 < i2 <

... < il ≤ n + 1 : {v1,...vi1 } ⊆ VS1 ∧ {vi1 ,...vi2 } ⊆ VS2 ∧ . . . ∧

{vil ,...vn+1} ⊆ VSl

26 2. BACKGROUND

We defined a proper sequence of segments in Definition 2.2.10 for a path p as the encoding of path p in G in a segment graph SG(G). As is mentioned earlier, for each path present in G this representa- tion is unique therefore each path has only one proper sequence of segments. Finally we present the weights of vertices and sequences of seg- ments in Definition 2.2.11. We have chosen to assign the weights to the vertices rather than to edges because we want to keep track of the cost of traversing a vertex than following a particular edge. Re- call that the vertex in the segment graph SG(G) is in fact a collapsed segment of a graph G therefore the cost of traversing this vertex is more interesting to us than the knowledge of traversing a single bor- der edge.

Definition 2.2.11 (Weights). Weights of a vertex, path and sequence of segments:

• Weight of a vertex v: w(v) ∈ N

n+1 • Weight of a path p =(v1e1v2e2 ...envn+1) : w(p)= w(vi) iP=1 • Set of weights of paths of a sequence of segments: |(S1 ...Sl)| = {w(p)|p ∈ (S1 ...Sl)}

• Weight of a sequence of segments k(S1 ...Sl)k = min(|(S1 ...Sl)|)

With the weight definitions we are able to focus our attention only to paths having a certain weight. Assuming that in G all the vertices have the weight of 1, the weight of a path then represents a number of vertices in the path which can be understood as an al- ternative definition of a path length that is usually defined as the amount of edges in such path. Notice that the sequence of segments has both, length and weight. The length represents the number of segments in the sequence. However the weight of a sequence of seg- ments represents the length of a shortest path that this sequence rep- resents. In the next section we will discuss the relation between the length and the weight of the sequence of segments.

27 2. BACKGROUND 2.3 Segmentation Hypotheses

We have already defined the ideas of graph segmentation. Now in this section we state and prove hypotheses that are the corner stones for our further work of indexing graph structured data presented in Chapter 4. First of all, we discuss the correctness of the definition of the proper sequence of segments representation. Consequently, we introduce the notion of a connecting path of a sequence of segments as a mean to measure the weight of the sequence of segments. After- wards, we impose a limit l on the set of sequences of segments and explore the properties of paths represented by such a set. Finally, the eventuality of repeated application of the graph segmentation will be investigated.

2.3.1 Correctness of a Proper Sequence of Segments Representa- tion In previous section we defined a proper sequence of segments in SG(G) for any path p in a graph G. We stated that this encoding rep- resents an surjective mapping function. Thus the encoding is unique for each path, yet more than one path can share one common encod- ing. Now we prove this assertion in Lemma 1.

Lemma 1. If a graph G = (V, E) has a segmentation S(G) that forms a graph SG(G), any path p = (v1e1v2e2 ...envn+1) in G can be represented by its proper segment sequence in S(G) and this representation is unique.

Proof. This assertion is true immediately from the definitions of a segment graph and a sequence of segments. As we mentioned, se- quence of segments is another representation of a path in SG(G). Thus, we show that any path in G can be transformed into a path in SG(G) and that it, in fact, represents a sequence of segments which is the proper segment sequence for this path. From the definition of S(G) we know that each vertex in G be- longs exactly to one segment of S(G). Therefore, we take the path p in G and rename the vertices by the segment names they belong to:

p =(v1e1v2e2 ...envn+1)=⇒ (S1e1S2e2 ...enSn+1)

28 2. BACKGROUND

If Si = Si+1 then ei is not a border edge, therefore we omit the part (ei,Si+1) from the transformed path. We repeat this step until is true that any Si =6 Si+1. According to the definition of SG(G) we drew an edge h = (Si,Sj) in SG(G) whenever EDGES OUT (Si) ∩ EDGES IN(Sj) =6 ∅. Presence of ei between Si and Si+1 in (S1e1S2e2 ...enSn+1) implies the presence of such edge in G connecting two vertices from the particular segments where Si =6 Si+1 which implies that EDGES OUT (Si) ∩ EDGES IN(Si+1) =6 ∅ and therefore exists an edge h in SG(G) from Si to Si+1.

Now, we got (S1ei1 Si1 ei2 ...eil−1 Sl), where 1 ≤ i1 < i2 . . . ≤ l ≤ n +1. In this expression, we replace all ei by the respective edges h from SG(G). The result is a correct path (S1h1Si1 h2 ...hlSl) in SG(G) that represents a proper segment sequence (S1 ...Sl) for the path p in G. The uniqueness of the proper sequence of segments of a path p results from the definition of segmentation S(G) because no vertex in V can be assigned to two different segments at the same time. Though, the chain of segment labels is unique for any path in G.

2.3.2 Connecting Path of a Sequence of Segments To investigate the various weight properties of the sequences of seg- ments we define a connecting path for a segment sequence (S1 ...Sl) in Definition 2.3.1. We will show that only the connecting paths of a sequence of segments are significant for determining the weight of the particular sequence of segments. To define the connecting path we use the notion of common edges between two segments. It is a subset of all border edges of a pair of segments, precisely, those edges that point from the preceding seg- ment to the neighboring segment in the sequence. From the defini- tion of sequence of segments we know that there must be at least one such edge.

Definition 2.3.1 (Connecting path). Connecting path p = (e1e2 ...en) in a sequence of segments (S1 ...Sl):

• Common edges CEi for (S1 ...Sl): 1 ≤ i ≤ l − 1 : CEi =

29 2. BACKGROUND

S2 S3 S1

S4

S5

S6

Figure 2.6: An example of a segmentation where sequence of seg- ments (S5S6S4S3) does not represent any path in G.

EDGES OUT (Si) ∩ EDGES IN(Si+1)

• p ∈ (S1 ...Sl) : e1 ∈ CE1 ∧ en ∈ CEl−1 ∧ ∃i2, i3,...il−1 : 1 <

i2 < i3 <...

ES3 ∧ . . . ∧ {eil−2 ,...eil−1 } ⊆ ESl−1 Nonetheless, the limits set by the definition themselves do not assure, that each segment sequence is representing a particular path in the original graph G. See Figure 2.6 for an example of such sit- uation. Consider the sequence of segments (S5S6S4S3), surely, each CEi for 1 ≤ i ≤ 3 is non-empty, but there is no connecting path for this sequence of segments. Therefore, the existence of a connecting path in a segment sequence indicates that this sequence of segments is important to us. Whenever the sequence of segments has at least one connecting path, it also has its weight represented by the smallest weight of a

30 2. BACKGROUND connecting path that this sequence of segments has. Since each se- quence of segments can have more than one connecting path and seemingly we would have to find all connecting paths to find the one with the smallest weight, we will discuss the upper bound on the number of connecting path to be computed to find the one with the minimal weight. In Definition 2.3.2 we define a partial connecting paths for a se- quence of segments that is used to built the set of minimal connect- ing paths for a sequence of segments. As we show in Lemma 2 the cardinality of CPs is predictable and forms the upper bound on the number of connecting paths to be computed to get the one with the minimal weight.

Definition 2.3.2 (Set of minimal connecting paths). Partial connect- ing paths and Set of minimal connecting paths:

• Partial connecting paths for (S1 ...Sl): 1 ≤ i ≤ l − 2 : PCPi = {p|ce1 ∈ CEi, ce2 ∈ CEi+1 : p =(e1e2 ...en) ∧ w(p) is minimal ∧ e1 = ce1 ∧ TERMINAL(en) = LEF T VTX(ce2) ∧ 1 ≤ k ≤ n :

ek ∈ ESi+1 }

• Set of minimal connecting paths CPs for (S1 ...Sl): l = 1 : ∅  l = 2 : CE1  l> 2 : {p| p =(p1.p2 ...pl−2.e) :  CPs =  1 ≤ i ≤ l − 2 : pi ∈ PCPi ∧ e ∈ NEl−1∧   2 ≤ k ≤ l − 2:(TERMINAL(pk−1),  INITIAL(pk)) ∈ NEk∧   TERMINAL(pl−2)= LEF T VTX(e)}   A set of partial connecting paths PCPi for a sequence of segments (S1 ...Sl) represents a set of paths that is initiated by a common edge of segments Si and Si+1 and is terminated by a common edge of seg- ments Si+1 and Si+2. So it is a set of paths with minimal weights that traverse the segment Si+1 in the context of the two neighboring segments Si and Si+2. Figure 2.7 demonstrates a short example of the sets of common edges CEi, partial connecting paths PCPi and minimal connecting paths for a particular sequence of segments. In

31 2. BACKGROUND

S 1 S 2 S 3 S 4 e e e 3 P1 5 P5 6 c gh i a e e e 2 P 4 P3 4 b f e 1 d e 7 j

P2

e 1 e 4 e 6

e 2 e 5

e 3

CE 1 CE 2 CE 3

e P e P e P e P e P e P PCP1 = { 1 2 , 1 3, 2 2, 2 3 , 3 1, 3 2 }

e P e P PCP2 = { 4 5 , 5 5 }

CPs = e 1 P 2 e 4 P 5 e 6 e 2 P 2 e 4 P 5 e 6 e 1 P 2 e 4 P 5 e 7 e 2 P 2 e 4 P 5 e 7 e P e P e e 1 P 3 e 5 P 5 e 6 3 1 5 5 6 e 1 P 3 e 5 P 5 e 7 e 3 P 1 e 5 P 5 e 7 e P e P e e 2 P 3 e 5 P 5 e 6 3 4 4 5 6 e 2 P 3 e 5 P 5 e 7 e 3 P 4 e 4 P 5 e 7

Figure 2.7: An example of partial connecting paths PCPi and a set of minimal connecting path CPs for a sequence of segments.

the example, Pj denotes a path of minimal weight between two bor- der vertices. Although there may exist more than one path, for the brevity of the figure only each set of paths is represented by the one of the minimal weight. As we see from Lemma 2 this method is suf- ficient to compute the whole set of minimal connecting paths. Lemma 2 (An upper bound on maximal amount of connecting paths computed to find the one with a lowest weight). A connect- ing path for a segment sequence (S1 ...Sl) with the lowest weight is a path in CPs with the lowest weight. Proof. CPs is defined as a combination of traversals of lowest

32 2. BACKGROUND weights for each triple of neighboring segments through the middle segment in (S1 ...Sl). A connecting path with a lowest weight of (S1 ...Sl) must lead through one edge in each CEi or it would not be connecting path for (S1 ...Sl) at all. The weight of each subpath sp of p having the first edge from CEi and TERMINAL(sp) equal to some of the LEF T - VTX(e2) where e2 ∈ CEi+1 is always minimal. If it would not, it would be in contradiction to a fact that p has minimal weight there- fore we would be able to construct p′ with weight

2.3.3 Imposing a Weight Limit l In this section we explore the length and weight properties of paths in G and their proper sequences of segments in SG(G). As we al- ready mentioned previously there is a difference between the notions of length and weight of a path and their counterparts at a sequence of segments. In the case of sequence of segments we refer to the length as the length of the sequence, i.e. the number of segments in the se- quence. While the weight of the sequence of segments represents the minimal weight of its connecting path. In Lemma 3 we prove that the length of the path in G is always greater than the length of its proper sequence of segments represen- tation.

Lemma 3. If a graph G = (V, E) has a segmentation S(G) that forms a graph SG(G), the length of a path in SG(G) that represents a proper segment sequence of a path p is always less or equal to the length of p.

33 2. BACKGROUND

Proof. From Lemma 1 we know that each path in G has its proper segment sequence representation. From the proof of this lemma we know that during the transformation of the path in G to a path in SG(G) we omit zero or more edges. Also some edges from G are re- placed by edges from SG(G), always one edge for another. This im- plies that the resulting path, i.e. the sequence of segments, in SG(G) can be at most of the same length that the path in G.

Lemma 3 assures us of a fact that the encoding that the graph seg- mentation represents is contractive. In other words, the path repre- sentation in the segment graph is always getting smaller or remains the same and never gets bigger. In Corollary 1 we prove a similar contractive relation between the weight of a path p and its proper sequence of segments.

Corollary 1. If a graph G = (V, E) has a segmentation S(G) that forms a graph SG(G), the weight of segment sequence representing a path p is always less or equal to the weight of p.

Proof. From Lemma 1 we know that each path in G has one proper sequence representation. Because the proper segment sequence is unique, the path p can be built using a connecting path and adding a prefix and suffix to p: p = (v1e1v2e2 ...envn+1), S(p)=(S1 ...Sl) : ∃i, j : 1 ≤ i ≤ j ≤ n +1:(v1e1v2 ...vieivi+1 ...vj−1ej−1vj ...vn+1) ∧{v1,...vi} ⊆ S1 ∧ {vj,...vn+1} ⊆ Sl ∧ (vieivi+1 ...vj−1ej−1vj) ∈ (S1 ...Sl) Then, the weight of p is w(p) = w((v1e1v2 ...vi)) + w(vieivi+1 . . . vj−1ej−1vj)+ w(vjejvj+1 ...vn+1). For the weight of p holds: w(p) ≥ w((v1e1v2 ...vi))+k(S1 ...Sl)k+ w(vjejvj+1 ...vn+1). From which we can conclude that w(p) ≥ k(S1 ...Sl)k, due to strictly non-negative weights. While w(p) and k(S1 ...Sl)k are equal in case that both prefix and suffix are of zero length.

Graphs can represent large structures having thousands of ver- tices and edges. In such a large graph, paths can have great lengths. Let assume that we are interested only in paths up to a certain length

34 2. BACKGROUND since to us with increasing path length the path is becoming less in- teresting. Lemma 4 proves the fact that all the sequences of segments with up to a limit l represent all paths in G having its weight less or equal to that limit. Lemma 4. If a graph G = (V, E) has a segmentation S(G) that forms a graph SG(G) then for a limit l, segment sequences in S(G) having weight ≤ l represent all paths in G that have weight ≤ l. Proof. Let assume that there is a path p with w(p) ≤ l and that it is not present in the result represented by segment sequences with k(S1 ...Sl)k ≤ l. This would imply that the ||S(p)|| >w(p) but this is contradictory to Corollary 1.

The last corollary in this section states an interesting assertion about the paths that are encoded into SG(G). It declares that not only the paths with weight less or equal to a limit l are encoded but also a certain amount of paths that are longer than l. Corollary 2. All possible sequences of segments with weight ≤ l represent all paths with weights ≤ l and some paths with weight > l. Proof. Immediately from the proofs of Lemma 1, Corollary 1 and Lemma 4.

As we will see in Chapter 5, the amounts of paths that have its weight greater than the limit l and that are represented by their proper sequence of segments with weights lower than a limit l are highly dependent on the way that the graph has been divided into segments and the amount of paths longer than l is considerable.

2.3.4 Iteration Step The iteration step from our point of view means the repeated appli- cation of the graph segmentation onto the segment graph of some graph G. We follow the idea that by applying the segmentation method to already segmented graph we get even simpler graph in the means of a lower number of vertices and edges yet still having a graph having similar properties as the original graph G.

35 2. BACKGROUND

D

1 D

A B X A B 1 1

1 1 X E CY 1 1 1 1 1 1 F E

1 C 1 1 1 Y

1 1 7 1 1

F

1

|| (ABC)|| = 3, || (ABX)|| = 4 => w(B) = 1 or 2 ?

Figure 2.8: Demonstration of a segment weight assignment.

Intuitively, the recursive application of the graph segmentation is possible but if we consider weights of vertices and segment se- quences we come across certain problems that we have to solve. Specifically, the weight of a segment is problematic. Since in SG(G) each vertex represents some collapsed segment in G which com- prises of a set of vertices and edges, what should be the weight as- signed to each vertex in SG(G), i.e. to each segment? At first glance, it would be logical to assign each segment the minimal weight of a path by which the particular segment can be traversed. Nonetheless, as we see in Figure 2.8, the weight of such path and therefore the weight of the segment is dependent on the context the segment is traversed. In the example in Figure 2.8 considering the segment B where in the context of segments A and C the minimal path to tra- verse has weight 1 but the minimal weight of a path that traverses the same segment B in the context of segments A and X is 2. Which of those two numbers should be assigned as the weight of the segment B? Either way the weights of sequences of segments in SG(SG(G) would be inaccurate with respect to the weights of paths in G. The solution that we propose is not assigning weights to segments at

36 2. BACKGROUND all and let the weight of the connecting sequence of segments for SG(SG(G)) be computed from the base weights of vertices in G. Therefore we have to slightly alter the definition of a weight of a sequence of segments. We also define the connecting segment se- quence to make the iteration step possible in Definition 2.3.3. Definition 2.3.3 (Altered weight definitions). Altered weight defi- nitions for the iteration step:

• G = (V, E), G’ = SG(G) = (S(G), E’), G” = (SG(SG(G)) = (S(S(G)), E”)

′ ′′ • Connecting segment sequence (A1 ...Al) ∈ G in (S1 ...Sl) ∈ G : (A1 ...Al) ∈ (S1 ...Sl) : e1 ∈ CE1 ∧ en ∈ CEl−1 ∧

∃i2, i3,...il−1 : 1 < i2 < i3 < ...< il−1 < n : {e2,...ei2 } ⊆

ES2 ∧ {ei2 ,...ei3 } ⊆ ES3 ∧ . . . ∧ {eil−2 ,...eil−1 } ⊆ ESl−1 • Set of weights of a sequence of segments:

{w(p)|p ∈ (S1 ...Sl)},S ∈ S(G) |(S1 ...Sl)| =  {|(A1 ...Al)||(A1 ...Al) ∈ (S1 ...Sl)},S ∈  S(S(G))  • Weight of a sequence of segments k(S1 ...Sl)k = min(|(S1 ...Sl)|)

Firstly, In Lemma 5 we prove that the repeated application of the graph segmentation encodes the paths in graph G into a graph SG(SG(G)). Lemma 5. Let a graph G be segmented by a segmentation S(G) that forms a graph G’=SG(G) that has a segmentation S(S(G)) and a graph represen- tation G”=SG(SG(G)). Any path in G can be represented by some segment sequence in S(S(G)). Proof. From Lemma 1 we know that any path p in G has a proper seg- ment sequence S(p) in S(G) and that this representation is unique. If we use the same lemma again we get the proper segment sequence S(S(p)). This representation is also unique due to the segmentation definitions.

37 2. BACKGROUND

Consequently, in Lemma 6 we prove that the multiple application of the graph segmentation does not spoil the property of the imposed limit l of the graph segmentation.

Lemma 6. If a graph G = (V,E) has a segmentation S(G) that forms a graph SG(G) then for a limit l, segment sequences in S(S(G)) having weight ≤ l represent all segment sequences in S(G) having weight ≤ l that represent all paths in G that have weight ≤ l.

Proof. From Lemma 5 we know that all paths in G can be represented by a segment sequence from S(S(G)). Let assume that there is a path p with w(p) ≤ l and that it is not present in the result represented by a segment sequences with k(S1 ...Sl)k ≤ l from S(S(G)). That is in contradiction with the Corollary 1 applied on G and S(G) and G′ = SG(G) and S(G′). The only difference between the G and G′ is that the vertices in G′ do not have weights, because it is dependent on the context in which the segment (vertex) is used in segment sequence (path), yet, the weight of a sequence of segments is computed recursively from the base weights of segments in G. Hence, when the segmentation method is recursively applied on the graph G n times and because we would like to refer to those segmentations and its respective segment graphs, we will denote the final segmentation Sn(G) and the respective segment graph SGn(G). Thus the segmentation or the segment graph after the ith application is Si(G) and SGi(G) respectively, where 1 ≤ i ≤ n.

38 Chapter 3 Indices for Graph Structured Data and Re- lated Work

In this chapter, we provide an overview of existing indexing tech- niques for different graph mining purposes and also algorithmic ap- proaches that could be used to answer different kinds of queries is- sued on graph structured data. This chapter is organized into two parts. The first stage presents various graph and matrix algorithms. Next, an insight into known indexing structures proposed is pre- sented. All the presented time and space complexities are adopted from source papers.

3.1 Algorithmic Approaches

This section will present the algorithmic approaches that could be used for answering the reachability queries and to implement the ρ -operators presented in the previous chapter.

3.1.1 Graph Algorithms In the work of Tarjan [71], there is a demonstration of a mapping the problem of finding path expressions to all sorts of other path problems, e.g. the shortest path search problem. This means that once a suffi- ciently efficient solution to the problem of path expressions is found one have a sufficiently efficient solution also to all sorts of other path problems. So, if we consider among the solutions of the problems as- sociated with the search for the path expressions, they are sufficient for the discovery of techniques for answering path queries on graph structured data using the graph algorithms.

39 3. INDICES FOR GRAPH STRUCTURED DATA AND RELATED WORK

Single Source Path Expression Problem The most fundamental path expression problem is the single source path expression problem. Given a graph G = (V, E) and a distinguished source vertex s, for each vertex v find a regular expression P (s, v) which represents all paths from s to v in G. In [71], there is demon- strated that by reinterpreting the operations ∪, . and ∗ used to con- struct the regular expressions, the solution of the single source path expression problem can be used to solve other kinds of path prob- lems. Therefore, we introduce an algorithm that solves the single source path-expression problem and take it as a universal solution to the problem of searching Semantic Associations through a graph algo- rithm approach. The formal description of this algorithm can be found in [70]. Ba- sically, the algorithm can be divided into several phases. The first one is that the nodes of the graph are topologically ordered consider- ing the chosen start node. The procedure of the topological order can be found for example in [43]. Simplified, it assigns lower numbers to nodes from which any other edge does not leave and greater num- bers to nodes to which any edge does not enter, keeping the greatest number reserved for the starting node. The nodes are then linearly ordered according to the numbers assigned to them. Once properly ordered, another phase comprising the decompo- sition of the graph is deployed. The decomposition described in [48] is applied. It constructs a dominator tree of the original graph. The domination relation between a pair of nodes says that node a domi- nates b when a lies on every path leading to b. The dominator tree is then used together with the original graph to build a derived graph where the domination in the original graph is represented in a sim- plified way. The edges of the derived graph are referred to as derived edges. The strong components in the derived graph are called domi- nating strong components. A strong component is defined as a maximal subgraph where each node is reachable from each other. In the next phase, the algorithm uses the Gaussian elimination to compute a path sequence for each dominator strong component of the derived graph, and combines these path sequences to form a path sequence for the original graph. The path sequence for the original

40 3. INDICES FOR GRAPH STRUCTURED DATA AND RELATED WORK graph is then used to compute the path expression for the starting node to represent all paths between this node and every other node in the graph. The path sequence for a directed graph G is a sequence of path expressions sufficient to find any path in G by combining the proper expression in the sequence. The computational complexity the outlined algorithm is O(m log n) + l, where l is the total length of the path sequences for the dom- inator strong components, and n represents the amount of nodes in the original graph and m the amount of edges. The algorithm solving the single source path expression problem has been implemented and further studied in [31]. The search com- plexity of the implementation proved the anticipated computational complexity yet proved a major bottleneck of this approach being in the space complexity equal to O (n2) which makes this approach un- usable for large graphs. The quadratic space complexity is credited to an internal representation of the graph by an adjacency matrix that is used to build the path expression for the inspected pair of vertices.

3.1.2 Transitive Closure Computation Algorithms In this subsection we introduce approaches to an algorithmic solu- tion to answer the reachability type of queries through a computation of the transitive closure of a binary relation. Roughly, three different kinds of algorithms used for the transitive closure computation are presented. The presented algorithms differ in the means they use for the computation, which are the matrix representation of the relation, the graph representation and a combination of both.

Matrix-based Direct Algorithms The type of algorithms that compute the transitive closure of a binary relation work with the matrix representation of the relation that is called the adjacency matrix. An n × n adjacency matrix of elements aij of a graph having n nodes is a matrix with aij having the value of 1 if there is an arc between i and j and 0 otherwise. The Warshall algorithm presented in [75] forms the base of all matrix-based algorithms and computes the transitive closure as follows: the algorithm traverses the matrix

41 3. INDICES FOR GRAPH STRUCTURED DATA AND RELATED WORK from the top left corner to the bottom right one and processes each aij. The processing phase involves examining whether the aij is 1 and if it is then making every successor of j the successor of i. Thus the Warshall algorithm processes each element of the matrix exactly once. For example in [2], it has been shown that the matrix elements can be processed in any order keeping the two following constraints true:

1. For all i, j, k, processing of the element aik precedes the pro- cessing of the element aij, if k < j

2. For all i, j, k, processing of the element ajk precedes the pro- cessing of the element aij, if k < j Various processing orders thus lead into a whole family of War- shall derived algorithms that have been designed for many pur- poses. For example, the algorithm introduced in [2] is designed in a way to reduce as much as possible the I/O traffic between disk and memory.

Graph-based Direct Algorithms The algorithms that are computing the transitive closure of a graph work under an assumption that the input graph is directed and acyclic. Because this assumption is very limiting at first sight, we should mention the work of Tarjan presented in [69], which takes as an input an arbitrary directed graph and transforms it into a directed acyclic graph (DAG) by identifying its strongly connected compo- nents and replacing them with a single node. The two key observations from [60] form the basis for these algo- rithms: 1. During the computation of a transitive closure of a DAG, if node A precedes node B in a topological sort of the nodes in the graph, additions to the successor set – a set of reachable vertices – of a node A cannot affect the successor set of node B. We could, therefore, compute the successor set of B before the one of A. This means that processing nodes in reverse topolog- ical order, one needs to add to a node only the successor sets of

42 3. INDICES FOR GRAPH STRUCTURED DATA AND RELATED WORK

its immediate successors, since the latter would already have been expanded.

2. All nodes within a strongly connected component in a graph have identical reachability properties and the condensed graph obtained by collapsing all the nodes in each strongly connected component into a single node is acyclic.

A by-product of the previously mentioned work of Tarjan is also a topological order of the strongly connected components. In many works based on this effort, like in [36], it is shown that it is possible to modify Tarjan’s algorithm in a way that the successor lists are also expanded while the strongly connected components are determined and thus allow to compute the transitive closure.

Hybrid Algorithms This type of algorithms got inspired by both the preceding ideas. The algorithms presented in [3] basically work in two passes. In the first pass, a condensed graph for a given graph is obtained where each strongly connected component is collapsed into a single node. Also a topological sort of the condensed graph is obtained at the same time. In the second pass, the transitive closure is computed. It assumes that the nodes of the condensed graph are numbered according to the topological sort. Thus, the source node of any edge has a higher node number than its terminal node. Therefore, the adjacency matrix of G is a lower triangular one. Then, the algorithm similar to the Warshall one is used to process the elements in a row order. However,

1. only those elements aij which were 1, to begin with, result in addition of successors of j to i (immediate successor optimiza- tion),

2. while a row is being processed, some elements which were 1 are treated as if they were 0,

3. matrix elements are processed from right to left.

43 3. INDICES FOR GRAPH STRUCTURED DATA AND RELATED WORK

Rule 1. implies that it behaves like the graph-based algorithms since it adds only the immediate successors to the successors sets. Though, unlike the graph-based algorithms which are depth-first re- cursive, the algorithm described in [3] is a breadth-first algorithm, making it amenable to efficient blocking - tuning the algorithm up for the optimal I/O traffic. All these observations lead into a conclusion that the hybrid al- gorithms outperform the both former approaches as described in [3].

3.1.3 Summary As we can see the graph queries can be answered using a combina- tion of an algorithmic solution of partial problems but the main in- convenience is the unsuitable complexity on large graphs. Roughly speaking the reachability queries can be answered using the single source shortest path algorithm with the computation complexity of O(|E|). However such high computational complexity makes it un- feasible to use for efficient query processing. On the other hand, we could precompute and store the transitive closure of the binary rela- tion but again we face high space complexity of O(|V |2). Therefore we introduce some more efficient indexing techniques in the next section.

3.2 Graph Structured Data Indices

In this section, we present indexing techniques that support solv- ing various search problems in graph structured data presented in the previous sections. In the first stage it is an indexing technique that helps to process the graph containment type queries. Next, we present an indexing technique that copes with efficient reachability query answering. And finally, one of the existing indexing structures that can be used to implement the ρ-operators will be demonstrated.

3.2.1 Graph containment queries The work in [81] is based on discriminative frequent structure anal- ysis and follows the idea of identifying small important subgraphs

44 3. INDICES FOR GRAPH STRUCTURED DATA AND RELATED WORK in the graph database. The small subgraphs are referred to as frag- ments and are indexed only those fragments that are both frequent and discriminative, see Definitions 3.2.1 and 3.2.1.

Definition 3.2.1 (Frequency). Given a graph set D = {g1,g2,...gn} and a graph f, the frequency of f in D is the percentage of graphs in |Df | D containing f, frequency(f) = |D| . Where Df denotes all graphs from D that contain f.

Definition 3.2.2 (Discriminative fragment). Fragment x is discrim- inative with respect to F if Dx is much smaller than f∈F ∧f⊆x Df . Where F is a feature set of fragments. T Then, the graph query processing can be divided into two steps. Firstly, the graph database D is mined for the identification of the feature set F containing the most discriminative fragments for the whole database. This part is performed only once before the real query processing. Secondly, the query is processed in a following way, the features in the query graph q are enumerated, then the graphs from the database containing those features are retrieved to form a candidate set Cq = f Df (f ⊆ q ∧ f ∈ F ), and finally the query graph containment isT verified in graphs from the candidate set to prune the false positives. The proposed indexing structure, which efficiently store the fea- ture set is based on idea of graph sequentializing and afterwards they hold them in a prefix tree. Translating a graph into a sequence, called canonical label, follows the idea that if two fragments are the same then they must share the same canonical label. A traditional sequen- tializing method is to concatenate rows and method of the adjacency matrix of a graph into an integer sequence. However other more ef- ficient methods were used in the implementation of the presented indexing structure [79, 80]. In the search phase, given a query graph q, frequent fragments are enumerated in q and checked whether they appear in the index- ing structure. Then the candidate set is created and afterwards each graph from this set is verified for the subgraph isomorphism. In the experimental evaluation they use, as the base line, an in- dexing structure called GraphGrep [62] that follows the same idea

45 3. INDICES FOR GRAPH STRUCTURED DATA AND RELATED WORK a d a b c d e f [1,8] [2,3] a 1 0 0 1 1 1 b 0 1 0 1 0 1 b e [9,10] c 0 0 1 1 0 1 [,]23 [4,5] d 0 0 0 1 0 0 [,]67 e 0 0 0 0 1 0 c f f 0 0 0 0 0 1 [11,12] [6,7] [,]23 [,]67

Figure 3.1: A short example of a directed acyclic graph (DAG) and its transitive closure matrix. of indexing small patterns in the graph but contrary to the frequent discriminative fragments it uses paths. The frequent discriminative fragments produce better results on queries that form more complex graph structures since its structure is more robust then the segmen- tation in paths contained in the query graph.

3.2.2 Reachability queries Testing the reachability between vertices in a graph incorporates two mostly known techniques. These techniques are based on provid- ing vertices in a graph with labels by which the reachability test is decided. The labeling schemes we are to present in this section are an interval based approach that takes its inspiration in a similar ap- proach known for tree structures, a 2-hop approach and finally a hi- erarchical labeling that combines the pros of the both previous ap- proaches.

Interval Based Approach The interval based approach labels the vertices in a graph by intervals whose containment relationships encode the ancestor- descendant relationships among nodes in a tree. This approach

46 3. INDICES FOR GRAPH STRUCTURED DATA AND RELATED WORK was originally designed for trees and was presented in [20]. Each node in a tree is assigned an interval [start(u), end(u)] by a depth- first tree traversal algorithm that maintains a counter that is incre- mented whenever the traversal enters or leaves a node. The values start(u) and end(u) are the values of the counter when the traver- sal enters and leaves the node u, respectively. The labeling scheme then has a property: Given two tree nodes u and v, u −→∗ v ⇐⇒ [start(u), end(u)] ⊇ [start(v), end(v)]. Agrawal in [1] extends this approach to directed acyclic graphs. Each vertex in a graph is assigned a set of non-overlapping intervals L(u) and u −→∗ v ⇐⇒ every interval in L(u) is contained in some interval in L(u). The labeling scheme is creating by firstly identify- ing the spanning forest of the labeled graph and then assigning the labels to the vertices in the forest. Next, to capture the reachability relationships through the non-spanning forest edges the additional intervals are added to labels in reverse topological order of the di- rected acyclic graph. Precisely, if (u, v) is an edge not in the spanning forest, then all intervals in L(v) are added into L(u). Let us now present an example of the interval approach on a graph in Figure 3.1. Using the spanning tree rooted in vertex a, the labels for a, d, e and f would be [1, 8], [2, 3], [4, 5] and [6, 7], respec- tively. Since b and c belong to different spanning trees in the span- ning forest, the labels they are assigned are [9, 10] and [11, 12], respec- tively. Both b and c also receive intervals from d and f resulting in L(b)= {[9, 10], [2, 3], [6, 7]} and L(c)= {[11, 12], [2, 3], [6, 7]}. The main drawback of the interval approach is that when the graph gets more complicated the size of the label becomes linear with the graph size. For example, if d and f from our example have many non-spanning forest descendants their label will become larger which will in turn result that their ancestors b and c will have larger labels also. Consequently, the reachability queries answering involves the interval containment checking for all intervals in the la- bel, large labels can seriously impact the query answer computation performance.

47 3. INDICES FOR GRAPH STRUCTURED DATA AND RELATED WORK a d x d e f a 1 1 1 1 b x e b 1 1 1 1 c 1 1 1 1 x 1 1 1 1 c f

Figure 3.2: Two sets of vertices connected through a single vertex and the corresponding submatrix in transitive closure matrix.

2-hop Approach The 2-hop approach as an alternative to the interval based approach presented above was introduced by Cohen et al. in [18]. This ap- proach also assigns each vertex a label and to answer the reacha- bility query the two respective labels are consulted and the answer computed. For each vertex u, Cin and Cout denote sets of vertices which can reach u and which are reachable from u, respectively. The key idea is that each vertex from Cin can reach any vertex in Cout. For example, in Figure 3.2 Cin(x)= {a, b, c, x} and Cout(x)= {x, d, e, f}. Those two vertex sets form the in-label and the out-label of each particular ver- tex. Thus, the vertices u −→∗ v ⇐⇒ out − label(u) ∩ in − label(v) =6 ∅. The reachability relationships from vertices Cin(x) to vertices Cout(x) can be succinctly encoded by adding x into every vertex’s out-label from Cin and into every vertex’s in-label from Cout. From the point of view of compressing the transitive closure ma- trix, the 2-hop approach pursues to compress the submatrix induced by vertex x consisting of all ones. The columns of such submatrix correspond to the vertices in Cin(x) and the rows correspond to the vertices in Cout(x), as it is illustrated in Figure 3.2. Therefore, this ap- proach works well especially on graphs with many well-connected hop vertices. The effectiveness depends on the area-to-circumference ratio of submatrices identified for compression: The larger the area compared with circumference, the better the compression ratio. The

48 3. INDICES FOR GRAPH STRUCTURED DATA AND RELATED WORK labeling algorithm repeatedly and greedily encodes the submatrix induced by x that maximizes |Cin(x)|×|Cout(x)|−k where k denotes the |Cin(x)|+|Cout(x)| number of ones in the submatrix that have been previously encoded. The problem of this approximative algorithm is that from the ma- trix compression point of view this algorithm can miss many subma- trices that are good candidates for compression, because it only con- siders sumbatrices induced by hop vertices. For example, in Figure 3.1 the submatrix spanned by columns {a, b, c} and rows {d, f} con- sist of all ones and qualifies to be a good candidate for compression but it is not induced by any hop vertex.

Hierarchical Labeling of Sub-Structures The hierarchical labeling of sub-structures (HLSS) introduced in [33] is inspired by both preceding labeling schemes. It tries to utilize their strengths and to suppress their weaknesses mainly identified in the graph type on which they can be applied. HLSS is assigning the labels in two phases in each focusing on the exploiting different input graph G characteristics. The first phase is a tree-reachability reduction and it incorporates a preprocessing step which identifies the strongly connected components and collapses each into one representative vertex, since from the reachability point of view all vertices in a SCC are indistinguishable. The result of the preprocessing is denoted G′. The second step in this phase then iden- tifies the spanning forest in the preprocessed graph and assigns in- terval labels to vertices based on the spanning forest. During this step, the remainder graph Gr is also computed. The Gr captures the remaining reachability relationships that are not encoded by the in- terval labels. The remainder graph Gr is defined in Definition 3.2.3. Specifically, a vertex can reach another one through portals in Gr. The vertices are labeled by their portals to enable the reachability check- ing. Definition 3.2.3 (Portals and Remainder Graph). Given a spanning forest T of G′, a vertex u ∈ G′ is exposed if there exists an edge (u, v) (or (v,u)) in G′ such that u is not v’s ancestor (or descendant, respec- tively) in T .

in • The in-portal of u, lp (u), is u’s lowest exposed ancestor in T , if

49 3. INDICES FOR GRAPH STRUCTURED DATA AND RELATED WORK a a

b c d

f g f g e e h i h i

j k

G’ Gr

Figure 3.3: An example illustrating Definition 3.2.3. The solid edges belong to spanning forest T . Grey vertices indicate ex- posed vertices. Vertex 6 is not exposed but it is the out- portal of 3 since it is the least common ancestor all 3’s ex- posed descendants, vertices 8 and 9.

any.

out • The out-portal of u, lp (u), is u’s lowest common ancestor of all u’s exposed descendants in T , if any.

Illustration of the Definition 3.2.3 is depicted in Figure 3.3. The ′ remainder graph Gr of G consists of vertices that are in-portals or out-portals of some vertices in G′. There is an edge between two ∗ ′ vertices u and v in Gr iff u −→ v in G . The second phase of the labeling algorithm, the remainder graph reachability encoding compresses the reachability information in the remainder graph Gr that is the result of the first phase. It does so by assigning additional labels to portals so that the reachability among them can be checked efficiently by comparing their labels. A vari- ous techniques are presented in [33] for assigning these techniques, including an enhanced version of the 2-hop approach.

50 3. INDICES FOR GRAPH STRUCTURED DATA AND RELATED WORK

Since each of the vertices in G is assigned a number of different labels, the query algorithm is as follows. Given two vertices u and v and testing whether u −→∗ v, firstly, the u and v are tested whether they belong to one common SCC. If not, then their interval labels are checked. If the answer is not affirmative either, their portal labels are looked up and the reachability of u’s out-portal and v’s in-portal is checked which involves testing whether their remainder label in- tersect. All steps take constant time except the last test which takes linear time with respect to the lengths of the remainder labels. The proposed labeling scheme presented in [33] is evaluated on a various synthetically generated graph data having different prop- erties. The main aspect measured is the compression factor which is defined as the total number of entries in the transitive closure matrix, i.e. the number of bits needed to represent the matrix, divided by the number of bits required by the labeling scheme. Using this factor the results gained are compared to the interval labeling and 2-hop labeling approaches. The results show that in most of the cases the HLSS outperforms the previously mentioned approaches by elimi- nating their weaknesses represented by the sensitivity to substruc- tures they cannot handle.

3.2.3 Pathway Oriented Indexing Schemes To answer path queries represented by the ρ-path operator, path ori- ented indices can be used. In this section we present two approaches that design an indexing structures to search paths in RDF graphs.

Class and Path Index This approach is described in [7] and employs a Schema Path Index (SPI) which provides quick access to all possible paths between any two classes for a given schema. It is represented by a matrix which’s entries are all possible paths between each two classes in the partic- ular schema. The paths in schemas are looked up first since schemas are relatively small compared to their data component. However, paths in the schema do not necessarily give the information about actual paths that resources participate in due to the multiple entity classification availability. The paths need to be validated at the data

51 3. INDICES FOR GRAPH STRUCTURED DATA AND RELATED WORK

locationThesaurus location fname creates exhibited DateTime String Artist Artifact Museum last_modified lname working_hours String ExtResource String sculpts Enumeration title Sculptor Sculpture paints file_size Painter Painting technique Schemas String Integer

Cubist Flemish

technique "oil on canvas"

"Pablo" paints &r2 fname paints technique "oil on canvas" &r1 2000−06−09T12:30:34 last_modified "Picasso" lname &r3 exhibited "Reina Sofia Museum" title &r4 title "Descent" "Michelangelo" sculpts &r5 fname title "Louvre Museum" sculpts &r6 "Buonarroti" lname exhibited FRANCE &r7 location exhibited &r8 working_hours 9−1, 5−8 Knowledge Base technique paints &r10 "oil on canvas"

&r9 technique "oil on canvas" paints &r11 file_size 17 title subClassOf, subPropertyOf (is−a) Abraham and Isaac

typeOf (instance)

Figure 3.4: An example of a RDF graph. layer to find which of the paths are actually present in the model base. One issue that arose in this approach is that schema graphs by themselves do not provide complete information about paths in which entities might participate at the data level. The reason is pri- marily due to the multiple classification of resources allowed by the RDF data model. Consequently, there may exist paths involving en- tities at the data layer that will not be found in schema’s SPI. For example, in Figure 3.4, the paints.exhibited.title sequence is not a sequence in either the left or right schema, but it has an instance in the model base, i.e. between &r1 and the literal node ”Reina Sofia Museum”. The reason for this is that the node &r3 has membership in both the Museum and the Ext.Resource classes. This situation

52 3. INDICES FOR GRAPH STRUCTURED DATA AND RELATED WORK could be solved by creating an intermediate class node that collapses Museum and the Ext.Resource classes, and consequently links the paints.exhibited sequence to the title sequence. This approach deals with this situation by managing the connec- tions between classes created by multiple classification separately. Basically, it migrates links at the data level to the schema by cre- ating artificial nodes that collapse the two class nodes. It does not explicitly create these nodes, instead, it stores the information about the schemas that are linked due to multiple classification in an In- terClass Index (ISI). These nodes represent candidates for collapsing when searching for paths. Then, when a query involves resources that belong to classes that do not have any paths between them or belong to different schemas, the ISI is consulted to find candidate nodes that if collapsed may result in a path. If no candidate nodes exist an empty set is returned as the result because it has considered all possibilities. This approach follows the simple ideas of precomputing and stor- ing all path between classes at the schema level into matrices but what if even the schema part would form a large graph? Thus, the computational complexity of the designed approach is O(|V ||V |) and the storing complexity is O(|E||V |) in case of the complete graph and precomputing all possible acyclic paths.

An Indexing Scheme for RDF and RDF Schema Based on Suffix Arrays This indexing structure originally presented in [51] is based on a sim- ilar work designated to index paths in XML documents [78]. The ex- tension of the previous work is mainly the possibility of applying the suffix array approach also on DAGs that represent the RDF graphs. As we indicated, this indexing scheme is usable only to RDF graphs that are DAGs. This fact is limiting since only a small portion of graphs and especially the RDF graphs are DAGs. The authors per- form their tests on one of such RDF graph – a portion of WordNet [55]. The WordNet is an on-line lexical reference system whose de- sign is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, where each of them represents one underlying

53 3. INDICES FOR GRAPH STRUCTURED DATA AND RELATED WORK

C 123456789 A B b d E F a f 1 A .a.B.b.C.d.E.f.F c e 2 A .a.B.c.D.e.E.f.F D

Figure 3.5: A directed acyclic graph together with two extracted path expressions. The character . denotes a delimiter of char- acters in the path expression. lexical concept. This approach is based on the data structure called suffix arrays [50] which are data structures for full-text search on documents con- structed on one-dimensional character strings. The concept of the presented approach is to extract all path expressions from the RDF graph and then to create the suffix arrays for all the extracted path expressions to make efficient query processing possible. The algo- rithm traverses each root vertex in a depth first manner, generating all possible path expressions along the way. Therefore, the algorithm presented to extract all path expressions for a given DAG works in a computational complexity of O(|R||E|) where denotes the set of all sources in G. An example of a DAG and its path expressions can be found in Figure 3.5. A definition of a suffix array for a DAG follows in Definition 3.2.4. Figure 3.6 illustrates the process of generating all suffixes from the two extracted path expressions from Figure 3.5. Consequently, the suffixes are lexicographically sorted and the duplicates are removed.

Definition 3.2.4 (Suffix array for DAGs). Let G = (V, E) be a di- rected acyclic graph. Let R ⊂ V be a set of vertices whose in-degree is equal to 0 and L ⊂ V be a set of vertices whose out-degree is equal to 0. We call R and L the roots and leaves, respectively.

Given a path in G from a root st,1 ∈ R to a leaf st,2kt−1 ∈ L, it can be represented as pt = st,1.st,2.....st,2kt−2.st,2kt−1, where: • t is the identifier of the path

• kt is the length of the path

54 3. INDICES FOR GRAPH STRUCTURED DATA AND RELATED WORK A.a.B.b.C.d.E.f.F:(1,1) (1,1) : A.a.B.b.C.d.E.f.F a.B.b.C.d.E.f.F:(1, 2 ) (2,1) : A.a.B.c.D.e.E.f.F B.b.C.d.E.f.F:(1, 3 ) (1,3) :B.b.C.d.E.f.F b.C.d.E.f.F:(1, 4 ) (2,3) :B.c.D.e .E.f.F C.d.E.f.F:(1, 5 ) (1,5) :C.d.E.f.F d.E.f.F:(1, 6 ) (2,5) :D.e.E.f.F E.f.F:(1, 7 ) (1,7) :E.f.F f.F:(1, 8 ) (2,7) :E.f.F F:(1, 9 ) (1,9) :F A.a.B.c.D.e.E.f.F:( 2 ,1) (2,9) :F a.B.c.D.e .E.f.F :(2, 2 ) (1,2) :a.B.b.C.d.E.f.F B.c.D.e .E.f.F :(2, 3 ) (2,2) :a.B.c.D.e .E.f.F c.D.e.E.f.F :(2, 4 ) (1,4) :b.C.d.E.f.F D.e.E.f.F:(2, 5 ) (2,4) :c.D.e.E.f.F e.E.f.F:(2, 6 ) (1,6) :d.E.f.F E.f.F:(2, 7 ) (2,6) :e.E.f.F f.F:(2, 8 ) (1,8) :f.F F:(2, 9 ) (2,8) :f.F

Figure 3.6: All possible suffixes generated from the pair of extracted path expressions from Figure 3.5. The suffixes are then lexicographically ordered and duplicates are removed. The resulting suffix array is [1,1], [2,1], [1,3], [2,3], [1,5], [2,5], [1,7], [1,9], [1,2], [2,2], [1,4], [2,4], [1,6], [2,6], [1,8].

• st,2h−1 ∈ V :1 ≤ h ≤ kt

• st,2h =(st,2h−1,st,2h+1) ∈ E :1 ≤ h ≤ kt − 1

When the query is processed, binary searches on the suffix array are performed. For this reason the computational complexity of the search algorithm is O(log2(n + 1)). Since, this approach indexes all the path expressions extracted from the RDF graph, various path queries can be answered using this indexing structure. Besides the queries defined by ρ-operators, those queries are also path queries that specify only one starting vertex or starting edge or combination of both. As we already mentioned the experimental evaluation of this ap- proach was conducted on a piece of the WordNet’s RDF graph. The biggest portion of the graph comprised of about 40,000 vertices hav-

55 3. INDICES FOR GRAPH STRUCTURED DATA AND RELATED WORK ing very similar number of edges among them. These numbers indi- cate that the graphs used are very close to trees – a very few edges needs to be removed from the DAG to acquire a tree. This fact is also supported by a stated number of distinct paths that the biggest DAG contained – about 90,000. Considering that the number of paths in a graph grows exponentially to the connectivity of its vertices. During the evaluation the proposed approach is compared to an RDF managing tool RDFSuite [4] and its RDF store and managing algorithms. The results gained in general demonstrate that the in- dexing structure is about four times faster than the database tool and that this ratio grows in favor of the indexing structure with the size of the indexed and searched RDF graph.

3.2.4 Summary In this section, we presented two indexing approaches that could be used to solve the problem of answering queries represented by the ρ-operators. The first one is rather naive and direct application of a brute force that suffers in space linearly with the size of the in- dexed graph. It takes advantages of the special structure of the RDF graph that speeds up the search algorithm. Especially, exploiting the schema part of the RDF graph that represents the semantics that the graph carries. However, we would like to extend such solution onto completely unstructured data represented by general directed graphs where no schema and semantic information is present. The latter indexing structure proposal which uses the suffix ar- rays and path expressions represent better approach to index di- rected graphs. However, the main drawback we see in a limitation of application of this indexing structure to the special kind of directed graph, DAGs. Although, the authors present an extension of the in- dexing structure to general directed graphs, the source paper lacks an evaluation of such an extension. By our opinion, since the number of paths and thus the number of path expressions that can be found in a general directed graph grows much faster, exponentially, to the average connectivity of vertices in the graph than in DAGs. There- fore, we presume that the computational complexity of building the index and the search algorithm will grow also. Hence, we would like to present our own indexing structure that

56 3. INDICES FOR GRAPH STRUCTURED DATA AND RELATED WORK would solve the problem of answering queries concerning the ρ- operators in general unstructured directed graphs.

57

Chapter 4 ρ-index

As we derived from the conclusions of the previous chapters, we would like to design own indexing structure for efficient answering of graph queries concerning the ρ-operators. Therefore, in this chap- ter we present the design of our effort that is called the ρ-index as it is aimed to ease the processing of ρ-operator queries. The ρ-index is based on the idea of recursive graph simplifica- tions. The simplification method that we have chosen is based on the graph segmentation theory that is presented in Section 2.2.2. Dur- ing the simplification phase the information regarding the transfor- mations is stored to enable the search algorithm to reconstruct the original information about paths in the indexed graph. This infor- mation is stored in matrices that represent the transitive closure of the subgraphs represented by the graph segments and inverted lists of border edges that are among segments. This way, a balanced tree structure is obtained where nodes rep- resent the graph segments and the particular levels represent each graph simplification – the graph segmentation. Hence, this chapter incorporates two parts. In the first part, the creation phase of the ρ-index and the description of the auxiliary structures used during this phase are discussed. The second part, presents the search algorithm that is used to answer the ρ-operators queries.

4.1 Structure of the Index

The ρ-index utilizes two main structures that will be introduced later in this section. The first one is a path type matrix that is aimed to store the information about paths inside the segment. The second

59 4. ρ-INDEX

A B C e1 e2

e5 e3 e4

D E

A B C D E  A {(e1)}  B {(e )} {(e )} M =  2 3   C {(e4)}     D {(e )}   5   E   

Figure 4.1: A path type matrix M for a directed graph. one is a set of inverted files that are used to store the information about the transitions among the vertices that are assigned to different segments.

4.1.1 Path Type Matrix The graph theory proved that a very handy representation of a di- rected graph is its adjacency matrix. The adjacency matrix is a square matrix that has a column and a row for each vertex in the graph and each entry aij of such matrix contains either zero or one, depending on an existence of an edge between vertices i and j. Using the ma- trix algebra we can comfortably study the graph’s properties. For instance, if the adjacency matrix is powered by two, each entry of the resulting matrix contains a number of paths of length two lying between each two vertices in the original graph. If the computation has continued, the result would contain amounts of all paths of an arbitrary length. The path type matrix differs from the usual number matrix by

60 4. ρ-INDEX the fact that each entry contains a set of paths instead of a single number representing the amount of them. Each path from the set is represented by a list of edges from the set of edges E of the graph G. The example of a path type adjacency matrix for a graph G is in Figure 4.1. Certainly, we have to modify also the matrix operations × and + to be able to compute the transitive closure of the path type matrix. The usual matrix multiplication is defined as:

n

(A.B)ij = aikbkj Xk=1 where aij, bkj ∈ N. In our type of matrix each entry is supposed to contain a set of paths. The modification is straightforward, the operations on numbers are replaced by a path concatenation and set union, respectively. The formal definition is stated in Definition 4.1.1.

Definition 4.1.1 (Path type matrix multiplication). The modification of the product definition for the path type matrix of a graph G = (V, E).

n

(A × B)ij = aik × bkj k[=1 ∞ s where aij = {p1,p2,...pl}, bij = {q1, q2,...qm}, p, q ∈ E and sS=0 {p1,p2,...,pl}×{q1, q2,...,qm} = {p1.q1,p1.q2,...,p1.qm,...,pl.q1, pl.q2,...,pl.qm} where operation . denotes the path concatenation de- fined as

|p| =0 ∨|q| = 0 : empty path p.q =  |p| ≥ 1 ∧|q| ≥ 1: (e1e2 ...emh1h2 ...hn)

where p = (e1e2 ...em), q = (h1h2 ...hn), 1 ≤ i ≤ m : ei ∈ E, 1 ≤ j ≤ m : hj ∈ E

Figure 4.2 demonstrates the first iteration of the computation of the transitive closure of the path type matrix of the graph G from Figure 4.1. Firstly, the matrix M 2 is computed. This matrix represents all paths of length two in the graph G. After adding the M 2 to matrix

61 4. ρ-INDEX

ABC D E  A {(e1e2)} {(e1e3)}  B {(e e )} {(e e )} M 2 =  3 5 2 4   C     D     E    AB C D E  A {(e1)} {(e1e2)} {(e1e3)}  B {(e e )} {(e )} {(e )} {(e e )} M + M 2 =  3 5 2 3 2 4   C {(e )}   4   D {(e )}   5   E   

Figure 4.2: One step in the computation of the transitive closure of the path type matrix.

M, each entry of the resulting matrix of the addition M +M 2 contains all paths up to length two inclusive. Main difficulty of the matrix representation of a graph is that its use is limited to fairly small graphs since the matrix grows in the quadratic space and the multiplication operation on matrices has even cubic time complexity. Therefore, we introduced the graph segmentation to enable the use of the matrix approach to graphs of arbitrary size. Instead of representing the graph by one large matrix the graph is dived into smaller pieces – segments – that are represented by its path type matrix then the computation com- plexity of the transitive closure computation can be lowered, e.g. the multiplication complexity of a matrix having size n = 1000 is n3 = 1, 000, 000, 000 but if divided into 10 matrices of size 100 then the multiplication complexity would be 10 ∗ 1003 = 10, 000, 000. In 3 n 3 N general, n < c ∗ ( c ) , 1

62 4. ρ-INDEX

S S 1 2 S1 S2

v1 v2

S3 v3 v4 SG(G)

S3

S1 OUT S2 OUT S3 IN v5 v6

v1 v2 ,v 4 v2 v3 v5 v3

v3 v4 ,v 5 v4 v6 v6 v4 IN IN S(G) v3 v2 v2 v1

v4 v1,v 3

Figure 4.3: A fragment of a graph segmentation accompanied with the segment graph and a segment transition tables for each of the participating segment.

4.1.2 Tables of Transitions Among Segments To capture the information about the border edges among segments the tables of transitions are used. The table structure finds inspiration in the inverted files indexing structure well known in Information retrieval. The general inverted file index [32] is used for the full-text search over a collection of documents. It comprises of two parts: a vocabulary containing all distinct values to be indexed and for each item from the vocabulary an inverted list of document identifiers that contain that particular item. The queries are evaluated by fetching the inverted lists for the query terms and consecutively intersecting them. In ρ-index the notion of inverted files is used to capture the infor- mation about the border edges that lie between vertices assigned to different segments. Each segment is assigned two inverted files. One having in the vocabulary the names of vertices that are terminals of

63 4. ρ-INDEX

S1 REVOUT S2 REVOUT S3 REVIN

v2 v1 v3 v2 v3 v5

v4 v1 ,v 3 v6 v4 v4 v6 v v 5 3 REVIN REVIN v1 v2 ,v 4 v v 2 3 v3 v4

Figure 4.4: Reversed transition tables for segments S1, S2 and S3 from Figure 4.3. some border edge, i.e. capturing the incoming border edges. Then, in the inverted lists there are the vertices’ names that are initiatial of those edges. The other inverted file captures the information about the border edges that are pointing from the particular segment. Figure 4.3 demonstrates particular example of the segment tran- sition tables for three segments in the segmentation. Notice that the job of the pair of transition tables could be done using only one kind of the two, the presence of the other one is only for the optimization purposes of the use of ρ-index . This structure is consulted whenever an information about the border edges leading from one segment to another is needed, i.e. all border edges pointing from S1 to S2 what is equivalent to EDGES OUT (S1) ∩ EDGES IN(S2). It is acquired by making an intersection of the set of key vertices of the out-table of segment S1 with each inverted list in the rows of the in-table of segment S2. Each time the intersection is non-empty the edges are reconstructed taking the information about the keys in the out-table of S2 and the result of the intersection. For example, to acquire all border edges leading from S2 to S1 in Figure 4.3 it takes to make for each row of the in-table of S1 ({v2}) the intersection with key vertices from out-table from S2 ({v2, v4}) : {v2, v4}∩{v2}⇒ (v2, v3)= EDGES OUT (S2) ∩ EDGES IN(S1). To speed up the edge reconstruction from the transition tables an additional pair of reversed transition tables is assigned to each segment. Again each takes care of one direction of the border edges of the particular segment. The difference is that the vocabulary is

64 4. ρ-INDEX

Level3

Level2

Level1 onevertex

graphsegment (clusterofvertices)

Level0 IndexedgraphG

Figure 4.5: Visual outline of the ρ-index’s structure. not comprised of the vertices that belong to the particular segment but the vertices that are assigned to the other segment. Then, the inverted lists contain the vertices of the segment to which the table is assigned. We can speed up the retrieval of the border edges by limiting the amount of intersected inverted lists using the vocabulary of the reversed transition tables. For example, reconstructing border edges pointing from S2 to S1 for segments from Figure 4.3 using also the reversed transition ta- bles depicted in Figure 4.4. It takes to make the intersection with the inverted lists only with the items from the S1.IN that are from the intersection of S2.REV OUT.keySet ∩ S1.IN.keySet instead of checking all the inverted lists there.

4.1.3 ρ-index’s Structure Outline Using the graph segmentation one large graph (G) can be trans- formed into a smaller simplified graph (SG(G)) by identifying cer- tain number of segments and collapsing them into single vertices. The size of the segment by which the number of vertices in the seg- ment is meant can be easily controlled. If the transformed graph is still too big to be described by its path type matrix the whole proce- dure can be repeatedly applied again taking as an input the already simplified graph. Thus a multilevel tree-like indexing structure is acquired where each node represents a graph on the lower level.

65 4. ρ-INDEX

Hence, the creation of the ρ-index accompanies a graph segmen- tation followed by a computation of the path type matrix for each segment. This step is repeated until a graph that is possible to de- scribe by its path type adjacency matrix is obtained. A size of seg- ments may vary on every particular level. Therefore the maximal sizes of the segments at each level form the parameter settings of the ρ-index. They depend on the chosen method and strategy of putting the vertices into segments. The methods and strategies together with their settings are discussed in detail in Section 4.3.1. The visual out- line of the indexing structure is in Figure 4.5.

4.2 Transcription Graph

Whenever a path query is being processed using the ρ-index a special graph structure, firstly introduced in [13], is used to represent the result throughout the query processing. It is a transcription graph where the vertices and edges are replaced by subgraphs retrieved from the ρ-index. The vertices in the transcription graph are either the segments of the ρ-index or the vertices of the indexed graph. The vertices and edges of the indexed graph G are considered to form the lowest level of the ρ-index. The transcription graph contains four special kinds of edges: transitionTo denotes an existence of a transition between segments or an edge between vertices at the particular level of the ρ- index. existsPathTo indicates an edge that can be replaced by a subgraph from ρ-index consisting of vertices at the same level and tran- sitions between them, representing all sequences of segments lying between these two vertices. This edge may be only be- tween two vertices that are assigned to one common segment on a higher level. belongsToRight represents the relationship of containment, a ver- tex from a lower level belongs to a vertex on a higher level. In other words a vertex on a directly lower level is assigned to a segment on the higher level.

66 4. ρ-INDEX isSuperiorToRight is an opposite of the previous relationship, it means that the vertex at a higher level contains the vertex on a lower level.

The formal definition of the transcription graph for the ρ-index of a graph G follows in Definition 4.2.1. Notice, that the definition of the transcription graph contains level restrictions of vertices between which the edges of those four described types can lie. Definition 4.2.1 (Transcription graph). The transcription graph on a graph G = (V, E) and its set of segmentations S(G),...Sn(G) is n TG = (W, R) where W = Wk where W0 = V × N and 1 ≤ i ≤ n : k=0 i S Wi = S (G) × N and R = {transitionTo, existsPathTo, belongsToRight, isSuperiorToRight} is a set of binary edge relations where

• transitionTo = {(a, b)|0 ≤ i ≤ n : a, b ∈ Wi ∧ a =6 b}

i+1 • existsPathTo = {(0 ≤ i ≤ n−1 : a, b ∈ Wi∧a, b ∈ S (G))∨(a, b ∈ Wn)}

• belongsToRight = {(a, b)|0 ≤ i ≤ n − 1 : a ∈ Wi ∧ b ∈ Wi+1}

• isSuperiorToRight = {(a, b)|0 ≤ i ≤ n − 1 : a ∈ Wi+1(G) ∧ b ∈ Wi}

Figure 4.6 demonstrates an initial state of the transcription graph for a search of all paths between vertices 1 and 10 in ρ-index hav- ing four levels. The vertices are assigned to respective segments on upper levels and on the top-most level an existence of a path is sup- posed between the segments. Each vertex in the transcription graph is assigned two important numbers which are kept updated through the whole computation. The first number is the vertex’s order from left and the other one is a minimal path weight between the start vertex and this particular ver- tex. The order from left number makes possible to have the vertices sorted by their position in the transcription graph as the algorithm processes its vertices strictly from left to right. Since the left order number is a floating point number, every time the process needs to insert a vertex between other two vertices there is always a gap be- tween their left orders. Therefore, the transcription graph forms a

67 4. ρ-INDEX

X Y K (3,3,1) (4,3,2) L (2,2,1) E (5,2,2) F (1,1,1) (6,1,2) v1 v10 (0,0,1) (7,0,2)

segmentname edges: Y transitionTo (4,3,1) existsPathTo

orderfromleft minpathweightfromstart isSuperiorToRight levelnumber belongsToRight

Figure 4.6: Initial state of a transcription graph for a search for all paths between vertices 1 and 10. special type of a directed graph referred to as a network which is also a DAG. Since, vertices can be partially ordered by its left order num- ber and it is true that there is no edge pointing from a vertex with greater left order to a vertex with a lower left order. The concept of the transcription process is to take the initial tran- scription graph and transform it to a graph which comprises of only vertices at the lowest level with all edges of the transitionTo type. To achieve this, all the segments and edges at the higher levels need to be processed – transcribed – into entities at lower levels until we achieve the stop condition of the algorithm. Taking the leftmost ver- tex which is at the same time the highest having assigned the ex- istsPathTo edge: firstly, it replaces the existsPathTo edge by a respec- tive subgraph of sequences of segments lying between the two ver- tices where all the edges are transitions. Secondly, each of the transi- tions concerning the particular vertex – segment, that is to be trans- formed into entities on the lower level – is replaced by a subgraph of segments at a lower level connected to this segment by the type of edges binding together segments on different levels. This transfor- mation is demonstrated in Figure 4.7 as the first step of the process. The transition between the segments X and Z is transformed into a transition between segments L and K but on a lower level. This fact indicates, that there exists a border edge between segments X and Y which is originated in K and terminated in L where K belongs

68 4. ρ-INDEX

X Z Y K (3,3,1) (3.5,3,2) (4,3,3) L 1. (2,2,1) E (5,2,3) F (1,1,1) (6,1,3) v1 v10 (0,0,1) (7,0,3)

X Z Y K (3,3,1) M N (3.5,3,2) (4,3,3) L (2,2,1) (2,3.2,1) (2,2,2) 2. E (5,2,3) F (1,1,1) (6,1,3) v1 v10 (0,0,1) (7,0,3)

Z Y K M N (3.5,3,3) (4,3,4) L 3. (2,2,1) (2,3.2,2) (2,2,3) E (5,2,4) F (1,1,1) (6,1,4) v1 v10 (0,0,1) (7,0,4)

Figure 4.7: Transcription of a transition to a lower level. to segment X and L is assigned to segment Y . If there existed any other border edges they would also appear in the transcription graph at this point. Once the segment has only edges connecting it to other segments on a lower level it is transformed into lower level entities by connect- ing each entity on the left side with each entity on the right side with an existPathTo type of an edge going from left to right. This is demon- strated in Figure 4.7 by a step number 2 and 3. The transformed seg- ment and all its connecting edges to lower levels are removed from the graph.

4.2.1 Formal Transcription Methods In the following text the above examples are summarized in the for- mal concepts of transcription of the special edges of the transcription graph.

69 4. ρ-INDEX

A3 (x+2,k,m+2)

A1 A4 A5 (x+1,k,m+1) (x+3,k,m+3) (x+4,k,m+4) X Y X A2 Y (x,k,m) (x+1,k,m+1) (x,k,m) (x+2,k,m+2) (x+5,k,m+3)

B1 B2 (x+1,k,m+1) (x+2,k,m+2)

Figure 4.8: Transformation of a existsPathTo type of edge where the entry pXY of the path type matrix P is pXY = {(XA1A2A4A5Y ), (XA1A3A4A5Y ), (XB1A2A4A5Y ), (XB1B2Y )}.

Transforming the existPathTo Type of Edge The formal concept of transforming the existPathTo type of edge is depicted in Figure 4.8. In that case the edge is replaced by a set of paths that were retrieved from the ρ-index. In particular, from tran- sitive closure of the path type adjacency matrix representation of the common segment to which the two vertices are assigned. The common vertices of the paths from the set are shared – merged. Yet, the condition of increasing left order must always be satisfied – the order from left of the initial vertex is less then the order from left of the terminal vertex of any edge. If this condition cannot be satisfied a duplicate vertex has to be used. A duplicate vertex has assigned its own order and weight and level numbers. The two instances of one vertex in the transcription graph are independent. In the case that the ρ-index returns an empty set of paths, the ex- istPathTo edge is removed and the initial and terminal vertices are tested for removal also. If the initial vertex (X) has out-degree equal to 0, all of the incoming edges are removed from the transcription graph and the initial vertices of those edges are again tested for re- moval. This process continues recursively until a vertex with out- degree greater than 0 is reached or the starting vertex is reached. Same process is done with the terminal vertex (Y ) only the removal process progresses to the right. So, if the in-degree of Y is equal to 0, all of the outgoing edges are removed and their terminal vertices tested for removal from the transcription graph.

70 4. ρ-INDEX

X Y X Y (x,k,m) (x+1,k,m+1) (x,k,m) (x+1,k,m+1) A1 B1 (x+1/3,k-1,m) (x+2/3,k-1,m+1)

A2 B2 (x+1/3,k-1,m) (x+2/3,k-1,m+1)

An Bn’ (x+1/3,k-1,m) (x+2/3,k-1,m+1)

Figure 4.9: Transformation of a transition where the border edges between the segments X and Y are EDGES OUT (X) ∩ EDGES IN(Y )= {(A1, B1), (A2, B2),..., (An, Bn′ )}.

The weight of the minimal path of the terminal vertex (Y ) of the existPathTo edge needs to be recomputed because the weight of the possible minimal weight could be affected by the insertion of the new vertices. To compute the minimal path weight a minimum of all path weights of vertices that are initial of an edge of which this vertex (Y ) is terminal is searched and eventually increased depending on the type of this edge: • transitionTo - increase the minimal path weight of Y by 1. • existsPathTo - if terminal vertex has same label as the terminal vertex then do not increase else increase the weight of Y by 1. • dependency edges - do not increase.

Transforming the transitionTo Type of Edge The transitionTo type of an edge formal transformation concept is demonstrated in Figure 4.9. Firstly, the ρ-index is used to retrieve a subgraph comprised of the border edges of the segments X and Y laying on the lower level. Afterwards, the edge at the initial vertex X is replaced by a collection of isSuperiorT oRight edges connecting it to the initial vertices of the border edges retrieved. At the ver- tex Y a collection of belongsT oRight edges is terminated. The initial vertices of that collection of edges are the terminal vertices of the bor- der edges retrieved. The order from left numbers of the newly added

71 4. ρ-INDEX

X (x,k,m) A1 B1 A1 B1 (x,k-1,m) (x+1,k-1,m) (x,k-1,m) (x+1,k-1,m+1)

A2 B2 A2 B2 (x,k-1,m) (x+1,k-1,m) (x,k-1,m) (x+1,k-1,m+1)

An Bn’ An Bn’ (x,k-1,m) (x+1,k-1,m) (x,k-1,m) (x+1,k-1,m+1)

Figure 4.10: Transformation of a collection of dependency edges at one segment.

1 2 vertices are derived from the order from left of X by adding 3 and 3 respectively. In the case when the order from left number of segment 1 y−x 2(y−x) Y is less then x + 3 a general formula x + 3 respectively x + 3 where y is the order from left of Y , can be used to compute the order from left numbers for the newly added vertices.

Transforming the Dependency Types of Edges When all the transitionTo and existsPathTo edges are removed at the particular segment and the segment does not have any dependencies to an upper level, its dependencies to lower level can be transformed in a way demonstrated in Figure 4.10 and consequently the vertex and its dependencies can be removed from the graph. The depen- dency relations are replaced by a collection of existsPathTo edge – it ′ connects each vertex Ai to vertex Bj where 1 ≤ i ≤ n and 1 ≤ j ≤ n . The semantics of this transformation is that within one segment X a path from a border vertex Ai to border vertex Bj needs to be checked further in the transformation process. There is no need to update the order from left but the minimal weight of a path from start vertex has to be updated in case when the Ai =6 Bj. The equality is true when that vertex Ai is a terminal vertex of border edge between X and the preceding segment and also an initial of a border edge between X and the following segment in the segment sequence. The weight is updated by adding 1 to each Bj that qualifies since it is a smallest possible weight. The actual value is gained when the existsPathTo

72 4. ρ-INDEX edge is transformed.

Soft and Hard Minimal Path Weights Each minimal weight at each vertex in the transcription graph has assigned a flag to distinguish whether this number is soft or hard.A hard number means that this number was calculated only from hard numbers and along the way there was no dependency edge. The soft number indicates that somewhere along the path of the minimal weight between the starting vertex and the vertex with the soft min- imal weight is a vertex that also has a minimal weight with a flag set to soft. These flags were left out from the formal description of the transcriptions processes of the individual edge types for the legibil- ity purposes because this notion refers to the transcription graph as whole and it is dependable on the context of the particular vertex. The hard minimal weight of the vertex can be only increased. This increasing can arise only in case of a removal of a piece of the minimal path that was used to compute the minimal path weight. The hard minimal weight denotes the actual real minimal weight of a path between the starting vertex and the particular vertex. Yet, the soft minimal weight of a path is only informative and contrary to the hard weight can be either increased or decreased. The strategy of us- ing the hard and soft weight numbers is discussed in the following section.

4.2.2 Strategy of the Transcription Process The transformation strategy and method is described in Algorithm 1. The algorithm refers to a transitivity of a vertex in a graph. This means that the vertex is a terminal vertex of an belongsToRight edge and, simultaneously, it is an initial vertex of another edge of the same type. Transitive vertices are E and K in Figure 4.7 in steps 1 and 2. After the transformation done in step 2 only the vertex E remains transitive. By proceeding strictly from left to right we always pro- cess the leftmost existsPathTo or transitionTo type of edge with what we reduce the left context as much as possible. By the left context we mean the set of vertices lying between the start vertex and the currently processed vertex. With the reduced context each of the

73 4. ρ-INDEX

Algorithm 1 Transcription graph transformation algorithm. 1: PQ = priority queue of vertices ordered by their order from left number 2: add all vertices to PQ 3: while PQ is not empty do 4: currentVertex = first(PQ) 5: while currVertex is transitive do 6: currentVertex = getNext(PQ) 7: end while 8: for all currentEdge ∈ currentVertex.existsPathTo do 9: Transform currentEdge 10: Add each new vertex to PQ 11: end for 12: for all currentEdge ∈ currentVertex.transitionTo do 13: Transform currentEdge 14: Add each new vertex to PQ 15: end for 16: if currentVertex has only dependencies to lower level then 17: Transform vertex’s dependencies 18: Remove vertex from PQ 19: Remove vertex from graph 20: end if 21: if currentVertex is involved only in the transitionTo edges then 22: Remove vertex from PQ 23: end if 24: end while transformations presented involves the smallest amount of vertices possible. The Algorithm 1 stops when the priority queue is empty. Then, the final transcription graph contains only the vertices from the low- est level – the indexed graph and only among them only the transi- tionTo edges that represent the actual edges of the indexed graph G. The algorithm always stops since the transcription graph is a DAG and it is limited by the maximal length of a path limit l. During the transformation process, the minimal weight of a path from the start vertex is used to limit the weight of segment sequences that are retrieved from the index to replace the path edges in the transcription graph. It considers the length of an already computed piece of path from the start vertex to the particular vertex. The al- gorithm utilizes at each time only those segment sequences having their weight at most the difference of the already computed piece of the result and the maximal length of a desired path, our l. This

74 4. ρ-INDEX fact assures that the algorithm will actually stop for any input be- cause if it is not possible to reach the end vertex from a segment by a sequence of segments with a weight less then l considering the al- ready minimal length of a path, the whole branch is removed from the transcription graph. When the process finishes the resulting transcription graph rep- resents either a network of all paths initiated in the start vertex and terminated in the end vertex with a length lower or equal to the pre- defined l and some paths longer than l due to the nature of the graph segmentation. All that with respect to the paths that are in the in- dexed graph. If there are no paths shorter than l between the start and end vertex the resulting transcription graph will have only two vertices and no edges.

Maintaining and Utilizing the Soft and Hard Minimal Weights If the above strategy of transcribing strictly from left is applied using the definition of soft and hard weights the following is true:

• a hard weight cannot become a soft weight again

• hard weights appear only at the vertices got from the lowest level of the ρ-index

Initially, all the weights of the minimal path in the transcription graph are set to soft, only the minimal weight of the starting vertex is set to hard. The weight of the minimal path leading from the starting vertex to the vertex it is assigned to needs to be updated whenever the min- imal path could be changed by the transcription process. Especially, when there are vertices removed from the transcription graph dur- ing the recursive removal of vertices. Each time a vertex is tested for removal, having in-degree or out-degree equal to zero, respectively, and not actually being removed, it is put into a set of affected ver- tices where the minimal path weight needs to be recomputed. The set is ordered regarding the vertices’ left order number. The minimal weight of path for each affected vertex is computed the same way as described in Section 4.2.1.

75 4. ρ-INDEX

Algorithm 2 ρ-index creation algorithm. Input: Graph G = (V, E), minSegmentSizes, maxSegmentSizes, weight limit l Output: ρ-index for G 1: G = indexed graph G 2: for i=0; i < size(minSegmentSizes); i++ do 3: segment the input graph G with parameters minSegment- Sizes[i] and maxSegmentSizes[i] 4: create the segment graph SG(G) 5: create the path type adjacency matrix for each segment S in S(G) 6: compute the transitive closure of each path type matrix from the previous step with weight limit l 7: create the transition tables for each segment S in S(G) 8: G = SG(G) 9: end for 10: create a path type adjacency matrix for G 11: compute transitive closure of path type matrix from the previous step

4.3 ρ-index Creation Algorithm

In this section we describe the process of creating ρ-index for a in- dexed graph G. Firstly, the segmentation methods implemented to cluster the vertices are presented. Secondly, the procedure of com- puting a transitive closure of the path type matrix is discussed in detail. Algorithm 2 summarizes the method of a creation of ρ-index for the indexed graph G. Notice that the input of the procedure is, be- sides the indexed graph, a pair of arrays of integer numbers that rep- resent the minimal and maximal sizes of the created segments at each level of the indexing structure. The sizes of those arrays are equal and determine the height of the ρ-index. The path type matrix cre- ated in the last part of the algorithm outside the cycle represent the creation of the top-most matrix of the indexing structure. Therefore, the user controls the height of the structure and it is always equal to size of the input arrays plus one level containing the top-most

76 4. ρ-INDEX matrix. The size of the top-most matrix depends on the parameter settings so it cannot be tightly controlled, yet, due to experiments it is predictable.

4.3.1 Graph Segmentation Methods and Strategies Various ways how to assign the vertices to segments have been iden- tified and studied. One of them was a graph to forest of trees trans- formation which’s result is a forest of trees and was proposed in [8, 9]. Combination of vertex clustering and the graph to forest of trees transformation together with its preliminary evaluation can be found in [11]. Further implementation and evaluation showed that the graph to forest of trees makes the resulting indexing structure very tangled and therefore the search algorithm did not present good results. Therefore, a graph segmentation was introduced and this graph transformation is the base for all presented clustering methods.

General Vertex Clustering Method The first presented method to assign the vertices to segments is gen- eral vertex clustering. It is supposed to partition the set of vertices V into segments where all the segments have relatively the same amount of vertices assigned. Therefore, the input parameter for this clustering method is the minimal and maximal number of vertices assigned to one segment. Firstly, the general vertex clustering method randomly chooses a first vertex to processes all its incoming and outgoing edges until the set of vertices assigned to this vertex is filled. All the vertices assigned to the leading vertex are removed from the graph. In each edge of the removed vertex the removed vertex is replaced by the leading vertex. The process stops when there are no unprocessed vertices in V left. The resulting graph represents the SG(G). The only difference is that there can be multiple edges between two vertices. The amount of those edges represent the number of the border edges between the two particular segments. This method has a second pass to check whether all the acquired segments have sufficient number of vertices assigned. If the number

77 4. ρ-INDEX of assigned vertices does not reach the minimal number of assigned vertices it is forced to merge with the smallest neighboring segment. This way the result is segments with almost uniform number of ver- tices assigned. Few other variants of this general method have been designed and implemented. They practically varied in few details of the seg- mentation strategy and in the way they handled the segments with sizes less than the minimal number of vertices. The experimental evaluation of the variants implemented showed that there is no sig- nificant performance improvement compared to the general method. Regardless, we have designed a variant that during the first phase of the algorithm took into account the best neighbor to be clustered with by counting the number of common edges lying between the two vertices. Then the vertex with greatest amount of common edges was assigned to the same segment. The aim was to minimize the amount of neighboring edges between segments and maximizing the amount of edges being inside a segment – being in the subgraph that the segment defines. Yet, even this method did not show any major performance speed up against the previously mentioned clustering methods.

Segmentation Using Topological Order As we presented in Chapter 2 directed acyclic graphs are a special kind of directed graphs. Due to the restrictive acyclicity condition, the structure of this kind of graphs is more closer to trees than arbi- trary directed graphs. Sometimes the indexed graph is really close to be a DAG so we designed a segmentation method that takes the advantage of the DAG-like structure of the indexed graph. The problem that lead us to design these special methods is that using the general clustering family of methods a cycle can appear in the segment graph SG(G) even where the cycle was not present in the indexed graph G. This situation is demonstrated in Figure 4.11. Using the methods that take the advantage of the topological order of the input graph we would like to eliminate this kind of a problem as much as possible. If a graph is a DAG the topological order of its vertices can be acquired. To recall, the topological order on a set of vertices means

78 4. ρ-INDEX

Figure 4.11: Demonstration of a cycling segment sequence problem. An acyclic path (e1e2e3) in the indexed graph is rep- resented by its cycling proper sequence of segments (S1S2S1S2). that there is no edge pointing from jth vertex to ith vertex where i < j. The topological order can be acquired from any input graph by ignoring certain edges that form cycles in the graph. The amount of these problem edges determines how much cyclic the graph is. If this number is low compared to the total amount of edges in the indexed graph, the one of the following methods can be used. The idea of the two presented segmentation methods is to pre- serve the topological order of the vertices in the segmented graph throughout the segmentation process. This means that the resulting segment graph should be also close to be DAG. The first method takes the list of vertices topologically ordered and divides it into sublists each having the size equal to the maximal number of vertices possible in a segment. This method is straightfor- ward and fast. The problem is that most of the time set of segment’s edges remains empty since vertices that are adjacent rarely appear close together also in the topological order. This specific situation is presented in Figure 4.12. As we see the topological order maintains close together the vertices according to their distance from the origin.

79 4. ρ-INDEX

S1 v1 S1 S2

v4 v3 v2 S2 S3 S2 S3

v9 v8 v7 v6 v5

SG(G)

Topologicalorder=(v1 ,v 2 ,v 3 ,v 4 ,v 5 ,v 6 ,v 7 ,v 8 ,v 9 )

verticesassignedtosegments

Figure 4.12: A topological order of a directed acyclic graph. Conse- quent graph segmentation according to the position of a vertex in the order. In resulting segmentation only

ES1 =6 ∅.

The problem of segmenting together adjacent vertices yet respect- ing the topological order of the graphs vertices solves the second designed segmentation method. It is a breadth first search method that selects a lead vertex and after that it chooses an edge to follow. This edge makes a span on the topological ordered list of vertices. If there is no path connecting the lead vertex with the potential vertex then they can be put together. If there is such path, it could mean a potential topological order losing of segments and therefore the acyclicity in the segment graph. The spanning principle is depicted in Figure 4.13. This segmentation method is computationally more expensive than the previous straightforward method. For each edge the method tries to follow the all paths initiated in the leading ver- tex need to be checked whether they are terminated in the potential vertex or not. Nonetheless, the topological order of G offers one op- timization – if the leading vertex topological order number is i and the topological order number of the potential vertex is j then all the vertices along each checked path must have its topological number k : i

80 4. ρ-INDEX

v1

v4 v3 v2

v9 v8 v7 v6 v5

v10

Topologicalorder=(v1 ,v 2 ,v 3 ,v 4 ,v 5 ,v 6 ,v 7 ,v 8 ,v 9 ,v 10 )

leadingvertex potentialvertex

Figure 4.13: A visualization of the spanning problem. If the vertices v1 and v10 were assigned to one common segment and either of vertices v4 or v9 were not, the resulting segment graph would not be DAG and therefore the topological order would be spoiled.

Summary In this section we presented two various approaches to segment the indexed graph. The first approach puts emphasis on the relatively uniform decomposition of vertices into segments of an arbitrary di- rected graph. Yet, the second approach that takes advantage of the topological order of the DAG also takes into account the information about the direction of the edges in the indexed graph and preserves the acyclicity of the DAG to upper levels of the ρ-index . Not only the selection of the segmentation method influences the ρ-index creation complexity but also the complexity of the query pro- cessing is affected. Other elements that influence the segmentation quality are the values of the particular parameter settings, meaning the minimal and maximal number of assigned vertices to one seg- ment. These settings can differ at each level. Intuitively, by setting small sizes of the segments a slim and high tree can be created. On the other hand, using a large number at first level a wide and low tree is acquired. The evaluation of different parameter settings and how they affect the search itself is demonstrated in Chapter 5.

81 4. ρ-INDEX

4.3.2 Transitive Closure Computation of the Path Type Matrix When the graph is partitioned into segments, the following step in the ρ-index creation procedure is the computation of the transitive closure of the segments’ path type matrices. The computation is lim- ited by the weight limit l for which the ρ-index is being created. Lemma 4 implies that it is necessary and sufficient to store all the sequences of segments that have the weight less or equal to the im- posed weight limit l to reconstruct all paths with a length less or equal to l in the indexed graph G. The usual number matrix reflects after an ith iteration of the transitive closure computation the num- ber of paths of length < i. Intuitively, if we intended to compute amounts of all paths shorter then l, we would do l iterations in such computation. Yet, the usual number matrices do not distinguish be- tween cycling paths and paths without cycles during the transitive closure computation. Each entry contains amount of all paths be- tween any two vertices in the graph. Since, the problem of the disconnected sequence of segments was discussed and since such sequences should not appear in ρ-index, the transitive closure of the path type matric incorporates the com- putation of the minimal set of connecting paths for each sequence of segments that is to be stored in the ρ-index. Thus the connectedness of the sequence of segments is tested and also by this computation we get its weight which is represented by minimal weight of a con- necting path of such sequence of segments as stated in Lemma 2. When a transitive closure of a relation is computed using the usual number matrix. The number of iterations is limited by the size of the matrix because a path that is longer than the amount of vertices in the graph is surely cyclic. In the case when the maximal length of a path is given the maximal iteration number equals to the maximal path length. The stop condition of the transitive closure computa- tion using the path type matrix is different. It processes the matrix until there is no possible progress in the computation or the iteration number reaches the limit l. No possible progress condition means that the resulting matrix of an iteration is empty. Lets sum up the reasons when an entry P Aij in the path type matrix P A can become empty: 1. the processed row i and column j are both already empty

82 4. ρ-INDEX

W X Y Z K L N O A B E F

v1 v2 v9 v10 M C D v3 v4

S(G)

segmentname Y (4,3,1)

orderfromleft minpathweightfromstart levelnumber X Y L (4,3,2) (5,3,3) N (3,2,2) B (6,2,3) E (2,1,2) (7,1,3) v1 v2 D v9 v10 (0,0,1) (1,0,2) (2,1,2) (8,0,3) (9,0,4) v3 v4 (0,0,1) (1,0,2)

Figure 4.14: Initial transcription graph for a computation of a weight of a segment of sequence. The segmentation concerning the sequence of segments is included.

2. all the sequences of segments in the entry get over the limit l and therefore they are not included in the result, or

3. the resulting sequences of segments are not connected

The algorithm of the transitive closure computation using the path type matrix will always stop according to the Lemmas 1, 3 and Corollary 1. There is surely only a limited number of possible se- quences of segments representing all acyclic paths to the limit l that needs to be checked for connectedness.

Sequence of Segments Weight Computation When a sequence of segments is generated during the transitive clo- sure of the path type matrix computation, its weight has to be com-

83 4. ρ-INDEX puted. The connecting path with smallest weight is computed and this weight is assigned to the sequence of segments. If a sequence of segments has no connecting path, it is left out from the result. A transcription graph is used to compute the weight of the se- quence of segments. There is a slight difference between the standard transcription graph, namely there may generally be a set of start and end vertices. Those are the vertices taking place in border edges of the first and the last segment in the sequence. Figure 4.14 depicts an example of an initial transcription graph. The set of start vertices is generated by descending to inferior segments and checking their border edges. In this example the set of start vertices is {v1, v3} and the set of end vertices is {v10}. The nature of the segment sequences is that the connecting path with the smallest weight – for brevity, as- suming that each vertex in the indexed graph is assigned a weight 1 – cannot be less then the length of the sequence. In the example depicted in Figure 4.14, notice that the min path weight number reflects the lowest number of vertices in a path be- tween the particular segment/vertex and one of the start vertices and that the min path weight of the end vertices reflects the smallest pos- sible weight of the inspected sequence of segments. In this case the minimal number of vertices traversed between the start and the end vertex is 4 which is the smallest possible weight of a sequence of seg- ments (WXYZ) with respect to the segmentation S(G). In fact it is the smallest possible weight of any sequence of segments of length 4 considering any graph segmentation according to Corollary 1. The initial transcription graph for the sequence of segments is ac- quired by repeatedly transforming the leftmost and rightmost tran- sitions of the transcription graph – in our example those are the tran- sitions between segments W and X and between segments Y and Z initially – until we reach the lowest level of the ρ-index . Thus, to acquire the set of the start vertices, the transitions that have to be transformed were W – X followed by K– L and K – M and finally a pair A – B and C – D. According to Lemma 2, the final transcription graph contains the set of minimal paths at the lowest level for the particular sequence of segments. This goal is accomplished by transforming repeatedly the leftmost transitionTo or existsPathTo edge. Contrary to the usual transcription graph where this type of edge was transformed by re-

84 4. ρ-INDEX

W XY Z K L N O A B B’ E’ E F

v1 v2 v5 v6 v7 v9 v10

C D M v3 v4 v8

S(G)

segmentname Y (4,3,1) orderfromleft minpathweightfromstart levelnumber

v1 v2 v5 v6 v7 (0,0,1) (1,0,2) (2,0,3) (3,0,4) (4,0,3) v9 v10 (8,0,4) (9,0,5) v3 v4 v8 (0,0,1) (1,0,2) (4,0,3)

Figure 4.15: Final state of the transcription graph. For brevity, the segmentation depicts only the segments vertices and edges that are needed to transform the transcription graph graph. The result represents a the set of minimal connecting paths of the segment sequence (WXYZ). placing it with all paths up to a certain length, in this case the ex- istsPathTo type edge is replaced only by the shortest path – the path with smallest weight. The final state of the transcription graph for example from Figure 4.14 is depicted in Figure 4.15. For brevity, the segmentation depicted there concerns only the vertices and edges needed to transform the initial transcription graph into this final state. There can be more ad- ditional vertices and edges in the segments depicted but they have no importance for the transformation since the transformation takes into account only the shortest paths between each two vertices or segments and those are included in this figure. Notice that the small-

85 4. ρ-INDEX

B C A D

E

F

Figure 4.16: A sequence of segments (CD) has a connection path thus is connected. A segment sequence (BCD) is not con- nected and is put to the suffix forest. By the time the (ABCD) is checked, it is immediately pronounced as disconnected since it has (BCD) as its suffix. est weight can be immediately read from the end vertex. In general, the set of end vertices is scanned for the minimal weight of a con- necting path.

Suffix Tree for Disconnected Sequences of Segments In the case when the generated sequence of segments turns out to be disconnected, i.e. it has no connecting path, it is stored to a suf- fix tree of disconnected sequences of segments at the particular level. Each newly generated sequence of segments is checked in the suf- fix tree for a containment of a sequence of segments that had been proven to be disconnected earlier. That would mean that also the newly generated sequence of segments is disconnected. The con- tainment means that the newly generated sequence of segments has

86 4. ρ-INDEX

E F

DC D DC DB

AB C B

Figure 4.17: A suffix tree built using following sequences of segments (ACDE), (BCDE), (CDE), (BDE), (CDF ) and (BDF ). some disconnected sequence of segments as a suffix. An example of this pruning assumption is demonstrated in Figure 4.16. This pruning assumption is based on two principles. Firstly, a sequence of segments that is disconnected cannot become connected by adding some other segments neither to the beginning nor to the end. Secondly, the segment sequence of a smaller weight is always checked for connectedness before a sequence of a greater weight. The former principle is true from the algorithm of the computation of the connecting path. The algorithm considered all possible combinations of border edges between segments in the sequence, extension of the sequence of segments cannot add any new combinations into con- sideration in the shorter sequence of segments. In fact, the addition of a segment to either end or beginning of the segment sequence can only limit the number of combinations, e.g. in Figure 4.16 two bor- der edges are between a sequence of segments (BC). Contrary, in segment sequence ABC, only one of those border edges can be used and finally, when checking the (ABCD), none of the border edges between B and C can be used. The verity of the latter principle results from the transitive closure computation algorithm where is true throughout the whole compu- tation that the length of the sequence of segments generated is grow- ing.

87 4. ρ-INDEX

As for the suffix tree itself, it is based on an indexing structure in- troduced in [53]. It is a tree structure that for a set of strings eases the decision whether a string on an input has a suffix in the set – in the suffix tree. The tree is consulted in a way that the inspected string is traversed from end to start and checked whether following this in- put a leaf node of the suffix tree can be reached. If yes, the inspected string has a suffix in the tree and not otherwise. The computational complexity of the search algorithm is O(n log n). In our case, the alphabet is formed by the names of the segments at the particular level and the strings are the sequences of segments. An example of such suffix tree covering few sequences of segments is depicted in Figure 4.17. To search only the suffixes is sufficient since when a segment sequence is found disconnected it is not used for further transitive closure computation and thus it cannot be ex- tended at its end. It can only become a part of some other sequence of segments as its suffix during further transitive closure computation.

4.4 Search Algorithms

In this section we describe the ρ-path algorithm for discovery of all paths to a certain length between two vertices in the indexed graph using ρ-index. Consequently, the outline of the ρ-connection algo- rithm is provided.

4.4.1 ρ-path Algorithm This algorithm is used to find all paths lying between two inspected vertices up to a certain limit l. Due to the nature of the ρ-index, the probability that the result will contain also some paths longer than the limit l is high. The algorithm also uses the transcription graph and its transcription concepts as introduced earlier in Section 4.2. The idea of the algorithm is to find all possible candidate sequences of segments that can represent the paths between the two inspected vertices and check them if there there are such paths.

88 4. ρ-INDEX

Algorithm 3 ρ-path algorithm to find all paths lying between vertices s and e to a certain limit l. Input: starting vertex s, ending vertex e Output: graph of all paths lying between s and e stored in ρ-index

1: get an array of segments [S1,S2,...Sk] for s 2: get an array of segments [E1, E2,...Ek] for e

3: retrieve the entry pSkEk from the top-most path type matrix P 4: build the initial state of the transcription graph 5: transform the transcription graph to the final state using Algo- rithm 1

The Initial State of the Transcription Graph The initial state of the transcription graph used to check the connect- edness of the generated sequence of segments was built from top to bottom, we were descending from the currently built level to the bottom of the structure. Contrary, the initial state of the transcrip- tion graph to implement the ρ-path operator is acquired in reverse order, from bottom to top. Firstly, the computation needs to get the sequence of segments to which the inspected vertices belong. This is represented by the steps 1, 2 in Algorithm 3. These steps represent the consultation of the ρ-index for an ar- ray of segments to which belong the start and end vertices at each particular level of the ρ-index. The result of such process is an ar- ray of segments, one for each level, because segmentation definition says that each vertex may belong to only one segment at an upper level. Therefore, the index k in every array of segments is equal to the number of levels that the ρ-index has. In step 3 of the Algorithm 3 the entry from the top most path type matrix is retrieved. According to the transitive closure compu- tation of the path type matrix, this entry contains all the sequences of segments to the limit l, that can represent all paths between the two inspected vertices s and e. An example of the initial state of the transcription graph for ver- tices v1 and v10 is depicted in Figure 4.18. On both ends there are the arrays of segments to which the respective inspected vertices be- long. In the middle part of the transcription graph the contents of

89 4. ρ-INDEX

A3 (5,3,3)

A1 A4 A5 (4,3,2) (6,3,4) (7,3,5) X A2 Y (5,3,3) K (3,3,1) (8,3,3) L (2,2,1) E (9,2,3) F B1 B2 (1,1,1) (10,1,3) v1 (4,3,2) (5,3,3) v10 (0,0,1) (11,0,2)

arrayofsegmentsforv1 entryfromthetopmostmatrix arrayofsegmentsforv10`

segmentname edges: Y transitionTo (4,3,1) existsPathTo orderfromleft minpathweightfromstart isSuperiorToRight levelnumber belongsToRight

Figure 4.18: An initial state of the transcription graph where the arrays of segments for vertices v1 is [E,K,X] and v10 is [F,L,Y ]. The entry of the top-most matrix P , pXY = {(XA1A2A4A5Y ), (XA1A3A4A5Y ), (XB1A2A4A5Y ), (XB1B2Y )}.

the entry pXY is connected to the top-most segments X and Y . Again the goal of the transcription process is to end with all vertices of the graph to be from the lowest level of ρ-index and the edges to be of the transitionTo type.

The Result The resulting transcription graph either is an empty graph – besides the pair of start and end vertex – or it contains the indexed paths of up to the given limit l, some longer paths. All vertices belong to the lowest level of ρ-index and the edges are the transitionTo type. The resulting graph is also a DAG where holds that the vertices are ordered according to their order from left number. The acyclicity is provided by the transformation process where each of the segment sequences produced are transformed to produce

90 4. ρ-INDEX

S1 S2 S3 S4 S1 S2 S3 S4 e e e e e e S1 1 2 3 S1 1 2 3 e e e e S2 4 5 S2 4 5 e e e e S3 6 7 S3 6 7

S4 S4 I II

Figure 4.19: Demonstration of how the top-most ρ-index matrix is consulted in case of ρ-connectionT o (I) and ρ-connec- tionF rom (II) algorithm implementation. also again sequences of segments that are not acyclic on lower levels.

4.4.2 ρ-connection Algorithm Outline Although in this thesis the main focus is dedicated to the efficient processing of the ρ-path operator, we also provide an outline of the procedure of exploiting the ρ-index to implement the ρ-connection algorithm for processing such queries. The ρ-connection algorithm would differ from the above de- scribed algorithm mainly in two points. Firstly, it is the informa- tion retrieved from the top-most matrix of the ρ-index . The second dissimilarity results from the difference of the retrieved information causing the need of altering the processing of such information. Contrary to the ρ-path algorithm the top-most matrix is consulted in a different way as described in Figure 4.19 and was demonstrated in [9]. In the case of the ρ-connectionT o , the matrix checks the rows of the pair of inspected vertices, i.e. the ones of segments S2 and S3 which have to be non-empty in a common column. This condition satisfies only the column denoted by the segment S4. In other words, if there is some vertex in which two paths originated in segments S2 and S3 respectively are terminated, it lies in the area delimited by the segment S4 and the paths are represented by their proper sequences

91 4. ρ-INDEX of segments contained in entry e5 respectively e7. The situation labeled as II in Figure 4.19 simulates the retrieval of the pair of entries for the same input segments but for the ρ-con- nectionF rom operator. In this case we are looking for an area of the indexed graph from which is possible to reach the two input vertices represented by their assigned top-most segments which in this case is delimited by segment S1. Again the retrieved entries e2 and e3 then represent the proper segment sequences of segments for those possible paths forming the answer for the processed query. If compared to the same process of the algorithm concerning the ρ-path operator where the top-most matrix was consulted in a usual fashion, i.e. the row and column intersection and thus at most one matrix entry retrieved, in this case the result of this procedure is a pair of entries which’s amount can be at most n, where n denotes the size of the top most matrix. To process the retrieved segment sequences the transcription graph can be used again yet the transcription process and strategy would need to be altered to suit the specific needs. At first, each transcription graph would have to be used for each retrieved pair of entries. Then, the transcription graph would have to be redefined to have two starting vertices instead of one in the case of the ρ-connec- tionT o operator. Also the definition of the ending vertex would need to be transformed into an ending segment from the top-most level. The transcription process then incorporates two processes similar to the one presented for the ρ-path operator yet having not firmly stated the starting or ending vertex replaced with the segment from the top-most level. Yet those processes would have to cooperate to preserve the acyclicity of the pairs of the paths transcribed into the terms of the lowest level of the indexing structure.

92 Chapter 5 ρ-index Evaluation

In this chapter we present practical results and experimental evalu- ation of the designed indexing structure and the search algorithm. Firstly, the nature of the data used to evaluate and to perform the search on is introduced and described. Consequently, the issues of the ρ-index building phase are discussed. Finally, the practical re- sults gained using the ρ-index to search for complex relationships in the indexed graph are presented. As we performed most of the experiments described in the fol- lowing sections we stated the maximal indexing length – the index- ing limit l – to be equal to 10. The ρ-index was then built to index all the paths up to this limit and the search then returned all the paths to this length and some paths longer. The issue of what some means in more precise numbers is tackled in Section 5.3.3. As for the machinery on which we executed all our experiments concerning the ρ-index, the computer was a dual double core Athlon Opteron 2.4 GHz with a 12 GB of RAM. During the time the tests were run the computer was not dedicated to only that task so all the experiments were run multiple times and the results depicted are averages of the results thus gained.

5.1 Data Collection

The first data set that we used for ρ-index’s evaluation is synthetic, randomly generated data. Firstly, we discuss the process of the data generation. Then, we present our own incremental random graph generation algorithm. Finally, we discuss the generated data set na- ture. First experiments were conducted on similar sized graphs as in

93 5. ρ-INDEX EVALUATION

m Graph size (n) Threshold value Parameter value m n ratio 5,000 21,000 9,000 1.8 10,000 46,000 20,000 2 20,000 99,000 44,000 2.2 30,000 154,000 69,000 2.3

Table 5.1: A summary of connectivity thresholds computed for the testing graphs generated by the random graph generation algorithm.

[12], yet the similarity after enlargement has been further studied and modified sizes are used here for the reason that a better compar- ison method for the similarity of graphs’ properties than the simple m n ration was found.

5.1.1 Random Graph Models The study of random graphs dates back to the work of Erd¨os and R´enyi published in seminal papers [23, 24]. They introduced two standard models that can be called the uniform random graph mod- els. The uniformity refers to the distribution of the edges in the graph. Each of the graph model has two parameters. One controls the number of vertices in the graph and the other one controls the density or the number of edges. For example, the random graph model G(n, m) uniformly places m edges among n vertices, while in the random graph model G(n, p) each possible edge from the com- plete graph with n vertices appears in G with the probability of p. In both models, self-loops and multiple edges between same pair of vertices are not allowed. Intuitively, the two presented random graph models are almost the same. The difference is that in the former case the amount of edges is always the same contrary to the latter one where the amount of edges is biased by the secondary randomness brought by the prob- ability bias. In [24] the threshold for connectivity of the random graph pro- duced is studied. In other words, the authors study how many edges

94 5. ρ-INDEX EVALUATION or how large m has to be to make the resulting random graph con- nected, i.e. all vertices are present in one component. The authors come to a general formula that states that the probability when all log n vertices fall into one single graph component must be p ≥ n or in 1 the case of the G(n, m) model m ≥ 2 (n − 1) log n. The connectivity threshold makes possible to compare randomly generated graphs having various sizes with each other because lin- early growing graphs do not have to share the same properties. The testing random graphs were generated using the n set step- wise to 5.000, 10.000, 20.000 and 30.000. The number of edges rep- resented by number m for the graph having 10.000 vertices was set to be 20.000 what represents the ratio m/n to be 2. The number of 20.000 edges in the graph of 10.000 vertices represents about 45% of the connectivity threshold value that is about 46.000 edges. Hence, the other testing graphs’ parameters were computed by evaluating the threshold of connectivity and consequently taking the 45% of it. Table 5.1 represents the numbers computed and parameters used for the random graph generation algorithm. As can be seen from Table 5.1, it would be a mistake to use just the m n ratio to compute the number of edges number for the generation algorithm. The larger graphs would be sparser than the smaller ones.

5.1.2 Incremental Random Graph Generation Algorithm The graph generating algorithm is demonstrated in Algorithm4. As is obvious from the generating algorithm, the graphs are generated iteratively using a smaller graph that is a subgraph of the generated one. The generated graph is enlarged in a way that the newly added edges lie either between the newly added vertices or a newly added vertex and a vertex of the smaller graph. The probability of whether the new edge will lie between a new vertex and a vertex of the |V1| smaller graph is equal to the ratio maxV txNumber where maxV txNum- ber is the intended number of vertices contained in the generated graph and V1 denotes the set of vertices of the input graph. An empty graph is used as the first smaller graph. In this phase the result of the algorithm is almost uniform random graph. The strict uniformity of the edge distribution is broken by the random approach to placing edges in the graph. Although, there exist var-

95 5. ρ-INDEX EVALUATION

Algorithm 4 Synthetic random graph generating algorithm.

Input: Graph G1 = (V1, E1), int maxV txNumber where |V1| < maxV txNumber, int maxEdgeNumber where |E1| < maxEdgeNumber Output: Graph G2 = (V2, E2), where |V2| = maxV txNumber, V1 ⊆ V2, E1 ⊆ E2 ′ 1: generate (maxV txNumber - |V1|) vertices ⇒ V ′ 2: V2 = V1 ∪ V 3: E2 = E1 |V1| 4: odds = maxV txNumber 5: while |E2| < maxEdgeNumber do ′ 6: V = randomlySelectSet(odds, V1,V ) 7: v1 = randomlySelectVtxFrom(V ) ′ 8: v2 = randomlySelectVtxFrom(V ) 9: if v1 ∈ V1 then 10: (v1, v2) = randomlySwitchDirection((v1, v2)) 11: end if 12: if v1 =6 v2 then 13: E2 = E2 ∪ (v1, v2) 14: end if 15: end while ious algorithms for generating uniform random graphs like those presented in [37] but the strict uniformity was not the primary prop- erty of the graph for us. By the random generation of the graphs we wanted the testing data to demonstrate all sorts of pathological cases like disconnected vertices on one hand and vertices with high de- grees on the other hand yet in a reasonable amount of appearances. We generated four graphs using this generation algorithm. The graphs were generated as follows: 1: G5000 = generateGraph((∅, ∅), 5000, 9000) 2: G10000 = generateGraph(G5000, 10000, 20000) 3: G20000 = generateGraph(G10000, 20000, 44000) 4: G30000 = generateGraph(G20000, 30000, 69000) Notice that the number in the name of the the graph always rep- resents the number of vertices contained in such graph. The vertex

96 5. ρ-INDEX EVALUATION

1400 1400 1000 inDegree outDegree degree 900 1200 1200 800 1000 1000 700 800 800 600 500 600 600 400 400 400 300

number of vertices number of vertices number of vertices 200 200 200 100 0 0 0 0 2 4 6 8 10 12 0 2 4 6 8 10 0 2 4 6 8 10 12 14 16 18 indegree outdegree total degree

Figure 5.1: Vertex degree distribution in the synthetic random graph G5000.

3000 3000 2000 inDegree outDegree degree 1800 2500 2500 1600 2000 2000 1400 1200 1500 1500 1000 800 1000 1000 600 number of vertices 500 number of vertices 500 number of vertices 400 200 0 0 0 0 2 4 6 8 10 12 14 0 2 4 6 8 10 0 2 4 6 8 10 12 14 16 18 indegree outdegree total degree

Figure 5.2: Vertex degree distribution in the synthetic random graph G10000. degree distribution summary for the testing graphs G5000, G10000, G20000 and G30000 is illustrated in Figures 5.1, 5.2, 5.3 and 5.4. The generated graphs have all intended properties to study and evalu- ate the ρ-index’s behavior because they contain certain proportion of vertices with degree of one, a majority of vertices having degree be- tween two and five and also has a proportion of vertices having high degrees ranging from six to thirteen. The vertices having total degree of one are considered as special ones, they are called sinks or sources respectively, they have similar properties like the root and leaves in tree have. The graphs have also the property that the smaller graph is al- ways a subgraph of any of the larger graphs. This property is very important when we evaluate the experiments that compare the search results in graphs with different sizes, because the result of a search performed on a smaller graph, i.e. the set of paths retrieved,

97 5. ρ-INDEX EVALUATION

6000 6000 4000 inDegree outDegree degree 3500 5000 5000 3000 4000 4000 2500 3000 3000 2000 1500 2000 2000 1000 number of vertices 1000 number of vertices 1000 number of vertices 500 0 0 0 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 0 5 10 15 20 indegree outdegree total degree

Figure 5.3: Vertex degree distribution in the synthetic random graph G20000.

9000 9000 6000 inDegree outDegree degree 8000 8000 5000 7000 7000 6000 6000 4000 5000 5000 3000 4000 4000 3000 3000 2000

number of vertices 2000 number of vertices 2000 number of vertices 1000 1000 1000 0 0 0 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 0 5 10 15 20 indegree outdegree total degree

Figure 5.4: Vertex degree distribution in the synthetic random graph G30000. is also a subset of a search result of the same search performed on any larger testing graph. So its true that G5000 ⊆ G10000 ⊆ G20000 ⊆ G30000.

5.2 ρ-index Creation Time

First of all, the experiments concerning the time needed for the cre- ation of the ρ-index for the testing synthetic random graphs are pre- sented for various parameter settings. Figure 5.5 demonstrates the experiments performed with varying maximal size of the segment at the lowest level and fixed parameter setting at higher levels. The pa- rameter settings for second level is ten and for the third level itisfive, recall that the top-most fourth level size depends on the size of the SG(G) created at the third level so it is not parametrically controlled. Therefore, the x-axis denotes the parameter for the first level of the

98 5. ρ-INDEX EVALUATION

35 G5000 G10000 G20000 G30000 30

25

20

15 creation time [mins]

10

5

0 0 10 20 30 40 50 60 70 lowest level cluster size

Figure 5.5: Creation time of ρ-index having four levels with param- eters set at second level to ten and third to five. The pa- rameter set at the first level is represented by the x-axis value. created indexing structure. The segmentation method used was the general vertex clustering method described in detail in Section 4.3.1. The parabolic shapes of the curves representing the creation times in Figure 5.5 demonstrate that the creating process is sensitive to ei- ther too small and too big clusters at the lowest level. This is due to the fact that when there are small segments at the lowest level it means that there is a big top-most matrix at the fourth level of the structure and it takes a lot of time to compute the transitive closure of this matrix. Contrary, when there are unnecessarily big segments on the lowest level, the top-most matrix size diminishes to small num- bers, which makes the level unoccupied which means that its func- tion is undesirably transferred to lower levels. However, the parabolic shapes of curves allow to find the optimal parameter setting for each testing graph. Yet, the parameter setting is usable only with the particular segmentation method. Although the experiments with other segmentation methods proved that the shape

99 5. ρ-INDEX EVALUATION

30 G5000 G10000 G20000 G30000 25

20

15 creation time [mins] 10

5

0 0 5 10 15 20 25 30 35 lowest level cluster size

Figure 5.6: Creation time of ρ-index having four levels with parame- ter setting at second level represented by the x-axis value. The parameter setting for the first level for graph G5000 is five, for G10000 is fifteen, for G20000 is thirty-five and for G30000 is forty-five. The third level parameter setting is five. of the curve is very similar the actual best parameter setting for each segmentation method slightly varies from method to method. Figure 5.6 depicts the creation time of ρ-index for the testing graphs with respect to various parameter settings at the second level. The settings on the first level was chosen to be the always the best read from the results of the experiments in Figure 5.5. The settings for the first level for G5000 was chosen to be five, for G10000 it was fifteen, for G20000 it was thirty-five and finally for G30000 it was forty-five. In contrast to the results gained by the set of experiments with varying setting on the lowest level, it can be observed that there is almost none shift on the x-axis with respect to the size of the in- dexed graph. That is due to the fact that the different parameter set- ting at the lowest level transforms the indexed graph into SG(G) of moreless the same size and the curves appear above each other.

100 5. ρ-INDEX EVALUATION

Unexpected steps can be observed in the curves depicted in Fig- ure 5.6 between x values of fifteen to twenty-five. This behavior can be observed at all curves and it is due to the fact that the size of a graph which is an input to the second segmentation are for all graphs almost the same, which has already been discussed earlier and that the larger graphs contain the smaller ones. Yet, the main response for the unexpected accumulation in the y-axis values has the general segmentation method and its optimization. It traverses the resulting SG(G) and forces the underfilled segments to merge with the small- est neighboring segment which can sometimes result into a creation of one segment that is undesirably large. This bottleneck then slows the creation process down. Finally, Figure 5.7 presents the results of a set of experiments that fixed the parameter settings at the first and second level of the cre- ated structure and varied the parameter setting at the third level. The parameter settings for the first level are the same as for the previous set of experiments and the parameter setting for the second level for all graph sizes is ten. The curves in this figure follow the same rules as the conclusion that was derived from the results of the set of ex- periments depicted in Figure 5.6. The experiments presented in this section showed that the cre- ation time grows almost linearly regarding the size of the indexed graph. Yet, it must be considered that the graphs’ sizes grow lin- |E| early to the connectivity threshold but the |V | ratio grows faster with greater graphs, which reflects the size of the graph representation, so the progress of the times needed to create the respective indexing structure picks up this behavior.

5.3 Search Complexity

The group of experiments performed and evaluated in this section tackle the issue of the computational complexity of the ρ-path search algorithm limited to a certain path length with respect to the size of the graph. The tests were again performed on the synthetic random graphs presented earlier in this section. The parameter settings used to build the ρ-indexes for each of the testing graphs were the optimal ones for each particular graph.

101 5. ρ-INDEX EVALUATION

35 G5000 G10000 G20000 G30000 30

25

20

15 creation time [mins]

10

5

0 0 2 4 6 8 10 12 14 lowest level cluster size

Figure 5.7: Creation time of ρ-index having four levels with param- eter setting at third level represented by the x-axis value. The parameter setting for the first level for graph G5000 is five, for G10000 is fifteen, for G20000 is thirty-five and for G30000 is forty-five. The second level parameter setting is ten.

The parameter settings are summarized in Table 5.2. Again only the maximum size of the segment on the lowest level varied and the rest of the parameter settings remained same for all testing graphs. The complexity of the search algorithm can be examined from more points of view. Firstly, there is a difference in the possible result of the search. The result can either be a set of found paths or no paths. Although, no paths are returned as the result the graph has to be scanned to confirm that the result is empty. So these two cases are discussed apart. Secondly, a search itself can be limited to various lengths independently to the limit for the ρ-index creation. Finally, the computational complexity of the search algorithm is investigated when the parameter settings is modified.

102 5. ρ-INDEX EVALUATION

Graph 1st level 2nd level 3rd level G5000 5 10 5 G10000 15 10 5 G20000 35 10 5 G30000 45 10 5

Table 5.2: Parameter setting used to build ρ-indexes for the testing graphs.

700000 80000 rhoIndex rhoIndex n*log(m) n*log(m) 600000 seq(10) 70000 seq(10) seq(12) 60000 500000 50000 400000 40000 300000 30000 200000 vertices processed vertices processed 20000

100000 10000

0 0 5000 10000 15000 20000 25000 30000 5000 10000 15000 20000 25000 30000 number of vertices in a graph number of vertices in a graph

Figure 5.8: A ρ-index search complexity with respect to the graph size.

5.3.1 Search Complexity of Positive Search Figure 5.8 demonstrates the experiments where the parameter set- tings were fixed and the size of the graph grew. As we have men- tioned earlier in this chapter, the nature of the testing graphs is such the result of the search on the larger graph always contains all the search results of the smaller graphs, thus the results of the searches are comparable with respect to the vertices and edges they comprise of. Both parts of Figure 5.8 refer to the same results of the same ex- periments. The difference is in the y-axis scale. The left part depicts the results in the whole scale, the right part depicts the results in the range starting at 0 and ending in 80,000 of the processed vertices.

103 5. ρ-INDEX EVALUATION

300000 10000 G5k G5k G10k G10k G20k G20k G30k G30k 250000 G5k-found G5k-found G10k-found 8000 G10k-found G20k-found G20k-found G30k-found G30k-found

200000

6000

150000

number of paths number of paths 4000

100000

2000 50000

0 0 0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45 path length path length

Figure 5.9: Amounts of paths found by the search executed on the testing graphs.

The particular curves represent the algorithms used to perform the search of all paths lying between two pre-selected vertices. The lines labeled with the prefix seq represent a sequential algorithm. This algorithm represents an upper bound of a way to solve the prob- lem of searching all paths between two vertices to a certain length. It is a depth first search that tries to recursively build a path of some maximum length. The number in the label states the maximal length of a searched path. In Figure 5.8, there are present two results for a sequential algo- rithm seq. This is caused by the nature of the ρ-index and its search algorithm, which results in a fact that all paths to the specified length l to which the ρ-index was built are returned and some of the paths longer than l are also returned by the search algorithm. That implies for the result of the search using ρ-index that the following holds: seq(l) ⊆ ρ-index ⊆seq(l + k) for a particular k, where the set inclu- sion is meant on the results of the search algorithms. For that reason we also present the complexity of the algorithm seq(12) which repre- sents the sequential scan for all paths to the length 12. From a rough comparison of the complexity measured for the sequential algorithm with the length set to 10 and 12 we can observe that the growth is exponential. Another approach to the problem of searching for all paths lying between two vertices in a directed graph is a direct computation us-

104 5. ρ-INDEX EVALUATION ing the Tarjan’s algorithms described in [70] and [71]. The algorithm works in a time complexity n ∗ log(m) where n represents the num- ber of vertices in the graph and m the number of edges in the graph. The algorithm takes a flow graph on the input and a start vertex and returns the path expressions (regular expressions where the letters are edges of the flow graph) representing all paths to all vertices in a graph on the output. A flow graph is a special type of a directed graph which allows only one source vertex in the graph and no cy- cles. There exist a non-trivial transformation of an arbitrary directed graph into a flow graph. This computational overhead of the graph transformation is not included in the complexity of the Tarjan’s algo- rithm. The complexity denoted by n ∗ log(m) was verified in [31] and represents the actual values gained by that implementation on our testing graphs for the same input vertices. The graphs in Figure 5.9 summarize the results of the searched performed on the testing graphs in terms of amounts of paths of par- ticular lengths found. Also here the curves are presented in two dif- ferent scales. The left graph depicts the search results in the global view, the right one focuses mainly on the results of the performed searches. Both parts contain two groups of curves, the first one rep- resents the real amount of paths of the particular lengths in each of the testing graph. The other group represents the amounts of paths actually found by the ρ-path search algorithm using the ρ-index . The behavior of the curves representing the real amounts of paths is exponential and progresses from amount of 3 paths of length 5 to the amount of around 200,000 paths of length 17. The amounts grow beyond the length of 17 but they would only blur the results by ex- tending the scale so they are left out. The decrease in amounts of paths found beyond the weight limit l for the larger graphs is due to the increasing size of the segments on the lowest level that leads into a smaller amount of paths indexed that have the weight greater than l. This observation is confirmed by the experiments of the following Section 5.3.4.

5.3.2 Search Complexity of Negative Search Contrary to the set of the experiments performed in previous section the results of the searches conducted in this section are empty graphs

105 5. ρ-INDEX EVALUATION

60000 rhoIndex seq(10) seq(12) 50000 reachable vertices

40000

30000

20000 vertices processed

10000

0 5000 10000 15000 20000 25000 30000 graph size

Figure 5.10: A complexity of a search when no paths were present in the ρ-index with respect to the graph size.

– no path was present in the index for a pair of inspected vertices. By this set of experiments the ability of processing queries where no answer is given is studied. The respective ρ-indexes were created using the same parameter settings as presented in Table 5.1. The vertices that were the input for the search were carefully se- lected. The criterion for selection was the amount of reachable ver- tices on the lowest level – in the indexed graph – from each of the inspected vertex because the situation when the vertex is in extreme a member of a single vertex component was undesired. This situa- tion would blur the results because in particular cases the sequential algorithm beats any other approach since it is immediately done be- cause it cannot follow any outgoing edge. Therefore in Figure 5.10, there is also present a number of reachable vertices from the starting vertex for each of the testing graph. As matter of fact, this number also represents the number of reachable vertices also from the ending vertex. The same amounts of reachable vertices for the starting and end- ing vertices indicate that the inspected vertices fell into one compo-

106 5. ρ-INDEX EVALUATION

80000 rhoIndex sequential n*log(m) 70000

60000

50000

40000

vertices processed 30000

20000

10000

0 0 2 4 6 8 10 12 user defined maximal length

Figure 5.11: A ρ-index search complexity of queries with different maximal search length. nent and that there exists a path between them yet having greater length than the weight limit l. The results of these experiments proved that the ρ-index also works well in these tasks, namely when the result of the search is minimal yet in situations when the ratio of the size of the reachable graph and the size of the indexed graph is significant.

5.3.3 Search Complexity of Queries with Limited Maximal Path Length To this point the maximal weight of the searched path by the ρ-path search algorithm was always considered to be the same as the weight limit l that was used to create the ρ-index. In this section, the be- havior of the ρ-path search algorithm is explored when the maximal weight of the searched path is its parameter. As we refer to the weight limit l of the indexed path, the maximal weight of a searched path is referred to as softL. Setting this param- eter does not limit the search to return paths longer than softL but

107 5. ρ-INDEX EVALUATION

100 softL=5 softL=7 softL=9 softL=10 softL=11 80

60

40 percentage of paths not found

20

0 0 5 10 15 20 25 30 path length

Figure 5.12: A ρ-index percentage of paths longer than softL not found. again it must not necessarily find all of them. Thus Figure 5.11 represents searches executed on the graph G10000 and with the parameters set to 30, 10 and 5. The ρ-index was created with l equal to 10. The x-axis then represents the val- ues of the softL parameter and the curves represent the respective algorithms used to process the ρ-path query. To make the Tarjan’s approach comparable with ours and the se- quential algorithm we approximated the computational complexity by limiting the input graph to only those vertices and edges that are reachable within softL steps. The behavior of the curve representing the Tarjan’s algorithm is much promising then the other approaches as it gets to greater softL values. Nonetheless it has to be reminded that the result of the limited Tarjan algorithm are only the paths to the particular softL length, yet the results of the ρ-index approach are a considerable set of paths of a greater length than the particular softL is. As for the number of the found paths, the sequential algorithm finds all paths to the length of softL, our algorithm finds all the paths

108 5. ρ-INDEX EVALUATION

60000 search complexity creation time (mins*1000)

50000

40000

30000

20000 vertices processed / minutes

10000

0 20 25 30 35 40 45 50 lowest level cluster size

Figure 5.13: A ρ-index search complexity related to the parameter setting. to the length of softL and some of the paths that are longer than that. Figure 5.12 represents the percentage of paths not found that have length greater than softL for each particular length. Although we ran the experiment for the softL value of 3 the curve representing returned results is not present here since it returns no paths for this softL value. For the softL value of 5 it finds no path longer than 5 so the curve reaches immediately 100 percent at length 6. The amount of paths increases in exponential manner, speaking in terms of specific numbers from the indexed graph G10000, the amount of paths of length 12 between the testing start and end vertex actually present in G10000 is 1821 and the amount of paths of length 14 between the same two vertices is 12,644. So even if the result- ing amount of paths found represent a low percentage of the paths present in the indexed graph, the actual amount can easily reach tens of thousands. For illustration, for the softL = l = 10 and a path length of 24 the amount of found paths is on average 72,000 and the longest found path has a length of 42.

109 5. ρ-INDEX EVALUATION

5.3.4 Search Complexity Affected by the Parameter Settings Since the ρ-index can be created for one particular graph using dif- ferent parameter settings and as could be see from the experimen- tal results of the previous section, also having different properties. In this section the correlation between certain parameter setting and the complexity of the searches performed on the respective indexing structures built upon one particular testing graph is explored and evaluated. As a testing graph the synthetic random graph called G10000 was chosen. The parameter settings varied in the maximal size of a seg- ment on the lowest level and the upper level settings remained the same for all tests. Consequently, Figure 5.13 depicts the relation be- tween the parameter settings and the average search complexity for thus created ρ-index. The parameter setting for the lowest level is represented by the x-axis values. The curve is falling with the increase of the segment size on the lowest level. The dashed curve in Figure 5.13 reflects the creation time of the ρ-index for that particular parameter setting. The time is in minutes multiplied by a constant 1000 to make the curve visible in this scale. On the contrary, the progress of this curve is rising. We have already seen this behavior in Figure 5.5 for all graph sizes at the rising part of the parabolas. These facts represent a creation and search tradeoff. We gain bet- ter creation time results for certain parameter settings but on the other hand we get worse search complexity results. This tradeoff has even one more dimension which is the amount of paths returned that are longer than l. This fact has already been discovered in one of the previous sections and the reason for it is that the possibility of each path that has length greater than l is decreasing hand in hand with the increasing size of the segment .

5.3.5 Summary The experiments conducted and their evaluation presented in this section express the behavior of the designed indexing structure called ρ-index. The first set of experiments concerning the creation phase of the indexing structure demonstrates that the results are very

110 5. ρ-INDEX EVALUATION dependable on the selected segmentation method and its optimiza- tions and also the parameter settings for each tested graph. Yet, the presented results prove the possible scalability of the structure since the time that is consumed by the creation phase of the ρ-index grows linearly with the size of the indexed graph. The second set of experiments regarded the complexity estima- tion of the ρ-path queries processing. Two cases where considered due to the nature of the indexing structure. Firstly, it was the case when the answer to the ρ-path query was a substantial set of paths representing one complex relationship. In this case, the observation that the number of paths having weight greater than the limit l de- creases every time the size of the segment is increased. Secondly, the search algorithm complexity was studied when the answer to a query was an empty graph yet with high probability of finding some complex relationship which was represented by a significant number of reachable vertices from the pair of inspected vertices. Also in this case, the ρ-index proved it is useful for its prediction of the best direc- tion between a pair of inspected vertices. Third, the behavior of the ρ-path query processing concerning the imposed weight limit softL on a maximal weight of a searched path was inspected. This weight limit can be used to virtually limit the maximal weight of paths rep- resenting a complex relationship between the inspected vertices, yet due to the nature of the ρ-index that limit is not strict and paths hav- ing weights greater than the weight limit softL can take part in the result. Finally, a further study of the nature of the ρ-index is presented to state and evaluate the amount of paths that are beyond the weight limit l for which the ρ-index was created. The observations made in this section proved that the amounts of paths indexed by the ρ-index whose weight limit is greater than l is very large and is dependent on the selected parameter setting. In some cases the paths that have weight four times greater than the limit l are also indexed.

111

Chapter 6 Applying ρ-index in Citation Analysis

For an application of the ρ-index on the real-life data, the domain of the citation analysis was chosen. The citation analysis is a part of the information retrieval and it basically mines the citation graph for important information about the vertices in the graph. Hence this chapter discusses the applicability of the ρ-index in this field of science and presents results gained. Besides, it also presents an introduction of the various techniques for mining the citation graph, sometimes also referred to as the citation network due to its special features. The notion of citation network was firstly introduced in mid fifties in [29] with the purpose of representation of the evolution in science. The author believes that there is semantics behind the citation association between two scientific papers, in other words association-of-ideas. The citing relation ties two publications together.

6.1 Bibliometrics

Bibliometrics [56, 57] is aimed to study and measure texts and infor- mation. Historically bibliometric methods have been used to trace relationships amongst academic journal . Citation analysis which is based on the citation relation introduced earlier, is used in searching for materials and analyzing their merit. Citation indices, such as Institute for Scientific Information’s Web of Science, allow users to search forward in time from a known article to more recent publications which cite the known item. In other words, it allows the user to trace the paths in the citation graph formed by the publica- tions in the mentioned index. The citation graph can be analyzed to determine the popularity

113 6. APPLYING ρ-INDEX IN CITATION ANALYSIS

1 2 13 2

citationrelation longitudalcoupling

1 1

3 3

2 2

bibliographiccoupling co-citationcoupling

Figure 6.1: Direct relationships between a pair of inspected vertices, 1 and 2, identified in the citation analysis. and impact of particular entities that may be the publications or their authors or the journals they were published in. Using the citation analysis the importance or impact of the entity can be quantified and used for further comparison or ranking. The scientometrics emerged from bibliometrics and it focuses es- sentially on the field of scientific publications. In recent years even narrower domain – informetrics – has evolved. It focuses only on the publications concerning informatics and computer science.

6.1.1 Citation Analysis In this section, the most popular relationships studied in citation analysis will be introduced. All of them are based on the citation relation between two publications in the graph. An overview of all relationships is depicted in Figure 6.1. First of them is co-citation firstly introduced by Henry Small [65] in mid seventies. This relationship binds together two publications that are directly cited by at least one common publication. In other words, the co-cited pair of publications appears in the reference list

114 6. APPLYING ρ-INDEX IN CITATION ANALYSIS of at least one publication. Speaking in terms of graph theory, in the citation network, there exists at least one direct common predecessor for the inspected pair of vertices. Another alternative relationship is bibliographic coupling iden- tified by Fano [27] and Kessler [38]. It binds two papers together if they both cite at least one common publication. Speaking in terms of graph theory, in the citation network from the two vertices edges point into one common vertex or the intersection of their reference lists is not empty. The final coupling method is called longitudal coupling [66]. It connects a pair of publications, older and newer, together taking two steps in the same direction along the direct citation relation. In other words it puts together two publications between which an indirect citation relation represented by a path of length two exists. The quality of these relationships is represented by the number of instances of the particular relatioships.

6.1.2 Material Search Strategies The techniques introduced here cope with the task of searching for interesting material in the citation graph. The task is to start the search from one or more seed publications and to retrieve the impor- tant publications found. The following presents the basic techniques that were introduced in [22]:

Backward chaining a method that follows edges in their direction – references of a given publications are retrieved.

Forward chaining a method that follows edges against their direction – publiactions that cite this document. To be successful, the document should be several years old to have a chance to be cited.

Citation cycling a backward chaining search from some recent doc- uments followed by a forward chaining. This search can have several iterations.

There are various methods how to utilize these techniques. For instance, using the forward chaining to identify the publications that

115 6. APPLYING ρ-INDEX IN CITATION ANALYSIS should represent the ongoing research concerning the same topic dis- cussed by the seed publication. An important notion here is the polyrepresentation [35] that repre- sents the hypothesis of that overlaps between different cognitive rep- resentations of both users information needs as well as documents can be exploited for reducing the uncertainties inherent in IR [infor- mation retrieval], and thereby improve the performance of IR sys- tems. In other words, it states the hypothesis of assigning higher rank- ing to those entities that are reached by the above techniques from more different entities that are picked by the user. According to the notion of polyrepresentation the overlaps of the searches originated from a set of seed documents picked by the user are studied to find an important material that would satisfy the user’s needs. An effort introduced in [44] bypasses the sometimes difficult con- struction of the set of seed documents. It combines the cyclic chain- ing with the subject search to find the relevant recent papers. Firstly, it performs a certain number of similar subject searches and after- wards it studies its overlaps which due to the polyrepresentation, are believed to contain the most relevant seed publications. Next, from this set of seed documents the backward chaining is initiated. Again the overlaps of the resulting sets of publications are studied and from those the final forward chaining is initiated to find the most recent and most relevant publications concerning the initial user’s demand.

6.1.3 Indirect Citation Relationships The direct relationships introduced earlier in this chapter has been enhanced to general indirect methods in [26] and [21]. The direct citation relation between two publications x −→ y is generalized to indirect citation relation between two publications x −→∗ y. Speaking in terms of graph theory the notion of edge was replaced with the notion of path. Using this indirect citation relation instead of the direct one in the coupling methods depicted in Figure 6.1, the indirect coupling methods can also be generalized. Formally, the indirect longitudal coupling then represents all

116 6. APPLYING ρ-INDEX IN CITATION ANALYSIS paths lying between the pair of inspected publications present in the citation graph. The indirect co-citation coupling binds together each pair of publications that have a common predecessor in the net- work. Similarly, the indirect bibliographic relationship is identified between each pair of publications that have a common successor in the citation network. Contrary to the direct citation relation the indirect citation rela- tion forms a partial order on the citation network. It holds the im- portant transitivity property that the direct citation relation did not have. The level on which publication A indirectly cites the publi- cation B denotes the minimal path length between A and B in the citation network. As for the reflexivity of the indirect citation rela- tion, we say that each publication D cites itself on a level 0. As can be seen the indirect citation relation forms with the citation network an partially ordered set. Intuitively, also the antisymmetry holds when A cites Band B cites A only if A is B. An important mathematical structure called lattice are identified in partially ordered sets and as is studied in [26] these structures can be also found as substructures in citation networks with the indirect citation relation. As will be shown in one of the following sections the ρ-index becomes a very handy tool for finding also these sub- structures in the citation network. The definition of the lattice struc- ture is following: Definition 6.1.1 (Lattice). A lattice is a partially ordered set (poset) in which any two elements have a supremum and an infimum. Where the supremum of S is the least element of T that is greater than or equal to each element of S. The infimum of a subset of some set is the greatest element, not necessarily in the subset, that is less than or equal to all other elements of the subset. Consequently, the supre- mum is also referred to as the least upper bound (also lub and LUB) and the infimum as the greatest lower bound (or glb and GLB).

6.1.4 Ranking of Publications Another important issue in searching for the material in the citation network is connected with a problem of giving the publications ranks that would reflect their importance.

117 6. APPLYING ρ-INDEX IN CITATION ANALYSIS

A very naive and intuitive method is the count of citations that each publication receives. Although this method gives more less good results, it does not reflect anyhow the importance of the pub- lications that cite the ranked publication. Therefore the progressive methods have been introduced to propagate the importance of pub- lications throughout the network. These methods try to overcome this problem by taking into ac- count a chain of citations. Methods primarily aimed to score the im- portance of the web pages in the world of web can be used to rank the publications in the citation network. The most popular ones are PageRank [17] and HITS [40, 42, 39]. Yet, there exist method called SCEAS [64] that takes its inspiration in PageRank and is determined to be used directly in bibliometrics since the previous two methods were tested to present biased results on the citation networks [63].

HITS The HITS algorithm is based on the importance of hubs and authorities in the graph representing the web space. According to [41], the au- thorities on the topic are the most prominent sources of primary con- tent. Other pages, equally intrinsic to the structure, assemble high- quality guides and resource lists that act as focused hubs, directing users to recommended authorities. In particular, the score of hubs and authorities is calculated using the equations: −→ −→ a′ = AT h −→ h′ = A−→a −→ where −→a and h are vectors containing the respective scores of publi- cations as authorities and hubs and matrix A is usual adjacency ma- trix. At the end of the computation each publication is assigned two numbers representing their role as an authority and a hub in the graph.

118 6. APPLYING ρ-INDEX IN CITATION ANALYSIS

PageRank This algorithm is also very popular in the context of the World Wide Web. It computes the Score Sj for each object j in the web space using the following formula:

Si Sj = (1 − d)+ d ∗ Ni Xi→j

where d isa damping factor usually set to 0.85 and Ni is the number of citations received from object i. PageRank associates high scores with object i when there exists a big strongly-connected component C in which some nodes point to i. The more and larger the cycles C contains in which i participates, the bigger score i receives. Nevertheless, cycles do not often appear in citation networks and if so, they usually represent self-citations. Thus, the resulting score boosted by the self-citation cycles is according to [63] biased.

SCEAS This method computes the ranking score using the following for- mula:

Si + b −1 Sj = (1 − d)+ d ∗ a (a ≥ 1,b> 0) Ni Xi→j where b denotes the direct citation enforcement factor and a repre- sents the speed in which an indirect citation enforcement converges to zero. It means that the change of the i’s score affects the score of j that is x vertices away by the factor a−x, favoring the closer object scores against the farther ones. In the tests, a was picked to be equal to e, which means that a−x converges to zero for x> 7. An overall comparison of the presented ranking methods on a citation network is presented by the authors of SCEAS in [63]. The citation network forms the DBLP [49] digital library. The results of the experiments were verified against the VLDB 10 year award and SIGMOD Test of Time Award. At the time of the experiment the data set represented at first glance a fair amount of publications – 588,865.

119 6. APPLYING ρ-INDEX IN CITATION ANALYSIS

However, the amount of publications that had an out-degree greater than zero – the publications that referenced some publication that is also present in DBLP – was only 8,183, i.e. 1.3% of the overall amount of publications present in the digital library. All the publications that have been published at these two confer- ences that could apply for these awards were ranked using all three dynamic ranking methods. Afterwards, the award winning publica- tions were checked how they actually stood in the lists of scores. The static method based on the plain direct citation count was used as the baseline for the evaluation. The results show that the best method is SCEAS followed by the PageRank scoring algorithm and surprisingly the plain citation count outperforming the HITS method. Yet, at some particular award winners the discrepancies in ranks were high. The authors attribute this phenomenon to two major reasons. Firstly, it is the in- completeness of the citation network – only a small fraction of the overall number of publications in DBLP have their citations also in- cluded in the digital library. Secondly, all these methods are based only on the citation relation. However, the awards are by the defini- tion subjective and combine besides the objective reasons also other measures and indicators of impact.

6.1.5 Topological Studies of a Citation Network The topological studies of large networks are used to make an insight into the structure of many fields that form such structures. Those structures are for example the power grid of United States, the world wide web and also the citation networks [77]. Since the pioneering work of [29] that unveiled structure underlying the scientific work, the properties of the citation networks have been studied in number of works [47, 74, 61]. Firstly, the citation network is examined from the statistical point of view. In [61] the numerical data for the distribution of citations are examined. The examinations were made on two data sets, one representing almost 800,000 papers with nearly 7,000,000 of citations among them representing the ISI’s1 Science Journal Ci-

1Institute for Science Information, Philadelphia

120 6. APPLYING ρ-INDEX IN CITATION ANALYSIS tation Reports. The other data set was formed from Physical Review D, volumes 11 – 50 taken from the SPIRES database. It represented about 25,000 papers with around 350,000 citations. The distribution of the citations was under investigation for estimating the relative popular- ity of an individual paper in the citation graph as an alternative to a usual methods, e.g. the plain citation count. The main conclusion of this investigation is that the distribution follows the power law, yet exhibiting some fluctuations since papers are being cited through a certain period of time and the data comprised of papers being in dif- ferent phases of this life cycle. The idea of the application of the power law onto the citation dis- tribution is even more refined in [47] where the power law distribu- tion is defined for two parts of the data set differently to fit the curve better to the distribution estimation. To this point, the in-degree dis- tribution, representing the cite relation, was studied. The author in [74] focuses on both, the in and out-degree distribution and verifies the thesis that both distributions follow the same behavior. The conclusions of both previously cited works are again verified in the [83] which focuses directly on the publications published in the journal Scientometrics in a period of time ranging from 1978 to 2004. Notice that the scientometrics is in fact bibliometrics applied to scientific publications as was mentioned earlier. The topological properties of the whole journal in the time evolvement is studied and the conclusions similar to those on the other data sets previously studied are stated. The results are tightly compared to the small world phenomenon that was firstly introduced in social studies [54] and af- terwards put into context of network studies in [76]. Another point of view to study the shape and topology of the citation network present the application of the sociometric studies identified for the social networks of the citation network. The authors in [34] try to identify the critical path in the DNA the- ory through sociometric network analysis. The analysis is carried out on the DNA theory that is represented by carefully selected 40 mile- stone papers and the citations among them. A connectivity among vertices and path properties are studied to find the general structure and the social process of the DNA theory network. The properties of paths especially the frequency of appearance of particular edges in those paths in the network were studied. Those methods then made

121 6. APPLYING ρ-INDEX IN CITATION ANALYSIS possible to identify in the DNA theory the most important sequence of publications called the critical path of the DNA theory. The paths investigated lied between the oldest paper in the theory to all the ter- minators in the network which is analogous to evaluating the ρ-path operator for a pair of the starting and terminating operator. What gives another strong motivation for a discovery of a structure for ef- ficient evaluation of ρ-operators in directed graphs.

6.1.6 Mapping of Science The mapping of science represents a spatial representation of how scientific fields, specialties, papers or authors are related among each other by the means of object proximity and relative locations. The first effort in mapping the science is Garfield’s own designed structure called historiograph introduced in [30]. It represents the network diagram chronologically ordered where the nodes are visu- alized publications. Two types of edges are recognized – a strong ar- row representing a direct relationship with some publication recog- nized as important one and a weak arrow representing an acknowl- edgement of the relevant work yet without explicitly citing the par- ticular paper. Again the paths comprised of the edges representing the strong relationships between publications are investigated and supposed to represent significant evolution in science. Another approaches to the mapping effort intend to discover clusters in the database of examined publications. They try to put closely related publications together into clusters. Yet, the publica- tions intrinsically have no coordinates. Therefore the domain of the visualized objects – the publications – reminds metric spaces [84]. Therefore, the crucial challenge of the visualizing efforts is the dis- covery of the best dissimilarity measure of two publications in the visualized database. In [67] the proposed approach stems from the co-citation similar- ity measure. They also discuss an alternative use of other coupling based similarity measure or even combined coupling methods into one measure. The methodology of the proposed approach is then to transform the similarity measure coefficients into distances such that closely related object are short distances apart and weakly related objects

122 6. APPLYING ρ-INDEX IN CITATION ANALYSIS appear farther away. The chosen ordination method is triangulation [46]. In a two-dimensional space, the triangulation begins arbitrarily with one of the objects which is placed at the origin of the coordinate system. Then the object closest to it is found and placed at the speci- fied distance arbitrarily in the plane. The location of the third object is fixed by using the distances of the first two objects. So the overall visualizing procedure consists of three steps: 1. the creation of the multilevel hierarchy of clusters or partitions starting with individual papers, 2. an ordination of objects within each cluster, and 3. the integration of the local structures into global one on com- mon coordinate system. The resulting maps remind a volvox – a kind of a small organism – because smaller, lower level objects, are represented as circles within larger, higher level objects. The linkages between particular clusters are formed by strong co-citations between those clusters.

6.2 Implementing Indirect Complex Relationships Us- ing ρ-operators

Considering the indirect relationships in the citation analysis intro- duced in Section 6.1.3, there is an obvious correlation between them and the ρ-operators defined for directed graphs as was firstly men- tioned in [10]. The ρ-path operator represents all paths forming the indirect citation relationship between two inspected publications – vertices in the citation network. The indirect co-citation relationship can be implemented by ρ- connectionT o operator since the indirect coupling is defined as a pair of chains of publications originated in the two inspected publications and having one common terminal vertex – a connection that repre- sents a common successor, a publication that is indirectly co-cited by the two inspected publications. Lastly, the bibliographic coupling is a reversed form of the indi- rect co-citation coupling and can be implemented by the ρ-connec- tionF rom operator. It returns all pairs of paths that are originated

123 6. APPLYING ρ-INDEX IN CITATION ANALYSIS

4 5 inDegree outDegree 3.5 4 3

2.5 3 2

1.5 2

1 1 0.5

number of vertices (log10(y)) 0 number of vertices (log10(y)) 0 -0.5

-1 -1 1 10 100 1000 10000 0 20 40 60 80 100 120 140 160 180 200 indegree outdegree 5 degree

4

3

2

1 number of vertices (log10(y)) 0

-1 1 10 100 1000 10000 total degree

Figure 6.2: Vertex degree distribution in the citation graph. in the common connection and terminated in the two inspected ver- tices, respectively. The semantic representation of the connection is the publication that indirectly cites the two inspected publications. The difference between the result of the citing relationships in ci- tation analysis and the ρ-operators for graph structured data is that the citing relationship creates pairs of publications that are in the par- ticular relation. The result of the ρ-operators is a set of paths, respec- tively pairs of paths that represent the particular relationship found. This enables the user with a possibility of further examination of the quality of the relationship found between the pair of investigated vertices – publications. The key idea presented in this chapter is to apply the indexing techniques developed for general graph structured data on citation network and study the semantics of the retrieved paths and publica- tions. Therefore, this thesis focuses mainly on the efficient query an- swering regarding the ρ-path operator in next section results gained by querying the ρ-path operator in a citation network are presented.

124 6. APPLYING ρ-INDEX IN CITATION ANALYSIS

6.2.1 Data Set The set of data representing the citation network is a piece of the Cite- Seer [15, 16] database of scientific publications and citations among them. The data set used for evaluation was created taking one core publication and deploying the breadth first search for all weakly ac- cessible publications from the core one until the desired size of the acquired set was met. Weak accessibility ignores the orientation of edges in the graph. Therefore, also the publications that are not con- nected by a directed path in the whole citation network appearin the data set. Yet, those publications are related to the core one because they lay in the same component due to the weak connectivity. The amount of publications in the data set we set to 30,000. This number was selected for the comparing purposes with the synthetic random graphs used for evaluation of the ρ-index in Chapter 5. The amount of edges acquired among the vertices in the data set is 63,584. The distribution of the respective degrees of the vertices in this cita- tion graph is demonstrated in Figure 6.2. The x-axis represents the particular vertex degree – respectively the amount of edges initiated, terminated and a total number of both in the particular vertex – and the y-axis then represents an amount of vertices having this degree. The x-axis is drawn using logarithmic scale to make a clearer view of the curve’s progress. Also to the values on the y-axis the loga- rithm was applied to achieve better readability of the demonstrated distribution. The in-degree represents the number of citations of each particu- lar publication. Notice that this distribution follows the power law that states that in the testing citation graph, there is a small number of vertices that have large in-degree and a large number of vertices whose in-degree is very small already observed in Section 6.1.5. This fact exactly conforms with the reality where most of the publications receive a small number of citations and was also presented in [5]. On the contrary, the out-degree represents the number of references that the particular publication refers to. This number is not always accurate since CiteSeer does not contain all the references for each publication in its database. The testing citation graph was built using the Van Rijsbergen’s Information Retrieval [73] as the core publication which was identified

125 6. APPLYING ρ-INDEX IN CITATION ANALYSIS

Path length Amount of paths Distinct vertices 4 2 5 5 7 13 6 17 32 7 27 51 8 33 58 9 48 59 10 62 60 11 41 61 12 24 61 13 20 61 14 8 61 289

Table 6.1: A summary of paths found between the reference [19] and the core publication. as a very important publication in the IR field for its high number of citations by other publications and also due to its recognition among information retrieval experts. Thus the testing citation graph repre- sents the scientific field concerning the information retrieval.

6.2.2 ρ-path Results By the conducted experiments the idea followed is that if we come across some newer publication that we consider interesting to our re- search that falls into the same scientific field like the core book then there is a high probability that there exists either direct or indirect ci- tation of our core book. If there exists an indirect citation then there is also a possibility that more than one indirect citation paths can be found. In this case we would like to study all paths to certain length lying between our recent – seed – publication and the core book. The vertices on these paths form a set of publications that deserve a fur- ther study of their importance by the user. For our experiment we have chosen W. Bruce Croft’s Predicting query performance [19] as the seed publication. The creation phase of

126 6. APPLYING ρ-INDEX IN CITATION ANALYSIS

2002 2001 2000 1999 1998 1997 1996 1995 19941993 1979

7

3 18 31 4 10 16 5 11 17 21 24 27 32 1 2 29 13 19 26 28 8 14 23 25 30 9 15 20 22 12 6

1 PredictingQueryPerformance 17 MiRRor:MultimediaQueryProcessinginExtensibleDatabases 2 Relevance-BasedLanguageModels 18 EffectiveRetrievalwithDistributedCollections 3 RelatingtheNewLanguageModelsofInformationRetrievaltothe TraditionalRetrievalModels 19 TopicDetectionand TrackingPilotStudyFinalReport 4 BridgingtheLexicalChasm:Statistical Approachesto Answer-Finding 20 On-lineNewEventDetectionand Tracking 5 ImprovingtheEffectivenessofInformationalRetrievalwithLocalContext Analysis 21 BoostingandRocchio Appliedto TextFiltering 6 StatisticalModelsfor TrackingandDetection 22 StudyonRetrospectiveandOn-LineEventDetection 7 TheMirrorDBMSat TREC-8 23 The TREC-5Filtering Track 8 OCELOT: A systemforsummarizingwebpages 24 Context-SensitiveLearningMethodsfor TextCategorization 9 Dragon's TrackingandDetectionSystemsforthe TDT2000Evaluation 25 Text-BasedInformationRetrievalUsingExponentiatedGradientDescent 10 Topic TrackinginaNewsStream 26 Training AlgorithmsforLinear TextClassifiers 11 ProbabilisticLatentSemanticIndexing 27 ProvidingGovernmentInformationontheInternet:Experienceswith THOMAS 12 Automated TextSummarizationinSUMMARIST 28 EvaluatingandOptimizing Autonomous TextClassificationSystems 13 A GeneralLanguageModelforInformationRetrieval 29 Corpus-SpecificStemmingusingWordFormCo-occurrence 14 A HiddenMarkovModelInformationRetrievalSystem 30 OptimizingRankingFunctions: A Connectionist Approachto AdaptiveInformationRetrieval 15 Summarizing TextDocuments:SentenceSelectionandEvaluationMetrics 31 UsingStatistical TestingintheEvaluationofRetrievalExperiments 16 UnsupervisedLearningfromDyadicData 32 InformationRetrieval

Figure 6.3: Publication search result visualization between the seed publication 1 and core publication 32. For the readability purposes, the result comprises only from paths up to the length of 6. the ρ-index for the testing citation network took on average around 20 minutes on the same machine that the experiments for the syn- thetic random graphs were conducted. The slightly longer time is credited to the different graph segmentation method used to assign the vertices to their clusters. The segmentation method used took into account the topological order that had to be pre-computed for each graph at each level. This method is described more in detail in Section 4.3.1. The ρ-index was created for the testing citation graph with the limit l set again to be ten. Consequently the ρ-path queries were evaluated. From the nature of the ρ-index that has been discussed earlier Chapter 4 we got all the paths to the length of ten and some longer. Table 6.1 summarizes the amounts of paths found according to their length and a total number of distinct vertices of all paths up to that length. The search complexity was 408 transformed vertices. This num- ber is less than those numbers presented in Chapter 5 for the searches

127 6. APPLYING ρ-INDEX IN CITATION ANALYSIS

Path length Amount of paths Distinct vertices 3 2 4 4 2 6 5 4 9 6 4 14 7 9 22 8 9 26 9 8 29 10 5 34 11 4 34 12 2 34 49

Table 6.2: A summary of paths found between the reference [58] and the core publication. in the synthetic random graphs. This is attributed to the fact that the citation network is better ordered with respect to the topological or- der on DAGs than the randomly generated graphs even when the number of vertices and edges are in both cases alike. The amount of accessible vertices from the starting vertex was 483 with respect to the depth of search being ten. Figure 6.3 demonstrates the network of the paths up to length 6– as a length of path we mean the number of vertices in a path. In that figure, the vertices represent publications that are placed on the back- ground of a timeline to make the result more readable. Although, the ρ-index was created to index all the paths up to the length of ten and as Table 6.1 shows, ρ-index does index also some more, Figure 6.3 demonstrates only the paths to the length 6 since it would get very hard to follow when it has contained all the paths got from the ρ -index . There was conducted a second search for another reference publi- cation Discovering word senses from text [58]. The statistics of the result of the ρ-path operator applied to this pair of vertices in the citation network is summarized in Table 6.2. The paths concerning the results gained for the second search are visualized in Figure 6.4. The grey

128 6. APPLYING ρ-INDEX IN CITATION ANALYSIS

2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 1992 1979

70 76 65 30

71 77 66 87 91 61 63 84 72 78 32 67 62 83 90 64 73 79 68 89 92 74 80 69 82 88

75 81 85

93

61 DiscoveringWordSensesfrom Text 78 EvaluationofSyntacticPhraseIndexing-CLARIT NLP TrackReport 62 A IncrementalMulti-Centroid,Multi-RunSamplingSchemefork-medoids-based Algorithms 79 IncrementalClusteringandDynamicInformationRetrieval 63 A ComparisonofDocumentClustering Techniques 80 ComputingDenseClustersOn-lineforInformationOrganization 64 NaturalLanguageInformationRetrieval: TREC-8Report 82 AnEvaluationof TechniquesforClusteringSearchResults 65 A PracticalClustering AlgorithmforStaticandDynamicInformationOrganization 81 Almost-Constant-TimeClusteringof ArbitraryCorpusSubsets 66 CHAMELEON: A HierarchicalClustering AlgorithmUsingDynamicModeling 83 ReexaminingtheClusterHypothesis:Scatter/GatheronRetrievalResults 67 A FuzzyRelativeofthek-Medoids Algorithmwith ApplicationtoWebDocumentandSnippetClustering 84 UsingLinear AlgebraforIntelligentInformationRetrieval 68 OntheMeritsofBuildingCategorizationSystemsBySupervisedClustering 85 RichInteractionintheDigitalLibrary 69 DocumentCategorizationandQueryGenerationontheWorldWideWebUsingWebACE 30 OptimizingRankingFunctions: A Connectionist Approachto AdaptiveInformationRetrieval 70 HypergraphBasedClusteringinHigh-DimensionalDataSets: A SummaryofResults 87 AutomaticCombinationofMultipleRankedRetrievalSystems 71 WebDocumentClustering: A FeasibilityDemonstration 88 An Association ThesaurusforInformationRetrieval 72 StaticandDynamicInformationOrganizationwithStarClusters 89 ViewingMorphologyasanInferenceProcess 73 WebACE: A Web AgentforDocumentCategorizationandExploration 90 AnInterfaceforNavigatingClusteredDocumentSetsReturnedbyQueries 74 CURE: AnEfficientClustering AlgorithmforLargeDatabases 91 DimensionsofMeaning 75 ClusteringLargeDatasetsin ArbitraryMetricSpaces 92 Lexical AmbiguityandInformationRetrieval 76 PrincipalDirectionDivisivePartitioning 93 Scatter/Gather: A Cluster-based ApproachtoBrowsingLargeDocumentCollections 77 ClusteringIn A High-DimensionalSpaceUsingHypergraphModels 32 InformationRetrieval

Figure 6.4: Publication search result visualization between the seed publication 61 and core publication 32. Grey nodes de- note the intersection with the result of search from Figure 6.3. vertices in this figure denote the vertices that are in both the first and the second search. The black vertices are the vertices that do not lie on any path with at least one grey vertex.

6.2.3 Semantic Analysis As was foreshadowed in previous subsection, the results of the two conducted publication searches can be combined to filter out some of the vertices and paths from the result. How the result is influenced by the context of the other search result can be seen in Figure 6.4. Im- portant are the grey vertices that denote the altered result. Only those paths that contain at least one of those vertices, excluding the com- mon for all the core publication, form the influenced result. Then, the white vertices on those paths form the context influence result and the black vertices the result filtered according to the result of the first search. Similarly, the same pruning process can be done with the first publication search result concerning the second search result. The

129 6. APPLYING ρ-INDEX IN CITATION ANALYSIS

Publication # Title 2002 1 Predicting Query Performance 2001 2 Relevance-Based Language Models 2000 3 Relating the New Language Models of Information Retrieval to the Traditional Retrieval Models 4 Bridging the Lexical Chasm: Statistical Approaches to Answer-Finding 5 Improving the Effectiveness of Informational Retrieval with Local Context Analysis 6 Statistical Models for Tracking and Detection 7 The Mirror DBMS at TREC-8 8 OCELOT: A system for summarizing web pages 9 Dragon’s Tracking and Detection Systems for the TDT2000 Evaluation 1999 10 Topic Tracking in a News Stream 13 A General Language Model for Information Retrieval 15 Summarizing Text Documents: Sentence Selection and Evaluation Metrics 33 The Beta-Binomial Mixture Model and Its Application to TDT Tracking and Detection The Beta-Binomial Mixture Model For Word Frequencies In Documents 34 With Applications To Information Retrieval 35 The Mirror MMDBMS architecture 36 Statistical Models for Text Segmentation 1998 18 Effective Retrieval with Distributed Collections 19 Topic Detection and Tracking Pilot Study Final Report 20 On-line New Event Detection and Tracking 37 INQUERY and TREC-8 1997 38 Text Segmentation by Topic 39 Learning Routing Queries in a Query Zone

Table 6.3: A summary of publications that form the vertices of the search between the publication [19] and core publication [73] concerning the intersection with paths of the search with reference publication [58]. Part 1 summary of this is depicted in Tables 6.3, 6.4 and 6.5. Tables 6.3 and 6.4 present the vertices that take place in the pruned result where the black marker identifies the vertices common for both search results. Consequently, Table 6.5 gives the summary of the vertices that were filtered out from the first result by the second search result. Let consider Figure 6.4 where the second search result is influ- enced by the first one. It can be seen that the vertices belonging to the intersection of those two are placed rather near the common core publication and the filtered out vertices are spread along the full length of the resulting network. In this particular example two branches of the network formed by the publications number 90 and 83 were filtered out. As is apparent from the names of the filtered out publications their common subject is text document clustering. The lattice-like structure of the search result has already been dis- cussed in Section 6.1.3. The filtered out part of the resulting network

130 6. APPLYING ρ-INDEX IN CITATION ANALYSIS

1996 25 Text-Based Information Retrieval Using Exponentiated Gradient Descent 26 Training Algorithms for Linear Text Classifiers 40 Pivoted Document Length Normalization 41 Incremental Relevance Feedback for Information Filtering 42 Query Expansion Using Local and Global Document Analysis 1995 27 Providing Government Information on the Internet: Experiences with THOMAS 29 Corpus-Specific Stemming using Word Form Co-occurrence 43 Execution Performance Issues in Full-Text Information Retrieval 44 Searching Distributed Collections With Inference Networks 45 TREC and TIPSTER Experiments With INQUERY 46 Recent Experiments with INQUERY 1994 30 Optimizing Ranking Functions: A Connectionist Approach to Adaptive Information Retrieval 47 Document Retrieval and Routing Using the INQUERY System 88 X An Association Thesaurus for Information Retrieval 1993 31 Using Statistical Testing in the Evaluation of Retrieval Experiments 89 X Viewing Morphology as an Inference Process 48 Learning Strategies for an Adaptive Information Retrieval System using Neural Networks 1992 91 X Dimensions of Meaning 92 X Lexical Ambiguity and Information Retrieval 93 X Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections 1979 32 X Information Retrieval

Table 6.4: A summary of publications that form the vertices of the search between the publication [19] and core publication [73] concerning the intersection with paths of the search with reference publication [58]. Part 2 can be conceived as a sublattice. In this particular case, the sublattice represents the publications on the topic of the text document cluster- ing. Figure 6.5 depicts the generalization of the above observations made upon the two publication searches. The grey area denotes the intersection of the two lattices. The black area identifies the sublat- tices that were filtered out. Finally the white area forms the paths that contain at least one of the vertices from the grey part representing the intersection. Because the black area is a sublattice in the result it is likely that it represents some closely related cluster of publications. According to the notion of the polyrepresentation discussed in Section 6.1.2 these pieces of knowledge can be used in two ways. Firstly, it is that the publications were acquired by two different searches thus they represent something more important than those that were filtered out from the search result. Secondly, these results can be as a whole used in combination with another type of publica-

131 6. APPLYING ρ-INDEX IN CITATION ANALYSIS

Publication 1999 A Theory of Proximity Based Clustering: Structure Detection by Optimization Topic-Based Language Models Using EM A Hidden Markov Model Information Retrieval System Automated Text Summarization in SUMMARIST Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Indexing 1998 Boosting and Rocchio Applied to Text Filtering The TREC-7 Filtering Track: Description and Analysis MiRRor: Multimedia Query Processing in Extensible Databases A Study on Retrospective and On-Line Event Detection Unsupervised Learning from Dyadic Data 1997 The TREC-5 Filtering Track Using and Combining Predictors That Specialize 1996 Context-Sensitive Learning Methods for Text Categorization Xerox Site Report: Four TREC-4 Tracks 1995 A Comparison of Classifiers and Document Representations for the Routing Problem Evaluating and Optimizing Autonomous Text Classification System

Table 6.5: Filtered out publications from the first publication search using the second publications search. tion search or recommendation system.

Implementing Forward Chaining by ρ-index The main drawback of this publication recommendation system is the same problem of those based on the bibliographic coupling [44] that it is unable to find publications newer than the seed publication. Our approach is unable to find any publication that is not taking part on any of the paths lying between the pair of input publications. This could be solved by implementing the forward chaining presented in Section 6.1.2. Analogically, to have only one input vertex and still be able to conduct a search for all paths terminated at this vertex. This task could also be solved using the ρ-index and an alteration of the ρ-path search algorithm. Recall how the algorithm worked, it firstly determined all segments to which the start and end vertex be- longed at all levels of the structure and consequently it transformed the paths stored in the top-most matrix entry to the paths of the low- est level of the structure. The altered algorithm would read not only one entry in the top most matrix but the whole column denoted by the ending vertex. Because the retrieved sequences of segments denote those proper se-

132 6. APPLYING ρ-INDEX IN CITATION ANALYSIS

S1

C

S2

Figure 6.5: A generalization of the influence of two publication search results conducted between seed publications S1, S2 and a core publication C. quences of segments of all indexed paths that are terminated in the known ending vertex. The results of searches initiated from the publications identified as important by the publication search introduced earlier could again be subjects of an overlapping studies from the polyrepresentation point of view.

6.2.4 Summary The approach proposed used one core publication and as we seen we got for this one a fair amount of publications from the citation graph because the core publication is well known. If we used as a core publication not so well-known publication the system would not be able to retrieve a reasonably big set of publications. For this reason, the approach could be improved to carry out the search with not only one core publication but with a set of core publications. Con- sequently, ρ-index would find all the paths to each particular publi- cation from the core set and put the result together. This improve- ment brings another interesting issue since the retrieved networks can overlap and that information can be also used for further recom- mendation process. Another ranking methods based on the paths in the citation net- work like those discussed in Section 6.1.5 can be further used on the search result to identify the important paths and vertices within the result. The vertices may be for example ordered according to the

133 6. APPLYING ρ-INDEX IN CITATION ANALYSIS amount of paths they appear in the resulting network. Yet this work is beyond the scope of this thesis which mostly focuses on the effi- ciency part of the searching and indexing of graph structured data.

134 Chapter 7 Conclusion

This thesis addresses the problems that are emerging in many fields which are recently becoming very popular in computer science, e.g. the Semantic Web or various network analyses. The thesis focuses on the processing of the queries denoted by the ρ-operators especially the efficient query processing of the ρ-path operator with the future extension to the both ρ-connection operators solving the query pro- cessing of the whole family of ρ-operators. This chapter will summarize the tackled issues, draw the con- clusions, sum up the achieved contributions and finally present the directions of the future research of the presented work in indexing graph structured data.

7.1 Summary

Firstly, the reader is provided with the basic necessities regarding the graph theory. The examples of various graph queries that can be processed on a graph structured data from which the simplest is the reachability query followed with the family of ρ-operators rep- resenting the search for complex relationships and finally the graph containment decision queries. Nevertheless, the most important part of this chapter presents the formalism of the corner stone of the re- search presented in this thesis, which is the concept of the graph seg- mentation together with its theory of recursive application towards the path preserving graph simplification. The problem of searching complex relationships proven to be an important and interesting one. As presented in Chapter 3 there are just a few algorithmic approaches to solve this problem from which the most promising one the Tarjan’s algorithm for solution of the sin-

135 7. CONCLUSION gle source path problem turned out to have a major stumbling block in poor scalability which is credited to the internal matrix represen- tation of the graph during the computation of the regular expression representing all paths between the selected vertex and all the remain- ing ones. Also some indexing structures has been presented but their use- fulness in the general graph structured data and their scalability is ei- ther questionable or not studied. Therefore, this thesis presents a de- sign and evaluation of the indexing structure aimed for efficient pro- cessing of the various path queries represented by the ρ-operators. Since other kinds of graph queries are presented, also to each kind of these queries a various algorithmic or indexing techniques that ef- ficiently process such queries are presented. This is done for wider presentation of the indexing problems in graph structured data that makes all these problems closely related and the authors took inspi- ration in solving problems that arisen by designing own indexing structure for ρ-operators in the efforts for the other queries presented that are not actually solved by the ρ-index. The designed indexing structure was proven to be very handy in processing the ρ-path algorithm that is presented in Chapter 5 by the gained results using this indexing structure on the synthetic data. The integral part of the design is also a development of an auxiliary structure used during both the creation phase of the index and during the query processing itself. The own designed transcrip- tion graph with its generality provides the evaluation of an arbitrary query a transparent process that can be easily controlled and manip- ulated by the immediate needs. Chapter 6 forms a brief introduction to bibliometrics and namely the citation analysis because the concepts studied here provided the real life data for the evaluation of the ρ-index which is the most im- portant part of this chapter. The various techniques of searching the citation network together with the possible ranking algorithms are presented. The indirect extension of various coupling methods and their correlation to the ρ-operators are then discussed. In the follow- ing evaluation part of this chapter, the ρ-index is applied on the real life-data and simultaneously the semantic conclusions of this appli- cation are presented. The completely new directions of the semantic studies derived from the usage of ρ-index in the field of the citation

136 7. CONCLUSION analysis conclude this chapter.

7.2 Contribution

This thesis contributes a novel indexing structure for graph struc- tured data aimed to ease the search for complex relationships among entities in the graph formed by this data. The ρ-index recursively simplifies the indexed graph by applying various graph transforma- tions designed to lower the amount of vertices and edges yet pre- serving the original properties of the indexed graph in the means that the paths that were present in the indexed graph are present in a simplified way in any of the transformed graphs. In Chapter 2, this thesis proposes the formalism regarding the simplification and transformation process and introduces a new no- tion of a segment of graph which eases the understanding and de- scription of the whole process of the creation of the ρ-index index- ing structure for an arbitrary directed graph. A further conception regarding the graph segmentation thereafter contains the segment graph that represents the simplified graph by the certain graph seg- mentation. The proposed formalism is summed up in a set of lemmas that prove the followed theory. Each segment is internally in the ρ-index represented by a path type matrix which is an altered numerical matrix meant to store the paths themselves instead of only the amounts of path between two respective vertices being present in the segment. The transi- tions among the segments at a particular level that form the segment graph of that particular graph transformation are represented by a set of inverted lists that are assigned to each of the identified seg- ment. The transcription graph introduced for the purposes of retriev- ing the information in the ρ-index, in other words to reconstruct the paths stored in the ρ-index is a novel structure that denotes another important contribution of the work presented in this thesis. The tran- scription graph provides the ρ-index with the abilities to parallelize and distribute the search algorithm as is discussed more in detail in the next section. The advantage of the ρ-index contrary to the other indexing tech-

137 7. CONCLUSION niques presented for path type queries is the unlimited ability of use on an arbitrary graph structured data and the graphs they represent. The advantage of ρ-index to the algorithmic approach represented by the Tarjan’s solution of the single source path problem is mainly the scalability. This allows the ρ-index to be used on greater data than the Tarjan’s approach since it lacks the bottleneck of the cubic space complexity in the phase of preparing the regular expression representing all paths lying between two inspected vertices, which makes this approach unapplicable to large scale data. The application and evaluation of the ρ-index on the real life data in the context of citation analysis unveiled tight correlation between the extended indirect concepts of various direct coupling methods widely known and used. The work presented in this thesis provided the citation analyst with a tool for fast retrieval of the indirect longi- tudal coupling and what more, it enables the analyst to further study the quality of the retrieved relationship in contrast to only having the information that such relationship exists. These facts lead into the outline of a further study of a semantics of the complex relationships retrieved and the context influenced publication search and recom- mendation.

7.3 Further Research Directions

The presented evaluation of the ρ-index together with the process- ing of the ρ-path queries proven to be suitable for also for the second family of ρ-operators – represented by the pair of ρ-connection op- erators. From which the most recent direction in the future work is particularly the implementation of the search algorithm for an eval- uation of this type of queries for which the corner stones will defini- tively be the notions, concepts, discoveries and conclusions gained by the work on the ρ-path operator. The concurrent effort will be the optimizations of the process of transcription of the transcription graph that lies in the parallelization of it. Intuitively, the strategy of processing the graph from left can be extended by parallel processing of vertices in the graph that are in the waiting queue in such a relation that they could be processed since their processing would not influence each other. Further, by

138 7. CONCLUSION extending the notion of the transcription graph by the counterparts of the introduced special typed edges and the numbers concerning the order from left and shortest path from the start, the symmetric process could be initiated also from the ending vertex of the ρ-path query evaluation. The evaluation on the real-life data taken from the citation anal- ysis also provided the research with some new needs on the query processing. For example, the missing capability to search for a com- plex relationships without having the whole pair of the inspected vertices, which would enable the user with the queries for entities with which the inspected vertex has the studied kind of the complex relationship besides the implemented verification of an existence of such relationship. This issue also reveals the other important direc- tion of future study which does not closely relate to the primary in- dexing effort but to the semantics of use of the designed indexing structure as is addressed in Section 6.2.3.

139

Bibliography

[1] R. Agrawal, A. Borgida, and H. V. Jagadish. Efficient manage- ment of transitive relationships in large data and knowledge bases. In Proceedings of the 1989 ACM SIGMOD international con- ference on Management of data, pages 253–262. ACM Press, 1989.

[2] Rakesh Agrawal, Shaul Dar, and H. V. Jagadish. Direct tran- sitive closure algorithms: design and performance evaluation. ACM Transactions on Database Systems, 15(3):427–458, 1990.

[3] Rakesh Agrawal and H. V. Jagadish. Hybrid transitive closure algorithms. In Dennis McLeod, Ron Sacks-Davis, and Hans- J¨org Schek, editors, 16th International Conference on Very Large Data Bases, August 13-16, 1990, Brisbane, Queensland, Australia, Proceedings, pages 326–334. Morgan Kaufmann, 1990.

[4] Sofia Alexaki, Vassilis Christophides, Gregory Karvounarakis, Dimitris Plexousakis, and Karsten Tolle. The ICS-FORTH RDFSuite: Managing voluminous RDF description bases. In SemWeb, 2001.

[5] Yuan An, Jeannette Janssen, and Evangelos E. Milios. Charac- terizing and mining the citation graph of the computer science literature. In Knowledge Information Systems, volume 6, pages 664–678, New York, NY, USA, 2004. Springer-Verlag New York, Inc.

[6] Kemafor Anyanwu and Amit Sheth. The rho operator: discov- ering and ranking associations on the semantic web. SIGMOD Record, 31(4):42–47, 2002.

[7] Kemafor Anyanwu and Amit Sheth. The ρ-operator: Enabling querying for semantic associations on the semantic web. In Pro-

141 7. CONCLUSION

ceedings of the twelfth international conference on World Wide Web, pages 690–699. ACM Press, 2003. [8] Stanislav Bartoˇn. Designing indexing structure for discovering relationships in RDF graphs. In Proceedings of the Dateso 2004 Annual International Workshop on DAtabases, TExts, Specifications and Objects, pages 1–11, 2004. [9] Stanislav Bartoˇn. Indexing structure for discovering relation- ships in RDF graph recursively applying tree transformation. In Proceedings of the Semantic Web Workshop at 27th Annual Inter- national ACM SIGIR Conference, pages 58–68, 2004. [10] Stanislav Bartoˇn. Searching indirect relationships in citation analysis using an index for graph structured data. In Proceedings of 2nd Doctoral Workshop on Mathematical and Engineering Meth- ods in Computer Science MEMICS 2006, pages 9–16, Brno, 2006. Faculty of Information Technology. [11] Stanislav Bartoˇnand Pavel Zezula. Rho-index - an index for graph structured data. In 8th International Workshop of the DE- LOS Network of Excellence on Digital Libraries, pages 57–64, 2005. [12] Stanislav Bartoˇnand Pavel Zezula. Designing and evaluating an index for graph structured data. In Proceedings of The Sec- ond International Workoshop on Mining Complex Data at 6th IEEE ICDM Conference, pages 253–257, Los Alamitos, CA, USA, 2006. IEEE Computer Society. [13] Stanislav Bartoˇnand Pavel Zezula. rhoIndex – designing and evaluating an indexing structure for graph structured data. Technical Report FIMU-RS-2006-07, Faculty of Informat- ics, Masaryk University, 2006. [14] Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web - a new form of web content that is meaningful to com- puters will unleash a revolution of new possibilities. Scientific American, 2001. [15] Kurt Bollacker, Steve Lawrence, and C. Lee Giles. CiteSeer: An automatic citation indexing system. In Ian Witten, Rob Akscyn,

142 7. CONCLUSION

and Frank M. Shipman III, editors, Digital Libraries 98 - The Third ACM Conference on Digital Libraries, pages 89–98, Pittsburgh, PA, June 23–26 1998. ACM Press.

[16] Kurt Bollacker, Steve Lawrence, and C. Lee Giles. CiteSeer: An autonomous web agent for automatic retrieval and identifica- tion of interesting publications. In Katia P. Sycara and Michael Wooldridge, editors, Proceedings of the Second International Con- ference on Autonomous Agents, pages 116–123, New York, 1998. ACM Press.

[17] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7):107–117, 1998.

[18] Edith Cohen, Eran Halperin, Haim Kaplan, and Uri Zwick. Reachability and distance queries via 2-hop labels. In Proceed- ings of the 13th ACM-SIAM SODA, pages 937–946, 2002.

[19] Steve Cronen-Townsend, Yun Zhou, and W. Bruce Croft. Pre- dicting query performance. In Proceedings of the 25th Annual In- ternational ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR 2002), pages 299–306, August 2002.

[20] Paul F. Dietz. Maintaining order in a linked list. In STOC ’82: Proceedings of the fourteenth annual ACM symposium on Theory of computing, pages 122–127, New York, NY, USA, 1982. ACM Press.

[21] L. Egghe and R. Rousseau. Co-citation, bibliographic coupling and a characterization of lattice citation networks. In Scientomet- rics, volume 55, pages 349–361, 2002.

[22] D. Ellis. A behavioural approach to information retrieval system design. Journal of Documentation, 45:171–212, 1989.

[23] P. Erd¨os and A. R´enyi. On random graphs. Publicationes Math- emticae (Debrecen), 6:290–297, 1959.

[24] P. Erd¨os and A. R´enyi. On the strength of connectedness of a random graph. Acta Math. Acad. Sci. Hungary, 12:261–267, 1961.

143 7. CONCLUSION

[25] Frank Olken et al. The Biopathways Graph Data Manager project. http://pueblo.lbl.gov/∼olken/graphdm/graphdm.htm.

[26] Y. Fang and R. Rousseau. Lattices in citation networks: An in- vestigation into the structure of citation graphs. In Scientomet- rics, volume 50, pages 273–287, 2001.

[27] R. M. Fano. Information theory and the retrieval of recorded information. In Documentation in action, pages 238–244, 1956.

[28] Michael R. Garey and David S. Johnson. Computers and In- tractability: A Guide to the Theory of NP-Completeness. W. H. Free- man & Co., New York, NY, USA, 1979.

[29] E. Garfield. Citation indexes for science: A new dimension in documentation through association of ideas. In Science, volume 122, pages 108–111, 1955.

[30] E. Garfield. Citation Indexing - Its Theory and Application in Sci- ence, Technology, and Humanities. John Wiley & Sons, Inc., New York, NY, USA, 1979.

[31] Filip Goldefus. Analysis and implementation of a path expres- sion search problem in directed graphs. Master’s thesis, Faculty of Informatics, Masaryk University, Brno, Czech republic, 2007.

[32] Donna Harman, R. Baeza-Yates, Edward Fox, and W. Lee. In- verted files. Information retrieval: data structures and algorithms, pages 28–43, 1992.

[33] Hao He, Haixun Wang, Jun Yang, and Philip S. Yu. Compact reachability labeling for graph-structured data. In CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 594–601, New York, NY, USA, 2005. ACM Press.

[34] N. P. Hummon and P. Doreian. Connectivity in a citation net- work: The development of dna theory. Social Networks, 11:39–63, 1989.

144 7. CONCLUSION

[35] Peter Ingwersen. Cognitive perspectives of information re- trieval interaction: elements of a cognitive ir theory. Journal of the American Society for Information Science, 52(1):3–50, 1996.

[36] Yannis E. Ioannidis and Raghu Ramakrishnan. Efficient transi- tive closure algorithms. In Proceedings of the Fourteenth Interna- tional Conference on Very Large Data Bases, pages 382–394. Morgan Kaufmann Publishers Inc., 1988.

[37] M. Jerrum and A. Sinclair. Fast uniform generation of regular graphs. Theoretical Computer Science, 73(1):91–100, 1990.

[38] M. M. Kessler. Bibliographic coupling between scientific papers. In American Documentaion, volume 14, pages 10–15, 1963.

[39] Jack P. C. Kleijnen and Willem J. H. Van Groenendaal. Mea- suring the quality of publications: new methodology and case study. Information Processing and Management, 36(4):551–570, 2000.

[40] Jon M. Kleinberg. Authoritative sources in a hyperlinked en- vironment. Journal of the Association for Computing Machinery, 46(5):604–632, 1999.

[41] Jon M. Kleinberg. Hubs, authorities, and communities. ACM Computing Surveys, 31(4es):5, 1999.

[42] Jon M. Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew S. Tomkins. The Web as a graph: Measurements, models and methods. Proceedings of 5th CO- COON Conference, pages 1–17, 1999.

[43] Donald E. Knuth. The art of computer programming, volume 1 (3rd ed.): fundamental algorithms. Addison Wesley Longman Publish- ing Co., Inc., 1997.

[44] B. Larsen. Exploiting citation overlaps for information retrieval: Generating a boomerang effect from the network of scientific papers. In Scientometrics, volume 54, pages 155–178, 2002.

145 7. CONCLUSION

[45] O. Lassila and R. R. Swick. Resource Description Framework: Model and Syntax specification. 1999.

[46] Richard C. T. Lee, James R. Slagle, and H. Blum. A triangula- tion method for the sequential mapping of points from -space to two-space. IEEE Transactions on Computers, 26(3):288–292, 1977.

[47] S. Lehman, B. Lautrup, and A. D. Jackson. Citation networks in high energy physics. Physical Rewiev E, 68(2), 2003.

[48] Thomas Lengauer and Robert Endre Tarjan. A fast algorithm for finding dominators in a flowgraph. ACM Transactions on Pro- gramming Languages and Systems, 1(1):121–141, 1979.

[49] Michael Ley. DBLP – Digital Bibliography & Library Project. http://www.informatik.uni-trier.de/∼ley/db.

[50] Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. In SODA ’90: Proceedings of the first an- nual ACM-SIAM symposium on Discrete algorithms, pages 319– 327, Philadelphia, PA, USA, 1990. Society for Industrial and Ap- plied Mathematics.

[51] A. Matono, T. Amagasa, M. Yoshikawa, and S. Uemura. An in- dexing scheme for RDF and RDF Schema based on suffix arrays. In Proceedings of SWDB’03, The first International Workshop on Se- mantic Web and Databases, Co-located with VLDB 2003, 2003.

[52] Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet por- tals with machine learning. Inf. Retr., 3(2):127–163, 2000.

[53] Edward M. McCreight. A space-economical suffix tree construc- tion algorithm. Journal of the Association for Computing Machinery, 23(2):262–272, 1976.

[54] Stanley Milgram. The small world problem. Psychology Today, 2:60–67, 1967.

[55] George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J. Miller. Introduction to WordNet:

146 7. CONCLUSION

An on-line lexical database. International Journal on Lexicography, 3(4):235–244, January 1990. [56] Farideh Osareh. Bibliometrics, citation analysis and co-citation analysis: A review of literature I. Libri, 46:149–158, 1996. [57] Farideh Osareh. Bibliometrics, citation analysis and co-citation analysis: A review of literature II. Libri, 46:217–225, 1996. [58] Patrick Pantel and Dekang Lin. Discovering word senses from text. In KDD ’02: Proceedings of the eighth ACM SIGKDD inter- national conference on Knowledge discovery and data mining, pages 613–619, New York, NY, USA, 2002. ACM Press. [59] Dongqing Yang Peixiang Zhao, Ming Zhang and Shiwei Tang. Finding hidden semantics behind reference linkages : an onto- logical approach for scientific digital libraries. In The 10th In- ternational Conference on Database Systems for Advanced Applica- tions (DASFAA 2005), pages 699–710, New York, NY, USA, 2005. Springer-Verlag. LNCS 3453. [60] Paul Walton Purdom. A transitive closure algorithm. BIT, 10:76– 94, 1970. [61] S. Redner. How popular is your paper? An empirical study of the citation distribution. The European Physical Journal B, 4:131, 1998. [62] Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno. Al- gorithmics and applications of tree and graph searching. In PODS ’02: Proceedings of the twenty-first ACM SIGMOD-SIGACT- SIGART symposium on Principles of database systems, pages 39–52, New York, NY, USA, 2002. ACM Press. [63] Antonis Sidiropoulos and Yannis Manolopoulos. A citation- based system to assist prize awarding. SIGMOD Record, 34(4):54–60, 2005. [64] Antonis Sidiropoulos and Yannis Manolopoulos. A new per- spective to automatically rank scientific conferences using dig- ital libraries. Information Processing and Management, 41(2):289– 312, 2005.

147 7. CONCLUSION

[65] Henry Small. Co-citation in the scientific literature: A new mea- sure of the relationship between two documents. In Journal of the American Society of Information Science, volume 24, pages 265– 269, 1973.

[66] Henry Small. Navigating the citation network. In Proceedings of the 58th Annual Meeting of the American Society for Information Science: Forging new partnership in information, volume 50, pages 118–126, 1999.

[67] Henry Small. Visualizing science by citation mapping. Journal of the American Society of Information Science, 50(9):799–813, 1999.

[68] SRI. The BioCyc project. http://biocyc.org/.

[69] Robert Endre Tarjan. Depth first search and linear graph algo- rithms. SIAM Journal on computing, pages 146–160, 1972.

[70] Robert Endre Tarjan. Fast algorithms for solving path problems. Journal of the Association for Computing Machinery, 28(3):594–614, 1981.

[71] Robert Endre Tarjan. A unified approach to path problems. Journal of the Association for Computing Machinery, 28(3):577–593, 1981.

[72] Sanjeev Thacker, Amit Sheth, and Shuchi Patel. Complex re- lationships for the semantic web. In D. Fensel, J. Hendler, H. Liebermann, and W. Wahlster, editors, Spinning the Seman- tic Web. MIT Press, 2002.

[73] C. J. Van Rijsbergen. Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow, 1979.

[74] Alexei Vazquez. Statistics of citation networks, May 2001.

[75] Stephen Warshall. A theorem on boolean matrices. Journal of the Association for Computing Machinery, 9(1):11–12, 1962.

[76] D. J. Watts and S. H. Strogatz. Collective dynamics of ’small- world’ networks. Nature, 393(6684):440–442, June 1998.

148 7. CONCLUSION

[77] Duncan J. Watts. Six Degrees: The Science of a Connected Age. W. W. Norton & Company, February 2004.

[78] Y. Yamamoto, M. Yoshikawa, and S. Umeura. On indices for XML documents with namespaces. In Proceedings of the Confer- ence on Markup Technologies, pages 235–243, 1999.

[79] Xifeng Yan and Jiawei Han. gSpan: Graph-based substructure pattern mining. In Proceeddings of the 2002 International Confer- ence on Data Mining (ICDM’02), pages 721 – 724, 2002.

[80] Xifeng Yan and Jiawei Han. CloseGraph: Mining closed fre- quent graph patterns. In Proceedings of the ACM SIGKDD In- ternational Conference on Knowledge Discovery and Data Mining, 2003.

[81] Xifeng Yan, Philip S. Yu, and Jiawei Han. Graph indexing based on discriminative frequent structure analysis. ACM Transactions on Database Systems, 30(4):960–993, 2005.

[82] Dang Yaru. Structural modeling of network systems in citation analysis. J. Am. Soc. Inf. Sci., 48(10):946–952, 1997.

[83] L.C. Yin, H Kretschmer, R. Hanneman, and Z. Liu. The evolu- tion of citation network topology: The development of the jour- nal Scientometrics. International Workshop on Webometrics, Infor- metrics and Scientometrics & Seventh COLLNET Meeting, 2005.

[84] Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, and Michal Batko. Similarity Search: The Metric Space Approach, volume 32 of Advances in Database Systems. Springer-Verlag, 2006.

149