
ABSTRACT MIRANSHAH, SAHIBI. Integrating a Path Operator in Apache Jena for Generalized Graph Pattern Matching. (Under the direction of Dr. Kemafor Anyanwu Ogan.) The emergence of large heterogeneous networks largely driven by the W3C standards for representing metadata and relationships on the Web has introduced the need for more flexible querying methods that the standard pattern matching paradigm. For such large graphs, it will be common for users to only know a part of the structure they are interested in finding in a large network. In some cases, finding the structure would in fact be the goal of the query. For such scenarios, it is necessary to have flexible matching paradigms where it is possible to match both fixed structure patterns as well as variable structure patterns. The present infrastructure for supporting such generalized queries exists in a very limited form. In this thesis we explore some approaches for achieving this. The two common approaches to implement generalized graph pattern matching queries, i.e. queries that can match not just edges but paths between source and destination node bindings, would be either to use the traditional graph pattern matching approach which roughly compares to relational query processing and evaluates the query by using relational join expressions. The other approach would be to use the graph traversal algorithms which convert the graph pattern to a regular expression, and execute its finite automata on the dataset graph. Both these approaches are limited in terms of expressiveness of queries and their performance on large and diverse datasets as the Semantic Web. In our research we use a hybrid approach by decomposing the generalized query into parts; traditional graph pattern matching is used to evaluate one part of the query and paths are evaluated using the path algebraic algorithm. We then integrate support for such queries in the popular open-source Semantic Web framework for Java, by extending the Jena API to support compilation and execution of such queries, and by integrating a path operator. Evaluation of the implementation was done using multiple datasets and the overhead in query compilation time was minimal, as compared to queries which do not perform path matching. © Copyright 2016 by Sahibi Miranshah All Rights Reserved Integrating a Path Operator in Apache Jena for Generalized Graph Pattern Matching by Sahibi Miranshah A thesis submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the Degree of Master of Science Computer Science Raleigh, North Carolina 2016 APPROVED BY: Dr. Raju Vatsavai Dr. Guoliang Jin Dr. Kemafor Anyanwu Ogan Chair of Advisory Committee DEDICATION To my parents. ii BIOGRAPHY Sahibi Miranshah received her Bachelor’s degree in Computer Science and Engineering from Dr. B.R.Ambedkar National Institute of Technology, Jalandhar, India in 2014. She enrolled at North Carolina State University in 2014 for her Master of Science degree in Computer Science. iii ACKNOWLEDGEMENTS I would like to thank my advisor Dr. Kemafor Anyanwu Ogan for her guidance in my thesis work. Her knowledge, consistent motivation, humble attitude and passion for research and innovation made my first steps in academic research an exciting and fulfilling experience. I would like to thank Dr. Raju Vatsavai and Dr. Guoliang Jin for being part of my graduate committee. I would like to thank my colleagues in the semantic computing research lab - Sidan Gao, HyeongSik Kim, Shalki Shrivastava and Avimanyu Mukhopadhyay for their support. I would like to thank Sidan for her invaluable suggestions and feedback and fruitful research oriented discussions. Finally, I would like to thank my family and friends who have supported me in all my endeavors, without which this would not have been possible. iv TABLE OF CONTENTS LIST OF TABLES .................................................. vii LIST OF FIGURES ................................................. viii Chapter 1 INTRODUCTION ........................................ 1 1.1 Structure of the Thesis ....................................... 3 Chapter 2 BACKGROUND ......................................... 4 2.1 RDF..................................................... 5 2.1.1 RDF Model.......................................... 6 2.1.2 RDF Serialization Formats............................... 7 2.2 Sparql ................................................... 10 2.2.1 Querying Data With Sparql .............................. 12 2.3 Graph Pattern Matching...................................... 15 2.3.1 General Definition of Graph Pattern Matching................ 16 2.3.2 Types of Graph Patterns................................. 17 2.3.3 Data Models and Graph Pattern Matching ................... 19 2.3.4 Systems supporting the Graph Pattern Matching problem . 20 2.4 Generalized Graph Pattern Matching ............................ 21 2.4.1 SPARQ2L ........................................... 23 2.5 Solve Algorithm............................................ 31 2.6 Apache Jena............................................... 34 2.6.1 Jena Architecture...................................... 34 Chapter 3 APPROACH ............................................ 37 3.1 Generic Sparql Query Engine .................................. 41 3.2 Extending Jena............................................. 43 3.2.1 ARQ Query Processing.................................. 44 3.2.2 Main Query Engine.................................... 46 3.2.3 Custom Query Engine.................................. 48 3.2.4 Algebra Extensions .................................... 49 3.2.5 Expression Functions and Property Functions ................ 49 3.3 Query Plan Generation....................................... 50 3.4 Architecture of Extended Jena.................................. 53 3.5 Integrating the Path Operator.................................. 56 3.5.1 Path Operator ’OpFindAllPaths’........................... 56 3.5.2 Integration of the Path Operator .......................... 57 3.5.3 User Workflow........................................ 58 3.6 Path Query Syntax .......................................... 58 v 3.7 Related Work.............................................. 62 Chapter 4 EVALUATION .......................................... 63 4.1 Query Compilation Time ..................................... 64 4.2 Query Execution Time ....................................... 65 4.3 Usability ................................................. 67 4.4 List of Queries ............................................. 69 Chapter 5 CONCLUSION AND FUTURE WORK ......................... 73 BIBLIOGRAPHY .................................................. 74 APPENDIX ...................................................... 77 Appendix A Code........................................... 78 vi LIST OF TABLES Table 4.1 Compile Times of queries with and without path operator . 64 Table 4.2 Execution Times of queries with and without path operator . 66 vii LIST OF FIGURES Figure 2.1 The graph describes two people, and each one has properties name, gender and knows. .................................... 7 Figure 2.2 The graph uses the FOAF ontology, describes two people, and each one has properties name, gender and knows.................. 13 Figure 2.3 The algorithm SOLVE to solve single source path expression prob- lem.[Tar81] .......................................... 32 Figure 2.4 The algorithm ELIMINATE is used to pre-compute the path sequence for an input graph G whose vertices are numbered from 1 to n.[Tar81] 33 Figure 2.5 The Jena architecture illustrating interaction between the APIs.[Arc] 35 Figure 3.1 Data graph to illustrate query types......................... 39 Figure 3.2 The intuition behind the integration of the path operator is to decom- pose the path query and use the existing Jena framework to execute graph pattern matching, and use a path algebraic approach to extract paths............................................... 40 Figure 3.3 A generic SPARQL Query processing, optimization and execution framework........................................... 42 Figure 3.4 Phases of ARQ query processing. .......................... 47 Figure 3.5 Query plan for a simple example query without path operator inte- grated into the query engine.............................. 51 Figure 3.6 Query plan for the example query after integrating the path operator into the query engine................................... 52 Figure 3.7 Query plan for the example query after integrating the path operator into the query engine................................... 54 Figure 3.8 The diagram shows the architecture of the Path-Extraction Enhanced Jena API............................................. 55 Figure 3.9 The flowchart shows how the Jena API can be used to execute SPARQL queries with or without path computation, by using the object of class ’ExtendedModel’ and by specifying boolean arguments that reflect the user’s choice of whether to compute paths or not............ 59 Figure 4.1 Graph to illustrate the comparison of compile time in queries with and without path computations. .......................... 65 Figure 4.2 Graph to illustrate the comparison of execution time in queries with and without path computations. .......................... 67 viii CHAPTER 1 INTRODUCTION With the exponential growth in data being contributed to the Semantic Web and its repre- sentation as large heterogeneous graphs, and the reliance of a large number of applications on the ability to query
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages110 Page
-
File Size-