Semantic Provenance: Modeling, Querying, and Application in Scientific Discovery

Semantic Provenance: Modeling, Querying, and Application in Scientific Discovery A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Engineering By SATYA SANKET SAHOO Masters in Computer Application, Goa University, 2003 2010 Wright State University WRIGHT STATE UNIVERSITY SCHOOL OF GRADUATE STUDIES July 29, 2010 I HEREBY RECOMMEND THAT THE DISSERTATION PREPARED UNDER MY SUPERVISION BY Satya Sanket Sahoo ENTITLED Semantic Provenance: Modeling, Querying, and Application in Scientific Discovery BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in Computer Science and Engineering. Amit P. Sheth, Ph.D. Dissertation Director Arthur Ardeshir Goshtasby, Ph.D. Director, Computer Science and Engineering Ph.D. Program Andrew T. Hsu, Ph.D Dean, School of Graduate Studies Committee on Final Examination Krishnaprasad Thirunarayan, Ph.D. Michael Raymer, Ph.D. Nicholas V. Reo, Ph.D. Olivier Bodenreider, Ph.D. William S. York, Ph.D. ABSTRACT Sahoo, Satya Sanket. Ph.D., Department of Computer Science and Engineering, Wright State University, 2010. Semantic Provenance: Modeling, Querying, and Application in Scientific Discovery Provenance metadata, describing the history or lineage of an entity, is essential for ensuring data quality, correctness of process execution, and computing trust values. Traditionally, provenance management issues have been dealt with in the context of workflow or relational database systems. However, existing provenance systems are inadequate to address the requirements of an emerging set of applications in the new eScience or Cyberinfrastructure paradigm and the Semantic Web. Provenance in these applications incorporates complex domain semantics on a large scale with a variety of uses, including accurate interpretation by software agents, trustworthy data integration, reproducibility, attribution for commercial or legal applications, and trust computation. In this dissertation, we introduce the notion of “semantic provenance” to address these requirements for eScience and Semantic Web applications. In addition, we describe a framework for management of semantic provenance by addressing the three issues of, (a) provenance representation, (b) query & analysis, and (c) scalable implementation. First, we introduce a foundational model of provenance called Provenir to serve as an upper-level reference ontology to facilitate provenance interoperability. Second, we define a classification scheme for provenance queries based on the query characteristics and use this scheme to define a set of specialized provenance query operators. Third, we describe the implementation of a highly scalable query engine to support the provenance query operators, which uses a new class of materialized views based on the Provenir ontology, called Materialized Provenance Views (MPV), for query optimization. We also define a novel provenance tracking approach called Provenance Context Entity (PaCE) for the Resource Description Framework (RDF) model used in Semantic Web applications. PaCE, defined in terms of the Provenir ontology, is an effective and scalable approach for RDF provenance tracking in comparison to the currently used RDF reification vocabulary. Finally, we describe the application of the semantic provenance framework in biomedical and oceanography research projects. iii Contents 1 Introduction 1 1.1 Metadata: Syntax, Structure, and Semantics 2 1.2 Semantic Web 3 1.3 Requirements for Provenance Management in Data Driven Science and the 4 Semantic Web 1.4 Dissertation Goals 6 1.5 Summary and Proposal Outline 7 2 Provenance Management: Approaches and Challenges 8 2.1 Introduction 8 2.2 Semantic Provenance 10 2.3 Conclusions 12 3 Provenance Representation and Provenir Ontology 14 3.1 Introduction 14 3.2 ProPreO Ontology: Modeling Provenance in Proteomics Experiments 14 3.3 Provenir Ontology: An Upper-level Provenance Ontology 18 3.4 A Modular Approach for Creating Provenance Ontologies 21 3.5 Related Work 26 3.6 Conclusions 26 4 Provenance Query and Analysis 28 4.1 Introduction 28 4.2 Provenance Query Characteristics and Classification Scheme 29 4.3 Provenance Query Operators 30 4.4 Related Work 36 4.5 Conclusions 38 5 Implementing the Provenance Query Infrastructure 39 5.1 Introduction 39 5.2 Provenance Query Engine 39 5.3 Evaluation Strategy 42 5.4 Results 45 5.5 Materialized Provenance Views: Optimizing Provenance Queries 48 5.6 Related Work 51 5.7 Conclusions 51 6 Provenance Context Entity (PaCE): Scalable Provenance Tracking for 52 Scientific RDF Data 6.1 Introduction 52 6.2 Foundations of Provenance Context Entity 55 6.3 Model Theoretic Semantics of PaCE Inferencing 59 iv 6.4 Implementation: Using PaCE Approach in the BKR Project 61 6.5 Evaluation 63 6.6 Conclusions 68 7 Application: A Framework for Provenance Management in Translational 69 Research 7.1 Introduction 69 7.2 The Semantic Problem Solving Environment for T.cruzi project 70 7.3 T.cruzi Provenance Management System 74 7.4 Query Infrastructure of T.cruzi PMS 75 7.5 Evaluation 77 7.6 Related Work 80 7.7 Conclusions 81 8 Conclusions and Future Work 82 8.1 Future Work 85 Appendix A 87 Bibliography 107 v List of Figures Figure Caption Page 1-1 Overview of "data-driven science" research 1 2-1 Protocol for proteomics data analysis using mass spectrometer 8 2-2 Evolution of semantic provenance from role of metadata in (a) data integration in 9 distributed environments; (b) verification of data and validation of processes in eScience 2-3 The three dimensions of semantic provenance framework 11 3-1 A schematic representation of the proteomics data analysis protocol using ProPreO 16 ontology concepts and the named relationships linking them 3-2 Provenir ontology schema 18 3-3 Extending Provenir ontology to create Parasite Experiment ontology 22 3-4 Extending Provenir ontology to create Janus ontology for Taverna workflow engine 24 3-5 Extending Provenir ontology to create Trident ontology for oceanography 25 4-1 Schematic representation of provenance () query operator execution strategy 33 4-2 Mapping provenance query operators with existing database and workflow 37 provenance 5-1 Architecture of the provenance query engine 40 5-2 RDF transitive closure using SPARQL ASK function 41 5-3 A simulated oceanography scenario from Neptune Project 43 5-4 Contribution of each provenance query engine component (in %) 45 5-5 Performance results for the three SPARQL functions to compute transitive closure on 46 <process, preceded_by> 5-6 Evaluation results for expression complexity of provenance queries illustrating the 47 significant increase in query time with increasing expression complexity 5-7 Evaluation results for data complexity of provenance queries illustrating the 47 significant increase in query time with increasing data complexity 5-8 A MPV corresponds to one logical unit of provenance in the given application 49 domain 5-9 Comparing expression complexity results with and without using MPV 50 5-10 Comparing data complexity results with and without using MPV 50 6-1 Provenance tracking on the Semantic Web using RDF reification 54 6-2 Schematic representation of provenance context for the BKR project and a sensor 57 application 6-3 Implementation of the PaCE mechanism using (a) exhaustive approach, and (b) 58 minimalist approach 6-4 Implementation of the PaCE mechanism using the intermediate approach 59 6-5 PaCE inferencing 61 6-6 The relative number of provenance-specific triples created using PaCE and RDF 64 reification 6-7 Query performance for fixed values 65 6-8 Query performance for query patterns using a set of 100 values 66 6-9 (a) Result of time profiling provenance query for the therapeutic use of drug 67 Thalidomide, (b) performance of the query using RDF reification and PaCE mechanism vi 7-1 Alleles of target genes are used to create knockout plasmids 71 7-2 Schematic representation of GKO and SP experiment protocols 72 7-3 The architecture of the T.cruzi provenance system addressing four aspects of 74 provenance management 7-4 Use of provenance () query operator to answer example provenance queries 76 7-5 The result of Query 4 corresponds to a MPV in T.cruzi provenance system 77 7-6 The response time for provenance query engine with (a) increasing size of RDF 79 dataset and (b) increasing complexity of queries 7-7 Comparative results for (a) increasing size of RDF datasets over the underlying 80 database and MPV and (b) provenance queries with increasing complexity vii List of Tables Table Page 5.1 SPARQL query details to evaluate expression complexity using the provenance ( ) 44 query operator 5-2 RDF datasets to evaluate the data complexity using the provenance ( ) query operator 45 5-3 Performance gain for provenance ( ) query operator using MPV 51 7-1 The four RDF datasets used to evaluate scalability of T.cruzi provenance system 78 7-2 Details of SPARQL queries with increasing expression complexity 78 viii Acknowledgements I would like to thank the team members of the Bioinformatics for Glycan Expression (glycomics) project - at the University of Georgia: Will York, James Atwood, Lin Lin, and John Miller; and at the Kno.e.sis Center (Wright State University): Cory Henson, and Christopher Thomas. The glycomics project was a very exciting real world research context for being introduced to biomedical ontologies, large scale data integration, scientific workflows, and provenance. I would also like to thank the team members of the Semantic

Load more