A Context Sensitive Model for Querying Linked Scientific Data
Total Page:16
File Type:pdf, Size:1020Kb
A context sensitive model for querying linked scientific data Peter Ansell Bachelor of Science/Bachelor of Business (Biomedical Science and IT) Avondale College December 2005 Bachelor of IT (Hons., IIA) Queensland University of Technology December 2006 A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy November, 2011 Principal Supervisor: Professor Paul Roe Associate Supervisor: A/Prof James Hogan * School of Information Technology Faculty of Science and Technology Queensland Univesity of Technology Brisbane, Queensland, AUSTRALIA c Copyright by Peter Ansell 2011. All Rights Reserved. The author hereby grants permission to the Queensland University of Technology to reproduce and redistribute publicly paper and electronic copies of this thesis document in whole or in part. Keywords: Semantic web, RDF, Distributed databases, Linked Data iii iv Abstract This thesis provides a query model suitable for context sensitive access to a wide range of distributed linked datasets which are available to scientists using the Internet. The model is designed based on scientific research standards which require scientists to pro- vide replicable methods in their publications. Although there are query models available that provide limited replicability, they do not contextualise the process whereby different scientists select dataset locations based on their trust and physical location. In different contexts, scientists need to perform different data cleaning actions, independent of the overall query, and the model was designed to accommodate this function. The query model was implemented as a prototype web application and its features were verified through its use as the engine behind a major scientific data access site, Bio2RDF.org. The prototype showed that it was possible to have context sensitive behaviour for each of the three mirrors of Bio2RDF.org using a single set of configuration settings. The prototype provided executable query provenance that could be attached to scientific publications to fulfil replicability requirements. The model was designed to make it simple to independently interpret and execute the query provenance documents using context specific profiles, without modifying the original provenance documents. Exper- iments using the prototype as the data access tool in workflow management systems confirmed that the design of the model made it possible to replicate results in different contexts with minimal additions, and no deletions, to query provenance documents. v vi Contents Acknowledgements xvii 1 Introduction 1 1.1 Scientific data . .5 1.1.1 Distributed data . .8 1.1.2 Science example . .9 1.2 Problems . 14 1.2.1 Data quality . 15 1.2.2 Data trust . 17 1.2.3 Context sensitivity and replication . 20 1.3 Research questions . 21 1.4 Thesis contributions . 21 1.5 Publications . 21 1.6 Research artifacts . 22 1.7 Thesis outline . 23 2 Related work 25 2.1 Overview . 25 2.2 Early Internet and World Wide Web . 30 2.2.1 Linked documents . 31 2.3 Dynamic web services . 31 2.3.1 SOAP Web Service based workflows . 32 2.4 Scientific data formats . 33 2.5 Semantic web . 35 2.5.1 Linked Data . 36 2.5.2 SPARQL . 37 2.5.3 Logic . 38 2.6 Conversion of scientific datasets to RDF . 39 2.7 Custom distributed scientific query applications . 40 2.8 Federated queries . 43 3 Model 47 3.1 Overview . 47 3.2 Query types . 52 vii 3.2.1 Query groups . 57 3.3 Providers . 58 3.3.1 Provider groups . 59 3.3.2 Endpoint implementation independence . 60 3.4 Normalisation rules . 61 3.4.1 URI compatibility . 64 3.5 Namespaces . 66 3.5.1 Contextual namespace identification . 67 3.6 Model configurability . 68 3.6.1 Profiles . 68 3.6.2 Multi-dataset locations . 69 3.7 Formal model algorithm . 70 3.7.1 Formal model specification . 72 4 Integration with scientific processes 75 4.1 Overview . 75 4.2 Data exploration . 76 4.2.1 Annotating data . 77 4.3 Integrating Science and Medicine . 78 4.4 Case study: Isocarboxazid . 79 4.4.1 Use of model features . 84 4.5 Web Service integration . 89 4.6 Workflow integration . 90 4.7 Case Study : Workflow integration . 92 4.7.1 Use of model features . 95 4.7.2 Discussion . 97 4.8 Auditing semantic workflows . 98 4.9 Peer review . 101 4.10 Data-based publication and analysis . 102 5 Prototype 107 5.1 Overview . 107 5.2 Use on the Bio2RDF website . 108 5.3 Configuration schema . 109 5.4 Query type . 112 5.4.1 Template variables . 115 5.4.2 Static RDF statements . 118 5.5 Provider . 119 5.6 Namespace . 119 5.7 Normalisation rule . 121 5.7.1 Integration with other Linked Data . 124 5.7.2 Normalisation rule testing . 126 5.8 Profiles . 126 viii 6 Discussion 129 6.1 Overview . 129 6.2 Model . 130 6.2.1 Context sensitivity . 131 6.2.2 Data quality . 133 6.2.3 Data trust . 137 6.2.4 Provenance . 139 6.3 Prototype . 141 6.3.1 Context sensitivity . 141 6.3.2 Data quality . 144 6.3.3 Data trust . 150 6.3.4 Provenance . 153 6.4 Distributed network issues . 155 6.5 Prototype statistics . 158 6.6 Comparison to other systems . 162 7 Conclusion 165 7.1 Critical reflection . 167 7.1.1 Model design review . 167 7.1.2 Prototype implementation review . 170 7.1.3 Configuration maintenance . 171 7.1.4 Trust metrics . 172 7.1.5 Replicable queries . 173 7.1.6 Prominent data quality issues . 173 7.2 Future research . 175 7.3 Future model and prototype extensions . 178 7.4 Scientific publication changes . 181 A Glossary 185 B Common Ontologies 187 ix x List of Figures 1.1 Example Scientific Method . .3 1.2 Flow of information in living cells . .4 1.3 Integrating scientific knowledge . .5 1.4 Use of information by scientists . .6 1.5 Storing scientific data . .7 1.6 Data issues . .9 1.7 Datasets related to flow of information in living cells . 10 1.8 Different methods of referencing Uniprot in Entrez Gene dataset . 11 1.9 Different methods of referencing Entrez Gene in Uniprot dataset . 12 1.10 Different methods of referencing Entrez Gene and Uniprot in HGNC dataset 12 1.11 Indirect links between DailyMed, Drugbank, and KEGG . 13 1.12 Direct links between DailyMed, Drugbank, and KEGG . 14 1.13 Bio2RDF website . ..