Modelling and Computing the Quality of Scientific Information on the Web of Data
Total Page:16
File Type:pdf, Size:1020Kb
MODELLING AND COMPUTING THE QUALITY OF SCIENTIFIC INFORMATION ON THE WEB OF DATA A thesis submitted to the University of Manchester for the degree of Doctor of Philosophy in the Faculty of Engineering and Physical Sciences 2014 By Matthew Gamble School of Computer Science Contents List of Tables 7 List of Figures 11 List of Code Listings 13 Abstract 14 Declaration 15 Copyright 16 Table of Artefacts 21 Acknowledgements 24 1 Introduction 26 1.1 Problem . 26 1.1.1 A Motivating Question . 30 1.1.2 Quality Knowledge and the Information Quality Life-cycle 31 1.2 Examples of Quality Knowledge for Scientific Data on the Web . 32 1.2.1 Minimum Information Checklists . 32 1.2.2 Gene Ontology Annotations Quality Score . 34 1.2.3 Online Chemical Structure Repositories . 36 1.3 Research Aims and Objectives . 37 1.4 Objective, Predictive, and Subjective . 37 1.5 Research Hypotheses . 39 1.6 Methodology and Approach . 40 1.7 Research Contributions . 42 2 1.8 Publications and Research Activity . 44 1.8.1 Publications . 45 1.8.2 Research Activity . 46 1.9 Thesis Organization . 46 2 Background 48 2.1 Chapter Introduction . 48 2.2 What is Data Quality? . 49 2.2.1 The IQ Life-Cycle . 53 2.3 Quality Knowledge and the Web of Data . 58 2.3.1 The Linked Data Approach . 58 2.3.2 Examining Quality Knowledge . 63 2.4 OPS IQ: an Objective, Predictive and Subjective Classification . 69 2.4.1 Using the Classification . 72 2.5 The Role of Provenance . 75 2.5.1 What is Provenance? . 75 2.5.2 State of Provenance in the Web of Data . 78 2.6 Related Work in Information Quality Assessment . 82 2.7 Summary and Conclusions . 95 3 MIM: a Minimum Information Model 97 3.1 Chapter Introduction . 97 3.2 Minimum Information Checklists . 98 3.3 A WikiProject Chemicals Case-Study . 102 3.3.1 Extending the IQ Life-cycle . 107 3.3.2 chemmim: A Checklist for Chemical Compound Data . 108 3.4 Minimum Information Checklist Structural Meta-Study . 112 3.5 The MIM Vocabulary . 115 3.5.1 Describing a Checklist . 117 3.5.2 Reporting Against a Checklist . 120 3.5.3 Checklist Satisfaction . 123 3.6 Implementation . 125 3.6.1 Checklist Satisfaction using SPIN . 127 3.7 Evaluating chembox Data with Chemmim . 129 3.7.1 Iterative Assessment - The IQ Life-cycle . 133 3.8 Comparison with other approaches . 134 3 3.8.1 minim . 134 3.8.2 OWL and OWL Integrity Constraints . 136 3.9 Discussion . 138 3.10 Future Work for MIM . 139 3.11 Chapter Summary . 141 4 Quality Fragments: a Probabilistic Approach 142 4.1 Chapter Introduction . 142 4.2 Modelling Predictive Quality Knowledge . 143 4.3 Bayesian Networks . 146 4.3.1 Modelling the Variables of a Metric: Continuous vs. Discrete150 4.3.2 Modelling the GAQ Metric . 151 4.3.3 Modular Metrics . 154 4.4 Multi-Entity Bayesian Networks . 156 4.5 Implementation . 163 4.5.1 Quality Knowledge Encoding with PR-OWL and the Evi- dent Vocabulary. 166 4.6 Assessing Bio2RDF data with Quality Fragments . 169 4.6.1 Data Preparation . 170 4.6.2 Replicating the GAQ Metric. 171 4.6.3 GAQ Assessment with Missing Metadata . 173 4.7 Discussion . 179 4.7.1 Some Notes on Implementation . 181 4.8 Comparison with Related Work . 182 4.9 Future Work for Quality Fragments . 183 4.10 Chapter Summary . 184 5 Procedurally Building Quality Fragments using Provenance 186 5.1 Chapter Introduction . 186 5.2 Provenance-based Quality Knowledge . 187 5.2.1 Our Intuition . 190 5.3 PROV . 192 5.4 Influence Factor . 199 5.4.1 Modelling evident:influenceFactor . 200 5.5 Quality Fragment Generation Procedure . 202 5.5.1 Assumptions and Restrictions . 203 4 5.5.2 Structure Generation . 204 5.5.3 Combination Strategy . 206 5.5.4 Influence Factor Normalization . 208 5.6 Evaluation . 212 5.6.1 The Zeng Network . 213 5.6.2 Data Preparation . 217 5.6.3 The Evident Generator . 220 5.6.4 The Evident Vocabulary for Provenance-aware Quality Frag- ments. 221 5.6.5 The Generated Quality Fragment . 222 5.6.6 Comparison of the Generated Network With the Zeng Net- work.............................. 225 5.7 Comparison with Related Work . 232 5.8 Conclusions and Future Work for Quality Fragment Generation . 233 5.9 Chapter Summary . 235 6 Conclusions 236 6.1 Summary of Research Contributions . 236 6.2 Future Work . 241 Bibliography 245 A Minimum Information Checklist Analysis 275 B The MIM Vocabulary 279 C The chemmim Checklist 285 D The mimspin MIM Validation Rules 288 E The Evident Vocabulary 305 F GAQ Multi-Entity Bayesian Network 308 G Wikiprov RDF Serialization Example 324 5 Word Count: 65331 6 List of Tables 3 Table of Artefacts that Support this Thesis . 21 2.1 Information Quality Dimensions Classified using OPS IQ . 74 2.2 Provenance Vocabularies . 79 2.3 Provenance Vocabulary Usage in the Web of Data . 81 2.4 IQ Assessment Frameworks . 83 3.1 Summary of Minimum Information Checklist Analysis . 112 4.1 Summary of Evaluation Datasets . 173 4.2 Results for GAQ Mfrags compared with GA2GAQ.pl . 174 5.1 The Core Elements of the PROV Data Model . 196 5.2 Summary of PROV Influence Types in ProvBench . 198 5.3 tAi CPT................................ 215 5.4 AuthorTrust CPT . 215 5.5 Results from the Zeng Network . 216 5.6 Statistics of the Training Dataset . 218 5.7 Statistics of the Test Dataset . 218 5.8 Results from the article Q Quality Fragment compared with the Zeng Network . 226 5.9 article Q Results for Featured and Cleanup Articles in Training Set230 5.10 article Q Results for Featured and Cleanup Articles in Test Set . 231 7 List of Figures 1.1 The Minimum Information about and RNAi Experiment (MIARE) Checklist (extract). 33 1.2 Annotation of UniprotKB P06727 with the term \lipoprotein metabolic process" (GO:0042157) in the Gene Ontology Annotations Database. 34 1.3 Position of the term \lipoprotein metabolic process" (GO:0042157) in the Gene Ontology. 35 1.4 The Three Aspects of Quality Knowledge. 37 1.5 A Readers Guide to the Structure and Dependencies in this Thesis 41 2.1 Two-dimensional matrix illustrating a spectrum of applicability and effectiveness of IQ metrics. 51 2.2 Map of Concepts used in this Thesis . 54 2.3 The Information Quality Life-cycle [Mis08] . 54 2.4 Abstract Representation of the GAQ score Assessment processes using a Reusable Quality Component for the GAQ Score . 55 2.5 Abstract Representation of a Quality View reusing the GAQ Score Assessment . 57 2.6 The Linking Open Data cloud diagram from [CJ11] . 59 2.7 Abstract Representation of a Quality Standard based Assessment 65 2.8 Evidence Code Ranks (ECRs) as Defined by Buza et al. as an Example of Prior Knowledge . 67 2.9 Abstract Representation of a Quality Fragment based Assessment 67 2.10 Abstract Representation of a Quality View based Assessment . 69 2.11 Using the OPS IQ classification. 72 2.12 Core Elements of the PROV Provenance Model [GM13] . 77 2.13 Example PROV Provenance Graph for a GO Annotation. 78 2.14 The WIQA Filtering Processes. 84 2.15 Intrinsic vs. Provenance-based Quality Knowledge in the GAQ score. 89 8 2.16 Example Bayesian Network Metric from Wang et al. [WV05] . 92 3.1 The Minimum Information about and RNAi Experiment (MIARE) Checklist (Assay Requirement Set). 99 3.2 A Minimum Information Checklist Solution in the Web of Data. 101 3.3 The Wikipedia Article and Chembox for Bicarbonate . 103 3.4 Interaction Between the IQ Life-cycle and Linked Data Extraction processes. 108 3.5 The 11 Requirements of the Chemmim Checklist. 109 3.6 Comparison of Wikipedia Article for Aluminium Hydroxide Oxide (Stub) and Acetic Acid (A). 111 3.7 The Anatomy of a Minimum Information Checklist. 113 3.8 The Minimum Information Model Vocabulary. 116 3.9 Creating an ObjectReport and Aligning it with a Checklist Re- quirement. 121 3.10 Creating a DataReport using a blank node and Aligning it with a Checklist Requirement. 121 3.11 Creating a ReportSet and Aligning it with a Checklist Requirement Set. 122 3.12 Associating the InChI Requirement with a SPARQL Query to Au- tomatically Generate Reports. 123 3.13 Requirement Satisfaction. 124 3.14 Requirement Satisfaction. ..