A context sensitive model for querying linked scientific data

Peter Ansell Bachelor of Science/Bachelor of Business (Biomedical Science and IT) Avondale College December 2005

Bachelor of IT (Hons., IIA) Queensland University of Technology December 2006

A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy November, 2011

Principal Supervisor: Professor Paul Roe Associate Supervisor: A/Prof James Hogan

* School of Information Technology Faculty of Science and Technology Queensland Univesity of Technology Brisbane, Queensland, AUSTRALIA c Copyright by Peter Ansell 2011. All Rights Reserved. The author hereby grants permission to the Queensland University of Technology to reproduce and redistribute publicly paper and electronic copies of this thesis document in whole or in part. Keywords: Semantic web, RDF, Distributed databases, Linked Data

iii iv Abstract

This thesis provides a query model suitable for context sensitive access to a wide range of distributed linked datasets which are available to scientists using the Internet. The model is designed based on scientific research standards which require scientists to pro- vide replicable methods in their publications. Although there are query models available that provide limited replicability, they do not contextualise the process whereby different scientists select dataset locations based on their trust and physical location. In different contexts, scientists need to perform different data cleaning actions, independent of the overall query, and the model was designed to accommodate this function. The query model was implemented as a prototype web application and its features were verified through its use as the engine behind a major scientific data access site, Bio2RDF.org. The prototype showed that it was possible to have context sensitive behaviour for each of the three mirrors of Bio2RDF.org using a single set of configuration settings. The prototype provided executable query provenance that could be attached to scientific publications to fulfil replicability requirements. The model was designed to make it simple to independently interpret and execute the query provenance documents using context specific profiles, without modifying the original provenance documents. Exper- iments using the prototype as the data access tool in workflow management systems confirmed that the design of the model made it possible to replicate results in different contexts with minimal additions, and no deletions, to query provenance documents.

v vi Contents

Acknowledgements xvii

1 Introduction 1 1.1 Scientific data ...... 5 1.1.1 Distributed data ...... 8 1.1.2 Science example ...... 9 1.2 Problems ...... 14 1.2.1 Data quality ...... 15 1.2.2 Data trust ...... 17 1.2.3 Context sensitivity and replication ...... 20 1.3 Research questions ...... 21 1.4 Thesis contributions ...... 21 1.5 Publications ...... 21 1.6 Research artifacts ...... 22 1.7 Thesis outline ...... 23

2 Related work 25 2.1 Overview ...... 25 2.2 Early Internet and World Wide Web ...... 30 2.2.1 Linked documents ...... 31 2.3 Dynamic web services ...... 31 2.3.1 SOAP Web Service based workflows ...... 32 2.4 Scientific data formats ...... 33 2.5 Semantic web ...... 35 2.5.1 Linked Data ...... 36 2.5.2 SPARQL ...... 37 2.5.3 Logic ...... 38 2.6 Conversion of scientific datasets to RDF ...... 39 2.7 Custom distributed scientific query applications ...... 40 2.8 Federated queries ...... 43

3 Model 47 3.1 Overview ...... 47 3.2 Query types ...... 52

vii 3.2.1 Query groups ...... 57 3.3 Providers ...... 58 3.3.1 Provider groups ...... 59 3.3.2 Endpoint implementation independence ...... 60 3.4 Normalisation rules ...... 61 3.4.1 URI compatibility ...... 64 3.5 Namespaces ...... 66 3.5.1 Contextual namespace identification ...... 67 3.6 Model configurability ...... 68 3.6.1 Profiles ...... 68 3.6.2 Multi-dataset locations ...... 69 3.7 Formal model algorithm ...... 70 3.7.1 Formal model specification ...... 72

4 Integration with scientific processes 75 4.1 Overview ...... 75 4.2 Data exploration ...... 76 4.2.1 Annotating data ...... 77 4.3 Integrating Science and Medicine ...... 78 4.4 Case study: Isocarboxazid ...... 79 4.4.1 Use of model features ...... 84 4.5 Web Service integration ...... 89 4.6 Workflow integration ...... 90 4.7 Case Study : Workflow integration ...... 92 4.7.1 Use of model features ...... 95 4.7.2 Discussion ...... 97 4.8 Auditing semantic workflows ...... 98 4.9 Peer review ...... 101 4.10 Data-based publication and analysis ...... 102

5 Prototype 107 5.1 Overview ...... 107 5.2 Use on the Bio2RDF website ...... 108 5.3 Configuration schema ...... 109 5.4 Query type ...... 112 5.4.1 Template variables ...... 115 5.4.2 Static RDF statements ...... 118 5.5 Provider ...... 119 5.6 Namespace ...... 119 5.7 Normalisation rule ...... 121 5.7.1 Integration with other Linked Data ...... 124 5.7.2 Normalisation rule testing ...... 126 5.8 Profiles ...... 126

viii 6 Discussion 129 6.1 Overview ...... 129 6.2 Model ...... 130 6.2.1 Context sensitivity ...... 131 6.2.2 Data quality ...... 133 6.2.3 Data trust ...... 137 6.2.4 Provenance ...... 139 6.3 Prototype ...... 141 6.3.1 Context sensitivity ...... 141 6.3.2 Data quality ...... 144 6.3.3 Data trust ...... 150 6.3.4 Provenance ...... 153 6.4 Distributed network issues ...... 155 6.5 Prototype statistics ...... 158 6.6 Comparison to other systems ...... 162

7 Conclusion 165 7.1 Critical reflection ...... 167 7.1.1 Model design review ...... 167 7.1.2 Prototype implementation review ...... 170 7.1.3 Configuration maintenance ...... 171 7.1.4 Trust metrics ...... 172 7.1.5 Replicable queries ...... 173 7.1.6 Prominent data quality issues ...... 173 7.2 Future research ...... 175 7.3 Future model and prototype extensions ...... 178 7.4 Scientific publication changes ...... 181

A Glossary 185

B Common Ontologies 187

ix x List of Figures

1.1 Example Scientific Method ...... 3 1.2 Flow of information in living cells ...... 4 1.3 Integrating scientific knowledge ...... 5 1.4 Use of information by scientists ...... 6 1.5 Storing scientific data ...... 7 1.6 Data issues ...... 9 1.7 Datasets related to flow of information in living cells ...... 10 1.8 Different methods of referencing Uniprot in Entrez Gene dataset . . . . 11 1.9 Different methods of referencing Entrez Gene in Uniprot dataset . . . . 12 1.10 Different methods of referencing Entrez Gene and Uniprot in HGNC dataset 12 1.11 Indirect links between DailyMed, Drugbank, and KEGG ...... 13 1.12 Direct links between DailyMed, Drugbank, and KEGG ...... 14 1.13 Bio2RDF website ...... 23

2.1 Linked Data Naive Querying ...... 27 2.2 Non Symmetric Linked Data ...... 28 2.3 Distributed data access versus local data silo ...... 29 2.4 Heterogeneous datasets ...... 34 2.5 Linked Data access methods ...... 46 2.6 Using Linked Data to retrieve information ...... 46

3.1 Comparison of model to federated RDF query models ...... 49 3.2 Query parameters ...... 51 3.3 Example: Search for Marplan in Drugbank ...... 51 3.4 Search for all references to disease ...... 54 3.5 Search for references to disease in a particular namespace ...... 55 3.6 Search for references to disease using a new query type ...... 56 3.7 Search for all references in a local curated dataset ...... 57 3.8 Query groups ...... 58 3.9 Providers ...... 59 3.10 Provider groups ...... 60 3.11 Normalisation rules ...... 62 3.12 Single query template across homogeneous datasets ...... 65

4.1 Medicine related RDF datasets ...... 80

xi 4.2 Links between datasets in Isocarboxazid case study ...... 81 4.3 Integration of prototype with Semantic Web Pipes ...... 93 4.4 Semantic tagging process ...... 100 4.5 Peer review process ...... 101 4.6 Scientific publication cycle ...... 103

5.1 URL resolution using prototype ...... 110 5.2 Simple system configuration in Turtle RDF file format ...... 111 5.3 Optimising a query based on context ...... 113 5.4 Public namespaces and private identifiers ...... 114 5.5 Template parameters ...... 116 5.6 Uses of static RDF statements ...... 118 5.7 Context sensitive use of providers by prototype ...... 120 5.8 Namespace overlap between configurations ...... 122

6.1 Enforcing syntactic data quality ...... 136 6.2 Different paging strategies ...... 145

7.1 Overall solutions ...... 168

xii List of Tables

2.1 Bio2RDF dataset sizes ...... 43

3.1 Normalisation rule stages ...... 63

5.1 Normalisation rule methods in prototype ...... 123

6.1 Context change scenarios ...... 132 6.2 Comparison of data quality abilities in different systems ...... 150 6.3 Comparison of data trust capabilities of various systems ...... 153 6.4 Bio2RDF prototype statistics ...... 161

xiii xiv Statement of Original Authorship

The work contained in this thesis has not been previously submitted to meet requirements for an award at this or any other higher education institution. To the best of my knowledge and belief, the thesis contains no material pre- viously published or written by another person except where due reference is made.

Signature: Peter Ansell

Date:

xv xvi Acknowledgements

I dedicate this thesis to my wife, Karina Ansell. She has been very supportive of my research throughout my candidature. I would also like to thank my supervisors for the effort they put into guiding me through the process.

xvii xviii Chapter 1

Introduction

There are many different ways of sharing information using current communications technology. A business may write a marketing letter to a customer, or they may email the information with a link to their website. A grandparent may send a birthday card to their grandchild, or they may write a note on a social networking website. A politician may make a speech on television, or they may self-publish a video on a video-sharing website. A scientist may publish an article in a paper based journal, while simultaneously publishing it on a website. In each case, the information needs to be shared with others to have an effect. In some cases, the information needs to be processed by humans, while in other cases computers must be able to process and integrate the data. In all cases, the data must be accessible by the interested parties. In a social network a person may want to plan and share details about an upcoming event. For example, they may want to plan a menu based on the expected guests at a dinner party. If the social network made it possible for people to share their meal preferences with their friends, the host may be able to use this data to determine which dishes would be acceptable. Although the host could print out a list of meal preferences from the social networking site and compare them to each of their recipes, it would be more efficient for the host to automatically derive a list of recipes that match the guests criteria. For the process to work effectively the preferences on the social networking site must be matched with the ingredients on the recipe website. Business and social networks do not require their information to be publicly available for it to be useful, especially as the privacy of information is paramount in attracting users. In the dinner party example, the guests do not need to publicly disclose prefer- ences, and the recipe website does not require information about each guest individually. In order for the menu planning system to work, it needs the guests to include informa- tion about which ingredients and recipes they like and dislike. This information is then accessible to their social networked friends, who can use the recipe site without their friends having to be members. Social networks and business-to-business networks rely heavily on data integration, for example, the foods on the social networking site need to be matched to the meal ingredients on the recipe site. However, in both cases there are a small number of types of data to represent. For example, social networks may only need to be able to

1 2 Chapter 1. Introduction represent people or agents, and networks or groups, while businesses may only need to exchange information about suppliers, customers and stock. These conditions may have contributed to the lack of a large set of interlinked data available about business or social networks, as the low level of complexity makes it possible for humans to examine and manually link datasets as needed. The dinner party example is not functional at this time, as there are currently no datasets available with information about recipe ingredients that have been used by social networks to create meal preference profiles. However, both social networks and businesses may be encouraged to publish more in- formation in the future based on schemas such as FOAF [57] and GoodRelations [68]; including the recently published LinkedOpenCommerce site1. Networks are also important for scientists as they provide a fast and effective way for them to publicly provide and exchange all of the relevant data for their publications with other scientists in order for their data and results to be peer reviewed and replicated [51]. They perform experiments using data from public and private datasets to derive new results using variations of the scientific method, possibly similar to those described by Kuhn [81], Crawford and Stucki [43], or Godfrey-Smith [54]. An example of a workflow describing one of the many scientific methods is shown in Figure 1.1. In some sciences, such as biology, and chemistry, there are a large number of different sources of curated data that contain links between datasets, as compared to other sciences where there may be a limited number of sources of data for scientists to integrate and crosscheck their experiments against. In cellular biology scientists need to integrate various types of data, including data about genetic material, genes, translated genes, proteins, regulatory networks, protein functions, and others, as shown in Figure 1.2. This data is generally distributed across a number of datasets that focus either on a particular disease or organism, such as cancer [120] or flies [150], or on a particular type of information, such as proteins [19]. These biological datasets contain data from many different scientists and it is impor- tant that it is accurately described and regularly updated. Scientists continually update these datasets using new data along with links to relevant data items in other datasets. For example, Figure 1.3 shows the changes necessary to reflect the new discovery that a gene found in humans is effectively identical to a gene in mice. There are a wide range of ways that well linked data can be used by scientists, including the motivation and hypothesis for an experiment, during the design of the experiment, and in the publication of the results, as shown in Figure 1.4. For example, the scientist would use the fact that the gene was equivalent in humans and mice as the motivation, with the hypothesis that the gene may cause cancer in mice. The characteristics of the cancer, including the fact that it occurs in bones would be used as part of the experiment design, and the results would contain the implication that the gene in mice causes cancer. The combined knowledge about the genes, the cancer, and the resulting implications, would be used by peer reviewers to determine whether the resulting knowledge was publishable.

1http://linkedopencommerce.com/ Chapter 1. Introduction 3

Scientist forms a Scientist reads testable hypothesis published material

Scientist designs an Scientist analyses previous experiment experiments

Scientist performs experiment

Scientist analyses data

Scientist makes Scientist forms results changes as into a publication required

Scientist submits publication to a journal

Journal editor sends anonymous copies of the publication to peers for review

Peers review work with reference to its validity and its agreeance with previous published work

Journal editor takes peer reviews and decides whether to publish the work

Article is published as part of a journal issue (Electronic and/or paper)

Citations by peers Critical reviews by peers

Figure 1.1: Example Scientific Method 4 Chapter 1. Introduction

Figure 1.2: Flow of information in living cells Original figure source: Kenzelmann, Rippe, and Mattick. [2006]. doi:10.1038/msb4100086

As the amount of data grows, and more scientists contribute data about different types of concepts, it is typical for scientists to store data about concepts such as genes and diseases in different locations. Scientists are then able to access both datasets efficiently based on the types of data that they need for their research. The combined data is discoverable by scientists accessing multiple datasets, using the links contained in each dataset for correlation, as shown in Figure 1.5. In that example, the data needs to be integrated for future researchers to study the similar effects of the gene in both humans and mice. Scientists can use the link specifying that the genes are identical to imply that the gene may cause cancer in mice, before testing this hypothesis with an experiment. Although most scientific publications only demonstrate the final set of steps that were used to determine the novel result, the process of determining goals and elimi- nating strategies which are not useful is important. The resulting publications would contain links to the places where the knowledge could be found. The information in the publication would make it possible for other scientists to reproduce the experiment, and provide evidence for the knowledge that the gene causes cancer in both humans and mice. Scientific peers may want to reproduce or extend an experiment using their own resources. These experiments are very similar to the structure of the peer’s experiments, but the data access and processing resources need to be changed to fit the scientist’s Chapter 1. Introduction 5

Initial information

Gene ABCC1 Gene GHFC

Found in: Causes: Found in:

Human Bone Cancer Mouse

Integrated information

Identical genes:

Gene ABCC1 Gene GHFC Identity used to Found in: Causes: imply: Found in:

May cause:

Human Bone Cancer Mouse

Figure 1.3: Integrating scientific knowledge

context. For example, they may have personally curated a copy of a dataset, and wish to substitute it with the publicly available copy that was originally used. This dataset may not have the same structure, and it may not be accessible using the same method as the public dataset. In the example, a peer may have a dataset that contains empirical data related to the expression of a gene in various mice and wish to use this as part of the experiment. If the curated dataset contains enough information to satisfy the goals of the experiment, the peer should be able to easily substitute the datasets to replicate the experiment.

1.1 Scientific data

There are a large number of publicly available scientific datasets that are useful in various disciplines. In disciplines such as physics, datasets are mostly made up of direct numeric observations, and there is little relationship between the raw data from different experiments. In other sciences, particularly those based around biology, most data describes non-numeric information such as gene sequences or relationships between proteins in a cell, and there is a clear relationship between the concepts in different datasets. In the case of biology, there is a clear incentive to integrate different types of data, compared to physics where it is possible to perform isolated experiments with and process the raw data without direct correlations to shared concepts such as curated gene networks. 6 Chapter 1. Introduction

Relationship of published information to the scientific method

Example methods and sources Scientific method

Paper, HTML, PDF, Databases Read articles and search databased to ascertain where research gaps exist PubMed, NCBI Gene, NCBI Taxon, Exploration Diseasome

Form a falsifiable hypothesis which attempts to create new knowledge in the area Hypothesis

Written, Programming code, Workflow Use a previously published experimental method and terminology PLoS One, myExperiment, CPAN, CRAN, SourceForge Experiment

XML, TSV, SOAP, RDF, Workflow Use accepted conventions on data processing to further understand the data NCBI XML, BioMoby, Bio2RDF, Taverna Analysis of data

HTML, URL, RDF, URI Explain results in terms of current scientific data, with reference to why Written, Ontology, Linked Data Conclusions the conclusions may be different

Write up coherent arguments suitable Written for publication including references and Article writing citations to appropriate prior articles LaTeX, BibTeX, Word, Endnote

Verify the coherence of the conclusive Workflow, Custom system arguments and the suitability of the Peer review experimental method and analysis

Paper, PDF, RDF, XML, Workflow,

Contribute method and results to the body of scientific knowledge through PLoS One, myExperiment, CPAN, CRAN, Publication publication of the peer-reviewed article SourceForge

Figure 1.4: Use of information by scientists Chapter 1. Introduction 7

Unlinked information Human Genes Cancers Mouse Genes

Gene ABCC1 Bone Cancer Gene GHFC

Found in: Causes: Found in: Found in:

Human Bone Cancer Bones Mouse

Linked information Gene links

Human Genes Mouse Genes Identical genes:

Gene ABCC1 Gene GHFC

Found in: Causes: Found in: May cause:

Bone Cancer Human Implied knowledge Mouse

Found in:

Bones

Cancers

Figure 1.5: Storing scientific data

In biology, an experiment relating to the functioning of a cell in one organism may share a number of conceptual similarities with another experiment examining the ef- fects of a drug on a different organism. The similarities between the experiments are identifiable, and are commonly shared between scientists by including links in published datasets. For example, a current theory about how genetic information influences the behaviour of cells in living organisms is shown in Figure 1.2, illustrating a small number of data types that are required for biologists to integrate when processing their data. In situations where data is heavily tied to particular experiments there may not be a case for adding links to other datasets. In other cases the experimental results may only be relevant when they are interpreted in terms of shared concepts and links between datasets that can be identified separately from the experiment. If the links and concepts cannot easily be recognised, then the experimental results may not be interpreted fully. A scientist needs to be able to access data using links and recognise the meaning of the link. Given the magnitude and complexity of the scientific data available, which includes a range of small and large datasets shown in Figure 4.1 and a range of concepts shown in Figure 1.2, scientists cannot possibly maintain it all in a single location. In practice, scientists distribute their data across different datasets, in most cases based on the type of information that the data represents. For example, some datasets contain information about genes, while others contain information about diseases such as cancer. Scientists typically use the World Wide Web to access these datasets, although there are other methods including Grids [85] and custom data access systems, such as LSIDs [41], that 8 Chapter 1. Introduction can be used to access distributed data.

1.1.1 Distributed data

Science is recognised to be fundamentally changing as a result of the electronic publica- tion of datasets that supplement paper and electronic journal articles [71]. It is simple to publish data electronically, as the World Wide Web is essentially egalitarian, as it has very low barriers to entry. In comparison, academic journals which are based on peer review and authoritarian principles have higher barriers to entry, although they should not be designed to distinguish the source of an article based on their social standing. Although it is simple to publish datasets, it is comparatively hard to get the data recognised and linked by other datasets. In practice, there are a limited number of scientists curating these distributed datasets, as most scientists are users rather than curators or submitters of data. In addition, the scientists using these distributed datasets may be assisted by research assistants, so the ideal “scientist” performing experiments or analysing results using different distributed datasets will likely be a group of collaborating researchers and assistants. There are both technical and infrastructure problems that may make it hard for a new dataset to be linked to from existing scientific datasets. These issues range from outdated information, to the size of the dataset, and the way the data is made available for use by scientists. Data maybe outdated or not linked if it is difficult for the maintainer of a dataset to obtain or correlate the data from another dataset. If a dataset maintainer is not sure that a piece of data from another dataset is directly relevant, they are less likely to link to the data. As distributed data requires that datasets give labels to pieces of information, it is important that different datasets use compatible methods for identifying each item. Although most datasets allow other datasets to freely reuse their identifiers without copyright restrictions, there are some closed, commercial datasets that put restrictions on reuse of their identifiers. This limits the usefulness of these identifiers, as scientists cannot openly critique the closed datasets because the act of referring to the closed dataset identifier in an open dataset could be illegal 2. By comparison, these datasets freely link to other datasets, so if they were open, they would be valuable sources of data for the scientific community. If there are no accurate, well-linked, sources of data, a scientist may need to manually curate the available data and host it locally. In some cases scientists may have no issues trusting a dataset to be updated regularly and contain accurate links to other datasets, but there may be a contextual reason why they prefer to use data from a different location, including a third party middleware service and their own computing resources. The contextual reasons may include a more regularly operational data provider or a locally available copy of the data. Knowledge of these and other operational constraints are necessary for other scientists to verify and replicate the research in future.

2http://lists.w3.org/Archives/Public/public-lod/2011Aug/0117.html Chapter 1. Introduction 9

1.1.2 Science example

The science of biology offers a large number of public linked datasets, in comparison to chemistry where there are a large number of datasets, but the majority are privately maintained and commercially licensed. Some specific issues surrounding data access in biology are shown in Figure 1.6, along with the general issues affecting scientific distributed data. Medical patient datasets, for example, may be encoded using an HL7 file format standard 3. These datasets, however, are generally private, and doctors may benefit from simple access to both biology and public medicine datasets for internal integration without widespread publication of the resulting documents. For example, datasets describing drugs, side effects and diseases are directly relevant, and would be very useful if they could be linked to and integrated easily [37].

Data

Data is Data can be Main problem: physically labelled Structured data is distributed Data labels can produced and used by Data may be different entities be used by legally others protected

Business Social Networks Main problem: Main problem: Trusted interfaces for Privacy of shared Business-to-Business Existing data information Identity of interaction labelling standard: individuals split EAN UCC-13 across networks Hard to legally Limited number of integrate operations : Quote, Purchase, Offer for Sale

Medicine

Main problem: Different standards Large number of Clear links to Privacy of all for labelling unique patients; not science data are patient conditions generalisable desirable information

Science

Large number of data Main problem: No Datasets contain types: Gene, Protein, Large number of standard for labelling links to other DNA, Taxonomy, public datasets and publishing data datasets Chemical elements, Chemical compounds

Biology

Genes, proteins, and Main problem: Large A small number of regulatory networks are Datasets link and amount of data large datasets; a large not easy to generalise publish data in number of small for all members of a different ways datasets species; difficult to share labels

Figure 1.6: Data issues

3http://www.hl7.org/implement/standards/index.cfm?ref=nav 10 Chapter 1. Introduction

Figure 1.2 shows the major concepts currently recognised in the field of cellular bi- ology, along with their relationships. The combined cycle describes how scientists link information about different parts of genomic cycle, using references to other datasets containing linked concepts that are thought to be causally relevant. For example, as shown in Figure 1.7, the Entrez Gene dataset contains information about genomic in- formation, and this genomic information can be translated to form proteins, which have relevant information in the Uniprot dataset, among others. The Uniprot dataset also attempts to compile relevant links to Entrez Gene, among other datasets, to allow scientists to identify the genes that are thought to be used to create particular proteins.

Entrez Gene

Reactome

CPath

Uniprot

Drugbank Pfam Sider

Figure 1.7: Datasets related to flow of information in living cells Original figure source: Kenzelmann, Rippe, and Mattick. [2006]. doi:10.1038/msb4100086

These datasets are vital for scientists who require information from multiple sections of the biological cycle to complete their experiments. This thesis examines an example involving a scientist who is required to determine and accommodate for genetic causes of side effects from drugs. The necessary published information to complete these types of experiments are available in linked scientific datasets including DrugBank [144], HGNC (dataset with single textual symbol to each human gene) [125], NCBI Entrez Gene (dataset with information about genes) [93], Uniprot (dataset with information about proteins) [139], and Sider (dataset with side effect information for drugs) [80]. At a conceptual level, this experiment should be relatively easy to perform using current methods such as web browsing or workflows. However, there are various reasons why it is difficult for scientists to perform the example experiment and publish the results including:

• References different in each format and dataset Chapter 1. Introduction 11

Uniprot reference information available in Entrez Gene

In HTML: mRNA and Protein(s) : UniProtKB/Swiss-Prot P27338 HTML source URL : http://www.ncbi.nlm.nih.gov/gene/4129

In ASN.1: type other, heading "UniProtKB", source {}, comment { { type other, source { { src { db "UniProtKB/ Swiss-Prot", tag str "P27338" }, anchor "P27338" } } } } ASN.1 source URL : http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&retmode=asn.1&id=4129

In XML: UniProtKB/Swiss-ProtP27338 XML source URL : http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&retmode=xml&id=4129

Figure 1.8: Different methods of referencing Uniprot in Entrez Gene dataset

• Many file formats, including Genbank XML, Tab-separated-values and FASTA

• Lack of interlinking : One way link from Sider to Drugbank

• Experimental replication methods vary according to the sources used

• Custom methods are generally hard to replicate in different contexts

There are different methods of referencing between the Entrez Gene dataset and the Uniprot dataset. There are multiple file formats available for each item in the Entrez Gene dataset, with each format using a different method to reference the same Uniprot item as shown in Figure 1.8. In a similar way, there are multiple file formats available for the equivalent Uniprot protein, however, each format also uses a different method to reference the same Entrez Gene item, as shown in Figure 1.9 Both of these datasets are curated and used by virtually every biologist studying the relationship between genes and proteins. The HGNC gene symbol that is equivalent to these items is represented using different file formats each of which use a different method of referencing external data links. In the case of Entrez Gene, HGNC also contains two separate properties, with very similar semantic meanings, that link to the same targets in Entrez Gene, as shown in Figure 1.10. Although some datasets are not linked in either one or both directions, there may be ways to make the data useful. For instance, the DailyMed website contains links to DrugBank, but DrugBank does not link back to DailyMed directly. DrugBank does however link to KEGG [76], which links to DailyMed, as shown in Figure 1.11. There are links to PubChem [31] from DrugBank and KEGG, however, they are not identical, as KEGG links to a PubChem substance, while DrugBank links to both PubChem com- pounds and PubChem substances. In this example, described more fully in Section 4.4, a scientist or doctor may want to examine the effects of a drug in terms of its side effects and any genetic causes, however, this information isn’t simple to obtain given the links and data access methods that are available for distributed linked datasets. In addition to the directly available data from these sites, there are restructured and annotated versions available from tertiary data providers including Bio2RDF [24], Chem2Bio2RDF [39], Neurocommons [117], and the LODD group [121]. These versions 12 Chapter 1. Introduction

Entrez Gene information available in Uniprot

In RDF/XML: RDF/XML source URL : http://www.uniprot.org/uniprot/P27338.rdf

In XML: XML source URL : http://www.uniprot.org/uniprot/P27338.xml

In HTML: Genome annotation databases : GeneID: 4129 HTML source URL : http://www.uniprot.org/uniprot/P27338.html

In Text File: DR GeneID; 4129; -. Text source URL : http://www.uniprot.org/uniprot/P27338.txt

In GFF : No References available in this format GFF source URL : http://www.uniprot.org/uniprot/P27338.gff

In FASTA : No References available in this format FASTA source URL : http://www.uniprot.org/uniprot/P27338.fasta

Figure 1.9: Different methods of referencing Entrez Gene in Uniprot dataset

Uniprot and Entrez Gene reference information available in HGNC

In HTML:

Entrez Gene ID 4129 Gene UniProt ID (mapped data supplied by UniProt) P27338 UniProt

HTML source URL : http://www.genenames.org/data/hgnc_data.php?hgnc_id=6834

In Tab Seperated Values Format:

HGNC ID Approved Symbol Approved Name Status Entrez Gene ID Entrez Gene ID (mapped data supplied by NCBI) UniProt ID (mapped data supplied by UniProt) 6834 MAOB monoamine oxidase B Approved 4129 4129 P27338

Tab Seperated Values Format source URL : http://www.genenames.org/cgi-bin/hgnc_downloads.cgi? title=HGNC+output +data&col=gd_hgnc_id&col=gd_app_sym&col=gd_app_name&col=gd_status&col=gd_pub_eg_id&col= md_eg_id&col=md_prot_id&status=Approved&status=Entry +Withdrawn&status_opt=2&level=pri&=on&where=gd_hgnc_id+%3D+%276834% 27&order_by=gd_hgnc_id&limit=&format=text&submit=submit&.cgifields=&.cgifields=level&.cgifields= chr&.cgifields=status&.cgifields=hgnc_dbtag

Figure 1.10: Different methods of referencing Entrez Gene and Uniprot in HGNC dataset Chapter 1. Introduction 13

Indirect linking

http://www.drugbank.ca/cgi-bin/getCard.cgi?CARD=DB01247

Link back to Drugbank:

Same item http://www.genome.jp/dbget-bin/www_bget?drug:D02580

Link to KEGG Drug database:

http://www.drugbank.ca/drugs/DB01247

Contains link to actual item:

http://www.drugbank.ca/search/search?query=APRD00701

Links to search page for secondary accession number:

http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?id=6788

Redirects to:

http://dailymed.nlm.nih.gov/dailymed/lookup.cfm? Redirects to: setid=AC387AA0-3F04-4865-A913-DB6ED6F4FDC5

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?sid=148916

Contains link to: Contains link to: http://www.drugbank.ca/cgi-bin/ http://dailymed.nlm.nih.gov/dailymed/lookup.cfm? getCard.cgi?CARD=DB01247.txt setid=ac387aa0-3f04-4865-a913-db6ed6f4fdc5

Contains link to: Contains link to:

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=3759

Contains link to:

Contains link to: http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?sid=17396751

Contains link to:

http://sideeffects.embl.de/drugs/3759/ http://www.genome.jp/dbget-bin/www_bget?drug:D02580

Figure 1.11: Indirect links between DailyMed, Drugbank, and KEGG

of the datasets contain more links to other datasets compared to the original data, how- ever the annotated information may not be as trusted as the original information that is directly accessible using the publishers website. If scientists rely on extra annotations provided by these tertiary sources, the experiments may not be simple to replicate using the original information, making it difficult for peer reviewers to verify the conclusions unless they use the annotated datasets. Although tertiary sources may not be as useful as primary sources, they provide a simpler understanding of the links between differ- ent items, as shown by replicating the example from Figure 1.11 using the annotated datasets in Figure 1.12. Although it is simpler to visualise the information using the annotated versions, there are still difficulties. For example, the DailyMed annotated dataset provided by the LODD group does not utilise the same identifiers for items as the official version, although the official version uses two different methods itself. The DailyMed item identified by both “AC387AA0-3F04-4865-A913-DB6ED6F4FDC5” and “6788” in the official version, is identified as “2892” in the LODD version, and there are no references to indicate that the LODD published data is equivalent to the DailyMed published data. A tertiary provider may accidentally refer to two different parts of a dataset as if 14 Chapter 1. Introduction

Direct linking

http://bio2rdf.org/drugbank_drugs:DB01247

http://bio2rdf.org/dr:D02580

http://www4.wiwiss.fu-berlin.de/dailymed/page/drugs/2892

http://bio2rdf.org/pubchem:148916

http://bio2rdf.org/pubchem:3759

http://bio2rdf.org/pubchem:17396751

http://www4.wiwiss.fu-berlin.de/sider/resource/drugs/3759

Figure 1.12: Direct links between DailyMed, Drugbank, and KEGG

they were one set. For example, the Bio2RDF datasets do not currently distinguish between the PubChem compounds dataset and the PubChem substances dataset. This error results in the use of identifiers in the examples above such as http://bio2rdf. org/pubchem:3759 which may refer to either to a compound, http://pubchem.ncbi. nlm.nih.gov/summary/summary.cgi?cid=3759, or an unrelated substance, http:// pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?sid=3759. In these cases, the two identifiers would be very difficult to disambiguate unless a statement clearly describes either a compound or a substance.

1.2 Problems

A number of concepts need to be defined to understand the motivating factors for this research. These concepts will be used to identify and solve problems that were highlighted using examples in Section 1.1.2.

Data cleaning : The process of verifying and modifying data so that it is useful and coherent, including useful references to related data. For example, this may include manipulating the Entrez and Uniprot datasets so that the links can be directly resolved to get access to information about the referenced data.

Data quality : A judgment about whether data is useful and coherent. For example, this may be the opinion of a scientist about whether the datasets were useful, along with which manipulations would be necessary to improve them.

Trust : A judgment by a scientist that they are comfortable using data from a partic- ular source in the context of their current experiment. In the example, scientists Chapter 1. Introduction 15

need to be confident that the sources of data they are using will be both non-trivial and scientifically acceptable to their peers.

Provenance : The details of the retrieval method and locations of data items that were used in an experiment, as distinct from the provenance information related to the original author and curator of the data item [129]. The distributed nature of scientific data was illustrated by compartmentalising the logical data and links from Figure 1.3 to match the way it is physically stored, as shown in Figure 1.5. In terms of this example, the provenance of the asserted relationship between the mouse gene and the human cancer, would include references to data located in each of the other datasets.

Context : The factors that are relevant to the way a scientist performs or replicates an experiment. These factors influence the way different scientists approach the same problems. For example, a scientist may publish a method using datasets distributed across many locations; while a peer may reproduce or extend the document solely using resources at their institution.

These factors are relevant to the way scientists work with data, as shown in Sec- tion 1.1. Scientists are not currently able to clean and query datasets to fix errors and produce accurate results using methods that are simple to replicate. The lack of replicability may result in lower data quality for published datasets due to the lack of reuse and verification. Although scientists can access and trust parts of a small number of large public datasets, they are not able to easily trust the large number of small datasets, or even all of a small selection of large datasets. Trust requires knowledge of the state of the dataset. In addition, scientists need to understand the usefulness of the data in reference to other related datasets. It is difficult to understand the usefulness of an approach while there are barriers to replicating existing research that uses similar linked datasets. Some of the problems discussed here are more relevant to the use of workflows, as compared to manually performing an experiment by hand. For example, it is more important that data is trusted for large scale workflows, as the intermediate steps are required to be correct for the final, published, results to be useful. In comparison, workflows make it easy to generate query provenance, as the precedents for each value can be explicitly linked through the structure of the workflow. In manual processes, the query provenance needs to be compiled by the scientist alongside their results. Contex- tual factors can adversely affect both workflow and manual processes, as workflows can allow for context variables as workflow inputs, and if they don’t it makes it difficult to replicate. Similarly, manual processes are designed to be intimately replicated, making it possible to modify context variables at the desired steps.

1.2.1 Data quality

Data should be accessible to different scientists in a consistent manner so that it can be easily reused. The usefulness of data can be compromised by errors or inconsistencies 16 Chapter 1. Introduction at different levels. Errors may occur when the data does not conform to its declared schema; when dataset links are not consistent or normalised; when properties are not standardised; or, when the data is factually incorrect. Some of these errors can be corrected when a query is performed, while others require correction before queries can be executed. Data may be compromised by one or more fields that do not match the relevant schema definition, possibly due to changes in the use of the dataset after its initial creation [69]. In some of these cases, automated methods may be available to identify and remove this information, including community accepted vocabularies that define the expected syntactic and semantic types of linked resources. In other cases such as relational databases, the structure of the database may need to be modified in order to solve the issue. The decentralised nature of the set of linked scientific datasets makes dynamic veri- fication of multiple datasets hard. Although a syntactic validator may be able to verify the completeness of a record, a semantic validator may rely on either a total record, or the total dataset, in order to decide whether the data is consistent. In many cases the results of a simple query may not return a complete record, as a scientist may only be interested in a subset of the fields available. The usefulness of linked data lies in the use of accurate, easily recognisable links between datasets. Datasets can be published with direct links to other records, or they may be published using identifiers that can be used to name a record without directly linking to it. In many cases, the direct links contain an identifier that can be used to name a record without linking to it, as shown in a variety of examples in Section 1.1.2. In some cases there are two equivalent identifiers for a record. Figure 1.11 shows two ways of referencing the relevant DailyMed record using both “AC387AA0-3F04-4865- A913-DB6ED6F4FDC5” and “6788” as identifiers. It may be necessary for scientists to recognise identifiers inside links, independent of the range of links that could be constructed, as the same identifier may be present in two otherwise different links. If two different identifiers can be used equivalently for a record, then normalising the identifier to a single form will not adversely affect the semantic meaning of the record. However, there are cases where the two identifiers denote two different records with data representing the same thing. In this case, the two identifiers are synonyms rather than equivalent, meaning that they need to be distinguished so that scientists can retain the ability to describe both items [112]. This is particularly important when integrating data from two heterogeneous sources, where it may be important in future that the identifiers are separate so that their provenance can be determined. The issue of variable data quality is more prominent when data is aggregated from different sources, as incorrect or inconsistent information is either very visible, or causes large gaps in the results. Both of these outcomes make it difficult or impossible to trust the results. This is particularly relevant to the increasingly important discipline of data mining, where large databases are analysed to help scientists form opinions about Chapter 1. Introduction 17 apparent trends [112]. It is necessary in science predominantly because of the tendency of scientific datasets to not always remove entries for incorrect items, and to import references at one point in time, without regularly checking in future to verify that the references are still semantically valid [69]. The use of multiple linked datasets may highlight existing data quality issues, how- ever links between distributed datasets may still be useful when the data is factually incorrect, as the link may be important in determining the underlying issue. In com- parison to textual search engines which rely on ever more sophisticated algorithms to guess the meaning of documents, scientists generate purpose built datasets and curate the links to other datasets to improve the quality of their data.

1.2.2 Data trust

Data trust is a complex concept, with one paper, Gil and Artz [52], identifying 19 factors that may affect the trust that can be put into a particular set of data. In the context of this research, data trust is evaluated at the level of datasets and the usefulness of different queries on each dataset. At a high level, different scientists determine whether they trust a dataset, and peers use this information to determine whether or not the dataset is trusted enough to be used as a source for queries. Although this information is implicitly visible in publications, a computer understandable annotation would allow scientists to efficiently perform queries across only their trusted datasets. If queries do not return the expected results, then the dataset that was responsible for the incorrect data could be explicitly annotated as untrusted for future queries, and past queries could be replicable regardless of the current situation, but not necessarily trusted. It is difficult to define a global trust level, due to evidence that knowledge is not only segmented, but its meaning varies according to context [64]. In business process modelling, where it is necessary to explicitly represent methods and knowledge related to the business activities, one paper, Bechky [21], concluded that within a business it is hard to represent knowledge accurately for a variety of reasons including job speci- ficity and training differences. Some of these may map to science due to its similarities with expert knowledge holders and segregated disciplines. Another paper, Rao and Osei-Bryson [113], focused on analysis of the data quality in internal business knowl- edge management systems, with the conclusion that there are a number of analytical dimensions that are necessary to formally trust any knowledge, with some overlap in the dimensions to the factors identified in Gil and Artz [52]. Trust in a set of data is related to data quality, as any trust metric requires that datasets contain meaningful statements. If a data trust metric is to extend past overall datasets to particular data items, there is a need to precisely designate a level of trust to data items based on general categories of trustworthiness. The continuous evolution of knowledge and data, including different levels of trust for scientific theories throughout their lifespan, compared to raw data, make it hard to define the exact meaning of all data items in a dataset in the context of a relatively static shared meaning for an item. The lack of exact meaning at this level, due to social and evolutionary issues, means 18 Chapter 1. Introduction that there is no automated general mechanism for detecting and resolving differences in opinion about meaning. The focus of many current distributed scientific data access models is on the use of scientific concept ontologies as the basis for distributing queries [4, 15, 17, 25, 30, 53, 74, 75, 82, 100, 103, 107, 108, 117, 118, 133, 143]. These systems do not contain methods for describing the trust that scientists have in the dataset, including different levels of trust based on the context of each query, as ontology researchers assume that datasets will be semantically useful enough to be integrated transparently in all relevant experiments. The datasets that are available to these systems need to remain independent from a query in order for the system to be able to automatically decide how to distribute the query based on high level scientific concepts and properties. These systems assume that both the syntactic and semantic data quality will be high enough to support realistic and complex queries across multiple linked datasets without having scientists examine or filter the results at each stage of the query process. If a dataset is well maintained and highly trusted within a domain it will be more consistent and accurate, and the way the data is represented should match the common theories about what meaning the data implies. If the level of trust in a dataset is not high across a community, the various degrees of trust that cannot be described in current ontologies, such as uncertainty, will lead to differences in implementation. The use of different implementations will lead to differences in the trust in the knowledge that is described using the resulting data. Data trust in science is gained based on a particular historical context. A scientific theory may be currently accepted to be part of the Real layer in Critical Realist [96] terms due to a large degree of evidence pointing to realistic causal mechanisms, but in the future if the context changes the observations may be interpreted differently. The change in recognition did not mean that the world actually changed to suit the new theory, however, it does mean that a community of scientists will gradually accept the new theory and regard knowledge based on the old theory as possibly suspect. Strassner et al. [137] attempted to utilise ontologies as the basis for managing the knowledge about a computer network, and to use this knowledge to automate the process of network management in different contexts. They found that a lack of sufficiently diverse datasets to experiment on and the lack of congruence between different ontologies made it difficult to utilise an ontology based approach as part of a network management system. In comparison, there are a large number of diverse scientific linked datasets, and there are issues relating to the use of incongruent ontologies as the datasets have been modelled by different organisations, each of which have a different view on the meaning of the data. As the number of scientific datasets grows, the level of knowledge by particular scientists about the suitability of any particular set of datasets will likely decrease due to the lack of knowledge by each scientist about exactly what each database contains. Not every piece of information is trustworthy, and even trustworthy datasets can contain inaccuracies including out of date links to other datasets, or errors due to insufficient Chapter 1. Introduction 19 scientific evidence for some claims [35]. Scientists have to be free to reject sources of data that they feel are not accurate enough to give value to the investigations in their current context. However, there is no current method which allows them to do this using a distributed linked dataset query model. Current theoretical models include many different factors [52], but no method for scientists to use to integrate their knowledge of trust with their scientific experiments across linked datasets. Scientists can record and evaluate the data and process provenance of their workflows, however, this does not recognise the concept of trust in a dataset provided at a particular location [34, 95, 148]. Although some trust systems can be improved using community based ratings, these systems are still prone to trust issues based on the nature of the community. Scientists require the ability to act autonomously to discover new results as necessary without being forced to confine themselves to the average of the current community opinion about a dataset or data item. They require a system that provides access to multiple datasets, without requiring that all scientists utilise it completely for it to be useful, as this may preventing them from extending it locally as they require. In order to remain simple to manage, a data access model designed with this in mind cannot hope to systematically assign trust values to every item in every dataset that will match all scientist’s opinions. On one extreme it would be globally populated automatically by an algorithm, while on the other it would be sparsely populated by a range of scientists. Both of these outcomes do not enable scientists to contextually trust different datasets more accurately than a manual review of relevant publications. An autonomous trust system does not require or imply that the data is only mean- ingful to some scientists. It recognises that the factual accuracy, along with the com- pleteness of the records contribute to each scientist’s trust in the dataset and its links to other datasets. This does not imply that private investigations are more valid or useful than published results or shared community based opinions. It does imply that as part of the exploratory scientific method, trusted datasets must be developed and refined in local scenarios before being published. In this way it matches the traditional scientific publication methodology, where results are locally generated and peer reviewed before being published. In order to be useful, a linked scientific data access model may be able to provide for both pre and post published data, and trust based selection of published data to allow transitions when necessary. Trust in the content of a document is distinct from trusting that the document was an unchanged representation of the original document. Typically trust in the computer sciences, particularly networking and security, is defined as the ability to distinguish between unchanged and changed representations of information. Although it may be useful to include this aspect in a trusted scientific data access model, it is not necessary as scientists decide on their level of trust based on the overall content available from a particular data provider. In addition a scientist may have different levels of trust for the content derived from a particular provider depending on their research context, requiring that, in two different contexts, the same representation be assigned two different levels of trust. 20 Chapter 1. Introduction

1.2.3 Context sensitivity and replication

Modern scientists need to be able to formulate and execute computer understandable queries to analyse the data that is available to them within their particular context. In terms of this research, context is defined as the factors that affect the way the query is executed, and not necessarily different contextual results. In previous models, such as SemRef [133], context sensitivity is defined as the process of reinterpreting a complex query in terms of a number of schema mappings. Scientists may wish to perform a completely different query in a different context, and the limitation to reinterpreting a complex structured query makes it impossible to provide alternative, structurally incompatible queries to match a given scientific question. It is instead necessary for a context sensitive query system to provide simple mappings between the scientist’s query, given as a minimal set of parameters rather than a structured query, and any number of ways that the particular query is implemented on different datasets. Although this does not provide a semantically rich link between the scientist’s query and the actual queries that are executed on the data providers, it provides flexibility that is necessary in a system which is designed to be used in different contexts to generate equivalent results.

Context sensitivity in this research relates to both the scientist’s requirements and the locations and representations of data. If scientists have the resources to locally host all of the datasets, then they are in a position to think about optimising the performance by using a method such as that proposed by Souza [133]. However, in these cases, the scientist still has to determine their level of trust based on the intermediate results. This may not be available if the entire question is answered using joins across datasets in a single database without a scientist being able to verify the component steps. They could perform these checks if they use the datasets solely as a data access point, with higher level filters and joins on the results according to their context.

Although there are systems that may assist them when constructing queries across different datasets, the scientist needs to verify whether the results of each interdataset query match their expectations in order to develop trust in their results. The data quality, along with their prior level of trust in the datasets, and the computer under- standable meaning attached to the data, are important to whether the scientist accepts the query and results or chooses to formulate it differently.

Some systems that attempt to gather all possible sources for all queries into a single location, for example, SRS (Sequence Retrieval System) [46]. The aggregation of similar datasets into single locations provides for efficient queries, and, assuming the datasets are not updated by the official providers too often, can be maintainable and trustworthy. In terms of this research the minimum requirement for scientists to replicate queries is that the data items must be reliably identified independent of context. Chapter 1. Introduction 21

1.3 Research questions

A list of research questions were created to highlight some of the current scientific data access problems described in Section 1.2.

1. What data quality issues are most prevalent for scientists working with multiple distributed datasets?

2. What data cleaning methods are suitable for scientists who need to produce results from distributed datasets that are easily replicable by other scientists?

3. What query model features are necessary for scientists to be able to identify trustworthy datasets and queries?

4. What query provenance documentation is necessary for scientists to perform queries across distributed datasets in a manner that can be replicated in a different context with a minimal amount of modification to the queries?

5. Can a distributed dataset query model be implemented, deployed, and used, to effectively access linked scientific datasets?

1.4 Thesis contributions

This thesis introduces a new context sensitive model for querying distributed linked scientific datasets that addresses the problems of data cleaning, trust, and provenance as discussed in Section 1.2. The model allows users to define contextual rules to automate the data cleaning process. It provides context sensitive profiles which define the trust that scientists have in datasets, queries, and the associated data cleaning methods. The model enables scientists to share methodologies through the publication and retrieval of their queries’ provenance. The thesis describes the implementation of the model in a prototype web application and validates the model based on examples that were difficult to resolve without the use of the model and the prototype implementation. The contributions will be evaluated by comparing the model and implementation features and methodology against other similar models in terms of their support for context-sensitive, clean, and documented access to heterogeneous, distributed, linked datasets. These features are necessary to provide support to current and future scientists who perform experiments and analyse data based on public datasets.

1.5 Publications

• Ansell. [2011] : Model and prototype for querying multiple linked scientific datasets. Future Generation Computer Systems. doi:10.1016/j.future.2010.08.016

• Ansell, Hogan, and Roe. [2010]. Customisable query resolution in biology and medicine. In Proceedings of the Fourth Australasian Workshop on Health Infor- matics and Knowledge Management (HIKM2010). 22 Chapter 1. Introduction

• Ansell. [2009]. Collaborative development of cross-database Bio2RDF queries. Presentation at EResearch Australasia 09.

• Ansell. [2008]. Bio2RDF: Providing named entity based search with a common biological database naming scheme. Presentation at BioSearch08 HCSNet Next- Generation Search Workshop on Search in Biomedical Information.

• Belleau, Ansell, Nolin, Idehen, and Dumontier. [2008]. Bio2RDF’s SPARQL Point and Search Service for Life Science Linked Data. Poster at Bio Ontologies 2008 workshop.

Other publications:

• Ansell, Buckingham, Chua, Hogan, Mann, and Roe. [2009]. Enhancing BLAST Comprehension with SilverMap. Presentation at 2009 Microsoft eScience Work- shop.

• Ansell, Buckingham, Chua, Hogan, Mann, and Roe. [2009]. Finding Friends Outside the Species: Making Sense of Large Scale BLAST Results with Silvermap. Presentation at EResearch Australasia 09.

• Rosemann, Recker, Flender, and Ansell. [2006]. Understanding context-awareness in business process design. Presentation at Australasian Conference on Informa- tion Systems.

1.6 Research artifacts

The research was made up of a model and a prototype implementation of the model. The model was designed to overcome the issues that were identified related to data access in science in Section 1.1. The prototype was then implemented to provide evidence for the usefulness of the model. The prototype was implemented using Java and Java Servlet Pages (JSP). It contained approximately 35,000 lines of program code4. The prototype included a human-understandable HTML interface, as shown in Figure 1.13, along with a variety of computer-understandable RDF file formats including RDF/XML, Turtle, and RDFa. The 0.8.2 version of the prototype was downloaded around 300 times from Sourceforge 5. The prototype was primarily tested on the Bio2RDF website. It provided with a way to resolve queries easily across their distributed linked datasets, including datasets using the normalised Bio2RDF resource URIs[24] and datasets using other methods for identifying data and links to other data. Bio2RDF is a project aimed at providing browsable, linked, versions of approximately 40 biological and chemical datasets, along with datasets produced by other primary and tertiary sources where possible. The datasets produced by Bio2RDF can be colocated in a single database, although datasets are all publicly hosted in separate query endpoints.

4http://www.ohloh.net/p/bio2rdf/analyses/latest 5http://sourceforge.net/projects/bio2rdf/files/bio2rdf-server/bio2rdf-0.8.2/ Chapter 1. Introduction 23

Figure 1.13: Bio2RDF website

The configuration information necessary for the prototype to be used as the engine for the Bio2RDF website was represented using approximately 27,000 RDF statements. A large number of datasets were linked back to their original web interfaces and descrip- tions of the licenses defining the rights that users of particular datasets have. In total there were 170 namespaces that were configured to provide links back to the original web interface as part of the Bio2RDF query process, and there were 176 namespaces that were configured to provide links back to a web page that describes the rights that users of the dataset have. The Bio2RDF website was accessed 7,415,896 times during the testing period and the prototype performed 35,694,219 successful queries and 1,078,354 unsuccessful queries on the set of distributed datasets to resolve the user queries. The website was accessed by at least 10,765 unique IP addresses, although two of the three mirrors were located behind reverse proxy systems that did not provide information about the IP of the user accessing the website. The three instances of the prototype running on the mirrors were synchronised using a fourth instance of the software that provided the most up to date version of the Bio2RDF configuration information to each of the other mirrors at 12 hourly intervals.

1.7 Thesis outline

The thesis starts with an introduction to the problems that form the motivation for this research in Chapter 1. The related research areas are described in Chapter 2, starting from a broad historical perspective and going down to related work that attempts to solve some of the same issues. Chapter 3 contains a description of the model that was 24 Chapter 1. Introduction created in order to make it simpler for scientists to deal with the issues described in Chapter 1. Chapter 4 describes how the model and prototype implementation can be integrated with scientific practices. The prototype that was created as an implementa- tion of the model is described in Chapter 5. Chapter 6 along with a discussion of issues that were addressed by the model and prototype along with outstanding issues and a description of the methodology and philosophy that was used to guide the research at a high level. A conclusion and a brief description of potential future research directions are given in Chapter 7. Chapter 2

Related work

2.1 Overview

For most of scientific history, data and theories have been accumulated in personal notebooks and letters to fellow scientists. The advent of mechanical publishing in the Renaissance period allowed scientists to mass produce journals that detailed their discoveries. This provided a broader base on which to distribute information, making the discovery cycle shorter. However, data still needed to be processed manually, and publication restrictions made it hard to publish raw data along with results and theories. The use of electronic computers to store and process information gave scientists efficient ways to permanently store their data, allowing for future processing. However, initially electronic storage costs were prohibitive, and scientists generally did not share raw data due to these costs. The creation of a global network to electronically link computer systems provided the impetus for scientists to begin sharing large amounts of raw data, as the costs were finally reduced to economic levels. This chapter provides a short summary of a range of data access and query methods that have been proposed and implemented. The networked computers, particularly those that make up the Internet, are now vital to the process of scientific research. The Internet is used to provide access to information that can retrieve as they require, with over 1000 databases in the biology discipline [50]. A scientist needs to request information from many different locations to process it using their available computing resources. There are many ways that sci- entists could request the information, making it difficult for other scientists to replicate their research based on their publications. They can publish computer understandable instructions detailing ways for other scientists to replicate their research, but other scientists need to be able to execute the instructions locally. This is difficult as the contextual environment is different between locations, including different operating sys- tems, database engines, and dataset versions. An initial step towards easy replication by peers may be to use workflow management systems to integrate the queries and results into replicable bundles. Applications such as Taverna [105] make it possible to use workflows to access distributed data sources. The methods that are most commonly used by scientific workflow management

25 26 Chapter 2. Related work systems are Web Services, in particular SOAP XML Web Services. Although XML is a useful data format, it relies on users identifying the context that the data is given in, and documents from different locations may not be easy to integrate as the XML model only defines the document structure, and not the meaning of different parts of the document. Web Services can be useful as data access and processing tools, especially if scientists are regularly working with program code, so they are useful if the context is identifiable by the scientist. However, the use of Web Services does not make it simple to switch between the use of distributed services and local services, as users need to implement the computer code necessary to match the web service interface to their own data. The method of data access is not distinctly separated from either the queries or the data processing steps within workflows, so scientists need to know how to either reverse engineer the web service, modify a workflow to match their context if the web service is not available, or access the data locally using another method. A scientist needs to be able to identify links between different pieces of data to develop scientifically significant theories across different datasets. The XML data for- mat, for example, is used as the basis for Web Services and many other data formats. However, XML and other common scientific data formats do not contain a standard method for identifying data items or links to other data items in a way that is generi- cally recognisable. In order to link between data items in different documents a generic data format, RDF (Resource Description Framework) was developed. RDF is able to show links between different data items using Uniform Resource Identifiers (URIs) as identifiers for different data items. RDF statements, including URIs and literals, such as strings and numbers, provide the structure for links between relevant URIs and prop- erties of data items, respectively. RDF is used as the basis for the Semantic Web, which is ideally characterised by the use of computer understandable data by computers to provide intelligent responses to questions. The original focus for the Semantic Web community was to define the way the com- munity agreed meanings for links, commonly known as ontologies or vocabularies, would be represented. This resulted in the development of OWL (Web Ontology Language), as a way of defining precisely what meaning could be implied from a statement given in RDF. Although this was useful, it did not fulfil the goals of the Semantic Web, as it is a format, rather than a dataset, and the datasets suffered due to both data quality prob- lems and lack of contextual links between the datasets that were published. The W3C community recognised this as an issue. Tim Berners-Lee formed a set of guidelines, known as Linked Data1, describing a set of best practices using URIs in RDF docu- ments. Linked Data uses HTTP URIs to access further RDF documents containing useful information about the link. According to the best practices given by the Linked Data community, URIs should be represented using the HTTP protocol, which is widely known and implemented, so many different people can get access to information about an item using the URI that was used to identify the item. When a Linked Data URI is resolved, the results should

1http://www.w3.org/DesignIssues/LinkedData.html Chapter 2. Related work 27 contain relevant further Linked Data URIs that contain contextually related items, as shown in Figure 2.1. This provides a data access mechanism that can be used to browse between many different types of data in a similar way to the HTML document web. The Linked Data guidelines encourage scientists to publish results that contain useful contextual links, including links to data items in other scientific disciplines. Although some scientists wish there to be a single authoritative URI for each scientific data item, RDF is designed to allow unambiguous links between multiple representations of a data item using different properties. In addition, Linked Data does not specify that there needs to be a single URI for each data item. A scientist needs to be able to create alternative URIs so they can reinterpret data in terms of a different schema or new evidence, without requiring the relevant scientific community to accept and implement the change beforehand.

Retrieve Linked Data URI Parse RDF document and store in local database If there are new references, retrieve them

Analyse document for references to other Linked Data

Perform query on the local database

Figure 2.1: Linked Data Naive Querying

Despite its general usefulness, plain Linked Data is not an effective method for systematic querying of distributed datasets, particularly where links are not symmetric, as shown in Figure 2.2. Even when all Linked Data URIs are symmetric, many URIs need to be resolved and locally stored before there is enough information to make complex queries possible. In the traditional document web, textual searches on large sets of data are made possible using a subset of the documents on each site. Although this sparse search methodology is very useful for textual documents is not useful for very specific scientific queries on large, structures, linked, scientific datasets. The RDF community developed a query language named SPARQL (SPARQL Query Language for RDF) 2 to query RDF datasets using graph matching techniques. SPARQL provides a way to perform queries on a remote RDF dataset without first discovering and resolving all of the Linked Data URIs to RDF documents. However, it is impor- tant that resolvable Linked Data URIs are present so that the data at each step is independently accessible outside of the context of a SPARQL query. SPARQL is useful for constructing complex graph based queries on distributed datasets, however, it was not designed to provide concurrent query access to multi- ple distributed datasets. In order for scientists to take advantage of publicly accessible

2http://www.w3.org/TR/rdf-sparql-query/ 28 Chapter 2. Related work

Linked Data URI about the concept X:

No Existing link link

Another Linked Data URI about the concept X:

Figure 2.2: Non Symmetric Linked Data

scientific datasets, without requiring them to store all of the data locally, queries must be distributed across different locations. Figure 2.3 shows the concept of data producers interlinking their datasets with other sources before offering the resulting data on the network contrasted with the data silo approach where all data is stored locally, and the maintenance process is performed by a single organisation rather than by each of the data producers. Federated SPARQL systems distribute queries across many datasets by splitting SPARQL queries up according to the way each dataset is configured [40]. An example of a dataset configuration syntax is VoiD 3, which includes predicates and types, labels for overall datasets, and statistics about interlinks between datasets. The component queries are aggregated by the local SPARQL application before returning the results to the user. Although there is no current standard defining Federated SPARQL, a future standard may not be useful for scientists if it does not recognise the need for the dataset to be filtered or cleaned prior to, or in the act of querying. In addition, the method and locations used to access distributed datasets are embedded directly into queries with current Federated SPARQL implementations4, in a similar way to web services and workflow management systems. Federated SPARQL strategies allow scientists to perform queries across distributed linked datasets, but the cost is that the datasets must all use a single URI to identify an item in each dataset, or they must all provide equivalency statements to all of the other URIs for an item. This requirement is not trivial, and a solution has not been proposed that provides access to heterogeneous datasets in a case where the providers do not agree on a single URI structure. Proposals for single global URIs for each dataset have not so far been implemented by multiple organisations, and even when they are

3http://vocab.deri.ie/void/ 4http://esw.w3.org/SPARQL/Extensions/Federation Chapter 2. Related work 29

Distributed dataset maintenance

Source 1 Source 2 Source 3

Publisher curates and Publisher curates and Publisher curates and interlink Source 1 interlink Source 2 interlink Source 3 with others with others with others

Access relevant data across network

Data silo maintenance

Source 1 Source 2 Source 3

Copy across Copy across Copy across network network network

Single organisation curate and interlink sources

Store and access data locally

Figure 2.3: Distributed data access versus local data silo 30 Chapter 2. Related work implemented in the future, all datasets would have to follow them for the system to be effective. Although Linked Data URIs are useful, they provide a disincentive to single global URIs. If a URI needs to be resolved through a proxy provided by a single organisation, there are many different social and economic issues to deal with. This organisation may or may not continue to have funding in the future and they may not respond promptly to change requests. Each of these issues reduces the usefulness that the single URI may otherwise provide. There are non-SPARQL based models and implementations that allow scientists to make use of multiple datasets as part of their experiments. They are generally based on the concept of wrapping currently available SQL, SOAP Web Service, and distributed processing grids to provide for complex cross-dataset queries. The wrappers rely on a schema mapping between the references in the query and the resources that are available in different datasets. In some systems distributed queries rely on the ability to map high level concepts from ontologies to the low level data items that are actually used in datasets. These types of mappings require the datasets to be factually correct according to the high level conceptual structure before the query can be completed, as there is no way to simply ask for data independent of the high level concepts. In these cases it may be possible to encode the low level data items directly as high level concepts, but this reduces the effectiveness of the high level concepts in other queries.

2.2 Early Internet and World Wide Web

Although it was originally sponsored by ARPANET, part of the US defence department, the distributed network of computers that is now known as the Internet, has been used to transmit scientific information in primitive forms since it was created. Email, the initial communication medium, was created to enable people in different locations to share electronic letters, including attached electronic documents. By its nature, email is generally private, although many public email lists are used to communicate within communities. Initially, the lack of data storage space made it hard for scientists to create and curate large datasets, resulting in a bias towards communication of scientific theories compared to raw data. The World Wide Web was created by scientists at CERN in Switzerland to share information using the Internet. It is based on the HTML (HyperText Markup Language) 5 and HTTP (Hypertext Transfer Protocol) 6 specifications. The HTML specification enables document creators to specify layouts for their documents, and include links to other documents. The use of HTTP, and in some cases HTTPS and FTP, form the core transport across which HTML and other documents are served using the Internet. HTML links, generally HTTP URIs, are created without a specific purpose, other than to provide a navigational tool. HTML is regularly used by scientists to browse through data. HTML versions of scientific datasets display links to scientific data both within the dataset and in other

5http://www.w3.org/TR/html4/ 6http://tools.ietf.org/html/rfc2616 Chapter 2. Related work 31 datasets using HTTP URLs. These interfaces enable scientists to navigate between datasets using the links that the data producer determined were relevant. With recent improvements in computing infrastructure, including very cheap storage space and large communication bandwidths, scientists are now able to collaborate in realtime with many different peers using the Internet. The massive scale electronic sharing of data, which allows this collaboration, has been described as a “4th paradigm” for science [71]. The first three paradigms are generally recognised as being based on theory, empirical experimentation, and numerical simulation, respectively. Real-time collaboration and data sharing enables scientists to process information more efficiently, arguably resulting in shorter times between hypotheses, experiments, and publications.

2.2.1 Linked documents

Initially the Web was used to publish static documents containing hardcoded links between documents. To overcome the maintenance effort required to keep these static documents up to date, systems were created to enable queries on remote datasets, without having to download the entire dataset. The results of these queries may be provided in many different formats, but there is no general method for identifying links between datasets, and in some cases no description of the meaning for links that are given. In HTML, where links are commonly provided, the URL link function is not necessarily standardised between datasets, making it potentially hard to match the reference to references in an HTML page from a different data provider. Linked documents, most commonly created using HTML, are popular due to the ease with which they can be produced. However, the semantic content of HTML documents is limited to structural knowledge about where the link is in the document, and what the surrounding elements are. Clearly linked documents provide information about the existence of a relationship between documents, but they do not propose meanings for the relationship. They can be annotated with meaningful information, with a popular example being the Dublin Core specifications 7. Dublin Core can be used to annotate documents with standard properties such as author and title. Textual search engines, supplemented with knowledge about the existence of links, make it possible to efficiently index the entire web, including annotations such as those provided by Dublin Core.

2.3 Dynamic web services

Dynamic processing services were created to avoid having to retrieve entire data files across the Internet before being able to determine if the data is useful. Initially these were created using ad hoc scripts that processed users inputs using a local datastore and returned the results. However, these services required each new service to be un- derstood by humans and implemented using custom computer programs. Web services can be created that enable computers to understand the interface specification, and pro- grammatically access the service, without a human programmer having to understand

7http://dublincore.org/ 32 Chapter 2. Related work each of the parameters and code the interface into code. Programmatic web services may be loosely or tightly specified. One example of a tight specification, where the interactions are specified in a tight contract, is the SOAP protocol 8. The SOAP protocol enables computers to directly obtain the interfaces, that in prior cases needed to be understood by humans. Web services are used by scientists to avoid issues with finding data in other ways, as the data is able to be manipulated in program code with the SOAP interface being the specification of what to expect. Web services may be just used for data retrieval, but they are also commonly used as specific query interfaces to enable scientists to use remote computing resources to analyse their data. Although the SOAP specification allows for multiple distributed endpoints to be used with a single Web Service, in most cases only single data providers are given with alternative ways to access the single provider. This causes issues with overloading of particular services and can in some cases stop scientists from processing their data if the service is unavailable. The nature of SOAP web services, with a tightly bound XML data contract, enables them to be reverse engineered and implemented locally. However, this process requires the re-creation of the code that was used to get the data out of a database or processing application, something which is not necessarily trivial. The use of Semantic markup to describe and compose Web Services has been used with limited success, although the lack of success may be due to the nature of the ontologies rather than a defect in the model [99]. Although web processing services are useful, the data may be returned in a format that is not understandable by someone outside of the domain. In the case of SOAP, users are required to understand XML, or alternatively, write a piece of code to translate it to their desired format. In the case of other web processing services, the data may be returned in any of a number of data formats, some of which may be common to similar processing services. Although the sharing of specific data using these dynamic web services is useful, scientists currently end up having to spend a large amount of their time translating between data formats as required by the different services. The translation between different data formats also makes it hard to recognise references between data items without first understanding the domain and the use of different methods by each domain to construct the references.

2.3.1 SOAP Web Service based workflows

Current scientific workflow management systems focus on data access using a combina- tion of SOAP Web Service calls and locally created code. Although this is a useful way to access data reliably as it includes fault aware SOAP calls, it creates a reliance on XML syntax as the basic unit of data interchange. The tightly defined XML specifica- tion, detailed in the SOAP contract, prevents multiple results being merged into single result documents easily if the specification did not allow this initially. Although there is some work being done to enhance the semantic structure of the XML returned by

8http://www.w3.org/TR/soap/ Chapter 2. Related work 33 web services, it does not contain a clear way to integrate web services which have been implemented by different organisations. A discovery protocol such as Simple Semantic Web Architecture and Protocol (SSWAP) provides a way of discovering the semantic content published in web services, but it does not provide a way for scientists to di- rectly use the web services discovered by the protocol [102]. The ontology description of the inputs, operations, and outputs of web services is useful in addition to the WSDL description which defines the structure of the data inputs and outputs, but it needs to integrate with the data, and provide a mapping between data structures before it would be useful as a data access protocol. Perhaps the biggest results in this area have been attained by the manually coded Web Service wrappers created by the SADI project [143]. Although it is useful, the mechanism for identifying which datasets are relevant to each provider in the background is done on a case by case basis inside of wrappers. The scientists, high level, query cannot be used in different contexts to integrate data that was not created using identical identifiers and the same properties. SADI uses a single query to determine which sub- queries are necessary. If datasets in future follow the exact structure and identifiers are not changed, than the single query can be adapted or reused to fit future data providers. SADI model allows for future datasets to be represented using different data structures, as long as they can be transformed using a manually coded wrapper or using an OWL based ontology transformation, as the data structure given in the query is fundamental to the central query distribution algorithm. This restriction ties the scientist closely to the actual data providers that are used. This makes it difficult to share or extend queries in different contexts. The parts of the query that need to be modified may not be easy to identify, given that each query is executed as a combined workflow. For example, although the datasets are equivalent, SADI would find it difficult to query across the data providers shown in Figure 2.4, as there is both variation in the references and variation in the data structures. In the example, some information for the results needs to come from each dataset, but there is no single way to resolve the query, as there are different ways of tracing from the drug to the fact that the gene is located on chromosome X depending on whether the information about its gene sequence is required. In addition to this, the property names are duplicated between datasets for different purposes. In the Drug dataset, the property "Target’s gene" contains a two part reference to a dataset along with the identifier in the dataset, while in the Disease dataset, the property contains a symbol without a reference to a dataset, and the symbol is not identical to the identifier in the Drug dataset.

2.4 Scientific data formats

Scientific data that will be used by computer programs, as opposed to the HTML that is used by humans, is presented in many different formats. These formats are generally text formats, as binary specifications that are common for some programs are not transportable and useable in different scenarios, something which is important to scientists. Although in the past scientists created new file formats for each area, 34 Chapter 2. Related work

Heterogeneous datasets

Get the name, location, genetic structure, and taxonomy information for genes and diseases that are relevant to the drug Marplan Query

Drug Dataset Disease Dataset Name : Marplan Name : Neuritis Targets disease : Neuritis Targets gene : hgnc, MAOB Targets gene : ncbi_gene, 4129

Datasets

NCBI Gene Dataset HGNC Gene Dataset Name : Monoamine oxidase B Name : Monoamine oxidase B Identifier : 4129 Identifier : 6834 Gene symbol : hgnc, 6834 Gene symbol : MAOB Taxonomy : 9606 Taxonomy : Human Genetic sequence : AGGCT... Chromosome : X Location : Xp11.23

Monoamine oxidase B, X chromosome, AGGCT..., Human, Neuritis Useful results

Figure 2.4: Heterogeneous datasets Chapter 2. Related work 35 new scientific document formats are generally standardised on using XML as the basic format. The common use of XML means that documents are easily parseable, and if they follow a particular schema, they can be validated easily by programs. The use of XML along with XML Schema 9, makes it simple to construct complete documents and verify that they are syntactically valid. However, the strict nature of XML schema verification does not allow extension of the document. In addition, XML does not contain a native method for describing links between documents. Although HTML contains a native method for linking between documents, it is not designed to describe the links in terms of recognisable properties. In addition to XML based formats, scientists share documents using the ASN.1 standard (ISO/IEC 8825-1:2008) 10 and custom, domain specific, data formats based on computer parseable syntax descriptions encoded using EBNF (ISO/IEC 14977:1996) 11. These documents may support links to different datasets, but they require knowledge of the way each link is specified to determine its destination, and the data item that the link refers to may not be directly resolvable using the link.

2.5 Semantic web

The Semantic Web was designed to give meaning to the masses of data that are repre- sented using electronic documents. The term was defined using examples which focus on the ability of computerised agents to automatically reason across multi-discipline distributed datasets without human intervention to efficiently process large amounts of information and generate novel complex conclusions to simple questions [27]. The Se- mantic Web is commonly described using idealistic scenarios where data quality issues are non-existent, or are assumed not to affect the outcome, data from different locations is generally trustworthy, and where different data providers all map their data to shared vocabularies. However, the slow process of developing of the necessary tools has shown that the achievable operational level of a global Semantic Web may be very limited. For instance, Lord et al. [91] found that, based on their experience, “[the] inappropriateness of automated service invocation and composition is likely to be found in other scientific or highly technical domains.” A completely automated Semantic Web is still possible, however, it is more likely that human driven applications can derive novel benefits from the extra structure and conditions provided by semantically marked up, shared, data. In comparison to agent based approaches, human data review incorporates curation and query validation pro- cesses at each step. Although the Semantic Web (SW) may be a user navigable dataset which includes references to the current HTML web, the major features are independent of the way the web is used today. These features include a focus on computer understandability, and a common context between documents to use as the basis for interpreting any meaning

9http://www.w3.org/TR/xmlschema-1/ 10http://www.iso.org/iso/catalogue_detail.htm?csnumber=54011 11http://www.iso.org/iso/catalogue_detail.htm?csnumber=26153 36 Chapter 2. Related work that may be given. A scientist needs to unambiguously describe the elements of a query using terminolo- gies that are compatible with the datasets they have available, including some commonly used ontologies described in Appendix B. This process requires a way to apply rules to each of the terms in a query to determine whether they are recognisable and not am- biguous, and convert them to fit the context of each dataset as necessary. This process is not trivial, and no general solution has been found, although some solutions focus on the use of Wikipedia as a central authority to disambiguate terms [18]. Unfortunately, Wikipedia only contains a subset of terms needed by scientists, as it is defined by its community as a general knowledge reference, rather than a scientific knowledge refer- ence. Although new terms could be added at any time to Wikipedia, they can just as quickly be deleted, meaning that it can’t be permanently used as a reference. The Semantic Web is currently being developed using the RDF (Resource Descrip- tion Framework) specification for data interchange 12. RDF is a graph based data format that provides a way to contextually link information from different datasets using URI links and predicates, which are also represented as URIs. These predicate URIs can be defined in vocabularies. In comparison to domain specific data formats and XML, RDF is designed to be an extensible model, so information from different locations can be meaningfully merged by computers. RDF is based on URIs, of which the HTTP URLs that are commonly used by the document web, are a subset. The RDF model specifies that the use of a URI in different statements implies a link between the statements. The URI should contain enough information to make it easy for someone to get information about the item if they understand the protocol used by the URI. In the past, scientists have been hesitant to create HTTP URIs due to perceived technical and social issues surrounding the HTTP protocol and the DNS system that HTTP is commonly used with [41]. Some RDF distributions such as the OBO schema recommend instead that scientists represent links in RDF using other mechanisms such as key-value pairs with namespace prefixes in one RDF triple and the identifier in another triple. This use of RDF does not enable computers to automatically merge different RDF statements, as there are no shared URIs between the statements.

2.5.1 Linked Data

The Semantic Web community has defined guidelines for creating resolvable, shared, identifiers, in the form of HTTP URIs, under the banner “Linked Data” 13. The guide- lines codify a way to gain access to factual data across many different locations, without relying on locally available databases or custom protocols. Although its specifications are independent of many of the goals of the Semantic Web, it is designed to produce a web of contextually linked data, that may then be incrementally improved to represent a semantically meaningful web of information.

12http://www.w3.org/TR/rdf-syntax-grammar/ 13http://www.w3.org/DesignIssues/LinkedData.html Chapter 2. Related work 37

In comparison to linked documents, Linked Data provides the basis for publishing links between data items that are not solely identifiers for electronic documents. Using Linked Data, an HTTP link can be resolved to a document that can be processed and interpreted as factual data by computers without intervention by a human. The Linked Data guidelines refer to the ability to use Uniform Resource Identifiers (URIs) to represent both an item and the way to get more information about the item. Although this in itself is not new, the Linked Data design recommends some conventions that enable virtually all current computing environments to get access to the information. These conventions are centred around the use of the Hyper Text Transfer Protocol (HTTP) and the Domain Name System (DNS), along with the reliance of these systems on the TCP/IP (Transport Control Protocol/Internet Protocol). The documents representing Linked Data are transferred in response to HTTP requests as files encoded in one of the available RDF syntaxes based on HTTP Content Negotiation (CN) headers. The RDF specification is ambivalent to the interpretation of data, other than to say that URIs used in different parts of a document should be merged into a single node in an abstract graph representing the document. There are a large number of RDF parsers available, and the RDF specification is not proprietary, making RDF-based Linked Data accessible to a large number of environments currently using the linked document web. Typical web documents can also be integrated as Linked Data, with the most common method being RDFa14 that can be layered onto a traditional HTML document. The use of RDFa makes it possible to include computer understandable data and visually marked up information in the same document, using HTTP URIs for semantic links to other documents. Scientists can utilise Linked Data to publish their data with appropriately annotated semantic links to other datasets. Linked Data systems are not centralised, so there is no single authority to define when datasets can be linked to other datasets. This means that scientists can use Linked Data to publish their results with reference to either accepted scientific theories or new theories as necessary, and have the links recognised and merged automatically by others.

2.5.2 SPARQL

Although Linked Data is useful as an alternative to linked documents, it is not a useful replacement for dynamic web services. In terms of its use by scientists, Linked Data focuses on resolving identifiers to complete data records, as opposed to the filtered or aggregated results of queries. In order to effectively query using Linked Data, scientists would need to resolve all of the URIs that they knew about and perform SPARQL queries on the resulting dataset. This is not efficient, as there may be millions of URIs published as part of each dataset, and there are no guarantees that the documents will not be updated in future. Although Linked Data URIs are useful for random data access, they can be used in

14http://www.w3.org/TR/rdfa-syntax/ 38 Chapter 2. Related work published datasets to enable queries over multiple datasets using SPARQL. SPARQL, a query language for RDF databases, is a graph based matching language that is based on the RDF triple. The RDF semantics specify that co-occurrences of a URI can be represented as a single node on an abstract graph representing a set of RDF triples. In some ways SPARQL is similar to SQL, which is used to perform queries on relational databases, although it is not predicated on a set of tables with defined columns to define valid queries based on. There are many SPARQL endpoints that are publicly available, including a large number of scientific datasets 15. SPARQL queries are constructed based on the structures of the RDF statements in each dataset. In addition, SPARQL queries can be used to generate results without using an RDF database. For example, relational databases can be exposed as RDF datasets using SPARQL queries [29]. SPARQL queries are difficult to transport them between different data providers unless the RDF statements in each location are identical. They rely on merging different triples into graphs using either URIs or Blank Nodes, (which are locally referenceable RDF nodes). It is rare that different datasets use the same URIs to denote every data record and link. In practice, different datasets use different URIs due to two main issues. Firstly, there is a lack of standardisation for URIs, as the original data provider may not define Linked Data HTTP URIs for their data records. Secondly, even if the original data provider defines Linked Data HTTP URIs, other entities may customise the data and wish to produce their own version using different URIs. It is very useful in terms of SPARQL queries, for all datasets to use the same URIs to identify an item, as it is difficult to transport SPARQL queries between data providers. However, it is more useful in terms of simple data access to use locally resolvable Linked Data HTTP URIs.

2.5.3 Logic

RDF and Linked Data are gradually expanding and improving to provide a basis on which semantically complex queries can be answered using formal logic. The logic component specifies the way that RDF statements relate to each other and to queries on the resulting sets of RDF statements. In the context of RDF this requires the use of computer understandable logic to identify the meaning of URIs and Blank Nodes based on the statements they appear in. The most popular and mature logic systems are currently created using Description Logics (DL), with the most popular language being the family of OWL (Web Ontology Language) variants. OWL-DL, a popular variant, is based on the theory of Description Logics, where the universe is represented using a set of facts, which are all assumed to be true for the purposes of the theory. The theory is by design monotonic, meaning that the addition of new statements will not change current entailments, where entailments are additional statements that are added based on the set of logic rules, which may also be statements under consideration. A non-monotonic theory would make it possible for extra statements to contradict, pos- sibly override, or remove, current entailments, something that would currently require

15http://esw.w3.org/SparqlEndpoints Chapter 2. Related work 39 one or more statements to be discarded before any reasoning was able to be consistently performed. Given that scientists are forced to consider multiple theories as possibly valid before they are consistently proven to be valid using empirical experimentation, they should not have to apply a single logic to the entire Semantic Web (also termed by Tim Berners-Lee as the Giant Global Graph (GGG) 16). However, there are projects that seek to apply ontologies to all interlinked scientific datasets to facilitate queries on the GGG. Scientists require the ability to decide whether statements, or entire datasets are irrelevant to their query, including the consequences of logic based entailment.

2.6 Conversion of scientific datasets to RDF

RDF versions of scientific datasets, have been created by projects such as Bio2RDF [24], Neurocommons [117], Flyweb [149], and Linked Open Drug Data (LODD) [121] to bootstrap the scientific Semantic Web, or at least the scientific Linked Data web. Where possible, the RDF documents produced by these organisations utilise HTTP URIs to link to RDF documents from other datasets as they are produced, including references to other organisations. This is useful, as it matches the basic Linked Data 17 goals which are designed to ensure that data represented in RDF is accessible and contextually linked to related data. In some cases, the same dataset is provided by two different organisations using two or more different URIs. For example, there is information about the NCBI Entrez Gene dataset in Neurocommons using the URI http://purl.org/commons/record/ncbi_ gene/4129 and there is similar information in Bio2RDF, including the Neurocommons data, where it is available, using the URI http://bio2rdf.org/geneid:4129. If the schemas used by these organisations are different, this forces scientists to accept one representation at the expense of the other when they create SPARQL queries. The data cleaning process is a vital part of the RDF conversion process. All data is dirty in some way [70]. The most essential data cleaning process is the identification of identical resources in different datasets. In non-RDF datasets, identifiers are not typically URIs, so they require context in order to be unambiguously linked with other datasets. The lack of context means that syntactically it is not easy to verify whether two identifiers refer to the same record. The successful conversion of scientific datasets to RDF, with defined links between the datasets, makes it possible for computers to automatically merge information. Even if the data is not semantically valid, there may be unexpected, but insightful results due to the implications of merges and rules that are applied to the combined data. RDF conversion may incur a cost in relation to query performance and storage space, as an RDF version of a traditional relational database is likely to be larger than a relational database dump format such as SQL. However, the RDF version is functionally equivalent and can be automatically merged with other RDF datasets.

16http://dig.csail.mit.edu/breadcrumbs/node/215 17http://www.w3.org/DesignIssues/LinkedData.html 40 Chapter 2. Related work

Compared to the equivalent relational database, an RDF database is typically larger, as RDF databases are typically very highly normalised, while relational database schemas tend to be designed with a tradeoff between normalisation and query efficiency. The size of the database limits the efficiency of queries, a factor which makes it hard to practically merge a large number of datasets into single RDF databases. In terms of scientific datasets the file size expansion, from experience, can be anything from 3 to 10 times, depending on the ratio of plain text to URIs in the datasets.

2.7 Custom distributed scientific query applications

A number of systems have been developed for the single purpose of querying distributed datasets. Some of these systems are focused on a single domain, such as Distributed Annotation System (DAS) [110] which provides access to distributed protein and gene annotations, while others are focused on a mix of domains related to a single topic, such as the range of datasets related to cancer research that are accessible using the caGrid infrastructure. Other systems such as OGCS-DAI are built on grid computing resources, while others such as BioMart are designed to provide simple methods of mirroring datasets locally, along with the ability for scientists to visually construct queries by picking resources from different datasets from lists that are provided by the datasets. The Distributed Annotation System allows distributed standardised querying on biological data. However it does not use RDF, as they are focused on a very specific domain. The biological datasets are represented using discipline specific interfaces and file formats which can be understood by most software. In comparison to an RDF based query solution, the current DAS implementation suffers in that it requires software updates to all of the relevant sites to support any new classes of in- formation. RDF based systems can be extended without having multiple data models implemented in their software. RDF based solutions enable users to extend an original record structure in a completely valid way using their own predicates, enabling cus- tomised configuration-based additions as well as the up to date annotation data that DAS is designed to provide. A variety of different computing and storage grids have been setup in the past to allow scientists to get access to large scale computing resources without having to have a super computer available locally. An example a large scientific grid is the caGrid, developed to enhance the transfer of information between cancer research sites [127]. It includes an example of a query system that requires a complex ontology to determine where queries and subqueries go. It is able to utilise SPARQL as a query language, but it translates queries directly to the an internal query format using ontologies to define mappings between different services. It does not allow users to perform these queries outside of the caGrid computing resources, as it requires the use of its internal query engine to execute queries. Although software to communicate with caGrid is provided for general use, it is highly specific and would be difficult to replicate in other contexts. It is able to provide provenance, Chapter 2. Related work 41 based on the internal queries that are executed as part of the query, but its emphasis on ontologies as the sole mapping tool means it is unable to be used as a generic model for querying many datasources, irrespective of their syntactic data quality. The mappings are defined in a central registry, making it difficult for scientists to define their own mappings, or integrate their own datasets with different, unpublished, mappings. OGCS-DAI is a general purpose grid based mapping facility, but it does not have a single standard data format, so it relies on services providing mappings between all possible formats in a similar way to general workflow engines [59]. The use of the resulting documents by scientists to utilise links between datasets, requires mapping between the identifiers used by datasets, something which is not simple given that the system uses a single binary unique identifier property that is presumed to be large enough to be suitable for the distributed system, but is not used in any other models, making mapping applications rely on other properties to decide which data items in the system match linked data from other applications. The OGCS-DAI model allows users to integrate different datasets, but these changes may not be replicable, as the record of the changes may not be visible in any of the provenance information that could be produced by the system. It contains some degree of user context sensitivity, through the use of profiles that decide what level of ontology support is required by different users, but this context sensitivity does not map to the query or datasets sections. In particular, users must completely understand the structure of the underlying datasets to successfully perform a query, both the ontologies, the way the data is arranged, and its syntactic structure, including whether the data appears in in lists, or in single items. The understanding of the generic underlying architecture does not necessarily in- clude identifying links between datasets, or the concept of data transformations outside of the metadata that is provided by the service. The metadata is solely provided by services, although this may be extended through software extensions to support user defined metadata. The overall focus of the system on a grid infrastructure, as opposed to an abstract set of linked datasets makes it unsuitable for general use in local contexts where users may not be able to simulate the grid infrastructure. The model does not seem to be designed for heterogeneous systems where users want to integrate datasets from grid sources, local sources, and other non-grid related sources. A general purpose scientific example of a software that can be used to mirror datasets locally, and perform queries both locally and on the foreign datasources is BioMart [130]. It can be used to construct queries across different datasets, although there are no global references internally, so it requires users knowing which fields map to other datasets, or relying the mapping provided by the dataset author. The datasets are curated by the authors, who then publish the resource, and register it with the BioMart directory. The mapping language that is used to map queries between datasets does not provide the ability for users to customise the mappings that are used by the distributed datasets, or utilise any dataset that is not available using either the BioMart format, or a generic SQL database. As the mappings are based on the knowledge that scientists have chosen 42 Chapter 2. Related work to query datasets from the mart, the context of the links are hard to define outside of the BioMart software. The query that is executed by the BioMart software can be used as part of a prove- nance record, as it contains references to the datasets that were used, but it focuses solely on the name of the dataset, and there are no references to where to potentially find the dataset other than to use the internal BioMart conventions for searching for datasets with that name. The ability of the software to mirror datasets locally provides some degree of context, although the process relies on the scientists ability to arrange for the data quality to be verified and corrected before they perform queries. Some systems aim to merge and clean scientific data for use in local repositories [4, 70, 74, 141]. Some systems attempt to dynamically merge data to single local repositories according to users instructions [72]. Some projects have managed to provide localised strategies using high performance computing to handle the resulting large set of scientific data [28, 60, 106]. However, it is impractical to expect every dataset to be copied into a single local database for complex queries by users without access to these high performance computing facilities. These facilities incur an associated management overhead that is not practical for many researchers, particularly if the datasets are regularly updated. The BIRN project, although able to query across distributed datasets, relies on a complete ontology mapping to distribute and perform queries [17]. It relies on databases using the relational model, with a single ontology type field to denote what the type of the record is. This is not practical for large cross-discipline datasets, as BIRN is focused on a single domain, biomedicine, where the ontologies cleanly map between datasets. Arbitrary queries for items, where one does not know what properties will be available, are hard in a relational model due to the lack of recognition of links between datasets. In a similar way, [75] and [145] are useful systems, but they lack the ability to provide users with arbitrary relations, and therefore users must still understand the way the data is represented to distinguish links from other attributes. These models do not recognise the difference in syntactic data quality, except for cases where mappings can be derived using the structure of the record, and there is no allowance provided for semantic data quality normalisation methods. The majority of systems that distribute SPARQL queries across a number of end- points, convert single SPARQL queries into multiple sub-queries, before joining and filtering the results to match the original query. These systems, broadly known as Federated SPARQL, generally require that users configure the system with specific knowledge of properties and types used by each dataset [143]. In many cases they also assume that the URI for a particular item will be the same in all datasets to enable a transparent, OWL (Web Ontology Language) inference based, joined results [1, 85, 111], in a similar way to BIRN, although using the RDF model. The SRS system [46] provides an integrated set of biological datasets, with a custom query language and internal addressing scheme. Although the internal identifiers are unambiguous in the context of the SRS system, they do not have a clear meaning when Chapter 2. Related work 43 used in other contexts. In comparison to the many formats offered by SRS, and the native document formats of particular scientific datasets, the use of RDF for both doc- uments and query results provides a single method, normalised namespace based URIs, to reference items from any of the involved datasets. SRS provides an internal query language that makes use of the localised database, giving it performance advantages over the similar distributed RDF datasets provided by Bio2RDF. The approximate number of RDF statements that are required to represent each of the largest 14 databases in the Bio2RDF project are shown in Table 2.1, illustrating the scale of the information provided currently in distributed RDF datasets. In comparison to the variety of RDF datasets available, SRS requires that there is only a single internal identifier for each record, enabling efficient indexing and queries. Although SRS provide versions of each record in a range commonly used formats, the central data structures are proprietary and not open and standardised, so it is difficult to provide interoperability between sites. This reduces the ability of other scientists to replicate results if they do not have access to an institution with an SRS instance, with the same datasets. The OpenFlyData model aims to abstract over multiple RDF databases by using them as sources for a large local repository, before directing queries at that virtual dataset. However it does not aim to systematically normalise the information or direct queries at particular component datasets [150], and it does not contain a method for scientists to express the amount of trust that they personally have in each dataset.

Database Approximate RDF statements

PDB 14,000,000,000 Genbank 5,000,000,000 Refseq 2,600,000,000 Pubmed 1,000,000,000 Uniprot Uniref 800,000,000 Uniprot Uniparc 710,000,000 Uniprot Uniprot 220,000,000 IProClass 182,000,000 NCBI Entrez Geneid 156,000,000 Kegg Pathway 52,000,000 Biocyc 34,000,000 (GO) 7,400,000 Chebi 5,000,000 NCBI Homologene 4,500,000

Table 2.1: Bio2RDF dataset sizes

2.8 Federated queries

Many systems attempt to automatically reduce queries to their components, and dis- tribute partial queries across multiple locations. These systems presume that a number 44 Chapter 2. Related work of conditions will be satisfied by all of the accessible data providers, including data quality, semantic integrity, and complete results for queries. In science it is necessary to rely on datasets in multiple locations instead of locally loading all datasets. This is due to the size of the datasets, and the continual updating that occurs, including changes to links between datasets based on changes in knowledge. In terms of this research into arbitrary multiple scientific dataset querying, the most common federated query systems are based on SPARQL queries that are split across a number of different RDF databases. Most systems also allow the translation of SPARQL queries to SQL queries [29, 40, 45]. Others are focused on a small number of cooperating organisations such as BIRN [17] and caGRID [120], or they are only focused on a single topic such as DAS [110]. These systems require estimates of the number of results that will be returned by any particular endpoint to optimise the way the results are joined. These specialised systems also require that there is a single authoritative schema for all users of the system, although there may be mappings between this schema and each dataset. Some federated SPARQL systems also require query designers to insert the URL’s of each of the relevant SPARQL endpoints into their queries by redefining the meaning of a SPARQL keyword, making complex and non-RDF datasets inaccessible, and intro- ducing a direct dependency on the endpoint which reduces the ability of scientists to transport the query according to their context [147]. Federated SPARQL systems that focus on RDF query performance improvements may not require users to specify which predicates they are accessing [63]. There are a number of patterns that are implemented by federated query systems, including Peer-to-Peer (P2P) data distribution, statistics based query distribution, and the use of search engines to derive knowledge about the location of relevant data. The first type of federated query relies on an even distribution of facts across the lo- cations to efficiently process queries in parallel. An even distribution of facts is generally achieved by registering the peers with a single authority, and distributing statements from a central location. However, there may be alternatives that self-balance the system without the use of a single authority. In either case, the locations all need to accept that they are part of the single system and they must all implement the same model to effectively distribute the information without knowledge about the nature of the information. This method is generally implemented in situations where queries need to be parallelised to be efficient, while the data quality is known, or not important to the results of the query. This method is not suitable for a large group of autonomous distributed datasets, as updates to the any of the datasets will require that the virtually every peer is updated. The second type of federated query is designed to be used across autonomous, multi- location datasets. It requires knowledge about the nature of the facts in each dataset, and uses this knowledge to provide a mapping between the overall query, and the dataset locations. The majority of federated SPARQL systems are designed around the basic concept of a high level SPARQL query being mapped to one or more queries, including Chapter 2. Related work 45

SPARQL, SQL, Web Services, and others. As the system requires detailed statistics to be efficient, most research focuses on this area, with the data quality assumed to be very high, and in most cases, the semantic quality to be very good, resulting in simple schema mappings between different datasets based solely on the information in the overall query. These methods have difficulties with queries that include unknown predicates, as this aspect is used to make overall decisions about which locations will be relevant. The VoiD specification includes a basic description of prefixes that can be used to identify namespaces, but if the URI structure is complex, the unnormalised prefix alone is not useful in deciding which locations will be relevant [3, 40]. The VoiD specification assumed that all references to a dataset will have equivalent URIs, so any datasets that use their own URIs, perhaps to enable their users to resolve a slightly different representation of the document locally, will not work with the VoiD specification. The third type of federated query is designed to be used against a central knowledge store. This store contains directions about where facts about particular resources can be found. The most common type of central knowledge store is in the form of a search engine. This strategy dramatically reduces the amount of information that is necessary for a client, while reducing the time required to find resources, assuming that the search engine has complete coverage of the relevant datasets. In many cases, items can be directly resolved using their identifier, especially if the authority is recognised, and the URI can be used to discover queryable sources of information about the item. However, a search engine may provide more information than a custom mashup that is created and maintained by users, at the expense of the maintenance and running costs surrounding a very large database. In contrast to federated query systems, a pure Linked Data approach is extremely inefficient in the long term, as it fails to appreciate the scale of the resources that are in use. It inevitably requires users to have alternative access to large datasets, for example, a SPARQL endpoint, for any useful, efficient queries. Some systems may provide a mix of federated query strategies, including a search engine together with a basic knowledge of which datasets are located in each endpoint. These methods are shown in Figure 2.5. For example, the Linked Data methods used to resolve information about a given URI are shown in Figure 2.6, including the possibility of an arbitrary crawl depth based on the discovered URIs. These federated systems, along with Linked Data and SPARQL queries are very brittle. If one of the SPARQL endpoints or Linked Data URIs is unavailable, the system may break down. Scientific results need to be replicable by peers as easily as possible. In order to overcome this brittleness an ideal system needs to provide a context sensitive query model that scientists can personally trust by virtue of their knowledge of the involved datasets and the way different representations of a dataset are related. An ideal scientific data access system cannot rely specifically on the use of specific properties or URIs. By decoupling the scientist’s query from the data locations, an ideal system is free to negotiate with alternative endpoints that may provide equivalent information 46 Chapter 2. Related work

Common Linked Data Query Methods

HTML RDFa N3 XML

RDF

SPARQL Endpoint and Graph URI HTTP URI with Service Description known links

Original HTTP URI Search Engine

Linked Data Resolver

SPARQL Query

Figure 2.5: Linked Data access methods

using alternative URIs or properties. It can negotiate based on the semantic meaning of the query, recognising multiple Linked Data URIs as equivalent and different sets of properties as equivalent in the current context without requiring the properties or URIs to always be equivalent.

Common Linked Data Query Methods

N3 RDF/XML

HTTP GET SPARQL Endpoint and http://ns2.org/ Graph URI name/id-2 http://ns1mirror.org/sparql HTML RDFa

HTTP GET Crawl all Service Description with http://ns1.org/ discovered URIs HTTP URI with URI Prefix resource/id2 to arbitrary depth known links http://ns1.org/resource/

Original HTTP URI Search Engine : Pre-crawled

Linked Data Resolver

SPARQL Query : DESCRIBE

Figure 2.6: Using Linked Data to retrieve information Chapter 3

Model

3.1 Overview

Scientists face a number of difficulties when they attempt to access data from multiple linked scientific datasets, including data cleaning, data trust, and provenance related issues, as described in Chapter 1. A model was designed to make it easy for scientists to access data using a set of queries across a set of relevant data providers, with de- sign features to support data cleaning operations and trusted, replicable queries. The model addresses the trust, quality, context sensitivity and replication issues discussed in Section 1.1. The main design goals for the model are to provide for replicability using a mapping layer between a scientist’s query and the actual queries that are executed. The mapping takes on two parallel forms, one of which maps the scientist’s query to a concrete query that could then be directly executed, and another which map the notion of datasets to data locations. The scientist’s query can be mapped to different languages and interfaces as necessary, while the textual strings ensure that there can be more than one namespace inferred from any query, enabling future scientists to merge different uses of the model without affecting backwards compatibility. The mapping layers ensure that no one system or organisation can be a sole point of failure for replicating a query, as long as data is freely and openly accessible. In addition, the queries are designed to be replicated based solely on textual provenance documents, as compared to similar mapping systems that require coded mapping pro- grams as common components in the mapping workflow. Coded, particularly compiled, mapping programs would limit the ability to implement the model in future using dif- ferent technologies. Scientists can choose to modify the system to suit their current local context, without affecting the way other scientists perform the same query using public data providers. This makes it possible for scientists to have contextually applicable data access to the relevant linked scientific datasets across their research. Scientists do not have to publicise their lack of trust or their opinion of the data quality in different datasets to extend the system, as there is no global point of entry for the system. When the scientist is ready to publish their results, other scientists should be able to examine

47 48 Chapter 3. Model their queries, and the provenance of their results, as textual documents which should lower a technological barrier to reuse, compared to implementation specific systems and global registries. Datasets may not be trusted if they contain out of date information; they may not cleanly link to other datasets; or they may cleanly link to other datasets, and be up to date, but may still not be the best choice for a scientist in their context. The model is aimed at providing scientists with more direct control over which data providers are accessed when queries are replicated, as current Linked Data and distributed SPARQL query models do not fully support this notion, as shown in Figure 2.6. The model attempts to provide a way for scientists to describe the nature of the distributed linked dataset queries that are performed in response to a set of their queries, with an emphasis on collaboration with other scientists and replication by scientists with access to different facilities containing similar data. The model enables context sensitive, distributed linked data access by mapping a single user query to many different locations, denormalising the query based on the location, and normalising the results to form a single set of homogeneous results. It maps user queries to query types based on query matching expressions. This mapping generates a set of parameters for each query type, including a set of namespaces that are unique to the query type. For each of these parameter sets, a set of providers which implement the query type are chosen, including inclusion or exclusion based on the namespaces identified in the query parameters. The query parameters are then processed using normalisation rules assigned to the provider using two stages for the query parameters and the query after the query parameters are inserted into the query template. The resulting query may optionally be compiled and transformed using using abstract syntax transformation rules. The query is then executed based on the type of the provider. For example, a SPARQL query could be submitted using an SQL interface or an HTTP interface, or a query could be a simple HTTP GET to the endpoint URL. The results are normalised for each query on each provider in two additional stages, before and after parsing to RDF triples. The RDF triples from all of the queries on all of the chosen providers are then aggregated into a single pool of RDF statements with two additional normalisation stages, before and after serialising the results to an RDF representation. A comparison of the basic components of the model to the typical federated RDF based query models access is shown in Figure 3.1. The typical query models, described in Section 2.8, all fail to see data quality as an inherent issue, preferring to focus on semantic or RDF statement level normalisation without allowing for any syntactic normalisation. The query model described here allows for a wide range of normalisation methods, along with the ability to transparently integrate data from providers which have not agreed on a particular URI model, as the typical query models all require that URIs be identical, or that all providers have access to–and have therefore agreed to–mappings between the different URIs. In the example from Section 1.1.2, scientists need to access data from a variety of Chapter 3. Model 49

Figure 3.1: Comparison of model to federated RDF query models 50 Chapter 3. Model datasets including genomic, drug, and chemical data. However, they are unable to easily distinguish between trusted data and untrusted data, as some datasets may be useful for some queries but not others, and the datasets are very large, including numerous links between datasets. They are unable to systematically alert other scientists to irregularities in the data quality, as the current data access and publishing methods do not provide ways for scientists to provide corrections or modifications as part of the data access model. The model is designed around the process of scientists asking questions that can be answered using data from linked datasets. The question contains parameters including the names of datasets and identifiers for data items that are relevant, along with other information that defines the nature of the query, as shown conceptually in Figure 3.2. For example, if the scientist in the example wanted to access information about a Drug named Marplan in DrugBank, they could access the data using the steps shown in Figure 3.3. To distinguish between different parts of a dataset, each dataset in the model is given one or more namespaces based on the way the dataset has assigned identifiers to data items internally. The parameters, which in the example include the DrugBank Drug namespace, and the search term, “Marplan”, are matched against the known types of queries and data providers to determine which locations and services can provide answers to the question. These parameters are applied to templates for the query types and providers. The data providers may not all be derived from the DrugBank dataset, as shown by the inclusion of an equivalent search on the DailyMed dataset in addition to the DrugBank query. Data providers are defined in terms of their query interfaces, which types of queries they are useful for, what namespaces they contain information about, and what data normalisation rules are necessary. This information is used to match each provider to a users query, before constructing queries for each provider based on the parameters, namespaces and data normalisation rules. Scientists can make different providers to choose relevant data cleaning rules in the context of both queries and providers, an improvement on previous models that require global or location specific data cleaning rules. The model includes profiles that allow scientists to state their contextual preferences about which providers, query types, and rules, are most useful to them. The selected providers, query types, and normalisation rules will be selected to answer questions, even if previous scientists did not use any of the same methods to answer the question in the past. The profile can explicitly include or exclude any provider, query type, or normalisation, as well as specifying what behaviour to apply to elements that do not exactly match the profile. To allow more than one profile to be used consistently, the profiles are ordered so scientists can override specific parts of any other profiles without having to recreate or edit the existing profile. Profiles are designed to allow scientists to customise the datasets and queries that they want to support without having to change provenance records that were created by previous scientists. This feature is unavailable in other models that allow scientists to answer questions using multiple distributed Chapter 3. Model 51

Search for Drug in DrugBank

Data access parameters : Query type = Search by Namespace Namespace = Drugbank Drugs Search term = Marplan

DrugBank DrugBank Data data source 1 data source 2 Providers

Search Search Actual queries interface interface

Integrate Resolved data results

Figure 3.2: Query parameters

Search for Drug in DrugBank

Data access parameters : Query type = Search by Namespace Namespace = Drugbank Drugs Search term = Marplan

DrugBank Dailymed Data http://www4.wiwiss.fu-berlin.de/ http://www4.wiwiss.fu-berlin.de/ Providers drugbank/sparql dailymed/sparql

Search for any uses of the word Search for any uses of the word Marplan in Actual queries Marplan in DrugBank Dailymed that have references to DrugBank

SPARQL Regex search SPARQL Regex search CONSTRUCT { ?s ?p ?o . } CONSTRUCT { ?drugbankLink WHERE { ?s ?p ?o . bio2rdf:linkedToFrom ?s . } FILTER(REGEX(?o, "Marplan"))} WHERE { ?s ?p ?o . ?s owl:sameAs ?drugbankLink . FILTER(REGEX(?o, "Marplan")) FILTER(REGEX(STR(?drugbankLink,"http:// www4.wiwiss.fu-berlin.de/drugbank/"))}

Resolved data

Integrate results

Figure 3.3: Example: Search for Marplan in Drugbank 52 Chapter 3. Model linked datasets. Scientists can take advantage of the model to access a large range of linked datasets. They can then provide other scientists with computer understandable details describing how they answered questions, making it simple to replicate the answers. This goal, generally known as process provenance, is supported by the model through the use of a minimal number of components, with direct links between items, such as the links between providers and query types. The model can be used to expose the provenance for any scientist’s query consistently because each of the components are declared using rules or relationships in the provenance record. This enables other scientists to feed the provenance record into their implementation of the model to replicate the query either as it was originally executed, or with substitutions of providers, queries, and/or, normalisation rules using profiles. Complete provenance tracking requires a combination of the state of the datasets, the scientific question that was being answered, any normalisation or data cleaning processes that were made to the data, and the places that the data was sourced from. The model is able to produce query provenance details containing the necessary features to replicate the data access, including substitutions and additions using profiles. The creation of annotations relating to the state of a dataset is difficult in general, as scientific datasets do not always contain identifiable revisions, however, if scientists have access to these annotations they can include them using query types and providers that complement their queries. The model requires a single data format to integrate the results of queries from dif- ferent information providers without ambiguity. This makes it possible to use and trust multiple data providers without requiring them to have the same or even a compatible data schema. Other distributed query models require a global schema making it diffi- cult to substitute data providers. The single data format, without a reliance on a single schema, enables scientists to easily integrate data from different locations, gradually modifying data quality rules as necessary, without initially having perfect data quality and a global schema, as is assumed by the theory proposed in Lambrix and Jakoniene [83]. A single extensible data format enables scientists to include provenance annota- tions with data, where most other provenance methods need be separated into different data files [17, 35, 60, 74].

3.2 Query types

The model directs query parameters to the relevant data providers and normalisation rules using query types. For example, a scientist needs to initially find all references to a disease as shown in Figure 3.4, while later only needing to find references from datasets that they trust as shown in Figure 3.7, or are interested in as shown in Figure 3.5 and Figure 3.6. In each case the scientist needs to describe the way each query would be structured for each dataset. If there are different methods of data access, different structures, or different datasets, they need to be encapsulated in separate query types to determine which providers, namespaces, and normalisation rules are relevant to each Chapter 3. Model 53 query type. Each query type represents a query, independent of the actual providers that it may be applied to, even if it is designed to be specific to a particular provider. However, each query type is able to recognise the namespaces that it is targeted at without reference to providers. If a namespace is recognised it may be used to restrict the set of providers that make the relevant query available in the context of that namespace. This structure makes it possible to extend or replace queries that other scientists have created for all namespaces, without changing the syntax. For example, a scientist may want to add annotations to the data description for a namespace. They do not have to modify any of the other query types that make it possible to get the base information in order to add their annotations, and they do not want their annotation query being used as a datasource for any other namespaces. In the model they simply create a new query type that recognises the query parameters that are used for the base information query types, and restrict it to their namespace, as long as the namespace prefix is identifiable in the parameter set. When other scientists wish to replicate the query in a different context, e.g., a different laboratory or using a different database, they can create semantically equivalent queries to use another data access interface. They can reuse the generic parameters as part of their new query type minimising the number of external changes that need to be made for an application to support the new context. The model allows scientists to easily substitute the query types without having to modify the original query types using profiles. The profiles dynamically include and exclude query types without changes to provenance records, other than additions where necessary. In the example shown in Figure 3.3, there was one set of parameters defined for the two query types, with the profile defining which of the query types were used in each situation. It was necessary to create two query types so that the scientist had a reference to use when defining which strategy they preferred. A use case for this would be to enable a scientist to transparently substitute their limited query into a public provenance record which originally used any and all possible datasets as sources for the query. They would need to do this to be sure that they could validate the results using datasets that they had personally curated. In comparison with typical Linked Data and Federated SPARQL query models, as shown in Figure 3.1, the model introduces a new layer between an overall query and the data providers. This layer makes it possible for scientists to perform query dependent normalisation, trust specific endpoints in relation to queries without having to specify a particular endpoint in their query, and enables scientists to reliably recreate the query using information in the provenance record. The contextual normalisation and trust is not possible if the configured information is generically applied to all queries, as the replicated query would be linked back to datasets and endpoints as the basic model elements, instead of query types which can contextually define namespaces and parameter relevance. These namespaces are more abstract than datasets in models such as VoiD which include references to URI structures and endpoints as part of the core 54 Chapter 3. Model

Search for References to a Disease Data access parameters : Query type = Find references Data item namespace = diseasome_diseases Data item identifier = 1689

Search for any references to the Query disease

DrugBank Dailymed

Sider MediCare Datasets

Integrate results Resolved data

Figure 3.4: Search for all references to disease Chapter 3. Model 55

Search for References to a Disease In Namespace

Data access parameters : Query type = Find references in namespace Data item namespace = diseasome_diseases Data item identifier = 1689 Search namespace = sider_drugs

Search for references to the disease in a certain namespace Query SPARQL Reference search CONSTRUCT { ?s ?p . } WHERE { ?s ?p .}

Does not contain namespace sider_drugs Does not contain namespace sider_drugs

DrugBank Dailymed

http://www4.wiwiss.fu-berlin.de/ http://www4.wiwiss.fu-berlin.de/ Datasets drugbank/sparql dailymed/sparql

Does not contain namespace sider_drugs

Sider MediCare

http://www4.wiwiss.fu-berlin.de/ http://www4.wiwiss.fu-berlin.de/ sider/sparql medicare/sparql

Integrate results Resolved data

Figure 3.5: Search for references to disease in a particular namespace 56 Chapter 3. Model dataset description. Search for References to a Disease

Data access parameters : Query type = Find references (Only Sider Drugs) Data item namespace = diseasome_diseases Data item identifier = 1689 Search for any references to the disease

Query SPARQL Reference search CONSTRUCT { ?s ?p . } WHERE { ?s ?p .}

Not configured to handle "Find Not configured to handle "Find references (Only Sider Drugs)" references (Only Sider Drugs)"

DrugBank Dailymed

http://www4.wiwiss.fu-berlin.de/ http://www4.wiwiss.fu-berlin.de/ Datasets drugbank/sparql dailymed/sparql

Not configured to handle "Find Sider references (Only Sider Drugs)"

http://www4.wiwiss.fu-berlin.de/ MediCare sider/sparql http://www4.wiwiss.fu-berlin.de/ medicare/sparql

Integrate results Resolved data

Figure 3.6: Search for references to disease using a new query type

Each query type defines a relevant set of parameters for each type of data access interface that may be used on a provider using the query type. If a provider with a new type of data access interface is included, current query type definitions can be updated to include a new parameterised template. These changes allow scientists to migrate between datasets gradually, perhaps while they test the trustworthiness or reliability of a new dataset or interface, without requiring them to change the way they use the model to access their data. This change is important as it provides a configuration driven method of migration, where other methods require scientists to change the way they access the data interface to match a new dataset. Query types and data providers can be assigned namespaces. Each of the assigned namespaces are recognised by the model based on parameters that the scientist uses in their query. In some cases the assigned namespace does not need to match the namespace in the query in order for the query type or data provider to be relevant. In the example shown in Figure 3.4, the scientist indicated that they wanted references to an item, including references from other datasets. This meant that although it was important to know the namespace and identifier for the item, any query type or provider would be relevant. Similarly, if the scientist wanted to restrict their search for references to a specific namespace they could add another parameter, and specify that the parameter was a Chapter 3. Model 57 namespace that the model needed to use to plan which query types and data providers would be relevant, as shown in Figure 3.5. If they wanted to avoid changing the query parameters to ensure that the query was compatible with their current processing tools, they could create a new query type with the same parameter set and only apply it to datasets containing the namespace as shown in Figure 3.6. Although it is easier to maintain the applications using the model using a new query type, the model is easier to maintain if namespaces are given as parameters, as the query may be reusable in the context of other namespaces. It is important to note that the actual queries that would be executed on the data providers did not need to change in the example shown in Figure 3.6, however, the different requirements made it necessary to use the same query in different query types.

Search for References to a Disease

Data access parameters : Query type = Find references (Only Sider Drugs) Data item namespace = diseasome_diseases Data item identifier = 1689 Search for any references to the disease

Query SPARQL Reference search CONSTRUCT { ?s ?p . } WHERE { ?s ?p .}

Restrictred by profile Restrictred by profile

DrugBank Dailymed

http://www4.wiwiss.fu-berlin.de/ http://www4.wiwiss.fu-berlin.de/ Datasets drugbank/sparql dailymed/sparql

Restrictred by profile Included by profile Restrictred by profile

Sider Sider MediCare

http://www4.wiwiss.fu-berlin.de/ http://myuniversity.edu/ http://www4.wiwiss.fu-berlin.de/ sider/sparql sider_curated/sparql medicare/sparql

Integrate results Resolved data

Figure 3.7: Search for all references in a local curated dataset

3.2.1 Query groups

In some cases there are different ways to make up queries to answer the same question, depending on the dataset that is used to answer the question. In some circumstances, only one of these queries needs to be performed for a scientist to get sufficient infor- mation to answer their query. However, together with normalisation rules, two queries may be able to generate the same results from two separate providers. In these cases a scientist may wish to group queries together to make their process more efficient, while still having redundancy incase a provider is unresponsive. Query types that are grouped 58 Chapter 3. Model together should have the same parameters, so that scientists do not have to know they are using different queries, unless they check the provenance. To reduce the number of provider queries that are unnecessarily performed, the model includes an optional item that can be used to group query types into formal groups according to a scientist’s knowledge that the query types are equivalent. The groups ensure that the same query is not performed on an equivalent provider more than once because it is known to produce a redundant set of results. The groups also ensure that the same query may only need to be executed on one of many providers, or provider groups. A query group must be treated as a transparent layer that only exists to exhibit the semantic equivalence of each of the query types it contains. A user can optimise the behaviour of the model by choosing all or any query types out of each group based on its the local context. As shown in Figure 3.8, the query groups are not structurally equivalent, in that they operate on the same queries, but the queries are not equivalent, as the query parameters need to be translated in four different ways to match the data provider interfaces. The example namespace, Digital Object Identifier (DOI), is inherently difficult to query due to the fact that there is no single organisation that maintains a complete copy of the dataset. There are currently only a small number of DOI registration agencies, and there is useful information about DOIs available many different domain specific datasets, so there needs to be a variety of different types of queries.

Query Groups

REST Interface SPARQL Endpoint SPARQL Endpoint SPARQL Endpoint http://bioguid.info/ http://cpath.bio2rdf.org/ http://pele.farmbio.uu.se/ http://uniprot.bio2rdf.org/ openurl.php? Data provider sparql nmrshiftdb/sparql sparql display=rdf&id= {namespace}:{identifier}

CONSTRUCT { ?p ? {identifier}> ?p ?o } No Template {identifier}> ?p ?o } o } Templated WHERE { WHERE { WHERE { ?s "{namespace}: {namespace}:{identifier}> ? "{identifier}" . ?s ?p ?o . } {identifier}" . ?s ?p ?o . } p ?o }

HTTP GET Bio2RDF URI Blue Obelisk DOI Dublin Core Identifier Query Types

Query Group: Construct document from DOI Common parameters - Query Group Namespace : doi Identifier:10.*

Figure 3.8: Query groups

3.3 Providers

Scientists gain access to datasets using many different methods, for example, they may use HTTP web based interfaces, Perl scripts which access SQL databases, or SOAP Chapter 3. Model 59

XML based web services. The model accommodates different access methods using data providers that are dataset endpoints which support one or more query types. These providers may be implemented in different ways, but each provider is linked to query types, namespaces and normalisation rules. These links give the model information about how a query is going to be formed, which datasets are relevant, and what data quality issues exist. Each provider represents a single method of access to either the whole, or part of, a dataset, and different providers can utilise the same access method and location for a dataset depending on the scientist’s context. Each provider is associated with a set of data endpoints that can all be accessed in the same way. Each of the endpoints must handle all of the query types and namespaces that are given for the provider. In the example shown in Figure 3.9, there are three providers representing four endpoints, three actual queries, and two query types. Each of the endpoints connected to each provider can be used interchangeably, reducing the query load on redundant endpoints. The two endpoints connected to the Blue Obelisk DOI query type are not equivalent, as each provider can only have one templated SPARQL Graph URI, so they must be placed in a provider group before they can be considered to be equivalent.

Providers

CONSTRUCT { CONSTRUCT { ?p ? CONSTRUCT { ?p ?o } ?p ?o } o } WHERE { WHERE { WHERE { GRAPH { ?s "10.*" . ?s ?p ?o . GRAPH { Actual query ?p ?o . } ?s "10.*" . ?s ?p ?o . }} } }

SPARQL Endpoint SPARQL Endpoint SPARQL Endpoint SPARQL Endpoint http://cu.cpath.bio2rdf.org/ http://quebec.cpath.bio2rdf.org/ http://pele.farmbio.uu.se/ http://localhost:8890/sparql Endpoint sparql sparql nmrshiftdb/sparql SPARQL Graph URI: SPARQL Graph URI: SPARQL Graph URI: SPARQL Graph URI: Not applicable

FarmBio Bio2RDF Mirrored Localhost Nmrshiftdb Provider CPath DOI Blue Obelisk DOI Blue Obelisk DOI

CONSTRUCT { ?p ?o } {namespace}:{identifier}> ?p ?o } WHERE { WHERE { Templated {sparqlGraphUriStart} {sparqlGraphUriStart} ?p ?o . ?s "{identifier}" . ?s ?p ?o . query {sparqlGraphUriEnd} {sparqlGraphUriEnd} } }

Bio2RDF URI Blue Obelisk DOI Query Type

Query Group: Construct document from DOI Common parameters - Query Group Namespace : doi Identifier:10.*

Figure 3.9: Providers

3.3.1 Provider groups

Providers are thin encapsulation layers over endpoints. However, they do not provide the necessary elements on their own to encapsulate data sources that do not have identical interfaces. Although a simple solution is to implement two providers, this is 60 Chapter 3. Model not efficient, as they will both be used in parallel for all of the relevant queries. A more useful solution, which allows semantically equivalent providers to be grouped, allows the implementation to know that the providers do not all have to be implemented in the same way to be equivalent in terms of the RDF triples that they will return in response to the same user queries. In the example shown in Figure 3.10, there are two providers that have equivalent SPARQL Graph URIs, so they do not need to be grouped, as the interface URL can be duplicated on a single provider. There are two providers that match the same query and can be used equivalently, but there are different SPARQL Graph URIs, so it is necessary to use a provider group to make sure that the model can recognise that the two providers are equivalent.

Provider Groups

SPARQL Endpoint SPARQL Endpoint SPARQL Endpoint SPARQL Endpoint http://cu.cpath.bio2rdf.org/ http://quebec.cpath.bio2rdf.org/ http://pele.farmbio.uu.se/ http://localhost:8890/sparql Data provider sparql sparql nmrshiftdb/sparql SPARQL Graph URI: SPARQL Graph URI: SPARQL Graph URI: SPARQL Graph URI: Not applicable

Not Applicable: Blue Obelisk DOI Provider Providers Match NMRShiftDB Group

CONSTRUCT { ?p ? {namespace}:{identifier}> ?p ?o } o } WHERE { Templated WHERE { {sparqlGraphUriStart} "{identifier}" . ?s ?p ?o . query {identifier}> ?p ?o } {sparqlGraphUriEnd} }

Bio2RDF URI Blue Obelisk DOI Query Type

Query Group: Construct document from DOI Common parameters - Query Group Namespace : doi Identifier:10.*

Figure 3.10: Provider groups

Provider groups do not functionally change the nature of a query unless users decide not to use all redundant, equivalent, providers as sources for a query. In this case, the Provider Groups would encapsulate a set of heterogeneous providers which are equivalent, and could be considered substitutes for any of the query types on all of the providers, based on the query groups that the query types are members of. Any of the providers could then be chosen at random, or using another strategy, to answer a given query.

3.3.2 Endpoint implementation independence

The system does not require SPARQL endpoint access, as other resolution methods could be designed and implemented as part of the current provider model, however the result of each query needs to be in RDF to be included with the results of other queries. In particular, if RDF wrappers are written for Web Services that currently Chapter 3. Model 61 form the bulk of the bioinformatics processing services, they can be included as sources of information for queries that they are relevant to. In some cases, there is a need to provide simple static mappings between data items and other known URIs. The number and scale of these mappings may make it more efficient to dynamically generate the mapping on request. In these cases, the mapping tool could be located in the model if the information content required for the mapping is provided in the query, or it can be created with a lightweight widget that takes the initial data item as input and provides the relevant RDF snippet as output. In Figure 3.8, there are four different providers. There is one direct HTTP URL that can be accessed using HTTP GET, while the others must be accessed using HTTP Post. The information available from each provider may be equivalent, however the endpoints must all be accessed differently in response to the same query.

3.4 Normalisation rules

Normalisation rules are transformations that allow scientists to define what changes are necessary to the results of queries from particular providers to form a clean set of results. Although the ability to assign transformations to the results of queries is not novel, in the context of this work, it enables scientists to utilise simple alternatives to coding or arbitrary workflow technology in most cases, although the transformations themselves may be complex depending on the situation. Normalisation rules allow different scien- tists to express their opinions about data by changing the way it is represented to fit their context. The changes may be simply related to data quality, but they may also be related to trust, as untrusted elements can be removed or the weight of an assertion can be changed using the rules. The level of trust that scientists put into the results of queries could be increased by publishing these modifications along with the data, so as not to portray the information as being directly sourced from the original provider. As well as allowing scientists to define and share their ideas about what changes need to be made to data sourced from particular providers, normalisation rules can be formulated to apply to more than one provider as applicable, making them available as reusable tools that may apply to many entities. This may reduce the complexity of dealing with a large number of datasets, if at least some of them are compatible, even if they do not match the normalised form that this model expects. The data cleaning process is more difficult for science datasets compared to business datasets. The most well known data cleaning methods involve business scenarios, where the conventions for what clean data looks like are well recognised. For example, it is easy to identify whether a customer record is complete if it contains a valid address and phone number, and the email address fits a simple pattern. In comparison, in science it is difficult to identify whether a gene record is complete, as it may be valid, but not be linked directly to a known gene regulation network, or it may be valid but not have a protein sequence, as it is a newly discovered type of gene that is not used to generate proteins. The scientific data quality conditions rely on a meaningful, semantic understanding of the data, where data is assumed to be clean in business scenarios if it 62 Chapter 3. Model is syntactically correct. The normalisation rules used with the model need to support both syntactic and semantic normalisation, within the bounds of the amount of data that is accessible to the model. In the example shown in Figure 3.11, there are three different DOI schemas, the BIBO, Blue Obelisk and Prism vocabularies. The datasets in the example contain three different types of URIs, Bio2RDF, BioGUID and NMRShiftDB, so there must be both semantic and syntactic normalisation rules to integrate the results into a single ho- mogeneous document. In the case of one property, the Dublin Core Identifier, there are four alternatives that have not been normalised to demonstrate the range of data that is available. The results of the query may include data from the four endpoints, how- ever, in the example, there are no results from NMRShiftDB or the Bio2RDF endpoints for this DOI. The schema that was chosen by the scientist as the most applicable was the Prism vocabulary, so the normalisation rules mapping the Prism vocabulary to the BIBO and Blue Obelisk vocabularies are not shown. The URI scheme that was preferred was the normalised Bio2RDF URI, so the mapping rules are shown for NMRShiftDB and BioGUID to Bio2RDF. The normalisation rules were only applied to the results from BioGUID in the example as it was the only provider that returned information, and it contained redundant information given using two different vocabularies.

Normalisation Rules

Replace Replace http://bioguid.info/pmid: http://pele.farmbio.uu.se/ URI nmrshiftdb/?bibId= with Normalisation http://bio2rdf.org/pubmed: with http://bio2rdf.org/nmrshiftdb_bib: Rules

Replace http://purl.org/ontology/bibo/doi Schema with http://prismstandard.org/ Normalisation namespaces/2.0/basic/doi Rules

REST Interface SPARQL Endpoint SPARQL Endpoint SPARQL Endpoint http://bioguid.info/ http://cpath.bio2rdf.org/ http://pele.farmbio.uu.se/ http://uniprot.bio2rdf.org/ openurl.php? Data provider sparql nmrshiftdb/sparql sparql display=rdf&id= {namespace}:{identifier}

CONSTRUCT { ?p ? {namespace}:{identifier}> ?p ? {namespace}:{identifier}> ?p ? No Template o } o } o } Templated WHERE { WHERE { WHERE { ?s "{namespace}: query {identifier}> ?p ?o } "{identifier}" . ?s ?p ?o . } {identifier}" . ?s ?p ?o . }

HTTP GET Bio2RDF URI Blue Obelisk DOI Dublin Core Identifier Query Types

Query Group: Construct document from DOI Common parameters - Query Group Namespace : doi Identifier:10.1080/0968768031000084163

Figure 3.11: Normalisation rules

In this example the model is able to syntactically normalise the data, but without some further understanding of the way the DOI was meant to be represented and used, it cannot semantically verify that the data is accurate. Further rules, that are implemented after the data is imported into the system or after it is merged with other Chapter 3. Model 63 results, could contain the semantically valid conditions that could be used to remove or modify the results. Normalisation rules are used at various stages of the query process, as highlighted in Table 3.1. Normalisation rules are applied at the level of data providers to provide the most contextual normalisation with the least disruption to generally used query types. For each query type on each provider, the normalisation rules are used on each of these separate streams. All normalisation rules from all streams may be applicable to the two final stages after all results are merged into the final pool of results, and after these complete results are output to the desired file format.

Stage Mode 1. Query variables Textual 2. Query template before parsing Textual 3. Query template after parsing In-Memory 4. Results before parsing Textual 5. Results after parsing In-Memory 6. Results after merging In-Memory 7. Results after serialisation Textual

Table 3.1: Normalisation rule stages

Initially, normalisation rules are used for each query type before template variables are substituted into the relevant template. After the template is ready, normalisation rules may be applicable to the entire template, including when the template is in textual form or after the template is parsed by the system to produce an in-memory represen- tation of the query. This enables scientists to process the variables independently of the template, before processing the template to determine if it is consistent, if this process cannot be performed using the variables alone. The normalisation rules are then used on the textual representation of the results, if this is available with the relevant communication method. The normalised text is then parsed into memory using the single data model that the system recognises, and normalisation rules are applied to the in memory representation of the results before merging results from different streams. After the results are merged, normalisation rules are applied to the in-memory representation of the results before serialising the complete pool of results to an output format. In most cases, this output format will be the requested format, unless the request was not compatible with the single data model, in which case, the pool of results must be serialised in a different way based on the in-memory representation, or a textual transformation on the output would be used. The same normalisation method would not be used for each stage. For example, the query variables are normalised using textual transformations, while the parsed in- memory results are normalised using algorithms based on the common data model. An implementation would need to support each of the different methods that were used with normalisation rules attached to each relevant provider before a query could be successfully executed. There is no guarantee that leaving out a normalisation rule will 64 Chapter 3. Model produce a consistent or useful result, unless the scientist has directed in their profiles that the normalisation rule is not useful to them.

3.4.1 URI compatibility

Although normalisation rules are generic transformations, they are a useful tool when applied to the case of queries distributed across a set of linked datasets. Although many datasets contain references to other datasets, the format for references to the same dataset is not generally consistent across datasets. Even if a resolvable URI is given as a reference, the resolution point is contained in the reference, along with the context specific information that defines which item is being linked to. A naive approach to resolving the URI, would be to use the protocol and endpoint that are given to find more information about the item. However, if the resolution point is not controlled by the user, any data quality measures using normalisation rules, or redundancy using multiple providers, may not be utilised. To solve this issue, normalised URIs can be created that scientists can easily resolve using resources that they control. If normalised URIs are given as a basic standard, they can be further utilised to change the references in documents to match the scientist’s context, making it simple to continue using the model described here to maintain control over the way the data is represented and the way references are resolved. This enables scientists to work with data from many datasets, but choose where to source the information from according to their context. Although this may tend to encourage the explosion of many different equivalent URIs, if the URIs are resolvable by other users, the new URIs may still be useful. If there is a standard format for the URIs, and a mapping between the local URIs and that set can be published as normalisation rules scientists could recognise the link and normalise it before resolving it to useful information. This can be performed without continuously relying on one scientist for infrastructure, as long as a rough network is established to share information about providers and normalisation rules, along with any useful query types and namespace information. A goal of this research is to both make information recognisable and trustworthy, while simultaneously making it accessible in many different contexts. Along with social agreements about link and record representations, the use of this model to represent what data is available, and what the access methods are, makes it possible to decentralise infrastructure, even if there is still a global resolution point, the use of the resolution point can be mirrored in other circumstances to minimise interruptions to research. The model is able to do this without a new file format, access protocol or resolution interface. By contrast, the LSID protocol was designed to solve data interoperability and access issues for life sciences. However, it has not been extensively used. There are likely a range of reasons, but with respect to this research, it may not have been widely adopted due to its insistence on a completely new data access protocol, a range of resolution methods, and the lack of a single data format, although they insisted on the use of RDF for metadata. Chapter 3. Model 65

The use of LSIDs would have enabled scientists to recognise the LSID protocol based on the URI, and access original data format files using one of the supported interfaces. However it did not include the ability to apply transformations to the information, as it insisted that data remain permanently static after its publication. In addition, scientists still needed to be able to process and supply data in different formats, making it difficult–or impossible–to merge data from different sources. Federated SPARQL approaches require that the URIs used on one endpoint to represent a resource must be exactly the same as the URIs used on other endpoints to represent that resource, as any other URI cannot be presumed to be equivalent without specific advice. The normalisation rules in this model are not explicitly linked to the namespace model because they are more relevant to providers than to namespaces, and the definition of whether the URI is equivalent is dependent on the provider rather than the namespace definition. An example of this can be seen in Figure 3.12, where different providers have different names for the same item and the same query template can be used with all of the providers.

Search for References to a Disease

Data access parameters : Query type = Find references Data item namespace = diseasome_diseases Data item identifier = 1689 Search for any references to the disease

SPARQL Reference search CONSTRUCT { ?s ?p <${normalisedURI}> . } Query WHERE { ?s ?p <${endpointSpecificURI}> .} template

SPARQL Reference search SPARQL Reference search CONSTRUCT { ?s ?p . } diseasome_diseases:1689> . } WHERE { ?s ?p .} Normalised diseases/1689> .} Queries

DrugBank Dailymed

http://www4.wiwiss.fu-berlin.de/ http://www4.wiwiss.fu-berlin.de/ Datasets drugbank/sparql dailymed/sparql

Disease review Sider MediCare http://mycompany.com/ http://www4.wiwiss.fu-berlin.de/ http://www4.wiwiss.fu-berlin.de/ disease_review/sparql sider/sparql medicare/sparql

Integrate results Normalised data

Figure 3.12: Single query template across homogeneous datasets

The use of normalisation rules enables the URIs used on each provider to be nor- malised before being returned to users, so scientists can recognise the URI, and the software can recognise where to perform further queries that relate to that namespace. In order not to remove the link to the previous context, normalisation rules can be used together with a static statement of equivalency back to the alternative URIs that 66 Chapter 3. Model were used on any of the endpoints, so scientists can have a list of URIs that relate to particular resources, although they may still focus on the normalised URI to make further querying with the model easier. A Federated SPARQL system could be de- signed using the configuration model, without references to query types or parameters, by recognising what normalisation rules and namespaces are applied to each known provider, removing the reliance on the preconfigured query types for arbitrary SPARQL endpoints. Although the namespaces are designed to be recognised in the context of the query types, the model could be implemented with named standard parameters to provide efficient access to the places where namespaces are known to be located. The model allows the definition of rules that remove particular statements from the results. These rules make it possible to reduce the amount of information as necessary without changing the interface for the provider.

3.5 Namespaces

The namespaces component is centred around identifying the ways that a dataset is structured. There are two main considerations for the decision to split a dataset into namespaces, the way that identifiers are used and the way that properties are used. In a traditional relational database, primary keys do not have to be given contexts, because they inherit their nature from the table they are located in. In non-relational datasets there is still a need to define unique identifiers for items, however as the con- straints are not given by the structure of the database, the identifiers must be assigned using some other method. This model does not seek to convert all datasets into a rela- tional compatible model, where each aspect is normalised and given a unique identifier set based on its data type, as there may be benefits to sharing an identifier set across different types of items in the same dataset. These benefits include improved acces- sibility stemming from a simpler method of searching the dataset without previously knowing its internal schema. Namespaces can be used to define the scope of an internal identifier, so scientists can know the context that the identifier was defined in, although the namespace may not have a clear semantic meaning if data items of different types are mixed in a single namespace. The second consideration, relating to properties, stems from the requirement that the model supports multiple datasets that do not have a single schema for all data items. Scientists cannot be assured that a given property will exist, although queries for an item may not cause errors. To specify which properties actually exist, namespaces may be used to link terminologies, such as ontologies, to datasets, specifically through particular providers of those datasets. If a provider is an interface for a particular property, then a namespace can be used to indicate that the provider is an interface for that property. This is possible even if the namespace is not being used to identify the data items using the property or the location of the property definition. In some cases, the set of queries attached to a provider can be used to provide hints about which properties may be supported by a provider, particularly when the query type has been annotated as being relevant to a particular property via a known Chapter 3. Model 67 namespace. The normalisation of URIs is based on the existence of identifiers that represent parts of datasets that have unique identifiers available inside of them. These identifiers need to be easily translated into different forms without ambiguity. As there is not always a direct link between the normalisation rule component of the model and the namespaces, any links that are made between normalisation rules that implement the translation and are informative, rather than normative. Not all normalisation rules relate to namespaces and identifiers, and not all namespaces require normalisation rules. If the model required links between normalisation rules and namespaces, the coverage would not be complete, and the link may not be obvious outside of the context of providers. The use of the authority property in the namespace definitions enables scientists to define the provenance of a namespace along with its use, as the namespace URI is used as the link by all other parts of the model. This provides scientists with an point of contact if they have queries about the definition or use of the namespace.

3.5.1 Contextual namespace identification

The model does not require that namespaces be identified globally. A particular iden- tifier, generally associated with a single namespace, may have different meanings when applied to different queries, so it may be identified as one namespace in one query, and another namespace in another query. The contextual nature of namespaces makes them useful for integrating different datasets into the same system. Although it may be more applicable to normalisation rules, the ability to decide what the meaning of a namespace is for a particular query makes it possible to modify the common representation for a document to match the locally understood representation. This may include avoiding making large assertions about things that are not necessarily applicable to one dataset or query, even though they may be applicable to other datasets. The design of the model allows scientists to include only specific namespaces in both query types and providers, so scientists can decide that a namespace is only applicable to a set of queries without requiring reference to providers of information, or the namespace can be deemed only relevant to a particular provider. The namespace mechanism is designed with the idea that references from the en- vironment would be mapped into the model in one or more ways, allowing different scientists to assign the same external reference, such as a short prefix identifying a different reference across different contexts, to more than one namespace. This allows scientists to reuse other model based information without semantic conflicts with their own models. There are likely to be many situations where this support is necessary once the model is deployed independently across different areas by different authorities. To allow computational identification of conflicts, the namespace mechanism must contain a direct link to the authority that assigned the namespace. This allows scientists to identify cases where a single authority has assigned the same reference, such as a prefix, to more than one case, indicating that the authority did not have a clear purpose for 68 Chapter 3. Model the string. It also allows scientists to trust namespaces based on the authority, although if there are untrusted sources of data the reference to the authority may not carry the weight that a set of namespace definitions from a completely trusted authority would.

3.6 Model configurability

The system is designed so that semantically similar queries will be compatible with many different data site patterns. Public definitions of which queries are semantically similar, or relevant in a particular case, can be included or excluded, and local, private and efficient queries can be added or substituted using profiles. The namespace classification is designed so that it relies on functional keys, although the functional key may look like a URI. The functional key may match more than one namespace classification, enabling scientists to take advantage of currently published classifications in some cases, while concurrently defining their own internally recognised, and maintained, namespaces.

3.6.1 Profiles

The profile component was included to allow varying levels of flexibility with respect to the inclusion or exclusion of each of the profileable components, i.e., query types, providers and normalisation rules. Profiles make it simple for scientists to customise the local configuration by adding their own RDF configuration snippets into the set of RDF statements that are used to configure the system. The semantics surrounding profiles provide for input from both scientists and public configuration creators, although local scientists can always override the recommendations given by configuration creators. The profiles are also stacked, enabling scientists to define higher or lower level profiles to extend and filter the set of publicly published configurations. For a query to be executed on a particular provider, both the query and the provider must be included by one of the profiles. If this process results in the inclusion of the provider for that query, the profile list can also be used to determine which of the normalisation rules are used to transform the input and output. If a lower level, more relevant profile, explicitly excludes or includes a component, than the higher level, less relevant profile, will not be processed for that component. If a profile specifies that it is able to implicitly accept either query types, providers, or normalisation rules, then the default include or exclude behaviour specified by the author or the configuration will be used to decide whether the query matches the profile, and if no profile matches. The default include or exclude behaviour was designed to allow a configuration editor to suggest to scientists using the configuration that they are not likely to either find the component relevant, or the component relates to an endpoint that they are not likely able to access from the public web. The simplest possible configuration consists of a single query type and a single provider, as shown in Figure 5.2. The query type needs to be configured with a regular expression that matches scientist queries. The provider needs to be configured with both a reference to the query type, and an endpoint URL that can be used to resolve queries Chapter 3. Model 69 matching the query type. Although the example is trivial, in that the user’s query is directly passed to another location, it provides an overview of the features that make up a configuration. One particular feature to be noted is the use of the profile directive to process profile exclude instructions first, and then include in all other cases. In this example, there are no profiles defined, resulting in the query type and the provider being included in the processing of queries that match the definitions.

Although profile decisions need to be binary to ensure replicability, the layering of different profiles makes it possible to have various levels of trust. A complex layered set of profiles may be confusing to users, but scientists can preview the included and excluded items independent of an actual query. This makes it possible to assert levels of trust before designing experiments.

If the set of configurations that the scientist uses for data access is refreshed period- ically, possibly from an untrusted source, scientists would need to preview the changes, especially if any profiles have implicit includes, and a configuration is not completely trusted. This may be a security issue, but if static configuration sources are used, then the profiles will have the effect of securing the system as opposed to making it more vulnerable. This security results from the choice of providers based on their relevancy, and if needed restrict all queries to just a set of curated providers that will not change when a public configuration is updated, as may happen at regular intervals.

3.6.2 Multi-dataset locations

In the case where a particular query needs to operate on a provider where more than one dataset needs to be present, the model allows for the query to specify which are the namespace parameters, along with a definition of how many of the namespace parameters need to match against namespaces on any eligible providers. This allows for the optimisation of queries, without relying on scientists to create new query types for each case where possible.

This allows scientists to either create local data silos where they store copies of each of the most used datasets, while still being able to link to distributed SPARQL query endpoints in some cases. This allows scientists to gain the benefits of the distributed and aggregated datasets without losing the consistent linking and normalised documents that this model can provide.

The model allows groupings of both equivalent providers and endpoints, and by allowing scientists to contribute their information to a shared configuration. The shared configuration is not limited to a single authority, as the configurations for the model are designed to be shareable, and independent of the authority that created them originally, apart from their use of Linked Data URIs as identifiers for items in the configuration documents. 70 Chapter 3. Model

3.7 Formal model algorithm

The formal definition of how the model is designed to work is given in this section. It does not contain some features that were included in the prototype, such as choices based on redundancy, that are provided by provider and query groupings, however, these can be included by restricting the range of the loops over query groups (qg) and each of the provider groups (pg). The use of query and provider groups is optional, to allow for the simplest mechanism of providing and publishing configuration information using the model. If a query type, or a provider does not appear in any group, they should be included as long as they pass the tests for inclusion for the scientist query.

1. Let u be the scientist query

2. Let r be the overall set of inputs given in u;

3. Let f be the ordered set of profiles that are enabled;

4. Let π be the pool of facts that will contain the results of the query;

5. Let Q be the set of query types in the global set of query types (GQ) that match u

6. Let QG be the set of query groups in the global set of query groups (QGS) that match u

7. Let P be the global set of providers

8. Let PG be the global set of provider groups

9. Let N be the global set of namespaces

10. Let GNR be the global set of normalisation rules to be used to modify queries and data resulting from queries

11. For each query group qg in QG

12. For each query type q in qg;

(a) Check that q matches the constraints of the set of profiles in f, and if it does not match then go to the next q (b) If q matches the conditions in f; (c) Let np be the set of namespace parameters identified from r in the context of q (d) Let nd be a set of the set of namespace definitions in N which can be mapped from each element of np (e) Let nc be the set of namespace conditions for q (f) If nd does not match q according to nc, then go to the next q Chapter 3. Model 71

(g) If nd matches q according to nc;

(h) For each provider group pg in PG

(i) For each provider p in pg;

i. Check that p matches the constraints of the set of profiles in f, and if it does not match then go to the next p ii. If p matches the conditions in f; iii. If p matches q according to nc and nd, including the condition that nd may not need to match p if this condition is included in nc iv. Let s be the set of normalisation rules that are attached to p; v. Substitute inputs from r in any templates in q, using rules that are defined in p and relevant to query parameter denormalisation stage vi. Substitute inputs from r in any templates in p, using rules that are defined in p and relevant to query parameter denormalisation stage vii. Parse q into an in memory representation mq if necessary for the com- munication type of provider p viii. Normalise mq based on any normalisation rules defined in p that are relevant to the parsed query normalisation stage ix. Let α be the document that is returned from querying q on p in the context of r x. Normalise α according the rules defined for p that are relevant to the before results parsing stage xi. Parse the normalised α and then normalise the facts based on the nor- malisation rules defined in p that are relevant to the parsed results before pool stage xii. Then include the normalised results in η xiii. Let β be the set of non-resolved facts found by parsing the static tem- plates from q after substituting parameters from the combination of r, q, p, and nd xiv. Normalise β according s, and include the results in η xv. Include η in π

13. Normalise statements from π which match the after pool results stage in the ordered set of normalisation rules relevant to all providers in P

14. Serialise facts in π

15. Normalise the serialised facts which match the after pool serialisation stage in the ordered set of normalisation rules relevant to all providers in P

16. Output the result of q as the serialised document 72 Chapter 3. Model

3.7.1 Formal model specification

The model requires that the configuration of each item includes a number of mandatory features. The list of features that are required for each part are given in this section. This list of features may be extended by prototypes if they need information that is spe- cific to the communication methods that they are using. For example, an HTTP based web application may require additional information for each query type to determine the way different user queries match based on their use of different HTTP methods such as HTTP GET and HTTP POST.

Query type

A query type should contain the following features:

• Input query parameters : The information from the user, including namespaces where required.

• Namespace distribution rules : Ie, which combinations of namespaces are accept- able, and if it is not clear, how are the namespace identifiers given in the query.

• Whether namespaces are relevant to this query : If namespaces are not relevant, the namespace distribution parameters are ignored.

• Whether to include endpoints which do not fit the namespace distribution param- eters but are marked as being general purpose.

• Query templates : The templates for the queries on different endpoint imple- mentations should be given in the query, with parameters in the template where available.

• Static templates : Templates that are relevant to the use of the query, but do not rely on information from the endpoint, other than the provider parameters. These templates represent facts that can be directly derived from the queries, and will not be subject to any normalisation rules that are specific to this query type and the provider it is executed using, with the exception of normalisation rules that are applied after the facts are included in the output pool.

Provider

A Provider should contain the following features:

• Namespaces : Which namespace combinations are available using this provider. This may be omitted in some cases if the provider may contain information about any namespace, but it is hard to pin point which ones it may actually contain at any point in time.

• Whether this provider is general purpose : This is used to match against any query types that are applied to this server that claim to be applicable to general purpose endpoints. Chapter 3. Model 73

• Query types : Which query types are known to be relevant to this provider. If a system does not require preconfigured queries to function using the model these may be omitted. An example of where they may be omitted may be in the use of the normalisation, namespace and provider portions in a federated SPARQL implementation.

• Endpoint communication type : The communication framework that is used for this provider. For example, HTTP GET, HTTP POST, etc.

• Endpoint query protocol : The method that is used to query the endpoint. For example, SPARQL, SOAP/XML etc.

• Normalisation rules : Which rules are necessary to use with the given queries on the given namespaces for this endpoint.

Normalisation rule

A normalisation rule should contain at least one of the following features:

• Input rule: used to map normalised information into whatever scheme the relevant endpoints require.

• Output rule: used to map endpoint specific data back into normalised information.

• Type of rule: For example, Regular expression matching or SPARQL.

• Relevant stages : When the rule is to be applied. For example, before query creation, after query creation, before results import to normalised data model, after import to normalised data model, or after serialisation to output data format. SPARQL rules are only useful in results after they have been imported to the normalised data model, but regular expressions may be useful for any stage where the query or results are in textual form, while query optimisation functions are only relevant after the query has been created, rather than in the process of replacing templates in the query to form a full query that may be parsed before being executed depending on the communication method of the relevant data provider. The normalisation rules, which are applied to the results pool and the serialised results document, are not relevant to the query type or provider that they were included with, so variables such as provider location that may be used in the other earlier stages, will not be available for these rules.

• Order : Indicates where the rule will be applied in a given set of rules within a stage.

Normalisation rule test

The model has an optional testing component that is used validate the usefulness of normalisation rules. The expected input and expected output are not defined in a rule, as they are 74 Chapter 3. Model

• Expected input : The input to use.

• Expected output : The output that is expected.

• Set of rules : The set of rules that must be applied to the input to derive the expected output.

Namespace

A Namespace should contain the following features:

• The identifier used to designate this namespace : These items can be mapped to both this and other namespace’s as applicable. These may include traditional prefixes, but also other items such as predicates and types that are available.

• The expected format for identifiers : A regular expression that can be used to identify syntactically invalid identifiers for this namespace.

• The preferred URI structure : A template containing the authority, namespace prefix, identifier, and any other variables or constants necessary to construct a URI for this namespace.

• The authority that is responsible for assigning the scientist level items to this namespace : This is in effect a mapping authority, although multiple authorities can be used concurrently, so there is no central authority.

Profile

A Profile should contain the following features:

• Explicitly included items : A list of items that should be used with this profile, and the scientist should be told that they will use this if it is the first matching profile on their sorted profile list.

• Explicitly excluded items : A list of items that should not be used with this profile, and the scientist should be told they shouldn’t be used if this is the first matching profile on their sorted profile list.

• Default inclusion and exclusion preference : This is used to assign a default value to any item that do not specify inclusion or exclusion preferences, as the use of profiles is not mandatory.

• Implicit inclusion preference : Whether items in each category should be included if the result of the profile rules indicates the item can be implicitly included. Chapter 4

Integration with scientific processes

4.1 Overview

Scientists need to be able to concurrently use different linked datasets in their research. The model described in Chapter 3, enables scientists to query various linked datasets using a single replicable method. This chapter contains a variety of ways scientists can integrate the results of these queries with their programs and workflows. The integration relies on the model to provide data quality, data trust, and provenance support. Scientists are able to use the model to reuse and modify work produced by other scientists to fit their current context, even if they need to use different methods and datasets used by the original scientist. The model allows scientists to explore linked data and integrate annotations relating to data they find interesting. In comparison to other techniques, the annotations do not need to modify the original dataset. The resulting RDF triples can be integrated into computer operated workflows, as the data resulting from the use of the model is mandated to be computer understandable and processable using a common query language regardless of which discipline or location the data originated in. A case study is given in Section 4.4 describing the ways that the model and pro- totype, described in Chapter 5, can be used in an exploratory medical and scientific scenario. The case highlights one way of using the model to publish links to results along with publications. The use of resolvable URLs as footnotes in Section 4.4 is one method of integrating computer resolvable data references into publications, although other methods have also been proposed in the data-based publication literature, includ- ing embedding references into markup [37, 58] and adding comments into bibliographic datasets [123]. Scientists need a variety of tools to analyse their data. One method that scientists use to manage their processing is to integrate data and queries using workflow man- agement systems. If the workflow is able to process and filter the single data model that is mandated by the query model, the scientist can avoid or simplify many of the common data transformation steps that are required to use workflow management sys- tems to process and channel data between different services. A case study is given in Section 4.7 using an RDF based workflow to demonstrate the advantages of the model

75 76 Chapter 4. Integration with scientific processes and prototype compared to current workflow and data access systems. In many cases it is difficult for a scientist to determine what the sources of data were in a large research project. Provenance information, including the queries and locations of data, is designed to be easily accessible using the model, as each of the query types are directly referenced in the relevant providers; and providers in turn reference normalisation rules and namespaces by name. These explicit links make it easy to generate and permanently store a portable provenance description. The provenance information provided by the model enables scientists to use provenance as part of their experiments. An example of how this information can be used to give evidence for the results of a scientific workflow is given in Section 4.7.1. Scientific results, and the processes that were used to derive the results, need to be reviewed by peers before publication. These steps require other scientists to attempt to replicate the results, and challenge any conclusions based on their interpretation of the assumptions, data, methods, and results that were described by the publishing scientist. The model is useful for abstracting queries away from high level processing systems, but the high level processing systems are required to integrate the results of different queries. Scientific peers need to replicate and challenge the results in their own contexts as part of the scientific review process. An example of how this can occur is given in Section 4.7.1.

4.2 Data exploration

Although most scientific publications only demonstrate the final set of steps that were used to determine the novel result, the process of determining goals and eliminating unsuccessful strategies is important. Historically, the scientific method required the scientist to propose a testable hypothesis explaining an observable phenomenon. This initial hypothesis may have come from a previous direct observation or a previous study. Ananiadou and McNaught [6] note that, “hypothesis generation relies on background knowledge and is crucial in scientific discovery.” The source of the initial hypothesis affects how data is collected and analysed. It may also affect how conclusions based on the results can be integrated into the public body of scientific knowledge [51]. Variations on this method in the current scientific environment may involve exploration through published datasets before settling on a hypothesis. The model described in this thesis, focusing on replicating queries across linked scientific datasets, provides a useful ad hoc exploration tool. It is designed to provide consistent access to information regardless of the data access method. The model focuses on replication, so exploratory activities can be tracked and replicated. This provides a novel and powerful tool for scientists who rely on a range of linked scientific datasets for their work. The method used by the model, is unique, as it provides an abstraction over the methods for distributing queries using namespace prefixes and query mapping from scientists queries to actual queries on datasets. Other RDF based strategies rely on scientists writing complete SPARQL queries which are resolved dynamically using a workflow approach to provide access to the Chapter 4. Integration with scientific processes 77

final results. These strategies may be simpler than writing workflows that rely on multiple different data formats, but they are still a barrier to adoption by the wider scientific community. They rely on datasets to be normalised, and scientists to be able to identify a single location for each dataset to include it in their query. This makes it difficult to replicate queries in different contexts, as peers may need to have access to an equivalent set of RDF statements and they may need to modify the query to use their alternative location. By contrast the model provides the necessary provenance details to enable replication using substitution and exclusion without removing anything from the provenance details. An implementation could interactively provide query building features, using both known and ad hoc datasets as needed. The essential data exploration is related to the number of links between datasets and how valuable they are. It is useful to have the ability to continuously explore information, without first requiring an application to understand a separate interface for each dataset and data provider. Scientists then need to be able to send future queries to processing services based on the data they have previously retrieved from the linked datasets. These features are provided by the normalising of URIs with the prototype, which make it possible to determine future locations based on the URIs from past results.

4.2.1 Annotating data

Scientists can create annotations that contain direct references to the data along with new information about the data. This enables them to keep track of interesting items throughout the exploration process. If these annotations are stored in a suitable place they can be integrated into the description for each item using a query and provider that are targeted at the annotation data store. An annotation service was implemented and the model was used to access the annotations, as shown in Figure 4.4. It demonstrates that novel data can easily be included with published data for any item based on its URI. The annotations contained details about the author of the annotation, as the annotation prototype required scientists to login with their OpenID, and the OpenID identifying URI was included in the annotation. Scientists can use tags to keep track of their level of trust in a dataset by annotating items in the dataset gradually, and then compiling reports about namespaces using the tag information. The reports can be selective, based on the annotations of trusted colleagues, and can be setup using new query types and profiles. These tags were used to annotate data records from the published Gene Ontology [16] and Uniprot Keywords [114] datasets, as well as an unpublished BioPatML dataset [92]. The tagging functionality was implemented in a web application that scientists can use it to annotate any URIs 1. The tag information could be directly integrated with the document that they resolve if they know they want to see annotations together with the original data in their context. In current workflows, every instance where the dataset was accessed would

1http://www.mquter.qut.edu.au/RDFTag/NewTerm.aspx 78 Chapter 4. Integration with scientific processes need to be subdivided into two steps, one for the original data and one for the annotated data. When used with normalisation rules from the model, annotations could be used to selectively remove case specific information, although other scientists would not receive this information until they resolved the annotation to RDF. If useful information was removed from the original dataset, annotations could be used to include it again.

4.3 Integrating Science and Medicine

The model is designed so that it can be easily customised by users. Extensions can range from additional sources and queries, to the removal of sources or queries for efficiency or other contextual reasons. The profiles in the model provide a simple way for scientists to select which sources of information they want to use without having to make choices about every published information source. In the context of Health Informatics, a hospital may want to utilise information from a drug information site, such as DrugBank and DailyMed, together with their private medical files. The hospital could map references in their medical files to disease or drug datasets such as DrugBank or DailyMed and publish the resulting information in a single data model, or provide a compatible interface to their current system with a single reference data model. The data from their systems could then be integrated with the DrugBank information transparently, as the model requires that the data model be able to be used for transparent merging of data from arbitrary locations. The hospital could then create a mapping between the terminologies stored in their internal dataset and those used by DrugBank and related datasets and use those to syntactically normalise their annotations. These mappings could be made using one of the available SQL to SPARQL converters such as the Virtuoso RDF Views mechanism [45] or the D2RQ server [29]. To distinguish private records from external public datasets, hospitals would create a namespace for their internal records, along with providers matching the internal ad- dresses used for queries about their records. This novel, private information, is able to be transparently included in the model through the use of private provider configurations indicating the source of the information. If the hospital then wanted to map a list of diseases into their files, they could find a source for disease descriptions, such as Diseasome [55], and either find existing links through DrugBank [144] and DailyMed 2, or they could use text mining to discover common disease names between their records and disease datasets [7]. For scientific research, resources like Diseasome are linked to bioinformatics datasets such as the NCBI Entrez Gene [93], PDB [26], PFAM [48], and OMIM [5] datasets. This linkage is defined explicitly using namespaces and identifiers in those namespaces. This enables the use of direct resolvable links and enables links to be discovered, for instance there may be links from patients and clinical trials to genetic factors, implying that patients may be eligible for particular clinical trials. Patients could be directly linked to genes using the data model without reference to diseases, and new diseases could be described

2http://dailymed.nlm.nih.gov/dailymed/about.cfm Chapter 4. Integration with scientific processes 79 internally without requiring a prior outside publication. Medical procedures and drugs inevitably have some side effects. There are a list of potential side effects available in the Sider dataset [80], enabling patients and doctors to both have access to equal information about the potential side effects of a course of medication. Side effects that are discovered by the hospital could be recorded using references to the public Sider record, reducing the possibility that the effect would be missed in future cases. The continuous use of a single data model enables scientists to transition between datasets using links without having to register a new file model for each dataset, al- though the single data model may have different file formats that are each associated with input and output from the model. For ease of reading, the URI links, which can be resolved to retrieve the relevant information used in the following case study, are footnoted. The links between the datasets were observed by resolving the HTTP URIs and finding relevant RDF statements inside the document that link to other URIs. In comparison to other projects that aim to integrate Science and Medicine, such as BIRN [17], the model used here does not require each participating institution to publish their datasets using RDF, as long as the data is released using a license that allows RDF versions of the data to be created by others. This enables others to integrate datasets that may not be in the same field or may not be currently maintained by an organisation. The semantic quality of the resulting information may not be as high as in the specialised neuroscience datasets that BIRN was developed to support, as each of the organisations involved in BIRN is required to be active and maintained. The clinical information that was integrated in BIRN consisted of public de-identified patient data, whereas, this model allows hospitals to retrieve public information and in- tegrate it with their private information without having to de-identify before being able to distribute queries across the data. The information about how to federate queries does not need to be publicly published to be recognised by the model or prototype, al- though it does need to be published in other systems that rely on centralised authorities for dataset locations and query federation. Private data is not ideal, as it hides semantically useful links from being reused. In some cases, notably hospitals, it is required by privacy laws in some countries that personal health records be private unless explicitly released.

4.4 Case study: Isocarboxazid

This case study is given in the context of both Science and Medicine. It is founded around a drug known generically as “Isocarboxazid”. This drug is also known by the brand name “Marplan”. The aims of this case study are to discover potential relation- ships between this drug and patients with reference to publications, genes and proteins, which may possibly affect the course of their treatment. In cases where patients are known to have adverse reactions or they do not respond positively to treatment, al- ternatives may be found by examining the usefulness of drugs that are designed for similar purposes. The doctor and patient could use the exploratory processes that are 80 Chapter 4. Integration with scientific processes

Figure 4.1: Medicine related RDF datasets

explained in the case study to determine whether current treatments are suited to the patient with respect to their history and genetic makeup. If necessary, the case study methods could also be used to find alternative treatments. The case study utilises a range of datasets that are shown in Figure 4.1. These datasets are sourced from Bio2RDF, LODD, Neurocommons, and DBpedia. The origi- nal image for the LODD datasets map can be found online 3. A portion of the case study, highlighting the links between items in the relevant datasets, can be seen in Figure 4.2. This case study includes URIs that can be resolved using the Bio2RDF website, which contains an implementation of the model. For the purposes of this case study, the relevant entry in DrugBank is known 4, although it could also be found using a search 5. According to this DrugBank record, Isocarboxazid is “[a]n MAO inhibitor that is effective in the treatment of major depression, dysthymic disorder, and atypical depression”. The DrugBank entry for Isocarboxazid contains links to the CAS (Chemical Ab- stracts Service) registry 6, which in turn contains links to the KEGG (Kyoto Ency- clopedia of Genes and Genomes) Drug dataset 7. The link back to DrugBank from the KEGG Drug dataset and others in this case could also have been discovered using only the original DrugBank namespace and identifier 8. The brand name drug dataset, Dailymed, also contains a description for Marplan 9, which is linked from Sider 10 and

3http://www4.wiwiss.fu-berlin.de/lodd/lodd-datasets_2009-08-06.png 4http://bio2rdf.org/drugbank_drugs:DB01247 5http://bio2rdf.org/searchns/drugbank/marplan 6http://bio2rdf.org/cas:59-63-2 7http://bio2rdf.org/dr:D02580 8http://bio2rdf.org/links/drugbank_drugs:DB01247 9http://bio2rdf.org/dailymed_drugs:2892 10http://bio2rdf.org/sider_drugs:3759 Chapter 4. Integration with scientific processes 81

MediCare (US) SideEffects: PubChem medicare_drugs:1332 Fluvoxamine interaction pubchem:17396751 SideEffects: drugbank_druginteractions: Neuritis DB00176_DB01247 sider_sideeffects:C0027813 CAS SideEffects: cas:59-63-2 sider_drugs:3759 KEGG Drug: Isocarboxazid (INN) Dailymed: Marplan dr:D02580 dailymed_drugs:2892

Diseaseome: Brunner syndrome diseasome_diseases:1689 DrugBank: Isocarboxazid / Marplan drugbank_drugs:DB01247 DrugBank: Amine oxidase [flavin-containing] B drugbank_targets:3939 DrugBank: Amine oxidase [flavin-containing] A drugbank_targets:3941 HGNC: monoamine oxidase B (MAOB) hgnc:6834 HGNC: monoamine oxidase A (MAOA) hgnc:6833 PDB: Human Monoamine Oxidase B in NCBI Entrez Geneid: complex with Farnesol monoamine oxidase B (Human) pdb:2BK3 geneid:4129 PFAM: Amino oxidase NCBI Entrez Geneid: pfam:PF01593 MAOB (Rat) geneid:25750 MeSH: DrugBank: Parkinson's Disease L-amino-acid oxidase NCBI Entrez Geneid: mesh:D006816 drugbank_targets:3041 MAOB (Mouse) geneid:109731

Pubmed: DrugBank: Contains 6795 related articles Flavin-Adenine Dinucleotide countlinksns/pubmed/mesh:D006816 drugbank_drugs:DB03147

Figure 4.2: Links between datasets in Isocarboxazid case study 82 Chapter 4. Integration with scientific processes the US MediCare dataset 11. These alternative URIs could be used to identify more datasets with information about the drug. The record for Isocarboxazid in the Sider dataset has a number of typical depres- sion side-effects to watch for, but it also has a potential link to Neuritis 12 a symptom which is different to most of the other 39 side effects that are more clearly depression related. Along with side effects, there are also known drug interactions available using the DrugBank dataset. An example of these is an indication of a possible adverse re- action between Isocarboxazid and Fluvoxamine 13. If Fluvoxamine was already being given to the patient, other drugs may need to be investigated, as alternatives to prevent the possibility of a more serious Neuritis side effect. DrugBank contains a simple cate- gorisation system that might reveal useful alternative Antidepressants 14, in this case, such as Nortriptyline 15. Dailymed contains a list of typically inactive ingredients in each brand-name drug, such as Lactose 16, which may factor into a decision to use one version of a drug over others. The Drugbank entry for Isocarboxazid also contains links to Diseasome, for example, Brunner syndrome 17, which are linked to the OMIM (Online Mendelian Inheritance in Man) entry for Monoamine oxidase A (MAOA) 18. DrugBank also contains a list of biological targets that Isocarboxazid is known to ef- fect 19. If Isocarboxazid was not suitable, drugs which also affect this gene; Monoamine oxidase B (MOAB) 20 21 22 23, the protein 24, or the protein family 25, might also cause a similar reaction. The negative link (in this case derived using text mining tech- niques) between the target gene, monoamine oxidase B, 26, and Huntington’s Disease 27 28, might influence a doctors decision to give the drug to a patient with a history of Huntington’s. The location of the MAOB gene on the X chromosome in Humans might warrant an investigation into gender related issues related to the original drug, Isocarboxazid. The homologous MAOB genes in Mice 29 and Rats 30, are also located on chromosome X, indicating that they might be useful targets for non-Human trials studying gender related differences in the effects of the drug.

11http://bio2rdf.org/medicare_drugs:13323 12http://bio2rdf.org/sider_sideeffects:C0027813 13http://bio2rdf.org/drugbank_druginteractions:DB00176_DB01247 14http://bio2rdf.org/drugbank_drugcategory:antidepressants 15http://bio2rdf.org/drugbank_drugs:DB00540 16http://bio2rdf.org/dailymed_ingredient:lactose 17http://bio2rdf.org/diseasome_diseases:1689 18http://bio2rdf.org/omim:309850 19http://bio2rdf.org/drugbank_targets:3939 20http://bio2rdf.org/symbol:MAOB 21http://bio2rdf.org/hgnc:6834 22http://bio2rdf.org/geneid:4129 23http://bio2rdf.org/mgi:96916 24http://bio2rdf.org/uniprot:P27338 25http://bio2rdf.org/pfam:PF01593 26http://bio2rdf.org/geneid:4129 27http://bio2rdf.org/mesh:Huntington_Disease 28http://bio2rdf.org/mesh:D006816 29http://bio2rdf.org/geneid:109731 30http://bio2rdf.org/geneid:25750 Chapter 4. Integration with scientific processes 83

The Human gene MAOA , can be found in the Traditional Chinese Medicine (TCM) dataset [47], 31, as can MAOB 32, although there were no direct links from the Entrez Geneid dataset to the TCM dataset. TCM has a range of herbal remedies listed as being relevant to the MAOB gene 33 including Psoralea corylifolia 34. Psoralea corylifolia is also listed as being relevant to another gene, Superoxide dismutase 1 (SOD1) 35 36. SOD1 is known to be related to Amyotrophic Lateral Sclerosis 37 38, although the relationship back to the Brunner Syndrome and Isocarboxazid, if any, may only be exploratory given the range of datasets in between. LinkedCT is an RDF version of the ClinicalTrials.gov website that was setup to register basic information about clinical trials [65]. It provides access to clinical infor- mation, and consequently is a rough guide to the level of testing that various treatments have had. The drug and disease datasets mentioned above link to individual clinical interventions in LinkedCT, enabling a path between the drugs, affected genes and trials relating to the drugs. Although there are no direct links from LinkedCT to Marplan at the time of publication, a namespace based text search returns a list of potentially interesting items 39. An example of a result from this search is a trial 40 conducted by John S. March, MD, MPH 41 of Duke University School of Medicine and overseen by the US Government 42. The trial references published articles, including one titled “The case for practical clinical trials in psychiatry” 43 44. These articles are linked to textual MeSH (Medical Subject Headings) terms such as “Psychiatry - methods” 45, indicating an area that the study may be related to. The trial is linked to specific primary outcomes and the frequency with which the outcomes were tested, giving information about the scientific methods in use 46. Although LinkedCT is a useful resource, as with any other resource, there are dif- ficulties with the data being complete and correct. An example of possible issues with completeness in LinkedCT are recent studies about the use of ClinicalTrials.gov which indicate that a reasonable percentage of clinical trials either do not publish results, register with ClinicalTrials.gov, or reference the ClinicalTrials record in publications re- sulting from the research [94, 116]. These issues may be reduced if people were required to register all drug trials and reference the entry in any publications.

31http://bio2rdf.org/linksns/tcm_gene/geneid:4128 32http://bio2rdf.org/linksns/tcm_gene/geneid:4129 33http://bio2rdf.org/linksns/tcm_medicine/tcm_gene:MAOB 34http://bio2rdf.org/tcm_medicine:Psoralea_corylifolia 35http://bio2rdf.org/tcm_gene:SOD1 36http://bio2rdf.org/geneid:6647 37http://bio2rdf.org/mesh:Amyotrophic_Lateral_Sclerosis 38http://bio2rdf.org/mesh:D000690 39http://bio2rdf.org/searchns/linkedct_trials/marplan 40http://bio2rdf.org/linkedct_trials:NCT00395213 41http://bio2rdf.org/linkedct_overall_official:12333 42http://bio2rdf.org/linkedct_oversight:2283 43http://bio2rdf.org/linkedct_reference:22113 44http://bio2rdf.org/pubmed:15863782 45http://bio2rdf.org/mesh:D011570Q000379 46http://bio2rdf.org/linkedct_primary_outcomes:55439 84 Chapter 4. Integration with scientific processes

Doctors and patients do not have to know what the URI for a particular resource is, as there is a search functionality available. This searching can either be focused on particular namespaces or it can be performed over the entire known set of RDF datasets, although the latter will inevitably be slower than a focused search as some datasets are up to hundreds of gigabytes in size, representing billions of RDF statements. An example of this may be a search for “MAOB” 47, which reveals resources that were not included in this brief case study.

4.4.1 Use of model features

The case study about the drug “Isocarboxazid” in Section 4.4 utilises features from the model to access data from multiple heterogeneous linked datasets. The case study uses the query type feature to determine what the nature of each component of the question means, and maps the question to the relevant query types. The query types from the Bio2RDF configuration that were utilised in the case study range from basic resolution of a record from a dataset, to identifying records that were linked to an item in particular datasets and searches over both the global group of datasets and specific datasets. These methods make it simple to replicate the case study using the URIs, although additions of new query types and providers may be necessary to utilise local or alternate copies of the relevant datasets.

Basic data resolution

The Bio2RDF normalised URI form can be used to get access to any single data records in Bio2RDF. The pattern is “http://bio2rdf.org/namespace:identifier”, where “namespace” identifies part of a dataset that contains an item identified by “iden- tifier”. This query provides access to Bio2RDF as Linked Data. It is simple to im- plement using SPARQL, with either a CONSTRUCT query, or a DESCRIBE query being useful. The SPARQL Basic Graph Pattern (BGP) that is used in this query is “ ?p ?o . ”, although the URI may be mod- ified using Normalisation Rules as appropriate to each data provider. The SPARQL pattern may not be necessary if the query is resolved using a non-SPARQL provider. An example of a question in the case study that may not use SPARQL is the ini- tial query. The process of resolving the URI “http://bio2rdf.org/drugbank_drugs: DB01247”, may require the model to use the Linked Data implementation at the Free University of Berlin site, provided by the LODD project, or it may use the SPARQL end- point. In this case, the namespace is “drugbank_drugs”, and the identifier is “DB01247”. According to the Bio2RDF configuration, the namespace prefix “drugbank_drugs”, is identified as being part of the namespace “http://bio2rdf.org/ns:drugbank_drugs” by matching the namespace prefix against the preferred prefix that is part of the RDF description for the namespace. This namespace is attached to the provider “http://bio2rdf.org/provider:fuberlindrugbankdrugslinkeddata”, which will at- tempt to retrieve data by resolving the Linked Data “http://www4.wiwiss.fu-berlin.

47http://bio2rdf.org/search/MAOB Chapter 4. Integration with scientific processes 85 de/drugbank/resource/drugs/DB01247” to an RDF document. The DrugBank provider in Bio2RDF contains links to normalisation rules that are necessary to convert the URIs in the Free University of Berlin document to their Bio2RDF equivalents. This enables the referenced URIs to be resolved using the Bio2RDF system, which contains other redundant methods of resolving the URI besides the Linked Data method. In this case, the document is known to contain incorrectly formatted Bio2RDF URIs. This is a data quality issue that can be fixed using the normalisation rules method, in a similar way to the way the useful, but not Bio2RDF, Linked Data URIs were changed. There are some situations where it may be difficult to get a single document to completely represent a data record. If the only data providers return partial data records, such as Web Services that map the record to another dataset, then the total data record may require that many Web Services may be resolved to get a single document. In some cases, the complete record may have parts of its representation derived from different datasets, or at least different locations. However in many cases, the record is available from a single location, so it is simpler to maintain and modify the record. In both cases, the model allows the data record to be compiled and presented as a single document to the scientist, including any necessary changes based on their context.

References in other datasets

Scientists are able to take advantage of references that dataset providers have deter- mined are relevant. They are not easily able to discover new references, as there has previously been no way of asking for all of the references to a particular item. Al- though the Linked Data proposals are useful as a minimum standard, they do not allow scientists to perform these operations directly, as they only focus on direct resolution of basic data items. Query types were created to handle both global and namespace targeted searches, extending the usefulness of the Linked Data approach to include link and search services in addition to simple record resolution. These queries are useful for scientists who utilise datasets that do not always contain complementary links to other datasets, and are therefore a way to increase the usability and quality of these datasets. The omission of a reference may not imply that the data is of a low semantic quality, but in syntactic terms it is useful to know where an item is referenced in another dataset. The global reference search in Bio2RDF has the pattern “http://bio2rdf.org/ links/namespace:identifier”, where “namespace” identifies part of a specific dataset that contains an item identified by “identifier”, and the query will find references to that item. In the case study, there was a need at one point to find all references to a drug, without knowing exactly where the references might come from. The Bio2RDF URI http://bio2rdf.org/links/drugbank_drugs:DB01247 was used to find any items that contained references to the drug. Although the canonical form for the data item of interest is http://bio2rdf.org/drugbank_drugs:DB01247, the data normalisation rules will change this to the URI that is known to be used on each data provider. This process is not easy to automate, as it requires a mapping step. A general mapping 86 Chapter 4. Integration with scientific processes that created multiple queries for every known dataset would be very inefficient for any datasets that had more than one or two known URI forms. In the case of Bio2RDF the mapping process was manually curated based on examinations of sample data records to increase the efficiency of queries. As the global references search requires queries on all endpoints, by design, it is not efficient in very large sets of data. A more efficient version was designed to take advantage of any knowledge about a namespace that a reference may be found in. This form has the pattern “http://bio2rdf.org/linksns/targetnamespace/namespace: identifier”, where “namespace” identifies part of a specific dataset that contains an item identified by “identifier”, and the query will find references to that item in data providers that are known to contain data items in the namespace “targetnamespace”. This was used in many places in the case study where the scientist knew that there were links between datasets, and wanted to identify the targets efficiently. The SPARQL BGP that is used in both cases is likely to be similar to this pattern “ ?s ?p . ”, although in some cases the predicate (second part of the triple) will be necessary, and the object (third part of the triple) will be “identifier” or something similar. The model maintainer may be able to specify which data providers are known to contain some links to a namespace, to assist scientists in identifying linked records without having to do either a global reference search, or know which namespaces to use for a targeted search. Using this knowledge, the model can be setup so that during the data resolution step, there is a targeted search for references in a select number of namespaces to dynamically include these references in the basic data resolution. In some cases, such as with widely used taxonomies, it is not efficient to utilise this step, but in general it is useful and can provide references that scientists may not have expected. This functionality is vital for namespaces that may not be derived from single datasets, or the datasets are not available for querying, as the semantic quality of the referenced data can be increased using a generic set of syntax rules. An example of this is the “gene” namespace that was referenced in the case study. Each identifier in the gene namespace is a symbol that is used to identify a gene in a particular species. Although these are curated by the HGNC (HUGO Gene Nomen- clature Committee) organisation for humans, the use of these symbols is more widely spread, including in the gene namespace, where the symbols are denoted as being human using the NCBI Taxonomy identifier “9606”, in the form http://bio2rdf.org/gene: 9606-symbol, where “symbol” is the name given to a human gene. An example of a mouse gene symbol is http://bio2rdf.org/gene:10090-ace, which is similar to the symbols that are available for the human version of the gene, http://bio2rdf.org/ gene:9606-ace and http://bio2rdf.org/symbol:ACE. In cases where the identifiers are not standardised, or different namespaces have different standards for the same set of identifiers, there are likely to be data quality issues similar to this case. As URI paths are case sensitive, http://bio2rdf.org/symbol:ACE is different to http: //bio2rdf.org/symbol:ace, and datasets using one version will not directly match Chapter 4. Integration with scientific processes 87 datasets using the other version. The use of RDF enables the addition of a statement linking to two identifiers. How- ever, this is not a good solution as it requires scientists to apply reasoning to the data while assuming that the data is of perfect semantic quality. In the prototype there is the possibility to apply normalisation rules to template variables such as “identifier” ele- ments in the symbol namespace that modify queries on particular data providers based on a previous examination of the data to know which convention a dataset follows. In some cases, the RDF datasets do not contain URIs for particular referenced items. Dataset owners do this to avoid any impression that the URIs would have ongoing support from their organisation. Some organisations have best practices to publish references to other data items using a tuple formed using one property for a namespace prefix, such as “namespace” above, and another property for the identifier, such as “identifier” above. This restricts the usefulness of the resulting data as there is no simple mechanism for using this tuple to get more information about the item. Although the lack of URIs is simple to get around in many cases, it makes queries harder, as the relevant properties need to be targeted along with a textual literal, which may not be handled as efficiently as URIs. In some cases a Linked Data RDF provider may argue that the underlying dataset did not contain URI references to another dataset, so they do not have to provide the links. One case where this is prominent is in the lack of references out from DBpedia to Bio2RDF, due to the fact that Bio2RDF isn’t a clear representative of the relevant data providers, although DBpedia is not a representative of the Wikipedia group in the first instance. In these cases, the model is useful in determining what DBpedia items reference a given Bio2RDF namespace and identifier and using these for reference searches, however, there is a lack of support in the SPARQL query language with respect to creating URIs inside queries, so the resulting documents still contain the textual references. In all of the cases where the dataset does not use a URI to denote a reference to another item, the query types feature in the model is utilised to make it possible to use multiple different queries on the dataset, without the scientist having to specify in their request that they want to use the alternative query types. As an example, these alternative queries are utilised for the inchi, inchikey 48, and doi 49 namespaces. These public namespaces are well defined, so there are not generally syntactic data quality issues, however they do not have RDF Linked Data compatible URIs, so some datasets have resorted to using literals, along with a number of different predicates, for references to these namespaces. The use of a large number of different data providers for these public namespaces is difficult from a trust position, as it is not always easy to identify where statements originate from in RDF, unless one of the Named Graph extensions to RDF is utilised, such as Trig or Trix. In the case of pure RDF triples, it is not easy to identify after the fact what trust levels were relevant. In the case of high throughput utilisation of this

48http://www.iupac.org/inchi/ 49http://www.doi.org/about_the_doi.html 88 Chapter 4. Integration with scientific processes information, it is not appropriate to rely on a scientist applying their trust policies on information after queries are performed. The use of profiles to restrict the set of data providers is the solution provided in the model. The profiles are designed so that they can be easily distributed to other users, so organisations can use the profiles to define their trust in different providers. However, profiles are user selected, so any profiles that appear in query provenance can be excluded during replication simply by not selecting them.

Text search

RDF URIs are useful, but they are not always easy to find. In particularly, Linked Data does not specify any mechanism for finding initial URIs given a question. In some cases this may not be an issue, if identifiers are textual and a URI can easily be created with knowledge of the identifier. In many cases identifiers are privately defined by datasets and the identifier has no semantic meaning apart from its use as part of the dataset. In these cases it is useful to be able to search across different datasets. One solution to this may be to rely on a search engine, but there are questions about the range of a search engine, and the ability to restrict it to what is useful and trusted. The majority of the data providers for text search queries were SPARQL endpoints. Although the official SPARQL 1.0 implementation only supports one form of text search, regular expressions, some SPARQL endpoints were known to support additional forms of full text search that are more efficient, so searches on these endpoints were performed using a different query type that was based on the same query structure. In order for scientists to utilise the system without having to know these details, the two forms of SPARQL search, along with other URL based search providers, were implemented to match against two standard URI patterns. The first pattern is for a text search across all datasets that have text search interfaces, http://bio2rdf.org/ search/searchTerm, where “searchTerm” is the text that a scientist wants to search for. The second pattern is used to target the searches at datasets which contain a particular namespace, http://bio2rdf.org/searchns/targetNamespace/searchTerm in this case that is “targetNamespace”. The “targetNamespace” is not guaranteed to be the only namespace that results are returned from, as the query and the interface may be generically applicable to any data in a provider. The process could be optimised further in cases where the interface allowed the query to only target data items that had URIs which matched the target namespace, but this process may not always be effective or efficient as the URIs may not have a base structure that is easily or efficiently identifiable in a SPARQL query. An experiment to improve the accuracy in some cases was completed, but it was not efficient, and queries regularly timed out, returning no results. The main efficiency problem was related to the use of SPARQL 1.0. It is not optimal for searches that rely on the structure of URIs, as it relies on regular expressions, which are not efficient when applied to large amounts of text. For the same reason, the targeted reference search only relies on the namespace being part of the data provider definition, so results can Chapter 4. Integration with scientific processes 89 be returned from any namespace that was present in the dataset that was queried. In small scenarios, where the number of triples is quite small 50, the identifier-independent part of the URI can be identified as belonging to a particular namespace. The text search function demonstrates a novel application of the model as an entry point to Linked Data models. In terms of data quality, it is perhaps less effective than a single large index, where results can be ranked across a global set of data, as the prototype implementation of the model only allows ranking inside data providers. However, in terms of data trust, the sites that were utilised were able to be selected in a simple manner using profiles, so as to restrict the possibility that the search terms accidentally matched on an irrelevant topic. The exact provenance for a search using the model is possible, and can be used in combination with the results to validate the usefulness of the results to a particular question. If the results have data quality issues, a scientist can encode rules that can be used to clean the results to suit their purposes, and view the results of this in their work, along with which rules were applied in their provenance. If a new text search method needs to be created to replicate results, it can be included using profiles. Profiles can also be used to filter out of date query types. For example, when SPARQL 1.1 is standardised, the SPARQL 1.0 query types can be filtered from historical query provenance and replaced with efficient SPARQL 1.1 queries

4.5 Web Service integration

The query type and provider sections of the model can be transparently extended in future to allow scientists to patch into web services to use them directly as sources for queries. A similar ontology based architecture, SADI [143], annotates and wraps up web services to query them as part of Federated SPARQL queries, however it makes some assumptions that make it unuseful as a replicable, context-sensitive, linked scientific data query model. The SADI method relies on service providers annotating their web services in terms of universally used predicates and classes. SADI also requires queries to be stated using known predicates for all parts of the query, with at least one known URI to start the search from, as the system is designed to replicate a workflow using a SPARQL query. This prevents queries for known predicates being executed without knowledge of any URIs, and it also prevents queries that have known URIs without scientists knowing what predicates exist for the URI. SADI is designed as an alternative to cumbersome workflow management systems. In this respect it is successful, as the combination of a starting URI and known predicates are all that is required for traditional web services to be invoked, as they are specifically designed for operations on known pieces of infor- mation. SADI is limited in this respect compared to arbitrary Federated SPARQL in- terpreters that allow any combination of known operations, known information, and/or, unknown information and operations.

501000 triples is small in this context and 5 million+ is large 90 Chapter 4. Integration with scientific processes

In comparison to the separate namespace and normalisation rules model described here, SADI requires wrapper designers to know beforehand what the structure is for URIs that all datasets use, with the resulting transformations encoded into the pro- gram. In comparison, the normalisation rules part of this query model only requires configuration designers to know how to transform the URIs from a single endpoint to and from the normalised URI model, and scientists can extend that knowledge using their own rules. This reduces the amount of maintenance that is required by the system, as URIs that are passed between services are normalised to a known structure as part of the output step for prior services before being converted to the URIs that are expected by the next services. If a scientist is able to locally host a SADI service registry, then they are able to trust each of the records. However, the model does not contain a method for trusting data providers in the context of particular queries. It is important that scientists be able to trust datasets differently based on their queries, so that all known information may be retrieved in some contexts, but in other contexts, only trusted datasets may be queried. Although this may come at the expense of scientists defining new templates for queries, the model is able to provide a definite level of trust for each query based on the scientist’s profiles. The design could be useful as a way to integrate documents derived from the model, however, the assumption that all queries can be resolved in a Linked Data fashion is inefficient and there needs to be an allowance for more efficient queries that do not follow the cell-by-cell model. SADI technically follows a cell-by-cell data retrieval model to answer queries, however, some queries may be executed intelligently behind the scenes in Web Services to reduce the overall time, making the system efficient in particular circumstances, but not based on the system design. The model and prototype can both we used to perform efficient queries, although resolution of individual Linked Data URIs is also possible if scientists are not aware of a way to make a query more efficient.

4.6 Workflow integration

Workflow management systems provide a useful way to integrate different parts of a scientific experiment. Although they may not be used as often as ad-hoc scripts, they can make it easier to replicate experiments in different contexts, and they make it easier scientists to understand the way the results are derived. The model and prototype were not designed to be substitutes for a workflow management system, as they are designed to abstract out the context specific details such as trust and data quality. By abstracting out the contextual details, each query can be replicated in future using the model configuration given by the replicating user, based on the original published configuration given by the original publisher. An evaluation of a selection of workflows on the popular workflow sharing website myExperiment.org revealed that the majority of these scientific workflows have large components that are concerned with parsing text, creating lists of identifiers, pushing those lists into processing services and interpreting the results for insertion into other Chapter 4. Integration with scientific processes 91 processing services. These workflows are unnecessarily complex due to the requirements in these steps to understand each output format and know how to translate these into the next input format. Hence they are not as viable for reuse in the general scientific community as a system that relies on a generic data format. In addition, each of the data access steps contain hardcoded references to the locations of the data providers. Even if the data locations are given as workflow parameters, the alternative data providers would need to subscribe to the same query method. The query format assumptions are encoded in the preceding steps of the workflow, and the data cleaning assumptions for the given data provider are encoded in the steps following the data access. In the model, the query format assumptions and data cleaning assumptions are linked from the relevant providers. If another provider is available for a query, but the query format is different, a new query type can be created and linked to a new provider. This new provider can reuse the data cleaning rules that were linked to from the original provider, or new rules can be created for this provider. Each of these substitutions is controlled using profiles, so the entire configuration does not need to be recreated to replicate the workflow. In addition, if the user only wants to remove data cleaning rules, providers, or query types, they only need to create new profile and combine it with the original configuration. All of these changes are independent of the workflow, as long as the location of the web application is specified as a workflow parameter so that future users can easily change it. Scientists can use annotations and provenance in scientific workflows to provide novel conclusions that can be used to evaluate the workflow in terms of its scientific meaning and the trust in the data providers. The model provides query provenance, and scientists can create query types to insert data provenance information into the result set of a query. These provenance sources give a workflow system a clear understanding of the origin of the data that it is processing. In the model, the RDF syntax is used to specify provenance details. This is use- ful because it enables concepts to be described using arbitrary terminologies, without requiring a change to the model standard to support different aspects of provenance. Scientists can also combine provenance and results RDF statements to explain the effect of workflow parameters on the workflow results. There are domain specific biological ontologies available that aim to formalise hier- archies of concepts, however, these are not well integrated with workflow management systems currently [61, 104, 124, 146]. Researchers have effectively proven that it is prac- tical to integrate semantic information into electronic laboratory books such as those by Hughes et al. [73] and Pike and Gahegan [109]. There have also been attempts to formally describe the experimental process for various disciplines such as geoscience and the microarray biology community [142]. There have been uses of simple dataset iden- tifier information to provide enhancements to workflows, with the most common being PubMed and Gene Ontology overlaps, where PubMed articles are described in terms of the Gene Ontology [44, 86, 132]. However, these systems do not involve the context of the scientist in the process, making it hard for scientists to replicate the results if they 92 Chapter 4. Integration with scientific processes need to extend them or process their provenance. Scientists can interpret the provenance of their data along with their results by including trusted, clean, structured data from various locations along with provenance records. The model and prototype described in this thesis allows for this integration using query types, and process provenance records in RDF that can be used to configure the prototype to replicate the results.

4.7 Case Study : Workflow integration

The model links into scientific workflow management systems more easily than current systems due to its reliance on a single data model, where documents at each step are parsed using a common toolset. Although many workflow management systems do not include native support for RDF as a data format, the Semantic Web Pipes program [87] is designed with RDF as its native format, including SPARQL as a query language inside the workflow. It was used to demonstrate ways to use the results of queries on the prototype to trigger other queries based on the criteria that are specified in the workflow. A future version of the model could expand it into a workflow engine, but for the prototype it was decided to use an external tool for workflow purposes. The following case study examines the usefulness of an RDF based workflow management system to process data using the prototype as the data access system. The basic input pattern that was used for these Semantic Web Pipes workflows contained a set of parameters including the base URL for the users instance of the prototype web application. An implementation of the model was used to resolve a query to an RDF document, as shown in Figure 4.3. The RDF document was then parsed into an in-memory RDF model and SPARQL queries were performed to filter the useful information that required further investigation. In some cases where workflows were chained together, the results of the SPARQL query were used to form parameters for further queries on the prototype. The results of these further queries were included in the results of the workflow, as RDF statements from different locations can be combined in memory and output to form single documents. In some cases, the initial RDF triples were used as part of the output statements for the workflow. This makes it possible to keep track of what information was used to derive the final output, as the RDF model provides the necessary generic structure to enable different documents to be merged without any specific rules. These small workflows were then linked together using the Pipe Call mechanism in the Semantic Web Pipes program to create larger workflows. This structure makes it possible to remove internal dependencies in the model on the way each query is implemented, so data quality and data trust can be changed in the prototype, without having to change the workflow where possible. These workflows were used to evaluate the effectiveness of the prototype as a data access tool specifically for biologists and chemists, as they focused on a biological and chemical scientific datasets. The workflows were integrated with the aim of verifying the way that biological networks are annotated and curated using literature sources, Chapter 4. Integration with scientific processes 93

Figure 4.3: Integration of prototype with Semantic Web Pipes

and further discover new networks that may not have been annotated previously, but which share common sources of knowledge. The case study uses the Ecocyc dataset as the basis for the biological network knowledge about the E. coli bacterium. Ecocyc contains references out to the Pubmed literature dataset to verify the source of the knowledge. The Pubmed literature dataset is also used to annotate the NCBI Entrez gene information database. The Entrez datasets in turn is referenced from a large number of other datasets such as the Uniprot protein dataset, and the Affymetrix microarray dataset. The microarray dataset can be used to identify which methods are available for experimentally verifying which of the proteins from the Ecocyc dataset appear in a biological sample. Along with the links to Entrez gene, the Affymetrix dataset also includes links to Uniprot. These links are useful, as they seem from the results of the workflow to be highly specific, although the actual level of trust could be given by a scientist who could use either this set of links or the set of links from Entrez gene to further identify the items.. The Ecocyc dataset is also annotated using the Uniprot dataset, which provides a method of verifying the consistency of the data that was discovered using the Ecocyc dataset originally. In the Ecocyc dataset, the Uniprot links are given from the functional elements, such as genes and polypeptides (i.e., proteins). The workflow was used with different promoters, and the results were manually verified to indicate that the results were scientifically relevant. The results generally contained proteins which were known to be in the same family, as the workflow was structured to select these relationships in favour of more general relationships. However, the resulting proteins were known to be experimentally verifiable, and the biological networks could be considered in terms of these relationships. Although the results of this workflow may be trivial, they highlight the ease with which these investigations can be made compared to non-RDF workflows. In traditional workflows, each of the data items must be manually formatted to match the input for a service. The results must also be parsed using knowledge of the expected document format, including the need to parse link identifiers, or have a special service created to perform the dataset 94 Chapter 4. Integration with scientific processes link analysis and give these identifiers as simple values, and any data normalisation processes must be created in a way that suits the data format. In the previous case study the links between data items were used for exploration, so it was suitable for a scientist to identify relevant links at each step of each case. In comparison, the use of the workflow in this case enables a scientists to repeat the experiment easily, with different inputs as necessary. The workflows in this case were created in a way that allowed scientists to create a workflow for a single purpose, and then use it in another workflow by linking to it. The relevant information from one step was processed into a list of parameters for the next step using an embedded SPARQL engine. The relevant inputs and results were then combined in the output as a single RDF document. The model and prototype were used to enable the workflow to avoid issues such as data quality and remote SPARQL queries. These factors are difficult to replicate and maintain if they are embedded into the workflow. Each of the workflows include a parameter that is used as the base URL for queries. This means that users do not have to utilise a globally accessible implementation of the model if they have one locally installed. This ensures that scientists can reliably replicate the workflows using their own context, if they have the datasets locally installed, with local endpoints and the local implementation used in their workflows as guided by their profile settings. The workflow was used to verify the consistency of the statements about a partic- ular promoter, known as “gadXp” 51, that is thought to be involved in several Ecocyc regulatory relationships. This promoter is known to be annotated with a reference to a single Pubmed article, titled “Functional characterization and regulation of gadX, a gene encoding an AraC/XylS-like transcriptional activator of the Escherichia coli glutamic acid decarboxylase system”. This article was used to identify 6 Entrez gene records which were annotated as being related to the article, using the query form, “/linksns/geneid/pubmed:11976288” 52. The results of this query contained a set of RDF statements that indicated which genes were related to the article. These genes were matched against known sets of probes that could be experimentally verified with microarrays, using a query similar to the following, “/linksns/affymetrix/geneid:NNNNNN”, where NNNNNN is one of the genes that were identified as being sourced from common literature. Each of the 8 microarray probes that were returned by this query were then ex- amined specifically with reference to any Uniprot annotations they contained, to avoid transferring the entire documents across the network. This was performed using the query form “/linkstonamespace/uniprot/affymetrix:XXNNNNNN”, where each of the affymetrix URIs were microarray probes that matched from the Entrez gene set. The 11 Uniprot links were then matched back against Ecocyc to determine which Ecocyc records were relevant. If the datasets were internally consistent, the resulting 8 Ecocyc records, and the Uniprot records they are linked to, should refer to similar proteins. In this example, the Ecocyc and Uniprot records all referred to proteins from

51http://bio2rdf.org/ecocyc:PM0-2441 52http://bio2rdf.org/linksns/geneid/pubmed:11976288 Chapter 4. Integration with scientific processes 95 the “glutamate decarboxylase” family, which gadXp is a member of. Each of the Ecocyc and Uniprot records were then examined in the context of all of the known datasets to identify references that could be used by the scientist to verify the relevance of each record. This step was performed using the query form, “/links/namespace:identifier”, where namespace was either ecocyc or uniprot based on the set of proteins that were identified. There may be cases where the Pubmed article was not specific to the promoter and contained a large number of irrelevant references to genes in the Entrez dataset. In these cases the workflow would need to be redesigned to verify the references using another strategy, such as a strategy using Uniprot references, along with other datasets. Although the typical usage of the prototype is to resolve queries about single entities, there is no restriction on queries that combine multiple entities in a single query. This would reduce the time taken to resolve the query, as a large part of the resolution time for distributed queries on the Internet relates to the inherent latency in the communication channels.

4.7.1 Use of model features

The workflow case study uses the basic query operations outlined in Section 4.4.1, along with others described here, where they are useful for further optimising the way the scientist constructs and maintains the workflows related to their research.

References to other datasets

The URI model does not require that the leading part of a URI can be used to identify the namespace, although this assumption is used by other systems such as VoiD. This assumption is necessary if scientists want a simple method of determining which refer- ences in a data item are located in particular namespaces. This is an optimisation, but it is necessary to avoid having to resolve an entire document, and every reference. The optimisation is useful if the namespace can be identified from the leading part of the URI, and the URI can be normalised using this leading part of the URI. The data normalisation rules implemented in the prototype can be used to iden- tify links to other namespaces if a partial normalised URI, e.g., http://bio2rdf.org/ namespace: can be transformed to the prefix of the unnormalised URI. This is neces- sary, as the user does not know the identifier for any of the references that they are searching for. In cases where a data item contains references that can be identified easily, this query is useful for optimising the amount of information that needs to be transferred. This was used in the workflow to optimise the amount of information, as the complete Affymetrix records were required to determine their relevance to the workflow. An alternative to this method is to recognise the predicate, and rely on the predicate to optimise the amount of information that needs to be transferred. This alternative is also able to be used with services that do not support partial URI queries, but they 96 Chapter 4. Integration with scientific processes do support queries that can be optimised based on the relationship between the record and its references.

Human readable label for data

The references in linked datasets are given as plain, anonymous identifiers. This is not useful for scientists, as they need to have a human readable label to recognise what the reference referred to. This functionality is necessary to avoid retrieving entire documents describing each of the references for a document just to display the most informative interface for a scientist. Although this query may be optimised to avoid transferring entire records, it is also possible to filter an entire record using deletion rules. This is particularly relevant if the linked dataset is only accessible using Linked Data HTTP URI resolution, and the results need to be restricted to simple labels for the purposes of the query. In the workflow, this query was used to identify Affymetrix records, as the entire record was not required for any part of the workflow, and the queries that were per- formed on the Affymetrix dataset did not include labels. It was useful to include human readable labels for scientists who wanted to verify the results using human readable la- bels. In the Bio2RDF configuration, this query is implemented as “http://bio2rdf.org/ label/namespace:identifier”, where “namespace:identifier” is the identifying portion from the normalised URI, “http://bio2rdf.org/namespace:identifier”. In the ex- ample, the label for the Affymetrix data item identified by “http://bio2rdf.org/ affymetrix:gadB_b1493_at” is found by resolving the URL “http://bio2rdf.org/ label/affymetrix:gadB_b1493_at”.

Provenance record integration

Scientists can use the provenance information in their workflow to provide for decisions based on which datasets and query types were used. There are many different cases where provenance information is useful and necessary, but in the context of distributed queries, the provenance record needs to contain at least the locations where the queries were performed, what the queries in each location were, and which datasets were poten- tially part of the results based on the queries and the data providers. The provenance record can then be used to determine the legitimacy of the query, including any trust and data quality metrics that can be created using annotations on the query types, data providers, and normalisation rules. The prototype exposes the provenance record using the prefix “/queryplan/”. For example, the provenance for “http://bio2rdf.org/geneid:917300” can be found using the URL “http://bio2rdf.org/queryplan/geneid:917300”. In the workflows in this case study, these records were used to determine the source of statements, and the query types that were used, to provide information about the link between Entrez gene and Uniprot datasets, including the version of each dataset that was used. If there are multiple versions in use, the information may not be trustworthy, as any deletions or Chapter 4. Integration with scientific processes 97

fixes that were provided between the versions may not be visible. If the versions do not match the most current versions the datasets may also not be trustworthy. The RDF statements in the document were then filtered on the basis of a Dublin Core version property pointing to the release version of the dataset that was used “http: //purl.org/dc/terms/version”. If data provenance information is to be included it needs to be retrieved using a separate query type, as the provenance record is designed to be only reliant on information that was in the model configuration of the server that resolved the query to provide replicability down to the level of dataset versions. In the context of the model, data provenance is informal information about how a particular fact was derived, rather than how it was accessed. Current provenance models focus on integrating data provenance chains into the data store, which is possible with the model as it aims to be independent of datastores to enable replicability in future if a datastore is not available. Although it is not the focus of this research there are numerous studies available [34, 35, 95, 119, 149]. For this research, scientists need to choose which queries they want to record the provenance for, and retrieve, store, and process those provenance records before processing the information further.

4.7.2 Discussion

The workflow management system was used to process and integrate different queries on the prototype. The example workflows demonstrated the context-sensitive abilities of the model to make it simple to personalise the data access operations, including the use of different data normalisation rules and locations. They also demonstrated the way the prototype could be used to make components of the overall query more efficient when the scientist was aware of the nature of the data that was required without requiring them to be aware of this aspect before starting to design the workflow. The workflow was used to identify the provenance of some queries, although the necessary size of the provenance records made it difficult to keep this information throughout the entire workflow. It made it simple to change the context by including a reference to the base URL for the prototype in each workflow. In contrast to other methods such as direct access to Web Services, the base URL makes it possible for scientists to change the location of the data, and normalise the data in the model to match the structure of the statements that are used by the workflow, where scientists would have to change the workflow to match a new set of Web Services if there were multiple sources of data, and encode each data normalisation rule into each workflow that required access to the Web Service. The datasets and query definitions that were used were not identified at every stage, as the provenance records were found to be very large in some cases where a large part of the configuration was applicable to each query. The overall configuration file for the prototype was in the range of 2-3MB when represented using RDF/XML. A global links query, as was performed in the last step of the case study, returned provenance records that were about 500KB in RDF/XML. The other provenance queries returned documents ranging from 12KB for the query plan for examining references to uniprot in 98 Chapter 4. Integration with scientific processes affymetrix records to 100KB for uniprot and ecocyc record resolutions, as the Uniprot dataset contains a large number of links to other namespaces. The size of the prove- nance records for ecocyc records was increased because of a number of automatically generated data providers that were created to insert links to HTML and other non-RDF representations of records. The size of the provenance record is proportional to the num- ber of data providers, rdf normalisation rules, and namespaces, that were relevant to each query. The workflow, along with the HTTP URIs that form the identifiers for data in the workflow, was useful for linking the relevant services. It was useful for abstracting the details of the queries from the scientific experiment, so that the queries on the datasets could be changed, or optimised, without the scientist needing to change their workflow. A scientist could change the data providers, including new normalisation rules to fit the data quality expected by the workflows. They could redirect the location of the prototype that they used by changing a single parameter in the workflow, making it possible to change a public interface while providing the old interface for select purposes. The scientist could also choose to extend the workflow using the model further by following the pattern used by the workflow to make up new queries.

4.8 Auditing semantic workflows

Scientists can use the provenance information along with their results to audit the datasources, queries, and filters that were used to produce their results. The provenance model needs to be taken into account to validate the design of a workflow in terms of whether the queries were successful and accurate. Although the provenance model does not directly provide a proof, it can be used along with review methods to validate the set of queries, and the datasets that were accessed. Assertions based purely on provenance models are separated in this research from assertions which investigate the scientific meaning which provided the motivation for the workflow. Useful optimisation conclusions based on both the syntactic and semantic levels are available as part of the provenance model. However, review methods were not designed as part of the model other than to provide a simple flag to confirm that the configuration for a query type or data provider had been curated. The provenance model is based on the RDF configuration model, and as such is designed to allow for extensions, including new relationships between items in the model while keeping backward compatibility. Any new relationships that use Linked Data HTTP URIs in RDF would be naively navigable, making it possible for scientists to use future provenance records in a similar way to their processing of the current model. The current model provides URI links from Providers to Query Types, Namespaces, and Normalisation Rules, links from Query Types directly to Namespaces, and links from Profiles to Providers, Query Types and Normalisation Rules. If a scientist did want to make workflow optimisations they could use either of two different ways based on the number of observed statements about a specific item, or a common semantic link between items. The first method is more suited to purely Chapter 4. Integration with scientific processes 99 workflow based provenance which attempts to streamline workflow executions based on knowledge about failures of specific processing tasks, the length of time taken to execute a specific workflow processor, or the amount of data which has to be transferred between different physical sites to accommodate grid or distributed processing. The second method relies on knowledge about the scientific tasks or data items which are being utilised by each of the workflow designs and executions which are under consideration by the auditing agent. The common link may refer to an antecedent item which was related to the current item by means of a key of some sort as opposed to relating two items which were both processed by a given processor and were ontologically similar. Even without reasoning about the meaning of information derived from the semantic review metadata, any data access methods embedded in the metadata can be utilised to provide a favourites or tagging system which can quickly identify widely trusted data providers and query types. This system may be integrated with other tagging systems, using the model to access multiple data providers that could provide tags for an item. In a scientific scenario where scientists are performing curation on data, for example, gene identification in biology, the tagging could be used for both data and process annotations. An example how this process may be designed can be seen in Figure 4.4. With respect to the first set, it is possible that specific semantic tags can be ac- commodated within a provenance ontology, an example of which may be found in the myGrid provenance ontology53. The myGrid ontology is not directly suitable, as it as- sumes that the workflow model that is being represented fits within the SCUFL model, as shown by their use of DataCollection elements, which form the basis for the distinc- tion between single data elements and multiple data elements being the output from a SCUFL workflow. The most notable reason for not using the myGrid ontology, as is, for extended research into the topic, is that it forms a number of key provenance elements using strings, where the model is able to give unambiguous identifiers to each item, and allow these to be referenced using URIs. In contrast to the idea that only one ontology is acceptable in each domain, multi- ple ontologies can exist, with both referencing each other where applicable and sensible. This is particularly relevant where a process provider may have a description of their server published using an ontology which is compatible with one workflow system, but many parts of the description can also be represented the same way under another ontology. If this were the case, and someone published a structure outlining the rela- tionships between the two ontologies, then a reasoner could present the common pieces of information for both, although the scientist did not anticipate this. This information is also useful if the provider of the original server did not publish semantic descriptions of their process, requiring an external provider to publish the descriptions and keep them up to date. These external descriptions must then be relied upon to provide ac- curate information in relation to auditing a given provenance log. The model provides the basic elements that scientists can use to analyse a provenance log to determine the range of sources that were used.

53http://www.mygrid.org.uk/ontology/provenance.owl 100 Chapter 4. Integration with scientific processes

http://mquter.qut.edu.au/user/scientistA Select an http://mquter.qut.edu.au/data/scientistA/region13445 interesting region Scientist collects samples

http://mquter.qut.edu.au/data/scientistA/sequence1234

Annotate region with a tag

Give tag a meaning

Give region extended attributes http://mquter.qut.edu.au/tags/xba2

http://mquter.qut.edu.au/tags/scientistA/20070813-1 http://moat.mquter.qut.edu.au/tag/xba2 Create a semantic wiki page for resource

http://semwiki.mquter.qut.edu.au/xba2

http://mquter.qut.edu.au/search/20070813-2 Give tag extended semantic attributes

http://dbpedia.org/resource/xba2

http://mquter.qut.edu.au/tags/xba2

http://mquter.qut.edu.au/tags/scientistA/20070813-2

http://mquter.qut.edu.au/workflow/20070813-1 Legend:

Data transformed by process into

http://mquter.qut.edu.au/provenance/20070813-22

Figure 4.4: Semantic tagging process Chapter 4. Integration with scientific processes 101

4.9 Peer review

Direct access to scientific data is increasingly being required by peer reviewers to ver- ify the results given in proposed publications, as the data is not easily replicable for disciplines where the experiment relies on data from a particular animal or plant, for example. This peer review process, shown in Figure 4.5, is predicated on the indepen- dence of the peers and in some cases the authors are unknown, although in many fields it may not be difficult for peers to identify the authors of a paper based on the style and references. The transfer of this data to the peers in a completely transparent process would require the data to be transferred to the publisher and the peers would receive the information in an anonymised form from the publisher. The model is useful for this process, as the namespace components of the URIs can be transparently redirected to the publishers temporary server or to the peers copy of the data, assuming that the publisher and peers require a detailed review on the as yet unpublished data. Although it is useful to provide temporary access to unpublished datasets, the dataset may eventually need to be published along with the publication. At this point, a final solution must be given to support the relevant queries, with a minimal number of modifications to the original processing methods.

Peer reviewer reads published material Peer reviewer critiques validity of the hypothesis

Peer reviewer critiques the design of the experiment Peer reviewer analyses previous experiments

If needed the Peer reviewer replicates the experiment

Peer reviewer verifies the analysis process

Peer reviewers return their opinions to the journal editor

Journal editor takes peer reviews and decides whether to publish the work

Article is published as part of a journal issue (Electronic and/or paper)

Citations by peers Post-publication reviews by peers

Figure 4.5: Peer review process

In some disciplines, the dataset may be either too large to individually publish each item, or the publication may refer to evaluations of public data. In these cases, the publication may provide a set of HTTP URI links to documents which can be used 102 Chapter 4. Integration with scientific processes to explore the information. This ability is not new, as DOI’s have been used for this purpose already, but the ability to link directly into the queries, with the ability of both peer reviewers and informal reviewers in future to get information about the component queries that were used, could be a valuable part of the publication chain in disciplines such as genomics. It would be most valuable if there is a known relationship between the items in the query results and other data items, as is encouraged in this thesis by the use of HTTP URIs and RDF documents. The prototype described in Chapter 5, publishes the details and results of individual queries using HTTP URIs, for example, http://bio2rdf.org/myexp_workflow:656. Related HTTP URIs are available for data access provenance records, for example, http://bio2rdf.org/queryplan/myexp_workflow:656. Higher level provenance de- tails are handled by other systems, such as a workflow management system, that is suited to the task. For example, the MyExperiment website provides RDF descriptions and HTTP URIs for each of the workflows and datasets that are uploaded to their site, including the original URI for the workflow referred to earlier in the paragraph, http://www.myexperiment.org/workflows/656.

4.10 Data-based publication and analysis

In order for computers to be utilised as part of the scientific process, data needs to be in computer readable forms [98]. A major source of communication for scientists is published articles. As these publications restrict the amount of information that can be included, scientists have developed text mining algorithms to interpret and extract the natural language data from published articles [38, 49, 90, 126, 134]. However, recently a few journals and conferences have started supporting the use of computer readable publications, which include links to related data, and in some cases include the meaning of the text and links [128]. The use of semantic markup on text and links to related data provide opportunities for scientists to automate otherwise manually organised processes. The model is designed to take advantage of these extensions to provide scientists with integrated methods of solving their scientific questions using the interactions shown in Figure 4.6. Marking up of data, and its use in automating processes, presents many challenges that were not present in the initial world wide web, where the emphasis was to make data computer readable and able to be distributed between locations. Making data readable simply requires that one arrange for a specific format based on a few simple categories of data. For instance, true and false facts from boolean logic, integers and real numbers from mathematics theory, and textual forms from natural language, are sufficient as a basis for data readability. Other forms such as image and spatial data can be represented using these basic forms, although they will take more computational power to process and integrate. The creation of semi-accessible data using plaintext metadata creates a quasi-semantic form which may be easier for scientists to personally understand, but does not ensure Chapter 4. Integration with scientific processes 103

External datasets Scientist

Use publically accessible datasets

Local datasets Collect data experimentally

Collaborate and review publications

Make conclusions and Link to dataset write them up for publication in a publication

Read and integrate previously published work

Own publications Publish papers in the area Citation to previously published material

Other scientists

Published material

Figure 4.6: Scientific publication cycle 104 Chapter 4. Integration with scientific processes that a computer will be able to unambiguously utilise the data [2]. This plaintext meta- data may come in the form of dataset fields or file annotations, however these are difficult to relate intuitively to the real world properties that motivated the initial hypothesis. The creation of structures that group and direct the associations between metadata ele- ments is the final pre-requisite for making clear decisions. Such decisions may be based on the domain structures that the computer can access, possibly using structures typi- cally associated with scientific ontologies [14, 20, 22, 32, 77, 79, 84, 101, 104, 135, 146]. Although there may be disagreements about the use of data, semantic markup fixes the original meaning while allowing the scientist to interpret the data using their own knowledge of the meaning of the term [97, 109]. Although a computer will then be able to utilise the data more effectively, the domain ontologies may still not be universally relevant due to a lack of world knowledge introduced by their creators [122]. This lack is an inherent disability to be overcome through cross-validation and evolution of the domain ontologies in an iterative manner, based on previous attempts to use the structures. The ability to gradually evolve, and normalise the use of domain ontologies is provided through the context sensitive normalisation rules in the model. The ontologies that are used in different locations can be normalised where possible to give a simple basis on which to further utilise data, without having to deal with complex mapping issues in each processing application. References to data items need to be recognised by scientists regardless of the format to remove the barriers to querying data based publications. This may not be a simple exercise for the scientist, as there may be different URI formats that are used in dif- ferent locations to identify an item. The model and prototype provide simple ways to represent data references, and resolve references from multiple locations. The prototype requires the use of the RDF format to integrate data from different locations. RDF does not have to be presented in specialised RDF files. The RDFa format allows scientists to markup existing XML-like documents with structured content. This allows a trans- parent integration with current web documents, which allows for scientists to present and publish documents, and maintain structured data in the same document, with the allowance for structured links to other documents using Linked Data principles. The use of the model with RDFa is implemented in the HTML results pages gen- erated by the prototype. The documents contain valid RDFa, however, the dynamic nature of the documents generated by the model makes it hard to take advantage of some of the advanced syntactic features of RDFa. RDFa is designed to allow for efficient manual markup of XML documents using compulsory shorthand namespace declarations, and dynamic generation of an entire automatically generated document reduces the usability of the information as the namespaces must be artificially generated to fit with the RDFa design principles. As part of the publication process, scientists are required to give details about the methods they used to generate their results. Using the model, they are able to publish both the process and data details using a combination of workflow file formats, and links to the relevant provenance records. Using these methods scientists can integrate less Chapter 4. Integration with scientific processes 105 curated data and workflows, without requiring methods of re-interpreting some pieces of data using statistical or interpretative processes such as text mining or other forms of pattern analysis. It may be necessary for scientists to continue to use these methods in some cases, including past publications, the future use of semantic annotations in relation to data, processes, and results, on publications will increase the scalability of the analysis processes. Semantically marked up publications contain links between textual language and defined concepts, which were previously not available, and could not be previously used to validate or mediate the analysis process. These links can easily be generated using the model as the data access method, and these documents can be simply integrated with other RDF documents using standard RDF graph merging semantics. The results may need to be derived through the use of a program that is not simple to represent in workflow terminology, such as a Java, C# or C++ program. In these cases, the prototype could still be used to resolve queries on datasets, given the wide range of tools available to resolve and process HTTP requests and RDF documents. However, the results may not be as simple to replicate as when the programs are represented in other formats which can be easily transported between different scientists. 106 Chapter 5

Prototype

5.1 Overview

A prototype web application was created to demonstrate the scientific usefulness of the data access model described in Chapter 3. The prototype was implemented as a stateless, dynamic configuration driven, web application that relies on RDF as its common document model. It was configured to provide access to a large number of scientific datasets from different disciplines, including biology, medicine and chemistry. It follows the Linked Data principles in making it possible for scientists to construct URIs using the HTTP protocol, and make the URIs resolvable to useful information, including relevant links to other Linked Data URIs that can be resolved using the prototype. This enables it to be integrated easily with the cases described in Chapter 4.

The prototype makes it possible for scientists to access alternative descriptions of published dataset records using their own criteria, including adding and deleting state- ments from the record. HTTP URIs that identify different records in another scheme can be resolved using the prototype using normalisation rules to convert the URIs to locally resolvable versions. Scientists can then publish their own HTTP URIs that are resolved using their instance of the prototype. In terms of provenance, scientists are able to control the versions of the datasets that are used by the prototype, so they can then publish URIs that point to their instance and be able to personally ensure that the data in the resulting records will not materially change without changes to the configuration of the prototype. For example, a scientist may resolve http://localhost/prototype/ namespace:identifier, and the prototype may be configured to interpret that URL as being equivalent to http://publicprototype.org/namespace:identifier. This functionality is necessary for scientists to be able to customise and trust their data, without having to publish their documents using the official globally authoritative URI for each record if it would not be resolved using their instance of the prototype.

The prototype was used extensively on the publicly accessible Bio2RDF website, allowing biologists and chemists to retrieve data from many different linked scientific

107 108 Chapter 5. Prototype datasets. The prototype software was released as Open Source Software on the Source- forge site 1. Scientists can download this software and customise it with their own do- main names to provide access to their local datasets without reference to the Bio2RDF datasets. The prototype accepts and publishes its configuration files in all of the major RDF formats. This makes it possible for scientists to reuse both the Bio2RDF configuration, containing biology and chemistry datasets, and any other configuration sources as part of their implementation so that they can avoid reconfiguring access to these public data providers. If a scientist has access to alternative data providers or query types compared to the Bio2RDF configuration, they can use the profile mechanism to selectively ignore the data providers that they are replacing without having first removed them from the configuration document. Scientists are able to share the details of their query with other scientists using the RDF provenance document for the query. The prototype provides access to the provenance record for any query using URIs similar to http://localhost/prototype/ queryplan/namespace:identifier. The provenance record contains all of the relevant configuration information that the prototype relied on to process the query. It enables other scientists to replicate the results without requiring any further configuration. This makes it possible to setup an instance of the prototype using provenance records as configuration sources, possibly without having access to the entire original configu- ration that was initially used. This ability provides for direct replication of published experiments wherever possible. The prototype can be configured to access data using both internal access, and external access methods depending on the context of the user. Scientists can use this functionality to provide limited access to their results using the prototype as a proxy. Peers can replicate the experiment on the same datasets to review the results, without the scientist having to expose direct access to their organisations entire database. The profiles, and the profile directives on each query type and provider, are used by peers to distinguish between the data providers that they can access, as opposed to those that the scientist used internally.

5.2 Use on the Bio2RDF website

The prototype was used as the engine for the Bio2RDF website, with an example shown in Figure 1.13. In this role, it was used to access both Bio2RDF and non-Bio2RDF hosted datasets. Queries on the Bio2RDF website are normalised using the Bio2RDF URI conventions, with the exception of equivalence references in results that link to non-Bio2RDF Linked Data URIs. The prototype configuration, created for the http://bio2rdf.org/ website, con- tains references to 1603 namespaces, 103 query types, 386 providers and 166 normali- sation rules. The Bio2RDF configuration provide links out from the RDF documents

1http://sourceforge.net/projects/bio2rdf/ Chapter 5. Prototype 109 to the HTML Web for 142 of the namespaces. These providers contain URL templates that are used to create URL’s for the official HTML version for all identifiers inside of the namespace. The Bio2RDF namespaces can be resolved using query types and providers using 12921 different permutations. The Bio2RDF configuration can be found in N3 format at http://config.bio2rdf.org/admin/configuration/n3. Figure 5.1 shows an example of the steps required for the prototype to retrieve a list of labels for the Gene Ontology (GO) [62] item with identifier “0000345”, known as “cytosolic DNA-directed RNA polymerase complex”. It uses the Bio2RDF configuration and illustrates the combination of a generic query type, along with a query type that is customised for the GO dataset. The queries are designed so that the generic query can be useful on any information provider, while the custom GO query will be restricted to providers that contain GO information because it uses an RDF predicate that only GO datasets contain. If another dataset was available to retrieve labels for GO terms using a different query, then a custom query definition could be added in parallel to these two queries without any parallel side effects. Further examples of how the prototype is used on the Bio2RDF website can be found in Section 4.4.1 and Section 4.7.1. The prototype made it possible for the Bio2RDF website to easily include and extend data from other RDF based scientific data providers. This made it possible to perform advanced queries on datasets using normalised, namespace-based, URIs to locate datasets and distribute queries. The prototype implements some URI patterns that are unique to Bio2RDF, and would need to be changed by modifying the URL Rewriting configuration file inside of the prototype. The query is interpreted by the prototype and the actual query passed to the model excludes the file format, whether the query is requesting a query plan, and the page offset for the results. As these are prefixed in a known optional order, scientists can always consistently identify the different instructions to the prototype for each query, independent of their query types.

5.3 Configuration schema

The configuration documents the prototype were created based on a schema that can be found at http://purl.org/queryall/. The schema gives the definitions for each of the schema URIs, along with whether an RDF object or a literal is expected for each property. A simple configuration that was created using the model can be seen in Figure 5.2. In order for the configuration schema to evolve without interrupting past implemen- tations, the prototype was built to understand different versions and to process them accordingly. The prototype schema went through 4 revisions as part of this research, with each implementation able to provide backwards compatible documents based on schemas in previous versions. In some revisions forwards compatibility with new ver- sions were affected. Forwards compatibility is important in order for scientists to replicate queries using current software where possible. For example, in the first revision, the RDF Type 110 Chapter 5. Prototype

URL to resolve: http://bio2rdf.org/n3/label/go:0000345

Host: http://bio2rdf.org/ Response format: n3/, ie, RDF N3 User query: label/go:0000345

Query type: http://bio2rdf.org/query:labelsearch Query type: http://bio2rdf.org/query:labelsearchforgo

Matches regex: ^label/([\w-]+):(.+) Matches regex: ^label/([\w-]+):(.+) This query type is useful for all namespaces This query type is only useful for namespace: http://bio2rdf.org/ns:go

Provider: http://bio2rdf.org/provider:mirroredobo This query type has a namespace at regular expression matching group number 1, ie, ([\w-]+), Matches regex: ^label/([\w-]+):(.+) which in this case is "go"

This provider is not a default provider, and the The namespace http://bio2rdf.org/ns:go has a referring query type is namespace specific so a prefix "go", so this query type is useful for this namespace check needs to be performed user query This provider is able to resolve queries for the namespace: http://bio2rdf.org/ns:go

The referring query type has a namespace at regular Provider: http://bio2rdf.org/provider:mirroredobo expression matching group number 1, ie, ([\w-]+), which in this case is "go" Matches regex: ^label/([\w-]+):(.+)

The namespace http://bio2rdf.org/ns:go has a This provider is not a default provider, and the prefix "go", so this provider is useful for this user referring query type is namespace specific so a query namespace check needs to be performed

This provider is able to resolve queries for the namespace: http://bio2rdf.org/ns:go

One of the provider endpoint URL's is chosen for this The referring query type has a namespace at regular particular query on this provider: expression matching group number 1, ie, ([\w-]+), http://cu.go.bio2rdf.org/sparql which in this case is "go" The namespace http://bio2rdf.org/ns:go has a This provider is known to be a SPARQL endpoint so the prefix "go", so this provider is useful for this user SPARQL template from the query type is copied and query replaced with the variables from this user query.

Results of this query: Results of this query: @prefix rdfs: . @prefix rdfs: . @prefix ns1: . @prefix ns1: . @prefix ns2: . ns1: rdfs:label "cytosolic DNA-directed RNA polymerase ns1: ns2:name "cytosolic DNA-directed RNA polymerase complex [go:0000345]" . complex" .

A normalisation rule is applied to the results from this provider, in order to standardise the use of colon's instead of hashes in the /ns/ ontology terms: http://bio2rdf.org/rdfrule:ontologyhashtocolon

is changed to by the normalisation rule.

These results, together with the results of other queries that matched, are parsed into an in-memory RDF database, and the combined results are returned to the user in the N3 format requested by the user. @prefix rdf: . @prefix rdfs: . @prefix dc: . @prefix ns1: . @prefix ns2: .

rdfs:label "cytosolic DNA-directed RNA polymerase complex [go:0000345]" ; ns2:name "cytosolic DNA-directed RNA polymerase complex" ; dc:title "cytosolic DNA-directed RNA polymerase complex" .

Figure 5.1: URL resolution using prototype Chapter 5. Prototype 111

@prefix query: . @prefix provider: . @prefix profile: . @prefix : .

:myquery a query:Query ; query:inputRegex "(.*)" ; profile:profileIncludeExcludeOrder profile:excludeThenInclude .

:myprovider a provider:Provider ; provider:resolutionStrategy provider:proxy ; provider:resolutionMethod provider:httpgeturl ; provider:isDefaultSource "true"^^ ; provider:endpointUrl "http://myhost.org/${input_1}" ; provider:includedInQuery :myquery ; profile:profileIncludeExcludeOrder profile:excludeThenIncludae .

Figure 5.2: Simple system configuration in Turtle RDF file format

definitions for a provider did not include which protocols it supported, as these were only given in the resolution method. However, the current version requires the RDF Type definitions to recognise the type before parsing the rest of the configuration, independent of the resolution method. In this case, a scientist would need to add the RDF Type to each of the providers before loading the configuration when they wished to replicate queries using that version. In general, any new features that are added to the model may affect forward compatibility if the default values for the relevant features are not accurate for past implementations. The configuration schema is defined using RDF so that the same query engine can be used to query the configuration as the one that parses and queries the scientific data. In addition, the use of RDF enables previous versions to ignore elements that they do not recognise without affecting the other statements. This enables past implementations to highlight statements that they did not recognise to flag possible backwards compatibility issues with new configuration schemas, while enabling them to function using a best effort approach. If the configuration is dynamically generated using the prototype, a particular con- figuration schema revision may be requested and the prototype will make a best effort to translate the current model into something that will be useful to past versions. For example, the Bio2RDF mirrors all request their current implemented version from the configuration server, even if the configuration server supports a new version. However, the use of RDF makes it possible to include new features in the returned documents without affecting the old clients who are required to flag and ignore unrecognised prop- erties in configuration documents. Backwards and forwards compatibility issues have several implications for scientists. If a query provenance record, including the relevant definitions from the overall con- figuration, is published as a file attachment to a publication, it will use the current configuration version. The query provenance record contains a property for each of the query bundles that were relevant to the query, indicating which schema version was used to generate that bundle. Scientists can use this information to accurately identify 112 Chapter 5. Prototype a complying implementation to execute the query provenance record against when they wish to replicate it. It is recommended that scientists publish the actual query prove- nance records, or at minimum the entire configuration file that was relevant to their research. If the query provenance record is published using links to a live instance of the prototype, the links may not be active in the future, so replicability may be hampered. However, if the links are active, they may refer to either the current version of the configuration syntax, as the software may have been updated, or they may explicitly refer to the version that was active when they research was performed. The Bio2RDF prototype allows fetching of configuration information using past versions, for example, http://config.bio2rdf.org/admin/configuration/3/rdfxml will attempt to fetch the current configuration in RDF/XML format using version 3 of the configuration syn- tax, while http://config.bio2rdf.org/admin/configuration/rdfxml will attempt to retrieve the configuration using the latest version of the configuration syntax. If possible, all three options should be simultaneously available to scientists when it comes time to publish research that uses the prototype, as all three have advantages, and together, they reduce the number of disadvantages. For example, if they just publish the file, scientists do not have access to any new features that they may have implemented, or fixes to normalisation rules that they may have found necessary since the publication. If they just publish the version specific configuration URL, they ensure that any future normalisation rule, provider and query type changes will be available–as long as the URL is resolvable. The version specific configuration URL should also limit the possibility for unknown future features to have a semantic impact. However, if it is not obvious how to change this URL to the current version, or the version that the scientist has access to, it may not be as useful to them. If they just publish the version independent URL, then future scientists may not know how to get access to the past version. Overall, the availability of all three options conclusively identifies all of the necessary information: the configuration syntax version that was used, in the RDF statements making up the query provenance record; the original configuration information; the current configuration information available in the original configuration version; and the current configuration information available in the current configuration version.

5.4 Query type

Query types are templated queries that contain the variables which are relevant to the query. The meaning of the variables are unique to each query type, as different query types may interpret the same question in different ways. Scientists can take advantage of this ability to create new compatible query types independent of other query types that may have previously been defined, and may still be used, in the context of some data providers. This has a direct effect on the ability of scientists to contextually normalise information, as the previous query interface may not have differentiated between factors Chapter 5. Prototype 113 that the scientist believes are relevant to their current context. An example of a situation where a scientist may want to optimise a current query to fit their context can be found in Figure 5.3. Depending on the scientist’s trust and data quality preferences expressed in their profiles, all of the query types and providers illustrated in Figure 5.3 may be used to resolve their query. This behaviour is reliant on the design of the pre-configured query type as a place where the namespace, and data providers can be abstracted from their query. The abstraction makes it possible for the normalisation rules to then be independent from the way the query is defined, apart from including markers in the query to determine where endpoint specific behaviour is required. In some situations a query may need to be modified to suit different data providers. This can be achieved without changing the way the query is posed using a new query type that is relevant to the namespaces in the data provider and the original query variables. Typical query for

Widely available query type matches *:* (namespace:identifier) and identifies a useful endpoint for the namespace "concept":

Optimised query for

A locally available concurrent query type matches concept:A/* and performs query on the scientists local database of concepts known to start with A/

Figure 5.3: Optimising a query based on context

The prototype implements query types as wrappers around templates that are ei- ther SPARQL queries or RDF/XML documents. These templates are used in con- junction with providers to include RDF statements in the results set. The wrappers require information derived from the input parameters. In the case of the prototype, the input parameters are derived by matching regular expressions to the application specific path section of the URL that was resolved by a scientist. For example, if the scientist attempted to resolve http://localhost/prototype/namespace:identifier, 114 Chapter 5. Prototype where “http://localhost/prototype/” was the path to the prototype application, then the input parameters would be derived by matching “namespace:identifier” against the regular expressions for each known query type. If a match occurred, then the query type would be utilised, and the matching groups from the regular expression would be used as parameters. The prototype interprets the parameters as, possibly, a namespace, and either public or private, according to the configuration of each query type, as shown in Figure 5.4. Namespace parameters are matched against prefixes in all of the known namespaces to allow scientists to integrate their datasets with public datasets, without having to previ- ously agree on a single authoritative URI for the involved namespaces. The distinction between public and private is used to automatically identify a set of parameters that could be normalised separately from the public namespace prefixes. A public parame- ter may be normalised to lower case characters if a template required this for instance, where the scientist may not want private, internal, parameters to be normalised. This distinction enables the scientist to control the data quality of the private identifiers separately from the public namespace prefixes that they have more control over. In the prototype, matching groups default to being private, and not namespaces. In the example above, “namespace:identifier” may be matched against a query type using the regular expression “([\w−]+) : (.+)”. The query type may define matching group number 1, “([\w−]+)”, to be a public parameter, and for it to be a namespace prefix. The second matching group, “(.+)”, defaults to being private and not a namespace.

Public and private variables

Bio2RDF DOI Bio2RDF DOI Namespace Namespace Has prefix : doi Has URI : http://bio2rdf.org/ns:doi

BioGUID DOI Provider BioGUID DOI resolver to RDF Handles namespace : http://bio2rdf.org/ns:doi Provider template URL : http://bioguid.info/openurl.php?display=rdf&id={input_1}:{input_2} Replaced provider URL : http://bioguid.info/openurl.php?display=rdf&id=doi:10.1038/msb4100086

Bio2RDF URI Match the Bio2RDF URI pattern with namespace:identifier Pattern: ([\w-]+):(.+) : Two groups Query type Only first matching group is declared public, which means the second is private First matching group is also declared to be a namespace

Resolved URL User query http://localhost/doi:10.1038/msb4100086

Figure 5.4: Public namespaces and private identifiers

Scientists can extend the configuration to support new query methods by creating and using a new URI for the relevant parts of the query type, along with a new resolution method and resolution strategy on the compatible providers. This makes it possible Chapter 5. Prototype 115 to add a Web Service for instance, where previously the query was resolved using a SPARQL endpoint, without formally agreeing on the interface in either case. Although it hides some of the semantic information about a query, it is important that scientists do not have to have a community agree about the semantic meaning for a query before they are able to use it. The use of the query in different scenarios may imply its meaning, but the data access model and prototype do not require the meaning to be established to use or share queries.

5.4.1 Template variables

Scientists need to be able to make up generic query templates that include variables which are filled in when the query executes. This pattern is similar to the way typical retrieval of resources is done. The prototype however allows the variables to be defined in different ways for each of a set of query types that may be known to be relevant to the query. In contrast to typical methods that provide either named parameters, e.g., HTTP GET parameters and Object Oriented Programming languages, or complex structured objects, SOAP XML, the prototype allows scientists to define a parameter based on the path from the URL that the scientist tried to resolve. For example, the scientist may resolve http://localhost/prototype/concept:Q/212, and the prototype would use “concept:Q/212” as its input. This input could match in different ways depending on the query depending on the purpose of the query. The prototype uses regular expression matching groups to recognise information in queries. These groups are available as variables to be used in templates in various ways, allowing denormalised queries and result normalisation to take place. Template variables, of the general form “${variablename}” can be inserted in tem- plates, and if they match in the context of the query type and the URL being resolved, they will be replaced. In the example shown in Figure 5.5, the two query types each match on the same user query. The matching process creates two input variables in BioGUID query, and one input variable in the general query. In the general case, the template variable does not represent a namespace, while in the BioGUID case, the first input variable represents a public variable, and it is known to be the prefix for a names- pace. The prototype is then able to use the first input variable to both choose a provider based on the namespaces that have this prefix, and replace the template variable in the URL using the input. In the general case, there is no namespace variable, so any default providers for the query type are chosen, and the template variable is replaced in the context of these providers. The matching groups are available as template variables in different ways depending on the level of normalisation required. If the second matching group in the previous example, needs to be converted to upper case, the variable “${uppercase_input_2}” could be included in the template. Variables need to be encoded so that the content of the variables will never interfere with the document syntax rules. For example, to properly encode the uppercased, input variable used above, the template variable needs to be changed to “${urlEncoded_uppercase_input_2}”. This will replace all URI 116 Chapter 5. Prototype

Namespaces as parameters

Default Bio2RDF DOI Handles all namespaces, to avoid Bio2RDF DOI Namespace having to know all namespaces Has prefix : doi Namespace Has URI : http://bio2rdf.org/ns:doi

Bio2RDF mirror BioGUID DOI Resolves the query using the public Bio2RDF BioGUID DOI resolver to RDF mirrors Handles namespace : http://bio2rdf.org/ns:doi Provider Provider template URL : http://bio2rdf.org/rdfxml/ Provider template URL : http://bioguid.info/ {input_1} openurl.php?display=rdf&id={input_1}:{input_2} NOTE: {input_1} is different for this provider Replaced provider URL : http://bioguid.info/ compared to BioGUID DOI openurl.php?display=rdf&id=doi:10.1038/ Replaced provider URL : http://bio2rdf.org/rdfxml/ msb4100086 doi:10.1038/msb4100086

HTTP GET Bio2RDF URI Match the Bio2RDF URI pattern with Match everything Query type Pattern: (.+) : One group namespace:identifier No namespaces recognised Pattern: ([\w-]+):(.+) : Two groups Namespace prefix is first group

Resolved URL

http://localhost/doi:10.1038/msb4100086 User query Patterns match on the path section of the URL, doi:10.1038/msb4100086

Figure 5.5: Template parameters

reserved characters with their “%XX” versions, where “XX” is the reserved character code according to the relevant RFC 2. In the prototype, the %XX variables are always represented in uppercase, so “:” is encoded as “%3A” and not “%3a”. In some cases data quality depends on the level of standardisation. In the case of URIs, some applications differ in the way they encode spaces. One set of applications encode spaces (“ ”) as the plus (“+”) character, and other applications encode it using percent encoding as “%20”. The prototype allows for both cases, where a URI to be encoded using the “+” character for spaces by changing “urlEncoded” in template vari- ables to “plusUrlEncoded”. For example, the example given above could be changed to “${plusUrlEncoded_uppercase_input_2}”. If the dataset does not consistently use either method then the prototype may be able to provide limited access to the data by using two different query types on the same provider to attempt to access data using both conventions. For example, this template variable is used on the “dbpedia” namespace to retrieve information about resources starting with the prefix http://dbpedia.org/resource/. There are some URIs starting with this prefix which contain special characters in the name such as http://dbpedia.org/resource/Category:People_associated_with_ Birkbeck%2C_University_of_London, as the percent encoding is required to complete the name, but it interferes with the colon character after “Category”. It is difficult to retrieve information from dbpedia, if the item is a category, and the name of the category contains a reserved character, as each case would require a normalisation rule to encode the reserved character. In the example above, the prototype will at- tempt to retrieve information about the non-percent-encoded URI, .../Category:

2http://tools.ietf.org/html/rfc3986#section-2.1 Chapter 5. Prototype 117

People_associated_with_Birkbeck,_University_of_London. When the identifier from the dbpedia namespace is fully percent-encoded, the URL would be .../Category%3APeople_associated_with_Birkbeck%2C_University_of_ London. Both of these URIs will not return information because they do not exactly match anything in the RDF database that is hosting DBpedia. The URI RFC 3 includes a condition that the URIs are equivalent if they only differ by a percent encoded element in a part of the URI where the reserved charac- ters will not interfere with the syntax, in this case, the path section of the URI. In this case, it may be satisfactory to make another namespace, for example “dbpediacategory”, and associate part of the “dbpedia” namespace with that namespace instead to avoid the issue, for example, the URI would be http://myexample.org/dbpediacategory: People_associated_with_Birkbeck%2C_University_of_London, which does not con- tain conflicting percent encoded characters. URLs in RDF/XML must also have the XML reserved characters encoded, as some of these characters are not encoded by the URI encoding 4. This template variable in the example using RDF/XML would be “${xmlEncoded_urlEncoded_input_2}”. The order of conversion is right to left, so the input is converted to uppercase before it is URL encoded, and then the resulting string is XML encoded. A query type may define intermediate templates which take the input parameters and use them to create other templates. An example of this in the prototype is the allowance to define for each query what a normalised URI would look like, given the parameters. This enables scientists to have another variable to use in queries, where they do not necessarily want to deal with each input, as that may make the template harder to port to other query types. The normalised URI in the previous example may be defined using “${defaultHostAddress}${input_1}${defaultSeparator}${input_2}”. In the template, the expected public host is given, where it may in some cases be resolvable to a Linked Data document. Along with the default separator, this template is used to derive the normalised URI given the abstract details, which may not be directly related to the URI that was used by the scientist to resolve the query with the prototype. This allows scientists to use the normalised URI matching the public URI scheme when they do not need to publish URIs that will resolve using their instance of the prototype. In many situations scientists may require others to resolve the document using their instance in order for others to easily replicate the queries, however, both options are available. The ability for the prototype to recognise intermediate templates was necessary so that scientists can reliably normalise complete URIs to the appropriate form that is required for each endpoint, as the normalisation rules assume that they will be operating on the complete URI. For example, it is not possible to reliably normalise a URI, for example http://myorganisation.org/concept:B/222 to its endpoint specific version, for example, http://worldconcepts.org/concepts.html?scheme=B&id=222 without including the protocol and host along with the parameters that were interpreted by

3http://tools.ietf.org/html/rfc3986#section-2.2 4http://www.w3.org/TR/REC-xml/#sec-predefined-ent 118 Chapter 5. Prototype the query type. If only the parameters "concept:(A-D)/(\d+)" were included in the rule, then the normalisation phase may change http://worldconcepts.org/concepts. html?scheme=B&id=222 to a URL such as http://worldconcepts.org/concepts:B/ 222 without changing the host or protocol for the URI.

5.4.2 Static RDF statements

The prototype allows scientists to include pre-defined, static, templates that can be used to insert information into a document in the context of a data provider and a users query. Scientists can use this functionality to keep a track of which URIs each endpoint used, along with their relationship to the URI used by the prototype. This makes it possible for scientists to describe the context of their document in relation to other documents, which may be describing equivalent real world things in different ways. This is in contrast to typical recommendations that specify that there should be a single URI for each concept, so that others can simply query for descriptions of the item without knowing what context each alternative description is given in. The RDF model is designed to provide ways of inter-relating concepts, including different descriptions of the same item. The prototype allows scientists to setup basic rules about the context surrounding their use of the URIs, and the RDF model, along with HTTP URIs, are used to implement this. Each of the alternative URIs may be a Linked Data URI that can be resolved to a subset of the data that was found using the normalised URI structure. Even traditional web URIs, that do not resolve to RDF information, can be useful. For example, a scientist’s program may require a particular data format, and the traditional web URI can point to that location, as shown in Figure 5.6.

Normalised URI

http://example.org/namespace:identifier

Has Data Format X URL Has Dublin Core Identifier Scientific data format URL

Dublin Core Identifier http://namespacefoundation.edu/cgi-bin/formatX.cgi?id=identifier

namespace:identifier

Has Alternative URI

Alternative Linked Data URI Has RDF source http://othersource.org/record/namespace2/identifier

RDF data source URL (From Provider Endpoint)

http://namespacefoundation.edu/cgi-bin/RDFendpoint.cgi?id=identifier

Figure 5.6: Uses of static RDF statements Chapter 5. Prototype 119

5.5 Provider

Datasets are available in different locations, and the method of accessing each dataset may be different. Scientists are able to implement multiple access methods across multiple locations for a range of query types using providers. This makes it possible for scientists to change the method and location and query that they use for data access without having to change their programs, as providers are contextually linked to the query types, which are linked to the overall query. The prototype makes it possible to use a range of methods for data access by substituting providers and query types, enabling scientists to replicate queries in their context without restrictions that may have been imposed on the original scientist. Each dataset may need more than one data quality normalisation change in order for the results to be complete. Each provider can be configured with more than one normalisation rule, which makes it possible to layer normalisation rules. If scientists want to change the URIs that are given in a document to republish the results using the URI of their prototype, they can add a normalisation rule at a higher level than each of the other normalisation rules. This normalisation rule will change the normalised URI into one that points to the prototype. In some cases, the scientist may not know exactly which set of data normalisations are required for a particular dataset. In these cases the scientist may need to have multiple slightly different queries performed on the dataset as part of the overall query. Each provider can be used multiple times to respond to the same overall query, although it will only be used once in the context of a query type. The prototype uses HTTP based methods, including HTTP GET, and a HTTP POST method for SPARQL endpoints. These make it possible to use a large range of data providers, including all of the Linked Data scientific data providers, and the SPARQL endpoints that have been made available as alternatives for the Linked Data interfaces to the scientific data providers. In Figure 5.7, there are two providers, which are used in different contexts. In one context, myexample.org uses the original data provider, with the associated normalisation rule, and in the context of a scientist who has curated and normalised the dataset, the scientist could easily substitute their provider, and exclude the original data provider using their profile.

5.6 Namespace

The concept of a namespace encompassed both the representation of data and the origin of the datasets. Although its use in the implementation is limited to simple declarations that a feature such as a part of a dataset exists on the provider in the context of the queries it is defined to operate on, it could easily be extended to include statistics. This may provide for a generalised model where queries are ad hoc, and the parameters are directly defined by the scientist instead of in the context of each query type. The use of regular expression matching groups in the prototype to identify the namespace sections in the scientist query is used to identify a set of namespaces. The 120 Chapter 5. Prototype

Providers in the Prototype

User query

http://myexample.org/namespace:identifier

Resolved by myexample.org Resolved by Scientist A using using

Common profile Scientist A profile

Resolved using: Resolved primarily using query

Untemplated query SPARQL Describe query

Resolved using provider Excluded by Scientist A profile Resolved using Curated, normalised, local data provider HTTP POST to : http://localhost:1234/sparql

Original data provider HTTP GET from : http://namespacefoundation.edu/cgi-bin/RDFendpoint.cgi?id=identifier

Requires normalisation

Original data Normalisation Rule

Change http://namespacefoundation.edu/ to http://myexample.org/

Figure 5.7: Context sensitive use of providers by prototype Chapter 5. Prototype 121 matching groups are prefixes that are mapped to a set of URIs from the configuration settings for the prototype. The current model and implementation require that scientist inputs are given as a single string. The use of a single arbitrary string enables semanti- cally similar queries to choose the parts of the string that are relevant without affecting other queries. For example, one query type may take the input and send it as part of an HTTP GET query to an endpoint without investigation into the meaning of the scientist’s query, while other query types may require knowledge about the meaning of the query and only partially use the information in a query. The use of indirect namespace references, using identical prefixes which are attached to multiple arbitrary URIs, makes it possible to integrate prototype configurations which would not be possible if the namespace was directly named with a single URI. It is nec- essary to integrate different sources in a distributed query model, as there is no single authority which is able to name each of the namespaces authoritatively. A hypothetical example of where different configurations can be integrated seamlessly using the pro- totype is shown in Figure 5.8, where it would be difficult to integrate them otherwise as it would require changes to one of the configurations to make its namespace URIs match the others. In addition, the ability to define aliases for namespaces makes it necessary to rely on the prefix instead of the URI, as there is not a single relationship between a URI and a prefix. The scientist needs to identify providers and query types in their contextual profiles as the basis of selecting which providers will actually be used to resolve the query, rather than namespaces, as it is ambiguous from their query which one should automatically be chosen.

5.7 Normalisation rule

A principle issue in the area of linked scientific datasets relates to the inability of scientists to query easily across different datasets due to the inability of the relevant communities to conclude on both the structure of a universal URI scheme, and the mechanism for how those URIs will be resolved. This model neatly avoids these issues by reducing URIs to their components, in this case, a namespace prefix, and a private identifier. The principal information components, which have been used by scientists in the past to embed non-resolvable references into their documents, are used to create a series of normalisation steps for each dataset, assuming a starting point that matches the scientists preferred URI structure. This standard URI is then transformed using the normalisation rules as necessary given the context of each provider, so that queries successfully execute with the correct results, before normalising the results so that scientists can understand the meaning without having to manually translate between URIs that are not relevant to their scientific questions. Normalisation rules can be defined and applied to providers to access datasets that follow different conventions to what a scientist expects. Normalisation rules are initially applied to query templates using template variables as well as to the query at other stages. The prototype implements the models requirement for ordered normalisation rules 122 Chapter 5. Prototype

Resolving namespaces using different configurations

User query

http://myexample.org/doi:identifier

Using Source A Using Source B

Configuration A Configuration B

Contains namespace Resolved primarily using query

Namespace alpha Namespace beta

Contains prefix Contains prefix

Preferred prefix : doi Preferred prefix : doid Alias : doi

Has Original producer Has Original producer

http://www.doi.org/ http://mydigitalorganisationid.example.org/

Figure 5.8: Namespace overlap between configurations Chapter 5. Prototype 123 with a staged rule application model. In the model the query and the results go through different stages where normalisation can occur. For each stage from Table 3.1, the supported normalisation methods that the prototype implements are listed in Table 5.1. At each stage, the normalisation rules are sorted based on a global ordering variable where indexes were sorted based on whether the stage was designed for de-normalisation or normalisation. The reversal of the order priority in the latter normalisation stages made it possible to integrate the normalisation and denormalisation rules for a single purpose into a single rule. This made it easier to manage the configuration information, as many rules would otherwise have required splitting in two.

Stage Method Order Priority 1. Query variables Regular Expressions Low to High 2. Query template before parsing Regular Expressions Low to High 3. Query template after parsing None implemented Low to High 4. Results before parsing Regular Expressions, XSLT High to Low 5. Results after parsing SPARQL CONSTRUCT High to Low 6. Results after merging SPARQL CONSTRUCT High to Low 7. Results after serialisation Regular Expressions High to Low

Table 5.1: Normalisation rule methods in prototype

In the prototype, URIs that should match what appears in an endpoint take the form of “${endpointSpecificUri}”. This template variable is derived by the application of a set of Regular Expression normalisation rules to the standard URI template that is available in the query type, after substituting the relevant variables. The prototype uses this method as an exhibit of the usefulness of high level normalisation rules to a set of distributed linked datasets. The use of the endpoint specific URI template, and its normalised version, “${normalisedStandardUri}”, enable queries to be created that are not possible in other models. The prototype initially supports three different types of rules, Regular expression string replacements, SPARQL Construct queries for selection or deletion of RDF triples, and XSLT transformations to transform XML results documents into RDF statements so they can be integrated with other RDF statements. Other types of rules can be created by implementing a subclass of the base normalisation rule class and creating properties to be used in configurations. Each rule needs to be serialisable to a set of RDF triples to enable replication solely based on the provenance record. Regular expression rules contain equivalent normalisation and de-normalisation sub- rules. The de-normalisation part only applies to template variables in queries, where the normalisation stage applies to entire result sets, although it is not applied to static RDF insertions as they provide the essential references between normalised and non- normalised data. The same template variables that are applied to query templates, are also available in static RDF insertions. This is important, because normalisation is a syntactic step in most cases, and although it is useful, it is also useful to highlight the diversity of URIs, where scientists may only have been aware of the normalised version previously and may not have correlated other URIs with their local version. The static 124 Chapter 5. Prototype

RDF templates that are attached to query types are not normalised along with the results, although they will be normalised after the results are merged into the overall pool as there is no way of knowing where individual RDF triples came from at that point. For example, the variables used in a SPARQL query may be modified using a regular expression prior to being substituted into the templates inside of the query, before the query is then processed and modified using SPARQL Basic Graph Pattern manipulation, although there is no currently standard rule syntax for this manipulation. Although the parsed template query normalisation stage is necessary in the model for completeness, it was not viewed as essential to the prototype in terms of data quality, context, or data trust. In each case, namespaces or profiles could be used to eliminate query types where the template would not be useful, rather than using normalisation rules to fix the query at this stage. The SPARQL rules operate in one of two modes, either deleting or keeping matches. The major reason for this restriction is the lack of software support for executing the, yet to be standardised, SPARQL Update Language (SPARUL) DELETE queries 5, where scientists could have more flexibility in the way deletions and retentions operate. Each of the different stages of a query are applicable to different types of rules. For instance, SPARQL transformations are not applicable to most queries, as they require parsed RDF triples to operate on. In comparison, Regular Expressions are not relevant to parsed RDF triples, so they are only relevant to queries, the unparsed results of queries, and the final results document. The RDF results from a query may be modified as textual documents before being imported to an internal RDF document, and they may be modified again after all of the results are pooled in a single RDF pool. In combination with the order attribute and the SPARQL Insert or Delete modes, normalisation rules can be implemented to perform any set of data transformations on a scientist’s data. In the absence of the standardisation of the SPARQL Update Language (SPARUL) and the lack of implementation of the draft in the OpenRDF Sesame software used by the prototype 6, normalisation rules in the prototype are limited to regular expressions and matching SPARQL CONSTRUCT or RDF triple results. When SPARUL and other SPARQL 1.1 changes are standardised, including textual manipulation and URI creation, the prototype could be easily extended to support the use of these features for arbitrary normalisation.

5.7.1 Integration with other Linked Data

It is a challenge to integrate and use Linked Data from various sources. Apart from operational issues such as sites not responding or empty results due to silent SPARQL query failures, there are issues that scientists can consistently handle using the proto- type, including URI conventions and known mistakes from individual datasets in the

5http://www.w3.org/TR/2010/WD-sparql11-update-20100126/ 6http://www.openrdf.org/ Chapter 5. Prototype 125 context of particular query types. The prototype can be configured to use different URI conventions, including those matching the Bio2RDF normalised URI specification [24], and other proposals such as the Shared Names proposal, although the Shared Names is not yet finalised. The prototype implements HTTP URI patterns that are not handled or seen by the core model implementation. For example, the choice of file format for the results is made by prefixing the query with “/fileformat/”. In comparison, other systems append the file format to the URI string, but this prevents their models handling cases where the actual query ends in the extension, for example, “.rdf”, as there may be no way to distinguish the suffix from an actual query for the literal string “.rdf”. The model is designed to provide access to data using an arbitrary range of queries, where the majority of Linked Data sites solely provide access to single records using a single URI structure. If all of the identifiers in a dataset are numeric, for example, a real query would never end in “.rdf”, so there is no confusion about the meaning of the query. In some cases it may be necessary to integrate normalisation rules created using different normalised URI schemes. The ordered normalisation rule implementation al- lows for this easily by creating a new normalisation rule with an order above any of the orders that the included normalisation rules use. This normalisation rule would then consistently convert the results of all of the imported normalisation rules, into the locally accepted normalisation standard. It would be simpler for single configuration maintainers if every dataset were as- signed a single set of normalisation rules to correct queries. However, it would prevent scientists from integrating their provenance and configurations with other scientists without manually combining the rules that were relevant to each dataset. The proto- type enables scientists to use other prototypes configuration documents and provenance documents as configuration sources. This process is automatic as long as there are no URI collisions where inconsistent definitions for query types etc., are provided by different sources. However, this should be a social issue, as scientists should use URIs that are under their control if they expect their work to be consistent with other users. In cases where scientists wish to modify definitions for any model items, they should change the subject URI for the triples, and change any dependent items by changing the reference along with the subject URI for the dependent item. The original URIs then need to be added to the scientist’s profile to ignore them and use the new items. The migration method using new URIs and profiles enables scientists to attach their own data quality and trust rules to any providers. However, if they only wish to change data in the context of a particular query type on a particular provider, they would need to create at least one new provider to include the new normalisation rule. If the query type was still used, they could not ignore it completely without redefining all of its uses. In this case, the original provider would need to be ignored using the profile and a copy made that did not contain the query type. The model allows different sets of normalisation rules to be used with the same 126 Chapter 5. Prototype query type on the same namespace on a single dataset. This situation requires two providers with the same query types and namespaces assigned to them, with only the normalisation rules differing. The Bio2RDF prototype used this pattern in cases where datasets contained inconsistent triples and it was not known which normalisation rules would be necessary for a particular query on a particular namespace beforehand.

5.7.2 Normalisation rule testing

The prototype included tests that specify the example inputs and expected outputs for normalisation rules. The tests include the requirement that normalisation rules with both input and output (i.e., de-normalisation and normalisation) components must be able to transform the results of the input, de-normalisation, rules back into a normalised version using the output rules. However, if the normalisation rule does not contain an input rule, or does not contain an output rule, the tests are only performed on the existing half of the rule. Rule testing makes it possible for scientists to both validate the way the rule works on their examples, and demonstrate to other users what the rule is intended to do if that is not clear from the description and the rule itself. This is important to independently validate the usefulness of rules to convince other scientists that they are useful for data cleaning, as there is no single authority to curate rules. Although the model is completely decentralised by design, to allow for context sensitive changes, authorities could produce rules and curate them, and scientists could use all or some of the rules based on their profiles.

5.8 Profiles

In the prototype, configuration information is serialisable to RDF, which can then be included with configuration information from other sites to form a large pool of possible datasets and the associated infrastructure. This pool of information can be organised using profiles so that scientists can exclude items that are not useful to them, and op- tionally replace the functionality by including an alternative, perhaps a trusted dataset that has a known provenance and is hosted locally. The prototype was designed as a layered system so that the prototype can explicitly say which profiles were more relevant than others. The configuration files specify a global ordering number for each profile, and higher numbers take preference over lower numbers, so scientists can generally find a simple way to override all other profiles with their own. The profiles that will be used in a particular situation can be specified by the operator of the prototype system without reference to the order, which is obtained from each of the profiles. The global ordering makes sense as some profiles will contain instructions to always include anything that isn’t restricted so far, and these profiles should be consistently visible for other users to generate the same results when using the profile in their set of profiles. Chapter 5. Prototype 127

The overall implementation allows scientists to source statements from many differ- ent locations. Each of these locations should be trusted to safely execute the system. Untrusted configurations may include queries that are so complex that they timeout, or that accidentally cause denial of service attacks of data providers. Before queries are executed, ideally, scientists should review the queries, although they may not be able to recognise where issues will occur without experimenting with the different ways the profiles work. An example of a situation where the profiles may be accidentally misused is where there are multiple mirrors for a website, such as Bio2RDF, and the mirrors could hypothetically rely on each other for resolution of some queries. This could easily lead to circular dependencies, which the system could not detect without analysing the query plan for each query before executing it. The profiles are designed so that these potential dependencies can be removed by scientists. The ordered profile system is useful in a range of scenarios. The simplicity makes it valuable in many situations, as the order of the profiles relies on knowledge about the desired effects, and the effects of each of the profiles individually. This makes it possible for scientists who know which order the profiles are given to consistently determine which query types, providers, and normalisation rules will be used without having to ask other users for information about how their prototype was setup internally. Although RDF provides a native facility for ordered lists, the implementation method makes it impossible to extend a list dynamically, as the last element of the list explicitly references a null value. It is important that future scientists be able to easily extend a list without having to modify the document containing the original definition. They can do this using numerically ordered profiles and normalisation rules. In addition, explicit ordering provides a degree of control over which processes can be performed in parallel, as items with the same order should not interfere with each other. 128 Chapter 6

Discussion

6.1 Overview

This chapter presents a discussion of the issues which were relevant to the design of the data access model described in Chapter 3; the use of the model by scientists as described in Chapter 4; and the implementation of the prototype web application described in Chapter 5. The model is designed to provide query access to multiple linked scientific datasets that are not necessarily cleanly or accurately linked. The model provides a set of concepts that enable scientists to normalise data quality, enforce data trust preferences, and provide a simple way of replicating their queries using provenance information. The prototype is a web application that implements the model to provide a way for scientists to get simple access to queries across linked datasets. This discussion includes an analysis of the advantages and disadvantages of the model and prototype with respect to the current problems that scientists face when querying these types of datasets. Simple methods of querying across linked scientific datasets are important for sci- entists, particularly as the number of published datasets grows due to the use of the internet to efficiently distribute large amounts of data to different physical locations [67]. The model was based on the degree to which it supports the needs of scientists to access multiple scientific datasets as part of their research. The model provided useful features that were not possible in other similar data access models, including abstract mapping from queries to query types and normalisation rules, using namespaces where neces- sary. This enables the model to substitute different, semantically equivalent queries, in different contexts without requiring changes to the scientists high level processing methods. They can use the model to transform data based on their knowledge about data quality issues across the range of datasets they require for their research, without re- quiring each query to determine which data quality issues are relevant to its operation, resulting in the ability to perform multiple semantically unique queries on a provider with each query requiring different data quality modifications. The model allows scien- tists to trust that the results of their query were derived from a trusted location using an appropriate query and normalisation standards. Scientists can evaluate the process

129 130 Chapter 6. Discussion level provenance for their queries, including the exact queries that were performed on each data provider, and they are able to interpret this information in a simple way due to the object and reference based design of the model. The prototype implementation was created to examine the practical benefits of the model in current scientific research. As part of this case studies were examined, including a case integrating science and medicine, in Section 4.4. An evaluation of the usefulness of the prototype together with workflow management systems in Section 4.7. These case studies, together with experience using and configuring the prototype revealed a number of advantages and disadvantages relating to the prototype and the initial model designs. The prototype enables scientists to easily construct alternative views of scientific data, and it enables them to publish these views using HTTP URIs that other scien- tists can use if they have access to the server the prototype is running on. The prototype was successful in acting as a central point to access numerous datasets, using typical Linked Data URIs for record access and atypical Linked Data URIs as ways to perform efficient queries on datasets. The prototype was found to be a useful way of distribut- ing configuration information, in both bulk form and in a condensed form as part of provenance records. This information was then able to be customised using a list of acceptable profiles that were statically configured for each instance of the prototype, making it possible for users to reliably trust information, as long as they already trusted the sites they obtained the overall configurations from. The prototype did not attempt to implement as access control system, so scientists would need to physically restrict access, or implement a proxy system that provided access control to each instance of the prototype. In terms of scientists using the system, it is important that data is able to be re- liably transferred between locations on demand. This process may be disrupted by issues such as network instability, query latency, and disruption to the data providers due to maintenance or equipment failure, while there are inherent limitations includ- ing bandwidth and download limits that affect both clients and data providers. The model was designed to provide abstractions to both data providers and query types, to provide redundancy across both types of queries and data locations. The prototype was implemented with a customisable algorithm to decide when a data provider had failed enough times to warrant ignoring it and focusing on other redundant methods of resolving the query. The redundancy in the model, together with users specifying profile preferences, made it possible to reduce latency on queries by choosing physically close data providers and optimising the amount of information that needed to be transferred.

6.2 Model

The model is evaluated based on the degree to which it fits the research questions that were outlined in Chapter 1, specially context sensitivity, data quality, data trust, and provenance. The evaluation focuses on the advantages of the model design decisions in reference to the examples given in Chapter 1, with specific reference to the way the Chapter 6. Discussion 131 model can be used to replicate queries in different contexts.

6.2.1 Context sensitivity

Scientists work in many different contexts during their careers. These contexts include different universities and research groups at a high level, and many different data pro- cessing methods on different datasets. Some of these datasets may be sensitive, and the experiments may be restricted from public recognition, at least temporarily. Scien- tist use linked datasets in many different facets of their research, making it necessary to have a simple method of adjusting the way the data access occurs to match their current context. Scientists require that their experiments are able to be replicated in the future, with the expectation that the results will not be materially different from the original results. The model includes the ability to easily switch or integrate new data providers or queries, along with the relevant normalisation rules. This means that even if the original data provider is unavailable, the overall experiment could still be replicated if the scientist can find an alternative method of retrieving the results. This is novel, compared to alternatives such as workflows, because it recognises the difference between the data location, queries on that data, and any changes that are required. Scientists can use this flexibility in the model to substitute components individually, according to their context, without requiring changes to the high level processing methods. The recognition of the context that data is accessed in, without requiring the use of a particular method, means that scientists can share their experimental workflows, and other scientists can replicate the results using their own infrastructure if necessary, with changes to the configuration of their prototype rather than workflow changes. Scientific datasets may develop independently, making it difficult to integrate datasets, even if they are based on the same domain. For example, a scientist may wish to inte- grate two drug related datasets, DailyMed and DrugBank, but they may be unable to do so automatically because they do not contain the same data fields, although there are links between records in the two datasets. If a scientist requires that the data from another set of results to be represented differently to fit into the way their processes expect, the model allows the scientist to modify the query provenance without affecting the way the original set of results was derived. The scientist can use normalisation rules to avoid having to contact the original data provider to modify their results. They would do this using a data provider, that accesses the original set of results, and modifies the data using rules, without performing any further queries. In the DailyMed/DrugBank example, the scientist would find the common data fields based on their opinion, and normalise data from one or both datasets using normalisation rules. There are a number of common scenarios that the model is useful for in terms of transitioning between different contexts and transitioning between different datasets, as shown in Table 6.1. The table shows the changes necessary to keep the status quo, so scientists using the model for data access do not need to modify their processing systems where the model can be modified instead. In each case, a scientist would have to ensure 132 Chapter 6. Discussion that their profiles included the new providers, query types and normalisation rules, and excluded outdated providers, query types and normalisation rules. An important thing to note with respect to these scenarios is that it may never be necessary to produce a new query type in reaction to a change in a current dataset, as query types are designed to be generic, where possible, to keep a stable interface between scientists and a large number of constantly updated datasets.

Scenario Definite changes Possible changes Data provider shuts down New provider for re- New query types and placement normalisation rules Data structures change New normalisation New query types rules and add rules to providers New scientific theory New normalisation New providers and rules and add rules query types to providers New dataset New providers New normalisation rules and query types Data identifiers change New normalisation New providers rules

Table 6.1: Context change scenarios

The ability to group query types and providers based on a contextual semantic understanding of functionality, makes it possible for a scientist to easily find alternatives, if they find their context makes it difficult to use a particular query type or provider. For example, a scientist may decide that the Bio2RDF published version of the UniProt dataset is easier to use than the official UniProt dataset, and they could link together the query types, providers and normalisation rules relevant to Bio2RDF and publish them together using a profile. Although this ability is useful, the model does not require scientists to specify why they didn’t find a particular query type or provider semantically useful. In each case, the scientist would make a decision about whether they want to use each item. The range of possible implementations of the model could including a range of inclusion metrics found in Mylonas et al. [101]. In contrast to Mylonas et al. [101] and other similar models, this model does not require that scientists and/or data providers adhere to a single global ontology. It does not even require them to specifically choose a single ontology for all of their research, as they would need to do in these systems just to access data from a range of distributed datasets. If different scientists interpret the same query in different ways, the provenance records will highlight the differences to other scientists, and other scientists may choose providers based on these records for some of their own experiments. The model is designed based on the assumption that the data within a namespace that can be found using a particular identifier will be consistent over time, although this may not be the case in many datasets. Although the model makes it possible Chapter 6. Discussion 133 for scientists to modify and change queries and data providers transparently to over- come changes, it does not make it simple for scientists to determine which queries and providers are relevant at the current time, compared to in the past. It is assumed that scientists will keep track of the provenance of previous experiments if they want to be guaranteed that the model will use the same sources in resolving their query, and scientists need to manually verify that the dataset was not modified. The issue of temporal reasoning, where scientists simultaneously specify that dif- ferent data providers are relevant at different points in time, could be included in the profiles system through the use of time constrained profiles, but it would also need to be included in the data providers if they include multiple versions of items. If datasets change identifiers for a data record, or deleted data describing a record, and the changed could not be reverted using normalisation rules, scientists would need to modify their processes to replicate their past experiments, something which is not ideal, but necessary in some circumstances. For example, if a scientist discovered that a gene was not real, or it was a replica of another gene, a genome dataset may choose to remove all of the information from the record describing the gene and replace references to the gene with references to the other gene. This would disrupt any current experiments that relied on the modified or deleted gene record containing meaningful information, and the scientists workflows would need to be modified to allow for genes that have been replaced by references to new genes. Although this was not a real world change, the fact that scientists changed their opinion is important to the way the system functions. Scientists use labels for their concepts and a label needs to be used consistently by a scientist in order for their work to be as successful as possible. In comparison to other models, this means that scientists are able to have inde- pendent opinions about the meaning of different pieces of scientific data. This is never possible if scientists are required to adhere to a single global ontology, as there is no way to recognise additions or deletions in experiments which are designed to produce new novel results. There may be benefits in well developed disciplines for scientists to adopt a single set of properties, even if the objects or classes were still variable. For example, in the field of medicine, health care providers may adopt a single standard to ensure that there is no confusion between documents produced by, and distributed between, different doctors. However, doctors participating in medical trials may need to forego the benefits of this standardised data interpretation to test new theories using unconventional properties.

6.2.2 Data quality

Scientists have the need to reformat data to eliminate syntactic data quality mistakes, such as malformed properties or schemas, before queries across datasets will be consis- tent. This is a major issue for scientists, as these issues must be fixed before they can use higher level, semantic data quality procedures to verify that the information content is consistent. The model is designed to be useful before data quality issues have been 134 Chapter 6. Discussion resolved, along with being useful when the syntactic and semantic issues are resolved using permanent dataset changes or normalisation rules. Many syntactic data quality issues require human interaction to both verify that there is an issue, and to decide what the easiest way to fix the issue is. Although this process is best done on a copy of the dataset that the scientist can modify, any modifications that the scientist can perform as part of a query can be accommodated using the model with normalisation rules on queries and results. The model makes it possible for scientists to them publish the rules along with their queries, so that other scientists can verify the results independently, avoiding the normalisation rules that did not suit them using profiles as necessary, without having to redesign the query. If a scientist does need to republish the entire dataset locally, a custom profile can be used to isolate provider references to the original dataset and replace them with references to the cleaned dataset. It is difficult to use automated mechanisms to diagnose semantic issues on distributed linked datasets, as the process may require information from multiple geographically different locations, which are not efficient to retrieve if the relevant parts of the datasets are highly interlinked. The model succeeds in cases where the necessary information can be retrieved to enable the model to semantically clean data locally, or where queries can be constructed in ways that allow scientists to clean the data in the query. In general, the model relies on being able to construct queries that match the semantic meaning contained in the data provider, with the normalisation occurring on the results. This is as issue for any distributed query methodology, as it is based around a fundamental distinction between locally available data that can be processed using efficient methods, including in memory and on disk, and distributed data that needs to be physically transported between locations before efficient localised methods can be used to verify that the data is semantically valid. If scientists require semantically valid trusted data that is not efficient in a distributed scenario, they can setup a local data provider and co-locating the datasets to provide semantic data quality using efficient local processes. In comparison to federated SPARQL systems, the model is able to be used to di- agnose and fix semantic issues, without requiring data to be provided using SPARQL in every location. The use of provider specific syntax rules to fix syntactic issues after a query is executed, provides a more stable basis for semantic reasoning, as providers are free to change and migrate to new standards while queries that were applicable to past versions of the dataset may still be useful after syntactic normalisation. The model can be used to impose current standards on datasets regardless of the decision or timing from dataset publishers. For example, the query model was used throughout this project to impose the current Bio2RDF standards on each of the datasets, even though the Bio2RDF project itself did not always have the resources to physically change the datasets each time the best practices in the community changed. The model may be used to easily link together data of differing qualities, potentially lowering the quality of the more useful data. This factor is inevitable in a model that allows users to merge information from different sources. Although it may be useful to Chapter 6. Discussion 135 enforce a single schema on the results, this would only enforce syntactic data quality, as shown in Figure 6.1. In the example, the syntax normalisation does not result in more useful data, and may even make it more difficult for scientists to process the data if they assume that a disease reference will be a link, rather than just a textual label. The use of a single data format, without relying on a single schema, makes it possible for scientists to include their private annotations, which may not have been possible in other scenarios. If the user then republished their annotations along with the original document, the quality of the combined document may be lower than the original data. This is a feature of the model, but the resulting issues must be dealt with at a higher level, through the creation of methods that can be used segregate the data based on its origin and the desired query, such as RDF quads for instance. Many systems assume that the data quality issue can be solved in a context- independent way, relying on a single domain ontology to normalise all documents from every source to a common representation. However, the assumption that because ontolo- gies are logically relevant to everything does not imply they are syntactically relevant to everything. A number of limited scope systems, such as the BIRN network and the SADI network, have succeeded in integrating data from a range of biological sources, but they could not easily integrate annotations, or internal scientific studies that violated one of the core ontological principles of the system. The model is designed to provide a neutral platform on which scientists can pick and choose their desired sources using providers, and their desired ontologies, based on the results of normalisation rules. The cost of this neutral platform is that queries must be specified in terms of the goals, and not the query syntax, so the system cannot interpret complex queries into sets of sub queries without users having first defined templates for the sub queries with links from the appropriate providers to the templates. However, the neutral platform, partic- ularly including the representation of normalisation rules in textual formats, provides simple ways for any scientist to choose their personally relevant set of queries, providers and rules. They can group these sets using profiles that other users can include or omit the sets based on the identifier that was given to the profile. Workflow management systems are designed to allow users to filter and clean data at every stage. The filtering and cleaning stages however are typically directly embedded in the workflow, making it difficult to distinguish between data processing steps and cleaning steps. This is important if the cleaning steps are rendered unnecessary in a future context, but the data processing steps are still necessary to execute the work- flow. In addition, the locations of the datasets are embedded in workflows, so if users independently clean and verify a local copy of a dataset, the workflow cannot easily be modified to both ignore the cleaning steps and to fetch data from the local copy. The model is designed with this goal in mind, so it natively supports omission and addition of both cleaning steps–normalisation rules–and dataset locations–providers. In the workflow integration described in Section 4.6, the prototype was used to pro- duce workflows that focused on the semantic filtering process. The query provenance for each query were fetched when necessary by prefixing the query URI with “/queryplan/”. 136 Chapter 6. Discussion

Syntax normalisation example

Data access parameters : Query type = Find references Data item namespace = diseasome_diseases Data item identifier = 1689

Query Search for any references to the disease

DrugBank DailyMed

DrugBank drug name : Marplan Dailymed generic drug name : (Isocarboxazid (Isocarboxazid) Linked to disease : Brunner syndrome Datasets Disease reference : Diseasome/1689

Label : Marplan (Isocarboxazid) Label : Isocarboxazid Syntax Disease reference : diseasome_diseases:1689 normalised Disease reference : Brunner syndrome results

Figure 6.1: Enforcing syntactic data quality Chapter 6. Discussion 137

6.2.3 Data trust

The model is useful for restricting queries to sources of data on trusted sources. It allows scientists to restrict their experiments to a set of data providers, along with the query types and normalisation rules that they find useful. The trust mechanism is novel in the context of restricting queries on linked datasets based on scientific trust. This enables scientists to predefine trusted sources, and then perform queries that are known to be contextually trustworthy, as profiles are limited in their scope. The trust factors include the ability to distinguish between trusted, and unreviewed, implicitly useful items, as both being available, depending on the overall level of trust that the scientist places in the system. It is not practical or useful for a scientist to apply trust factors to every dataset, however, at the query level, in the context of particular datasets, a scientist is able to eliminate untrusted items, allow explicitly trusted items, and make it known to other scientists that other items are as yet unreviewed, but in their opinion, another scientist could either use or not use them, depending on the context for the research. The range of undecided datasets may include datasets that are not linked to the scientist’s discipline expertise, or are not curated to an overall level that is satisfactory to the scientist, or they could simply include datasets that were not useful for the experiments that were currently being focused on. The profiles are useful for temporary exploration, and are necessary if scientists are to easily use widely applied query types, to particular data providers, without redefining the query types or providers. This may be common in an open situation where other scientists may have knowledge of more query types or data providers, that are not reviewed according to the profile, but can be included or excluded based on the overall criteria that the other scientist applies. This aspect of data trust is reflected in the way the provenance is applied in future situations, as scientists need to decide what preference they are going to apply to the extra data providers and query types, that they know about, but do not find in a provenance record, as scientific experiments may be able to be replicated using different query types, for instance, the original scientist may have used SQL queries as compared to the SPARQL queries that are currently available, although both queries have the same semantic meaning, and can therefore be substituted. If scientists were interested in determining trust in a social context, they could digitally sign and share their profiles, along with resolvable Linked Data URIs to indicate where to find the relevant data providers, query types and normalisation rules. The model and prototype were not evaluated at this level as the model was designed to make it possible for scientists to define and share their personal trust preferences, rather than on how to decide at the community level where something was trusted. The trust algorithm that was used for the prototype could be substituted with a community based algorithm that used numeric thresholds to represent the complex relationship between trust and the community preferences. In the trust system that was implemented for the prototype, the scientist only needed to define whether they wanted to use the item, rather than whether others should want to use the item. For instance a review site such 138 Chapter 6. Discussion as that discussed by Heath and Motta [66] would make the model more flexible in social terms, but it wouldn’t change the fundamental methods used by the model to choose or ignore data providers, query types and normalisation rules based on the overall context of the current user. The model may be extended easily to include a system of numeric thresholds, such as those used by many distributed trust and review models.The model relies on a verdict of either yes, maybe, or no, for each item on each profile, and maybe decisions are passed down through a set of layered profiles, based on the scientist’s trust, until a yes or no is returned, or all of the profiles do not match. A numeric threshold would require the scientist to arbitrarily assign an acceptable value to the maybe verdicts for each profile. The threshold could be defined as an average over a group of scientists, although there are debates about whether the wisdom of crowds [138] are more useful for trust purposes than the wisdom of known friends [131]. For example, the rating of known scientists may be more useful than the rating of a general crowd of scientists. The use of numeric thresholds would still follow the yes, maybe, and no, system for each profile in order for the system to allow for a complex definition of context sensitivity that could be directly replicated by others in the same way as the current model. The profiles component of the model does not distinguish between model elements that are either untrusted, those that are excluded because a scientist has a more efficient way to access the data provider, those where the implementation of the query type is thought to be incorrect, or those where the normalisation rule is not deemed to be necessary in a particular context, although it may be trusted in other contexts. The limited semantic information given by the profile is suitable for the model to consistently interpret the scientists intentions, but it does not make the exact intentions clear to other scientists. An implementation of the model may provide different definitions for the semantic reasons behind a particular inclusion and exclusion, in order for other scientists to more accurately reason about the information, but the essential steps required to resolve queries using the model would be unchanged as scientists would need to define exact criteria in the implementation regarding which parameters specified inclusion and exclusion, and what to default to if the set of parameters did not match these sets. In the model, the statements from different data providers are merged into a pool, before returning the complete set of results to the scientist. It is not possible to distin- guish between statements from different locations using the current RDF specification without using reification or RDF Quads. Reification is not practical in any large scale scenario as it turns single triples into sets of 4 triples as opposed to the RDF quads implementations that enlarge the size of results by only one extra URI per triple by adding a graph URI to the current model, and using that URI to denote the location of the data provide for the statement. There is however limited support for RDF quads in client applications, as the concept and the available formats (NQuads, TRiG, TRiX) have not yet been standardised. The prototype supports data available in Quads, with the graph URIs as the URIs of providers that were used to derive each of the triples in the graph. Chapter 6. Discussion 139

6.2.4 Provenance

The model is designed to consistently distribute semantically similar queries across multiple datasets in a transparent manner, so the initial form, location and quality of the data does not need to be encoded into data processing scripts and tools. In doing so, it promotes the reuse and replication of the results in the future by being able to publish the query as part of a process provenance record that contains information about the context of the query and what information is needed to replicate the query. The model can be used to track the provenance of the data, but that would require extra annotations on providers. It was not feasible given that most datasets in science only contain simple data provenance annotations such as the dates when records were last updated, and these dates are not useful for scientists in terms of allowing them to exactly replicate their experiments. In scientific workflow models that have previously explored the area of provenance, the location of the data, and the filters and normalisation rules applied to the data are not different parts of the model. An advantage of the way the provenance information is derived from the model is the ability to locate the original data without requiring the use of data normalisation rules. This enables scientists to work with a data model that other scientists may be more familiar with in some cases, compared to their preferred normalised version. This is useful for cases where scientists require the original data to be unchanged, or the data is accessed in a different way to match their current collaboration. In the model, data normalisation is not a deciding factor in whether to use a provider, unlike systems that use single ontologies to distribute data, where the deciding factor is primarily whether the scientist’s query matches the expected data normalisation rules. In a similar way, the normalised data can be useful if it contains references, such as HTTP URIs, that can be resolved by other scientists to get access to information that is used as evidence in publications. The normalised information may be cumbersome to reproduce without the standard model of specifying what queries were used on which datasources and which changes were made to the information. Scientists can publish links to the underlying query plan, and peers can access this information and process it in a similar way to a workflow, but with the benefit of knowing where they can substitute their own datasources and which normalisation elements were used, in order to review the work as part of the peer review process. The design choice of separating the providers of information from queries that sci- entists may want to pose on arbitrary endpoints improves the way that other scientists can understand the methodology that was used to derive the answer. This is an im- portant step in comparison to current workflow tools which rely on scientists to guess which parts of a workflow are related to data access, and which parts are related to data cleaning and format transformation. It allows scientists to choose which data providers are used for queries, and allows them to change the mix of providers without interfering with the methods used to integrate the information with processing tools. In terms of provenance, the model allows scientists to completely recreate their 140 Chapter 6. Discussion queries, assuming that the data is still accessible at some location. It does not specify the meaning of each query, as this is not relevant to the data access layer, and it does not specify the provenance of individual data items inside of data providers, as that is the responsibility of the dataset. In particular, it requires scientists to understand the way that the query parameters are linked to the overall question before the provenance can be interpreted, as this requires community negotiation to define how the parameters match the scientific question. Other provenance systems focus on understanding the way that the linked datasets are related and what levels of trust can be placed in each data item, as opposed to focusing on what is required to reproduce the results of each query [149]. The prove- nance of the results does depend on this information, but it needs scientists to interpret the information that is available from the model, including any annotations on data providers, and any provenance information that is derived as part of the query. The inclusion of data item provenance in the model was restricted by the requirement that the model be simple to create and understand. If the model included annotations about data items in the overall configuration, it would not be maintainable or dis- tributable. It is the responsibility of the scientist to store the provenance information describing their queries, using the information that is generated by the model. This is consistent with the design of the model as a way for scientists to define the way their queries are distributed across their trusted datasets. The final complete configuration of 27,487 RDF statements for the Bio2RDF web- site was serialised in RDF/XML format for a file size of 3,468 kilobytes. It would be relatively efficient to store provenance records, if redundant triples from the configura- tion that are duplicated between provenance records are not all stored multiple times. The provenance record also stays small because each of the elements, such as query types and providers, are linked using URIs, and it is easy to determine whether the in- formation about the query type has already been included in the provenance record to avoid adding it multiple times. In addition, the RDF model does not recognise multiple copies of a statement as being useful, so it can optimise the number of statements that are included in the results automatically. In comparison with the workflow provenance models that attempt to exactly repli- cate the data processing actions that were included, using the same data access inter- face in future, the model provides the ability for scientists to merge and change the data processing actions using substitution based on the user query parameter matching functions. The parameter matching functions allow for multiple query types to be used transparently, in comparison to workflows and SPARQL based projects such as SADI which rely on being able to provide replicability by relying on a single input data struc- ture producing a single output data structure using a single data location for each part of the processing cycle. The model makes it possible to integrate heterogeneous services at the data access level without having to change the processing steps, as is necessary in current models. A provenance model for scientific data processing needs to allow scientists in the Chapter 6. Discussion 141 future to replicate and contrast results using independent data. Although RDF provides the basis for this replication, current data query models that are based on SPARQL do not support independent replication without human intervention. The extension of these data access models to remove direct references to data locations and properties used in the inputs and results makes it possible for independent replication using configuration changes. The emphasis of provenance literature on replication in the future denies the evidence that data providers will always eventually cease to exist or not always be accessible in the future. There are tools such as WSsDAT [89] which are designed based on evidence that communications between data providers are variable in the current time, and may not last beyond the lifetime of the scientific grant related to the study. Although theo- retically SPARQL based query models could generate the same results from different locations, in practice they rely on either hard coding this information into the data providers service definition (see SPARQL 1.1 Service Definition draft specification1), or hard coding this information into the middleware or clients data processing code as in SADI. Neither of these solutions are as flexible as a model that relies on separating the translation and data normalisation steps from the definitions of queries. The model proposed in this thesis allows scientists to replicate their data processing in future with- out having to rely on particular data providers or face the daunting task of changing their processing code to support alternative data providers to support replication using independent data sources when they could declaratively exchange a part provider for a current one with a change to the profile in use.

6.3 Prototype

The prototype was implemented as a proof of concept for the model that can be used to validate the advantages and disadvantages of the model, as described in Section 6.2. The evaluation of the prototype is based on its usefulness as both a privately deployed software package, and its deployment as the engine behind the Bio2RDF website. Some features of the model, such as context sensitivity, and data trust, apply mostly to the private deployments of the prototype due to the emphasis on scientists forming opinions about datasets, while the others apply equally to both the public and private as they apply to the broader scenario of scientists easily getting access to data while still understanding the provenance of the process involved, and any data quality changes that were applied in the process.

6.3.1 Context sensitivity

The use of RDF as an internal model makes it possible for scientists to translate results from any query into other formats as necessary. The output format is able to identify links explicitly, so any reference fields in other formats can be mapped from the RDF

1http://www.w3.org/TR/sparql11-service-description/ 142 Chapter 6. Discussion document, and other fields can be mapped as necessary from the structures in the document. The prototype provides access to the paging facility which is included in the model and a subset of the queries, so that scientists can choose to only fetch particular parts of the results set at each time, although they can get access to all of the information by resolving the document for each page offset until they recognise that there is no new information being returned. A scientist can use these functions to direct the model to return a maximum number of results from each provider. There is no recognition of relevance across the results returned from different providers, although it could be given using the deletion rules, to select the best results as necessary. This enables the prototype to support both gradual and complete resolution of queries, through the use of a number of data providers to represent the different stages of the query in cases where the data provider allows paging to occur. The prototype allows scientists to host the application in arbitrary locations, as long as scientists put in normalisation rules where necessary to change the structure of the URIs that are given in the results of each query. In the case of the HTML results pages, the location is automatically changed to match the way the scientist accessed the prototype, but this is not the case currently with the RDF representations, where it is viewed as more important to have standardised URIs in order for the scientist to easily identify the concept they were querying for without reference to the location of the prototype. It is particularly relevant that scientists can replicate a public Linked Data website using the prototype to avoid overburdening the public website with automated queries that can be performed by local software directly accessing local data. The prototype allows scientists to extend public resources with any of their own re- sources, and keep their resources private as necessary. This enables scientists to extend what are traditionally very static Linked Data sources, without revealing their infor- mation, or hosting the entire dataset locally. This context sensitivity is an important feature on its own, but it also makes it possible to solve the data quality and data trust issues in local situations, and control the data provenance, rather than just obtain the information about what an independent data provider thinks about a dataset. In this way it is a generic version of the DAS, where the data must be genome or protein records, and other annotations are not accessible. The prototype software allows scientists to individualise the configuration and be- haviour of the software according to their contextual needs. It includes the ability to resolve public Linked Data URIs using the prototype, while supporting a simple method for extending the public representation of the record with RDF information directing scientists to other queries that are related to the item. However, this behaviour is not common, and scientists may not fully understand the difference between their local query and the public version, including how to give a reference to the local version of the data in their research, especially if the RDF retrieved by the general public using the published URI never includes their information. The prototype requires that scientists utilise a local prototype installation to resolve Chapter 6. Discussion 143 documents according to their trust settings. This either requires scientists to change the URIs in the results to match the DNS name for their local resolver, or it requires them to publish the documents using the local prototype URIs. This is not a design deficiency of the prototype, as it is necessary to fully support local installations of the prototype software. It is, however, an issue that needs to be resolved using a social strategy to support the data quality and data trust features, as local results can be intentionally different to the same query as executed on a public service.

Although it is simpler for scientists to use commonly recognised Linked Data URIs for a particular item, it requires other scientists to know how to change the URIs in any of their documents to work with the scientist’s prototype resolver. If they do not know to change the URIs, then resolving the Linked Data URI will only get them the statements that were available at the authority. The prototype is designed to allow scientists to install their own prototype and utilise the configurations from other scientists to derive their own context sensitive options for resolving the URI, including their own trust settings to determine which sources of information they trust in the context of each query.

Some recommendations for identifying data items, and links to items in other datasets focus on standardising a single URI for each item in a dataset, and utilis- ing that URI wherever the data item is referred to. A standard URI is useful for integrating datasets, but it is not useful if there are contextual differences between the way the item is represented, including novel statements, then it is not appropriate to use the standard URI, as its trusted global meaning will be limited to statements that are published by the original authority. In order for the prototype to deliver the unique set of statements that a scientist believes to be trusted, it is more appropriate that the URI is changed, although the model and the prototype support both alternatives.

The Linked Data movement does not focus on supporting complex queries using HTTP URIs, although the principles of Linked Data fit for queries in a similar way to raw data records. In terms of evaluating the contextual abilities of this prototype, there are no current recommendations about how to interlink data items with queries that may be performed using the data item. For example, in the Bio2RDF configuration, there are many different queries that may be performed depending on the context that the scientist desires.

It is not practical to enumerate all of the query options in an RDF document that is resolved using the data item URI, as it would expand the size of the document, and detract from the information content that is directly relevant to the item. An example of this is the resolution of the URI “http://bio2rdf.org/geneid:4129”. This URI can be modified to suit a number of different operations, such as locating the HTML page that authoritatively defines this item, “http://bio2rdf.org/html/geneid:4129”, locating the XML document that authoritatively defines this item, “http://bio2rdf. org/html/geneid:4129”, and searching for references to the item in another namespace, such as the HGNC namespace, “http://bio2rdf.org/linksns/hgnc/geneid:4129”. 144 Chapter 6. Discussion

If URIs that could be resolved to all of the possible queries for links in other names- paces were included, there would currently be at least 1622 extra statements correspond- ing to a link search for each of the namespaces in the Bio2RDF configuration. These URIs can be usefully created by scientists without impairing basic data item resolution. If a user requires these query options, extra statements could be included to identify the meaning of each URI. It is not necessarily evident to a computer that the URI “http://bio2rdf.org/linksns/hgnc/geneid:4129 will perform a query on the HGNC namespace. These extra statements would need to link the URI to the namespace, which in Bio2RDF is identified by the URI “http://bio2rdf.org/ns:hgnc”. The computer could decide whether to resolve the namespace URI based on the predicates that were used in the extra RDF statements. It is not practical for Bio2RDF to include each of the possible query URIs in the document that is resolved using “http://bio2rdf.org/ geneid:4129”, although some are included, such as the location of the HTML page. In particular, there is a unique “linksns” URIs that can be created dynamically for each namespace, resulting in a large number of RDF statements that would need to be created and transported but may never be used.

6.3.2 Data quality

The use of the prototype on the Bio2RDF website provided insights into the data quality issues that exist across the current linked scientific datasets. The issues have a range of causes, including a lack of agreement on a standard Linked Data URI to use for each data item, a range of URI syntax differences in and across datasets and the inability to retrieve complete results sets depending on the server, which cuts out possibly important results, and makes it hard to consistently apply semantic normalisation rules to query results. There is a large variation in the way references are denoted across the range of scientific data formats, including differing levels of support for links. These issues are not necessarily fixed by the use of RDF as a format for the presentation of these datasets, as the link representation method in the original data format cannot always be fixed by generic rules. The prototype was useful for fixing a number of the different link representations, although the lack of support for string manipulation in SPARQL made it difficult to support complex rules that would enable the prototype to dynamically create URIs out of properties where the currently available RDF representation did not specify a URI 2. The prototype was used to change references and data items that did not conform to the normalised data structures that were decided on by the Bio2RDF project. These references included URIs that were created by other organisations, that could be simply replaced with Bio2RDF URIs to provide scientists with a way to resolve further information within the Bio2RDF infrastructure. The model also provided the ability to define query templates, so that the normalised URI would be used in output, and the endpoint specific reference, whether it be a URI or another form, could be used in the query. It was successful in integrating the datasets provided by projects varying

2http://www.w3.org/2009/sparql/wiki/Feature:IriBuiltIn Chapter 6. Discussion 145 from the Bio2RDF endpoints, to the LODD, FlyWeb, and the datasets provided by the Pharmaceutical Biosciences group at the Uppsala University in Sweden. The prototype enables scientists to understand the way the information is repre- sented in each dataset relevant to their query, as the query can be completely planned without executing it. This makes it possible for scientists to understand each step of their overall query, without relying on a single method to incrementally perform a large query on any of the available datasets, depending on the nature of the query. A major data quality factor related to the execution of distributed queries relates to the way results can be obtained from each data provider. In some cases, the entire results set can be returned without limitation, but many data providers set limits on the number of results that can be returned from each query. This restriction has a direct impact on data quality, as the scientist must know what the limits are, and repeatedly per- form as many incremental queries as necessary to make the query function properly. The prototype provides a way to incrementally page through results from queries un- til the scientist is satisfied. Some query types will return generic information that is not page dependent. Query types can be created to include a parameter indicating whether paging is relevant. For other query types that are page dependent, scientists can incrementally get results until the number of results is constant, and less than the server is internally configured to support. This method is compared to other models in Figure 6.2, with other systems providing a variety of methods ranging from returning a fixed sample of the top ranked results to always returning all results. Paging query results

User query

Information linked to either concept X or concept Y

Resolved using automatic Return all results ranked results method

Information linked to both Information about Information about concept X and concept Y concept X concept X

If there are not enough results:

Return all results Information about Information about concept X concept X High bandwidth usage if there are a large number of results May still be hidden issues with fixed size result sets

Return top results No paging Bias towards results with both concepts There may be no scientifically useful ranking method

Figure 6.2: Different paging strategies

Some public data providers are known to put both incremental restrictions, and 146 Chapter 6. Discussion overall restrictions on the number of results that can be related to a given query. This may be necessary, as the data provider may be voluntarily provided for minor use. There may be overall restrictions with the size of document returned by distributed queries. If the data provider is an HTTP server, there is the option for the server to return a status code indicating that the server’s bandwidth is restricted for some reason, and the document cannot be returned because of this. This restriction is in place for the public DBpedia provider, http://dbpedia.org, for example, although larger documents may be available if scientists enable GZIP compression on the results, something that is not required by the HTTP or Linked Data standards. Each query needs to be executed in a way that would minimise the chances of the size restriction being triggered, although it is not possible using current SPARQL standards to know how large a document may be. This may include constructing Linked Data URI resolutions as SPARQL queries with limits on the number of results that would be returned. The prototype avoids these issues by allowing scientists to pose multiple queries on the data provider, allowing them to iteratively get as many results as necessary. Scientists may find it necessary to create their own mirror of the data so that they can execute queries without the limitations that the public provider imposed. The prototype allows this, as the model is designed to make the data access methods transparent to the scientist request, making it simple for scientists to override parts of the prototype configuration with their own context. For example, they could override the “wikipedia” namespace that DBpedia provides so that queries about it were redirected to their mirror instead of the public endpoint. The prototype provides the ability for scientists to customise the quality of the data resulting from their queries. It is limited by several factors, including the lack of transformation functions in SPARQL, the lack of RDF and SPARQL to support or normalise scientific units, and semantic ambiguity about results from different datasets. The use of RDF as the model that was used to represent the knowledge provides a number of semantic advantages, as entities from different datasets can be merged, where other similar models do not allow or encourage universal references to items outside of a single dataset. The prototype uses an in memory SPARQL implementation for high level data transformations. However, the current SPARQL 1.0 recommendation does not include support for many transformation operations. Notably, there is no way to create a URI out of one or more RDF Literal strings. This makes it currently hard to use string identifiers to normalise references in all documents to URIs if the original data producer did not originally use a URI, as string based regular expressions are not suitable for the process. This has an impact on the data quality normalisation rules that can be performed by the prototype, as URIs can only be created by inserting the value into a URI inside the original query, or the statically inserted RDF. This only works for queries where the URI was known prior to the query being executed. In terms of scientific data quality, it is important to know the units that are attached to particular numeric values. For example, it is important to be able to work with Chapter 6. Discussion 147 numbers, such as a experimental result of “3.35”, while knowing that the number is represented in units, such as “mL”. The RDF format has not been extended to include annotations for these units in a similar way to the native language annotation feature that is currently included in the RDF specification 3. This issue forms a large part of the normalisation process, as scientists expect data normalisation to result in a single unit being represented for each value. As units are not supported directly by the underlying data format, the prototype would need to have some other method of recognising the unit attached to a numeric value to use a rule to convert between units unless scientists are aware of the possible existence of unsuitable units. The prototype implements a simple normalisation rule mechanism. It support sim- ple regular expression transformations, but does not utilise transformations such as RIF, OWL reasoning. These transformations may be required in some cases, but the prototype demonstrates the practical value of the model in relation to linked scien- tific datasets using simple regular expressions for the majority of normalisations, with the ability to define SPARQL queries to perform more complex normalisation tasks. Some non-RDF transformation approaches, such as XQuery and XSLT, require a direct mapping between the dataset and the result formats. In these cases there is no oppor- tunity to utilise a single intermediate model, such as RDF, that can have statements unambiguously merged together in an unordered manner. Many RDF datasets have avoided issues relating to the creation of URIs by exten- sively using Blank Nodes, which are unaddressable by design. This affects data quality, as scientists cannot reference, or normalise a reference, to a particular node, and there is no way, at least in the current RDF 1.0 specification, to normalise Blank Nodes consis- tently to URIs, by design. The use of Blank Nodes makes it difficult to optimise results, as an entire RDF graph may need to be transferred between locations so that there are no elements that cannot be dereferenced in future if blank nodes are used extensively. For example, an optimised query using URIs as identifiers could result in one triple out of a thousand triples in a dataset. The same query on a dataset that uses Blank Nodes may need to transfer all one thousand triples so that the scientist could properly interpret the triple in terms of other information. If the graph uses URIs, the scientist can selectively determine which other triples are relevant to the results using the URIs as independent identifiers. The prototype provides the ability to optimise queries and return normalised URIs that can be matched against any suitable dataset to determine where other relevant statements are located. The use of RDF as the results format may highlight existing semantic data quality issues. The data quality issues may occur when information is improperly linked to other information, and the use of computer understandable RDF enables a comput- erised system to discover that the link is inconsistent with other knowledge. In science, however, there is a constant evolution of knowledge, so the use of RDF to automatically discover inconsistent information relies on an up to date encyclopaedia of verified knowl- edge. The data quality of the encyclopaedia must be assumed to be perfect before it

3http://www.w3.org/DesignIssues/InterpretationProperties 148 Chapter 6. Discussion can be used with current, widely implemented, reasoning techniques such as OWL [36], although there have been explorations into how to make imperfect sources of knowledge useful for the reasoning process [42]. Data quality issues at the semantic level, such as these, are highlighted by the use of the model with syntax normalisation rules, as the sources of data can be integrated with ontologies that represent a high standard of curated knowledge, although there may not be enough information available to fix the issue without a higher level processing tool. There may also be inherent record level issues with data when similar, but distinct, queries are executed on different datasources, with the multiple results being merged into the returned document. If there is likely to be ambiguity about the meaning of the information from different datasources, then ideally the queries shouldn’t be combined, as scientists would then need to recognise that multiple datasources were being used, and accommodate for this in the way they process the information, something that the model attempts to avoid. However, in practice, the different sets of results may both be valuable, and scientists would then need to notice the different structures used by the different datasources, or customise their profiles to choose one query over the other, or allow for the different structures in their processing applications. The example shown in Figure 5.1 illustrates a data quality issue, as there are multiple RDF predicates which may be used to describe a label or title for the item, and scientists would need to recognise this. In the example the datasources used a unique predicate for name of an item in the Gene Ontology, scientists may also find the distinction useful, but other scientists may create normalisation rules to convert the data to a standard predicate, and include those rules in their provenance records to highlight the difference to other scientists. In comparison to other projects, this allows scientists to specify their intent, and share this intent with other scientists, without having to embed the normalisation and access aspects into the other parts of their processing workflows. In many cases, the domain experts may not be experts in the methods of formally representing their knowledge in ontologies. Data created by the domain expert may be consistent, but it may not include many conditions that could be used to verify its consistency. In order for other scientists to trust the data quality, they can create new rules and apply them to the information. This would require either low level programming code in other systems, or a change to the globally recognised ontology framework, neither of which can be used in provenance records to distinguish between the original statements, and the changes that were applied by others. In order for the model to be used to verify the semantic data quality of a piece of data, the context of the item in conjunction with other items may be required. This process may require that a large range of data items are included in the output, so that rules can be processed. In practice, this level of rule based reasoning would require scientists to download entire datasets locally, and preprocess the data to verify its consistency, before using it as an alternative to any of the publicly accessible versions of the dataset. The use of RDF triples makes it difficult to delegate statements to particular data provenance histories, and therefore the data quality is harder to identify. However, it is Chapter 6. Discussion 149 possible if the scientist is willing to use different URIs for their annotations compared to the URI for the relevant data item. Different reasoning tools have different levels of reliance on the URI that is used for each statement about an item. In OWL for instance, there is no link between a single URI and a distinct “Thing”, as statements with different URIs can be merged transparently to form a set of facts about a single “Thing”. If users do not rely on OWL reasoning, it is possible to distinguish between statements made by different scientists by examining the URI that is used. This is principally because RDF allocates a particular URI to the Subject position of an RDF triple, and users can create queries that rely on the URI being in the Subject position if they are not using reasoning tools. Reasoning tools such as OWL may discover an inverse relationship based on the URI of the Predicate in the RDF triple, and interpret the RDF triple differently, with the item URI being in the Object position in the RDF triple. This makes it difficult to conclusively use the order of the triple to define which authority was responsible for the statement. In some cases, the prototype may need to be configured to perform multiple queries on a single endpoint to deal with inconsistencies between the structure of the data in the endpoint and the normalised form. This is inefficient, but it is necessary in some cases to properly access data where it is not known which form the data will be in for an endpoint. The prototype partially supports RDF Quads as there are already datasets, including Bio2RDF datasets, which utilise RDF Quads to provide provenance information about records. However, the lack of standardisation for any of the various RDF Quads file serialisations, and the resulting lack of standardisation between the various toolsets, made it necessary to restrict the prototype to RDF triples for the current time. Any moves to require RDF Quads should require a backwards compatible serialisation to RDF Triples, which is not yet available outside of the very verbose RDF Triple reification model. A summary of the data quality support of a range of scientific data access systems is shown in Table 6.2. It describes a range of ways that different systems use to modify data, including rule based, workflow and custom code data normalisation. The proto- type uses rules that are distributed across providers, as shown in Figure 3.1 with the context sensitive design model. The federated query model in the figure is represen- tative of the general method used by the other rule based systems, as they rely on a single, complete decomposition of a complex query to determine which endpoints are relevant. The prototype is fundamentally different to other systems in the way it separates queries from the locations that the queries are to be executed, excluding users from being able to specify locally useful rules as global rules for the query. When this is used along with the contextual definition of namespaces it makes the overall query replicable in a number of locations, where the queries in other systems are localised to the structure and quality of the data as provided in the currently accessible location. The configuration uses URI links between query types, providers and normalisation 150 Chapter 6. Discussion

System Advantages Disadvantages Prototype Rules, syntactic and semantic – SADI Rule based semantic Custom code for syntactic BioGUID Clean linked identifiers Custom code for each service Taverna Custom workflows Hard to reuse and combine Semantic Web Pipes Custom workflows Not transportable

Table 6.2: Comparison of data quality abilities in different systems rules that make it possible for data quality rules to be discoverable. Communities of scientists can take advantage of this behaviour to migrate or replace definitions of rules in response to changes in either data sources or current standards for data representa- tion. This benefit is available without having to change references to either the data normalisation rules in any queries due to the separation of rules from query types, or the rules in any providers where the new queries do not require new normalisation rules. This ability does not necessarily affect the provenance of queries, as the provenance recording system has full access to the exact rules that were used to normalise data when the query is executed to provide for exact replication on data sources that did not materially change.

6.3.3 Data trust

The prototype demonstrated the practical considerations that were necessary to support data trust, including the open source distribution of the program to enable scientists to create and modify configurations without having to negotiate with the community to get a central repository modified to suit their research. The configurations that were relevant to different scientists were published in different web locations, and scientists were able to configure the prototype to use their locally created configuration. These configurations, were then used to setup the state of a server, so that the server knew about all of the providers, query types, and normalisation rules in the provided configu- rations. When a scientist performed a query on the prototype, the pool of configuration information was filtered using the profiles that the server was pre-configured to obey. The pool of configuration information was used to allow three servers in the Bio2RDF group to each choose the datasets and queries that they wanted to support by assigning each server a different profile. In comparison to methods that use endpoint provided metadata to describe service functionality, the prototype can use on a combination of local files, and publicly acces- sible web documents for configuration information. The VoiD specification presents a way of specifying the structure of data available in different SPARQL endpoints, but it does not include trust information [3]. VoiD documents are produced using RDF, so it can be stored locally, and interpreted directly, unless it relies on the URL that it was derived from, to provide context, something that is popular in some RDF pub- lishing environments. A scientist could trust VoiD information in a similar way to the configuration information for the prototype if the RDF files were available and verified Chapter 6. Discussion 151 locally. However, in contrast to VoiD, the prototype configurations can be used to trust specific queries on specific data providers, as shown by the inclusion of a layer between providers of data and the query in the context-sensitive model in Figure 3.1. If a trust mechanism was included explicitly in VoiD it would not include the scien- tist’s query as part of the context that was used to describe the trust. Trust, in VoiD, would need to be specified using the structure of the data as the basic reference for syntactic data trust, although semantic data trust could be recognised using a reference back to the original data provider. In comparison to the model described here, VoiD relies on a description of a direct link between datasets and the endpoints that can be used to access the datasets. In VoiD there is no direct concept of endpoints, so scientists cannot easily substitute their own endpoints, given a description of the dataset. In the same way, they cannot also directly state their trust in a particular endpoint, whether it is a trust based on service availability, data quality, or the other trust factors described in Gil and Artz [52] and Rao and Osei-Bryson [113]. The mechanism that is used to describe services in the recent SADI model does not allow transparent reviews of the changes that were made to data items, as many of the rules related to data normalisation are only visible in code, as the extensive use of Web Services makes it simpler to encapsulate this layer in auto-generated programming code, rather than expose it as queries that can be studied by scientists. This means that a scientist needs to trust a datasource without knowing which method is being used to resolve their queries, something that is possible, but not as advantageous as the alternatives where scientists see all of the relevant information and can decide about it independent of the internal state of a server that is out of their control. In particular, SADI requires that scientists accept all data sources as semantically equal, and code is typically linked to a single data location making it necessary to use a single syntax to describe each datasource in each location. The scientist must trust the results of the overall SPARQL query that is submitted to the SADI server, as the results are combined and filtered before the results of the query are returned to scientists. In particular, any semantic rules that are applied to the data must exactly match or the data will be discarded or an error returned to the user in lieu of giving them access to whatever data is available. This makes it particularly difficult for SADI to provide for optimised queries where there is not enough data in the results of the query to satisfy the semantic rules for the classes of data that were returned. An entire RDF record needs to be resolved to reliably determine whether the item described by the record fits into a particular semantic class. The provenance information for a SADI query, if available, would require the system to evaluate the entire query again to accurately indicate which services were used, as the model behind SADI requires incremental execution of the query to come to a final solution. This may be similar to the requirements for a query executed using VoiD, as it contains information about both SPARQL predicates, which may not be specified in a query, along with information about how to match the starting portion of a URI with a given dataset. In contrast to this, the prototype is designed to only execute the 152 Chapter 6. Discussion query in increments, so the information necessary to perform the query in future can be obtained without performing the query beforehand. This ability is vital if scientists are to trust the way a particular query will execute before performing the query. The prototype implements the data trust feature using user specified preferences to define which profiles are trusted, and where the configuration information is located. This information is used to provide the context for all queries on the instance of the prototype. A scientist does not need to decide about this information for every query on the prototype, as they should have a basic level of trust in the way the query is going to be performed before executing the query. The scientist is able to make a final decision about the trust that they are going to put into the results when the query is complete, and the provenance information is available. However, in the prototype the provenance information is derived using a different HTTP request. This means that the HTTP requested query provenance may not exactly match the data providers and/or query types that were used to derive the results of the previous query due to random choices of providers and network instability causing providers to be temporarily ignored. These issues are not present for internal Java based requests where the provenance and the results can be returned from the same request. The prototype does not insert information about which provider and query type combinations that were used to derive particular pieces of data into the results. This means that scientists do not have a simple way of evaluating which of the sources of data were responsible for obtaining an incorrect result, if the identifiers for the data items do not contain this information. This is mostly due to the restrictions placed on the prototype by the RDF triple model with relation to annotations at the triple level. The RDF model provides a reification function, but it would expand the size of the results for a query by at least three hundred percent. Although an RDF quad format could be used to make it possible to add information about provider and query type combinations above the level of triples, with some quad syntaxes being relatively efficient, the RDF quad formats are not widely implemented, and are not standardised, and changing would require all scientists to understand the RDF quad syntaxes, as there is no backwards compatibility with the triple models. The method of configuring the prototype using multiple documents as sources of configuration information relies on the assumption that URIs for different elements, such as providers and profiles, being unique to a given document. Although this is not a realistic assumption to make in a broadly distributed scenario, the configuration sources were all trusted, and the configuration sources were setup so that within a particular document it was easy to verify if the URI for an element was accidentally duplicated. Another method needs to be used to authenticate the document to fully trust the configuration information given by a web document. The configuration documents could be authenticated by having scientists manually check through the contents of the file, and then have them sign the particular file using a digital signature. This is not viable in dynamic scenarios, as there are legitimate reasons for making both minor and major changes to the configuration file without Chapter 6. Discussion 153 the scientist having to lose trust in the configuration information. The major risks are that an item will be defined in more than one document, causing unintentional changes to the provider, query type, or other object. A future implementation could allow users to specify a list of URIs representing model objects that could be trusted from particular configuration documents. This strategy would ensure that scientists have the opportunity to verify that their configuration information does not clash with any other configuration information.

System Advantages Disadvantages Prototype Configured data trust Multi-datasource results SADI – Assume all data is factual BioGUID Simple record translation – Taverna Data tracing through workflow Typically single locations Semantic Web Pipes Data tracing for single workflows

Table 6.3: Comparison of data trust capabilities of various systems

6.3.4 Provenance

The prototype can provide a detailed set of configuration information about the methods that would be used to resolve any query. This provenance record contains the necessary information for another implementation to completely replicate the query. However, the provenance information is not necessarily static. The model provides a unique way of ensuring that scientists can make changes to the provenance record without deleting the original data providers. It does this by loosely coupling query types with input parameters, and by linking from providers to the query types they support, so query types do not have to be updated to reflect the existence of new data providers. The prototype provides network resiliency across both data providers, so that a single broken data provider will not result in partial loss of the access to the data, even if different data providers use different query methods. The provenance record for each query includes a plan for the query types and providers that would be used, even if the query was not executed. An implementa- tion could generate the query plan link and inserted it into the document although many users may not require the query plan to use the document, so it wasn’t inserted in the Bio2RDF documents by default. A complete solution would require a special application that knew how to interpret the URI into its namespace component along with any identifiers that were present in the item. It would use the namespace to get a list of query types and the relevant providers for the query types, using the namespace as the deciding factor. In particular, the provenance for the URI is retrieved using the implementation by modifying the URI and resolving the modified URI. The provenance record includes information about the profiles, normalisation rules, query types, and providers, in the context of the query. It contains a subset of the overall configuration, which has been processed using the chosen profiles to create the relevant queries. The prototype relies on template variables from queries, which are 154 Chapter 6. Discussion replaced with values in the provenance record. These changes mean that the prove- nance record is customised for the query that it was created for. In cases where the URI format does not change and the namespace stays the same, the query plan would contain most of the relevant providers and query types for a similar query. For exam- ple, both http://bio2rdf.org/queryplan/geneid:4129 and http://bio2rdf.org/ queryplan/geneid:4128 contain the same providers and query types, as there is no query type that distinguishes between identifiers in the “geneid” namespace. An alternative to modifying provenance records to derive new possible queries would be to generate further provenance records using the prototype after it is configured using the provenance record as a configuration source. The scientist could then trust that the resulting queries were semantically accurate in terms of the query model. The prototype did not provide a way to discover what queries would be useful for a particular normalised URI without resolving the URI. The document that was resolved for the normalised URI could have static insertions, that were used in the prototype to indicate which other queries may be applicable. For example, the document resolved at http://bio2rdf.org/geneid:4129 contains a reference to the URL http: //bio2rdf.org/asn1/geneid:4129, and the predicate indicates that the URL could be used to find an ASN.1 formatted document containing the given record. The provenance implementation in the prototype does not include data provenance. There are many different ways of denoting data provenance, and they could be used together with the prototype as long as the sections of the record can be segregated. This is an issue with RDF, as the different segments of the record cannot easily be segregated without extending the RDF model from triples to quads, where the extra context URI for each triple forms a group of statements that can then be given data provenance. The model emphasises data quality, which includes normalisation, and data trust, which includes proper annotation of statements with trust estimations. These conflicting factors make it difficult for both the prototype and the model to support provenance at a very low granular level. In the prototype, the choice of which strategy to use is left to the scientist, although the prototype only currently supports RDF triples. The strategy taken for the Bio2RDF website is to emphasise data quality over the location of the datasets and the provenance of individual RDF triples. The low level data strategy, where the data provenance of each statement is given, makes it possible to analyse the data in different ways, including the ability to make up integrated prove- nance and data quality queries. However, a normalised strategy which does not include all of the data provenance as part of the results of queries allows processing applica- tions to focus on the scientific data without having to allow for the extra processing complexity that is required if the RDF model includes the context reference. The annotation of specific RDF triples is difficult, and some linked datasets require that scientists extend the original RDF triple model to include a quad model, including an extra Named Graph to each triple. This is possible using the model, but it is not encouraged in general as it reduces the number of places that the RDF documents will Chapter 6. Discussion 155 be recognised, as there is no recognised standard that defines RDF quad documents. The data item provenance is useful as a justification of results in publications. How- ever, if the data providers update the versions of the data items without keeping older data items available, the model is unable to recreate the queries, particularly if the data items are updated without changing the identifier. In part, this formed the basis for the LSID project [41], and implementations which used LSID’s such as the myGrid project [148]. The social contract surrounding LSID’s required that the data resolved for each identify be unchanged in future to exactly reproduce past queries. LSID’s failed to gain a large following outside of biological science communities. However, many initial adopters are instead moving to HTTP URI Linked Data based systems. The use of more liberally defined HTTP URIs does not mean that the emphasis on keeping data accessible needs to be lost. The ability to reproduce large sets of queries exactly on large datasets is a basic requirement in the internet age. Textual details of a methodology were previously the only scientific requirement. The model provides the ability to reproduce the same query, with the publication containing the information required to submit the same queries to the same datasets. It is flexible enough to work with both constantly updated and static datasets. The data provider object in a query provenance document may be changed from the updated endpoint to the location of a static version that was not updated since the query was executed to reproduce the exact results. The cost of continuously providing public query access to old copies of datasets may not be economical based on the benefits of being able to exactly reproduce queries. This may be leading many data producers to avoid long term solutions such as Handle’s and DOI’s which rely on the data producer paying for the costs of keeping at least the metadata about a record available in perpetuity. The model aims to promote a context sensitive approach which does not rely on an established authority to provide continuous access to the provenance or metadata about records. It is possible to publish provenance documents using schemes such as DOI’s if data providers are willing to pay the costs associated with the continual storage of metadata about items by the global DOI authority.

6.4 Distributed network issues

The prototype was designed to be resilient to data access issues that could affect the replicability and context sensitivity of queries resolved using the model. Issues such as computer networks being unavailable, or sites that were overloaded and failed to respond to a query would have produced a noticeable effect on the results. The effect was reduced by monitoring and temporarily avoiding providers that were unresponsive. This monitoring was made at the level of provider and at a lower level by collating the DNS names that were responsible for a large number of errors. The prototype relies on simple random choices across the endpoint URL’s that are given for each provider to balance the query load across all of the functioning endpoints if they are not included in query groups or provider groups. In comparison to Federated 156 Chapter 6. Discussion

SPARQL, where each of the services is defined as a single endpoint in the query, query groups and multiple providers for the same query types allow the system to be load balanced in a replicable manner. In addition, it was important for the Bio2RDF website that results were consistent across a large number of queries. In a large scale multi-location query system, such as the set of scientific databases described in Table 2.1, a broken or overwhelmed data provider will reduce both the correctness and speed of queries. The implementation includes the ability to detect errors based on the endpoint address and avoid contacting them after a certain threshold is exceeded. If there are backup data providers that can execute the same semantic query, they can be used to provide redundancy. Although this may have an effect on the completeness of queries, as long as the backup data providers are used for a query until a success is found, it was viewed as a suitable tradeoff. It avoids very long timeouts when sites do not instantly return errors which reduces query responsiveness for scientists. After a suitable period of time, the endpoints may be unblocked, although if they fail to respond again they would again be blocked. The statistics gathering accumulated the total latency of queries made by the pro- totype on providers. The statistics derived from the Bio2RDF mirrors indicated that there were 298,807 queries that had low latency errors (i.e., The time for a query to fail was more than 0 and less than or equal to 300ms (milliseconds)). By comparison there were 328,857 queries that had a total error latency of more than 300ms. This indicates that there was a roughly even number of cases where an endpoint had failed completely and quickly returned an error and where it was working, but busy, and timed out be- fore being able to complete a request. If an endpoint was blacklisted, it would appear a maximum number of times per blacklisting period in the statistics, making it difficult to interpret the statistics. During the research, the blacklisting period varied between 10 minutes and 60 minutes. The server timeout values across the length of the Bio2RDF monitoring period ranged from 90 seconds at the beginning to 30 seconds at the end. A small number of queries, 284, had server timeout values of less than 5 seconds. A server timeout only occurred if the endpoint stopped responding as it was processing the query, so higher latencies were possible. In some cases, the query did not timeout after 30 seconds, presumably because data was still being transferred. The highest total error latency for a query was 2,121,788 ms for a query on November 10, 2009. It contained two queries that failed to respond correctly, with the overall query taking 1,201,622 ms to complete. The highest response time for an overall query was 1,279,359 milliseconds (ms), for a query on November 4, 2009. It contained 1 failed query, which took 345,502 ms (345 seconds) to fail, while the 8 successful queries that made up part of the query took a combined total of 8,152,005 ms to complete. This indicated that there were queries that legitimately took a long time to complete. Many of the queries did complete relatively quickly considering the network communication that was required. 6,662,019 overall queries completed within 3 seconds of the user requesting the data, and 7,022,205 queries completed within 5 seconds. Chapter 6. Discussion 157

The Bio2RDF website and datasets were not hosted using high performance servers, however the total number of unsuccessful queries was still only 3 percent, as shown in Table 6.4, indicating that the datasets were relatively accessible with the commodity hardware that was used. Two of the mirrors were limited to single servers with 4 core processors and between 1 and 4GB of RAM, while the other mirror was spread across two dual-quad-core machines with 32 GB and 16 GB of RAM respectively. The requests from users to Bio2RDF were spread equally across all of the servers using DNS round robin techniques, so there was no specific load balancing to prefer the system with the higher specifications. The current implementation lacks the ability to notify scientists in a systematic manner when particular parts of their query failed, however, in future this information may also be sent to scientists along with the results. This would enable decisions about whether to trust the results based on which queries failed to execute. Since the prototype was designed as a stateless web service, this information is not easily retrievable outside of the comments included in the results if they used a format that supports textual comments. The query is not actually executed when the scientist requests the provenance for a query, so that avenue is not applicable. Using RDF and SPARQL in the prototype, there is no way to estimate what the number of results for each query is going to be, as SPARQL 1.0 does not contain a standard count facility, and it cannot be easily derived from the query without previously having statistics available. It would also incur a performance hit to require each results set to be counted, either before or after the actual results are retrieved. The model and prototype do not include support for specifying the statistics related to each dataset. In order to use the model as the basis for a federated query application, where a single query is submitted, planned, filtered, and joined across the data providers, the model would need to be extended to include descriptions about the statistics relevant to each namespace. This functionality was not included, as the model was designed to be relevant to the contextual needs of many different scientists, including those where scientists do not have the internet resources available to process very large queries, as is required by similar large grid processing applications. The prototype is efficient in part due to its assumptions that queries can be executed in parallel, as the system is stateless, and the results can be normalised and included in a pool of statements, without having to join or reduce the results from different queries until after they are resolved and normalised. This design does not favour queries that require large amounts of information from geographically distant datasets. The prototype was implemented using Java. The Java security settings do not allow an unprivileged program to specify which IP is going to be used for a particular HTTP request, as this requires access to a lower layer of the TCP/IP stack than the high level HTTP interface allows. The main consequence of this limitation is that the prototype is not able to understand where failures occur in cases where single DNS names are dis- tributed across multiple IP addresses. The mirrored Bio2RDF SPARQL endpoint DNS entries were changed in response to this to create a one-to-one relationship between 158 Chapter 6. Discussion

DNS entries and IPs, which could then be used to reliably detect and avoid errors with- out affecting the functioning mirrors. For example, there are two IP addresses for the DNS entry “geneid.bio2rdf.org”, but there is only one IP address for each of the specific location DNS entries, ie, “cu.geneid.bio2rdf.org” and “quebec.geneid.bio2rdf.org”. This may be a problem for the use of the prototype with other websites that rely on DNS level abstractions to avoid scientists having to know how many mirrors are available for a given endpoint. Following a policy of having DNS entries mapping to single IP addresses enables an implementation to understand which endpoints are unresponsive. However, it restricts the ability of the scientist to choose which endpoints are geographically close. Both strategies enable the best responses in different conditions. The actual endpoint that was used can be reported to the scientist in the provenance record, with any alternatives notes as possible substitutes.

6.5 Prototype statistics

This research was not evaluable based on numeric comparisons, as the research questions focus on determining the social and technical consequences of different model design features on queries across distributed linked datasets. However, the use of the prototype on the Bio2RDF website was monitored and the results are shown here to indicate the extent to which the model features were relevant to Bio2RDF. The prototype was configured to match the conventions for the http://bio2rdf. org/ website. The configuration necessary for the Bio2RDF website contains 1,626 namespaces, 107 query types, 447 providers and 181 normalisation rules. In order to provide links out from the RDF documents to the rest of the web, 169 of the namespaces have providers that are able to create URLs for the official HTML versions of the dataset. The Bio2RDF namespaces can be queried using query types and providers using 13,614 different combinations. The Bio2RDF website resolved 7,415,896 user queries between June 2009 and May 2010. The prototype was used with three similar profiles on three mirrors of the Bio2RDF website; two in Canada and one in Australia. The configurations were sourced from a dedicated instance of the prototype, 4. Profiles were used to provide access to each sites internally accessible datasets, while retaining redundant access to datasets in other locations. The reliability of the website was more important than its response latency, so redundant access to datasets at other locations increased the likelihood of getting results for a user’s query. Analysis of the Bio2RDF use of the prototype generated a number of statistics, including those shown in Table 6.4. These statistics revealed that the prototype was widely used including an average over 10 months of 17 user queries per minute. These queries generated 35,694,219 successful queries on providers, with 1,078,354 unsuccessful queries (3%). A large number of unsuccessful queries were traced to a small number of

4http://config.bio2rdf.org/admin/configuration/n3 Chapter 6. Discussion 159 the large Bio2RDF databases that were not handling the query load effectively. There were periods where the entire dataset hosting facility at individual Bio2RDF locations was out of use due to hardware failures or maintenance, which caused the other mirrors to regularly retry these providers during the outage.

The statistics gathering system was not designed to provide complete information about every combination of query type and provider. In particular, there were at least 10,697,566 queries that were executed on providers that did not require external queries to complete, as the providers were designated as no-communication. These queries were not included in the successful or unsuccessful queries. No-communication providers are fillers for templates to generate RDF statements to add to the resulting documents, so they did not provide information about the efficiency of the system. The 36,772,573 remote queries were performed using 18,455,634 query types on providers that were configured to perform SPARQL queries.

Each of the user queries required the prototype to execute an average of 4.96 queries over an average of 3.73 of the 447 providers. The 447 providers do not represent distinct endpoints due to the way query types, namespaces, normalisation rules are all contex- tually related to a provider in the model and not an endpoint. The large number of unique providers and queries reflects the difficulty in constructing alternative workflows or processes to contain these queries, along with the normalisation rules that are applied to each provider.

Each of the 447 providers were able to be removed using profiles, and if necessary replaced using alternative definitions without needing to change the definitions of any query types, namespaces, or normalisation rules in the master Bio2RDF configuration. By contrast, current Federated SPARQL models expose endpoint URLs directly in queries. This creates a direct link between queries and endpoints, which means that queries cannot be replicated, based on query provenance, using any other endpoint without modifying the query provenance.

Federated SPARQL models may use object types to optimise queries. This strat- egy works for systems where everyone uses the same URI to designate a record, and all records in a namespace are represented using consistent object type URIs. These assumptions hold in an environment where every namespace is similar to a typical normalised relational database table, with the same properties for every record, and numeric primary keys for each record. However, as the Bio2RDF case demonstrated, primary keys are not always numeric, so they may require syntactic normalisation. In addition, there may be multiple object types in a single namespace, even if the object types are in reality equivalent but represented using different URIs or schemas.

In Bio2RDF, there were a few namespaces that clearly highlighted issues with relational-database-like assumptions that are used by Federated SPARQL implemen- tations. For example, the Wikipedia dataset has been republished in RDF form by the DBpedia group. The record identifiers in DBpedia are derived from the title of articles on Wikipedia, so references to the equivalent Wikipedia URL should be 160 Chapter 6. Discussion treated equally to DBpedia. In the Bio2RDF normalised URI scheme, the main names- pace prefix was “wikipedia”, with an alternate prefix of “dbpedia” as it is commonly referred to in the Semantic Web. In the prototype, two different URIs, http:// bio2rdf.org/wikipedia:Test and http://bio2rdf.org/dbpedia:Test would resolve these documents, although the resulting documents both contained http://bio2rdf. org/wikipedia:Test as the subject of their RDF triples. In Federated SPARQL, the whole community would need to decide on a single structure for URIs, or decide on a single object type for each namespace. In Bio2RDF, different structures were supported by adding a new namespace and applying it and any corresponding normalisation rules to the differing providers. For example, the prototype used by Bio2RDF to access and normalise the URIs in the Chembl dataset, available at the Uppsala University in Sweden 5. There are a number of namespaces inside this dataset, and a set of namespaces were created to suit the structure. The prototype was also experimentally used to create a Linked Data infrastructure for the original URIs. In order to do this, alternative namespaces needed to be created for the same dataset, as the namespace given in the original URIs did not contain the “chembl_” prefix that Bio2RDF used to designate the namespace as being part of the Chembl dataset. It would not be possible using Federated SPARQL to create these alternative Linked Data access points, as Federated SPARQL requires single URIs, so each query could only contain one of the many published URIs. This would make it impossible for anyone except for the original publisher to customise the data, and it is necessary to customise both the source and structure of data to satisfy the data quality, data trust and context sensitive research questions for this thesis. The 447 providers were mapped to 1,626 namespaces. The majority of these names- paces represented ontologies published by OBO, converted to Bio2RDF URIs and lo- cated in a single Bio2RDF SPARQL endpoint. In the OBO case, it would be very difficult to separate the different ontologies based on object types, as the OWL vocab- ulary is universally used to describe the objects in all of the namespaces. The only alternative method for segregating the dataset into namespaces is to search for a given prefix in each URI, as each of the ontologies contain different URI prefixes. In SPARQL 1.0, it is necessary to use regular expressions to query for a particular prefix in a URI. This is very inefficient and not necessarily simple. To avoid this, optimised SPARQL queries were made up to use the Virtuoso free text search extension. In Federated SPARQL, these extensions would need to be located in the query, making it difficult to replicate sets of these queries on another system, as they would all need to be modified. However, in Bio2RDF, the equivalent regular expression query was created and used on non-Virtuoso endpoints, without changing the query that users refer to. In the case of OBO, these query and namespace features enabled Bio2RDF to create a small number of providers to represent similar queries across all of the OBO ontologies. There were at least 10,765 unique users of the Bio2RDF website during the statistics

5http://rdf.farmbio.uu.se/chembl/sparql Chapter 6. Discussion 161 gathering period, although two of the mirrors did not recognise distinct users using their IP addresses as a result of the web applications being hosted in a reverse proxy configuration.

Statistic Number Total no. of resolutions 7,415,896 No. of resolutions within 3 seconds of request 6,662,019 No. of resolutions within 5 seconds of request 7,022,205 Average no. of resolutions per minute over 10 months 17.16 Average latency of queries (ms) 1532 Resolutions with low latency errors (1-300 ms in total) 298,807 Resolutions with high latency errors (more than 300 ms in total) 328,857 Total number of queries on external endpoints 36,772,573 No. of successful provider queries by prototype 35,694,219 No. of unsuccessful provider queries by prototype 1,078,354 Average no. of queries for each resolution 4.96 Average no. of endpoints for each resolution 3.73 Number of unique users by IP Address 10,765 Current number of providers 447

Table 6.4: Bio2RDF prototype statistics

The prototype software was released as Open Source Software and had an aggregate of 2,685 downloads from the SourceForge site 6. The most recent version of the software released on SourceForge, 0.8.2, was downloaded around 300 times in its compiled form, and around 30 times in its source form at the time of writing. The normalisation rules in the Bio2RDF case provided a way to make the normalised Bio2RDF URI useful in cases where other data providers used their own URIs. This is necessary for data providers that independently publish their own Linked Data ver- sions of datasets using their own DNS authority as the resolving entity for the URI. For example, the NCBI Entrez Gene data item about “monoamine oxidase B”, with the identifier “4129” has at least three identifying URIs, including http://purl.org/ commons/record/ncbi_gene/4129, http://bio2rdf.org/geneid:4129, and the NCBI URI http://www.ncbi.nlm.nih.gov/gene/4129. The normalisation rules made it pos- sible to retrieve data using any of these URIs and integrate the data into a single doc- ument that can be interpreted reliably using a single URI as the reference for the item. In addition, the 150+ providers that were designed to provide links to the traditional document web did not require normalisation, as the identifiers were able to be placed into the links. In the case of Bio2RDF, there was a relatively small number of normalisation rules compared to the number of providers because there were in practice a limited number of locations that each dataset was available from, and the semantic structure of the data did not need to be normalised between providers in most cases. The majority of rules defined URI transformations so that data from different locations could be directly integrated based on RDF graph merging rules.

6http://sourceforge.net/projects/bio2rdf/files/ 162 Chapter 6. Discussion

In comparison, the SADI system requires that the data normalisation be written in code, as opposed to being configured in a set of rules. It must encode each of these URIs and predicates into the relevant code and decide beforehand what the authoritative URI will be. This makes it simpler to deploy services to a single repository, but makes it very difficult for users to change the authoritative URIs and predicates to match their own unique context. Having said that, the SADI system provides SPARQL based access to Web Services using its mapping language, so it is useful for integrating RDF methods with the current pool of scientific Web Services. SADI has been initially built on a selection of premapped BioMoby services, making it able to potentially access over 1500 web services when the mapping is complete. In comparison, the Bio2RDF configuration built for the prototype relies mostly on RDF database access using direct SPARQL queries, making it possible to avoid performing multiple queries to get a single record from a single database, as is necessary using many BioMoby Web Services. The SADI system however is being gradually integrated with the Bio2RDF databases, although the normalisation rules that make the prototype successful across the large range of Bio2RDF and non-Bio2RDF RDF datasources will need to be integrated into the SADI system manually. In addition, SADI, like its SQL based predecessor described in Lambrix and Jakoniene [83], has not dealt with the issue of context sensitivity or replicability if the dataset structures materially change, as the Bio2RDF datasources have done significantly in the last 2 years.

6.6 Comparison to other systems

There are a number of other systems that are designed using a model that is based a single list of unique characteristics of each datasource, such as the properties and classes of information available in each location. These systems use this information to decompose a query into subqueries and join the results based on the original query. The prototype focuses on the flexible, content-agnostic, RDF model as its basis for automatically joining all results. Although some other systems are designed to resolve SPARQL queries, they do not focus on RDF as the sole model, with many utilising the SPARQL Results Format7 as the standard format for query results in the same way previous systems use SQL result rows. The SPARQL Results Format is a useful method of communicating results of com- plex queries without having to convert the traditional relational row-based results pro- cessing algorithms into the fully flexible RDF model. However, it does not enable the system to directly integrate the results from two locations without knowing what each of the results mean in terms of the overall query. If the query is not a typical query that is looking for a set of result rows, rather it is looking for a range of information about items, then the variables from different results will not join easily, as they are defined in the query as text, and they do not form RDF documents. However, any SPARQL

7http://www.w3.org/TR/rdf-sparql-XMLres/ Chapter 6. Discussion 163

SELECT query, which would otherwise return results using the SPARQL Results For- mat, can be rewritten as an equivalent SPARQL CONSTRUCT query which returns RDF triples. The prototype can be integrated with other RDF based systems using SPARQL CONSTRUCT/DESCRIBE queries, URL resolution, or other methods as they are cre- ated in future. In all of the other frameworks the distribution of queries across different locations relies on knowledge about the structure of the data that will be returned, as queries are planned based on the way that the data from different endpoints will be joined. In the model and prototype, the distribution is based on knowledge about the high level datasources that are accessible using each endpoint, with the possibility that all datasources will be available using a single default endpoint without needing to specify each of the datasources. These methods make it possible to construct complex, efficient queries, without restricting users to particular types of queries or restricting the system to cases where it knows the complete details of the data accessible in each endpoint. The prototype provides users with a way of integrating any of the RDF datasources that they have access to, including Web Services wrapped using SADI [143], distributed RDF databases wrapped using DARQ [111], and plain RDF documents that are ac- cessible using URL’s. This makes it possible for scientists to develop solutions in any infrastructure and make it possible for other scientists to directly examine or consume their results, whether or not the other scientists have a prior understanding of the on- tologies that they have used to describe their results. Although SADI provides direct access to RDF datasources using specifically designed code sources, its entire processing method relies on scientists knowing exactly which predicates were used to describe each dataset. This makes it necessary for each community to use a single set of predicates for any of the datasets that they wish to access with their query, or produce code for the translation rules to make the SADI engine see a single set of predicates. The prototype provides an environment where multiple sources of data from different contexts, such as open academic, and closed business models, can be combined, and recombined in different ways by others, without having to publish limited access datasets completely in globally accessible locations. In DARQ, for example, the statistics relevant to each dataset must be accessible by each location to decide whether the query is possible, and if so whether there is an efficient way to join the results. In the prototype the relevant pieces of information about each dataset can be published, and if the dataset is not intended to be directly accessed, the prototype can be used to publish the set of operations that the public can perform with the dataset. In SADI, it is necessary to use a single service repository, as the query distribution plan is compiled in a single location using both the coded programs and the service details, making it difficult to transport current queries, or integrate service repositories, even if a scientist can establish their own repository for their own private needs. The prototype makes it possible to integrate different datasets by encoding the nor- malisation rules in a transportable configuration file and making sure that namespaces 164 Chapter 6. Discussion are always recognised using an indirect method, so that there is not a single global defi- nition of what URI is used to identify each namespace or dataset. In SADI, namespaces are either recognised using custom code for each URI structure, based on the prefix definitions in the Life Science Resource Name website 8, or they are recognised using the literal prefix string representing the namespace. In comparison, the prototype makes it possible for scientists to internally redefine namespaces using the same prefix as an external namespace, while linking it to their own URI to make it possible and simple to distinguish between their use of the same prefix and other uses of the prefix. It is not possible to use a prefix to name more than one namespace if one relies on Inverse Functional Properties to name an RDF object using a database name prefix and the identifier without using a unique predicate URI, such as the Gene Ontology recommends in its RDF/XML format guidelines9. The prototype encourages Linked Data style, HTTP URI references to resources so that they can be accessed using the prototypes query methods, using HTTP method if available, or using the entire HTTP URI as an identifier if another method is used in the future.

8http://lsrn.org/ 9http://www.geneontology.org/GO.format.rdfxml.shtml Chapter 7

Conclusion

The research presented in this thesis aims to make it possible for scientists to work with distributed linked scientific datasets, as described in Section 1.1. Current systems, described in Chapter 2, that attempt to access these datasets encounter problems in- cluding data quality, data trust, and provenance which were defined and described in Section 1.2. These problems, shown in Figure 1.6, are inherent in current distributed data access systems. Current systems assume data quality is relatively high; that scientists understand and trust the datasets that are publicly available in various locations; and that other scientists will be able to replicate the entire results set using the exact method as published. In each case, scientists need to have control over the relevant methods and datasources to integrate, replicate, or extend the results in their particular context. They must be able to do this with minimal effort. Solutions such as copying all datasets to their local site before normalising and storing them in a specialised database, or alternatively relying exclusively on public web services that may not be locally available or reliable, do not form a good long term solution for access to both local and public datasets. The query model described in Chapter 3 enables scientists to integrate semantically similar queries on multiple datasources into a single, high quality, normalised output for- mat without relying exclusively on a particular set of data providers. The query model distinguishes between user queries, templated query types, and the data providers, mak- ing it possible to add and modify the way existing user queries are resolved by adding or modifying templated query types or data providers. In addition, users can create new normalisation rules to modify queries or the results of queries without requiring any community negotiation, as is necessary with single global ontology systems that have been created in the past. The query model was implemented in the prototype web application as described in Chapter 5. The prototype makes it possible for scientists to expose their queries and configuration information as HTTP URIs. The use of HTTP URIs enables scientists to use the information in different processing applications, while sharing the computer un- derstandable query methods and results with other scientists to enable further scientific research.

165 166 Chapter 7. Conclusion

The prototype is designed to be configured in a simple manner. The configuration of their prototype installation can be configured from scratch or the public sources of configuration information can be included and selectively used based on the users profile. An example of a public configuration is the Bio2RDF configuration 1. They can add and substitute new providers without requiring changes to current providers and queries. The use of profiles from the query model enables scientists to choose local sources in preference to non-local sources where they are available. Profiles allow scientists to ignore datasources, queries, or normalisation rules that are untrusted, without actually removing them from published configuration files that they import into their prototype. Namespaces, corresponding to sets of unique identifiers, can be assigned to data providers and queries to perform efficient queries using the prototype. Namespaces enable the prototype to distribute queries across providers who provide the same data- source, without relying on the method that was used to implement either of the providers, or any declarations by the data producer about which datasets are present in a location. The RDF based provenance and configuration information that is provided by the prototype can be processed together with the RDF data when it is necessary to explain the origin of the data along with the data cleaning and normalisation methods that were relevant. This enables scientists to integrate the prototype with workflow management systems, although there is not yet widespread support for RDF processing in workflow management systems. Chapter 4 described the integration of the model with medicine and science showed that it can be used for exploratory purposes in addition to the regular workflow processing tasks. The model can to be used to access untrusted datasets. This is necessary for ex- ploration and evaluation purposes. However, it is simultaneously possible to restrict trusted queries to only specific datasources as deemed necessary by a scientist. This would reduce the number of datasets that they need to evaluate before trusting the results of a query. The model is implemented as a prototype web application that can be used to perform complete queries on each data provider for a particular query, and normalise the results. The scientist is then responsible for performing further queries. In comparison to other systems, the prototype does not attempt to automatically optimise the amount of information that is being transferred and it does not answer arbitrary SPARQL queries. It has native support for de-normalising queries to match data in particular endpoints, and normalising the results, something that other systems do not cater for, but which is important for consistent context-sensitive integration of distributed linked scientific datasets. The use of the prototype in both private and public scenarios, including the public Bio2RDF.org website2 and private uses of the open source software package3, provides evidence for the success of the model in improving the ability for scientists to perform queries across distributed linked scientific datasets.

1http://config.bio2rdf.org/admin/configuration/n3 2http://bio2rdf.org/ 3http://sourceforge.net/projects/bio2rdf/ Chapter 7. Conclusion 167

The scientific case studies described in Chapter 4 and the related systems described in Chapter 2, show that there are no other models or systems that are designed to enable scientists to independently access and normalise a wide variety of data from distributed scientific datasets. Other systems generally fail by assuming constant data quality and semantic meaning, uniform references, or they require scientists to have all datasets locally available. The prototype was shown in Section 4.7, to be simple to integrate with workflow management systems using HTTP URI resolution as the access method and RDF as the data model, with no normalisation steps necessary in the workflows. Workflows are ideally suited to filtering and parallelisation, and they were used in this way to make it possible to link the results of queries using the prototype with further queries. The discussion in Chapter 6, shows that the prototype is useful by virtue of the way it was used by the Bio2RDF project to integrate a large range of heterogeneous datasets published by different organisations. In addition, the Bio2RDF configuration was manipulated using profiles to both make the Bio2RDF mirrors work efficiently, and for private users to add and remove elements of the configuration using private configurations. The model and prototype make it possible for scientists to understand and define the data quality, data trust, and provenance requirements related to their research by being able to perform queries on datasets, including understanding where datasets are located and exactly which syntax and semantic conventions they use. In reference to the issues shown in Figure 1.6, the model and prototype provide practical support as illustrated in Figure 7.1. The model and prototype could be improved in the future to include more features such as linked queries and named parameters that may be beneficial to scientists, as described in Section 7.3. In particular, the prototype could be extended to support different types of normalisation, including logic-based rules and query transformations.

7.1 Critical reflection

The model provided a simple way to align queries with datasets, and the prototype provided verification of the models usefulness through its use as the manager for queries to the Bio2RDF website. The prototype phase influenced the design of the model, as the implementation of various features required changes to the model. In addition to this, the Bio2RDF datasets were simultaneously updated, requiring the configuration for the prototype to be maintained and updated out of step with the development and release cycle.

7.1.1 Model design review

The model provides a number of loosely linked elements, centralised around providers. Although they act as a central point of linkage, providers still contain a many to many 168 Chapter 7. Conclusion

Data

Maintain distributed data in Standardise public different locations: dataset labels: Single data model to Data providers and profiles Preferred namespace integrate structured label data: Provide knowledge Data labels RDF about protected data: resolve to data: Static RDF insertions Linked Data Medicine

Use different trust for Give patients access to Physically restrict Promote useful different patients to machine understandable access to patient data: resolvable labels for recognise context: links to science data: Locally resolvable URIs links to public datasets: Preferred and alternative Profiles Linked Data namespace labels

Science

Republish major datasets Focus on datasets instead Use normalised label and as an illustration to of data types: publish alternative labels so smaller providers : Namespaces separate scientists can integrate data: Bio2RDF, LODD, from Normalisation rules Normalisation rules Neurocommons

Normalise links to other datasets: Normalise data access so workflows Contextual normalisation can be replicated easily: rules Query types and Modifiable HTTP URIs

Biology

Normalise syntax, and rely on scientists Offer different data formats including semantic normalisation as part as references from a single of their experiment context: generic data model: Regular Expressions and SPARQL patterns RDF and URL

Assign namespace labels to Standardise on a widely datasets to reduce confusion available technology for about the target of a link: data access: Namespaces HTTP

Figure 7.1: Overall solutions Chapter 7. Conclusion 169 relationship with endpoints. Each provider may have more than one, redundant, end- point, while each endpoint may be associated with more than one provider, in the context of different namespaces and/or query types. The entry points, query types, provide a many to many relationship between queries, namespaces and query types. This enables the construction of arbitrary relationships between queries and namespaces, as namespaces can be recreated using different URIs to avoid collisions for the same prefix, and normalised using the preferred and alternate prefix structures. Although the prototype only implemented a single query to namespace mapping method, using Regular Expressions, new mapping methods can be created to transform namespaces using intelligent mechanisms, including database lookups and SPARQL queries to transform the input in arbitrary ways. In comparison to provider-based normalisation rules, this mechanism is not easy to share between query types, but in all of the Bio2RDF cases the preferred and alternate prefix feature of namespace entries was enough to normalise namespaces. In a future re- vision these rules could be defined separately, or normalisation rules could be integrated with both query types and providers. The loosely linked query elements are very useful, particularly in the way query replication can be performed without having to modify hard links between queries and data sources. However, it makes it difficult to determine a priori which elements will be used for a particular type of query without a concrete example, particularly in a large configuration such as that used for Bio2RDF. For example, the DOI (Digital Object Identifier) namespace that was largely represented in RDF datasets using textual references. Since a query can define itself as being relevant to all namespaces, the list of namespace URIs that are attached to any query type does not immediately identify the queries that were relevant. This made it difficult to audit the DOI datasets without performing queries to see which datasets were available. As namespaces needed to be independent of the required query model, to support arbitrary queries, a similar mechanism was created for providers, the default provider setting. In addition, to enable queries to require explicit annotation of providers with a namespace before sending a query to a provider, query types were able to specify that they were not compatible with default providers. This design choice was very useful according to the criteria for this thesis, but in the majority of cases, these features were not required by Bio2RDF, and made it slightly more difficult to verify the configuration was acting as desired. However, in cases where there was a definite query, it was always possible to de- termine the relevant query types, namespace entries, providers and normalisation rules without performing any network operations. In comparison, some federated SPARQL systems such as SADI, cannot determine which datasources will be required for a query, as they need to continue to plan each query while they are executing it. In addition, it is impossible, by design, to get all properties of a data record from a SADI SPARQL query, as by design there are no unknown predicates in a query. This inherent lack of knowledge about the extent of a single query makes it difficult to replicate using the 170 Chapter 7. Conclusion resulting query provenance. The query needs to be executed again using the full SADI service registry to be correct. In comparison, queries using the model described in this thesis are replicable solely using the query provenance information, if the content of the datasets do not change. This makes it practical to publish the query provenance for others to use in similar ways, even if they do not use the same query, or they want to selectively add or remove providers to match their context. If the namespace parameters do not change, the query plan for the first query will be identical for future queries, eliminating the necessity of a central registry to plan similar queries after the first query. Although it would have been possible to define parameters at the query group level, this would have a negative effect on replicability and context sensitivity. If all queries in a query group are required to accept the same parameters, then general query types, which could answer a wide range of queries, would not be allowed. This would have a negative effect on replicability if these general query types are necessary to provide redundant access to data. The model defines parameters at the level of query type to provide for any possible mapping between a query and the parameters for that query. This makes it possible to pass through queries directly to lower levels, or deconstruct them into parameter sets for direct resolution. If a query type inherited its parameter mappings from its query group, it may not be possible to implement in these situations without creating an entirely new query group, which defeats the purpose of query groups as a semantic grouping of similarly purposed query types.

7.1.2 Prototype implementation review

The prototype was implemented using Java Servlets, along with the URLRewrite library to support HTTP URI queries in a lightweight manner, and the Sesame RDF library to support internal RDF transformations. This made it very flexible during development, as the URL structures were easy to modify using URLRewrite when necessary, and the Servlets did not need to know about the query method as they received a filtered path string. In addition, queries to the model are designed to be stateless, reducing the complexity of the application. The processing code was implemented using a single Servlet. Each instance of the servlet used a single Settings class to maintain knowledge about the configuration files, and a similarly scoped BlacklistController class to maintain knowledge about failed queries on endpoints, and the frequency of client queries. The settings class contained maps that made it possible to efficiently lookup references between configuration objects using their RDF URIs, as objects were not tightly linked in memory using Java object references. The servlet was able to process multiple queries concurrently per second. However, the basic design would be difficult to optimise if there were a range of different nor- malisation rules in the common queries. For example, the RDF document needs to be processed as a stream of characters before being imported to an abstract RDF store to be processed for each query. Then, these RDF triples are integrated with RDF triples Chapter 7. Conclusion 171 from other query types to be processed as a pool of RDF triples, in order to perform semantic normalisations that were not possible with the results from each provider. Then, in order to perform textual normalisation, the pool of RDF triples needed to be serialised to an RDF string that could have an arbitrary number of normalisations per- formed on it. All of the string normalisation stages could require the entire string, for example, an XSLT transformation. This made it difficult to generate a smooth pipeline. In comparison, the BioMart transformation pipeline assumes that records are in- dependent of each other, so that it can process them in a continuous stream, using ASCII control chracters (\t for the end of a field and \n for the end of a record) in a byte stream to designate the start and end of records. The Bio2RDF case required that the existence of relational database style records with identical fields in each were not necessary to process data, with search services able to provide arbitrary results, for example. It also required that the results could be normalised from each endpoint individually, to fix errors in particular datasets, and that results could be normalised across all endpoints for a query, to make it possible to perform semantic queries across datasets. The servlet handled documents containing up to 30,000 RDF triples regularly, as this range was required to process queries on the RDF triples in the configuration, although most results were limited to 2000-5000 triples. In smaller datasets, records were very small if they were only sourced from their official dataset, but if they were highly linked to from other datasets, the results were much larger, resulting in a very large document. In other cases, records are relatively small only if internal links are not taken into account. For example, in the Wikipedia dataset, the interarticle links are so numerous that DBpedia was forced to omit the page links dataset from the default Linked Data URI, as it was causing lightweight Linked Data clients to crash due to the large number of triples. The Bio2RDF installation of the prototype was able to process and deliver the full Wikipedia records, including the page links, although the dbpedia.org SPARQL endpoint had a high traffic flow so SPARQL queries sometimes failed to complete.

7.1.3 Configuration maintenance

The configuration defines the types of queries, data providers, normalisation rules, and namespace entries, that are to be used to resolve queries. It is flexible, and when used with each users set of profiles, it can be used differently based on the user’s context, rather than on something that is built into the basic configuration. In comparison to Federated SPARQL systems, it is not limited by the basic descrip- tions of the statements and classes in an endpoint. Normalisation rules can be written to reform the URIs, properties and classes that are the core of Federated SPARQL configurations. However, this flexibility requires more effort, as the Federated SPARQL descriptions can be completely generated without human input, and most are statically generated by producers rather than generated by users. 172 Chapter 7. Conclusion

In order to automatically generate configuration information for the prototype, ad- ditional information is required, defining the rules and queries that would be used for each namespace. This registry would then be used together with Federated SPARQL descriptions to generate a complete configuration which could then be selectively cus- tomised by scientists using profiles. The majority of the configuration management tasks involve identifying changes to datasets and how to either take advantage of the change, or isolate the change if it will interfere with their current queries. For example, if a new data quality issue was identified for an endpoint, a new normalisation rule could be added to the data providers. If the data provider previously was thought to be consistently applicable to a range of endpoints, the data provider may need to be split into two data providers to isolate the data quality issue. The RDF statements that make up the prototype configuration files can be validated using a schema. The prototype schema is generated dynamically by the prototype itself. The current version can be resolved from http://purl.org/queryall/, which in turn resolves the schema using an instance of the prototype. Past versions can be generated using past prototype versions, or by passing the current prototype the desired configuration version as part of the schema request.

7.1.4 Trust metrics

Truth metrics are used to give users opinions about the trustworthiness of unknown quality items. They generally take the form of numeric scales. For example, there may be a 5 star hotel, or a 8 star movie. The model does not attempt to define trust metrics inside the model. In centralised registries it is appropriate to compile information from all users and generate ratings based on the wisdom of the crowds. However, the model is designed to be extensible based on unique contexts. If a Web Service is given a trust metric based on its uptime, it may not be relevant to a user that reimplemented the service exactly in their own location. Similarly, if a user perceives that a data provider is generating incorrect, or non-standard data, they may rate the data provider lowly. However, the model provides the opportunity for others to generate wrappers to improve data quality, using normalisation rules for transformations. The wrappers are defined in a declarative manner, enabling them to publish the resulting rules for others to examine and reuse. If others reuse the wrapper, then effectively they are trusting the dataset. This contradicts the trust opinions given to the basic data provider interface, even though the data provider is providing the same information. Instead, the trust in a data provider was defined based solely on whether the dataset was used or not. Although this definition provides clues about useful data providers, it does not reliably identify false negatives. That is, it does not correctly categorise providers that are not used, but would be trusted if there were no other contextual factors, such as the performance capacity of the provider or the physical distance to the provider. Chapter 7. Conclusion 173

The data trust research question, defined in Section 1.3, relies on scientists identi- fying the positive aspects of trust. Scientists are able to identify trusted datasets and queries based on examining published configurations that used the datasets or queries. They can then form their own opinions to generate further publications, which would increase the rank for a particular provider in the view of future peers.

7.1.5 Replicable queries

The model provides a framework for scientists to use when they wish to define and communicate details of their experiments to other scientists. The query provenance details that combine the replication details, include the data cleaning methods that were thought to be necessary for the queries to be successful and useful. They also contain indications about positive trust ratings and the general data quality of each data provider. The prototype can take query provenance details and execute the original query in the current scientist’s context, with selections and deletions defined as extra RDF triples. That is, future scientists do not have to remove any triples from a query provenance document to replicate it, as any changes are additions, together with changes to the profiles used to replicate the queries. If a new normalisation rule is needed, it can be added to the original provider, by extending the original provider definition with an extra RDF triple. This extension can occur in another document if necessary, as the RDF triples are all effectively linked after they are loaded into an RDF triplestore by the prototype. In addition, non-RDF systems that may be used by scientists currently can be integrated using normalisation rules to transform the content to RDF, including XML using XSLT and textual transformations including formats such as CSV [88], or they can be accessed using queries to conversion software such as BioMart [130] or D2R [29]. Any textual queries are supported using the model, including URLs and HTTP POST queries. If a query does not function properly in the scientists context, they can ignore the query type using their profile, and create a replacement using a different URI. They can then link this new query type URI into the original provider.

7.1.6 Prominent data quality issues

The use of the prototype as the engine for the Bio2RDF website revealed a range of data quality issues, many of which were introduced in Section 1.2. These included data URIs that did not match between different locations, a range of predicates that were either not defined in ontologies, or defined to be redundant with other ontologies, and a lack of consistency in the use of some predicates between locations. The issue of different URIs for the same data records was identified as a major issue by the Bio2RDF community. Although it was not an issue for the Bio2RDF datasets, it affected the ability to source extra links from other providers. The Bio2RDF resolver has a major requirement to be a single resolver for information about different data records, while including references to the original URIs after the normalised URI is resolved, 174 Chapter 7. Conclusion whether they are resolvable or not. This requirement it necessary to include both the original URI and the normalised URI in some cases, meaning that the URI normalisation couldn’t be applied in a single step. Some statements needed to be included without being normalised together with the results from each endpoint. In order to allow for all possibilities, there were two stages included after these statements were inserted, one with access to RDF triples, and one with access to the serialised results in whatever format was requested. For example, http://www4.wiwiss.fu-berlin.de/drugbank/ resource/drugs/DB00001 was equivalent to http://bio2rdf.org/drugbank_drugs: DB00001, and when the original RDF triples were imported the URI needed to change, but after the equivalent RDF triples were inserted, the URIs needed not to change. This was achieved by applying the URI normalisation rules to the “after results import” stage, before the equivalence triples were inserted and further normalisation was done in the “after results to pool” and “after results to document” stages. The Bio2RDF datasets did however have issues with inconsistent URIs over time for predicates and classes. These issues were solved over a period of time, as best practices were identified in the larger RDF community. However, the lack of resources for main- tenance of the RDF datasets meant that the Bio2RDF resolver needed to recognise and normalise 5 different URI styles while the migrations to older datasets were performed. For example, an ontology URI was originally written as http://bio2rdf.org/ bio2rdf#property. This was acceptable, but not ideal, as the property would not be resolved, due to the fragment “#property”, being stripped before HTTP resolution. In order to isolate these cases from other URI resolutions, the ontology URI standard was changed to http://bio2rdf.org/ns/bio2rdf#property. The base URL in this URI was designed to resolve to a full ontology document containing a definition for the property. However, this was not ideal in cases where the property needed to be re- solved specifically, so it was changed to http://bio2rdf.org/ns/bio2rdf:property. This URI was completely resolvable and distinguishable from other URIs, but it was not consistent with some other ontologies that were defined using the normal Bio2RDF URI format, http://bio2rdf.org/namespace:identifier. In response to this, a new namespace was created for each Bio2RDF ontology, for example, http://bio2rdf.org/ bio2rdf_resource:property. At one stage another alternative was also experimentally used, http://ontology. bio2rdf.org/bio2rdf:property, however the use of a single domain name for all URIs was too valuable to ignore. The normalisation rules for each of these rules were, in practice, applied in series to eventually arrive at the current URI structure, although they could have all been modified to point to the current URI format. In some cases, datasets define their own properties that overlap with current prop- erties. Although this research did not aim to provide a general solution to ontology equivalence, the normalisation rules, particularly SPARQL rules, can be used to trans- late between different, but equivalent sets of RDF triples. It is particularly important for the future of Linked Data if there are standard properties that can be relied on to be consistent. For example, the RDF Schema label property could be relied on to Chapter 7. Conclusion 175 contain a human readable label for a URI in RDF, although its meaning was redefined by OWL2, and can hence no longer be used for non-ontology purposes in cases where OWL2 may be used. One term in particular was identified as being inconsistently interpreted, possibly due to a lack of specification and best practice information. The Dublin Core “identifier” property is designed to provide a way to uniquely identify an item in cases where the URI may be different. For example, it may be used to provide the unique identifier from the dataset that it was originally derived from. In the drugbank case above this would be “DB00001”. However, it may also be used to represent the namespace prefixed identifier, for example, “drugbank_drugs:DB00001”. The SADI designers favour another specification that matches their Federated SPARQL strategy, that of a blank node with a key value pair identifying the dataset using a URI in one triple and the identifier in another triple attached to the blank node 4. Although this functionality may be useful, it requires a single URI to identify the dataset, which requires everyone to accept a single URI for every dataset. In the Bio2RDF experience outlined here, that is unlikely to ever happen.

7.2 Future research

This thesis produced a viable model for scientists to use in cases where other scientists will need to replicate the research using their own resources. It incorporates the nec- essary elements for dynamically cleaning data, in a declarative manner so that other scientists could replicate and extend the research in future. However, it left open re- search questions related to studying the behaviour of scientists using the model with respect to trust and data provenance. The model provides the necessary functions for scientists to individually trust any or all of the major elements, including query types, providers and normalisation rules. The trusted elements would then be represented in the query provenance record that a model implementation could generate for each query. Scientists could substitute their own implementations for these elements without removing the original implementations from the original provenance records. This thesis did not study the reactions of scientists to the complexity surrounding this functionality, as it enables arbitrary changes to both queries and data, in any sequence. For example, there were two flags implemented in the prototype to control the implementation of the profiles feature. The first defined whether an element should be included if it implicitly matched a profile, and the second defined whether an element should be included if it was not excluded by any profile and did not implicitly match any profile. Although these rules are simple to define, different combinations of these two flags may change the results of a query in range of different ways. These flags were used to simplify the maintenance of the Bio2RDF system, as they reduced the complexity of the resulting profiles. The first flag, “include when implicitly

4https://groups.google.com/group/sadi-discuss/msg/d3e4289082098428 176 Chapter 7. Conclusion matching”, was enabled and the second flag, “include when no profiles match flag” was disabled. Future research could study the behaviour of the model using different combinations of these flags, in combination with the settings on each profile defining whether to implicitly include any query type or provider that advertised itself as being suitable for implicit inclusion. Although they were mostly implemented to simplify maintenance, the flags, along with other flags inside of named profiles, enable scientists to dynamically switch between different levels of trust without changing the other configuration information. A very low level of trust would set both of the overall flags to disabled and include a profile with all of its flags set to disabled. This level of trust would require that they specifically include all of the query types, providers and normalisation rules that they wish to use. In comparison, a very high level of trust may be represented by setting both of the overall flags set to enabled, without using any profiles. The query provenance records produced by the prototype included a subset of the complete configuration that the scientist had enabled at the time they executed the request. This may reduce the ability of scientists to change the trust levels when repli- cating queries using the provenance record. Setting broader levels of trust will only ever include the elements in the record, even if there were other possibilities in the prototype configuration. In the Bio2RDF example, the entire configuration was located in each provenance record, the provenance size would increase, depending on which elements matched a particular query. The size increase could be significant in terms of future processing of the query using the provenance record and storing and publishing the provenance record. Future research could investigate the viability of this model, and the viability of a large scale combined data and query provenance model with respect to its usefulness and economic cost. For example, the query provenance, including the entire Bio2RDF configuration so that future scientists could fully experiment with novel profile and trust combinations, could be stored in a database. If it were stored using a single graph for each of the query provenance RDF records, the 7.4 million queries during the 11 month statistics gathering period, there would be in excess of 222 billion triples in the resulting database, as the configuration contained approximately 30 thousand triples. There were additional triples for each provenance record detailing the actual endpoints that were used and the exact queries that were used for each endpoint, which would increase the actual number of triples. For the evaluation of this thesis, a reduced set of this provenance was stored in a database, with the records only containing URIs pointing to the full query type and provider definitions that would be present in the full record. Most provenance research assumes that this cost will be either offset by the usefulness of the provenance in future, or it will be taken up by a new commercial provenance industry, perhaps modelled on the current advertising industry. A full examination of this would be required before recommending any large scale use of provenance information as a method of satisfying trust and replicability requirements. In comparison, the publication of a single large Chapter 7. Conclusion 177 configuration, such as the Bio2RDF configuration, along with the methods used for each of the queries used to produce a given scientific conclusion, would enable flexible replicability using a range of trust levels, without incurring the permanent storage cost that is associated with permanent provenance, as 30 thousand triples can be stored and communicated relatively easily compared to the alternative of 222 billion. In the Bio2RDF configuration there was a single profile for each of the three mirrors, along with a single default profile for other users of the prototype. These profiles were used to exclude items from the configuration based on knowledge about the location of the provider. Most legitimate changes in the context of the Bio2RDF mirrors were fixed by changing the master configuration, as they were bug fixes that improved the results of a query. If a provider was not available any more, it was set to be excluded by default, without being removed. This made it possible to keep knowledge of the provider, if, for example, they reappeared at a different location, without continuing to attempt to use the provider. If a dataset changed so that a normalisation rule was not required any more, the normalisation was either changed to be excluded by default, or it was removed from the provider, depending on whether it may have some use in the future again. Future queries against the Bio2RDF website would use the new configuration, however past provenance records would still include their original elements, even if the query could not be easily replicated due to permanent disappearance of the provider, or a non-backwards-compatible data change. Future research could examine the effects of modifications to the Bio2RDF master configuration, perhaps in conjunction with past research into migration of ontologies [136]. There are conflicting objectives influencing migration of configurations using the model. Firstly, if a public provider changes their data, then the URI for the provider would not be changed, or current profiles would be affected. The normalisation rules and namespaces applied to the provider would be added or deleted to match the real changes. However, if the old version of the provider was referenced in past provenance records, then they could not be included together with the current configuration, or they could introduce conflicting normalisation rules with unknown combined effects. In this case it would be advantageous to change the URI of the provider, however, this would affect profiles that trusted the provider in both cases, as they may need to be changed to explicitly include the current URI to continue to have the same effect. The research in this thesis does not examine ways to include and use identity in- formation in configurations, other than to include it in namespaces to recognise the authority that defined a particular namespace. Future research could extend this thesis to examine the benefits of adding in identity information, including individual scientists and the organisations they work for. One benefit might be to automatically identify the necessary citations for publications, based on the use of a particular group of datasets. Another might be to recognise scientific laboratory contributions, including the actual contributions from individual scientists in a team, which may only currently be identified as contributions from the authors on a resulting publication. 178 Chapter 7. Conclusion

7.3 Future model and prototype extensions

The model may be extended in future to suit different scenarios, including those where complex novel queries are processed using information that is known about data providers. As part of this it may be extended to include references to statistics as a way of opti- mising large multi-dataset queries. The prototype can be extended to provide different ways of creating templates and different types of normalisation rules to suit any of the current normalisation stages. These changes would require the normalisation process to be intelligently applied to both the queries and the results of queries, although the way this may work is not immediately obvious. The independence of namespaces from normalisation rules may be an issue in automatically generating queries, as currently the prototype only includes an informative link between normalisation rules and any namespaces that they are related to. The rationale for this is that the namespaces make it possible to identify many different datasets using a single identifier, while the normalisation rules independently make location specific queries functional in some locations without any reference to other locations. Future research could examine this aspect, along with other semantic extensions to the model to facilitate semantic reasoning on the RDF triples that make up the configuration of the prototype. The scientific community has endeavoured on a number of occasions to create schemes that require a single URI for all objects, regardless of how that would af- fect information retrieval using the URIs, as the gains for federated queries are thought to be more valuable than the losses for context dependent queries and URI resolution. Although the model encourages a normalised URI scheme for queries to effectively trans- late queries and join results using the normalised data model across endpoints, it does not require it, so scientists can arbitrarily create rules and use URIs that manipulate queries and results outside of the normalised URI scheme. The model could also be extended to recognise links between namespaces to indi- cate which namespaces are synonyms or subsets of each other. This could be used by scientists to identify data providers as being relevant, even if the provider did not in- dicate that namespace was relevant. This would enable scientists to at least discover alternative data providers, even if the related queries and normalisation rules may not be directly applicable as they may be specific to another normalised URI scheme. The implementation of named parameters would need to ensure that if two services were implemented differently, that they would still be used in parallel without having conflicting named parameters. Regular Expression query types are currently limited to numerically indexed parameters. If other query types they wished to be backwards or cross compatible with Regular Expression query types they would need to support similar parameters. For example, they could support the “input_NN” convention used by Regular Expressions. Named parameters may restrict the way future scientists replicate queries if they have semantic connotations. For example, the two simplest named parameters, “names- pace” and “identifier” from http://example.org/${$namespace$}$:${$identifier$}$, Chapter 7. Conclusion 179 may conflict if “identifier” actually meant a namespace. For example, to get data about “namespace”, one may use http://example.org/${$ns$}$:${$namespace$}$ which se- mantically connotes two namespace, “ns” and “namespace”, resulting in a possible se- mantic conflict with other query types. The current model is designed as a way of accessing data for single step queries, as opposed to multi-step workflows. However, the complexity of queries can be arbitrarily defined according to the behaviour of the query on a provider with ordered normalisation rules filtering the results. In future, this may be extended to include linked queries that directly reference other queries. The model for this behaviour would need to decide what information would be required for scientists to be able to create new sub-queries without having to change the main query to fit their context. The current model assumes that the parameters are not directly related to a partic- ular query interface, so that they can be interpreted differently based on the context of the user. This is similar to workflow management systems and programming languages, which both require particular query interfaces to be in place to enable different contex- tual implementations. This research aimed to explore a different method that removed a direct connection between the users query and the way data is structured in physical data providers. Given that this freedom was based on the goal of context-sensitivity, the model may not be viable for use with linked queries. It may, however, be reasonable to modify the model to include explicit interfaces that query types implement, as the query implementation would be separated from the interface and the data providers. In effect, the model only requires that the separation occur so that each part can be freely substituted based on the scientist’s context, without materially changing the results if the scientist can determine an equivalent way to derive the same results. The model and prototype can be used to promote the reuse of identifiers and the standardisation of URIs. Perhaps surprisingly, normalisation rules can be used to in- tegrate systems that would not otherwise be linked using URIs. This is due to the fact that at the point that the RDF statements from the provider are combined with other results, novel unnormalised RDF statements relating the normalised URIs to any other equivalent strings or URIs can also be inserted into the pool of RDF results. This provides a solution to the issue of URI sustainability, as in the future, any other nor- malised URI scheme can provide normalisation rules that map the future scheme back to current URIs if a particular organisation did not maintain or update their data using currently recognised URIs. Using this system, scientists can publish their data using URIs that have not yet been approved by a standards body with the knowledge that they can migrate in future without reducing the usefulness of their queries. The prototype should be extended in future to allow static unnormalised RDF state- ments to derive information from the results of queries. This could be implemented by adding an optional SPARQL SELECT template to each query type that could be used to fill template values dynamically based on the actual results returned from the provider between the parsing and inclusion in the pool normalisation stages. This method would provide further value to the model as it allows scientists to derive partially normalised 180 Chapter 7. Conclusion results without relying on variables being present in their query. This would make it possible to dynamically insert unnormalised links to URIs that are not transformable using textual methods. For example, this would allow scientists to dynamically derive the relationship between the two incompatible DailyMed reference schemes shown in Figure 1.11 (i.e., “AC387AA0-3F04-4865-A913-DB6ED6F4FDC5” is textually incom- patible with “6788”), even if the rest of the result statements contained normalised URIs. The system does not need to rely on Linked Data HTTP URIs and RDF. If there is another well known resolution method, with a suitable arbitrary data model, developed in the future the prototype could easily be changed to suit the situation, and the model may not be affected. The model does not rely on HTTP requests for input, so it does not need to follow other web conventions regarding the meaning of HTTP requests for different items. For example, the model could be designed to fit with 303 redirects, but it would be entirely arbitrary to require this for direct Java queries into the system, so it does not require it. The metadata which forms the basis for the extended 303 discussion can be transported across a different channel as necessary. It is up to the scientist using the data to keep the RDF triples relating to the document separate from the RDF triples relating to the document provenance, if they do not wish to process them together. The model allows scientists to change where their data is provided without changing their URIs, or relying on entire communities accepting the new source as authoritative for a dataset. For example, some schemes that are not designed to be directly resolvable to information, such as ISBNs, may be resolvable using various methods now and in the future without the scientist changing either their workflows, or their preferred URI structure. This eliminates the major issue that is limiting the success of distributed RDF based queries without imposing a new order, such as LSIDs. The prototype focuses on RDF as its authoritative communication syntax because of its simple extensibility and relatively well supported community. If there was another syntax that both allowed multiple information sources to be merged to a useful combined document by machine–without semantic conflict–and enabled universal references, such as those provided by URIs, it may be interchangeable with RDF as the single format of choice. The provision for multiple file formats for the RDF syntax may also be used in future to generate more concise representations of information, as the currently standardised file formats, RDF/XML and NTriples, are verbose and may prevent the transfer of large amounts of information as efficiently as a binary RDF format, for example. The prototype is designed to be backwards compatible as far as possible with older style configurations, since the configuration syntax changed as the model and prototype developed. The configuration file is designed based on the features in the model, so it makes it possible for other implementations to take the configuration file and process it using the model theory as the semantic reference. The prototype has the ability to recognise configuration files designed for other implementations, although it may not be Chapter 7. Conclusion 181 able to use them if they require functionality that is not implemented or incompatible with the prototype. The prototype was used to evaluate the practical methods for distributing trusted information about which queries are useful on different data providers, including refer- ences to the rules that scientists used with the data and the degree of trust in a data provider. However, the profiles mechanism is currently limited to include and exclude instructions, these do not directly define particular levels of trust, other than through the use of the profile by scientists in their work. It could be extended in future to provide a semantically rich reference to the meaning for a profile inclusion or exclusion. This would enable scientists to have additional trust criteria when deciding whether a profile is applicable to them. In this study, a constraint that was used for profile selection by the configuration maintainers was the physical location of a data provider. This constraint was not related to the main objectives of data quality, data trust and replicability, although it was related to the context sensitivity objective. The profiles were not actually annotated with the ideal physical location for users of the profile. It was deemed not to be vital to the research goals and there were only three locations in the Bio2RDF case study. Although the prototype would still need to have a binary decision about whether to include or exclude an item, the scientist could have different operating criteria for the prototype based on the distance to the provider. This may be implemented in future for each instance of the prototype, or it may be implemented using additional parameters to each query. This may include the IP address of the request being used with geolocation techniques the best location to redirect the user to, or the public IP address of the server being used to geolocate the closest data providers based on the geographic locations that could be attached to each provider. The model is agnostic to the nature of the normalisation rules, making it simple to implement rules for more methods than just Regular Expressions and SPARQL CON- STRUCT queries. In particular, the prototype could be extended in future to support logic-based rules to either decide whether entire results were valid, or to eliminate in- consistent statements based on the logic rules. Validation is possible using the model currently, as the normalisation rule would simply remove all of the invalid results state- ments from a given provider so that they wouldn’t be included, or they could include RDF triples in the response indicating a warning for possibly invalid results. However, it could also be possible to have formal validation rules, without modifying the provider interface, as the rule interface and its implementation can be extended independently.

7.4 Scientific publication changes

The scientific method, particularly with regard to peer review, has developed to match its environment, with funding models favouring scientists that continually generate new results and publish in the most prestigious journals and conferences [43]. The growth of the Internet as a publishing medium is gradually changing the status quo that academic publishers have previously sustained. In some academic disciplines, notably maths and 182 Chapter 7. Conclusion physics, free and open publication databases such as arXiv5 are gradually displacing commercially published journals as the most common communication channel. In the case of arXiv, the operational costs are mostly paid for by the institutions that use the database the most 6. In the absence of a commercial profit motive, arXiv publishes articles using an instant publication system that is very low cost. Given that many of the funding models are provided by governments, ideally, the results of the research should be publicly available 7. However, the initial issues with open access publication, the perceived absence of a genuine business model for open access publishers, and the absence of an authentic open peer review system 8, make it difficult to resolve underlying issues relating to data access and automated replication of analysis processes. The model proposed in this thesis encourages scientists to link data, along with publication of the queries related to a publication. In doing so, it allows scientists to freely distribute their methodology, both attached to a publication, and on its own, if they annotate the process description. Even if the underlying data is sensitive and private, the publication of the methodology in a computer understandable form allows for limited verification. Ideally datasets should be published along with articles and computer understandable process models, to allow simple, complete replication. Both publishers and specialised data repositories such as Dryad [140]9 or DataCite [33]10 could fill this niche. The current scientific culture makes it difficult to change the system, even though technologically there is no reason why a low cost electronic publication model could not be sustainable. Critical peer review systems on the internet tend to be anonymous, as are many traditional peer review methods. However, on the internet, peer review systems typically do not include verification of the peer’s identity. In groups of scientists within a field, this should not be an issue though, and there is technology available to determine trust levels in an otherwise anonymous internet system, via PGP key signing as part of a “Web of Trust” [56]. In the case of this thesis, this encourages scientists to cross-verify publications that they have personally replicated using the published methodology, although possibly in a different context. However, there is currently no financial incentive to verify published research, as grants are mostly given for new research. Lowering the barriers to replication and continuous peer review will reduce the financial disincentive, although it remains to be seen whether it would be sufficient to change the current scientific publishing culture. Current scientific culture is designed to encourage a different perception of career researchers, compared with professional scientists or engineers. This may be evidenced in the order that authors are named on publications, even if it was patently obvious

5http://arxiv.org/ 6http://arxiv.org/help/support/arxiv_busplan_July2011 7http://www.researchresearch.com/index.php?option=com_news&template=rr_2col&view= article&articleId=1102230 8http://www.guardian.co.uk/science/2011/sep/05/publish-perish-peer-review-science 9http://datadryad.org/ 10http://www.datacite.org Chapter 7. Conclusion 183 to their peers and readers of the publication that they did not perform the bulk of the research. It may also be evidenced by the omission of lab assistants from publication credit completely, in cases where publications are monetarily valuable. In the context of this research, it may prevent the direct identification of query provenance, as the raw data contributions in a publication may be attributed to the lead author on the publication, which would make it difficult to investigate the actual collection methods if they did not actually collect the data. Researchers require the publication in order to continue receiving grants, while the income of professional scientists is linked to an employment contract that does not include publications as a success criteria. In evolutionary terms, this design favours the recognition of the career researchers who analyse and integrate the results. It offers no competitive advantage to those who do not succeed or fail based on the results. In terms of data provenance, this results in the data being attributed to the scientist, which may be accurate, considering their role in cleaning the raw data. Data provenance may be more important than query provenance in terms of verification, however, in terms of replicability, query provenance may be more useful as it is more clearly useful for machines. Data provenance is generally information, however, it could be used by machines to select and verify versions of data items and authors if a viable model and implementation are created in future. 184 Appendix A

Glossary

BGP A Basic Graph Pattern in SPARQL that defines the way the SPARQL query maps onto the underlying RDF graph.

Context Any environmental factor that affects the way a scientist wishes to perform their experiment. These may include location, time, network access to data items, or the replication of experiments with improved methodologies and a different view on the data quality of particular datasets.

Data item A set of properties and values reflecting some scientific data that are ac- cessed as a group using a unique identifier from a linked dataset.

Data provider An element of the model proposed in this thesis to represent the nor- malisation rules and query types on particular namespaces that are relevant to a given location.

Link The use of an identifier for another data item in the properties attached to a data item in a dataset

Linked dataset Any set of data that contains links to other data items using identifiers from the other datasets.

Namespace An element of the model proposed in this thesis to represent parts of linked datasets that contain unique identifiers, to make it possible to directly link to these items using only knowledge of the namespace and the identifier.

Normalisation rule An element of the model proposed in this thesis to represent any rules that are needed to transform data that is given to endpoints, and returned from endpoints.

Profiles An element of the model proposed in this thesis to represent the contextual wishes of a scientist using the system in a given context.

Query type An element of the model proposed in this thesis to represent a query template and associated information that is used to identify data providers that may be relevant to the query type.

185 186 Appendix A. Glossary

RDF Resource Description Framework : A generic data model used to represent linked data including direct links between datasources using URIs. It is used by the pro- totype to make it possible to directly integrate data from different data providers into a single set of results for each scientist’s query.

Scientist Any member of a research team who is tasked with querying data and ag- gregating the results.

SPARQL SPARQL Query Language for RDF : A query language that matches graph patterns against sets of RDF triples to resolve a query.

URI Uniform Resource Identifier : A string of characters that are used to identify shared items on the internet. URIs are used as the standard syntax for identifying unique items in RDF and SPARQL. Appendix B

Common Ontologies

There are a number of common ontologies that have been used to make RDF documents on the internet recognisable by different users. RDF documents containing terms from these ontologies are meaningful to different users through a shared understanding of the meaning surrounding each term.

RDF Resource Description Framework Syntax : The basic RDF ontology is represented in RDF, requiring user agents to understand its basic elements without reference to any other external specification. 1

RDFS RDF Schema : Provides a basic set of terms which are required to describe RDF documents and to represent restrictions on items inside of RDF documents.2

OWL Web Ontology Language : Enables more complex assertions about data than are possible in RDF Schema. It is generally assumed to be the common language for new web ontologies, although some endeavour to represent theirs in RDFS because of its restricted format which makes reasoning more reliable. The use of some limited subsets of OWL are guaranteed to provide semantically complete reasoning.3

DC Dublin Core Elements : The commonly accepted set of ontology terms for describ- ing metadata about online resources such as documents. It is referred to as the legacy set of terms with the dcterms namespace being the current recommenda- tion, although there are many ontologies which still utilise this form, including the dcterms set which attempts to reference these terms wherever applicable to provide backward compatibility.4

DCTERMS Dublin Core Terms : Provides a revised set of document metadata el- ements including the elements in the original DC set, with domain and range specifications to further define the allowed contextual use of the DC terms. The original DC ontology was intentionally designed to provide a set of terms which

1http://www.w3.org/1999/02-22-rdf-syntax-ns 2http://www.w3.org/2000/01/rdf-schema 3http://www.w3.org/2002/07/owl 4http://purl.org/dc/elements/1.1/

187 188 Appendix B. Common Ontologies

did not have side-effects, and could therefore be used more widely without preju- dice, but as the Semantic Web developed this was found to be unnecessarily vague, with user agents desiring to validate and integrate ontologies in non-textual forms. 5

FOAF Friend Of A Friend : Provides an ontology which attempt to describe relation- ships between agents on the internet in social networking circles.6

WOT Web Of Trust : Provides an ontology which formalises a trust relationship based on distributed PGP keys and signatures. It can be integrated with the FOAF model by utilising the model of hashed mailbox addresses which is unique to the distributed FOAF community.7

SKOS Structured Knowledge Organisation System : Aims to be able to represent hierarchical and non-hierarchical semi-ordered thesauri and similar collections of literary terms. It is utilised by the dbpedia project to represent the knowledge which is encoded inside of Wikipedia, and hence has been practically verified to some extent.8

5http://purl.org/dc/terms/ 6http://xmlns.com/foaf/0.1/ 7http://xmlns.com/wot/0.1/ 8http://www.w3.org/2004/02/skos/core Bibliography

[1] Gergely Adamku and Heiner Stuckenschmidt. Implementation and evaluation of a distributed rdf storage and retrieval system. In Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI’05), pages 393–396, Los Alamitos, CA, USA, 2005. IEEE Computer Society. ISBN 0-7695- 2415-X. doi: 10.1109/WI.2005.73.

[2] Hend S. Al-Khalifa and Hugh C. Davis. Measuring the semantic value of folk- sonomies. In Innovations in Information Technology, 2006, pages 1–5, November 2006. doi: 10.1109/INNOVATIONS.2006.301880.

[3] K Alexander, R Cyganiak, M Hausenblas, and J Zhao. Describing linked datasets. Proceedings of the Workshop on Linked Data on the Web (LDOW2009) Madrid, Spain, 2009. URL http://hdl.handle.net/10379/543.

[4] R. Alonso-Calvo, V. Maojo, H. Billhardt, F. Martin-Sanchez, M. García-Remesal, and D. Pérez-Rey. An agent- and ontology-based system for integrating public gene, protein, and disease databases. J. of Biomedical Informatics, 40(1):17–29, 2007. ISSN 1532-0464. doi: 10.1016/j.jbi.2006.02.014.

[5] Joanna Amberger, Carol A. Bocchini, Alan F. Scott, and Ada Hamosh. Mckusick’s online mendelian inheritance in man (omim). Nucleic Acids Research, 37(suppl 1):D793–D796, 2009. doi: 10.1093/nar/gkn665.

[6] Sophia Ananiadou and John McNaught, editors. Text mining for biology and biomedicine. Artech House, Boston, 2006.

[7] Sophia Ananiadou and John McNaught. Text mining for biology and biomedicine. Computational Linguistics, 33(1):135–140, 2007. doi: 10.1162/coli.2007.33.1.135.

[8] Peter Ansell. Bio2rdf: Providing named entity based search with a common biological database naming scheme. In Proceedings of BioSearch08: HCSNet Next- Generation Search Workshop on Search in Biomedical Information, November 2008.

[9] Peter Ansell. Collaborative development of cross-database bio2rdf queries. In eResearch Australasia 2009, Novotel Sydney Manly Pacific, November 2009.

189 190 BIBLIOGRAPHY

[10] Peter Ansell. Model and prototype for querying multiple linked scientific datasets. Future Generation Computer Systems, 27(3):329–333, March 2011. ISSN 0167- 739X. doi: 10.1016/j.future.2010.08.016.

[11] Peter Ansell, Lawrence Buckingham, Xin-Yi Chua, James Hogan, Scott Mann, and Paul Roe. Finding friends outside the species: Making sense of large scale blast results with silvermap. In Proceedings of eResearch Australasia 2009, 2009.

[12] Peter Ansell, Lawrence Buckingham, Xin-Yi Chua, James Michael Hogan, Scott Mann, and Paul Roe. Enhancing blast comprehension with silvermap. In 2009 Microsoft eScience Workshop, October 2009.

[13] Peter Ansell, James Hogan, and Paul Roe. Customisable query resolution in biology and medicine. In Proceedings of the Fourth Australasian Workshop on Health Informatics and Knowledge Management - Volume 108, HIKM ’10, pages 69–76, Darlinghurst, Australia, Australia, 2010. Australian Computer Society, Inc. ISBN 978-1-920682-89-7. URL http://portal.acm.org/citation.cfm?id= 1862303.1862315.

[14] Mikel Egana Aranguren, Sean Bechhofer, Phillip Lord, Ulrike Sattler, and Robert D. Stevens. Understanding and using the meaning of statements in a bio-ontology: recasting the gene ontology in owl. BMC Bioinformatics, 8:57, 2007. doi: 10.1186/1471-2105-8-57.

[15] Yigal Arens, Chin Y. Chee, Chun-Nan Hsu, and Craig A. Knoblock. Retrieving and integrating data from multiple information sources. International Journal of Intelligent and Cooperative Information Systems, 2:127–158, 1993. URL http: //www.isi.edu/info-agents/papers/arens93-ijicis.pdf.

[16] , Catherine A. Ball, Judith A. Blake, , Heather Butler, J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight, Janan T. Eppig, Midori A. Harris, David P. Hill, Laurie Issel-Tarver, Andrew Kasarskis, Suzanna Lewis, John C. Matese, Joel E. Richardson, Martin Ringwald, Gerald M. Rubin, and Gavin Sherlock. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nature Genet., 25:25–29, 2000. doi: 10.1038/75556.

[17] Vadim Astakhov, Amarnath Gupta, Simone Santini, and Jeffrey S. Grethe. Data integration in the biomedical informatics research network (birn). Data Integration in the Life Sciences, pages 317–320, 2005. doi: 10.1007/11530084_31.

[18] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. Lecture Notes in Computer Science, 4825:722, 2007.

[19] , Rolf Apweiler, Cathy H. Wu, Winona C. Barker, Brigitte Boeck- mann, Serenella Ferro, Elisabeth Gasteiger, Hongzhan Huang, Rodrigo Lopez, BIBLIOGRAPHY 191

Michele Magrane, Maria J. Martin, Darren A. Natale, Claire O’Donovan, Nicole Redaschi, and Lai-Su L. Yeh. The universal protein resource (uniprot). Nucleic Acids Research, 33(suppl_1):D154–159, 2005. doi: 10.1093/nar/gki070.

[20] JB Bard and SY Rhee. Ontologies in biology: design, applications and future challenges. Nat Rev Genet, 5(3):213–222, 2004. doi: 10.1038/nrg1295.

[21] Beth A. Bechky. Sharing meaning across occupational communities: The trans- formation of understanding on a production floor. Organization Science, 14(3): 312–330, 2003. ISSN 10477039. URL http://www.jstor.org/stable/4135139.

[22] S. Beco, B. Cantalupo, L. Giammarino, N. Matskanis, and M. Surridge. Owl- ws: a workflow ontology for dynamic grid service composition. In e-Science and Grid Computing, 2005. First International Conference on, page 8 pp., 2005. doi: 10.1109/E-SCIENCE.2005.64.

[23] François Belleau, Peter Ansell, Marc-Alexandre Nolin, Kingsley Idehen, and Michel Dumontier. Bio2rdf’s sparql point and search service for life science linked datasets. Poster at the 11th Annual Bio-Ontologies Meeting, 2008.

[24] François Belleau, Marc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault, and Jean Morissette. Bio2rdf: Towards a mashup to build bioinformatics knowledge systems. Journal of Biomedical Informatics, 41(5):706–716, 2008. doi: 10.1016/j. jbi.2008.03.004.

[25] S Bergamaschi, S Castano, and M Vincini. Semantic integration of semistructured and structured data sources. SIGMOD Record, 28:54–59, 1999. doi: 10.1145/ 309844.309897.

[26] Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig, Ilya N. Shindyalov, and Philip E. Bourne. The protein data bank. Nucleic Acids Research, 28(1):235–242, 2000. doi: 10.1093/nar/28.1.235.

[27] Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web. Sci- entific American Magazine, May 2001. URL http://www.sciam.com/article. cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21.

[28] A Birkland and G Yona. Biozon: a system for unification, management and analysis of heterogeneous biological data. BMC Bioinformatics, 7:70, 2006. doi: 10.1186/1471-2105-7-70.

[29] C. Bizer and A. Seaborne. D2rq-treating non-rdf databases as virtual rdf graphs. In Proceedings of the 3rd International Semantic Web Conference (ISWC2004), 2004.

[30] U. Bojars, J.G. Breslin, A. Finn, and S. Decker. Using the semantic web for linking and reusing data across web 2.0 communities. Web Semantics: Science, Services and Agents on the World Wide Web, 6(1):21–28, February 2008. doi: 10.1016/j.websem.2007.11.010. 192 BIBLIOGRAPHY

[31] Evan E. Bolton, Yanli Wang, Paul A. Thiessen, and Stephen H. Bryant. Chapter 12 pubchem: Integrated platform of small molecules and biological activities. In Ralph A. Wheeler and David C. Spellmeyer, editors, Annual Reports in Computa- tional Chemistry, volume 4 of Annual Reports in Computational Chemistry, pages 217–241. Elsevier, 2008. doi: 10.1016/S1574-1400(08)00012-1.

[32] Shawn Bowers and Bertram Ludäscher. An ontology-driven framework for data transformation in scientific workflows. In Data Integration in the Life Sciences, pages 1–16. Springer, 2004. doi: 10.1007/b96666.

[33] Jan Brase, Adam Farquhar, Angela Gastl, Herbert Gruttemeier, Maria Heijne, Alfred Heller, Arlette Piguet, Jeroen Rombouts, Mogens Sandfaer, and Irina Sens. Approach for a joint global registration agency for research data. Information Services and Use, 29(1):13–27, 01 2009. doi: 10.3233/ISU-2009-0595.

[34] Peter Buneman, Adriane Chapman, and James Cheney. Provenance management in curated databases. In SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 539–550, New York, NY, USA, 2006. ACM Press. ISBN 1-59593-434-0. doi: 10.1145/1142473.1142534.

[35] Eithon Cadag, Peter Tarczy-Hornoch, and Peter Myler. On the reachability of trustworthy information from integrated exploratory biological queries. Data Inte- gration in the Life Sciences, pages 55–70, 2009. doi: 10.1007/978-3-642-02879-3_ 6.

[36] Diego Calvanese, Giuseppe Giacomo, Domenico Lembo, Maurizio Lenzerini, Ric- cardo Rosati, and Marco Ruzzi. Using owl in data integration. Semantic Web Information Management, pages 397–424, 2010. doi: 10.1007/978-3-642-04329-1_ 17.

[37] Michael N Cantor and Yves A Lussier. Putting data integration into practice: using biomedical terminologies to add structure to existing data sources. AMIA Annu Symp Proc, pages 125–129, 2003.

[38] Monica Chagoyen, Pedro Carmona-Saez, Hagit Shatkay, Jose M Carazo, and Al- berto Pascual-Montano. Discovering semantic features in the literature: a foun- dation for building functional associations. BMC Bioinformatics, 7:41, 2006. doi: 10.1186/1471-2105-7-41.

[39] B Chen, X Dong, D Jiao, H Wang, Q Zhu, Y Ding, and D Wild. Chem2bio2rdf: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC Bioinformatics, 11:255, 2010. doi: 10.1186/ 1471-2105-11-255.

[40] Kei-Hoi Cheung, H Robert Frost, M Scott Marshall, Eric Prud’hommeaux, Matthias Samwald, Jun Zhao, and Adrian Paschke. A journey to semantic web BIBLIOGRAPHY 193

query federation in the life sciences. BMC Bioinformatics, 10(Suppl 10):S10, 2009. ISSN 1471-2105. doi: 10.1186/1471-2105-10-S10-S10.

[41] Tim Clark, Sean Martin, and Ted Liefeld. Globally distributed object identifica- tion for biological knowledgebases. Brief Bioinformatics, 5(1):59–70, 2004. doi: 10.1093/bib/5.1.59.

[42] Paulo C.G. Costa and Kathryn B. Laskey. Pr-owl: A framework for bayesian ontologies. In Proceedings of the International Conference on Formal Ontology in Information Systems (FOIS 2006). November 9-11, 2006, Baltimore, MD, USA, 2006. IOS Press. URL http://hdl.handle.net/1920/1734.

[43] Susan Crawford and Loretta Stucki. Peer review and the changing research record. Journal of the American Society for Information Science, 41(3):223–228, 1990. ISSN 1097-4571. doi: 10.1002/(SICI)1097-4571(199004)41:3<223::AID-ASI14> 3.0.CO;2-3.

[44] Andreas Doms and Michael Schroeder. GoPubMed: exploring PubMed with the Gene Ontology. Nucl. Acids Res., 33(Supplement 2):W783–786, 2005. doi: 10. 1093/nar/gki470.

[45] Orri Erling and Ivan Mikhailov. Rdf support in the virtuoso dbms. In Proceedings of the 1st Conference on Social Semantic Web (CSSW), pages 7–24. Springer, 2007. doi: 10.1007/978-3-642-02184-8_2.

[46] T. Etzold, H. Harris, and S. Beulah. Srs: An integration platform for databanks and analysis tools in bioinformatics. In Bioinformatics: Managing Scientific Data, chapter 5, pages 35–74. Elsevier, 2003.

[47] Y Fang, H Huang, H Chen, and H Juan. Tcmgenedit: a database for as- sociated traditional chinese medicine, gene and disease information using text mining. BMC Complementary and Alternative Medicine, 8:58, 2008. doi: 10.1186/1472-6882-8-58.

[48] Robert D. Finn, Jaina Mistry, John Tate, Penny Coggill, Andreas Heger, Joanne E. Pollington, O. Luke Gavin, Prasad Gunasekaran, Goran Ceric, Kristof- fer Forslund, Liisa Holm, Erik L. L. Sonnhammer, Sean R. Eddy, and Alex Bate- man. The pfam protein families databases. Nucleic Acids Research, 38(suppl 1): D211–D222, 2010. doi: 10.1093/nar/gkp985.

[49] R. Gaizauskas, N. Davis, G. Demetriou, Y. Guo, and I. Roberts. Integrating text mining into distributed bioinformatics workflows: a web services implementation. In Services Computing, 2004. (SCC 2004). Proceedings. 2004 IEEE International Conference on, pages 145–152, 15-18 Sept. 2004. doi: 10.1109/SCC.2004.1358001.

[50] Michael Y. Galperin. The molecular biology database collection: 2008 update. Nucl. Acids Res., 36(suppl-1):D2–4, 2008. doi: 10.1093/nar/gkm1037. 194 BIBLIOGRAPHY

[51] Robert Gentleman. Reproducible research: A bioinformatics case study. Statis- tical Applications in Genetics and Molecular Biology, 4(1), 2005. doi: 10.2202/ 1544-6115.1034.

[52] Yolanda Gil and Donovan Artz. Towards content trust of web resources. Web Semantics: Science, Services and Agents on the World Wide Web, 5(4):227–239, December 2007. doi: 10.1016/j.websem.2007.09.005.

[53] C. A. Goble, R. D. Stevens, G. Ng, S. Bechhofer, N. W. Paton, P. G. Baker, M. Peim, and A. Brass. Transparent access to multiple bioinformatics information sources. IBM Syst. J., 40(2):532–551, 2001. ISSN 0018-8670. doi: 10.1147/sj.402. 0532.

[54] Peter Godfrey-Smith. Theory and Reality: An Introduction to the Philosophy of Science. University Of Chicago Press, 2003.

[55] Kwang-Il Goh, Michael E. Cusick, David Valle, Barton Childs, Marc Vidal, and Albert-László Barabási. The human disease network. Proceedings of the Na- tional Academy of Sciences, 104(21):8685–8690, May 2007. doi: 10.1073/pnas. 0701361104.

[56] Jennifer Golbeck. Weaving a web of trust. Science, 321(5896):1640–1641, 2008. doi: 10.1126/science.1163357.

[57] Mike Graves, Adam Constabaris, and Dan Brickley. Foaf: Connecting people on the semantic web. Cataloging & Classification Quarterly, 43(3):191–202, 2007. ISSN 0163-9374. doi: 10.1300/J104v43n03_10.

[58] Tom Gruber. Collective knowledge systems: Where the social web meets the semantic web. Web Semantics: Science, Services and Agents on the World Wide Web, 6(1):4–13, February 2008. doi: 10.1016/j.websem.2007.11.011.

[59] Miguel Esteban Gutiérrez, Isao Kojima, Said Mirza Pahlevi, Oscar Corcho, and Asunción Gómez-Pérez. Accessing rdf(s) data resources in service-based grid in- frastructures. Concurrency and Computation: Practice and Experience, 21(8): 1029–1051, 2009. doi: 10.1002/cpe.1409.

[60] LM Haas, PM Schwarz, and P Kodali. Discoverylink: a system for integrated access to life sciences data sources. IBM Syst. J, 40:489–511, 2001. URL http: //portal.acm.org/citation.cfm?id=1017236.

[61] Carole D. Hafner and Natalya Fridman Noy. Ontological foundations for biology knowledge models. In Proceedings of the Fourth International Conference on In- telligent Systems for Molecular Biology, pages 78–87. AAAI Press, 1996. ISBN 1-57735-002-2.

[62] MA Harris, J Clark, A Ireland, J Lomax, M Ashburner, R Foulger, K Eilbeck, S Lewis, B Marshall, and C Mungall. The gene ontology (go) database and BIBLIOGRAPHY 195

informatics resource. Nucleic Acids Res, 32(1):D258–261, 2004. doi: 10.1093/ nar/gkh036.

[63] A. Harth, J. Umbrich, A. Hogan, and S. Decker. Yars2: A federated repository for querying graph structured data from the webs. Lecture Notes in Computer Science, 4825:211, 2007. doi: 10.1007/978-3-540-76298-0_16.

[64] Olaf Hartig and Jun Zhao. Using web data provenance for quality assessment. In In: Proc. of the Workshop on Semantic Web and Provenance Management at ISWC, 2009.

[65] O Hassanzadeh, A Kementsietsidis, L Lim, RJ Miller, and M Wang. Linkedct: A linked data space for clinical trials. arXiv:0908.0567v1, August 2009. URL http://arxiv.org/abs/0908.0567.

[66] Tom Heath and Enrico Motta. Ease of interaction plus ease of integration: Com- bining web 2.0 and the semantic web in a reviewing site. Web Semantics: Science, Services and Agents on the World Wide Web, 6(1):76–83, February 2008. doi: 10.1016/j.websem.2007.11.009.

[67] John J. Helly, T. Todd Elvins, Don Sutton, David Martinez, Scott E. Miller, Steward Pickett, and Aaron M. Ellison. Controlled publication of digital scientific data. Commun. ACM, 45(5):97–101, 2002. ISSN 0001-0782. doi: 10.1145/506218. 506222.

[68] Martin Hepp. Goodrelations: An ontology for describing products and services of- fers on the web. In Proceedings of the 16th International Conference on Knowledge Engineering and Knowledge Management (EKAW2008), September 29 - October 3, 2008, Acitrezza, Italy, volume 5268, pages 332–347. Springer LNCS, 2008. doi: 10.1007/978-3-540-87696-0_29.

[69] Katherine G. Herbert and Jason T.L. Wang. Biological data cleaning: a case study. International Journal of Information Quality, 1(1):60 – 82, 2007. doi: 10.1504/IJIQ.2007.013376.

[70] Mauricio A. Hernández and Salvatore J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9–37, January 1998. doi: 10.1023/A:1009761603038.

[71] Tony Hey, Stewart Tansley, and Kristin Tolle, editors. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Redmond, Wash- ington, 2009. URL http://research.microsoft.com/en-us/collaboration/ fourthparadigm/.

[72] Shahriyar Hossain and Hasan Jamil. A visual interface for on-the-fly biological database integration and workflow design using vizbuilder. Data Integration in the Life Sciences, pages 157–172, 2009. doi: 10.1007/978-3-642-02879-3_13. 196 BIBLIOGRAPHY

[73] Gareth Hughes, Hugo Mills, David De Roure, Jeremy G. Frey, Luc Moreau, M. C Schraefel, Graham Smith, and Ed Zaluska. The semantic smart laboratory: a system for supporting the chemical escientist. Org. Biomol. Chem., 2:3284 – 3293, 2004. doi: 10.1039/B410075A.

[74] Zachary Ives. Data integration and exchange for scientific collaboration. Data In- tegration in the Life Sciences, pages 1–4, 2009. doi: 10.1007/978-3-642-02879-3_ 1.

[75] Hai Jin, Aobing Sun, Ran Zheng, Ruhan He, and Qin Zhang. Ontology- based semantic integration scheme for medical image grid. Cluster Comput- ing and the Grid, IEEE International Symposium on, 0:127–134, 2007. doi: 10.1109/CCGRID.2007.77.

[76] Minoru Kanehisa, Susumu Goto, Miho Furumichi, Mao Tanabe, and Mika Hi- rakawa. Kegg for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Research, 38(suppl 1):D355–D360, 2010. doi: 10.1093/nar/gkp896.

[77] P. D. Karp. An ontology for biological function based on molecular interactions. Bioinformatics, 16(3):269–285, Mar 2000. doi: 10.1093/bioinformatics/16.3.269.

[78] Marc Kenzelmann, Karsten Rippe, and John S Mattick. Rna: Networks & imag- ing. Molecular Systems Biology, 2(44), 2006. doi: 10.1038/msb4100086.

[79] Jacob Köhler, Stephan Philippi, and Matthias Lange. Semeda: ontology based semantic integration of biological databases. Bioinformatics, 19(18):2420–2427, Dec 2003. doi: 10.1093/bioinformatics/btg340.

[80] Michael Kuhn, Monica Campillos, Ivica Letunic, Lars Juhl Jensen, and Peer Bork. A side effect resource to capture phenotypic effects of drugs. Mol Syst Biol, 6, January 2010. doi: 10.1038/msb.2009.98.

[81] Thomas Samuel Kuhn. The Structure of Scientific Revolutions. University Of Chicago Press, 3rd edition, 1996.

[82] Camille Laibe and Nicolas Le Novere. Miriam resources: tools to generate and resolve robust cross-references in systems biology. BMC Systems Biology, 1(1):58, 2007. ISSN 1752-0509. doi: 10.1186/1752-0509-1-58.

[83] P. Lambrix and V. Jakoniene. Towards transparent access to multiple biological databanks. In Proceedings of the First Asia-Pacific bioinformatics conference on Bioinformatics 2003-Volume 19, pages 53–60. Australian Computer Society, Inc. Darlinghurst, Australia, Australia, 2003. URL http://portal.acm.org/ citation.cfm?id=820189.820196.

[84] Ning Lan, Gaetano T Montelione, and Mark Gerstein. Ontologies for proteomics: towards a systematic definition of structure and function that scales to the genome BIBLIOGRAPHY 197

level. Current Opinion in Chemical Biology, 7(1):44–54, February 2003. doi: 10.1016/S1367-5931(02)00020-0.

[85] Andreas Langegger, Martin Blochl, and Wolfram Woss. Sharing data on the grid using ontologies and distributed sparql queries. In 18th International Conference on Database and Expert Systems Applications, 2007. DEXA ’07, pages 450–454, 2007. doi: 10.1109/DEXA.2007.4312934.

[86] A. Lash, W.-J. Lee, and L. Raschid. A methodology to enhance the semantics of links between pubmed publications and markers in the human genome. In Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE 2005), pages 185– 192, Minneapolis, Minnesota, USA, October 2005. doi: 10.1109/BIBE.2005.3.

[87] D. Le-Phuoc, A. Polleres, M. Hauswirth, G. Tummarello, and C. Morbidoni. Rapid prototyping of semantic mash-ups through semantic web pipes. In Pro- ceedings of the 18th international conference on World wide web, pages 581–590. ACM New York, NY, USA, 2009. URL http://www2009.org/proceedings/pdf/ p581.pdf.

[88] Timothy Lebo, John S. Erickson, Li Ding, Alvaro Graves, Gregory Todd Williams, Dominic DiFranzo, Xian Li, James Michaelis, Jin Guang Zheng, Johanna Flores, Zhenning Shangguan, Deborah L. McGuinness, and Jim Hendler. Producing and using linked open government data in the twc logd portal (to appear). In David Wood, editor, Linking Government Data. Springer, New York, NY, 2011.

[89] Peter Li, Yuhui Chen, and Alexander Romanovsky. Measuring the dependability of web services for use in e-science experiments. In Dave Penkler, Manfred Re- itenspiess, and Francis Tam, editors, Service Availability, volume 4328 of Lecture Notes in Computer Science, pages 193–205. Springer Berlin / Heidelberg, 2006. doi: 10.1007/11955498_14.

[90] Sarah Cohen-Boulakia Frédérique Lisacek. Proteome informatics ii: Bioinfor- matics for comparative proteomics. Proteomics, 6(20):5445–5466, 2006. doi: 10.1002/pmic.200600275.

[91] Phillip Lord, Sean Bechhofer, Mark D. Wilkinson, Gary Schiltz, Damian Gessler, Duncan Hull, , and . Applying semantic web services to bioinformatics: Experiences gained, lessons learnt. In Sheila A. McIlraith, Dimitris Plexousakis, and Frank van Harmelen, editors, The Semantic Web – ISWC 2004, volume 3298 of Lecture Notes in Computer Science, pages 350–364. Springer Berlin / Heidelberg, 2004. doi: 10.1007/978-3-540-30475-3_25.

[92] Stefan Maetschke, Michael W. Towsey, and James M. Hogan. Biopatml : pattern sharing for the genomic sciences. In 2008 Microsoft eScience Workshop, 7-9 De- cember 2008., University Place Conference Center & Hotel, Indianapolis, 2008. URL http://eprints.qut.edu.au/27327/. 198 BIBLIOGRAPHY

[93] D. Maglott, J. Ostell, K. D. Pruitt, and T. Tatusova. Entrez gene: gene-centered information at ncbi. Nucleic Acids Res., 35:D26–D31, 2007. doi: 10.1093/nar/ gkl993.

[94] Sylvain Mathieu, Isabelle Boutron, David Moher, Douglas G. Altman, and Philippe Ravaud. Comparison of registered and published primary outcomes in randomized controlled trials. JAMA, 302(9):977–984, 2009. doi: 10.1001/jama. 2009.1242.

[95] S. Miles, P. Groth, S. Munroe, S. Jiang, T. Assandri, and L. Moreau. Extracting causal graphs from an open provenance data models. Concurrency and Compu- tation: Practice and Experience, 20(5):577–586, 2007. doi: 10.1002/cpe.1236.

[96] J. Mingers. Real-izing information systems: Critical realism as an underpinning philosophy for information systems. Information and Organization, 14(2):87–103, April 2004. doi: 10.1016/j.infoandorg.2003.06.001.

[97] Olivo Miotto, Tin Tan, and Vladimir Brusic. Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza a viruses. BMC Bioinformat- ics, 9(Suppl 1):S7, 2008. ISSN 1471-2105. doi: 10.1186/1471-2105-9-S1-S7.

[98] Barend Mons. Which gene did you mean? BMC Bioinformatics, 6:142, 2005. doi: 10.1186/1471-2105-6-142.

[99] Michael Mrissa, Chirine Ghedira, Djamal Benslimane, Zakaria Maamar, Florian Rosenberg, and Schahram Dustdar. A context-based mediation approach to com- pose semantic web services. ACM Trans. Internet Technol., 8(1):4, 2007. ISSN 1533-5399. doi: 10.1145/1294148.1294152.

[100] C. J. Mungall, D. B. Emmert, and Consortium FlyBase. A chado case study: an ontology-based modular schema for representing genome-associated biological information. Bioinformatics, 23:i337–i346, 2007. doi: 10.1093/bioinformatics/ btm189.

[101] P.H.. Mylonas, D. Vallet, P. Castells, M. Fernández, and Y. Avrithis. Per- sonalized information retrieval based on context and ontological knowledge. The Knowledge Engineering Review, 23(Special Issue 01):73–100, 2008. doi: 10.1017/S0269888907001282.

[102] Rex Nelson, Shulamit Avraham, Randy Shoemaker, Gregory May, Doreen Ware, and Damian Gessler. Applications and methods utilizing the simple semantic web architecture and protocol (sswap) for bioinformatics resource discovery and disparate data and service integration. BioData Mining, 3(1):3, 2010. ISSN 1756- 0381. doi: 10.1186/1756-0381-3-3.

[103] Natalya F. Noy. Semantic integration: a survey of ontology-based approaches. SIGMOD Rec., 33(4):65–70, 2004. ISSN 0163-5808. doi: 10.1145/1041410. 1041421. BIBLIOGRAPHY 199

[104] P. V. Ogren, K. B. Cohen, G. K. Acquaah-Mensah, J. Eberlein, and L. Hunter. The compositional structure of gene ontology terms. Pac Symp Biocomput, 9:214– 225, 2004. URL http://psb.stanford.edu/psb-online/proceedings/psb04/ ogren.pdf.

[105] Tom Oinn, Mark Greenwood, Matthew Addis, M. Nedim Alpdemir, Justin Ferris, Kevin Glover, Carole Goble, Antoon Goderis, Duncan Hull, Darren Marvin, Peter Li, Phillip Lord, Matthew R. Pocock, Martin Senger, Robert D. Stevens, Anil Wipat, and Chris Wroe. Taverna: lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience, 18(10): 1067–1100, 2006. ISSN 1532-0626. doi: 10.1002/cpe.993.

[106] A. Paschke. Rule responder hcls escience infrastructure. In Proceedings of the 3rd International Conference on the Pragmatic Web: Innovating the Interactive Society, pages 59–67. ACM New York, NY, USA, 2008. doi: 10.1145/1479190. 1479199.

[107] C. Pasquier. Biological data integration using semantic web technologies. Biochimie, 90(4):584–594, 2008. doi: 10.1016/j.biochi.2008.02.007.

[108] Steve Pettifer, David Thorne, Philip McDermott, James Marsh, Alice Villeger, Douglas Kell, and Teresa Attwood. Visualising biological data: a semantic ap- proach to tool and database integration. BMC Bioinformatics, 10(Suppl 6):S19, 2009. ISSN 1471-2105. doi: 10.1186/1471-2105-10-S6-S19.

[109] William Pike and Mark Gahegan. Beyond ontologies: Toward situated representa- tions of scientific knowledge. International Journal of Human-Computer Studies, 65(7):674–688, July 2007. doi: 10.1016/j.ijhcs.2007.03.002.

[110] Andreas Prlić, , Tony Cox, Thomas Down, Rob Finn, Stefan Gräf, David Jackson, Andreas Kähäri, Eugene Kulesha, Roger Pettett, James Smith, Jim Stalker, and Tim Hubbard. The distributed annotation system for integration of biological data. In Data Integration in the Life Sciences, pages 195–203, 2006. doi: 10.1007/11799511_17.

[111] B. Quilitz and U. Leser. Querying distributed rdf data sources with sparql. Lecture Notes in Computer Science, 5021:524, 2008. doi: 10.1007/978-3-540-68234-9_39.

[112] Erhard Rahm and Hong Hai Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23:2000, 2000. URL http://sites.computer. org/debull/A00DEC-CD.pdf.

[113] Lila Rao and Kweku-Muata Osei-Bryson. Towards defining dimensions of knowl- edge systems quality. Expert Systems with Applications, 33(2):368–378, 2007. ISSN 0957-4174. doi: 10.1016/j.eswa.2006.05.003.

[114] N. Redaschi. Uniprot in rdf: Tackling data integration and distributed annotation with the semantic webs. Nature Precedings, 2009. doi: 10.1038/npre.2009.3193.1. 200 BIBLIOGRAPHY

[115] Michael Rosemann, Jan C. Recker, Christian Flender, and Peter D. Ansell. Un- derstanding context-awareness in business process design. In 17th Australasian Conference on Information Systems, Adelaide, Australia, December 2006. URL http://eprints.qut.edu.au/6160/.

[116] Joseph S. Ross, Gregory K. Mulvey, Elizabeth M. Hines, Steven E. Nissen, and Harlan M. Krumholz. Trial publication after registration in clinicaltrials.gov: A cross-sectional analysis. PLoS Med, 6(9):e1000144, 09 2009. doi: 10.1371/journal. pmed.1000144.

[117] A. Ruttenberg, J.A. Rees, M. Samwald, and M.S. Marshall. Life sciences on the semantic web: the neurocommons and beyond. Briefings in Bioinformatics, 10 (2):193, 2009. doi: 10.1093/bib/bbp004.

[118] Satya S Sahoo, Olivier Bodenreider, Joni L Rutter, Karen J Skinner, and Amit P Sheth. An ontology-driven semantic mashup of gene and biological pathway infor- mation: Application to the domain of nicotine dependence. Journal of Biomedical Informatics, Feb 2008. doi: 10.1016/j.jbi.2008.02.006.

[119] Satya S. Sahoo, Olivier Bodenreider, Pascal Hitzler, and Amit Sheth. Provenance context entity (pace): Scalable provenance tracking for scientific rdf datasets. In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM 2010), 2010. URL http://knoesis.wright. edu/library/download/ProvenanceTracking_PaCE.pdf.

[120] Joel Saltz, Scott Oster, Shannon Hastings, Stephen Langella, Tahsin Kurc, William Sanchez, Manav Kher, Arumani Manisundaram, Krishnakant Shanbhag, and Peter Covitz. cagrid: design and implementation of the core architecture of the cancer biomedical informatics grid. Bioinformatics, 22(15):1910–1916, 2006. doi: 10.1093/bioinformatics/btl272.

[121] Matthias Samwald, Anja Jentzsch, Christopher Bouton, Claus Kallesoe, Egon Willighagen, Janos Hajagos, M Marshall, Eric Prud’hommeaux, Oktie Hassan- zadeh, Elgar Pichler, and Susie Stephens. Linked open drug data for pharmaceu- tical research and development. Journal of Cheminformatics, 3(1):19, 2011. ISSN 1758-2946. doi: 10.1186/1758-2946-3-19.

[122] Marco Schorlemmer and Yannis Kalfoglou. Institutionalising ontology-based se- mantic integration. Applied Ontology, 3(3):131–150, 2008. ISSN 1570-5838. URL http://eprints.ecs.soton.ac.uk/18563/.

[123] Ronald Schroeter, Jane Hunter, and Andrew Newman. The Semantic Web: Re- search and Applications, volume 4519, chapter Annotating Relationships Between Multiple Mixed-Media Digital Objects by Extending Annotea, pages 533–548. Springer Berlin / Heidelberg, 2007. doi: 10.1007/978-3-540-72667-8_38. BIBLIOGRAPHY 201

[124] S. Schulze-Kremer. Adding semantics to genome databases: towards an on- tology for molecular biology. Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology, 5:272–275, 1997. URL https: //www.aaai.org/Papers/ISMB/1997/ISMB97-042.pdf.

[125] Ruth L. Seal, Susan M. Gordon, Michael J. Lush, Mathew W. Wright, and El- speth A. Bruford. genenames.org: the hgnc resources in 2011. Nucleic Acids Research, 2010. doi: 10.1093/nar/gkq892.

[126] Hagit Shatkay and Ronen Feldman. Mining the biomedical literature in the ge- nomic era: an overview. J Comput Biol, 10(6):821–855, 2003. doi: 10.1089/ 106652703322756104.

[127] E. Patrick Shironoshita, Yves R Jean-Mary, Ray M Bradley, and Mansur R Kabuka. semcdi: A query formulation for semantic data integration in cabig. J Am Med Inform Assoc, Apr 2008. doi: 10.1197/jamia.M2732.

[128] David Shotton. Semantic publishing: the coming revolution in scientific journal publishing. Learned Publishing, 22(2):85–94, APRIL 2009. doi: 10.1087/2009202.

[129] Y. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. SIGMOD Record, 34(3):31–36, 2005. doi: 10.1145/1084805.1084812.

[130] D. Smedley, S. Haider, B. Ballester, R. Holland, D. London, G. Thorisson, and A. Kasprzyk. Biomart – biological queries made easy. BMC genomics, 10(1):22, 2009. doi: 10.1186/1471-2164-10-22.

[131] Miriam Solomon. Groupthink versus the wisdom of crowds: The social episte- mology of deliberation and dissent. The Southern Journal of Philosophy, 44(S1): 28–42, 2006. doi: 10.1111/j.2041-6962.2006.tb00028.x.

[132] Young C. Song, Edward Kawas, Ben M. Good, Mark D. Wilkinson, and Scott J. Tebbutt. Databins: a biomoby-based data-mining workflow for biological path- ways and non-synonymous snps. Bioinformatics, 23(6):780–782, Jan 2007. doi: 10.1093/bioinformatics/btl648.

[133] Damires Souza. Using Semantics to Enhance Query Reformulation in Dy- namic Distributed Environments. PhD thesis, Universidade Federal de Per- nambuco, April 2009. URL http://www.bdtd.ufpe.br/tedeSimplificado/tde_ arquivos/22/TDE-2009-07-06T123119Z-6016/Publico/dysf.pdf.

[134] Irena Spasić and Sophia Ananiadou. Using automatically learnt verb selectional preferences for classification of biomedical terms. J Biomed Inform, 37(6):483–497, Dec 2004. doi: 10.1016/j.jbi.2004.08.002.

[135] Robert D. Stevens, C Goble, I Horrocks, and S Bechhofer. Building a bioin- formatics ontology using oil. IEEE Transactions on Information Technology in Biomedicine, 6:135–141, June 2002. ISSN 1089-7771. doi: 10.1109/TITB.2002. 1006301. 202 BIBLIOGRAPHY

[136] Ljiljana Stojanovic, Alexander Maedche, Boris Motik, and N. Stojanovic. User- driven ontology evolution management. In Proceedings of the 13th European Con- ference on Knowledge Engineering and Knowledge Management EKAW, pages 53– 62, Madrid, Spain, 2002. URL http://www.fzi.de/ipe/publikationen.php? id=804.

[137] John Strassner, Sven van der Meer, Declan O’Sullivan, and Simon Dobson. The use of context-aware policies and ontologies to facilitate business-aware network management. Journal of Network and Systems Management, 17(3):255–284, September 2009. doi: 10.1007/s10922-009-9126-4.

[138] James Surowiecki. The Wisdom of Crowds. Random House, New York, 2004. ISBN 0-385-72170-6.

[139] The UniProt Consortium. The universal protein resource (uniprot) in 2010. Nu- cleic Acids Research, 38(suppl 1):D142–D148, 2010. doi: 10.1093/nar/gkp846.

[140] Todd J. Vision. Open data and the social contract of scientific publishing. Bio- Science, 60(5):330–331, 2010. ISSN 00063568. doi: 10.1525/bio.2010.60.5.2.

[141] Stephan Weise, Christian Colmsee, Eva Grafahrend-Belau, Björn Junker, Chris- tian Klukas, Matthias Lange, Uwe Scholz, and Falk Schreiber. An integration and analysis pipeline for systems biology in crop plant metabolism. Data Integration in the Life Sciences, pages 196–203, 2009. doi: 10.1007/978-3-642-02879-3_16.

[142] Patricia L. Whetzel, Helen Parkinson, Helen C. Causton, Liju Fan, Jennifer Fostel, Gilberto Fragoso, Laurence Game, Mervi Heiskanen, Norman Morrison, Philippe Rocca-Serra, Susanna-Assunta Sansone, Chris Taylor, Joseph White, and Jr Stoeckert, Christian J. The MGED Ontology: a resource for semantics-based description of microarray experiments. Bioinformatics, 22(7):866–873, 2006. doi: 10.1093/bioinformatics/btl005.

[143] Mark D Wilkinson, Benjamin Vandervalk, and Luke McCarthy. Sadi semantic web services – cause you can’t always get what you want! In Proceedings of IEEE APSCC Workshop on Semantic Web Services in Practice (SWSIP 2009), pages 13–18, 2009. doi: 10.1109/APSCC.2009.5394148.

[144] David S. Wishart, Craig Knox, An Chi Guo, Dean Cheng, Savita Shrivastava, Dan Tzur, Bijaya Gautam, and Murtaza Hassanali. Drugbank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Research, 36(suppl 1): D901–D906, 2008. doi: 10.1093/nar/gkm958.

[145] Limsoon Wong. Technologies for integrating biological data. Brief Bioinform, 3 (4):389–404, 2002. doi: 10.1093/bib/3.4.389.

[146] Alexander C. Yu. Methods in biomedical ontology. Journal of Biomedical Infor- matics, 39(3):252–266, June 2006. doi: 10.1016/j.jbi.2005.11.006. BIBLIOGRAPHY 203

[147] J. Zemánek, S. Schenk, and V. Svátek. Optimizing sparql queries over disparate rdf data sources through distributed semi-joins. In Proceedings of ISWC 2008 Poster and Demo Session, 2008. URL http://ftp.informatik.rwth-aachen. de/Publications/CEUR-WS/Vol-401/iswc2008pd_submission_69.pdf.

[148] Jun Zhao, Carole Goble, Robert Stevens, and Daniele Turi. Mining taverna’s semantic web of provenance. Concurrency and Computation: Practice and Expe- rience, Online Publication, 2007. doi: 10.1002/cpe.1231.

[149] Jun Zhao, Graham Klyne, and David Shotton. Provenance and linked data in bio- logical data webs. In Linked Open Data Workshop at The 17th International World Wide Web Conference, 2008. URL http://data.semanticweb.org/workshop/ LDOW/2008/paper/4.

[150] Jun Zhao, Alistair Miles, Graham Klyne, and David Shotton. Openflydata: The way to go for biological data integration. Data Integration in the Life Sciences, pages 47–54, 2009. doi: 10.1007/978-3-642-02879-3_5.