Provenance of E-Science Experiments - Experience from Bioinformatics

Provenance of E-Science Experiments - Experience from Bioinformatics

Provenance of e-Science Experiments - experience from Bioinformatics Mark Greenwood, Carole Goble, Robert Stevens, Jun Zhao, Matthew Addis, Darren Marvin, Luc Moreau, Tom Oinn EPSRC e-Science Pilot Project myGrid http://www.mygrid.org.uk Abstract Like experiments performed at a laboratory bench, the data associated with an e-Science experiment are of reduced value if other scientists are not able to identify the origin, or provenance, of those data. Provenance information is essential if experiments are to be validated and verified by others, or even by those who originally performed them. In this article, we give an overview of our initial work on the provenance of bioinformatics e-Science experiments within myGrid. We use two kinds of provenance: the derivation path of information and annotation. We show how this kind of provenance can be delivered within the myGrid demonstrator WorkBench and we explore how the resulting Webs of experimental data holdings can be mined for useful information and presentations for the e-Scientist. 1 Introduction management and sharing of data-intensive in silico experiments in biology. The generation and produc- tion of provenance data is seen as a central require- Like experiments performed at a laboratory bench, the my data associated with an e-science experiment are of ment for the Grid project. reduced value if other scientists are not able to hold the provenance of those data. Provenance is a kind of metadata, recording the process of biological experi- ments for e-Science, the purpose and results of exper- 3 Generating Provenance in my iments as well as annotations and notes about experi- Grid ments by scientists [1]. This information is essential if experiments are to be validated and verified by others, Within myGrid we want to generate and exploit prove- or even by those who originally performed them. It is nance that is intended for machine consumption. The also important in assessing the quality, and timeliness large amount of provenance data that is required to of results. allow scientists to verify the experiments of others, Our investigation of provenance is part of the myGrid means that as much of the collection as possible must project, which is reviewed in Section 2. We then de- be automatically and systematically generated. There scribe our understanding of provenance data for in sil- is considerable potential for programs that process ico experiments and how it is generated. We then collections of provenance records and provide users present how we think these data can be exploited in with higher-level views of what has happened within Section 4 and conclude with a discussion in Section 5. their virtual organisation. There are two major forms of provenance: First, the derivation path records the process by which results my 2 The Grid project are generated from input data. This could include a database query, a program and its parameters, or e-Science is the use of electronic resources – instru- a workflow that orchestrates a number of services. ments, sensors, databases, computational methods, Knowledge of the derivation path is essential for ef- computers - by scientists working collaboratively in fective change management or knowing when an ex- large distributed project teams in order to solve scien- periment needs to be re-run in the light of new infor- tific problems. An in silico experiment is a procedure mation [6]. Second, annotations are attached to ob- that uses computer-based information repositories and jects, or collections of objects. There are standard an- computational analysis to test a hypothesis, derive a notations such as when an object was created, last up- summary, search for patterns, or demonstrate a known dated, who owns it and its format. In addition, there fact. The myGrid project1 [5] is developing high-level can be annotations that describe an object with con- service-based middleware to support the construction, cepts related to the scientific domain. For example, a database entry is a protein sequence in FASTA format 1http://www.mygrid.org.uk and it describes an enzyme with a glucose substrate and a kinase function. All annotations add context in- • some note on where it came from; formation that is essential to the interpretation of the raw scientific data. • some other biological information such as species, function, etc; The workflows that represent in silico experiments in myGrid describe the orchestration of bioinformatics • comments on why this data was being used; data and analysis services that are used to derive out- • standard mIR metadata (who input the puts. The outputs of one workflow may form the in- data, when, its syntactic, semantic and dis- puts to another so that a complete in silico experiment play(MIME) type). includes a network of related workflow invocations. my The Grid workflow enactment system ensures that The workflow descriptions are themselves objects any output can be associated with its corresponding within the environment that can have their own prove- workflow invocation record and the associated prove- nance data. Bioinformatics data and analysis services nance data. In this way, the detail of how an output is may themselves have provenance information indicat- related to its inputs is available when required [7]. ing: The myGrid environment uses the open source Freefluo workflow enactment engine2. Coupled with • their versions; the myGrid WorkBench [4], this enactment engine • default parameters; produces a provenance log that records what events have been performed during the enactment. Figure 1 • resource versions - For services that involve shows the myGrid schema for workflow or derivation searching databases, it can be important to know path provenance information myGrid generates. This what version of the database was used. schema is being further extended to include more an- notation provenance. This is of particular relevance in bioinformatics, The workflow provenance log provides a record of the where there are a large number of secondary databases derivation path of the workflow. Thus the provenance whose content is based both on automatic annotation log of a workflow enactment is the recording of the of the primary data, and on the additional annotations start time, end time and service instances operated in of a small number of human experts (curators). This this workflow. At the end of the workflow execution, means that there is an inevitable time lag between the resulting data, metadata about the workflow and changes to the primary data, and the corresponding the provenance logs are held in the myGrid Informa- changes in secondary databases. tion Repository (mIR) [2]. When the myGrid Work- Running workflows is not the only activity that goes Bench is used, there is additional annotation about the on within the myGrid environment. Other activities context of the workflow. In addition to the derivation might include: Tracking what users have selected; path, a set of metadata is associated with the workflow tracking the texts that users refer to; and understand- invocation instance: the input and output relation- ing what colleagues have done. Potentially all actions ships between the workflow instance and data items, of the user can be recorded and act as annotations; the ‘is defined by’ relationship between the workflow indeed, these can be themselves annotated with the instance, the semantic description document and the input of the e-Scientist. workflow template. Other annotations regarding the hypothesis of the experiment, thoughts and opinions All this information provides a ‘work context’ for by the scientist and quality of results are also stored in silico experiments. We would like to build a web as XML in the mIR or as regular web documents3. of related pages relevant to an experimental inves- tigation, marked up with, and linked together using annotations drawn from shared ontologies (see Fig- ure 2). This web includes not only the provenance 3.1 Beyond workflow provenance record of a workflow run, but also links to other prove- nance records of other directly or indirectly related It is also clear that there is much more than just data workflow runs, diagrams of the workflow specifica- provenance of workflow outputs to consider. The sci- tions; web pages about people who ran the workflow entists using myGrid may want to record information or have related study in provenance; literatures rele- about the provenance of data that they load directly vant to provenance study; notes of the experiment and into the mIR. For example if the data is a DNA se- so on. This is the idea behind a “web of science” as quence, then scientists might want to store: proposed by Hendler [3]. 2http://freefluo.sourceforge.net/ 3An example of a myGrid provenance record may be seen at http://twiki.mygrid.org.uk/twiki/bin/view/ 4 Exploiting Provenance Records Mygrid/GDProvenanceExample. An example of the addi- tional information available through the mIR may be seen at http: //twiki.mygrid.org.uk/twiki/bin/view/Mygrid/ There are a potentially wide number of uses for this ProvenanceData\#provenance_from_mIR_metadata. provenance data: Figure 1: The myGrid information model for provenance information. • The derivation path provenance should be the most used bio-services in my organisa- enough to allow others to repeat and validate an tion/group/project?’; ‘is is worth renewing my experiment. If exactly the same conditions are subscription to this expensive service?’. Such available (same database version, same program, queries obviously have security and privacy im- same algorithm, etc.) then it may be possible to plications – another fundamental topic for e- repeat the in silico experiment. Science. • The provenance data also provides a store of • In the volatile world of bioinformatics such in- know-how, a means of learning from history and formation could be used to automatically re-run disseminating best practise. an in silico experiment in the light of a notifica- tion of change in data, analysis tool or third-party • There is also substantial benefit in using prove- data repository.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    4 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us