Distributed Computing Environments and Workflows for Systems

Total Page:16

File Type:pdf, Size:1020Kb

Distributed Computing Environments and Workflows for Systems Going with the Flow Distributed Computing for Systems Biology Using Taverna Prof Carole Goble The University of Manchester, UK http://www.mygrid.org.uk http://www.omii.ac.uk Data pipelines in bioinformatics Genscan Resources /Services EMBL BLAST Clustal-W Multiple Query Protein sequences Clustal-W sequence alignment 2 Example in silico experiment: Investigate© the evolutionary relationships between proteins [Peter Li] z Manual creation z Semi-automation using bespoke software 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt z Issues: 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt z Volatility of data12541 ttcttataag in tctgtggtttlife ttatattaatsciences gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa z Data and metadata12781 taacccattt tctgtctcta storage tggatttgcc tgttctggat attcatatta atagaatcaa z Integration of heterogeneous biological data z Visualisation of models z Brittleness 3 © 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg Warehouses and WarehousesDatabases and WarehousesDatabases and Databases 4 © Data Warehouse z Copy the data sets z Combine them into a pre- determined model before query z Query that model z Clean data z Refresh, Fixed z High cost, Front loaded z You can only use what has been set up for you. 5 © Distributed Database Integration z Marshal the data sets z Combine it into a pre- determined model when you query z Always fresh z Map from model to databases dynamically z More flexible but still depends on model z High cost z You can only use what has been set up 6 © 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg Warehouses and WarehousesDatabases and WarehousesDatabases and Databases [Mark Wilkinson, 2006 BioMOBY] 7 © It would be good if you could systematically automate… Make data sets / resources / tools / codes / models accessible to a computer. And cope when they change And run them where they are hosted …. 8 © It would be good if you could systematically automate… ….Link together resources Automate the protocol so I don’t have to do it every time I need to repeat the search or re-run the analysis. And do it accurately and systematically every time without mistakes. And not get bored and sloppy. And be comprehensive too…. 9 © It would be good if you could systematically automate… …Rerun it over and over and over and over again. Automatically. And keep the log of what actually happened. Automatically. Manage the results of the protocol. Not just the data results, but the evidence for the results, the source of the data, the log of what you did and why. Helpful when you publish!.... 10 © It would be good if you could systematically automate… …Record this protocol, share it with colleagues. Fiddle with it. Remember what it was 2 weeks later. Adapt a colleague’s or expert’s to suit you Give your protocol to a colleague… 11 © It would be good if you could systematically automate… And do it in my lab without having to have a lot of systems administrators and developers building databases for me. Or writing Perl. And it runs on my crappy laptop. 12 © And be … z Un-biased and Unambiguous in my science z Systematic z Efficient z Scalable z Flexible z Customisable z Transparent in my scientific method 13 © The Two W’s z Web Services z Technology and standard for exposing code / database with an API that can be consumed by a third party remotely. z Describes how to interact with it. z Workflows z General technique for describing and enacting a process z Describes what you want to do, not how you want to do it 14 © Workflows Workflow language specifies how bioinformatics processes fit together. High level workflow diagram separated from any lower level coding – you don’t have to be a coder to build workflows. Workflow is a kind of script or protocol that you configure when you run it. Easier to explain, share, relocate, reuse and repurpose. 15 © myGrid z myGrid http://www.mygrid.org.uk z UK e-Science pilot project since 2001 z Build middleware for Life Scientists that enables them to undertake in silico experiments and share those experiments and their results. z Individual scientists, in under-resourced labs, who use other people’s applications. z Open source. z Workflows. z Data flows. Ad hoc & exploratory 16 © Taverna Workflow Workbench 17 http://taverna.sourceforge.net© Trypanosomiasis in Africa 18 © http://www.genomics.liv.ac.uk/tryps/trypsindex.html Genotype Phenotype 200 ? Phenotypic response investigated using microarray Genes captured in in form of expressed genes microarray or evidence provided through experiment and QTL mapping 19 present in QTL © region Microarray + QTL [Andy Brass, Steve Kemp, Paul Fisher, 2006] Key: A – Retrieve genes in QTL region B – Annotate genes with external database Ids C – Cross-reference Ids with KEGG gene ids D – Retrieve microarray data from MaxD database E – For each KEGG gene get the pathways it’s involved in F – For each pathway get a description of what it does G – For each KEGG gene get a description of what it does 20 [Andy Brass, Steve Kemp, © Paul Fisher, 2006] Result z Captured the pathways returned by QTL and Microarray workflows over the MaxD microarray database z Identified a pathway for which its correlating gene (Daxx) is believed to play a role in trypanosomiasis resistance. z Manually analysis on the microarray and QTL data had failed to identify this gene as a candidate. 21 © [Andy Brass, Steve Kemp, Paul Fisher, 2006] Trichuris muris (mouse whipworm) infection z Identified the biological pathways involved in sex dependence in the mouse model, previously believed to be involved in the ability of mice to expel the parasite. z Manual experimentation: Two year study of candidate genes, processes unidentified z Workflows: trypanosomiasis cattle experiment, was reused without change. z Analysis of the data by a biologist found the processes in a couple of days. 22 © [Joanne Pennock, Paul Fisher, 2006] Pull Retrieve and parse data fromPublic BIND Databases + inHouse Data = Model Core SBML model construction workflow 23 © [Peter Li, Doug Kell, 2006] Visualise results using routine SBML tools SharkView – interactive SBML viewer 24 © [Peter Li, Doug Kell, 2006] Model construction: Post-Taverna z Captures the scientific process of model construction as workflows z Workflows enacted ‘on demand’ to construct most up-to-date models using the latest data z Models are pushed into a data model of choice z Provide various ways of visualising models 25 © Multi-disp. z ~20000 downloads z Users in US, Singapore, UK, Europe, Australia, z Systems biology z Proteomics z Gene/protein annotation z Microarray data analysis z Medical image analysis z Heart simulations z High throughput screening z Phenotypical studies z Plants, Mouse, Human z Astronomy z Dilbert Cartoons 26 © A workflow 27 marketplace © Finding and Taverna Workbench 3rd Party Sharing Tools Applications and myExperiment DAS Portals Utopia Feta Workflow Enactor Workflow enactor ClientsClients Service Management LSIDs Log Default Custom Meta Data Store data Store 28 © Results KAVE BAKLAVA Management Transparency 29 © Provenance Smart Tea z Who, What, Where, When, Why?, How? z Context z Interpretation z Logging & Debugging z Reproducibility and repeatability z Evidence & Audit BioMOBY z Non-repudiation z Credit z Credibility z Accurate reuse and interpretation z Smart re-running z Cross experiment mining z Just good scientific practice 30 © Tracking From which Ensembl gene does pathway come from mmu004620 ? 31 © Workflows over Results Automatically backtrack through the data provenance graph Entrez dF dF dF dF Pathway_id KEGG_id Uniprot Ensembl_gene_id 32 © An Open World z Open domain services and resources. z Taverna accesses 3000+ operation. z Third party. z All the major providers z NCBI, DDBJ, EBI … z Enforce NO common data model. z Quality Web Services considered desirable . 33 © If you don’t provide a Web Service Interface… z SoapLab z Java API Consumer import Java API of libSBML as workflow components http://www.ebi.ac.uk/soaplab/ 34 © Shield the Scientist Bury the complexity Workflow enactor Processor Processor Processor Processor Processor Processor Processor Styx Processor ... Bio WSRF Plain Soap Bio Local Enactor Styx R ... MART Web lab MOBY Java client package 35 Service © App User Interaction z Allows a workflow to call out to an expert human user z E.g. Used to embed the Artemis annotation editor within an otherwise automated genome annotation pipeline 36 © [University of Bergen] No miracles here. z Building good workflows
Recommended publications
  • The Design and Realisation of the Myexperiment Virtual Research Environment for Social Sharing of Workflows
    The Design and Realisation of the myExperiment Virtual Research Environment for Social Sharing of Workflows David De Roure a Carole Goble b Robert Stevens b aUniversity of Southampton, UK bThe University of Manchester, UK Abstract In this paper we suggest that the full scientific potential of workflows will be achieved through mechanisms for sharing and collaboration, empowering scientists to spread their experimental protocols and to benefit from those of others. To facilitate this process we have designed and built the myExperiment Virtual Research Environ- ment for collaboration and sharing of workflows and experiments. In contrast to systems which simply make workflows available, myExperiment provides mecha- nisms to support the sharing of workflows within and across multiple communities. It achieves this by adopting a social web approach which is tailored to the par- ticular needs of the scientist. We present the motivation, design and realisation of myExperiment. Key words: Scientific Workflow, Workflow Management, Virtual Research Environment, Collaborative Computing, Taverna Workflow Workbench 1 Introduction Scientific workflows are attracting considerable attention in the community. Increasingly they support scientists in advancing research through in silico ex- perimentation, while the workflow systems themselves are the subject of ongo- ing research and development (1). The National Science Foundation Workshop on the Challenges of Scientific Workflows identified the potential for scientific advance as workflow systems address more sophisticated requirements and as workflows are created through collaborative design processes involving many scientists across disciplines (2). Rather than looking at the application or ma- chinery of workflow systems, it is this dimension of collaboration and sharing that is the focus of this paper.
    [Show full text]
  • Data Curation+Process Curation^Data Integration+Science
    BRIEFINGS IN BIOINFORMATICS. VOL 9. NO 6. 506^517 doi:10.1093/bib/bbn034 Advance Access publication December 6, 2008 Data curation 1 process curation^data integration 1 science Carole Goble, Robert Stevens, Duncan Hull, Katy Wolstencroft and Rodrigo Lopez Submitted: 16th May 2008; Received (in revised form): 25th July 2008 Abstract In bioinformatics, we are familiar with the idea of curated data as a prerequisite for data integration. We neglect, often to our cost, the curation and cataloguing of the processes that we use to integrate and analyse our data. Downloaded from https://academic.oup.com/bib/article/9/6/506/223646 by guest on 27 September 2021 Programmatic access to services, for data and processes, means that compositions of services can be made that represent the in silico experiments or processes that bioinformaticians perform. Data integration through workflows depends on being able to know what services exist and where to find those services. The large number of services and the operations they perform, their arbitrary naming and lack of documentation, however, mean that they can be difficult to use. The workflows themselves are composite processes that could be pooled and reused but only if they too can be found and understood. Thus appropriate curation, including semantic mark-up, would enable processes to be found, maintained and consequently used more easily.This broader view on semantic annotation is vital for full data integration that is necessary for the modern scientific analyses in biology.This article will brief the community on the current state of the art and the current challenges for process curation, both within and without the Life Sciences.
    [Show full text]
  • Data Management in Systems Biology I
    Data management in systems biology I – Overview and bibliography Gerhard Mayer, University of Stuttgart, Institute of Biochemical Engineering (IBVT), Allmandring 31, D-70569 Stuttgart Abstract Large systems biology projects can encompass several workgroups often located in different countries. An overview about existing data standards in systems biology and the management, storage, exchange and integration of the generated data in large distributed research projects is given, the pros and cons of the different approaches are illustrated from a practical point of view, the existing software – open source as well as commercial - and the relevant literature is extensively overviewed, so that the reader should be enabled to decide which data management approach is the best suited for his special needs. An emphasis is laid on the use of workflow systems and of TAB-based formats. The data in this format can be viewed and edited easily using spreadsheet programs which are familiar to the working experimental biologists. The use of workflows for the standardized access to data in either own or publicly available databanks and the standardization of operation procedures is presented. The use of ontologies and semantic web technologies for data management will be discussed in a further paper. Keywords: MIBBI; data standards; data management; data integration; databases; TAB-based formats; workflows; Open Data INTRODUCTION the foundation of a new journal about biological The large amount of data produced by biological databases [24], the foundation of the ISB research projects grows at a fast rate. The 2009 (International Society for Biocuration) and special edition of the annual Nucleic Acids Research conferences like DILS (Data Integration in the Life database issue mentions 1170 databases [1]; alone Sciences) [25].
    [Show full text]
  • Mygrid: a Collection of Web Pages
    Home The myGrid team produce and use a suite of tools designed to “help e-Scientists get on with science and get on with scientists”. The tools support the creation of e-laboratories and have been used in domains as diverse as systems biology, social science, music, astronomy, multimedia and chemistry. The tools have been adopted by a large number of projects and institutions. The team has developed tools and infrastructure to allow: • the design, editing and execution of workflows in Taverna • the sharing of workflows and related data by myExperiment • the cataloguing and annotation of services in BioCatalogue and Feta • the creation of user-friendly rich clients such as UTOPIA These tools help to form the basis for the team’s work on e-Labs. Taverna Workbench 2.1.2 available for download The myGrid team have released Taverna Workbench 2.1.2. It supports secure access to data on the web, secure web services and secured private workflows. Taverna 2.1.2 includes • Copy/paste, shortcuts, undo/redo, drag and drop • Animated workflow diagram • Remembers added/removed services • Secure Web services support • Secure access to resources on the web • Up-to-date R support • Intermediate values during workflow runs • myExperiment integration • Excel and csv spreadsheet support Hosted at the School of Computer Science, University of Manchester, UK. Copyright 2008 - All Rights Reserved Sponsored by Microsoft, EPSRC, BBSRC, ESRC and JISC From www.mygrid.org.uk/ 1 27 May 2010 Outreach >Outreach The myGrid consortium put a lot of effort into reaching out to users. The team do not just “preach to the converted” but seek new users by offering Tutorials and giving talks, papers and posters in a very wide variety of places.
    [Show full text]
  • Hosts: Monash Eresearch Centre and Messagelab Seminar :The Long Tail Scientist Presenter: Prof Carole Goble, Computer Science
    Hosts: Monash eResearch Centre and MessageLab Seminar :The Long tail Scientist Presenter: Prof Carole Goble, Computer Science, University of Manchester Venue: Seminar Room 135, Building 26 Clayton Time and Date: Wed 3 August 2011, 5-6pm Abstract Big science with big, coordinated and collaborative programmes – the Large Hadron Collider, the Sloan Sky Survey, the Human Genome and its successor the 1000 Genomes project – hogs headlines and fascinates funders. But this big science makes up a small fraction of research being done. Whilst the big journals – Nature, Science - are often the first to publish breakthrough research, work in a vast array of smaller journals still contributes to scientific knowledge. Every day, PhD students and post-docs are slaving away in small labs building up the bulk of scientific data. In disciplines like chemistry, biology and astronomy they are taking advantage of the multitude of public datasets and analytical tools to make their own investigations. Jim Downing at the Unilever Centre for Molecular Informatics was one of the first to coin the term of “Long Tail Science” – that large numbers of small researcher units is an important concept. We are not just standing on the shoulders of a few giants but standing on the shoulders of a multitude of the average sized. Ten years ago I started up the myGrid e- Science project (http://www.mygrid.org.uk ) specifically to help the long tail bioinformatician, and later other long tail scientists from other disciplines. And it turns out that the software, services and methods we develop and deploy (the Taverna workflow system, myExperiment, BioCatalogue, SysMO-SEEK, MethodBox) apply just as well to the big science projects.
    [Show full text]
  • A Web 2.0 Virtual Research Environment
    my Experiment – A Web 2.0 Virtual Research Environment David De Roure Carole Goble Electronics and Computer Science School of Computer Science University of Southampton, UK The University of Manchester, UK [email protected] [email protected] +44 23 8059 2418 +44 161 275 6195 ABSTRACT established as a means of automating the processing of e-Science has given rise to new forms of digital object in scientific data in a scalable and reusable way. the Virtual Research Environment which can usefully be my shared amongst collaborating scientists to assist in The Experiment Virtual Research Environment (VRE) generating new scientific results. In particular, descriptions provides a personalised environment which enables users to of Scientific Workflows capture pieces of scientific share, re-use and repurpose experiments. Our vision is that knowledge which may transcend their immediate scientists should be able to swap workflows and other scientific objects as easily as citizens can share documents, application and can be shared and reused in other my experiments. We are building the my Experiment Virtual photos and videos on the Web. Hence Experiment owes Research Environment to support scientists in sharing and far more to social networking websites such as MySpace collaboration with workflows and other objects. Rather than (www.myspace.com) and YouTube (www.youtube.com) adopting traditional methods prevalent in the e-Science than to the traditional portals of Grid computing, and is developer community, our approach draws upon the social immediately familiar to the new generation of scientists. software techniques characterised as Web 2.0. In this paper Where many e-Science projects have focused on bringing we report on the preliminary design work of my Experiment.
    [Show full text]
  • Taverna Workflow Management System
    http://www.mygrid.org.uk/ http://www.taverna.org.uk/ Taverna Robert Haines, Stian Soiland-Reyes myGrid, University of Manchester [email protected] http://orcid.org/0000-0002-9538-7919 IS-ENES2 workshop on workflows, Hamburg, 2014-06-03 This work is licensed under a Creative Commons Attribution 3.0 Unported License Taverna in Context • Comprehensive Scientific Workflow Management System + auxiliary tools/repositories • Based at Manchester with multiple contributions and collaborations • Releases: Three major; numerous rolling intermediate. First release 2004. • Downloads: 90,000+ cumulative; 1000 in first month per intermediate release; user audit for May 2013 had 900+ unique addresses use Taverna • Users: ~380 sites and institutions have used or use Taverna • Support: mailing list, community list and Jira Taverna workflows • Sophisticated analysis pipelines • A set of services to analyze or manage data (local or remote) • Workflows run through the workbench or via a server • Automation of data flow through services • Control of service invocation • Iteration over data sets • Provenance collection • Extensible and open source Taverna workflows • Dataflow – Graphically connect data between drag-and- dropped services • Service types – REST, SOAP, Command Line, web interactions, scripts (R, Python, Beanshell) – Domain-specific plugins – Your tool? Taverna workflows • Nested workflows • Components – Reusable and inter-compatible workflow fragments – Grouped into families – Semantically annotated – Curated Application RepositoriesRepositories
    [Show full text]
  • Bit Brochure.Pdf
    Register by March 5 and Save up to $200! Final Agenda ENABlING TECHNOlOGy. Leveraging Data. TransformING mEDICINE. Featured Presentations by: CONCURRENT TRACKS: KEYNOTE PreseNTATIONS BY: 1 IT Infrastructure — Hardware Platinum Sponsors: 2 IT Infrastructure — Software John Halamka, m.D., m.S., CIO, Harvard Medical School 3 Bioinformatics and Next-Gen Data 4 Systems and Predictive Biology 5 Cheminformatics and Computer-Aided Modeling Font: Optima Regular Colors: Dark Blue #003399 (CMYK 96, 69, 3, 0) eClinical Trials Technology Light Blue #6699CC (CMYK 62, 22, 3, 0) 6 Gold Sponsors: 7 eHealth Solutions Christoph Westphal, m.D., Ph.D., CEO, Sirtris Pharmaceuticals; Senior Vice President, Centre of Excellence for External Drug Discovery, EVENT FEATURES: GlaxoSmithKline Access All Seven Tracks for One Price Network with 1,500+ Attendees Hear 100+ Technology and Scientific Presentations Attend Bio-IT World’s Best Practices Awards KEYNOTE PANEL: Connect with Attendees Using The Future of Personal Genomics Bronze Sponsors: CHI’s Intro-Net A special plenary panel discussion featuring: Participate in the Poster Competition James Heywood, Co-founder and Chairman, PatientsLikeMe See the Winners of the following 2010 Awards: Dan Vorhaus, J.D., M.A., Attorney, Robinson, Bradshaw & Hinson; Benjamin Franklin Editor, Genomics Law Report Best of Show Dietrich Stephan, Ph.D., President & CEO, Ignite Institute Best Practices View Novel Technologies and Solutions Kevin Davies, Ph.D., Editor-in-Chief, Bio-IT World in the Expansive Exhibit Hall And More... And Much More! Corporate Support Sponsor Bio-ITWorldExpo.com Organized & Managed by: Official Publication: Cambridge Healthtech Institute 250 First Avenue, Suite 300, Needham, MA 02494 Phone: 781-972-5400 • Fax: 781-972-5425 • Toll-free in the U.S.
    [Show full text]
  • Web Workflows for Data Mining in the Cloud
    WEB WORKFLOWS FOR DATA MINING IN THE CLOUD Janez Kranjc Doctoral Dissertation Jožef Stefan International Postgraduate School Ljubljana, Slovenia Supervisor: Prof. Dr. Nada Lavrač, Jožef Stefan Institute, Ljubljana, Slovenia, and Jožef Stefan International Postgraduate School, Ljubljana, Slovenia Co-Supervisor: Assoc. Prof. Dr. Marko Robnik Šikonja, Faculty of Computer and Infor- mation Science, University of Ljubljana, Slovenia Evaluation Board: Assist. Prof. Dr. Martin Žnidaršič, Chair, Jožef Stefan Institute, Ljubljana, Slovenia, and Jožef Stefan International Postgraduate School, Ljubljana, Slovenia Assoc. Prof. Dr. Igor Mozetič, Member, Jožef Stefan Institute, Ljubljana, Slovenia, and Jožef Stefan International Postgraduate School, Ljubljana, Slovenia Prof. Dr. Hendrik Blockeel, Member, Kaotolieke Universiteit Leuven, Leuven, Belgium Janez Kranjc WEB WORKFLOWS FOR DATA MINING IN THE CLOUD Doctoral Dissertation SPLETNI DELOTOKI ZA RUDARJENJE PODATKOV V OBLAKU Doktorska disertacija Supervisor: Prof. Dr. Nada Lavrač Co-Supervisor: Assoc. Prof. Dr. Marko Robnik Šikonja Ljubljana, Slovenia, March 2017 v Acknowledgments This thesis would have never been finished without the help of a number of people. Their support and help throughout my studies are greatly appreciated. First of all I would like to thank my supervisor Nada Lavrač for her infinite amount of patience, enthusiasm, support and her always helpful and insightful research ideas and comments. I am also very grateful to my co-supervisor Marko Robnik Šikonja for the copious amount of
    [Show full text]
  • A Web 2.0 Virtual Research Environment
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by e-Prints Soton my Experiment – A Web 2.0 Virtual Research Environment David De Roure Carole Goble Electronics and Computer Science School of Computer Science University of Southampton, UK The University of Manchester, UK [email protected] [email protected] +44 23 8059 2418 +44 161 275 6195 ABSTRACT established as a means of automating the processing of e-Science has given rise to new forms of digital object in scientific data in a scalable and reusable way. the Virtual Research Environment which can usefully be my shared amongst collaborating scientists to assist in The Experiment Virtual Research Environment (VRE) generating new scientific results. In particular, descriptions provides a personalised environment which enables users to of Scientific Workflows capture pieces of scientific share, re-use and repurpose experiments. Our vision is that knowledge which may transcend their immediate scientists should be able to swap workflows and other scientific objects as easily as citizens can share documents, application and can be shared and reused in other my experiments. We are building the my Experiment Virtual photos and videos on the Web. Hence Experiment owes Research Environment to support scientists in sharing and far more to social networking websites such as MySpace collaboration with workflows and other objects. Rather than (www.myspace.com) and YouTube (www.youtube.com) adopting traditional methods prevalent in the e-Science than to the traditional portals of Grid computing, and is developer community, our approach draws upon the social immediately familiar to the new generation of scientists.
    [Show full text]
  • Tracking Workflow Execution with Tavernaprov Data Bundle
    Tracking workflow execution with TavernaProv Document id: https://github.com/stain/2016-provweek-tavernaprov/ Stian Soiland-Reyes, Pinar Alper, Carole Goble; eScience Lab, University of Manchester Apache Taverna is a scientific workflow system for combining web services and local tools. Taverna records provenance of workflow runs, intermediate values and user interactions, both as an aid for debugging while designing the workflow, but also as a record for later reproducibility and comparison. Taverna also records provenance of the evolution of the workflow definition (including a chain of wasDerivedFrom relations), attributions and annotations; for brevity we here focus on how Taverna's workflow run provenance extends PROV and is embedded with Research Objects. Data bundle A workflow run can be exported from Taverna as a workflow data bundle; a Research Object bundle in the form of a ZIP archive that contains the workflow definition (itself a Research Object), annotations, inputs, outputs and intermediate values, and a PROV-O trace of the workflow execution, showing every process execution within the workflow run, linking to the produced and consumed values using relative paths. A workflow run can thus be downloaded as a single file from a Taverna Server, shared (myExperiment, SEEK), published (ROHub, Zenodo), imported in another Taverna Workbench, shown in the Databundle Viewer or modified with the DataBundle API. Abstraction levels PROV is a generic model for describing provenance. While this means there are generally many multiple ways to express the same history in PROV, a scientific workflow run with processors and data values naturally match the Activity and Entity relations wasGeneratedBy and used .
    [Show full text]
  • Viewed Or Downloaded, and a Rating Scale That Reports About the Quality of the Resource
    Pérez et al. Journal of Biomedical Semantics 2013, 4:12 http://www.jbiomedsem.com/content/4/1/12 JOURNAL OF BIOMEDICAL SEMANTICS RESEARCH Open Access BioUSeR: a semantic-based tool for retrieving Life Science web resources driven by text-rich user requirements María Pérez1*, Rafael Berlanga2, Ismael Sanz1 and María José Aramburu1 Abstract Background: Open metadata registries are a fundamental tool for researchers in the Life Sciences trying to locate resources. While most current registries assume that resources are annotated with well-structured metadata, evidence shows that most of the resource annotations simply consists of informal free text. This reality must be taken into account in order to develop effective techniques for resource discovery in Life Sciences. Results: BioUSeR is a semantic-based tool aimed at retrieving Life Sciences resources described in free text. The retrieval process is driven by the user requirements, which consist of a target task and a set of facets of interest, both expressed in free text. BioUSeR is able to effectively exploit the available textual descriptions to find relevant resources by using semantic-aware techniques. Conclusions: BioUSeR overcomes the limitations of the current registries thanks to: (i) rich specification of user information needs, (ii) use of semantics to manage textual descriptions, (iii) retrieval and ranking of resources based on user requirements. Keywords: Resources discovery, Semantic annotation, Information retrieval, Life science Background is prodigious, and the sheer amount of available resources In this section we introduce the context of our work and to manage research data is a source of severe difficulties. our motivation. Then, we summarize the related work and In this work, we consider web resources as any appli- we present the rationale of our proposal.
    [Show full text]