An Introduction to Taverna Workflows What Is Mygrid?
Total Page:16
File Type:pdf, Size:1020Kb
An Introduction to Taverna Workflows Katy Wolstencroft myGrid University of Manchester What is myGrid? • myGrid is a suite components to support in silico experiments in biology • Taverna workbench = myGrid user interface • Originally designed to support bioinformatics Expanded into new areas: Chemoinformatics Health Informatics Medical Imaging Integrative Biology • Open source – and always will be History EPSRC funded UK eScience Program Pilot Project OMII Open Middleware Infrastructure Institute • University of Manchester (myGrid) joined with the Universities of Edinburgh (OGSA-DAI) and Southampton (OMII phase 1) in March 2006 • OMII-UK aims to provide software and support to enable a sustained future for the UK e-Science community and its international collaborators. • A guarantee of development and support The Life Science Community In silico Biology is an open Community • Open access to data • Open access to resources • Open access to tools • Open access to applications Global in silico biological research The Community Problems • Everything is Distributed – Data, Resources and Scientists • Heterogeneous data • Very few standards – I/O formats, data representation, annotation – Everything is a string! Integration of data and interoperability of resources is difficult Lots of Resources NAR 2007 – 968 databases Traditional Bioinformatics 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa Cutting and Pasting • Advantages: – Low Technology on both server and client side – Very Robust: Hard to break. – Data Integration happens along the way • Disadvantages: – Time Consuming (and painful!) • Can be repeated rarely • Limited to small data sets. – Error Prone: • Poor repeatability How do you do this for a genome/proteome/metabolome of information! Pipeline Programming • Advantages – Repeatable – Allows automation – Quick, reliable, efficient • Disadvantages – Requires programming skills – Difficult to modify – Requires local tool and database installation – Requires tool and database maintenance!!! What we want as a solution A system that is: • Allows automation • Allows easy repetition, verification and sharing of experiments • Works on distributed resource • Requires few programming skills • Runs on a local desktop / laptop myGrid as a solution myGrid allows the automated orchestration of in silico experiments over distributed resources from the scientist’s desktop Built on computer science technologies of: • Web services • Workflows • Semantic web technologies Web Services Web services support machine-to-machine interaction over a network. Note: NOT the same as services on the web Web services are a: – technology and standard for exposing code / databases with an API that can be consumed by a third party remotely. – describes how to interact with it. They are: • Self-contained • Self-describing • Modular • Platform independent Workflows – General technique for describing and enacting a process – Describes what you want to do, not how you want to do it – High level description of the experiment Repeat GenScan Blast Masker Web Service Web service Web Service Workflows Workflow language specifies how bioinformatics processes fit together. High level workflow diagram separated from any lower level coding – you don’t have to be a coder to build workflows. Workflow is a kind of script or protocol that you configure when you run it. Easier to explain, share, relocate, reuse and repurpose. Workflow <=> Model Workflow is the integrator of knowledge The METHODS section of a scientific publication Workflow Advantages • Automation – Capturing processes in an explicit manner – Tedium! Computers don’t get bored/distracted/hungry/impatient! – Saves repeated time and effort • Modification, maintenance, substitution and personalisation • Easy to share, explain, relocate, reuse and build • Releases Scientists/Bioinformaticians to do other work • Record – Provenance: what the data is like, where it came from, its quality – Management of data (LSID - Life Science Identifiers) Different Workflow Systems •Kepler • Triana • DiscoveryNet • Taverna • Geodise • Pegasus • Pipeline Pilot Each has differences in action, language, access restrictions, subject areas Taverna Workflow Components Web Service e.g. DDBJ BLAST Scufl Simple Conceptual Unified Flow Language SOAPLAB Taverna Writing, running workflows & examining results Any Application SOAPLAB Makes applications available Web Service An Open World •Opendomain services and resources. • Taverna accesses 3000+ services • Third party – we don’t own them – we didn’t build them • All the major providers – NCBI, DDBJ, EBI … • Enforce NO common data model. • Quality Web Services considered desirable Adding your own web services • SoapLab • Java API Consumer import Java API of libSBML as workflow components http://www.ebi.ac.uk/soaplab/ Services Landscape Shield the Scientist – Bury the Complexity Taverna Application Workbench Scufl Model Simple Conceptual Unified Flow Language Workflow Execution Workflow enactor Processor Processor Processor Processor Processor Processor Processor Styx Processor ... Bio WSRF Plain Soap Bio Local Enactor Styx R ... MART Web lab MOBY Java client package Service App What can you do with myGrid? • ~37000 downloads •Users worldwide US, Singapore, UK, Europe, Australia • Systems biology •Proteomics • Gene/protein annotation • Microarray data analysis • Medical image analysis • Heart simulations • High throughput screening • Genotype/Phenotype studies • Health Informatics • Astronomy • Chemoinformatics • Data integration Trypanosomiasis in Africa Steve Kemp Andy Brass Paul Fisher http://www.genomics.liv.ac.uk/tryps/trypsindex.html Trypanosomiasis Study • A form of Sleeping sickness in cattle – Known as n’gana • Caused by Trypanosoma brucei • Can we breed cattle resistant to n’gana infection? • What are the causes of the differences between resistant and susceptible strains? Trypanosomiasis Study Understanding Phenotype • Comparing resistant vs susceptible strains – Microarrays Understanding Genotype • Mapping quantitative traits – Classical genetics QTL Need to access microarray data, genomic sequence information, pathway databases AND integrate the results Genotype Phenotype 200 ? Genes captured in microarray Phenotypic response investigated experiment and present in QTL using microarray in form of region expressed genes or evidence provided through QTL mapping Microarray + QTL Key: A – Retrieve genes in QTL region B – Annotate genes with external database Ids C – Cross-reference Ids with KEGG gene ids D – Retrieve microarray data from MaxD database E – For each KEGG gene get the pathways it’s involved in F – For each pathway get a description of what it does G – For each KEGG gene get a description of what it does Results • Identified a pathway for which its correlating gene (Daxx) is believed to play a role in trypanosomiasis resistance. • Manual analysis on the microarray and QTL data had failed to identify this gene as a candidate. Why was the Workflow Approach Successful? • Workflow analysed each piece of data systematically – Eliminated user bias and premature filtering of datasets and results leading to single sided, expert-driven hypotheses • The size of the QTL and amount of the microarray data made a manual approach impractical • Workflows capture exactly where data came from and how it was analysed • Workflow output produced a manageable amount of data for the biologists to interpret and verify – “make sense of this data” -> “does this make sense?” Trichuris muris (mouse whipworm) infection parasite model of the human parasite - Trichuris trichuria) • Identified the biological pathways involved in sex dependence in the mouse model, previously believed to be involved in the ability of mice to expel the parasite. • Manual experimentation: Two year study of candidate genes, processes unidentified • Workflows: trypanosomiasis cattle experiment was reused without change. • Analysis of the resulting data by a biologist found the processes in a couple of days. Joanne Pennock, Richard Grencis University of manchester Workflow Reuse – Workflows are Scientific Protocols – Share them! Addisons Disease SNP design Protein annotation Microarray analysis A workflow marketplace A Practical Guide to Building and Managing in silico Experiments Semantic Web Technologies • myGrid built on Web Services, Workflows AND semantic web technologies • Semantic web technologies are used to: – Find appropriate services during workflow design – Find similar workflows for reuse and repurposing – Record the process and outcome of an experiment, in context ->>>> the experimental provenance Finding Services There are over 3000 distributed services. How do we find an appropriate one? Find services by their function instead of their name • We need to annotate services