Distributed Computing Environments and Workflows for Systems
Total Page:16
File Type:pdf, Size:1020Kb
Going with the Flow Distributed Computing for Systems Biology Using Taverna Prof Carole Goble The University of Manchester, UK http://www.mygrid.org.uk http://www.omii.ac.uk Data pipelines in bioinformatics Genscan Resources /Services EMBL BLAST Clustal-W Multiple Query Protein sequences Clustal-W sequence alignment 2 Example in silico experiment: Investigate© the evolutionary relationships between proteins [Peter Li] z Manual creation z Semi-automation using bespoke software 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt z Issues: 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt z Volatility of data12541 ttcttataag in tctgtggtttlife ttatattaatsciences gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa z Data and metadata12781 taacccattt tctgtctcta storage tggatttgcc tgttctggat attcatatta atagaatcaa z Integration of heterogeneous biological data z Visualisation of models z Brittleness 3 © 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg Warehouses and WarehousesDatabases and WarehousesDatabases and Databases 4 © Data Warehouse z Copy the data sets z Combine them into a pre- determined model before query z Query that model z Clean data z Refresh, Fixed z High cost, Front loaded z You can only use what has been set up for you. 5 © Distributed Database Integration z Marshal the data sets z Combine it into a pre- determined model when you query z Always fresh z Map from model to databases dynamically z More flexible but still depends on model z High cost z You can only use what has been set up 6 © 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg Warehouses and WarehousesDatabases and WarehousesDatabases and Databases [Mark Wilkinson, 2006 BioMOBY] 7 © It would be good if you could systematically automate… Make data sets / resources / tools / codes / models accessible to a computer. And cope when they change And run them where they are hosted …. 8 © It would be good if you could systematically automate… ….Link together resources Automate the protocol so I don’t have to do it every time I need to repeat the search or re-run the analysis. And do it accurately and systematically every time without mistakes. And not get bored and sloppy. And be comprehensive too…. 9 © It would be good if you could systematically automate… …Rerun it over and over and over and over again. Automatically. And keep the log of what actually happened. Automatically. Manage the results of the protocol. Not just the data results, but the evidence for the results, the source of the data, the log of what you did and why. Helpful when you publish!.... 10 © It would be good if you could systematically automate… …Record this protocol, share it with colleagues. Fiddle with it. Remember what it was 2 weeks later. Adapt a colleague’s or expert’s to suit you Give your protocol to a colleague… 11 © It would be good if you could systematically automate… And do it in my lab without having to have a lot of systems administrators and developers building databases for me. Or writing Perl. And it runs on my crappy laptop. 12 © And be … z Un-biased and Unambiguous in my science z Systematic z Efficient z Scalable z Flexible z Customisable z Transparent in my scientific method 13 © The Two W’s z Web Services z Technology and standard for exposing code / database with an API that can be consumed by a third party remotely. z Describes how to interact with it. z Workflows z General technique for describing and enacting a process z Describes what you want to do, not how you want to do it 14 © Workflows Workflow language specifies how bioinformatics processes fit together. High level workflow diagram separated from any lower level coding – you don’t have to be a coder to build workflows. Workflow is a kind of script or protocol that you configure when you run it. Easier to explain, share, relocate, reuse and repurpose. 15 © myGrid z myGrid http://www.mygrid.org.uk z UK e-Science pilot project since 2001 z Build middleware for Life Scientists that enables them to undertake in silico experiments and share those experiments and their results. z Individual scientists, in under-resourced labs, who use other people’s applications. z Open source. z Workflows. z Data flows. Ad hoc & exploratory 16 © Taverna Workflow Workbench 17 http://taverna.sourceforge.net© Trypanosomiasis in Africa 18 © http://www.genomics.liv.ac.uk/tryps/trypsindex.html Genotype Phenotype 200 ? Phenotypic response investigated using microarray Genes captured in in form of expressed genes microarray or evidence provided through experiment and QTL mapping 19 present in QTL © region Microarray + QTL [Andy Brass, Steve Kemp, Paul Fisher, 2006] Key: A – Retrieve genes in QTL region B – Annotate genes with external database Ids C – Cross-reference Ids with KEGG gene ids D – Retrieve microarray data from MaxD database E – For each KEGG gene get the pathways it’s involved in F – For each pathway get a description of what it does G – For each KEGG gene get a description of what it does 20 [Andy Brass, Steve Kemp, © Paul Fisher, 2006] Result z Captured the pathways returned by QTL and Microarray workflows over the MaxD microarray database z Identified a pathway for which its correlating gene (Daxx) is believed to play a role in trypanosomiasis resistance. z Manually analysis on the microarray and QTL data had failed to identify this gene as a candidate. 21 © [Andy Brass, Steve Kemp, Paul Fisher, 2006] Trichuris muris (mouse whipworm) infection z Identified the biological pathways involved in sex dependence in the mouse model, previously believed to be involved in the ability of mice to expel the parasite. z Manual experimentation: Two year study of candidate genes, processes unidentified z Workflows: trypanosomiasis cattle experiment, was reused without change. z Analysis of the data by a biologist found the processes in a couple of days. 22 © [Joanne Pennock, Paul Fisher, 2006] Pull Retrieve and parse data fromPublic BIND Databases + inHouse Data = Model Core SBML model construction workflow 23 © [Peter Li, Doug Kell, 2006] Visualise results using routine SBML tools SharkView – interactive SBML viewer 24 © [Peter Li, Doug Kell, 2006] Model construction: Post-Taverna z Captures the scientific process of model construction as workflows z Workflows enacted ‘on demand’ to construct most up-to-date models using the latest data z Models are pushed into a data model of choice z Provide various ways of visualising models 25 © Multi-disp. z ~20000 downloads z Users in US, Singapore, UK, Europe, Australia, z Systems biology z Proteomics z Gene/protein annotation z Microarray data analysis z Medical image analysis z Heart simulations z High throughput screening z Phenotypical studies z Plants, Mouse, Human z Astronomy z Dilbert Cartoons 26 © A workflow 27 marketplace © Finding and Taverna Workbench 3rd Party Sharing Tools Applications and myExperiment DAS Portals Utopia Feta Workflow Enactor Workflow enactor ClientsClients Service Management LSIDs Log Default Custom Meta Data Store data Store 28 © Results KAVE BAKLAVA Management Transparency 29 © Provenance Smart Tea z Who, What, Where, When, Why?, How? z Context z Interpretation z Logging & Debugging z Reproducibility and repeatability z Evidence & Audit BioMOBY z Non-repudiation z Credit z Credibility z Accurate reuse and interpretation z Smart re-running z Cross experiment mining z Just good scientific practice 30 © Tracking From which Ensembl gene does pathway come from mmu004620 ? 31 © Workflows over Results Automatically backtrack through the data provenance graph Entrez dF dF dF dF Pathway_id KEGG_id Uniprot Ensembl_gene_id 32 © An Open World z Open domain services and resources. z Taverna accesses 3000+ operation. z Third party. z All the major providers z NCBI, DDBJ, EBI … z Enforce NO common data model. z Quality Web Services considered desirable . 33 © If you don’t provide a Web Service Interface… z SoapLab z Java API Consumer import Java API of libSBML as workflow components http://www.ebi.ac.uk/soaplab/ 34 © Shield the Scientist Bury the complexity Workflow enactor Processor Processor Processor Processor Processor Processor Processor Styx Processor ... Bio WSRF Plain Soap Bio Local Enactor Styx R ... MART Web lab MOBY Java client package 35 Service © App User Interaction z Allows a workflow to call out to an expert human user z E.g. Used to embed the Artemis annotation editor within an otherwise automated genome annotation pipeline 36 © [University of Bergen] No miracles here. z Building good workflows