Helping Scientists Do Science M Fro Les Mp Confessions of an Exa Applied Computer Scientist

Helping Scientists do Science Examples from Confessions of an Applied Computer Scientist Professor Carole Goble CBE FREng FBCS The University of Manchester, UK [email protected] and the myGrid Team ACM womENcouragehttp://www.mygrid.org.uk Europe 01 March 2014, Manchester, UK e-Science, Computational Science Scientific Computing • Support global scientific collaboration, enable large scale resource, tools and results sharing, assist scientific processing, avoid unnecessary repeated work. • Accelerate scientific discovery, improving scientific productivity, stimulate technological innovation. • Cope with scales and speed of scientific innovation and data. http://research.microsoft.com/en-us/collaboration/fourthparadigm/ Computational, data intensive problems Drug discovery, Next Generation small molecules, Genome managed targets, Models of Human Sequencing based worlds / in the compounds Physiology Patient Diagnostics OpenPHACTS Eagle Genomics wild VPH-Share Ecological Niche and Population Systems Biology of Modelling Astronomy & Micro-Organisms HelioPhysics Document data & model BioVeL analytical Preservation management pipelines Digitisation SysMO Metagenomics HELIO, Wf4ever SCAPE Ocean Sampling Day Distributed Computing Linking up different codes, resources, platforms & e-infrastructure. Knowledge Computing Describing, finding and linking up different data, models, methods, science stuff… Social Computing Sharing different science stuff. Collaborations between different scientists. Science Computer Scientific Informatics Software Science Computational Science Engineering THEORY APPLICATION PRACTICE fundamental applied PRINCIPLE “USE CASE” PRODUCT (Open Source) Biodiversity marine monitoring and health assessment ecological niche modelling Enclosed sea problem Pilumnus hirtellus (Ready et al., 2010) Data Intensive Science Collaborative Science Sarah Bourlat Lots of different resources http://www.catalogueoflife.org/ Lots of different software Including other researcher’s software Zeeya Merali , Nature 467, 775-777 (2010) | doi:10.1038/467775a Computational science: ...Error…why scientific programming does not compute. Aleksandra Pawlik Devasena Inupakutika Ghaithaa Manla Data collection Data discovery Data assembly, cleaning, and refinement Ecological Niche Modeling Statistical analysis Scholarly Communication Insights Scholarly Communication & Reporting Analytical cycle • Volume Data collection • Variety – Integrative Multi-* – Multi-step, repetitive process Data discovery • Volatile and Velocity – evolving, reanalysis Data assembly, cleaning, and • Variant refinement refinement – Comparable: sweep across data & parameters Ecological Niche Modeling – different experiments. • Valid Statistical analysis – Reporting & Replication Scholarly Communication Insights Scholarly Communication & Reporting Analytical cycle E.Science laboris Scientific Workflow Management Systems data, parameters, configurations • Coordinate execution of services and codes. • Dataflow at scale • Reusable variants • Comparable repetitions • Import own data / codes + public libraries/datasets • Honour hosted codes • Shield operational complexity • Auto-document provenance • Package up dependencies E.Science laboris Scientific Workflow Management Systems data, parameters, configurations E.Science laboris •Visual Programming • Computational Lambda Calculus Tools •Process mining •Adaptive & parallel computing •Cloud computing •SOA, Semantic Web Services Services •Automated wrapping of codes •Data integration, knowledge modelling •Reporting & tracking Standards •….. Security Design tools and practices for mortals Shielding vs Obfuscation Auto assembly. Guided assembly F rag cha ili nge ty aut s in om inf ated ras ad truc apt tur ion es & Provenance res ou rces Reproducible executions… Packaging, preservation & portability Workfows as commodities d d • How, What, Where, When, Why, 1 2 d1' d2 S S Who 0 1 S0 S1 • Trace lineage, Process history, w z w S Accountability 2 S'2 • The link between computation y y' and results S4 S4 • Transparency df df' (i) Trace A (ii) Trace B [Woodman et al, 2011] Social [Cheney, 2012] Provenance Week June 9-13, 2014 , Cologne http://provenanceweek.dlr.de Mind the Provenance Gap Fine grain Summarisation, What do I cite? Big Labelling, What did I do? A White box Distillation N Black boxes One System Many Systems Special tools My Lab Book Collection Analytics A Big Graph Smart in situ Presentation Pinar Alper Sarah Cohen-Boulakia Juliana Freire Susan Davidson Primacy of Method (a la Code) What code was run? – which executable? Which options did you set? What was the input data? Where can I get hold of the code / script / workflow? How does it work? What are its assumptions? Who authored it? How do I version it? How do I cite it? What’s its licence? How do I get credit for it? How fragile is it? How do we repair it? Primacy of Methods Systems Biology Sharing and interlinking Methods, Models, Data… Metadata External Databases Data Article Model Social Computation Storing, Sharing and Reusing data, methods, models, between collaborating and competing scientists e-Laboratories, collaboratories, VREs, repositories experimentalists, modellers, X- An ego-system informaticians, computational Xs, software engineers, computer scientists, systems administrators, resource providers, tool builders social scientists, librarians, curators “Startup-Like” Balance Innovation with Usefulness Knowledge Turns amongst Scientists [Josh Sommer] E.Science Sociam • HCI, Human Factors • Security Platforms • Data and Knowledge management Services • Distributed Computing • Digital Preservation Standards • Social Machines • Information Systems • Social Science Policies/Practices Scientists Share Strategically and Sparingly Sharing Creep Da ta H Data Flirting ug gi ng Da Tools ta V oy er is Collaborating to Compete m Cost Reward Risk Social Engineer Computer Scientist Software Engineer credit is like ♥ not £$€¥ • Universal identity • Inter-platform tracking • Auto-tracking • Credit recommendation • Credit recognition • Standards • Tools • Socio-Technical development • Credit for Developers!! credit is like ♥ not £$€¥ • Universal identity • Inter-platform tracking Rebecca • Auto-tracking Victoria Kaitlin Lawrence • Credit recommendation Stodden Thaney Anita De Waard • Credit recognition • Standards • Tools Liz Lyon • Socio-Technical development Heather Piwowar • Credit for Developers!! Katy Borner Christine Borgman Describing X well enough to share it, find it, understand it, reuse it, combine it with Y & Z X, Y, Z = data, models, methods, workflows, services, codes, * Knowledge Computation •Accurate, intelligible and comparable descriptions •Data interoperability •Machine readable metadata Semantic technologies, Ontologies, Linked Data, Data schema [Taheriyan et al Semantic Description adapted] Describing and linking data in terms of shared concepts, relationships and identifiers Ontology bornIn nearby birthdate livesIn isPartOf Place Person name organizer location postalCode name ceo City State worksFor Event state e title nd object property Da Organization te data property phone startDate subClassOf name Data Column 1 Column 2 Column 3 Column 4 Column 5 Bill Gates Oct 1955 Microsoft Seattle WA Mark Zuckerberg May 1984 Facebook White Plains NY Larry Page Mar 1973 Google East Lansing MI Environment Ontology shared, controlled, structured vocabulary for biomes, environmental features, and environmental materials. Common source of names and synonyms for matching, linking, searching, indexing, structuring data Web Ontology Language OWL E.Science Semantii • Database theory • Query Answering Tools • Description Logics • Reasoners • Artificial Intelligence Resources • Automated annotation • Data integration & Search • Crowd sourcing Standards knowledge • Knowledge elicitation tion egra Smart Search o Int ou G as y Pay Changes in data & metadata ity bil ala Sc security Adding Semantics to Data Capturing metadata ing son rea nd n a atio ent Crowd sourced Annotationres rep ge led Semantic ETL pipelines now h k Ric Curation Knowledge Ramps Katy Wolstencroft Populous http://www.rightfield.org.uk http://www.economist.com/printedition/2013-10-19 Born Reproducible | Exchangeable | Reusable Rich descriptions Transparent Method Open & Available Re-executable Lemberger T Mol Syst Biol 2014;10:715 ©2014 by European Molecular Biology Organization Research Objects • Bundles and relate multi-hosted digital resources of a scientific experiment or investigation using standard mechanisms • Exchange, Releasing paradigm for publishing Jun Zhao http://www.researchobject.org/ Research is like software. Release research Jennifer Schopf, Treating Data Like Software: A Case for Production Quality Data, JCDL 2012 "To be a proper professional you need to think about the context and motivation and justifications of what you're doing... once you see how important computing is for life you can't just leave it as a blank box and assume that somebody reasonably competent and relatively benign will do something right with it." Karen Spärck Jones IEEE Spectrum, Computer Science, A Woman's Work May 2007 • myGrid – http://www.mygrid.org.uk • Taverna – http://www.taverna.org.uk • myExperiment – http://www.myexperiment.org • BioCatalogue – http://www.biocatalogue.org • Biodiversity Catalogue – http://www.biodiversitycatalogue.org • Seek – http://www.seek4science.org • Rightfield – http://www.rightfield.org.uk • Open PHACTS – http://www.openphacts.org • Wf4ever – http://www.wf4ever-project.org • Software Sustainability Institute – http://www.software.ac.uk • BioVeL – http://www.biovel.eu • Force11 – http://www.force11.org .

Helping Scientists Do Science M Fro Les Mp Confessions of an Exa Applied Computer Scientist

Understanding Semantic Aware Grid Middleware for E-Science

Description Logics Emerge from Ivory Towers Deborah L

Open PHACTS: Semantic Interoperability for Drug Discovery

The Fourth Paradigm

The Design and Realisation of the Myexperiment Virtual Research Environment for Social Sharing of Workﬂows

Data Curation+Process Curation^Data Integration+Science

Social Networking Site for Researchers Aims to Make Academic Papers a Thing of the Past 16 July 2009

Ivelize Rocha Bernardo Promoting Interoperability of Biodiversity

Data Management in Systems Biology I

Distributed Computing Environments and Workflows for Systems

FAIR Computational Workflows

Mygrid: a Collection of Web Pages