Helping Scientists do Science m fro les mp Confessions of an Exa Applied Computer Scientist
Professor Carole Goble CBE FREng FBCS The University of Manchester, UK [email protected] and the myGrid Team http://www.mygrid.org.uk
ACM womENcourage Europe 01 March 2014, Manchester, UK e-Science, Computational Science Scientific Computing • Support global scientific collaboration, enable large scale resource, tools and results sharing, assist scientific processing, avoid unnecessary repeated work. • Accelerate scientific discovery, improving scientific productivity, stimulate technological innovation. • Cope with scales and speed of scientific innovation and data.
http://research.microsoft.com/en-us/collaboration/fourthparadigm/ Computational, data intensive problems Drug discovery, Next Generation small molecules, Genome managed targets, Models of Human Sequencing based worlds / in the compounds Physiology Patient Diagnostics OpenPHACTS Eagle Genomics wild VPH-Share
Ecological Niche and Population Systems Biology of Modelling Astronomy & Micro-Organisms HelioPhysics Document data & model BioVeL analytical Preservation management pipelines Digitisation SysMO Metagenomics HELIO, Wf4ever SCAPE Ocean Sampling Day Distributed Computing
Linking up different codes, resources, platforms & e-infrastructure. Knowledge Computing
Describing, finding and linking up different data, models, methods, science stuff…
Social Computing
Sharing different science stuff. Collaborations between different scientists.
Science
Computer Scientific Informatics Software Science Computational Science Engineering
THEORY APPLICATION PRACTICE fundamental applied
PRINCIPLE “USE CASE” PRODUCT (Open Source) Biodiversity marine monitoring and health assessment ecological niche modelling Enclosed sea problem Pilumnus hirtellus (Ready et al., 2010)
Data Intensive Science Collaborative Science
Sarah Bourlat
Lots of different resources
http://www.catalogueoflife.org/ Lots of different software
Including other researcher’s software
Zeeya Merali , Nature 467, 775-777 (2010) | doi:10.1038/467775a
Computational science: ...Error…why scientific programming does not compute. Aleksandra Pawlik
Devasena Inupakutika
Ghaithaa Manla
Data collection
Data discovery
Data assembly, cleaning, and refinement
Ecological Niche Modeling
Statistical analysis
Scholarly Communication Insights Scholarly Communication & Reporting
Analytical cycle • Volume
Data collection • Variety – Integrative Multi-* – Multi-step, repetitive process Data discovery • Volatile and Velocity – evolving, reanalysis Data assembly, cleaning, and • Variant refinement refinement – Comparable: sweep across data & parameters Ecological Niche Modeling – different experiments. • Valid Statistical analysis – Reporting & Replication
Scholarly Communication Insights Scholarly Communication & Reporting
Analytical cycle E.Science laboris Scientific Workflow Management Systems data, parameters, configurations • Coordinate execution of services and codes. • Dataflow at scale • Reusable variants • Comparable repetitions
• Import own data / codes + public libraries/datasets • Honour hosted codes
• Shield operational complexity • Auto-document provenance • Package up dependencies E.Science laboris Scientific Workflow Management Systems data, parameters, configurations
E.Science laboris
•Visual Programming • Computational Lambda Calculus Tools •Process mining •Adaptive & parallel computing •Cloud computing •SOA, Semantic Web Services Services •Automated wrapping of codes •Data integration, knowledge modelling •Reporting & tracking Standards •…..
ity Secur rtals or mo ces f racti and p ls n too catio y ign bfus sembl Des vs O d as lding Guide Shie bly. assem Auto Fragility chang es in inf aut rastruct omated ures & r adaption esources Provenance
Reproducible executions… s oditie Packaging, preservation & portability omm s as c kfow Wor
d d • How, What, Where, When, Why, 1 2 d1' d2
S S Who 0 1 S0 S1
• Trace lineage, Process history, w z w
S Accountability 2 S'2
• The link between computation y y' and results S4 S4 • Transparency df df'
(i) Trace A (ii) Trace B [Woodman et al, 2011]
Social
[Cheney, 2012]
Provenance Week June 9-13, 2014 , Cologne
http://provenanceweek.dlr.de Mind the Provenance Gap
Fine grain Summarisation, What do I cite? Big Labelling, What did I do? A White box Distillation N Black boxes
One System Many Systems Special tools My Lab Book Collection Analytics A Big Graph Smart in situ Presentation
Pinar Alper Sarah Cohen-Boulakia Juliana Freire Susan Davidson Primacy of Method (a la Code)
What code was run? – which executable?
Which options did you set? What was the input data?
Where can I get hold of the code / script / workflow?
How does it work? What are its assumptions?
Who authored it? How do I version it? How do I cite it? What’s its licence? How do I get credit for it?
How fragile is it? How do we repair it?
Primacy of Methods Systems Biology Sharing and interlinking Methods, Models, Data…
Metadata External Databases Data
Article Model
Social Computation Storing, Sharing and Reusing data, methods, models, between collaborating and competing scientists e-Laboratories, collaboratories, VREs, repositories experimentalists, modellers, X- An ego-system informaticians, computational Xs, software engineers, computer scientists, systems administrators, resource providers, tool builders social scientists, librarians, curators
“Startup-Like” Balance Innovation with Usefulness
Knowledge Turns amongst Scientists
[Josh Sommer] E.Science Sociam
• HCI, Human Factors • Security Platforms • Data and Knowledge management Services • Distributed Computing • Digital Preservation Standards • Social Machines • Information Systems • Social Science Policies/Practices
Scientists Share Strategically and Sparingly
p ree g C rin Sha Da ta Hu gging Data Flirting
Data V oyerism Tools
Collaborating to Compete
Cost Reward Risk
Social Engineer Computer Scientist
Software Engineer
credit is like ♥ not £$€¥
• Universal identity • Inter-platform tracking • Auto-tracking • Credit recommendation • Credit recognition • Standards • Tools • Socio-Technical development • Credit for Developers!!
credit is like ♥ not £$€¥
• Universal identity • Inter-platform tracking Rebecca • Auto-tracking Victoria Kaitlin Lawrence • Credit recommendation Stodden Thaney Anita De Waard • Credit recognition • Standards • Tools Liz Lyon • Socio-Technical development Heather Piwowar • Credit for Developers!!
Katy Borner Christine Borgman Describing X well enough to share it, find it, understand it, reuse it, combine it with Y & Z
X, Y, Z = data, models, methods, workflows, services, codes, *
Knowledge Computation •Accurate, intelligible and comparable descriptions •Data interoperability •Machine readable metadata
Semantic technologies, Ontologies, Linked Data, Data schema
[Taheriyan et al Semantic Description adapted] Describing and linking data in terms of shared concepts, relationships and identifiers Ontology bornIn nearby birthdate livesIn isPartOf Place Person name organizer location postalCode name ceo City State worksFor Event state e title nd object property Da Organization te data property phone startDate subClassOf name Data Column 1 Column 2 Column 3 Column 4 Column 5 Bill Gates Oct 1955 Microsoft Seattle WA Mark Zuckerberg May 1984 Facebook White Plains NY Larry Page Mar 1973 Google East Lansing MI Environment Ontology shared, controlled, structured vocabulary for biomes, environmental features, and environmental materials.
Common source of names and synonyms for matching, linking, searching, indexing, structuring data Web Ontology Language OWL
E.Science Semantii
• Database theory • Query Answering Tools • Description Logics • Reasoners • Artificial Intelligence Resources • Automated annotation • Data integration & Search • Crowd sourcing Standards knowledge • Knowledge elicitation
gration u Go Inte Smart Search Pay as yo Changes y in dat bilit meta a & Scala data security
Adding Semantics to Data Capt uring me tadata g sonin Crowd sourced Annotation d rea ion an entat epres dge r nowle Rich k Semantic ETL pipelines
Curation Knowledge Ramps
Katy Wolstencroft Populous http://www.rightfield.org.uk http://www.economist.com/printedition/2013-10-19 Born Reproducible | Exchangeable | Reusable
Rich descriptions
Transparent Method
Open & Available Re-executable
Lemberger T Mol Syst Biol 2014;10:715 ©2014 by European Molecular Biology Organization Research Objects • Bundles and relate multi-hosted digital resources of a scientific experiment or investigation using standard mechanisms • Exchange, Releasing paradigm for publishing Jun Zhao
http://www.researchobject.org/ Research is like software. Release research
Jennifer Schopf, Treating Data Like Software: A Case for Production Quality Data, JCDL 2012 "To be a proper professional you need to think about the context and motivation and justifications of what you're doing... once you see how important computing is for life you can't just leave it as a blank box and assume that somebody reasonably competent and relatively benign will do something right with it."
Karen Spärck Jones
IEEE Spectrum, Computer Science, A Woman's Work May 2007 • myGrid – http://www.mygrid.org.uk • Taverna – http://www.taverna.org.uk • myExperiment – http://www.myexperiment.org • BioCatalogue – http://www.biocatalogue.org • Biodiversity Catalogue – http://www.biodiversitycatalogue.org • Seek – http://www.seek4science.org • Rightfield – http://www.rightfield.org.uk • Open PHACTS – http://www.openphacts.org • Wf4ever – http://www.wf4ever-project.org • Software Sustainability Institute – http://www.software.ac.uk • BioVeL – http://www.biovel.eu • Force11 – http://www.force11.org