Helping Scientists do Science m fro les mp Confessions of an Exa Applied Computer Scientist

Professor CBE FREng FBCS The , UK [email protected] and the myGrid Team http://www.mygrid.org.uk

ACM womENcourage Europe 01 March 2014, Manchester, UK e-Science, Computational Science Scientific • Support global scientific collaboration, enable large scale resource, tools and results sharing, assist scientific processing, avoid unnecessary repeated work. • Accelerate scientific discovery, improving scientific productivity, stimulate technological innovation. • Cope with scales and speed of scientific innovation and data.

http://research.microsoft.com/en-us/collaboration/fourthparadigm/ Computational, data intensive problems Drug discovery, Next Generation small molecules, Genome managed targets, Models of Human Sequencing based worlds / in the compounds Physiology Patient Diagnostics OpenPHACTS Eagle wild VPH-Share

Ecological Niche and Population of Modelling Astronomy & Micro-Organisms HelioPhysics Document data & model BioVeL analytical Preservation management pipelines Digitisation SysMO Metagenomics HELIO, Wf4ever SCAPE Ocean Sampling Day Distributed Computing

Linking up different codes, resources, platforms & e-infrastructure. Knowledge Computing

Describing, finding and linking up different data, models, methods, science stuff…

Social Computing

Sharing different science stuff. Collaborations between different scientists.

Science

Computer Scientific Informatics Software Science Computational Science Engineering

THEORY APPLICATION PRACTICE fundamental applied

PRINCIPLE “USE CASE” PRODUCT (Open Source) Biodiversity marine monitoring and health assessment ecological niche modelling Enclosed sea problem Pilumnus hirtellus (Ready et al., 2010)

Data Intensive Science Collaborative Science

Sarah Bourlat

Lots of different resources

http://www.catalogueoflife.org/ Lots of different software

Including other researcher’s software

Zeeya Merali , Nature 467, 775-777 (2010) | doi:10.1038/467775a

Computational science: ...Error…why scientific programming does not compute. Aleksandra Pawlik

Devasena Inupakutika

Ghaithaa Manla

Data collection

Data discovery

Data assembly, cleaning, and refinement

Ecological Niche Modeling

Statistical analysis

Scholarly Communication Insights Scholarly Communication & Reporting

Analytical cycle • Volume

Data collection • Variety – Integrative Multi-* – Multi-step, repetitive process Data discovery • Volatile and Velocity – evolving, reanalysis Data assembly, cleaning, and • Variant refinement refinement – Comparable: sweep across data & parameters Ecological Niche Modeling – different experiments. • Valid Statistical analysis – Reporting & Replication

Scholarly Communication Insights Scholarly Communication & Reporting

Analytical cycle E.Science laboris Scientific Workflow Management Systems data, parameters, configurations • Coordinate execution of services and codes. • Dataflow at scale • Reusable variants • Comparable repetitions

• Import own data / codes + public libraries/datasets • Honour hosted codes

• Shield operational complexity • Auto-document provenance • Package up dependencies E.Science laboris Scientific Workflow Management Systems data, parameters, configurations

E.Science laboris

•Visual Programming • Computational Lambda Calculus Tools •Process mining •Adaptive & parallel computing •Cloud computing •SOA, Semantic Web Services Services •Automated wrapping of codes •Data integration, knowledge modelling •Reporting & tracking Standards •…..

ity Secur rtals or mo ces f racti and p ls n too catio y ign bfus sembl Des vs O d as lding Guide Shie bly. assem Auto Fragility chang es in inf aut rastruct omated ures & r adaption esources Provenance

Reproducible executions… s oditie Packaging, preservation & portability omm s as c kfow Wor

d d • How, What, Where, When, Why, 1 2 d1' d2

S S Who 0 1 S0 S1

• Trace lineage, Process history, w z w

S Accountability 2 S'2

• The link between computation y y' and results S4 S4 • Transparency df df'

(i) Trace A (ii) Trace B [Woodman et al, 2011]

Social

[Cheney, 2012]

Provenance Week June 9-13, 2014 , Cologne

http://provenanceweek.dlr.de Mind the Provenance Gap

Fine grain Summarisation, What do I cite? Big Labelling, What did I do? A White box Distillation N Black boxes

One System Many Systems Special tools My Lab Book Collection Analytics A Big Graph Smart in situ Presentation

Pinar Alper Sarah Cohen-Boulakia Juliana Freire Susan Davidson Primacy of Method (a la Code)

What code was run? – which executable?

Which options did you set? What was the input data?

Where can I get hold of the code / script / workflow?

How does it work? What are its assumptions?

Who authored it? How do I version it? How do I cite it? What’s its licence? How do I get credit for it?

How fragile is it? How do we repair it?

Primacy of Methods Systems Biology Sharing and interlinking Methods, Models, Data…

Metadata External Databases Data

Article Model

Social Computation Storing, Sharing and Reusing data, methods, models, between collaborating and competing scientists e-Laboratories, collaboratories, VREs, repositories experimentalists, modellers, X- An ego-system informaticians, computational Xs, software engineers, computer scientists, systems administrators, resource providers, tool builders social scientists, librarians, curators

“Startup-Like” Balance Innovation with Usefulness

Knowledge Turns amongst Scientists

[Josh Sommer] E.Science Sociam

• HCI, Human Factors • Security Platforms • Data and Knowledge management Services • Distributed Computing • Digital Preservation Standards • Social Machines • Information Systems • Social Science Policies/Practices

Scientists Share Strategically and Sparingly

p ree g C rin Sha Da ta Hu gging Data Flirting

Data V oyerism Tools

Collaborating to Compete

Cost Reward Risk

Social Engineer Computer Scientist

Software Engineer

credit is like ♥ not £$€¥

• Universal identity • Inter-platform tracking • Auto-tracking • Credit recommendation • Credit recognition • Standards • Tools • Socio-Technical development • Credit for Developers!!

credit is like ♥ not £$€¥

• Universal identity • Inter-platform tracking Rebecca • Auto-tracking Victoria Kaitlin Lawrence • Credit recommendation Stodden Thaney Anita De Waard • Credit recognition • Standards • Tools Liz Lyon • Socio-Technical development Heather Piwowar • Credit for Developers!!

Katy Borner Christine Borgman Describing X well enough to share it, find it, understand it, reuse it, combine it with Y & Z

X, Y, Z = data, models, methods, workflows, services, codes, *

Knowledge Computation •Accurate, intelligible and comparable descriptions •Data interoperability •Machine readable metadata

Semantic technologies, Ontologies, Linked Data, Data schema

[Taheriyan et al Semantic Description adapted] Describing and linking data in terms of shared concepts, relationships and identifiers Ontology bornIn nearby birthdate livesIn isPartOf Place Person name organizer location postalCode name ceo City State worksFor Event state e title nd object property Da Organization te data property phone startDate subClassOf name Data Column 1 Column 2 Column 3 Column 4 Column 5 Bill Gates Oct 1955 Seattle WA Mark Zuckerberg May 1984 Facebook White Plains NY Larry Page Mar 1973 Google East Lansing MI Environment Ontology shared, controlled, structured vocabulary for biomes, environmental features, and environmental materials.

Common source of names and synonyms for matching, linking, searching, indexing, structuring data Web Ontology Language OWL

E.Science Semantii

• Database theory • Query Answering Tools • Description Logics • Reasoners • Resources • Automated annotation • Data integration & Search • Crowd sourcing Standards knowledge • Knowledge elicitation

gration u Go Inte Smart Search Pay as yo Changes y in dat bilit meta a & Scala data security

Adding Semantics to Data Capt uring me tadata g sonin Crowd sourced Annotation d rea ion an entat epres dge r nowle Rich k Semantic ETL pipelines

Curation Knowledge Ramps

Katy Wolstencroft Populous http://www.rightfield.org.uk http://www.economist.com/printedition/2013-10-19 Born Reproducible | Exchangeable | Reusable

Rich descriptions

Transparent Method

Open & Available Re-executable

Lemberger T Mol Syst Biol 2014;10:715 ©2014 by European Molecular Biology Organization Research Objects • Bundles and relate multi-hosted digital resources of a scientific experiment or investigation using standard mechanisms • Exchange, Releasing paradigm for publishing Jun Zhao

http://www.researchobject.org/ Research is like software. Release research

Jennifer Schopf, Treating Data Like Software: A Case for Production Quality Data, JCDL 2012 "To be a proper professional you need to think about the context and motivation and justifications of what you're doing... once you see how important computing is for life you can't just leave it as a blank box and assume that somebody reasonably competent and relatively benign will do something right with it."

Karen Spärck Jones

IEEE Spectrum, Computer Science, A Woman's Work May 2007 • myGrid – http://www.mygrid.org.uk • Taverna – http://www.taverna.org.uk • myExperiment – http://www.myexperiment.org • BioCatalogue – http://www.biocatalogue.org • Biodiversity Catalogue – http://www.biodiversitycatalogue.org • Seek – http://www.seek4science.org • Rightfield – http://www.rightfield.org.uk • Open PHACTS – http://www.openphacts.org • Wf4ever – http://www.wf4ever-project.org • Software Sustainability Institute – http://www.software.ac.uk • BioVeL – http://www.biovel.eu • Force11 – http://www.force11.org