Provenance in Data Exploration for Reproducibility and Beyond

Juliana Freire [email protected] VisTrails Group & Web and Databases Lab NYU Poly Science Today: Data Intensive

Simulations Sensors

User studies Particle colliders

Obtain Analyze/ Publish/ Data Visualize Share

Web Sequencing machines Databases

CATT Research Review 2011 Juliana Freire 2 Science and Business Today: Data Intensive

Simulations Sensors

User studies Particle colliders

Obtain Analyze/ Publish/ Data Visualize Share

Web Sequencing machines Databases

CATT Research Review 2011 Juliana Freire 3 Science Today: Data + Computing Intensive

Simulations Sensors AVS

User studies Particle colliders

Obtain Analyze/ Publish/ Data Visualize Share

Web Sequencing VisTrails machines Databases Taverna

CATT Research Review 2011 Juliana Freire 4 Science Today: Data + Computing Intensive

Simulations Sensors

User studies Particle colliders

Obtain Analyze/ Publish/ Data Visualize Share

Web Sequencing machines Databases

CATT Research Review 2011 Juliana Freire 5 Science Today: Data + Computing Intensive

Simulations Sensors

User studies Particle colliders

Obtain Analyze/ Publish/ Data Visualize Share

Web Sequencing machines Databases

CATT Research Review 2011 Juliana Freire 6 Science Today: Incomplete Publications

◆ Publications are just the tip of the iceberg - Scientific record is incomplete--- to large to fit in a paper - Large volumes of data - Complex processes ◆ Others can’t (easily) reproduce results ◆ Authors can’t remember all the steps led to a result…

CATT Research Review 2011 Juliana Freire 7 Science Today: Incomplete Publications

◆ Publications are just the tip of the iceberg “It’s impossible to verify most of the results that - Scientificcomputational record scientistsis incomplete--- present at conference to large to fit in a paper and in papers.” [Donoho et al., 2009] - Large“Scientific volumes and of mathematical data journals are filled - Complexwith pretty processes pictures of computational experiments ◆ Can’t that(easily) the reader reproduce has no resultshope of repeating.” [LeVeque, 2009] “Published documents are merely the advertisement of scholarship whereas the computer programs, input data, parameter values, etc. embody the scholarship itself.” [Schwab et al., 2007]

CATT Research Review 2011 Juliana Freire 8 Science Today: Incomplete Publications

◆ Publications are just the tip of the iceberg “It’s impossible to verify most of the results that - Scientificcomputational record scientistsis incomplete--- present at conference to large to fit in a paper and in papers.” [Donoho et al., 2009] - Large“Scientific volumes and of mathematical data journals are filled - Complexwith pretty processes pictures of computational experiments ◆ Can’t that(easily) the reader reproduce has no resultshope of repeating.” [LeVeque, 2009] “Published documents are merely the advertisement of scholarship whereas the computer programs, input data, parameter values, etc. embody the scholarship itself.” http://[Schwaben.wikipedia.org/wiki/Scientific_misconduct et al., 2007] http://ori.dhhs.gov/misconduct/cases/ CATT Research Review 2011 Juliana Freire 9 Science and Reproducibility

CATT Research Review 2011 Juliana Freire 10 Provenance in Science When ◆ Not a new issue! ◆ Lab notebooks have been used for a long time ◆ What is new? Annotation – Large volumes of data – Complex analyses— computational processes ◆ Writing notes is no longer an option Observed data

DNA recombination By Lederberg CATT Research Review 2011 Juliana Freire 11 The VisTrails System

◆ Workflow-based system for data analysis and visualization – Allows multiple tools to be combined into pipelines ◆ Comprehensive provenance infrastructure ◆ Transparently tracks provenance of the discovery process---from data acquisition to visualization – The trail followed as users generate and test hypotheses ◆ Leverage provenance to streamline exploration – Support for reflective reasoning and collaboration – Query and mine provenance ◆ Focus on usability ◆ The system is open source: http://www.vistrails.org – Multi-platform: , Mac, Windows – Written in Python +

CATT Research Review 2011 Juliana Freire 12 The VisTrails System

◆ Workflow-based system for data analysis and visualization – Allows multiple tools to be combined into pipelines ◆ Comprehensive provenance infrastructure ◆ Transparently tracks provenance of the discovery process---from data acquisition to visualization – The trail followed as users generate and test hypotheses ◆ Leverage provenance to streamline exploration – Support for reflective reasoning and collaboration • Study on the use of tms for improving memory • Visualizing environmental simulations (CMOP STC) – Query and mine provenance (Pyschiatry, U. Utah) • Simulation for solid, fluid and structural mechanics • eBird (Cornell, NSF DataONE) (Galileo◆ Network, UFRJ Brazil) Focus on usability • Astrophysical Systems (Tohline, LSU) • Quantum physics simulations (ALPS, ETH Switzerland) • NIH NBCR (UCSD) • Climate◆ The analysis system (CDAT) is open source: http://www.vistrails.org • Pervasive Technology Labs (Heiland, Indiana • Habitat modeling (USGS) – Multi-platform: Linux, Mac, WindowsUniversity) • Open Wildland Fire Modeling (U. Colorado, NCAR) • Linköping University (Sweden) • High-energy physics (LEPP, Cornell) – Written in Python + Qt • University of North Carolina, Chapel Hill • Cosmology simulations (LANL) • UTEP CATT Research Review 2011 Juliana Freire 13 Demo Provenance Beyond Reproducibility

◆ Support for reflective reasoning ◆ Ability to compare data products

[Freire et al., IPAW 2006] CATT Research Review 2011 Juliana Freire 15 Provenance Beyond Reproducibility

◆ Support for reflective reasoning ◆ Ability to compare data products ◆ Explore parameter spaces and compare results

[Freire et al., IPAW 2006]

CATT Research Review 2011 Juliana Freire 16 Provenance Beyond Reproducibility

◆ Support for reflective reasoning ◆ Ability to compare data products ◆ Explore parameter spaces and compare results ◆ Support for collaboration

[Ellkvist et al., IPAW 2008]

CATT Research Review 2011 Juliana Freire 17 Provenance Enabling 3rd-Party Tools

Autodesk Maya

ParaView

VisIt

ImageVis3d

[Callahan et al., IPAW 2008]

CATT Research Review 2011 Juliana Freire 18 Provenance Plugin for ParaView

http://www.cs.utah.edu/~juliana/videos/paraview_plugin.avi

CATT Research Review 2011 Juliana Freire 19 Provenance Plugin for

http://www.cs.utah.edu/~juliana/videos/paraview_plugin.avi

CATT Research Review 2011 Juliana Freire 20 Provenance-Rich Publications

http://www.crowdlabs.org/vistrails/medleys/details/2/ CATT Research Review 2011 Juliana Freire 21 Reproducible Publications: Benefits

◆ Produce more re-usable knowledge---not just text ◆ Allow scientists to stand on the shoulders of giants and on their own shoulders! ◆ Science can move faster – http://www.nytimes.com/2011/06/26/opinion/sunday/26ideas.html?_r=1 ◆ Higher-quality publications – Authors will be more careful – Many eyes to check results ◆ Describe more of the discovery process: learn from successes and mistakes ◆ Expose scientific community to different techniques and tools: expedite their training, reduce time to insight ◆ More impact, more citations (?)

CATT Research Review 2011 Juliana Freire 22 !"#!"#$%&'&(%)*+(,-$%&'()#!*$+,-.# A Reproducible Paper: ALPS2.0

The ALPS project release 2.0: .#"/0#1# Open source software for strongly correlated systems

B. Bauer1 L. D. Carr2 H.G. Evertz3 A. Feiguin4 J. Freire5 S. Fuchs6 L. Gamper1 J. Gukelberger1 E. Gull7 S. Guertler8 A. Hehn1 R. Igarashi9,10 S.V. Isakov1 D. Koop5 P.N. Ma 1 P. Mates 1,5 H. Matsuo11 O. Parcollet12 G. Pawlowski13 J.D. Picon14 L. Pollet1,15 E. Santos5 V.W. Scarola16 U. Schollw¨ock17 C. Silva5 B. Surer1 S. Todo10,11 S. Trebst18 M. Troyer1 M. L. Wall2 P. Werner1 S. Wessel19,20 ‡ 1Theoretische Physik, ETH Zurich, 8093 Zurich, Switzerland 2Department of Physics, Colorado School of Mines, Golden, CO 80401, USA 3Institut f¨ur Theoretische Physik, Technische Universit¨atGraz, A-8010 Graz, Austria 4Department of Physics and Astronomy, University of Wyoming, Laramie, Wyoming 82071, USA 5Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, Utah 84112, USA 6Institut f¨ur Theoretische Physik, Georg-August-Universit¨atG¨ottingen, G¨ottingen, Germany 7Columbia University, New York, NY 10027, USA 2+3"'"+%4# 8Bethe Center for Theoretical Physics, Universit¨atBonn, Nussallee 12, 53115 Bonn, The ALPS projectGermany release 2.0: Open source software for strongly correlated systems 15 9Center for Computational Science & e-Systems, Japan Atomic Energy Agency, 110-0015 Tokyo, Japan !*$+#,-.# 10Core Research for Evolutional Science and Technology, Japan Science and Technology Agency, 332-0012 Kawaguchi, Japan 11Department of Applied Physics, University of Tokyo, 113-8656 Tokyo, Japan 12Institut de Physique Th´eorique, CEA/DSM/IPhT-CNRS/URA 2306, CEA-Saclay, F-91191 Gif-sur-Yvette, France 13Faculty of Physics, A. Mickiewicz University, Umultowska 85, 61-614 Pozna´n,

arXiv:1101.2646v4 [cond-mat.str-el] 23 May 2011 Poland /%0&120134# 14Institute of Theoretical Physics, EPF Lausanne, CH-1015 Lausanne, Switzerland 15Physics Department, Harvard University, Cambridge 02138, Massachusetts, USA 16Department of Physics, Virginia Tech, Blacksburg, Virginia 24061, USA 17Department for Physics, Arnold Sommerfeld Center for Theoretical Physics and Center for NanoScience, University of Munich, 80333 Munich, Germany 5'6'# 18Microsoft Research, Station Q, University of California, Santa Barbara, CA 93106, USA 19Institute for Solid State Theory, RWTH Aachen University, 52056 Aachen, Germany +3/51%62"# Corresponding author:Figure [email protected] 3. In this example we show a data collapse of the Binder Cumulant in the ‡ classical Ising model. The data has been produced by remotely run simulations and 7'85108# the critical exponent has been obtained with the help of the VisTrails parameter exploration functionality.

1cat> parm << EOF 16 T=0.6; LATTICE=” c h a i n lattice” {T=0.7;} MODEL=” s p i n ” {T=0.75;} [Bauerlocal S=1/2 et al., JSTAT 2011]{T=0.8; } L=60 {T=0.9;} 6J=1 21 {T=1.0;} THERMALIZATION=5000 {T=1.25;} http://SWEEPS=adsabs.harvard.edu 5 0 0 0 0 /abs/2011arXiv1101.2646B{T=1.5; } ALGORITHM=” l o o p ” {T=1.75;} T=0.05; {T=2.0; } 11 {T=0.1; } 26 EOF{ } CATT Research Review{T=0.2;} 2011 Juliana Freire 23 {T=0.3;} parameter2xml parm {T=0.4;} loop auto evaluate write xml parm . in .xml {T=0.5;} { } Figure 4. A shell script to perform an ALPS simulation to calculate the uniform susceptibility of a Heisenberg spin chain. Evaluation options are limited to viewing the output files. Any further evaluation requires the use of Python, VisTrails, or a program written by the user.

sensitivity of the data collapse to the correlation length critical exponent.

9. Tutorials and Examples

Main contributors: B. Bauer, A. Feiguin, J. Gukelberger, E. Gull, U. Schollw¨ock, B. Surer, S. Todo, S. Trebst, M. Troyer, M.L. Wall and S. Wessel The ALPS web page [38], which is a community-maintained wiki system and the central resource for code developments, also oers extensive resources to ALPS users. In particular, the web pages feature an extensive set of tutorials, which for each ALPS application explain the use of the application codes and evaluation tools in the context of a pedagogically chosen physics problem in great detail. These application tutorials are further complemented by a growing set of tutorials on individual code development Some Videos

Editing an executable paper written using LaTeX and VisTrails http://www.vistrails.org/download/download.php?type=MEDIA&id=executable_paper_latex.mov

Exploring a Web-hosted paper using server-based computation http://www.vistrails.org/download/download.php?type=MEDIA&id=executable_paper_server.mov

An interactive paper on a Wiki* http://www.vistrails.org/index.php/User:Tohline/CPM/Levels2and3

Reproducible Papers

An interactive paper on a Wiki* http://www.vistrails.org/index.php/User:Tohline/CPM/Levels2and3

The ALPS 2.0 paper http://adsabs.harvard.edu/abs/2011arXiv1101.2646B Research: Provenance Analytics

◆ Opportunity for knowledge discovery, sharing and re-use ◆ Query information – Understand processes and data dependencies – Find useful workflows, e.g., given a piece of data or task, which workflow should we run? ◆ Mine information – Discover interesting patterns (e.g., common workflow patterns)  recommendation system, discover analogies – Identify homogeneous workflow groups by clustering  organize collections [Santos et al., IPAW 2008] – Infer workflow specification from execution log [Aalst et al., TKDE 2004]

CATT Research Review 2011 Juliana Freire 25 The Need for Guidance in Workflow Design

CATT Research Review 2011 Juliana Freire 26 VisComplete: A Workflow Recommendation System

◆ Mine provenance collection: Identify graph fragments that co-occur in a collection of workflows ◆ Predict sets of likely workflow additions to a given partial workflow ◆ Similar to a Web browser suggesting URL completions [Koop et al., IEEE Vis 2008]

CATT Research Review 2011 Juliana Freire 27 VisComplete: A Workflow Recommendation System

◆ Mine provenance collection: Identify graph fragments that co-occur in a collection of workflows ◆ Predict sets of likely workflow additions to a given partial workflow ◆ Similar to a Web browser suggesting URL completions

CATT Research Review 2011 Juliana Freire 28 Exploring Provenance Trails

◆ Task provenance provides insights regarding – Complexity and nature: number of actions; structural vs. parameter changes; task duration – User confusion: large branching factor=lots of trial and error steps ◆ Very detailed (and honest!) feedback: instructors/managers can leverage this information

[Lins et al., SSDBM 2008] CATT Research Review 2011 Juliana Freire 29 Conclusions and Future Work

◆ Provenance is crucial for science and an enabler for executable papers ◆ Provenance must be at the center of the scientific process! ◆ Built an end-to-end solution based on VisTrails--- currently working on integrating infrastructure with other systems – Provenance-enabling other tools ◆ Many challenges and several open research questions ◆ Great opportunity to have impact in science and industry

CATT Research Review 2011 Juliana Freire 30 Additional Information

◆ The VisTrails System http://www.vistrails.org ◆ An infrastructure to support the creation, review and re-use of reproducible papers http://www.vistrails.org/index.php/ExecutablePapers

CATT Research Review 2011 Juliana Freire 31 Acknowledgments

◆ Thanks to: Philippe Bonnet, Dennis Shasha, Matthias Troyer, Claudio Silva and the VisTrails team ◆ This work is partially supported by the National Science Foundation, the Department of Energy, and IBM Faculty Awards.

CATT Research Review 2011 Juliana Freire 32 Merci Ευχαριστω Thank you Obrigada Vision: Provenance-Rich Science

Analyze/ Obtain Visualize Data

Collaborate

Publish/Share

Provenance Repository Provenance Repository

CATT Research Review 2011 Juliana Freire 34 Vision: Provenance-Rich Science

Analyze/ Obtain Visualize Data

Collaborate

Publish/ Share Provenance Repository Provenance

Provenance

CATT Research Review 2011 Repository Juliana Freire 35