Making Computations and Publications Reproducible with Vistrails

Making Computations and Publications Reproducible with Vistrails

R EPRODUCIBLE R E S E A R C H F O R S C I E N T I F I C C OMPUTING Making Computations and Publications Reproducible with VisTrails The VisTrails system supports the creation of reproducible experiments. VisTrails integrates data acquisition, derivation, analysis, and visualization as executable components throughout the scientific exploration process, and through systematic provenance capture, it makes it easier to generate and share reproducible results. Using VisTrails, authors can link results to their provenance, reviewers can assess the experiment’s validity, and readers can repeat and utilize the computations. mportant scientific results give insight and www.vpf.ethz.ch/services/researchethics/Broschure. lead to practical progress. The ability to pdf), funding agencies, conferences (www. test these results is crucial for science to be sigmod2011.org/calls_papers_sigmod_research_ self-correcting, and the ability to reuse and repeatability.shtml), and journals (www. Iextend the results enables science to move for- signalprocessingsociety.org/publications/periodicals/ ward. In natural science, long tradition requires tsp) have started to encourage (or require) authors that results be reproducible, and in math, results to include reproducible results in their publications. must be accompanied by formal, verifiable proofs. However, a major barrier to the wider adoption However, the same standard hasn’t been applied of reproducibility is the fact that it’s hard for au- for the results of computational experiments. thors to derive a compendium that encapsulates Most computational experiments are specified all the components (for example, the data, code, only informally in papers, where experimental parameter settings, and environment) needed to results are briefly described in figure captions, reproduce a result; and even when a compendium and the code that produced the results is seldom is available, it’s often hard for reviewers to verify available. The lack of reproducibility for computa- results. tional results currently reported in the literature As a step toward simplifying the creation and has raised questions about their reliability1 and led review of reproducible results, and motivated by to a widespread discussion on the importance of the needs of computational scientists, we built computational reproducibility. an infrastructure that supports the life cycle of Academic institutions such as the Swiss Federal computational experiments. A key component Institute of Technology, Zurich (ETH, Zurich; of this infrastructure is a provenance manage- ment system that systematically and transparently 1521-9615/12/$31.00 © 2012 IEEE captures the metadata necessary to reproduce COPUBLISHED BY THE IEEE CS AND THE AIP experiments, including the specifications of the computations, input and output data, source code, Juliana Freire and Claudio T. Silva and library versions. We also developed a set of Polytechnic Institute of New York University solutions to address practical aspects related to 2 THIS ARTICLE HAS BEEN PEER-REVIEWED. COMPUTING IN SCIENCE & ENGINEERING CISE-14-4-Freire.indd 2 6/8/12 10:41 AM Figure 1. Anatomy of a real reproducible paper that investigates Galois conjugates of quantum double models.3 Figures in the paper are accompanied by their provenance, and users and reviewers can execute and examine the interactive results on the Web. reproducibility, including methods to link results This provenance information allows all of the to their provenance, explore parameter spaces, paper’s results to be reproduced. In the paper’s PDF wrap command-line tools, interact with results version (available at http://arxiv.org/abs/1106.3267), through a Web-based interface, and upgrade the the figures are active and, when clicked, their cor- specification of computational experiments to responding workflow is loaded into VisTrails and work in different environments and with newer executed on the reader’s machine. The reader can versions of software. This infrastructure has been then modify the workflow, change parameter val- implemented and released as part of VisTrails ues, and input data. The same provenance also (www.vistrails.org), an open source workflow- enables the result to be published on a website, based data exploration and visualization tool,2 where users and reviewers can execute it and ex- and it’s already being used by different groups of amine the results using a Web browser.4 scientists. Videos that illustrate the process to create reproducible publications using VisTrails VisTrails and Provenance are available at www.vistrails.org/index.php/ for Computational Experiments RepeatabilityCentral. In this article, we give an Provenance is a critical ingredient for reproduc- overview of this infrastructure and its components, ible experiments.5,6 If we know how a figure or how it can be used, and its benefits and limitations. table was generated (the computational processes and data used), we can incorporate them in the Creating Reproducible Papers paper so that the result can be reproduced. How- Before discussing how to create a reproducible pa- ever, because computational experiments can be per, let’s first examine a real reproducible paper. complex and their design involves many trial-and- Figure 1 illustrates the anatomy of a reproducible error steps, it’s easy to get lost. For example, it’s paper created using our infrastructure. This paper easy to forget the exact parameter values or the investigates Galois conjugates of quantum double version of an input file that was used to derive a models.3 Figures in the paper are accompanied specific result. Therefore, systematic mechanisms by their provenance, consisting of the workflow are needed to capture the provenance of these used to derive the plot, the underlying libraries experiments. invoked by the workflow, and links to the input In our infrastructure, we’ve adopted the Vis- data—simulation results stored in an archival site. Trails system as a means to capture provenance. JULY/AUGUST 2012 3 CISE-14-4-Freire.indd 3 6/8/12 10:41 AM VISTRAILS AND (www.avs.com) and ParaView (www.paraview.org), Vis- Trails makes advanced scientific and information visualiza- RELATED SYSTEMS tion techniques available to users, letting them explore and compare different visual representations of their data. As a isTrails (see www.vistrails.org/usersguide) is an open result, users can create complex workflows that encompass V source system designed to support exploratory com- important steps of scientific discovery, from data gathering putational experiments. VisTrails is written in Python and and manipulation to complex analyses and visualizations, uses Qt as its GUI toolkit (through PyQt Python bind- all integrated in one system. ings). It is multiplatform and runs on Windows, Mac, and There are two key aspects that distinguish VisTrails from Linux. Since its beta release in 2007, the system has been these systems. First, it provides comprehensive provenance downloaded more than 35,000 times. The VisTrails wiki has support: in addition to capturing data provenance (for had more than 1.2 million page views, and Google Analyt- example, the steps followed to create a given data prod- ics reports that visitors to the site come from 75 different uct), VisTrails also captures provenance of the exploration countries. process, including the trial-and-error refinements applied VisTrails includes and substantially extends useful to workflows. Second, VisTrails has been a pioneer in the features of scientific workflow and visualization systems. support of reproducible publications. It has introduced Similar to scientific workflow systems such as Kepler functionality for sharing and publishing computational (https://kepler-project.org) and Taverna (www.taverna.org. experiments, including the ability to link results reported in uk), VisTrails allows the specification of computational pro- a document to their provenance, run workflow in multiple cesses that integrate existing applications, loosely coupled environments, manage files manipulated by workflows, resources, and libraries according to a set of rules. As with and automatically upgrade workflows when the underlying visualization systems such as Advanced Visual Systems libraries change. (For a basic overview of VisTrails and a discussion and visualization systems. To cater to a broader about other possible tools, see the sidebar “Vis- set of users, including many who don’t have pro- Trails and Related Systems.”) gramming expertise, it leverages provenance in- Compared to both scientific workflows and -vi formation to provide a series of operations and sualization systems, a distinguishing feature of user interfaces that simplify workflow design and VisTrails is its provenance infrastructure: Vis- use—including the ability to create and refine Trails was designed from the start to both capture workflows by analogy, to query workflows by -ex and leverage provenance information. VisTrails ample, and to suggest workflow completions as us- captures a detailed history of the steps followed ers interactively construct their workflows using a and data derived in the course of an exploratory recommendation system.5 We’ve also developed a task. Workflow systems have traditionally been framework that lets users create custom applica- used to automate repetitive tasks, but in applica- tions (mashups) that can be more easily deployed tions that are exploratory in nature, such as simu- to end users.7 lations, data analysis, and visualization, not

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    8 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us