V is u a l i z a t i o n C o r n e r

Editors: Claudio Silva, [email protected] Joel E. Tohline, [email protected]

Provenance for Visualizations: Reproducibility and Beyond

By Claudio T. Silva, Juliana Freire, and Steven P. Callahan

The demand for the construction of complex visualizations is growing in many disciplines of science, as users are faced with ever increasing volumes of data to analyze. The authors present VisTrails, an open-source provenance management system that provides infrastructure for data exploration and visualization.

omputing has been an enor- Without provenance, it’s difficult on the notion of data flows,2 and they mous accelerator for science, (and sometimes impossible) to repro- provide visual interfaces for producing C leading to an information ex- duce and share results, solve problems visualizations by assembling pipelines plosion in many different fields. Future collaboratively, validate results with out of modules (or functions) connect- scientific advances depend on our abil- different input data, and understand ed in a network. SCIRun supports an ity to comprehend the vast amounts the process used to solve a particular interface that lets users directly edit of data currently being produced and problem. In addition, data products’ data flows. MayaVi and ParaView have acquired. To analyze and understand longevity becomes limited—with- a different interaction paradigm that this data, though, we must assemble out precise and sufficient informa- implicitly builds data flows as the user complex computational processes and tion about how the data product makes “task-oriented” choices (such as generate insightful visualizations, was generated, its value diminishes selecting an isosurface value). which often require combining loosely significantly. Although these systems let users coupled resources, specialized librar- The lack of adequate provenance create complex visualizations, they ies, and grid and Web services. Such support in visualization systems mo- lack the ability to support data explo- processes could generate yet more tivated us to build VisTrails, an open ration at a large scale. Notably, they data, adding to the information over- source provenance-management sys- don’t adequately support collaborative flow scientists currently deal with. tem that provides infrastructure for creation and exploration of multiple Today, the scientific community data exploration and visualization visualizations. Because these systems uses ad hoc approaches to data ex- through workflows. VisTrails trans- don’t distinguish between the defini- ploration, but such approaches have parently records detailed provenance tion of a data flow and its instances, serious limitations. In particular, sci- of exploratory computational tasks to execute a given data flow with dif- entists and engineers must expend and leverages this information be- ferent parameters (for example, differ- substantial effort managing data (such yond just the ability to reproduce and ent input files), users must manually as scripts that encode computational share results. In particular, it uses this set these parameters through a GUI. tasks, raw data, data products, images, information to simplify the process of Clearly, this process doesn’t scale to and notes) and recording provenance exploring data through visualization. more than a few visualizations at a information (that is, all the informa- time. Additionally, modifications to tion necessary for reproducing a cer- Visualization Systems parameters or to a data flow’s defini- tain piece of data) so that they can Visualization systems such as MayaVi tion are destructive—the systems don’t answer basic questions: Who created (http://mayavi.sourceforge.net) and maintain any change history. This re- a data product and when? When was ParaView (www.paraview.org)—which quires the user to first construct the it modified, and who modified it? are built on top of Kitware’s Visual- visualization and then remember the What process was used to create the ization Toolkit (VTK)1—as well as input data sets, parameter values, and data product? Were two data products SCIRun (http://software.sci.utah.edu/ the exact dataflow configuration that derived from the same raw data? Not scirun.html) enable users to interac- led to a particular image. only is this process time-consuming, tively create and manipulate complex Finally, before constructing a vi- it’s also error-prone. visualizations. Such systems are based sualization, users must often acquire,

 Copublished by the IEEE CS and the AIP 1521-9615/07/$25 ©2007 IEEE Computing in Science & Engineering Reproducibility and Sharing we do things. This first column discusses the benefits of provenance Data and Processes for the and makes a case that better provenance mechanisms Visualization Corner are needed for visualization. In upcoming columns of the Visualization Corner, we will attempt to inform the sci- By Claudio Silva and Joel E. Tohline entific community at large about the benefits and tech- nologies related to provenance. In particular, we want to reetings! We’re the new co-editors for the Visualiza- promote the idea of reproducible visualizations. We en- Gtion Corner. Claudio is a computer science professor courage authors of articles published in the department at the University of Utah and faculty member of the Scien- to provide metadata for visualizations in their articles that tific Computing and Imaging (SCI) Institute, where he does let readers reproduce images as well as generate related research primarily in visualization, graphics, and applied ones (for example, using different data). Ultimately, our geometry. Joel is a professor of physics and astronomy at hope is that this trend will spread to the point that pub- Louisiana State University and a faculty member in LSU’s lished articles will contain not only textual descriptions Center for Computation and Technology (CCT), with a of the techniques, but links to data, code, and the com- research focus on complex fluid flows in astrophysical plete overall process used to generate the scientific results. systems. We both have extensive experience in high-per- As a mechanism to capture and share provenance formance computing. In partnership with our readers and metadata, authors can use VisTrails to produce speci- colleagues, we hope to bring you relevant and effective in- fications of the figures and plots presented in their- ar formation about visualization techniques that can directly ticles. We’ll archive this information at www.vistrails.org/ affect the way our readers do science. We would like to use index.php/CiSE. The data and processes associated with new Web technologies (Wikis, blogs, and so on) to encour- this column are already available on the Web site, so age the community to more actively participate in the way you can reproduce them, right now, from your desktop!

generate, or transform a given data line evolution as a visualization trail, users integrate a wide range of librar- set—for example, to calibrate a simu- or vistrail. ies. This makes the system suitable for lation, they must obtain data from sen- The stored provenance ensures other exploratory tasks, including data sors, generate data from a simulation, that users will be able to reproduce mining and integration. and finally construct and compare the the visualizations and lets them easily visualizations for both data sets. Most navigate through the space of pipe- Creating an Interactive visualization systems, however, don’t lines created for a given exploration Visualization with VisTrails give users adequate support for cre- task. The VisTrails interface lets users To illustrate the issues involved in ating complex pipelines that support query, interact with, and understand creating visualizations and how prov- multiple libraries and services. the visualization process’s history. In enance can aid in this process, we particular, they can return to previous present the following scenario, com- VisTrails: Provenance versions of a pipeline and change the mon in medical data visualization. for Visualization specification or parameters to gener- Starting from a volumetric computed The VisTrails system (www.vistrails. ate a new visualization without losing tomography (CT) data set, we generate org) we developed at the University previous changes. different visualizations by exploring of Utah is a new visualization system Another important feature of the the data through volume rendering, that provides a comprehensive prov- action-based provenance model is isosurfacing (extracting a contour), and enance-management infrastructure that it enables a series of operations slicing. Note that with proper modi- and can be easily combined with ex- that greatly simplify the exploration fications, this example also works for isting visualization libraries. Unlike process and could reduce the time visualizing other types of data (for ex- previous systems, VisTrails uses an to insight. In particular, it allows the ample, tetrahedral meshes). action-based provenance model that flexible reuse of pipelines and provides uniformly captures changes to both a scalable mechanism for creating and Dataflow Processing Networks parameter values and pipeline defini- comparing numerous visualizations as and Visual Programming. tions by unobtrusively tracking all well as their corresponding pipelines. A useful paradigm for building visu- changes that users make to pipelines Although we originally built VisTrails alization applications is the dataflow in an exploration task. We refer to to support exploratory visualization model. A data flow is a directed graph in this detailed provenance of the pipe- tasks, its extensible infrastructure lets which nodes represent computations,

September/October 2007  V is u a l i z a t i o n C o r n e r

(a) (b) (c)

Figure 1. Dataflow programming for visualization. (a) We commonly use a script to describe a pipeline from existing libraries such as the Visualization Toolkit (VTK). (b) Visual programming interfaces, such as the one VisTrails provides, facilitate the creation and maintenance of these dataflow pipelines. The green rectangles represent modules, and the black lines represent connections. (c) The end result of the script or the VisTrails pipeline is a set of interactive visualizations. and edges represent data streams: each for example, “scripting” in a modern that we can use expressive languages node or module corresponds to a pro- dynamic language, such as Python. for querying and updating workflows. cedure that’s applied on the input data Consider Figure 1a, which defines the and generates some output data as a re- workflow via a script written in Py- Comparing and Exploring sult. The flow of data in the graph de- thon that uses VTK to read a volume Multiple Visualizations termines the order in which a dataflow data set from a file, extract an isosur- Regardless of the specific mechanism system executes the processing nodes. face, map the isosurface to renderable we use to define a pipeline, the visu- In visualization, we commonly refer to geometry, and then finally render it in alization process’s end goal is to gain a dataflow network as a visualization an interactive window. insight from the data. To obtain such pipeline. (For this article, we use the Visual programming interfaces for insight, users must often generate and terms workflow, data flow, and pipe- designing data flows have become compare multiple visualizations. Go- line interchangeably.) Figure 1b shows popular and several systems, such as ing back to our scenario, several al- an example of the data flow used to SCIRun, have adopted them. These ternatives exist for rendering our CT derive the images shown in Figure 1c. interfaces give users a more intuitive data. Isosurfacing is a commonly used The green rectangles represent mod- view of the pipeline. They also dy- technique. Given a function f: Rn → R ules, and the black lines represent con- namically perform type checking and and a value a, an isosurface consists of nections. Most of the modules Figure guide the connection between mod- the set of points in a domain that map n 1 shows are from VTK, and labels on ules’ input and output ports—once the to a—that is, Sa = {x ∈ R : f(x) = a}. each module indicate the correspond- user selects a module’s output, con- The range of a values determines ing VTK class. In this figure, we nat- nections are allowed only to the target all possible isosurfaces that the user urally think of data flowing from top module’s appropriate input. VisTrails can generate. To identify “good” a to bottom, eventually being rendered automatically pulls edges toward the values that represent a data set’s im- and presented for display. correct input port. As we discuss later, portant features, we can look at the We can use different mechanisms another benefit of having a high-level, range of values taken by a, and their for creating visualization pipelines— structured workflow description is frequency, in the form of a histogram.

 Computing in Science & Engineering Figure 2. VisTrails’ parameter exploration interface. The system computes the results efficiently by avoiding redundant computation and displays them in the for interactive comparative visualization.

Using VisTrails, we can straightfor- The VisTrails spreadsheet lets us are complementary techniques, and wardly extend the isosurface pipeline compare visualizations in different they can generate very similar imag- to also display a histogram of the data. dimensions (row, column, sheet, and ery depending on parameters. In fact, VisTrails provides a very simple plu- time), and we can link the spreadsheet’s distinguishing between them can be gin functionality that you can use to cells to synchronize the interactions difficult. The VisTrails system lets us add packages and libraries, including between visualizations. Note that Vis- compare workflows using a visual dif- your own. For our example, we used Trails leverages the dataflow specifica- ference interface. To demonstrate this ’s 2D plotting functional- tions to identify and avoid redundant capability, we compute the difference ity (http://matplotlib.sourceforge.net) operations. By using the same cache between the original isosurface gen- to generate the histogram at the top for different cells in the spreadsheet, eration pipeline and the new volume of Figure 1c. This histogram helps VisTrails lets users efficiently explore rendering pipeline. Figure 3 shows in data exploration by suggesting re- numerous related visualizations. the visual difference of the workflows gions of interest in the volume. The that we can inspect, along with their plot shows that the highest frequency Comparing different visualization tech- resulting visualizations. In Figure 3a, features lie between the ranges [0,25] niques. Volume rendering is a pow- we use volume rendering to create the and [58,68]. To identify the features erful computer graphics technique image, in which we can see the skin that correspond to these ranges, we for visualizing 3D data. While many on top of the bone structure; Figure must explore these regions directly visualization algorithms focus on 3b shows only the bone structure ren- through visualization. creating a rendering of surfaces—al- dered with our standard isosurface though they might be surfaces of 3D technique. This ability to (efficiently) Scalable exploration of parameter spaces. objects—volume rendering lets us see compare workflows and visualizations VisTrails provides an interface for “inside” the volume. This technique is one of the benefits of the VisTrails parameter exploration that lets users models the volume as cloud-like cells action-based provenance model and specify a set of parameters to explore, of semitransparent material. Although becomes increasingly important as a as well as how to explore, group, and a surface rendering of the human workflow becomes more complex and display them. As a simple 1D example, body might show only the skin, for is shared among collaborators. Figure 2 shows an exploration of the example, a complete volume render- With other workflow systems, these isosurface value as four steps between ing might also show the bones and comparisons are challenging because 50 and 70, displayed horizontally in internal organs. they require module-by-module (vi- the VisTrails spreadsheet. Volume rendering and isosurfacing sual programming) or line-by-line

September/October 2007  V is u a l i z a t i o n C o r n e r

(a)

(b) (c)

Figure 3. A visual difference between different pipelines in VisTrails. We show the difference between pipelines that generated (a) volume rendering and (b) isosurface visualizations. (c) The interface distinguishes shared modules in dark gray, the modules unique to isosurfacing in blue, those unique to direct volume rendering in orange, and those with parameter changes in light gray.

(c)

(a) (b)

Figure 4. Multiple rendering techniques. (a) VisTrails renders the visualizations by combining volume rendering and isosurfacing and updates them with user interactions. (b) The corresponding pipeline represents the data flow for creating interactive visualizations. (c) VisTrails provides a fully browseable history of the exploration process that led to this final set of visualizations.

(scripting) comparisons. Although the the latter could lead to results that are images we generated so far corre- former can be computationally intrac- hard to interpret. spond to simple, static workflows. table (the related decision problem of To perform a more dynamic com- subgraph isomorphism is NP-complete), Interacting with visualizations. The parison between volume rendering

 Computing in Science & Engineering The VisTrails System (SVN); http://subversion.tigris.org). • Extensibility. VisTrails provides a very simple plugin n this article, we focused on using VisTrails as a tool for functionality that can help dynamically add packages Iexploratory visualization. Additional features that might and libraries. Neither changes to the user interface nor be relevant for CiSE readers include system recompilation are necessary. Because VisTrails is written in Python, integrating Python-wrapped libraries • Flexible provenance architecture. VisTrails transparently is straightforward. tracks changes made to workflows. It maintains a • Scalable derivation of data products and parameter explora- detailed record of all the steps followed in the explora- tion. VisTrails supports a series of operations for simulta- tion. The system can optionally track runtime informa- neously generating multiple data products, including an tion about workflow execution (such as who executed interface that lets users specify sets of values for different a module, on which machine, and how much time parameters in a workflow. Users can display the results elapsed). VisTrails also provides a flexible annotation of a parameter exploration side by side in the VisTrails framework whereby users can specify application- spreadsheet for easy comparison. ­specific provenance information. • Task creation by analogy. VisTrails supports analogies as • Querying and re-using history. The provenance informa- first-class operations to guide semi-automated changes tion is stored in a structured way. Users have the choice to multiple workflows, without requiring users to di- of using a relational database (such as MySQL and IBM rectly manipulate or edit the workflow specifications. DB2) or XML files in the file system. The system provides flexible and intuitive query interfaces through which Please visit www.vistrails.org to access the VisTrails com- users can explore and reuse provenance information. munity Web site. You’ll find information including instruc- Users can formulate simple keyword-based and selection tions for obtaining the software, online documentation, queries (find a visualization that a given user created, for video tutorials, and pointers to papers and presentations. example) as well as structured queries (find visualizations VisTrails is written in Python and uses the multiplatform that apply simplification before an isosurface computa- library for its user interface. The system is available as tion for irregular grid data sets). open source, released under the GPL 2.0 license. The pre- • Support for collaborative exploration. Users can configure compiled versions for Windows, Mac OS X, and come the system with a database backend that can act as a with an installer and include several packages, including shared repository. It also provides a synchronization VTK, matplotlib, and Image Magick. Additional packages, facility that lets users collaborate asynchronously and in including ones users have written, are also available, but you a disconnected fashion—they can check changes in and can easily add new packages using the VisTrails plugin infra- out, akin to a version control system (such as subversion structure. Detailed instructions are available at our Web site.

and isosurfacing, we add a feedback Provenance and organization becomes a major com- loop into the workflow to let users Exploratory Visualization ponent of the process. adjust the visualization interactively. The combination of multiple visual- VisTrails manages the data ma- We build a new workflow that uses ization algorithms, different librar- nipulated and metadata created in the the isosurfacing and volume render- ies, and the interactions between course of an exploratory task.3 As a user ing algorithms simultaneously. We them considerably complicates the (or group of users) generates a series of add a clipping plane into the volume workflow specification. In addition, visualizations, VisTrails transparently visualization to assign the volume creating a set of visualizations from tracks all the steps this exploration fol- regions used for each algorithm. In data is not always a linear process lowed—that is, the modules and con- addition, we use a point on the plane and often involves several itera- nections added and deleted, parameter to define axis-aligned slices of the tions as a user formulates and tests value changes, and so on. Figure 4c volume that we display in distinct hypotheses. Whereas for simple shows a history tree of the different spreadsheet cells. The pipeline in- experiments, manual approaches pipelines created in the course of our teractively updates these slices along to provenance management might running example. The nodes in this with the plane during user interac- be feasible, complex computational tree correspond to pipelines, and an tions. Figure 4 shows the resulting tasks involving large volumes of data edge between two pipelines corre- visualizations along with the com- or multiple researchers require au- sponds to changes performed on the plex dynamic workflow required to tomated approaches. As these tasks’ parent pipeline to obtain its child. For produce them. complexity and scale increases, data readability, by default, only the nodes

September/October 2007  V is u a l i z a t i o n C o r n e r

Figure 5. The VisTrails query by example interface. (a) Users can define a set of modules and parameters in the visual programming interface to create a query template. (b) The query results are shown in the history tree, which users can browse for specific instances of the match (inset).

Table 1. Query examples. of the actions that led to a particu- Query Result lar result. volume Highlights all nodes in the history tree where the string “volume” appears (for example, in a Thus, users can efficiently explore module name, parameter name, annotation) several related visualizations. The issue of reproducibility for user:juliana Highlights all nodes in the history tree created visualization has been considered by the user “juliana” before,5 we should note that whereas before: March 30, 2007 Highlights all nodes in the history tree created some visualization and workflow sys- before “March 30, 2007” tems provide support for provenance tracking, their focus has been on data provenance—that is, information in the tree that the user tags as impor- the alternative of explicitly storing about how the system derived a given tant are displayed. multiple versions of a pipeline.3 data product, including the param- By tracking all the changes made to a The exploration trail VisTrails cap- eter values used6—and on interaction workflow ensemble, VisTrails properly tures also supports various activities provenance (such as capturing a visu- captures each step, leaving a complete that are crucial for performing reflec- alization’s viewing manipulations).7 trail of the work. Having access to the tive reasoning and obtaining insights, VisTrails is the first system to capture different pipelines’ specifications lets such as following chains of reasoning information about how workflows others reproduce and share the results backward and forward and comparing evolve over time. of each step in the exploratory process. different results.4 The tree-based view For instance, to generate the com- To demonstrate this, we made the vis- lets users posite visualization in our final exam- trail associated with this example avail- ple, we extended our pipeline labeled able for download with the VisTrails • seamlessly navigate over the history “Volume Rendering” to include mod- system (see the “VisTrails System” tree and return to previous pipeline ules from the pipeline labeled “Isosur- sidebar). You can recreate each figure versions after reaching a dead end; facing.” Having two pipelines lets us shown in this article by executing the • undo bad changes; further explore the visualization—by different nodes in the history tree. • reuse pipelines and pipeline frag- trying different isosurface values, for Note that by using the action-based ments from previous versions; example (see Figure 2). In addition, provenance model, we obtain a very • compare different pipelines and we can compare the pipelines by drag- concise representation of the history, their results; and ging one node on top of the other (see which uses substantially less space than • be reminded, in an intuitive way, Figure 3). These computed differences

 Computing in Science & Engineering are useful for understanding the visu- enance data of exploratory tasks. With 7. D.P. Groth and K. Streefkerk, “Provenance and Annotation for Visual Exploration alization process, and the user can also this infrastructure in place, our re- Systems,” IEEE Trans. Visualization and reuse them. In this case, we applied the search focus is now on what we can do Computer Graphics, vol. 12, no. 6, 2006, pp. modules unique to “Isosurfacing” to with all the provenance that is accumu- 1500–1510. “Volume Rendering” to create a new lated. By mining this information, we 8. C. Scheidegger et al., Querying and Creat- ing Visualizations by Analogy,” IEEE Trans. pipeline called “Combined Rendering” hope to learn useful patterns that can Visualization and Computer Graphics, to be that uses a cutting plane to define re- help guide users in assembling and re- published, 2007. gions for the rendering methods. Vis- fining complex computational tasks. Trails can automatically apply pipeline differences (like a patch) to derive new Acknowledgments Claudio T. Silva is an associate professor at pipelines in a process we call visualiza- This article summarizes work being the University of Utah. His research inter- tion creation by analogy.8 done in the VisTrails project. It’s only ests include visualization, geometry pro- Another benefit to having a high- possible through the work of all team cessing, graphics, and high-performance level specification of the visualiza- members: Erik Anderson, Jason Cal- computing. Silva has a PhD in computer tion process is that users can query lahan, David Koop, Emanuele San- science from SUNY at Stony Brook. He is a the pipelines and their execution in- tos, Carlos E. Scheidegger, and Huy member of the IEEE, the ACM, Eurograph- stances. Scientists can query a vistrail T. Vo. The data used in this article ics, and Sociedade Brasileira de Matemati- to find anomalies in previously gen- is available courtesy of the National ca. Contact him at [email protected]. erated visualizations and locate data Library of Medicine’s Visible Hu- products and visualizations based on man Project. The US National Sci- Juliana Freire is an assistant professor at operations applied in the visualiza- ence Foundation partially supported the University of Utah. Her research inter- tion process. VisTrails supports sim- this work under grants IIS-0513692, ests include scientific data management, ple, keyword-based queries as well as CCF-0401498, EIA-0323604, CNS- Web information systems, and information structured queries. In addition to pro- 0541560, OCE-0424602, and OISE- integration. Freire has a PhD in computer viding information about the results 0405402. The US Department of science from SUNY at Stony Brook. She is a (for example, workflow identifiers and Energy, an IBM Faculty Award, and member of the ACM and the IEEE. Contact attributes), VisTrails can visually dis- a University of Utah Seed Grant her at [email protected]. play query results by highlighting the also partially supported this work. workflows and modules that satisfy Steven P. Callahan is a research assistant the query. Table 1 shows an example. References and PhD candidate at the University of Users might also define queries by 1. W. Schroeder, K. Martin, and B. Lorensen, Utah. His research interests include scientif- example.8 As Figure 5 illustrates, us- The Visualization Toolkit An Object-Oriented ic visualization, visualization systems, and Approach To 3D Graphics, Kitware, 2003. ers can construct (or copy and paste) computer graphics. He has an MS in compu- 2. E.A. Lee and T.M. Parks, “Dataflow Process a pipeline fragment into the VisTrails Networks,” Proc. IEEE, vol. 83, no. 5, 1995, tational engineering and science from the query tab to identify in the history pp. 773–801. University of Utah. Contact him at stevec@ tree all nodes that contain that frag- 3. S. Callahan et al., “Managing the Evolu- sci.utah.edu. ment. They can then browse through tion of Dataflows with VisTrails (extended abstract),” IEEE Workshop on Workflow and the highlighted nodes and click on one Data Flow for Scientific Applications (SciFlow), to display the workflow and highlight IEEE CS Press, 2006. the modules that match the query. 4. D.A. Norman, Things That Make Us Smart: Users can then click on the individual Defending Human Attributes in the Age of the Machine, Addison-Wesley, 1994. modules to view execution log records 5. G. Kindlmann, “Lack of Reproducibility Hin- associated with them. ders Visualization Science,” IEEE Visualization Compendium, 2006; IEEE CS Press, p. 69. 6. T. Jankun-Kelly, K.-L. Ma, and M. Gertz, he VisTrails project has focused “A Model and Framework for Visualization Exploration,” IEEE Trans. Visualization and Ton creating an infrastructure to Computer Graphics, vol. 13, no. 2, 2007, pp. address the need to manage the prov- 357–369.

September/October 2007