V is u a l i z a t i o n C o r n e r

Editors: Claudio Silva, [email protected] Joel E. Tohline, [email protected]

Provenance for Visualizations

Reproducibility and Beyond

By Claudio T. Silva, Juliana Freire, and Steven P. Callahan

The demand for the construction of complex visualizations is growing in many disciplines of science, as users are faced with ever increasing volumes of data to analyze. The authors present VisTrails, an open source provenance-management system that provides infrastructure for data exploration and visualization.

omputing has been an enor- Without provenance, it’s diffi- on the notion of data flows,2 and they mous accelerator for science, cult (and sometimes impossible) to provide visual interfaces for produc- C leading to an information ex- reproduce and share results, solve ing visualizations by assembling pipe- plosion in many different fields. Fu- problems collaboratively, validate re­ lines out of modules (or functions) ture scientific advances depend on our sults with different input data, and connected in a network. SCIRun ability to comprehend the vast amounts understand the process used to solve supports an interface that lets users of data currently being produced and a particular problem. In addition, directly edit data flows. MayaVi and acquired. To analyze and understand data products’ longevity becomes ParaView have a different interac- this data, though, we must assemble limited—without precise and suffi- tion paradigm that implicitly builds complex computational processes and cient information about how the data data flows as the user makes “task- generate insightful visualizations, product was generated, its value di- ­oriented” choices (such as selecting which often require combining loosely minishes significantly. an isosurface value). coupled resources, specialized librar- The lack of adequate provenance Although these systems let users ies, and grid and Web services. Such support in visualization systems mo- create complex visualizations, they processes could generate yet more tivated us to build VisTrails, an open lack the ability to support data explo- data, adding to the information over- source provenance-management sys- ration at a large scale. Notably, they flow scientists currently deal with. tem that provides infrastructure for don’t adequately support collabora- Today, the scientific community data exploration and visualization tive creation and exploration of mul- uses ad hoc approaches to data ex- through workflows. VisTrails trans- tiple visualizations. Because these ploration, but such approaches have parently records detailed provenance systems don’t distinguish between serious limitations. In particular, of exploratory computational tasks the definition of a data flow and its scientists and engineers must expend and leverages this information be- instances, to execute a given data substantial effort managing data yond just the ability to reproduce and flow with different parameters (for (such as scripts that encode computa- share results. In particular, it uses this example, different input files), users tional tasks, raw data, data products, information to simplify the process of must manually set these parameters images, and notes) and recording exploring data through visualization. through a GUI. Clearly, this process provenance information (that is, all the doesn’t scale to more than a few vi- information necessary to reproduce Visualization Systems sualizations at a time. Additionally, a certain piece of data) so that they Visualization systems such as Maya­ modifications to parameters or to a can answer basic questions: Who cre- Vi (http://mayavi.sourceforge.net) and data flow’s definition are destruc- ated a data product and when? When ParaView (www.paraview.org)—which tive—the systems don’t maintain any was it modified, and who modified it? are built on top of Kitware’s Visual- change history. This requires the What process was used to create the ization Toolkit (VTK)1—as well as user to first construct the visualiza- data product? Were two data products SCIRun (http://software.sci.utahedu/ tion and then remember the input derived from the same raw data? This scirun.html) enable users to interac- data sets, parameter values, and the process is not only time-consuming, tively create and manipulate complex exact dataflow configuration that led but also error-prone. visualizations. Such systems are based to a particular image.

82 Copublished by the IEEE CS and the AIP 1521-9615/07/$25.00 ©2007 IEEE Computing in Science & Engineering Reproducibility and Sharing more actively participate in the way we do things. This first column discusses the benefits of provenance Data and Processes for the and makes a case that better provenance mechanisms are Visualization Corner needed for visualization. In upcoming installments, we’ll attempt to inform the scientific community at large about By Claudio Silva and Joel E. Tohline the benefits and technologies related to provenance. In particular, we want to promote the idea of reproducible reetings! We’re the new co-editors for the Visualiza- visualizations. We encourage authors of articles published Gtion Corner. Claudio is a computer science professor at here to provide metadata for visualizations in their articles the University of Utah and faculty member of the Scientific that let readers reproduce images as well as generate Computing and Imaging Institute, where he does research related ones (for example, using different data). Ultimately, primarily in visualization, graphics, and applied geometry. our hope is that this trend will spread to the point that Joel is a professor of physics and astronomy at Louisiana published articles will contain not only textual descriptions State University and a faculty member in LSU’s Center for of the techniques, but links to data, code, and the complete Computation and Technology, with a research focus on overall process used to generate the scientific results. complex fluid flows in astrophysical systems. We both have As a mechanism to capture and share provenance meta- extensive experience in high-performance computing. In data, authors can use VisTrails to produce specifications partnership with our readers and colleagues, we hope to of the figures and plots presented in their articles. We’ll bring you relevant and effective information about visual- archive this information at www.vistrails.org/index.php/ ization techniques that can directly affect the way our read- CiSE. The data and processes associated with this column ers do science. We would like to use new Web technologies are already available on the Web site, so you can reproduce (Wikis, blogs, and so on) to encourage the community to them, right now, from your desktop!

Finally, before constructing a vi- this detailed provenance of the pipe- extensible infrastructure lets users in- sualization, users must often acquire, line evolution as a visualization trail, tegrate a wide range of libraries. This generate, or transform a given data or vistrail. makes the system suitable for other set—for example, to calibrate a simu- The stored provenance ensures exploratory tasks, including data min- lation, they must obtain data from sen- that users will be able to reproduce ing and integration. sors, generate data from a simulation, the visualizations and lets them easily and finally construct and compare the navigate through the space of pipe- Creating an Interactive visualizations for both data sets. Most lines created for a given exploration Visualization with VisTrails visualization systems, however, don’t task. The VisTrails interface lets users To illustrate the issues involved in give users adequate support for cre- query, interact with, and understand creating visualizations and how prov- ating complex pipelines that support the visualization process’s history. In enance can aid in this process, we multiple libraries and services. particular, they can return to previous present the following scenario, com- versions of a pipeline and change the mon in medical data visualization. VisTrails: Provenance specification or parameters to gener- Starting from a volumetric computed for Visualization ate a new visualization without losing tomography (CT) data set, we generate The VisTrails system (www.vistrails. previous changes. different visualizations by exploring org) we developed at the Univer- Another important feature of the the data through volume rendering, sity of Utah is a new visualization action-based provenance model is isosurfacing (extracting a contour), and ­system that provides a comprehensive that it enables a series of operations slicing. Note that with proper modi- ­provenance-management infrastruc- that greatly simplify the exploration fications, this example also works for ture and can be easily combined with process and could reduce the time to visualizing other types of data (for ex- existing visualization libraries. Unlike insight. In particular, the model al- ample, tetrahedral meshes). previous systems, VisTrails uses an lows the flexible reuse of pipelines action-based provenance model that and provides a scalable mechanism Dataflow-Processing Networks uniformly captures changes to both for creating and comparing numer- and Visual Programming parameter values and pipeline defini- ous visualizations as well as their cor- A useful paradigm for building visu- tions by unobtrusively tracking all responding pipelines. Although we alization applications is the dataflow changes that users make to pipelines originally built VisTrails to support model. A data flow is a directed graph in in an exploration task. We refer to exploratory visualization tasks, its which nodes represent computations,

September/October 2007 83 V is u a l i z a t i o n C o r n e r

(a) (b) (c)

Figure 1. Dataflow programming for visualization. (a) We commonly use a script to describe a pipeline from existing libraries such as the Visualization Toolkit (VTK). (b) Visual programming interfaces, such as the one VisTrails provides, facilitate the creation and maintenance of these dataflow pipelines. The green rectangles represent modules, and the black lines represent connections. (c) The end result of the script or VisTrails pipeline is a set of interactive visualizations. and edges represent data streams: for creating visualization pipelines— ing a high-level, structured workflow each node or module corresponds to for example, scripting in a modern description is that we can use expres- a procedure that’s applied on the in- dynamic language, such as Python. sive languages in querying and updat- put data and generates some output Consider Figure 1a, which defines the ing workflows. data as a result. The flow of data in workflow via a script written in Py- the graph determines the order in thon that uses VTK to read a volume Comparing and Exploring which a dataflow system executes the data set from a file, extract an isosur- Multiple Visualizations processing nodes. In visualization, we face, map the isosurface to renderable Regardless of the specific mechanism commonly refer to a dataflow network geometry, and then finally render it in we use to define a pipeline, the visu- as a visualization pipeline. (For this an interactive window. alization process’s end goal is to gain article, we use the terms workflow, Visual programming interfaces for insight from the data. To obtain such data flow, and pipeline interchange- designing data flows have become insight, users must often generate ably.) Figure 1b shows an example of popular and several systems, such as and compare multiple visualizations. the data flow used to derive the im- SCIRun, have adopted them. These Going back to our scenario, several ages shown in Figure 1c. The green interfaces give users a more intuitive alternatives exist for rendering our rectangles represent modules, and view of the pipeline. They also dy- CT data. Isosurfacing is a commonly the black lines represent connec- namically perform type checking and used technique. Given a function f: tions. Most of the modules in Figure guide the connection between mod- Rn → R and a value a, an isosurface 1 are from VTK, and labels on each ules’ input and output ports—once consists of the set of points in a do- module indicate the corresponding the user selects a module’s output, main that map to a—that is, Sa = {x ∈ VTK class. In this figure, we natu- connections are allowed only to the Rn: f(x) = a}. rally think of data flowing from top target module’s appropriate input. The range of a values determines all to bottom, eventually being rendered VisTrails automatically pulls edges possible isosurfaces that the user can and presented for display. toward the correct input port. As we generate. To identify “good” a values We can use different mechanisms discuss later, another benefit of hav- that represent a data set’s important

84 Computing in Science & Engineering Figure 2. VisTrails’ parameter exploration interface. The system computes the results efficiently by avoiding redundant computation and displays them in the for interactive comparative visualization. features, we can look at the range of isosurface value as four steps between dering might also show the bones and values taken by a, and their frequen- 50 and 70, displayed horizontally in internal organs. cy, in the form of a histogram. Using the VisTrails spreadsheet. Volume rendering and isosurfacing VisTrails, we can straightforwardly This spreadsheet lets us compare are complementary techniques, and extend the isosurface pipeline to visualizations in different dimensions they can generate very similar imag- also display a histogram of the data. (row, column, sheet, and time); we ery depending on parameters. In fact, VisTrails provides a very simple plug­ can also link the spreadsheet’s cells to distinguishing between them can be in functionality that you can use to synchronize the interactions between difficult. The VisTrails system lets us add packages and libraries, including visualizations. Note that VisTrails compare workflows using a visual dif- your own. For our example, we used leverages the dataflow specifications ference interface. To demonstrate this ’s 2D plotting functionality to identify and avoid redundant op- capability, we compute the difference (http://matplotlib.sourceforge.net) to erations. By using the same cache between the original isosurface-gen- generate the histogram at the top of for different cells in the spreadsheet, eration pipeline and the new volume- Figure 1c. This histogram helps in VisTrails lets users efficiently explore rendering pipeline. Figure 3 shows data exploration by suggesting re- numerous related visualizations. the visual difference of the workflows gions of interest in the volume. The that we can inspect, along with their plot shows that the highest frequency Comparing different visualization tech- resulting visualizations. In Figure 3a, features lie between the ranges [0,25] niques. Volume rendering is a power- we use volume rendering to create the and [58,68]. To identify the features ful computer graphics technique for image, in which we can see the skin that correspond to these ranges, we visualizing 3D data. Whereas many on top of the bone structure; Figure must explore these regions directly visualization algorithms focus on 3b shows only the bone structure ren- through visualization. creating a rendering of surfaces— dered with our standard isosurface ­although they might be surfaces of technique. This ability to (efficiently) Scalable exploration of parameter spaces. 3D objects—volume rendering lets us compare workflows and visualizations VisTrails provides an interface for see “inside” the volume. This tech- is one of the benefits of the VisTrails parameter exploration that lets users nique models the volume as cloud- action-based provenance model and specify a set of parameters to explore, like cells of semitransparent material. becomes increasingly important as a as well as how to explore, group, and A surface rendering of the human workflow becomes more complex and display them. As a simple 1D example, body might show only the skin, for is shared among collaborators. Figure 2 shows an exploration of the example, but a complete volume ren- With other workflow systems, these

September/October 2007 85 V is u a l i z a t i o n C o r n e r

(a)

(b) (c)

Figure 3. A visual difference between different pipelines in VisTrails. We show the difference between pipelines that generated (a) volume rendering and (b) isosurface visualizations. (c) The interface distinguishes shared modules in dark gray, the modules unique to isosurfacing in blue, those unique to direct volume rendering in orange, and those with parameter changes in light gray.

(c)

(a) (b)

Figure 4. Multiple rendering techniques. (a) VisTrails renders visualizations by combining volume rendering and isosurfacing and updates them with user interactions. (b) The corresponding pipeline represents the data flow for creating interactive visualizations. (c) VisTrails provides a fully browseable history of the exploration process that led to this final set of visualizations. comparisons are challenging because (scripting) comparisons. Although the subgraph isomorphism is NP-complete), they require module-by-module (vi- former can be computationally intrac- the latter could lead to results that are sual programming) or line-by-line table (the related decision problem of hard to interpret.

86 Computing in Science & Engineering The VisTrails System • Extensibility. VisTrails provides a very simple plug-in functionality that can help dynamically add packages n this article, we focus on using VisTrails as a tool for and libraries. Neither changes to the user interface nor Iexploratory visualization. Additional features might be system recompilation are necessary. Because VisTrails is relevant for CiSE readers: written in Python, integrating Python-wrapped libraries is straightforward. • Flexible provenance architecture. VisTrails transparently • Scalable derivation of data products and parameter explora- tracks changes made to workflows by maintaining a tion. VisTrails supports a series of operations for simulta- detailed record of all the steps followed in the explora- neously generating multiple data products, including an tion. The system can optionally track runtime informa- interface that lets users specify sets of values for different tion about workflow execution (such as who executed parameters in a workflow. Users can display the results a module, on which machine, and how much time of a parameter exploration side by side in the VisTrails elapsed). VisTrails also provides a flexible annotation spreadsheet for easy comparison. framework through which users can specify application- • Task creation by analogy. VisTrails supports analogies as ­specific provenance information. first-class operations to guide semiautomated changes to • Querying and reusing history. Provenance information multiple workflows, without requiring users to directly is stored in a structured way. Users have the choice of manipulate or edit the workflow specifications. using a relational database (such as MySQL or IBM DB2) or XML files in the file system. The system provides Please visit www.vistrails.org to access the VisTrails flexible and intuitive query interfaces through which community Web site. You’ll find information including users can explore and reuse provenance information. instructions for obtaining the software, online docu- Users can formulate simple keyword-based and selection mentation, video tutorials, and pointers to papers and queries (find a visualization that a given user created, for presentations. example) as well as structured queries (find visualizations VisTrails is written in Python and uses the multiplat- that apply simplification before an isosurface computa- form library for its user interface. The system is open tion for irregular grid data sets). source, released under the GPL 2.0 license. The pre-com- • Support for collaborative exploration. Users can configure piled versions for Windows, Mac OS X, and come the system with a database back end that can act as a with an installer and several packages, including VTK, shared repository. This back end also provides a syn- matplotlib, and Image Magick. Additional packages, chronization facility that lets users collaborate asynchro- including ones users have written, are also available, nously and in a disconnected fashion—they can check but you can easily add new packages using the VisTrails changes in and out, akin to a version-control system plug-in infrastructure. Detailed instructions are available (such as subversion (SVN), http://subversion.tigris.org). at our Web site.

Interacting with visualizations. The teractively updates these slices along hypotheses. Whereas for simple images we generated so far corre- with the plane during user interac- experiments, manual approaches spond to simple, static workflows. To tions. Figure 4 shows the resulting to provenance management might perform a more dynamic compari- visualizations along with the com- be feasible, complex computational son between volume rendering and plex dynamic workflow required to tasks involving large volumes of data isosurfacing, we add a feedback loop produce them. or multiple researchers require au- into the workflow to let users adjust tomated approaches. As these tasks’ the visualization interactively. We Provenance and complexity and scale increases, data thus build a new workflow that uses Exploratory Visualization organization becomes a major com- the isosurfacing and volume render- The combination of multiple visual- ponent of the process. ing algorithms simultaneously. We ization algorithms, different librar- VisTrails manages the data ma- add a clipping plane into the volume ies, and the interactions between nipulated and metadata created in visualization to assign the volume them considerably complicates the the course of an exploratory task.3 As regions used for each algorithm. In workflow specification. In addition, a user (or group of users) generates addition, we use a point on the plane creating a set of visualizations from a series of visualizations, VisTrails to define axis-aligned slices of the data is not always a linear process transparently tracks all the steps this volume that we display in distinct and often involves several itera- exploration followed—that is, the spreadsheet cells. The pipeline in- tions as a user formulates and tests modules and connections added and

September/October 2007 87 V is u a l i z a t i o n C o r n e r

Figure 5. The VisTrails query-by-example interface. (a) Users can define a set of modules and parameters in the visual programming interface to create a query template. (b) The query results are shown in the history tree, which users can browse for specific instances of the match (inset).

Table 1. Query examples. • seamlessly navigate over the history Query Result tree and return to previous pipeline volume Highlights all nodes in the history tree in which versions after reaching a dead end; the string “volume” appears (for example, in a • undo bad changes; module name, parameter name, annotation) • reuse pipelines and pipeline frag- ments from previous versions; user:juliana Highlights all nodes in the history tree created by the user “juliana” • compare different pipelines and their results; and before: March 30, 2007 Highlights all nodes in the history tree created • be reminded, intuitively, of the ac- before “March 30, 2007” tions that led to a particular result.

Thus, users can efficiently explore deleted, parameter value changes, and ciated with this example available for several related visualizations. so on. Figure 4c shows a history tree download with the VisTrails system The issue of reproducibility for vi- of the different pipelines created in the (see “The VisTrails System” sidebar). sualization has been considered be- course of our running example. The You can recreate each figure shown in fore,5 but we should note that whereas nodes in this tree correspond to pipe- this article by executing the different some visualization and workflow sys- lines; an edge between two pipelines nodes in the history tree. Note that tems provide support for provenance corresponds to changes performed on by using the action-based provenance tracking, their focus has been on data the parent pipeline to obtain its child. model, we obtain a very concise rep- provenance—that is, information For readability, by default, only the resentation of the history, which uses about how the system derived a given nodes in the tree that the user tags as substantially less space than the alter- data product, including the param- important are displayed. native of explicitly storing multiple eter values used6—and on interaction By tracking all the changes made versions of a pipeline.3 provenance (such as capturing a visu- to a workflow ensemble, VisTrails The exploration trail VisTrails cap- alization’s viewing manipulations).7 properly captures each step, leaving tures also supports various activities VisTrails is the first system to capture a complete trail of the work. Hav- that are crucial for performing reflec- information about how workflows ing access to the different pipelines’ tive reasoning and obtaining insights, evolve over time. specifications lets others reproduce such as following chains of reasoning For instance, to generate the and share the results of each step in backward and forward and comparing composite visualization in our final the exploratory process. To demon- different results.4 The tree-based view example, we extended our pipeline strate this, we made the vistrail asso- lets users labeled Volume Rendering to include

88 Computing in Science & Engineering modules from the pipeline labeled to display the workflow and highlight Data Flow for Scientific Applications (SciFlow), Isosurfacing. (Having two pipelines the modules that match the query. IEEE CS Press, 2006. lets us further explore the visualiza- Users can then click on the individual 4. D.A. Norman, Things That Make Us Smart: Defending Human Attributes in the Age of the tion—by trying different isosurface modules to view the execution log re- Machine, Addison-Wesley, 1994. values, for example [see Figure 2]). In cords associated with them. 5. G. Kindlmann, “Lack of Reproducibility Hin- addition, we can compare the pipe- ders Visualization Science,” IEEE Visualization lines by dragging one node on top Compendium, IEEE CS Press, 2006, p. 69. of the other (see Figure 3). These he VisTrails project has focused 6. T. Jankun-Kelly, K.-L. Ma, and M. Gertz, “A Model and Framework for Visualization computed differences are useful for T on creating an infrastructure to Exploration,” IEEE Trans. Visualization and understanding the visualization pro- manage the provenance data of ex- Computer Graphics, vol. 13, no. 2, 2007, pp. cess, and the user can also reuse them. ploratory tasks. With this infrastruc- 357–369. In this case, we applied the modules ture in place, our research focus is 7. D.P. Groth and K. Streefkerk, “Provenance and Annotation for Visual Exploration unique to “Isosurfacing” to “Volume now on what we can do with all the Systems,” IEEE Trans. Visualization and Rendering” to create a new pipeline provenance accumulatation. By min- Computer Graphics, vol. 12, no. 6, 2006, pp. called “Combined Rendering,” that ing this information, we hope to learn 1500–1510. uses a cutting plane to define regions useful patterns that can guide users 8. C. Scheidegger et al., Querying and Creating Visualizations by Analogy,” to be for the rendering methods. VisTrails in assembling and refining complex published in IEEE Trans. Visualization and can automatically apply pipeline dif- computational tasks. Computer Graphics, 2007. ferences (like a patch) to derive new pipelines in a process we call visual- Acknowledgments ization creation by analogy.8 This article summarizes work being Claudio T. Silva is an associate professor at Another benefit to having a high- done in the VisTrails project. It’s only the University of Utah. His research inter- level specification of the visualiza- possible through the work of all our ests include visualization, geometry pro- tion process is that users can query team members: Erik Anderson, Jason cessing, graphics, and high-performance the pipelines and their execution Callahan, David Koop, Emanuele computing. Silva has a PhD in computer instances. Scientists can query a vis- Santos, Carlos E. Scheidegger, and science from SUNY at Stony Brook. He trail to find anomalies in previously Huy T. Vo. The data used in this ar- is a member of the IEEE, the ACM, Eu- generated visualizations and locate ticle is available courtesy of the Na- rographics, and Sociedade Brasileira data products and visualizations tional Library of Medicine’s Visible de Matematica. Contact him at csilva@ based on operations applied in the Human Project. The US National Sci- cs.utah.edu. visualization process. VisTrails sup- ence Foundation partially supported ports simple, keyword-based queries this work under grants IIS-0513692, Juliana Freire is an assistant professor at as well as structured queries. In addi- CCF-0401498, EIA-0323604, CNS- the University of Utah. Her research inter- tion to providing information about 0541560, OCE-0424602, and OISE- ests include scientific data management, the results (for example, workflow 0405402. The US Department of Web information systems, and information identifiers and attributes), VisTrails Energy, an IBM Faculty Award, and integration. Freire has a PhD in computer can visually display query results by a University of Utah Seed Grant also science from SUNY at Stony Brook. She is a highlighting the workflows and mod- partially supported this work. member of the ACM and the IEEE. Contact ules that satisfy the query. Table 1 her at [email protected]. shows an example. References Users might also define queries by Steven P. Callahan is a research assistant and 8 1. W. Schroeder, K. Martin, and B. Lorensen, example. As Figure 5 illustrates, us- The Visualization Toolkit: An Object-Oriented PhD candidate at the University of Utah. His ers can construct (or copy and paste) Approach To 3D Graphics, Kitware, 2003. research interests include scientific visual- a pipeline fragment into the VisTrails 2. E.A. Lee and T.M. Parks, “Dataflow Process ization, visualization systems, and computer query tab to identify in the history Networks,” Proc. IEEE, vol. 83, no. 5, 1995, graphics. Callahan has an MS in computa- pp. 773–801. tree all nodes that contain that frag- tional engineering and science from the 3. S. Callahan et al., “Managing the Evolution ment. They can then browse through of Dataflows with VisTrails (extended ab- University of Utah. Contact him at stevec@ the highlighted nodes and click on one stract),” Proc. IEEE Workshop on Workflow and sci.utah.edu.

September/October 2007 89