DESIGNING VISUALIZATIONS FOR BIOLOGICAL DATA Miriah Meyer, School of Computing, , Salt Lake City, UT, 84112, U.S.A. E-mail: [email protected]

Submitted:

Abstract is now a vital component of the biological discovery process. In this article I pre- sent visualization design studies as a promising Fig. 1. This nine-stage framework [1] provides practical guidance for designing visualization process for creating effective, visualization tools tools that focus on solving specific, real-world problems. Copyright Miriah Meyer, 2012. for biological data.

Design studies are often highly col- chemical reactions that occur in cells. The field of biology has been trans- laborative projects that, at their heart, The problem they faced at the onset of formed over the last decade with the rely on a visualization practitioner ac- our collaboration was that existing tools rapidly decreasing cost of data. Today, quiring a deep understanding of a prob- could only look at a subset of this data at the focus is on combining huge ge- lem and exploring the broad space of a time. nomics databases with large amounts of design solutions. Through progressively During the discover stage of the pro- molecular data, and then augmenting refining prototypes based on feedback ject we conducted weekly meetings with this data with long-term clinical out- from the target end-users, effective visu- the group of 7 biologists over the course comes across large populations of peo- alization tools emerge for solving a real- of 3 months to learn about their scien- ple. Biologists believe that embedded in world analysis problem. tific questions, data analysis needs, and these complex, massive datasets are The nine-stage framework [1] lays out existing visualization tools. scientific goldmines like a cure for can- a practical process for conducting design From these meetings we came to un- cer. But how do you combine data that studies, shown in Figure 1. This frame- derstand that the group was working spans from the molecular level up to the work contains three precondition stages with four kinds of data: several dozen population level in a meaningful way? that focus on learning the space of visu- graphs that describe cellular reactions Visualization has emerged as an im- alization techniques and establishing that are catalyzed by genes; a large portant way to make sense of this data. synergistic collaborations. The inner of quantitative values that describe Lying at the intersection of computer four core stages describe the process for how much each gene in each species is science, design, and biology, visualiza- characterizing a real-world problem on or off over time during an experi- tion of biological data enables scientists followed by the design, implementation, ment; a tree depicting the evolutionary to understand data by encoding meaning and evaluation of a visualization solu- relationship of the fourteen species of through images and supporting explora- tion. The final two analysis stages de- yeast; and quantitative correlation tion through human-computer interac- scribe steps for analyzing lessons- values that describe how similar the tions. My research shows that producing learned during the project and com- temporal values are for a specific gene agile visualizations by matching the rate municating those findings to the visuali- across the species. of software development to the biolo- zation research community. The entire We preformed a similar analysis on gists’ rate of experimentation produces process is iterative and practitioners the questions our collaborators were tools that not only support scientists, but often jump backwards to earlier stages asking and translated them into a list of also influence them as they tackle com- when refining ideas. data analysis tasks. We characterized plex questions. these tasks as operating at four different A recent trend in the visualization Pathline: A Design Study levels. Briefly, those tasks, starting at community is to develop influential To illustrate how the nine-stage frame- the lowest level, are: tools by conducting design studies in work applies to a real-world, visualiza- 1. Study the gene data over time to close collaboration with scientists and tion design problem, I’ll discuss the observe dynamic trends in the data. other expert end users. In this paper I process of creating the tool Pathline [2], 2. Compare a limited number of time describe a methodology for conducting a visualization system for exploring series, such as those corresponding to a design studies, and illustrate their effec- molecular biology data. specific gene across all the species, to tiveness with an example from my own In this project we collaborated with a characterize meaningful differences. work. group of biologists at the Broad Institute 3. Compare the correlation values of Harvard and MIT who are studying across one or more of the reaction Visualization Design Studies how the same sets of genes can facilitate graphs to understand where the genes A design study is a “project in which different kinds of cellular functions in behave the same, and differently, in the visualization researchers analyze a spe- related species. Our collaborators specif- species. cific real-world problem faced by do- ically study metabolism in yeast, and 4. Compare multiple correlation val- main experts, design a visualization they spent several years conducting ex- ues to validate which correlation metric system that supports solving this prob- periments and collecting data for many is the most informative for the specific lem, validate the design, and reflect different genes in fourteen species of analysis needs of the group. about lessons learned in order to refine yeast. Their goal is to understand pat- This data and task abstraction [3] visualization design guidelines”[1]. terns and trends in this data in the con- served as the input for the design stage. text of known information about In this stage we focused our designs around the idea of encoding quantitative the discovery of a previously unknown References and Notes values with a spatial encoding channel evolutionary event. * This paper was presented as a keynote talk instead of other, less effective channels The design study project that under- at Arts, Humanities, and Complex Networks – 3rd Leonardo satellite symposium at NetSci2012. such as color [4]. This means that in- lies Pathline has several visualization See stead of using a node-link to research contributions, including a de- 1. M. Sedlmair, M. Meyer, T. Munzner, “Design visualize the reaction graphs which re- tailed characterization of this subdomain Study Methodology: Reflections from the Trenches lies on spatial position to show topology in biology, the invention and validation and the Stacks”, IEEE TVCG 18, No. 12 (2012), pp. [5], we instead linearize the graphs and of new visualization representations, and 2431-2440. use position to layer on the correlation the creation of the first visualization tool 2. M. Meyer, B. Wong, M. Styczyski, T. Munzner, values. We also use spatial encoding to to combine all four kinds of biological H. Pfister, “Pathline: A Tool for Comparative Functional Genomics”, Forum show the temporal gene data using a data used by the group. 29, No. 3 (2010), pp. 1043-1052. matrix layout of line , called a 3. T. Munzner, “A Nested Model for Visualization curvemap, as opposed to the widely used Conclusions Design and Validation”, IEEE TVCG 15, No. 6 color encoding of the heatmap display Today, biological data holds the promise (2009), pp. 921-928. [6]. The linearized graph representation of explaining the origins of life, curing 4. W. Cleveland and R. McGill, “Graphical Percep- and the curvemap display are shown in tion: Theory, Experimentation, and Application to human disease, and helping us to live the Development of Graphical Methods”, J. Ameri- the left and right of Figure 2 respective- healthy and happier lives. Reaching the- can Statistical Association 79, (1984), pp. 531-554. ly. This figure is a screenshot of the se goals, however, relies on making 5. G. Di Battista, P. Eades, R. Tamassia, I. Tollis, Pathline visualization interface. sense of vast amounts of biological data. Graph Drawing: Algorithms for the Visualization These two views are linked together This challenge has placed visualization of Graphs (Prentice Hall PTR, 1998). by interaction. A user selects genes of directly within the scientific discovery 6. L. Wilkinson and M. Friendly, “The History of interest on-the-fly in the linearized rep- process. the Cluster Heat ”, American Statistician 63, No. 2 (2009), pp. 179-184. resentation which populates the Based on close collaborations between curvemap display by adding columns of visualization practitioners and domain 7. B. Shneiderman, “The Eyes Have It: A Task by Data Type Taxonomy for Information Visualiza- temporal gene data. Thus, the linearized experts, design studies are a visualiza- tions”, Proc. IEEE Symp. Visual Languages, representation serves as an overview of tion method for creating tools that sup- (1996), pp. 336-342. the data, guiding the user to select inter- port complex, real-world, data analysis 8. B. Fry, Visualizing Data (O’Reilly Media, 2008). esting genes for more detailed analysis problems. Design studies have proven to 9. P. Cairns and A. Cox, Research Methods for in the curvemap display [7]. be a powerful method for designing Human-Computer Interaction (Cambridge Univer- During the implement stage of the de- effective visualization tools –this trend sity Press, 2008). sign study process we used an iterative is expected to continue as the flood of refinement scheme over the course of biological data continues to grow. two months. We created numerous low- fidelity prototypes using Fig. 2. Pathline is a tool for exploring molecular biology data [2]. The view on the left is a software, followed by rudimentary inter- linearized representation of graphs that describe cellular reactions. The view on the right active prototypes using the Processing is a curvemap display showing time-series data for several user-selected genes across related species. Copyright Miriah Meyer, 2012. programming language [8]. Each proto- type was refined based on feedback from our collaborators. Source code, executa- bles, and example data for the final ver- sion of Pathline can be found at http://www.pathline.org. In the deploy stage we released the fi- nal version of Pathline to our collabora- tors – it is now one of the primary analysis tools used by the group. The biologists verified that Pathline shows known information more clearly than could be seen with their previous visual- ization tools, and they directly attributed new insights into their data to the use of Pathline. We complied a series of case studies [9] that illustrate how the biologists use Pathline and what types of analysis they are able to perform. These case studies decribe how, using Pathline, the biolo- gists were able to: discover a bias in their data processing pipeline that pro- duced missing data; confirm known findings orders of magnitude faster than using conventional visualization tools; develop numerous hypotheses, one of which led to a follow-up experiment and