Supplement on visualizing biological data

iology is a visually grounded scientific discipline—from the Contents way data is collected and analyzed to the manner in which S2 Visualizing biological the results are communicated to others. B data—now and in the methods have advanced greatly from the hand-drawn pictures future found in scientific publications before the twentieth century and S I O’Donoghue, A-C Gavin, N Gehlenborg, now rely almost exclusively on computer-based visualization D S Goodsell, J-K Hériché, The cover image shows a range tools. But the similarity of modern computer-generated phyloge- C B Nielsen, C North, of data visualizations currently A J Olson, J B Procter, used by life scientists. Source netic trees to their ancestral hand-drawn evolutionary trees illus- D W Shattuck, images come from figures in trates the challenges involved in developing novel visualization T Walter & B Wong the Nature Methods supplement methods that present information in a self-evident way and yet S5 Visualizing genomes: “Visualizing biological data” and can handle the demands placed on them by modern methods of techniques and from Nature Cell Biology and Nature challenges Biotechnology. Cover design by data generation. C B Nielsen, M Cantor, Seán O’Donoghue and Bang Wong. The exponentially increasing amount of scientific data is taxing I Dubchak, D Gordon & Supplement Foreword p193 the abilities of scientists to make sense of it all and communicate it T Wang to others in a concise and meaningful way. Although the computers S16 Visualization of multiple alignments, that facilitate this data deluge also help handle it, it is critical that phylogenies and gene scientists be able to participate intimately in the analysis steps using family evolution J B Procter, J Thompson, qualitative and quantitative abstractions of the underlying data. I Letunic, C Creevey, This supplement describes methods and tools F Jossinet & G J Barton and how these methods are adapting to the challenges accompa- S26 Visualization of image nying modern biology. A Commentary introduces the topic and data from cells to summarizes the general challenges. Five Reviews describe the organisms T Walter, D W Shattuck, visualization approaches and software tools that biologists use R Baldock, M E Bastin, © 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 for, respectively, data visualization of genomes, alignments and A E Carpenter, S Duce, J Ellenberg, A Fraser, phylogenies, image-based data, macromolecular structures and N Hamilton, S Pieper, systems biology data. M A Ragan, J E Schneider, P Tomancak & J-K Hériché Each review highlights a recommended fraction of the avail- able tools. Because these tools can be very specialized and the S42 Visualization of macromolecular writers themselves are developers of some of the tools, there structures is little comparative assessment. Instead, the reviews focus S I O’Donoghue, D S Goodsell, more on the challenges and methods behind the tools. The A S Frangakis, F Jossinet, tools themselves, ranging from simple stand-alone software to R A Laskowski, M Nilges, H R Saibil, A Schafferhans, complex integrated software packages, are conveniently listed R C Wade, E Westho & in tables within each review, and links are provided so that A J Olson readers may easily access the tools and evaluate which ones best S56 Visualization of omics meet their specific needs. data for systems biology N Gehlenborg, Daniel Evanko S I O’Donoghue, N S Baliga, A Goesmann, Editor, Nature Methods Daniel Evanko Senior Production Editor M A Hibbs, H Kitano, O Kohlbacher, Publisher Veronique Kiermer Brandy Cafarella H Neuweger, R Schneider, Senior Copy Editor Anita Gould Production Editor Amanda Crawford D Tenenbaum & A-C Gavin Managing Production Editor Marketing Joanna Budukiewicz Ingrid McNamara

nature methods SUPPLEMENT | VOL.7 NO.3s | MARCH 2010 | S1 commentary

Visualizing biological data—now and in the future Seán I O’Donoghue1, Anne-Claude Gavin1, Nils Gehlenborg2,3, David S Goodsell4, Jean-Karim Hériché1, Cydney B Nielsen5, Chris North6, Arthur J Olson4, James B Procter7, David W Shattuck8, Thomas Walter1 & Bang Wong9

Methods and tools for visualizing biological data have improved considerably over the last decades, but they are still inadequate for some high-throughput data sets. For most users, a key challenge is to benefit from the deluge of data without being overwhelmed by it. This challenge is still largely unfulfilled and will require the development of truly integrated and highly useable tools.

Computer-based visualization is widely used disposal, many of these tools amenable to In addition, tools are increasingly being in biology to help understand and communi- use by non-experts1. designed to interoperate directly with other cate data, to generate ideas and to gain insight A main reason for the increased accessibility visualization and analysis tools. Such inter- into biological processes. This collection of and use of visualization software has been the operation can enable, for example, simulta- reviews examines the key methods now being advances in computer hardware and network neous interactive visualization of a multiple used to visualize genomes1, alignments and access. Many visualization tasks that previ- sequence alignment with corresponding phylogenies2, macromolecular structures3, ously required expensive and specialized hard- three-dimensional structures (Procter et systems biology data4 and image-based data5. ware can now be easily managed with a stan- al.2 and O’Donoghue et al.3)—or of a net- Here, we outline several common trends, dard personal computer. However, an equally work with corresponding heat , profile challenges and recent advances that suggest important factor has been the development of plots or phylogenetic trees and dendrograms the nature of future visualization in biology. a wide range of methods and tools specialized (Gehlenborg et al.4). in visualizing specific kinds of biological data. Finally, many of today’s visualization Visualization goes mainstream In this Supplement, we discuss over 200 tools tools can be either directly embedded into, Twenty years ago, only experts could cre- selected from the much greater number now or launched from, web pages; and such tools ate computer images of a protein structure available. This diversity of tools can be con- are being used to construct integrated web at atomic detail, a large phylogenetic tree, fusing, but it is probably unavoidable, given applications for data mining and browsing,

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 or a complex biochemical pathway. Today, the diverse nature of the biosciences. In fact, in often using multiple visualization tools. For software tools for creating these images are many cases, biologists still find that their exact example, the UCSC Genome Browser8 shows widely available and widely used. Of the dif- requirements are not met by current tools and genomic sequences assembled from many ferent visualization areas in biology, molec- often have to create custom solutions. This has laboratories and provides access to a diverse ular graphics3 is perhaps the most mature, helped spur a growing trend to allow reuse range of related data, including multiple and as a result, molecular graphic images are of visualization software, either by means of sequence alignments among sequences from widely used in textbooks, presentations and open source software libraries (for example, similar organisms, three-dimensional struc- popular media. Other fields, such as genome http://www.vtk.org/) or by means of architec- tures and in situ hybridization images. visualization1, are much younger; however, tures specifically designed to allow extensions The improved integration in visualiza- even here, molecular biologists have a rich (for example, Cytoscape6). tion tools has been helped greatly by a toolbox of visualization software at their trend toward increased consolidation of Integration is improving experimental data. An exemplary case of this 1European Molecular Biology Laboratory, Heidelberg, In the past, visualization tools were typically trend is macromolecular three-dimensional Germany. 2European Institute, stand-alone programs designed to view data structure: almost all experimentally deter- Cambridge, UK. 3Graduate School of Life Sciences, University of Cambridge, Cambridge, UK. 4The from a single experiment. In contrast, many mined structures are consolidated in a single 9 Scripps Research Institute, La Jolla, California, USA. of today’s tools are integrated with remote resource (wwPDB ). Unfortunately, such 5British Columbia Cancer Agency, Genome Sciences databases and provide visualizations that consolidation is still the exception: it is more Centre, Vancouver, British Columbia, Canada. integrate data from multiple sources. For typical in biology to have equivalent data 6Virginia Tech, Blacksburg, Virginia, USA. 7School 7 of Life Sciences Research, College of Life Sciences, instance, Jalview —a popular tool for edit- distributed over many resources. In the case University of Dundee, Dundee, UK. 8Laboratory of ing multiple sequence alignments—can con- of image data from high-throughput experi- Neuro , University of California, Los Angeles, nect to multiple data sources and displays not ments, most of these data are never made California, USA. 9Broad Institute of MIT & Harvard, Cambridge, Massachusetts, USA. only alignments but also a wide variety of publicly available, even though this would e-mail: [email protected] sequence feature information. clearly be of value. Some preliminary steps

S2 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement commentary

Tissue level Cellular level Molecular level

Cell types Lung

Xray

Figure 1 | Possible integrated visualization environment. Soon, biologists may be able to seamlessly move between data from tissue, cellular and molecular scales, as well as data from genomes, networks and pathways. Many of these data will be organized around a cellular coordinate framework, and visualization of biochemical pathways will allow increasingly detailed representations of cellular topology and of proteins. For instance, selecting a tissue (left image) could automatically show micrographs of cell types; selecting a cell type could show relevant pathways (center image); selecting a protein from the pathway could access micrographs showing the cellular distribution and effects of knockdown experiments (bottom, center); in addition, selecting the protein could show atomic-detail three-dimensional structural information, sequence features, alignments and genomic location. Many of these interoperations are already being used today. Images courtesy of ClearScience (drawing), iStockPhoto (lung X-ray), Univ. of Kansas Medical Center (lung histology), Digizyme and Cell Signaling Technology (pathway). Protein structure and sequence alignment made using SRS 3D; chromosome image from UCSC Genome Browser8.

are being made (for example, CCDB10, http:// the past decade, we expect the usability stan- level of detail appropriate to a particular ccdb.ucsd.edu/), but a truly consolidated dard to continue to improve. Unfortunately, scale. For example, in showing the three- resource for image data is likely to remain a improvements may be slow, because work on dimensional structure of a protein, ribbon distant goal owing to difficulties with defin- usability is usually less rewarded in science representation is often used to hide all atoms ing standards for organizing and categorizing than is research on new methods. except those involved in ligand interactions; these data and to data set sizes that are pro- as a user zooms out to see higher-order pro- hibitively large for network-based transfer. . In the process of under- tein complexes, ribbon representation is too standing and interpreting biological data, detailed and is replaced by an overall surface. User-interface challenges tools ideally would provide visualization Although the basic ideas are not new, the

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 Although visualization methods and tools for tasks that require human judgment, details of how to realize multiscale naviga- have greatly improved, there has also been and other tasks would be automated where tion vary greatly with the data type. This is an exponential increase in the size and possible. But, finding a productive balance the subject of ongoing research, particularly complexity of data sets studied in biol- between automation and visualization is a in visualizing genomic data, pathways and ogy. A common challenge faced by many challenge and is one of the goals of visual ana- networks and joint visualization of image biologists is how to benefit from this data lytics methods12, which involve studying the data sets acquired at different resolution, deluge without being overwhelmed by it. role of visualization in the whole process of requiring multimodal image registration. Visualization is clearly part of the solution; analyzing and understanding data. Recently, however, the sheer number and diversity of these methods have begun to be applied to Innovative representations. In all areas of tools available can make the problem worse. biological visualization tools, and, if success- biology, new visual metaphors and graphi- Below we discuss several recent advances ful, these developments will improve the abil- cal representations are being developed to toward addressing these issues. ity of tools to provide meaningful biological convey information and to facilitate naviga- insights13 and to meet user requirements14. tion. Innovation of representations is often Usability. Very often, biologists fail to fully inspired by the need to visualize new types benefit from visualization methods because Multiscale representation and navigation. of data or to support new analysis tasks. software tools are too difficult to learn. Biological data visualization often deals with Examples include the need to display expres- Making software that is easy to use often a broad range of scales—for example, images sion profile data together with pathway data requires considerable work. Fortunately, may range from the atomic scale to the cel- (Gehlenborg et al.4) or the need to make there have been many advances in under- lular level3,5, and genomic browsers provide genome assembly structures easier to see (for standing principles of software usabil- information from whole chromosomes example, ABySS-Explorer15). In some cases, ity11. These principles are increasingly being down to an individual nucleotide position. the innovations are brought in from outside adopted by developers of visualization tools To be useful, the graphical representations of biology; for instance, partial order graphs for biologists. Judging from progress over used need to adjust, ideally displaying the are representations taken from discrete

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S3 commentary

mathematics that are now being used to cre- accomplished using a standard personal Creating such integrated visualization ate concise summaries of multiple alignment computer. However, in almost all areas of frameworks will require a collective effort, information (for example, POAVIZ16) and to biology, visualization of cutting-edge data and several initiatives toward collabora- visualize alternative gene splicing. sets remains a challenge. For instance, a tive, community-based editing of biological modern high-throughput image data set image data have already begun (for example, Standardized representations. Because visu- may consist of thousands of videos or hun- CATMAID27). But all these efforts are still alization methods are still rapidly evolving, dreds of channels (each channel typically very much at the pioneering stage, and, to part of the difficulty faced by end users today corresponding to one gene product)—and paraphrase Alan Kay, we could say that the arises from a lack of standards in representa- may be up to tens of terabytes in size. To revolution in biological data visualization tions. Although there is an obvious strength interactively visualize these data, personal hasn’t started yet. in diversity, and indeed a need for continued computers are often inadequate. innovation in graphical representation, in This situation is inspiring further inno- COMPETING INTERESTS STATEMENT many cases usability would be enhanced by vation in software—especially in methods The authors declare no competing financial interests. the adoption of some standards in represen- for dimension reduction and classification, 1. Nielsen, C.B., Cantor, M., Dubchak, I., Gordon, D. tation. In systems biology, there has recently which underlie visualization tools in many & Wang, T. Nat. Methods 7, S5–S15 (2010). been a significant community-driven pro- areas of biology. For example, the recently 2. Procter, J.B. et al. Nat. Methods 7, S16–S25 posal17 toward developing a more unified developed MCL clustering algorithm22— (2010). standard for graphical notation of biochemi- which enables fast network clustering—has 3. O’Donoghue, S.I. et al. Nat. Methods 7, S42–S55 (2010). cal networks, and we anticipate similar pro- been implemented in a range of visualiza- 4. Gehlenborg, N. et al. Nat. Methods 7, S56–S68 posals in other areas. tion tools, particularly in systems biology. (2010). Although these advances will undoubtedly 5. Walter, T. et al. Nat. Methods 7, S26–S41 (2010). 6. Shannon, P. et al. Genome Res. 13, 2498–2504 Display hardware. To help display and use improve on today’s limitations, our ability (2003). complex biological data sets, large display to collect data will also continue to improve, 7. Waterhouse, A.M., Procter, J.B., Martin, D.M., devices and tiled arrays with improved reso- and it is certain to continually challenge our Clamp, M. & Barton, G.J. Bioinformatics 25, 18 1189–1191 (2009). lution are likely to be of significant benefit . visualization capabilities. 8. Rhead, B. et al. Nucleic Acids Res. 38 (database As these devices become more affordable, issue), D613–D619 (2010). they are likely to see more use. For instance, Future visualization 9. Berman, H., Henrick, K. & Nakamura, H. Nat. Struct. Biol. 10, 980 (2003). touch tables are promising for navigation and Ultimately, the goal of visualizing biologi- 10. Martone, M.E. et al. J. Struct. Biol. 161, 220–231 collaborative work on complex phylogenetic cal data is to provide biologists with an (2008). hierarchies (http://involvweb.org/). integrated framework they can use to gain 11. Shneiderman, B. & Plaisant, C. Designing the User insight into the processes in organelles, Interface: Strategies for Effective Human-Computer Interaction 5th edn. (Addison Wesley, Reading, Adding a third dimension. The use of three- cells, organs and even whole organisms. Massachusetts, USA, 2009). dimensional visualization is being explored Fulfilling this ambitious goal requires sub- 12. Thomas, J.J. & Cook, K.A. Illuminating the Path: for networks19, phylogenetic trees (Procter et stantial further development in visualiza- The Research and Development Agenda for Visual 2 Analytics (National Visual Analytics Center & IEEE, al. ) and genomics data (for example, http:// tion methods, especially better integration Richland, Washington, USA, 2005). genodive.org/). Although the third dimen- of different tool types. 13. Saraiya, P., North, C. & Duca, K. IEEE Trans. Vis. Comput. Graph. 11, 443–456 (2005). © 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 sion adds complexity to the user interface, Several efforts to build such integrated 14. Mirel, B. J. Biomed. Discov. Collab. 4, 2 (2009). three-dimensional visualization may be visualization frameworks have begun—for 15. Nielsen, C.B., Jackman, S.D., Birol, I. & Jones, necessary for some very complex data sets. example, using the framework of genomic S.J.M. IEEE Trans. Vis. Comput. Graph. 15, 881– Visualization in three dimensions is helped coordinates to integrate increasingly diverse 888 (2009). 23 16. Grasso, C., Quist, M., Ke, K. & Lee, C. greatly by hardware stereo, which is now data . Other frameworks based on com- Bioinformatics 19, 1446–1448 (2003). becoming easily affordable. monly used systems biology data types are 17. Le Novère, N. et al. Nat. Biotechnol. 27, 735–741 being developed24. And projects from the (2009). Augmented computer interaction. For chal- structural biology and microscopy com- 18. Ball, R. & North, C. Comput. Graph. 31, 380–400 (2007). lenging data sets, we anticipate the increased munities aim to integrate biological data on 19. Freeman, T.C. et al. PLOS Comput. Biol. 3, e206 use of methods that augment or improve the basis of a cellular coordinate framework (2007). the ability to interact with visual data. For (for example, Visible Cell25 and others26) by 20. Gillet, A., Sanner, M., Stoffler, D. & Olson, A. Structure 13, 483–491 (2005). example, tangible devices that give touch synthesizing multiscale data, including data 21. Garcia-Ruiz, M.A. & Gutierrez-Pulido, J.R. feedback are becoming more affordable and from cellular tomograms, cryo-electron Interact. Comput. 18, 853–868 (2006). are promising for three-dimensional struc- microscopy, and atomic-detail three-dimen- 22. Enright, A.J., Van Dongen, S. & Ouzounis, C.A. 20 Nucleic Acids Res. 30, 1575–1584 (2002). ture visualization . Preliminary studies on sional structures, as well as inventories of 23. Dowell, R.D., Jokerst, R.M., Day, A., Eddy, S.R. & augmenting visualization with auditory tech- expressed proteins, estimations of organelle Stein, L. BMC Bioinformatics 2, 7 (2001). niques (‘sonification’) have also been done, shapes and distributions, and protein local- 24. Shannon, P.T., Reiss, D.J., Bonneau, R. & Baliga, N.S. BMC Bioinformatics 7, 176 (2006). using molecular three-dimensional structure izations and gradients. Probably no single 25. Burrage, K., Hood, L. & Ragan, M.A. Brief. and sequence information, and preliminary framework will suit all biologists; however, Bioinform. 7, 390–398 (2006). results are encouraging21. the goals of these different efforts may even- 26. Suderman, M. & Hallett, M. Bioinformatics 23, tually produce a standardized visualization 2651–2659 (2007). 27. Saalfeld, S., Cardona, A., Hartenstein, V. & Computational challenges environment that allows seamless integra- Tomancák, P. Bioinformatics 25, 1984–1986 Today, many visualization tasks are easily tion of biological data (Fig. 1). (2009).

S4 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

Visualizing genomes: techniques and challenges Cydney B Nielsen1, Michael Cantor2, Inna Dubchak2,3, David Gordon4 & Ting Wang5

As our ability to generate sequencing data continues to increase, data analysis is replacing data generation as the rate-limiting step in genomics studies. Here we provide a guide to genomic data visualization tools that facilitate analysis tasks by enabling researchers to explore, interpret and manipulate their data, and in some cases perform on-the-fly computations. We will discuss graphical methods designed for the analysis of de novo sequencing assemblies and read alignments, genome browsing, and comparative genomics, highlighting the strengths and limitations of these approaches and the challenges ahead.

The study of genomes has to a large extent become a One challenge in designing visual tools is decid- digital science made possible by the advent of sequencing ing on a graphical representation—essentially, how technology and its power to detect genomic sequence the data are encoded into colors and shapes or trans- at nucleotide resolution. The emergence of extensive formed onto different scales. The choice of repre- sequence data resources opened new interfaces with sentation can either help or hinder a user’s ability computer science, fuelling fields such as bioinformatics, to interpret the data and ideally should be designed and enabled biological questions to be tackled compu- to facilitate the analysis task. For example, genomic tationally. The recent innovations in sequencing tech- rearrangements may be more easily viewed as arcs nology provide an unprecedented capacity for data on a circle than on a line. Genomic data are derived generation. Now more than ever, we require intuitive from diverse sources using different techniques,

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 and rapid data exploration and analysis capabilities. each accompanied by its own experimental error. It Although many genome data analysis tasks can be is important that visual representations capture this accomplished with automated processes, some steps technical uncertainty and any resulting inconsist- continue to require human judgment and are frequently encies. There is also substantial biological variation rate limiting. Visualization can augment our ability to between individuals, which needs to be distinguished reason about complex data, thereby increasing the effi- from the technical variation mentioned above. In ciency of manual analyses. In some cases, the appro- addition to the challenges of choosing an appropriate priate image makes the solution obvious. Given the visual representation, some types of primary data are importance of human interpretation particularly in the unavailable owing to their prohibitive online storage early hypothesis generation stages of biological research, requirements, and enabling real-time interaction with visual tools also provide a valuable complement to large-scale data sets is nontrivial. automated computational techniques in enabling us to This review highlights examples from three core user derive scientific insight from large-scale genomic data tasks: (i) analyzing sequence data, both in the context of sets. Visual and automated approaches are particularly de novo assembly and of resequencing experiments, (ii) powerful when used in combination, such that a user browsing annotations and experimental data mapped can seamlessly inspect and perform computations on to a reference genome and, finally, (iii) comparing their data, iteratively refining their analyses. sequences from different organisms or individuals.

1British Columbia Cancer Agency, Genome Sciences Centre, Vancouver, British Columbia, Canada. 2Department of Energy Joint Genome Institutes, Walnut Creek, California, USA. 3Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA. 4Department of Genome Sciences, University of Washington, Seattle, Washington, USA. 5Department of Genetics, Center for Genome Sciences, Washington University School of Medicine, St. Louis, Missouri, USA. Correspondence should be addressed to C.B.N. ([email protected]). published online 1 march 2010; doi:10.1038/nmeth.1422

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S1 review

Visualization methods in these domains are at different stages of display primary image data, in part because their large size makes maturity, and we will discuss their respective strengths and limita- them too expensive to keep in online storage and too slow to display. tions. One important consideration is that the field of genomics is However, the high read coverage routinely generated by NGS often rapidly evolving. Although we have attempted to provide a guide alleviates the need to inspect any one read. A user can evaluate a to the techniques in this area, it is likely that new tools and data suspect base in one read through comparison with the correspond- formats will emerge in the very near future, and we will discuss ing bases in the other aligned reads at the same location. some of the associated challenges. We encourage readers to consult online resources, such as SEQanswers (http://seqanswers.com/), Finishing. The output of automated sequence assembly programs for the most recent tool developments. is imperfect, and repeat regions, read length and coverage limit contiguity. The next step, ‘finishing’, involves closing gaps, correcting Visualizing sequencing data misassemblies and improving the error probabilities of consensus Interpreting the raw data from a sequencing machine begins with bases. Specialized finishing software facilitates this process by auto- automated data processing. Base calling and quality calculations mating and/or allowing a user to perform the above-mentioned are followed by sequence assembly in the case of de novo genome tasks. In some cases, automated finishing is sufficient—for example, sequencing projects, or by read alignment to a reference in the case as performed by Autofinish11, which is a program that examines of resequencing. Recent innovations in sequencing technology the output of an assembly program and suggests what laboratory have been accompanied by a growth in new assembly and align- data to acquire (for example, specific primers for PCR). However, in ment programs to cope with the shorter read lengths and larger other situations manual inspection and editing are needed to com- numbers of reads (for reviews, see refs. 1,2), but no standards have plement automation. Gap4 (refs. 12,13), Consed and commercially been reached. For some downstream analysis tasks, visual inspec- available products such as Sequencher (Gene Codes Corporation) tion is valuable in interpreting and validating automated outputs and Lasergene14 (DNASTAR) are widely used finishing programs and can drive both biological insight and algorithmic improve- that provide rich editing functionality and history tracking and ments. For example, automated single nucleotide polymorphism enable the user to manually break apart and join contigs, which (SNP) detection based on sequencing data remains imperfect, and distinguishes them from static alignment viewers that do not allow visual inspection is still used to evaluate individual cases for both editing ( 1). biological implications and technical observations that may be In most sequencing protocols, the size range of genomic frag- used to improve the prediction algorithm. This section highlights ments is known. The sequencing reads derived from opposite current graphical tools for analyzing sequencing reads. ends of the same source genomic fragment (‘mate pairs’) there- fore have an expected distance (‘insert size’) and expected orien- Visualizing alignments. Analysis of assemblies and read align- tation (one top strand read and one bottom strand read). Mate ments often involves examination of the sequencing reads them- pairs that violate these spatial constraints can be used to reveal selves, and all tools listed in Table 1 provide a view of aligned read misassemblies, while consistent mate pairs can be used to join bases. Read sequences are typically represented as letter strings contigs together. oriented horizontally from left to right and stacked vertically. In the Consed’s ‘assembly view’ depicts mate pairs as color-coded lines case of assemblies, a user can scan down the corresponding column spanning contigs, with the contigs represented by horizontally in the read stack (Fig. 1) to identify the bases contributing to the oriented blocks. This display visually separates ‘consistent’ pairs

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 consensus at a given position. Base qualities (log-transformed error (those of expected insert size and orientation) from the ‘incon- probabilities) are often indicated with gray scale and bases that sistent’ pairs (those with unexpected insert size or orientation) disagree with the consensus emphasized with color3–5. Some tools by plotting them above and below the contig boxes, respectively, minimize the visual clutter in the read stack by highlighting only which helps to reveal misassemblies (Fig. 1a). One advantage discrepancies and concealing all consistent base pairs (for exam- of this tool is that it allows interactive filtering of the displayed ple, Integrative Genomics Viewer (IGV), Hawkeye6, US National data (contigs, mate pairs, similar sequences and so on). Despite Center for Biotechnology Information (NCBI) Assembly Archive this filtering, one limitation is that the view can quickly become Viewer7, Text Alignment Viewer in SAMtools8). cluttered as the number of mate pairs increases. For example, Most tools built before the emergence of next-generation sequenc- in Consed it is sometimes desirable to turn off the display of all ing (NGS) continue to support visualization of the underlying consistent mate pairs internal to a contig because their number primary data for Sanger reads through a separate ‘trace’ view. For overwhelms the image. Applying biologically meaningful aggrega- example, in the popular program Consed3, the ‘trace’ window can tion methods and summary techniques to highlight only the most be launched from the ‘aligned reads’ window, and cursor movement well-supported connections remains an outstanding challenge. is synchronized between the two displays (Fig. 1). This view allows In addition to mate pair relationships, sequence similarity can a user to inspect positions with conflicting bases and uncover the be used to identify possible contig joins missed by the assembly source of ambiguity within the primary traces directly (for example, program. For example, a user can interactively request an align- a base-calling error in one of the reads, a misassembly, or a poly- ment of two selected regions within Consed and inspect the morphism). To a large extent, NGS data has changed how a user output in the ‘compare contigs’ window. Similar functionality evaluates uncertain consensus bases. For example, Consed allows exists in other finishing software; for example, Gap4 provides the user to inspect raw Roche 454 sequencing data, but in the case of a ‘contig joining editor’. These sequence-based views are com- Illumina and Applied Biosystems’ SOLiD data, there are no raw read plemented by overview displays. Gap4 uses a dot- repre- traces, only image data. (Details of these sequencing technologies sentation wherein each axis indicates positions along a contig’s are reviewed elsewhere9,10.) Consed and similar programs do not length and dots demark the locations sharing above-threshold

S2 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

Table 1 | Tools for visualizing sequencing data Name Cost OS Description URL Stand-alone tools ABySS-Explorer25 Free Win, Mac, Interactive assembly structure visualization tool http://tinyurl.com/abyss-explorer/ CLC Genomics Workbench $ Win, Mac, Linux Integrates NGS data visualization with analysis tools; http://www.clcbio.com/ user friendly Consed3* Free Mac, Linux Widely used; assembly finishing package; NGS compatible http://www.phrap.org/ DNASTAR Lasergene14 $ Win, Mac Analysis suite with an assembly finishing package; http://www.dnastar.com/ NGS compatible EagleView17 Free Win, Mac, Linux Assembly viewer; compatible with single-end NGS http://tinyurl.com/eagleview/ Gap12,13 Free Linux Widely used; assembly finishing package; Gap5 is http://staden.sourceforge.net/ NGS compatible Hawkeye6 Free Win, Mac, Linux (S) Sanger sequencing assembly viewer http://amos.sourceforge.net/hawkeye/ Integrative Genomics Free Win, Mac, Linux Genome browser with alignment view support (Table 2); http://www.broadinstitute.org/igv/ Viewer (IGV)* NGS compatible MapView18 Free Win, Linux Read alignment viewer; custom file format for fast http://evolution.sysu.edu.cn/mapview/ NGS data loading MaqView Free Mac, Linux Read alignment viewer; fast NGS data loading from Maq http://maq.sourceforge.net/ alignment files Orchid Free Linux (S) Assembly viewer customized to display paired-end http://tinyurl.com/orchid-view/ relationships Sequencher $ Win, Mac Assembly finishing package http://www.genecodes.com/ SAMtools tview8 Free Win, Mac, Linux Simple and fast text alignment viewer; NGS compatible http://samtools.sourceforge.net/

Web-based tools LookSeq19 Free Uses AJAX; y axis for insert size; user configures data http://lookseq.sourceforge.net/ resources; NGS compatible NCBI Assembly Free Graphical interface to contig and trace data in NCBI’s http://tinyurl.com/assmbrowser/ Archive Viewer7 Assembly Archive Free means the tool is free for academic use; $ means there is a cost. OS, operating system: Win, Microsoft Windows; Mac, Macintosh OS X. Tools running on Linux usually also run on other versions of Unix. (S) indicates that compilation from source is required. “Assembly finishing package” enables interactive sequence editing and/or integration with tools for automated assembly improvement. *Our recommendation

sequence similarity. A user can interactively explore the sequence Assembly visualization tools possess most of the necessary relationships between different contigs and view the results of functionality, but they were built with Sanger data in mind and search operations such as ‘find repeats’. Consed’s assembly view initially strained under the substantially higher read volume of can display the output of a sequence comparison utility called NGS technologies. Several of these tools are being retrofitted ‘cross_match’, using arcs to connect regions with sequence to tackle larger data sets, including Consed and the updated

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 similarity between user-selected contigs. Different colors dis- Gap5, but a new wave of tools is also being designed with this tinguish features such as directed repeats from inverted repeats. purpose in mind: for example, EagleView17, MapView18 and IGV One advantage of viewing sequence similarity in ‘assembly view’ (Table 1). Unlike finishing software, these tools are primarily is that it can be integrated with a read coverage plot (Fig. 1a), data viewers and do not provide direct editing functionality. which can reveal regions of unexpectedly high coverage often Because of their emphasis on browsing, many provide more indicative of similar sequences that were erroneously collapsed flexible zooming capabilities and enable a user to freely zoom by the assembler into one. The user can click to examine the out to higher-level views. The commercially available CLC sequence similarity at the base level, and click again to exam- Genomics Workbench (CLC bio) is particularly user friendly ine the underlying reads. There are also standalone tools with and includes its own read alignment programs, which can be related functionality; for example, Miropeats15, widely used for launched through a GUI. early genome sequencing projects, is a UNIX C-shell script that In the resequencing context, mate pairs provide valuable infor- generates static images using arc representations to indicate mation about structural variation, such as insertions, deletions different types of repeats. and inversions. As discussed in the previous section, mate pairs can also indicate misassemblies, and users performing variation Next-generation sequence viewers. As sequencing through- detection on draft assemblies should be aware of these issues. put increases and costs decrease, individual genome sequenc- LookSeq19 and Gap5 use the vertical-axis position to indicate ing has become feasible and has led to initiatives such as the insertion size. This places inconsistent mate pairs at the extremes 1,000 Genomes project (http://www.1000genomes.org/). These of the plot and visually separates large insert sizes, which are con- data provide an unprecedented opportunity to characterize the sistent with deletions, from small insert sizes, which suggest inser- landscape of human genotypes, and a new generation of com- tion events. When analyzing structural variations, it is important putational methods has emerged as a result16. In some cases, to consider gene annotations—for example, whether a single visual inspection can facilitate the evaluation and interpretation nucleotide variation leads to a synonymous or nonsynonymous of read alignment techniques and variation detection outputs. amino acid change. For this reason, several of these visualization

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S3 review

a Figure 1 | Screenshots of connected views in Consed. (a) Contigs from a human BAC clone assembly are shown in ‘assembly view’ as gray boxes with a scale of nucleotide positions within the contig. Angled colored lines represent mate pairs (aqua, consistent; red, inconsistent; purple, multiple at same location). Curved lines indicate sequence similarity computed using cross_match (orange, directed; black, inverted orientation). The read coverage is plotted along the contigs in dark green and mate-pair coverage highlighted in light green. (b) The ‘aligned reads’ window displays a vertical stack of read sequences, optionally separated by strand, with forward on top (right arrows) and reverse on bottom (left arrows). The * character in the computed consensus indicates that one or more of the reads contains an insertion at this position that the assembly program deems incorrect. (c) By inspecting the read traces in the ‘trace’ window, the user can evaluate the insertion and override the assembly program’s choice of consensus if needed. b images20–24 including an interactive viewer25 are emerging to enable higher-level assembly structure visualization. Part of the power of assembly finishing software comes from integrating on-the-fly analysis operations with the visualization. Sequence similarity searches resulting in dynamic alignment visualizations are one example. In addition, user efficiency can be greatly improved by providing recommendations for where to look. For example, a user can jump to the next ‘low consensus c quality’ region using Consed’s navigation menu instead of manu- ally evaluating all positions. Achieving this type of integration between visual and computational analyses will be important in tackling our growing data analysis needs.

Browsing genomes The end product of genome sequencing, assembly and finish- ing cycles is a highly contiguous sequence in which most con- tigs have lengths that are orders of magnitude longer than an individual read. How can a researcher navigate this sequence tools and some finishing software support the display of anno- to find regions of interest? The sequence provides a reference tations. Consed, for instance, optionally displays the amino acid coordinate system and a natural platform on which to assem- translation of the consensus in all six reading frames and allows the ble scientific annotations and genome-mapped data sets from

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 user to annotate genotypes, repeats and user-defined genes. diverse sources. Genome browsers were originally developed to display data on early draft assemblies, such as the Caenorhabditis Challenges. NGS technologies and the high volume of data they elegans genome26 (for example, AceDB27), and, later, those of produce give rise to both computational and representational chal- other model organisms (for example, GBrowse28), and the human lenges. New file formats—for example, the Sequence Alignment/ genome assembly29 (for example, the University of California (SAM) format8, adopted for the 1,000 Genomes Project, Santa Cruz (UCSC) Genome Browser30, the Ensembl Genome and the Compact Alignment Format, CALF (http://www.phrap. Browser31,32 and the NCBI MapViewer33). These browsers share org/phredphrap/calf.pdf)—provide compact storage of read align- much functionality and their main differences have been reviewed ment data. Preindexing—for example, of BAM files (the compan- elsewhere34,35. Today, browsers have become a standard tool for ion binary representation of SAM)—is being increasingly used to exploring genomes, facilitating analysis of genome-anchored data, achieve fast random retrieval of alignment data and reduce the and providing a common platform for investigators to share, store memory requirements of interactive alignment viewers. and publish scientific discoveries (Table 2). In addition to these computational hurdles, NGS data pose representational challenges. For example, most read alignment Genome browsers in a nutshell. In general, genome browsers viewers render all available aligned reads using sorting or color- display data and biological annotations from many sources in coding by quality to guide the user. However, this representation their genomic context within a graphical interface. These tools breaks down when hundreds or thousands of reads map to a support data types including gene expression, genotype varia- single location. Users require summary methods that consider tion, cross-species comparisons and many more. Annotations base and alignment qualities in order to obtain a high-level over- of functionally important regions such as the locations of view, together with interactive access to the underlying data on genes, regions with transcriptional activity or regulatory ele- demand. In addition, recent NGS assembly programs based on ments, derive from either experimental results (for example, de Bruijn graphs produce contig connectivity information that sequenced transcripts) or from simulation studies (for exam- can become complex (for reviews, see refs. 1,2). Assembly graph ple, gene model predictions). Both data and annotations are

S4 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

Table 2 | Genome browsers Name Description URL Stand-alone browsers Argo Supports manual annotation of whole genomes http://tinyurl.com/argo-combo CGView82 Circular genome visualization http://wishart.biology.ualberta.ca/cgview/ Gaggle83 Genome browser within an analysis framework; good microarray support http://gaggle.systemsbiology.net/ Integrative Genomics Viewer (IGV)* Flexible user interface; can integrate metadata as heat maps http://www.broadinstitute.org/igv/ Integrated Genome Browser (IGB)84 GenoViz project genome browser; reusable http://genoviz.sourceforge.net/ visualization components NCBI Genome Workbench Displays sequence data in many views; integrated with BLAST http://tinyurl.com/gbench/

Web-based browsers AnnoJ Designed for NGS data; uses AJAX; assemble by html configuration http://www.annoj.org/ Cancer Molecular Analysis Portal Integrates clinical data; designed for TCGA project https://cma.nci.nih.gov/cma-tcga/ Ensembl31,32* Comprehensive genome browser and database; strong user support http://www.ensembl.org/ GBrowse28* GMOD28* component; back end of WormBase, FlyBase; v2.0 uses AJAX http://gmod.org/wiki/Gbrowse Genome Projector42 Offers circular and pathway views; user configures data resources http://tinyurl.com/gprojector/ JBrowse39 Component of GMOD28*; AJAX interface; user configures data resources http://jbrowse.org/ JGI Supports live annotation; primary portal for JGI genome projects http://genome.jgi-psf.org/ NCBI Map Viewer33 Vertically oriented viewer; integrated with NCBI resources and tools http://tinyurl.com/mapview1/ UCSC Genome Browser30* Comprehensive genome browser and database; strong user support http://tinyurl.com/ucscbrowser/ UCSC Cancer Genomics Browser43 Integrates clinical data; offers a pathway view; portal for TCGA data http://genome-cancer.ucsc.edu/ UTGB Toolkit to construct personalized browser; uses AJAX; user configures http://utgenome.org/ data resources X:map41 Customized to view Affymetrix exon arrays http://xmap.picr.man.ac.uk/ All listed tools are free for academic use, and all are available for Microsoft Windows, Macintosh OS X and Linux. Tools running on Linux usually also run on other versions of Unix.

usually organized into ‘tracks’, which can be preloaded into a consortia—for example the Encyclopedia of DNA Elements genome browser or uploaded on demand. (ENCODE) project37, the Cancer Genome Atlas (TCGA) project38, Investigators frequently wish to examine particular regions of the 1,000 Genome Project and Epigenome Roadmap Project interest, and all current genome browsers allow a user to select (http://nihroadmap.nih.gov/epigenomics/)—each will produce specific genomic locations to display. Most tools also provide the thousands of genome-wide data sets. Even comparatively small ability to search for sequences and for specific genome annota- groups of researchers are now able to obtain large volumes of tions, such as gene names, that reside in the underlying databases. genomic data over a short time period. A new generation of genome Many genome browsers also permit complex database queries browsers and associated databases are emerging to efficiently and provide a suite of tools to access annotation lists for specific manage and distribute this high volume of data (Table 2). 36 © 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 regions or for the whole genome. For example, Galaxy is a service The traditional web-based genome browsers use a central- specifically designed to interface with genome browsers and facilitate ized model whereby both data and service are located on the data manipulation and analysis. server side. Information flows from data providers to the genome Part of the value of genome browsers is that they are customiz- browser server, which renders the requested image and passes it able. For example, a user can decide on the resolution at which to the end user. When the size of the data set increases to a critical information is shown (say, a window of several hundred base point, the substantial overhead burdens the server and internet pairs versus tens of thousands) and zoom and pan at will. Data connections and ultimately disrupts smooth genome browsing. tracks can be freely ordered and organized to facilitate compari- Decentralizing the data, the service or a combination of the sons. In most cases, users can also choose among and configure two can relieve such server load. For example, JBrowse39 uses ­several modes of display to examine the same underlying data. For Asynchronous JavaScript and XML (AJAX) technology to distribute example, continuously valued data such as that from chromatin work between the server and client, thereby incurring substan- ­immunoprecipitation (ChIP) can be uploaded as ‘wiggle tracks’ tially less server overhead while also replacing traditional static and displayed as heat maps or histograms (Fig. 2a). The popularity image loading with smoothly animated genome navigation and of the UCSC Genome Browser stems from its flexibility in display- track selection. Anno-J40 (Annotation with JavaScript) provides ing user-provided data sets and its quick response time. However, similar smooth Web 2.0 navigation; however, it achieves its cli- the validity of the displayed comparisons requires user evaluation. ent-side rendering using the ‘canvas’ HTML element, which only For example, the user must interpret a colocalization of histone some web browsers support. Several other applications use the H3 acetylation (H3ac) with Usf1 transcription factor binding as technology behind Google Maps API to reduce response time on either biologically meaningful or experimental artifact (Fig. 2a). the server’s side and create the effect of panning smoothly when navigating through genomic locations41,42. Next-generation genome browsers. Newer and higher through- Using another approach, UCSC Genome Browser recently put genomic technologies, including NGS, have enabled research- improved its popular custom track function by developing BigBed ers to generate unprecedented amounts of data. International and BigWig formats to handle very large data sets (hundreds of

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S5 review

Chr10: 89500000 900 a UCSC Gene Predictions Based on RefSeq, UniProt, GenBank, and Compa maps. Clinical features are displayed alongside genomic data in KIAA2020 MINPP1 ATAD1 AK091716 PAPSS2 PTEN a separate heat map. Investigators interact with the browser to BC082979 CFLP1 AK130076 order, filter, aggregate and display data according to clinical fea- Genome Institute of Singapore ChIP-PET tures, annotated biological pathways or user-edited collections of p53 HCT116 +5FU cMyc P493 genes. Statistical analyses can be applied to defined data sets and H3K4me3 hES3 H3K27me3 hES3 displayed graphically on the browser. Uppsala University ChIP-chip Signal (H3ac) UU H3ac Signal The UCSC Cancer Genomics Browser uses a heat map view Uppsala University ChIP-chip Signal (Usf1) UU Usf1 Signal in which the x axis represents genomic coordinates and the y axis is an ordered stack of genome-wide measurements, each row representing data of one sample. This display makes it easy b to identify common patterns across a sample set. For example, the user can clearly identify where a region of chromosome 10 around the PTEN locus appears to be deleted recurrently in available brain tumor samples (Fig. 2b). Below the genome heat map is a summary view of the same data, where the character- istic copy number variation profile is apparent. A clinical heat map allows researchers to visually examine the relationship between genomic measurements and selected clinical features available to the user on the basis of their authorized level of data Gender access. Rearrangement of the vertical (clinical sample) order in unaffected Tumor versus both the clinical and genomic heat maps can be accomplished by chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 10 11 12 13 14 15 16 17 18 19 20 YchrX ­simultaneously sorting on the basis of a numerically encoded clinical feature or combination of features. For example, when glioblas- Figure 2 | The UCSC Genome and Cancer Genomics Browsers. (a) The UCSC Genome Browser displays diverse data types across the human reference toma data are sorted on ‘tumor versus unaffected’, there is an obvi- assembly (for example, gene annotations with exons (boxes), introns (thin ous difference between the genomic content of these two sample lines) and untranslated regions (intermediate-height boxes); ChIP data as types, with the ‘normal’ samples showing almost no large-scale copy heat maps or histograms). (b) The UCSC Cancer Genomics Browser provides number abnormalities and the tumors rife with them (Fig. 2b). an improved overview and links back to the Genome Browser. Agilent 244A Constraining the data visualization to sequence-based coordi- comparative genomic hybridization (CGH) data are taken from randomly nates can be limiting. This is particularly true when visualizing selected glioblastoma tumor samples made available through the TCGA consortium, together with a small number of unaffected tissues (blue, structural variations or long-range interactions between two deletion; red, insertion). Two publicly available clinical parameters are genomic loci. In addition, global patterns across genomes are displayed: tumor (olive) versus unaffected (yellow), and male (yellow) often better appreciated in the context of features that do not versus female (black); gray, data unavailable. map to genome coordinates. One recent example is the UCSC Cancer Genomics Browser, in which genomic data are displayed megabytes to gigabytes of data). Such large data sets are formatted within the context of biological pathways43. By organizing the and stored locally on the client computer. Instead of storing the placement of data into sets of genes according to individual path-

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 entire data set in the browser’s database, the browser only fetches ways as opposed to chromosomal location, users obtain a more a slice of data around the requested genomic locus. Besides robust and biologically meaningful summary of their genomic improving efficiency, locally stored data also have the distinct data across genes that may act in a concerted manner. Anders advantage of simplifying the steps necessary to secure sensi- and colleagues provide another approach, in which genomic data tive data, such as those from individual human subjects. The are organized on a Hilbert curve to provide a global overview44. University of Tokyo Genome Browser, UTGB, is specifically In the future, there is great potential in exploring new ways to designed for browsing locally stored data in a customized man- better navigate the genomic data landscape. ner. There are also several standalone tools—in particular, two Java-based packages, the Affymetrix Integrated Genome Browser Challenges. Several key challenges in genomics data analysis have (IGB, pronounced ig-bee) and the Integrative Genomics Viewer emerged in recent years, including issues of data volume, data (IGV) developed at the Broad Institute. type and data representation. Several new genome browsers, as In addition to experimental data associated with genomic introduced above, are available that tackle some of these topics; sequences, other types of data, such as clinical information asso- however, a consensus has not yet been reached. In addition, it will ciated with specimens, are often critical in the interpretation of be important that new genome browsers build on the successes genomic data. Several recently developed genome browsers are of earlier tools, including easy cross-platform access, data and designed to provide a platform to integrate large genomic data display customization, and the ability to perform on-the-fly com- sets, especially cancer genomic data. These include the UCSC putation within the visualization (for example, the BLAT search Cancer Genomics Browser43, the IGV and the Cancer Molecular functionality in the UCSC Genome Browser). Analysis Portal developed at the US National Cancer Institute. Genome browsers are beginning to interface with sensitive The main innovation of these new tools is the simultaneous dis- information, and the community is increasingly aware of the play of genomics data and clinical data. These browsers display challenge of data security. The personal information encoded in a whole-genome-oriented view of genome-wide experimental genomic DNA, a person’s clinical parameters, and other private measurements for individual samples and sets of samples as heat information require careful protection. Genome browsers should

S6 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

take advantage of many security systems developed for electronic Visualization of whole-genome alignments. A wide variety of information to ensure that only authorized investigators can strategies have been explored for graphically depicting synteny access these data. In addition, these tools can aim to maximize at the level of whole genomes. Two-dimensional ‘dot plots’, the utility of sensitive data by presenting them in an anonymized historically used in the analysis of local alignment, have seen a form, such as aggregates or summaries, while preventing the modern resurgence as a powerful way to visualize increasingly extraction of personal information from such aggregates45. available whole-genome alignments (DAGChainer61, VISTA-Dot MUMmer62, GenomeMatcher63 and MEDEA). The genomes of Comparing genomes two organisms are represented along the x and y axes of the plot, The recent availability of a large number of completely sequenced with grid lines indicating chromosome boundaries. Points in the and assembled genomes has stimulated active research in the field plot indicate some measure of alignment, forming 45° lines in of comparative genomics. This includes the development of algo- conserved regions. Genome rearrangement and duplication are rithms and tools for pairwise and multiple alignment of very long immediately identifiable as, respectively, off-diagonal lines and genomic intervals and complete genomes. Among the goals of identical lines stacked horizontally or vertically. DAGChainer61, this work are (i) the identification of functional elements, such the first publicly available tool for generating dot plots, calcu- as exons or enhancers (reviewed in refs. 46,47), (ii) the study of lates synteny on the basis of a meta-alignment of genes paired large-scale rearrangements and evolution of individual genomes48 by BLAST matches between two organisms. VISTA-Dot offers a and (iii) the alignment of unfinished and reference genomes in dot-plot viewing mode for the browsing of synteny based on the course of assembly and finishing49. Visualization of align- whole-genome DNA alignments (Supplementary Fig. 1). This ment data is critically important for each of these goals but is tool has an interactive Google Maps–like interface, allowing users challenging because of the difficulty of graphically identifying to zoom and pan within the plot, as well as to link out from relationships of interest across multiple chromosomes in multi- aligned segments to view them in VISTA or in the JGI Genome ple genomes and over multiple scales. In this section, we review Browser. Dot plots are useful not just in analyzing synteny the variety of techniques that have been developed to help between finished genomes but also in genome assembly and fin- ­investigators navigate sequence conservation between two or ishing. For instance, the OSLay tool49 automates the increasingly more genomes. common technique of using a dot plot to align a collection of contigs from an unfinished assembly against a reference assembly Calculation of whole-genome alignment and synteny. A variety and thereby map the target genome. of methods exist for pairwise and multiple whole-genome align- Global conservation may also be visualized by representing a refer- ment—for example, BLASTZ50, MULTIZ51, Shuffle-LAGAN52, ence genome using pill-shaped of chromosomes, banded Mercator and MAVID53, Mauve54 and symmetric multiple align- to indicate regions of alignment with a compared genome. Bands ment55. All these techniques are unified by the common principle are color-coded to indicate the chromosome of the aligned region of finding the most similar genomic intervals (‘anchors’), extend- on the compared genome. The representation of genome ing these regions, chaining alignments to make them contiguous, alignment is a popular choice for custom-generated figures in the and analyzing rearrangements. After alignment, the next step is publication of newly sequenced genomes64–66. Three options are to find conserved signals that may indicate potential functional publicly available for automatically generating variations of this visu- regions. Methods for calculating short conservation signals in alization given user-supplied genomic data: Cinteny67, Apollo68 and

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 alignments range from a simple window-based approach in MEDEA. The Sybil ‘gradient view’ uses an innovative visualization PipMaker and VISTA50,56 to the phylogenetic hidden Markov in which genes are displayed along a color gradient in the reference model Phastcons57,58 and another statistical model, Gumby59. genome, with these colors then used to mark the locations of Calculation of conserved synteny, defined as the conservation homologs in a set of aligned genomes. The VISTA Synteny Viewer of chromosomal location of multiple genes60, is based either on (VSV) (Supplementary Fig. 2) uses an ideogram-based depiction the analysis of DNA alignment or bidirectional comparison of of pairwise genome alignments as a navigational tool to select genes on orthologous intervals in two genomes. The evolution- chromosomes in a reference organism for closer inspection. ary significance of synteny derives from the assumption that the In comparison to a dot plot, the ideogram representation of precise order of genes on a chromosome passes down from a synteny loses information about the physical location of aligned common ancestor60. regions on the compared genome. However, the use of color in Visualization of alignments has been approached at several lev- these makes it very easy to visualize the way in which els of resolution, supporting different analytical tasks. Graphical the compared genome has been redistributed across the reference representations of synteny at the level of the whole genome genome. Furthermore, colored segments in the reference genome are critical for the exploration of genome evolution. Also criti- can be linked to specific compared loci by drawing lines to smaller cal is the ability to ‘drill down’ from a global representation of glyphs of compared-organism chromosomes. This approach is conserved synteny to explore a specific region of conservation taken in Apollo, as well as by the PhIGs website69, which allows between two genomes in the context of annotations. In addi- users to generate synteny maps from among 45 sequenced fungi tion, genome assembly and genome model annotation may be and metazoans. served by comparing the neighborhood of an unknown gene to An alternative and aesthetically pleasing approach to depicting that of its ortholog in a different organism that has a finished or genomic synteny has been introduced by Circos70. The Circos tool well annotated genome sequence. Below we describe visualiza- represents two or more genomes as arcs in a single circle. Tracks of tion methods used to depict synteny at both the micro and at the a variety of types can be aligned as inner circles along the genomes. macro level (Table 3). Lines cross the middle of the circle connecting aligned regions.

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S7 review

Table 3 | Tools for comparative genomics visualization Name Description Data URL Web-based tools Cinteny67 Three-scale view of synteny calculated from user-specified markers H http://cinteny.cchmc.org/ CoGe SynMap85 Dot plots from DAGChainer61 alignments; histograms of synonymous substitutions H http://tinyurl.com/synmap/ GenomeMatcher63 A rich, mostly dot plot–based viewer displaying alignments and annotation F,E,G http://tinyurl.com/genomematcher/ MEDEA* A Flash-based suite of linked-track, dot-plot and global-synteny viewing tools C http://tinyurl.com/broadmedea/ MultiPipMaker86 Vertically arranged display of user-supplied multiple alignments F http://pipmaker.bx.psu.edu/pipmaker/ PhIGs69 Ideogram-style interactive display of orthologs across >75 genomes H UCSC Genome Conservation tracks within popular UCSC genome browser H,F,G http://genome.ucsc.edu/cgi-bin/hgGateway/ Browser72* VISTA87* Conservation tracks connected to a variety of analysis tools H http://genome.lbl.gov/vista/index.shtml VSV, VISTA-Dot* Three-scale viewer for synteny and dynamic, interactive dot plots for whole-genome H http://genome.jgi-psf.org/synteny/ DNA alignments Stand-alone tools ACT76 Linked-track views; annotation track search; stacking of multiple genomes E,GF,D http://www.sanger.ac.uk/Software/ACT/ Circos70 Circle-graph presentation of synteny; animations for increased dimensionality C http://mkweb.bcgsc.ca/circos CMap88 Stacked vertical depictions of arbitrary relations among DNA segments D,S http://gmod.org/wiki/CMap Combo77 Dot-plot and linked-track views; integration of annotation in both views G,F,C http://tinyurl.com/argo-combo GBrowse_syn GMOD28* component; highly customizable linked-track view of synteny D,S http://gmod.org/wiki/GBrowse_syn MizBee71 Synteny visualized using circular and linked-track views at multiple scales C http://mizbee.org/ Sybil78 Local and global views of synteny based on BlastP and protein clustering D http://sybil.sourceforge.net/ SynBrowse75 GMOD28 component; local synteny based on gene order, orthology or structure D http://www.synbrowse.org/ SynView79 GMOD28 component; synteny at different scales with multiple feature tracks D http://gmod.org/wiki/SynView All tools listed are free and are either web-based or available for all three operating systems. The Data column describes the formats accepted for display within each tool: H, only alignment data hosted at the tool’s website; F, FASTA format; E, EMBL/GenBank/DDJB format; G, gff format; C, a custom text-based format; D, designed for use with a user-hosted database; S, requires hosting from a user-supplied web server. *Our recommendations This circular arrangement reduces the visual confusion that would However, this representation of conservation does not allow the result from the equivalent linear representation, in which a spider- investigator to view features within both the reference and com- web of lines connects distant regions in stacked genomes. The tool pared regions of the alignment simultaneously. For this reason, also supports animation of the alignment such that connections many tools have been developed with the capability to visualize between individual genomes or chromosomes can be viewed in local synteny67,75,76,77 (Table 3). Generally these tools use a com- sequence, further reducing visual complexity. A circular genome mon strategy of stacking track-like representations of a reference viewer is also available in MEDEA and MizBee71. and one or more compared genomic regions and drawing lines The dot-plot, ideogram and circular representations discussed between them to indicate synteny (a ‘linked-tracks’ representa- above represent strategies for visually presenting conservation at tion). Feature tracks, indicating annotations such as gene models

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 the whole-genome scale. Tools implementing these representa- or expressed sequence tag (ESTs), may be overlaid above or below tions can be used to identify regions of synteny, duplication and the aligned regions, in a manner analogous to that used by genome translocation between genomes. Upon identifying such regions, browsers. This presentation allows the user to visually browse an investigators need the means to view them at a higher level of alignment while maintaining the context of the genomic annota- resolution and in visual association with annotation data. tions that describe the content of all regions under investigation. Links connecting conserved regions may be drawn on the basis Visualization of local conservation. The most straightforward way of genomic alignment, gene orthology, protein cluster member- to visually associate conservation with annotation data is to repre- ship78 or even gene model structure75,79. sent alignments of compared genomes as ‘tracks’ within a genome GMOD, the Generic Model Organism Database project (http:// browser. This strategy is best exemplified in the ‘conservation tracks’ gmod.org/), including the popular GBrowse genome browser28, is in the UCSC Genome Browser72 and the VISTA Browser73 (Fig. 3). perhaps the most widely used framework for developing software In both cases, pairwise or multiple alignment is represented as a two- tools to support genome analysis and curation. Three synteny dimensional plot in which the x axis shows position along the refer- browsing tools have been developed within the GMOD frame- ence genome and conservation scoring in genome-wide multiple work: SynBrowse75, SynView79 and GBrowse_syn. SynBrowse, alignments is plotted along the y axis. In addition, the UCSC browser an extension of the GBrowse family of tools75, allows users to has tracks of ‘chained alignments’, shown as different shades of gray74. switch among three modes for displaying links between conserved In the case of VISTA tracks, features such as conserved exons, UTRs regions. In ‘synteny block’ mode, regions are connected accord- and noncoding regions are indicated by color in the areas under the ing to a user-specified definition of synteny (a certain number of curves. VISTA tracks may also be exported for viewing within their collinear genes within a certain minimum distance). In ‘coding respective reference organisms on other genome browsers such as the gene’ and ‘coding exon’ modes, protein alignments are displayed JGI Genome Browser and the UCSC Genome Browser. as lines grouping aligned genes and exons, respectively, across the Alignment tracks provide a valuable means of quickly identify- reference and compared segments. Alignment quality is further ing conservation when browsing within an individual genome. indicated by the color of each line.

S8 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

Still another challenge in the visualiza- tion of synteny is the graphical representa- tion of insertions and deletions (‘indels’), which are critical to tracking genome evolution at the chromosome, gene fam- ily and gene structure scales. Although many alignment algorithms are capable of identifying indels, most synteny viewers offer no means of indicating them visually, displaying only correspondence between conserved regions. To our knowledge, only the GBrowse_syn viewer allows for the ­visualization of indels. When ‘grid-lines’ are enabled in GBrowse_syn, an indel is represented by grid lines connecting an Figure 3 | The VISTA browser. This plot corresponds to a 14-kb interval on the Sorghum bicolor v.1.0 assembly (chr. 3, 66815542–66829466). Conserved regions are colored according to the insertion region on one genome to the gene annotation displayed above the conservation plot (blue, conservation in exons; light blue, single point of deletion on another. in untranslated regions; pink, in conserved noncoding sequences). Several alignments can be viewed at the same time to assist in analysis. The following VISTA conservation tracks are displayed: (1) duplicated region on S. bicolor (chr. 9, 52532014–52544345); (2) Oryza sativa in the multiple Many successful visualization tools are three-way alignment of S. bicolor, O. sativa and Arabidopsis thaliana; (3) A. thaliana in the same carefully tailored to the specialized analysis three-way alignment; (4) Rank-VISTA plot of the three-way alignment; (5) maize BAC AC198485.2_6; demands of their users, and it is unlikely (6) orthologous region in the soybean genome. that a universal tool for genomics analysis is feasible or desirable. There is, however, Challenges. A variety of representations have been used to visu- a great need to improve the integration among tools and ease alize synteny at scales ranging from whole-genome alignment the transition from one analysis to another. Rapid advances in to the conservation of intron/exon structure in regions of pre- sequencing technologies continue to strain existing software served gene order. A major challenge in the further development and challenge developers to anticipate future requirements. The of these tools is to provide a means for the investigator to navi- paradigms of more mature tools, both in terms of computational gate seamlessly across these levels of resolution. Fortunately, the approaches and visual representations, struggle to scale to today’s increasing sophistication of web application technology enables data demands. More recent tools address some of the core issues, ever-greater interactivity and the ability to connect visual ele- but they often sacrifice richness of functionality to satisfy the ments to informational resources on the internet. The VSV takes urgent needs for speed and for ease of distribution. It is likely that advantage of these technologies by presenting a novel interface to widespread integration between tools will only be realized once unify scales in the display of synteny (Supplementary Fig. 2). The we acquire greater stability in the data generation technologies VSV depicts synteny in three cross-navigable panels representing and file format standards. 77 © 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 different scales of the alignment. Both the Combo tool and We have highlighted several widely used tools to guide a researcher GenomeMatcher63 bridge levels of resolution in the visuali- wishing to tackle genome analyses today. However, given the pace at zation of synteny by connecting an interactive dot plot with a which this relatively young field is evolving, it is very likely that new ‘linked-track’ view of local conservation. MizBee71, released software tools will emerge and revised file formats will be proposed very recently, provides interactive side-by-side views of the data in the near future. As a consequence of this dynamic nature, the across the range of scales supporting exploration of all of these potential for innovation in this domain is great. relationship types. To meet future analysis demands, visualization tools first Most of the tools described above follow a model of aligning one need to successfully integrate diverse data forms, such as clini- or more ‘compared’ genomes against a single ‘reference’ genome. cal information together with genomic data. Second, these tools This model, although seemingly necessitated by visual tractability, require visual representations that scale smoothly to compari- is limiting in that the relationships among compared organisms sons of thousands or millions of elements. For example, the cannot be explored. One approach to address this limitation, taken track-based displays used by current genome browsers will not in both the Artemis Comparison Tool76 and the CMAP applica- readily support the output of the 1,000 Genomes Project. Third, tion, is to allow the user to stack genomes so that an arbitrary set advances in this domain will require the seamless navigation of pairwise comparisons can be visualized (although for a given across relevant levels of resolution, taking advantage of aggrega- genome it is still possible to compare it to at most two others). tion methods to reveal global trends and interactive interfaces Another drawback of the ‘reference genome’ model for display- to provide user access to details at lower levels on demand. And ing synteny is that the x axis for the entire alignment is usually fourth, improved integration between automated computation defined by position along the reference genome, potentially and visualization will be necessary to allow users to interactively obscuring interesting features in the compared sequences. Two refine and iterate their analyses. This type of integration will tools, Phylo-VISTA80 and SynPlot81, implement visualizations of also enable a broader biology community to perform genome- conservation in which position is depicted relative to the length wide analyses, rather than these studies being limited to of the overall alignment. computational specialists.

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S9 review

Note: Supplementary information is available on the Nature Methods website. 22. Chaisson, M.J. & Pevzner, P.A. Short read fragment assembly of bacterial genomes. Genome Res. 18, 324–330 (2008). Acknowledgments 23. Hernandez, D., François, P., Farinelli, L., Osterås, M. & Schrenzel, J. De Thanks to Y. Butterfield, S. Diguistini, P. Gorniak, M. Krzywinski, N. Liao, novo bacterial genome sequencing: millions of very short reads assembled G. Robertson and G. Taylor for helpful discussions and comments. C.B.N. was on a desktop computer. Genome Res. 18, 802–809 (2008). funded with US federal funds from the National Cancer Institute, National 24. MacCallum, I. et al. ALLPATHS 2: small genomes assembled accurately and Institutes of Health (NIH), under contract no. NO1-CO-12400. The contributions with high continuity from short paired reads. Genome Biol. 10, R103 (2009). of M.C. and I.D. were performed under the auspices of the US Department of 25. Nielsen, C.B., Jackman, S.D., Birol, I. & Jones, S.J. ABySS-Explorer: Energy’s Office of Science, Biological and Environmental Research Program, visualizing genome sequence assemblies. IEEE Trans. Vis. Comput. Graph. and by the University of California, Lawrence Livermore National Laboratory 15, 881–888 (2009). under contract no. DE-AC52-07NA27344, Lawrence Berkeley National 26. C. elegans Sequencing Consortium. Genome sequence of the nematode Laboratory under contract no. DE-AC02-05CH11231, and Los Alamos National C. elegans: a platform for investigating biology. Science 282, 2012–2018 (1998). Laboratory under contract no. DE-AC02-06NA25396. D.G. was supported by NIH 27. Eeckman, F.H. & Durbin, R. ACeDB and macace. Methods Cell Biol. 48, grants R01 HL094976 and 1RC2HL10296-01 and by the Howard Hughes Medical 583–605 (1995). Institute. T.W. was supported by funds from the Helen Hay Whitney Foundation. 28. Stein, L.D. et al. The generic genome browser: a building block for a model organism system database. Genome Res. 12, 1599–1610 (2002). COMPETING INTERESTS STATEMENT The Generic Model Organism Database project is the most widely used The authors declare no competing financial interests. framework for developing software tools to support genome analysis and curation. Three synteny-specific tools have been developed within Published online at http://www.nature.com/naturemethods/. the GMOD framework: SynBrowse, SynView and GBrowseSyn. Reprints and permissions information is available online at http://npg.nature. 29. Lander, E.S. et al. Initial sequencing and analysis of the human genome. com/reprintsandpermissions/. Nature 409, 860–921 (2001). 30. Kent, W.J. et al. The human genome browser at UCSC. Genome Res. 12, 1. Pop, M. Genome assembly reborn: recent computational challenges. Brief. 996–1006 (2002). Bioinform. 10, 354–366 (2009). Widely used genome browser with user-friendly web interface and 2. Flicek, P. & Birney, E. Sense from sequence reads: methods for alignment capability to display third party data. and assembly. Nat. Methods 6 (suppl.), S6–S12 (2009). 31. Birney, E., Bateman, A., Clamp, M.E. & Hubbard, T.J. Mining the draft 3. Gordon, D., Abajian, C. & Green, P. Consed: a graphical tool for sequence human genome. Nature 409, 827–828 (2001). finishing. Genome Res. 8, 195–202 (1998). 32. Stalker, J. et al. The Ensembl web site: mechanics of a genome browser. A widely used finishing tool that was the first to use error Genome Res. 14, 951–955 (2004). probabilities as an objective criterion to guide the finishing process. 33. Wheeler, D.L. et al. Database resources of the National Center for 4. Ewing, B., Hillier, L., Wendl, M.C. & Green, P. Base-calling of automated Biotechnology. Nucleic Acids Res. 31, 28–33 (2003). sequencer traces using phred. I. Accuracy assessment. Genome Res. 8, 34. Cline, M.S. & Kent, W.J. Understanding genome browsing. Nat. Biotechnol. 175–185 (1998). 27, 153–155 (2009). 5. Ewing, B. & Green, P. Base-calling of automated sequencer traces using 35. Furey, T.S. Comparison of human (and other) genome browsers. Hum. phred. II. Error probabilities. Genome Res. 8, 186–194 (1998). Genomics 2, 266–270 (2006). 6. Schatz, M.C., Phillippy, A.M., Shneiderman, B. & Salzberg, S.L. Hawkeye: 36. Giardine, B. et al. Galaxy: a platform for interactive large-scale genome an interactive visual analytics tool for genome assemblies. Genome Biol. analysis. Genome Res. 15, 1451–1455 (2005). 8, R34 (2007). 37. ENCODE Project Consortium. et al. Identification and analysis of functional 7. Salzberg, S.L., Church, D., DiCuccio, M., Yaschenko, E. & Ostell, J. The elements in 1% of the human genome by the ENCODE pilot project. Genome Assembly Archive: a new public resource. PLoS Biol. 2, E285 (2004). Nature 447, 799–816 (2007). 8. Li, H. et al. The Sequence Alignment/Map (SAM) format and SAMtools. 38. Cancer Genome Atlas Research Network Comprehensive genomic Bioinformatics 25, 2078–2079 (2009). characterization defines human glioblastoma genes and core pathways. 9. Mardis, E.R. Next-generation DNA sequencing methods. Annu. Rev. Nature 455, 1061–1068 (2008). Genomics Hum. Genet. 9, 387–402 (2008). 39. Skinner, M.E., Uzilov, A.V., Stein, L.D., Mungall, C.J. & Holmes, I.H. JBrowse: 10. Turner, D.J., Keane, T.M., Sudbery, I. & Adams, D.J. Next-generation a next-generation genome browser. Genome Res. 19, 1630–1638 (2009). sequencing of vertebrate experimental organisms. Mamm. Genome 20, 40. Lister, R. et al. Highly integrated single-base resolution maps of the © 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 327–338 (2009). epigenome in Arabidopsis. Cell 133, 523–536 (2008). 11. Gordon, D., Desmarais, C. & Green, P. Automated finishing with autofinish. 41. Yates, T., Okoniewski, M.J. & Miller, C.J. X:Map: annotation and Genome Res. 11, 614–625 (2001). visualization of genome structure for Affymetrix exon array analysis. 12. Dear, S. & Staden, R. A sequence assembly and editing program for efficient Nucleic Acids Res. 36 Database issue, D780–D786 (2008). management of large projects. Nucleic Acids Res. 19, 3907–3911 (1991). 42. Arakawa, K. et al. Genome Projector: zoomable genome map with multiple 13. Bonfield, J.K., Smith, K.F. & Staden, R. A new DNA sequence assembly views. BMC Bioinformatics 10, 31 (2009). program. Nucleic Acids Res. 23, 4992–4999 (1995). 43. Zhu, J. et al. The UCSC Cancer Genomics Browser. Nat. Methods 6, One of the first and a widely used finishing tool with an interactive 239–240 (2009). graphical user interface and sequence editing capabilities. An updated 44. Anders, S. Visualization of genomic data with the Hilbert curve. version (Gap5) is designed to handle NGS data. Bioinformatics 25, 1231–1235 (2009). 14. Burland, T.G. DNASTAR’s Lasergene sequence analysis software. Methods 45. Homer, N. et al. Resolving individuals contributing trace amounts of DNA Mol. Biol. 132, 71–91 (2000). to highly complex mixtures using high-density SNP genotyping 15. Parsons, J.D. Miropeats: graphical DNA sequence comparisons. Comput. microarrays. PLoS Genet. 4, e1000167 (2008). Appl. Biosci. 11, 615–619 (1995). 46. Ureta-Vidal, A., Ettwiller, L. & Birney, E. Comparative genomics: genome- 16. Medvedev, P., Stanciu, M. & Brudno, M. Computational methods for wide analysis in metazoan eukaryotes. Nat. Rev. Genet. 4, 251–262 discovering structural variation with next-generation sequencing. Nat. (2003). Methods 6 (suppl.), S13–S20 (2009). 47. Freeling, M. & Subramaniam, S. Conserved noncoding sequences (CNSs) in 17. Huang, W. & Marth, G. EagleView: a genome assembly viewer for next- higher plants. Curr. Opin. Plant Biol. 12, 126–132 (2009). generation sequencing technologies. Genome Res. 18, 1538–1543 (2008). 48. Drosophila 12 Genomes Consortium. et al. Evolution of genes and genomes 18. Bao, H. et al. MapView: visualization of short reads alignment on a on the Drosophila phylogeny. Nature 450, 203–218 (2007). desktop computer. Bioinformatics 25, 1554–1555 (2009). 49. Richter, D.C., Schuster, S.C. & Huson, D.H. OSLay: optimal syntenic layout 19. Manske, H. & Kwiatkowski, D. LookSeq: a browser-based viewer for deep of unfinished assemblies. Bioinformatics 23, 1573–1579 (2007). sequencing data. Genome Res. 19, 2125–2132 (2009). 50. Schwartz, S. et al. Human-mouse alignments with BLASTZ. Genome Res. 20. Kim, P.-G., Cho, H.-G. & Park, K. A scaffold analysis tool using mate-pair 13, 103–107 (2003). information in genome sequencing. J. Biomed. Biotechnol. 2008, 675741 51. Blanchette, M. et al. Aligning multiple genomic sequences with the (2008). threaded blockset aligner. Genome Res. 14, 708–715 (2004). 21. Zerbino, D.R. & Birney, E. Velvet: algorithms for de novo short read 52. Brudno, M. et al. Glocal alignment: finding rearrangements during assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008). alignment. Bioinformatics 19 (suppl. 1), i54–i62 (2003).

S10 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

53. Dewey, C.N. Aligning multiple whole genomes with Mercator and MAVID. 70. Krzywinski, M. et al. Circos: an information aesthetic for comparative Methods Mol. Biol. 395, 221–236 (2007). genomics. Genome Res. 19, 1639–1645 (2009). 54. Darling, A.C.E., Mau, B., Blattner, F.R. & Perna, N.T. Mauve: multiple 71. Meyer, M., Munzner, T. & Pfister, H. MizBee: a multiscale synteny browser. alignment of conserved genomic sequence with rearrangements. Genome IEEE Trans. Vis. Comput. Graph. 15, 897–904 (2009). Res. 14, 1394–1403 (2004). 72. Miller, W. et al. 28-way vertebrate alignment and conservation track in the 55. Dubchak, I., Poliakov, A., Kislyuk, A. & Brudno, M. Multiple whole-genome UCSC Genome Browser. Genome Res. 17, 1797–1808 (2007). alignments without a reference organism. Genome Res. 19, 682–689 (2009). 73. Dubchak, I. Comparative analysis and visualization of genomic sequences 56. Frazer, K.A., Pachter, L., Poliakov, A., Rubin, E.M. & Dubchak, I. VISTA: using VISTA browser and associated computational tools. Methods Mol. computational tools for comparative genomics. Nucleic Acids Res. 32 Biol. 395, 3–16 (2007). (Web Server issue), W273–W279 (2004). 74. Kent, W.J. et al. Evolution’s cauldron: duplication, deletion, and A comprehensive suite of programs and databases for comparative rearrangement in the mouse and human genomes. Proc. Natl. Acad. Sci. analysis of genomic sequences. Whole-genome alignments of many USA 100, 11484–11489 (2003). species from different taxa (vertebrates to prokaryotes) and tools for 75. Brendel, V., Kurtz, S. & Pan, X. Visualization of syntenic relationships with custom analysis of user-submitted sequences are provided. SynBrowse. Methods Mol. Biol. 396, 153–163 (2007). 57. Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, 76. Carver, T. et al. Artemis and ACT: viewing, annotating and comparing sequences worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005). stored in a relational database. Bioinformatics 24, 2672–2676 (2008). 58. Karolchik, D. et al. Comparative genomic analysis using the UCSC genome 77. Engels, R. et al. Combo: a whole genome comparative browser. browser. Methods Mol. Biol. 395, 17–34 (2007). Bioinformatics 22, 1782–1783 (2006). 59. Prabhakar, S. et al. Close sequence comparisons are sufficient to identify 78. Crabtree, J., Angiuoli, S.V., Wortman, J.R. & White, O.R. Sybil: methods human cis-regulatory elements. Genome Res. 16, 855–863 (2006). and software for multiple genome comparison and visualization. Methods 60. Gregory, S.G. et al. A physical map of the mouse genome. Nature 418, Mol. Biol. 408, 93–108 (2007). 743–750 (2002). 79. Wang, H., Su, Y., Mackey, A.J., Kraemer, E.T. & Kissinger, J.C. SynView: 61. Haas, B.J., Delcher, A.L., Wortman, J.R. & Salzberg, S.L. DAGchainer: a a GBrowse-compatible approach to visualizing comparative genome data. tool for mining segmental genome duplications and synteny. Bioinformatics Bioinformatics 22, 2308–2309 (2006). 20, 3643–3646 (2004). 80. Shah, N. et al. Phylo-VISTA: interactive visualization of multiple DNA 62. Kurtz, S. et al. Versatile and open software for comparing large genomes. sequence alignments. Bioinformatics 20, 636–643 (2004). Genome Biol. 5, R12 (2004). 81. Göttgens, B. et al. Long-range comparison of human and mouse SCL loci: 63. Ohtsubo, Y., Ikeda-Ohtsubo, W., Nagata, Y. & Tsuda, M. GenomeMatcher: a localized regions of sensitivity to restriction endonucleases correspond graphical user interface for DNA sequence comparison. BMC Bioinformatics precisely with peaks of conserved noncoding sequences. Genome Res. 11, 9, 376 (2008). 87–97 (2001). 64. Mouse Genome Sequencing Consortium. et al. Initial sequencing and 82. Stothard, P. & Wishart, D.S. Circular genome visualization and exploration comparative analysis of the mouse genome. Nature 420, 520–562 using CGView. Bioinformatics 21, 537–539 (2005). (2002). 83. Shannon, P.T., Reiss, D.J., Bonneau, R. & Baliga, N.S. The Gaggle: an 65. Galagan, J.E. et al. Sequencing of Aspergillus nidulans and comparative open-source software system for integrating bioinformatics software and analysis with A. fumigatus and A. oryzae. Nature 438, 1105–1115 data sources. BMC Bioinformatics 7, 176 (2006). (2005). 84. Nicol, J.W., Helt, G.A., Blanchard, S.G. Jr., Raja, A. & Loraine, A.E. 66. Putnam, N.H. et al. Sea anemone genome reveals ancestral eumetazoan The Integrated Genome Browser: free software for distribution and gene repertoire and genomic organization. Science 317, 86–94 (2007). exploration of genome-scale data sets. Bioinformatics 25, 2730–2731 (2009). 67. Sinha, A.U. & Meller, J. Cinteny: flexible analysis and visualization of 85. Lyons, E. et al. Finding and comparing syntenic regions among Arabidopsis synteny and genome rearrangements in multiple organisms. BMC and the outgroups papaya, poplar, and grape: CoGe with rosids. Plant Bioinformatics 8, 82 (2007). Physiol. 148, 1772–1781 (2008). A flexible web-based tool allowing investigators to view synteny at the 86. Elnitski, L., Riemer, C., Burhans, R., Hardison, R. & Miller, W. MultiPipMaker: level of whole genomes, individual pairs of chromosomes, or regions comparative alignment server for multiple DNA sequences. Curr. Protoc. around markers of interest, which can be uploaded by the user. Bioinformatics Ch. 10, unit 10.14 (2005). 68. Lewis, S.E. et al. Apollo: a sequence annotation editor. Genome Biol. 3, 87. Mayor, C. et al. VISTA: visualizing global DNA sequence alignments of RESEARCH0082 (2002). arbitrary length. Bioinformatics 16, 1046–1047 (2000). 69. Dehal, P.S. & Boore, J.L. A phylogenomic gene cluster resource: the 88. Youens-Clark, K., Faga, B., Yap, I.V., Stein, L. & Ware, D. CMap 1.01: © 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 Phylogenetically Inferred Groups (PhIGs) database. BMC Bioinformatics 7, a comparative mapping application for the Internet. Bioinformatics 25, 201 (2006). 3040–3042 (2009).

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S11 nature | methods

Visualizing genomes: techniques and challenges Cydney B Nielsen, Michael Cantor, Inna Dubchak, David Gordon & Ting Wang

Supplementary figures and text:

Supplementary Figure 1 Global dot-plot of Sorghum bicolor and Oryza sativa assemblies displayed using VISTA-Dot. Supplementary Figure 2 The VISTA Synteny Browser.

Nature Methods: doi:10.1038/nmeth.1422

Supplementary Figure 1 Global dot-plot of Sorghum bicolor and Oryza Sativa assemblies displayed using VISTA-Dot.

Supplementary Figure 1. Axis scales are the lengths of concatenated S. bicolor and O. Sativa chromosomes (430 MB and 730 Mb, respectively). Red lines show alignments on different strands (i.e. inversions between genomes). Blue lines show alignments on the same strand. This plot highlights several large conserved regions between chromosomes of these two genomes. For example, there is extensive similarity between chromosome 2 of O. sativa and chromosome 4 on S. bicolor (blue) with a relatively large inversion (red).

Nature Methods: doi:10.1038/nmeth.1422

Supplementary Figure 2 The VISTA Synteny Browser.

Supplementary Figure 2. An alignment of a Mycosphaerella fijiensis assembly and the Mycosphaerella graminicola finished genome is displayed at three levels of resolution. (a) An ideogram represents the whole-genome alignment. (b) Conservation within scaffold_7 of M. fijiensis is displayed. Syntenic regions are colored according to the chromosome to which they align on M. graminicola. (c) A “linked tracks” view is shown of M. fijiensis scaffold_7 versus M. graminicola chromosome 5. Regions in M. fijiensis (dark gray) are connected with lines to regions in M. graminicola (green) to indicate conservation. A gene model track is displayed as a line with hollow blocks immediately inside of the position scale for each organism. Both chromosome/scaffold can be zoomed and panned independently.

Nature Methods: doi:10.1038/nmeth.1422 review

Visualization of multiple alignments, phylogenies and gene family evolution James B Procter1, Julie Thompson2, Ivica Letunic3, Chris Creevey4, Fabrice Jossinet5 & Geoffrey J Barton1

Software for visualizing sequence alignments and trees are essential tools for life scientists. In this review, we describe the major features and capabilities of a selection of stand-alone and web-based applications useful when investigating the function and evolution of a gene family. These range from simple viewers, to systems that provide sophisticated editing and analysis functions. We conclude with a discussion of the challenges that these tools now face due to the flood of next generation sequence data and the increasingly complex network of bioinformatics information sources.

Tree and sequence alignment visualizations have a a complete environment for creation, visualization, long history. Evolutionary tree diagrams can be found editing, annotation and analysis. Some tools are more in even the earliest descriptions of evolution, and their specialized and provide functions essential for edit- visualization still plays a key role in modern phyloge- ing and analyzing alignments or working with RNA netics. However, although trees visualize an organism’s (Table 1) or allow the user to map other kinds of bio- evolutionary history, it is the biological data used in logical data (‘Annotators’, Table 2). their construction that contains the information that distinguishes each organism. Sequence alignments are Sequence database searches the most common data used in phylogenetic analysis, Many sequence analysis exercises begin by using a search and their visualization assists in understanding the tool such as BLAST1. These tools use fast alignment

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 molecular mechanisms that differentiate each species, methods to compare a query sequence against a library down to the level of the individual nucleotide bases of potential sequence or sequence-family matches. The and amino acids. result is a ranked list of query–hit alignments, each with Many tools for tree and sequence alignment visualiza- an associated alignment score and estimate of the signifi- tion have been developed in the last 20 years, and a com- cance of the match. The user is then tasked with examin- prehensive analysis is beyond the scope of this review. ing this list, to identify the alignments relevant to their Instead, we describe the main visualization approaches investigation for use in the next stage of the analysis. found in a selection of applications that are available at Probably the most widely used visualization tool for present (Tables 1 and 2), and that we consider either to sequence database search results is the BLAST viewer2 at be widely used or to represent a significant contribution the US National Center for Biotechnology Information to each field. We also highlight important capabilities (NCBI) website. This web-based system has its roots and drawbacks for each tool, but since many are under in the textual report generated by BLAST search tools. active development, we urge the user to explore a tool’s However, the main advantage of this viewer is that it pro- capabilities for themselves. vides a summary that gives a bird’s-eye view of Several functions can be found among the tree and the aligned positions of each hit on the query sequence. alignment visualization tools we consider here: ‘ren- Each hit is colored by the bit score for its match to the derers’ generate static figures, ‘viewers’ allow interac- query to indicate alignment quality, and a hyperlink tive display and analysis, and ‘workbenches’ provide takes the viewer to the pairwise alignment, enabling

1School of Life Sciences Research, College of Life Sciences, University of Dundee, Dundee, UK. 2Institute of Genetics and Molecular and Cellular Biology (IGBMC), Strasbourg, France. 3European Molecular Biology Laboratory, Heidelberg, Germany. 4Animal Bioscience Centre, Teagasc, Ireland. 5Architecture et réactivité de l’ARN, Université de Strasbourg, Institut de Biologie Moléculaire et Cellulaire du Centre National de la Recherche Scientifique (CNRS), Strasbourg, France. Correspondence should be addressed to J.B.P. ([email protected]). PUBLISHED ONLINE 1 march 2010; DOI:10.1038/NMETH.1434

S16 | VOL.7 NO.3s | MARCH 2010 | nature methods SUPPLEMENT review

Table 1 | Selected tools for multiple sequence alignment visualization Name Costa OSc Functiond Description URL Stand-alone AlScript9 Free Win, Mac, Linux Renderer Powerful layout engine but complex control files http://tinyurl.com/pol2ta/ BOXShade Free Win, Mac, Linux Renderer Simple web interface, basic alignment figures http://tinyurl.com/mxf6nd/ ClustalX24 Free Win, Mac, Linux Viewer User interface to ClustalW http://www.clustal.org/ VISSA37 Free Win, Mac, Linux Viewer Mapping between alignments and protein structures http://tinyurl.com/quxjt7/ BioEdit Free Win Edit & anal. Nucleic acid sequence alignment tools http://tinyurl.com/nofcdr/ Cinema24 Free Win, Mac, Linux Edit & anal. Motif autogeneration; part of Utopia50 http://tinyurl.com/rxjb8e/ GeneDoc Free Win Edit & anal. Visualizer for MEME21 motif discovery results http://tinyurl.com/om6mfm/ Jalview25* Free Win, Mac, Linux Edit & anal. Interactive annotation; linked tree views http://www.jalview.org/ Jevtrace35 Free Win, Mac, Linux Edit & anal. Automated tree subfamily analysis http://tinyurl.com/n6jvr5/ PFAAT18* Free Win, Mac, Linux Edit & anal. Quantitative annotation modeling http://pfaat.sourceforge.net/ SeaView23 Free Win, Mac, Linux Edit & anal. Lightweight, ‘guided’ alignment editing http://tinyurl.com/oddnsa/ 4SALE59 Free Win, Mac, Linux RNA RNA structure prediction tools http://tinyurl.com/o2n97e/ ConStruct60 Free Mac, Linux RNA Handles pseudoknots http://tinyurl.com/lo63cd/ S2S61 Free Win, Mac, Linux RNA Requires RNA reference structure http://tinyurl.com/qeda97/ SARSE62 Free Mac, Linux RNA Semiautomated alignment refinement http://sarse.kvl.dk/ eBioX11 Free Mac Workbench Access to extensive tool set including EMBOSS suite http://tinyurl.com/yldee4e/ Geneious* $ Win, Mac, Linux Workbench Innovative BLAST result viewer http://www.geneious.com/ MacVector19 $ Mac Workbench Spreadsheet for recording analysis results http://www.macvector.com/ SeqPad Free Win, Mac, Linux Workbench Built on biojava platform; not yet stable http://trac.seqpad.org/ Strap63 Free Win, Mac, Linux Workbench Sequence and structure alignment and analysis http://3d-alignment.eu/ Vector NTI3 $b Win, Mac, Linux Workbench Sophisticated annotation diagrams http://tinyurl.com/c92xm7/ Web-based Chroma+10 Free Renderer Supports JOY64 annotated alignments http://www.llew.org.uk/chroma/ ESPript8 Free Renderer Rendering engine for ENDSCRIPT http://tinyurl.com/m5dqwd/ aFree means the tool is free for academic use; $ means the tool is not free, but a demo version is available. bDemo version is severely limited. cOS, operating system: Win, Microsoft Windows; Mac, Macintosh OS X. Tools running on Linux usually also run on other versions of Unix. dRenderers are tools that generate figures via the web or command line only. Viewers are interactive alignment visualization tools without editing capabilities. Edit & anal.: editing, annotation and analysis tools. RNA, tools with special support for RNA alignment editing and analysis. Workbench tools are often aimed at experts and provide a range of analysis and visualization functions in addition to alignment visualization. *Our recommendations.

deeper inspection. The NCBI interface has been enhanced over the Multiple sequence alignments years, to keep pace with the increasing size of sequence databases, A multiple sequence alignment (MSA) is a matrix, in which each row

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 and now provides a tree representation of the search results, so that corresponds to a sequence and each column defines equivalent posi- the relationships within the hit set may be seen. Furthermore, the tions across all sequences. Sequence search results are a collection annotation of each hit with an NCBI taxonomy identifier allows of alignments, and they must usually be transformed into a single a phylogenetic breakdown of the hit list to be displayed, so the MSA before further analysis. BLAST, and some of the workbenches researcher can focus on the query’s similarity to sequences found in in Table 1, can create an MSA by aligning each hit to the query using a single organism, or specific clades. the pairwise alignment from the search results (Fig. 1a), but this There are surprisingly few alternatives to the style of visualiza- approach often introduces errors. True multiple-alignment algo- tion provided by the NCBI viewer. However, two of the align- rithms obtain more accurate information with a variety of optimi- ment workbench tools in Table 1 include innovative approaches zation heuristics, such as the guide tree approach (Fig. 1b) found in (Supplementary Fig. 1). Vector NTI3 presents a fivefold linked ClustalW4 and the consistency method (Fig. 1c) used by T-Coffee5, view: a hierarchical list giving the details of each hit, the hit sum- and these are extensively reviewed elsewhere6,7. mary diagram, a pairwise alignment view, the currently selected hit’s alignment trace showing the corresponding homologous Multiple alignment renderers segments in both sequences, and a two-dimensional plot of the Many tools for multiple alignment visualization have been devel- hit profile on the query sequence. Hit selection is facilitated by oped over the years. Each one offers some variant of the spreadsheet- the plot, and sequence region selection is possible in either align- style alignment diagram shown in Figure 2a. Here, sequences are laid ment view. Geneious’s BLAST viewer provides a ‘Linnaeus view’ out in rows, and corresponding residues and bases are represented as in addition to a more conventional multiple alignment view (see letters arranged on a grid. Renderers (such as ESPript8, ALSCRIPT9 below). In the Linnaeus view, hits are shown as a two-dimen- and Chroma10) were the first dedicated systems to generate these sional taxonomic tree map with the ‘top hit’ identified within kind of visualizations, and although appearing outdated by twenty- its clade by an arrow. Each cell contains a distance tree for hits first-century standards, they still provide the greatest control for in that clade, with cells grayed for clades with hits below a user- automated figure creation. In addition to parsing sequence align- defined threshold. ment files, they take a set of parameters via either the web interface,

nature methods SUPPLEMENT | VOL.7 NO.3s | MARCH 2010 | S17 review

Table 2 | Selected tools for phylogenetic tree visualizationa Name OSb Functionc Description URL Stand-alone TreeDyn46* Win, Mac, Linux Renderer Turnkey tree editor and annotator http://www.treedyn.org/ Archaeopteryx65* Win, Mac, Linux Viewer Viewer/editor providing reference support for phyloXMLd http://tinyurl.com/c9vp2d/ CTree66 Win, Mac, Linux Viewer Viewer for analysis and visualization of clusters within trees http://tinyurl.com/pd3m3l/ Dendroscope67* Win, Mac, Linux Viewer Interactive viewer for large phylogenetic trees and networks http://tinyurl.com/2etsd8/ FigTree Win, Mac, Linux Viewer Modern tree viewer with coloring and collapsing http://tinyurl.com/cjrxcd/ HyperTree41 Win, Mac, Linux Viewer Simple hyperbolic viewer http://tinyurl.com/55moet/ NJplot68 Win, Mac, Linux Viewer Interactive tree plotter; reroots, exports as PDF http://tinyurl.com/lbjw4x/ Tree Set Viz69 Win, Mac, Linux Viewer Viewer that computes and visualizes distances between trees http://tinyurl.com/otvc7g/ TreeView70 Win, Mac, Linux Viewer Classic tree viewing software that is very highly cited http://tinyurl.com/nn95wv/ TreeJuxtaposer71 Win, Mac, Linux Viewer The first viewer implementing the focus+context navigation technique http://olduvai.sourceforge.net/ Walrus42 Win, Mac, Linux Viewer Generic 3D hyperbolic viewer; no support for standard phylogenetic formats http://tinyurl.com/ac4cs/ NOTUNG39 Win, Mac, Linux Annotator ATV-based tool for ortholog and paralog identification by tree reconciliatione http://tinyurl.com/yhyztd7/ Treebolic Win, Mac, Linux Annotator Generic hyperbolic viewer/editor; no support for phylogenetic formats http://treebolic.sourceforge.net/ TreeGraph49 Win, Mac, Linux Annotator Annotate with multiple support values or through different widths and colors http://treegraph.bioinfweb.info/ Treevolution47 Win, Mac, Linux Annotator ‘Distortable’ tree layout, subfamily highlighting http://tinyurl.com/kq22s9/ ARB72* Mac, Linux Workbench Complete analysis environment http://www.arb-home.de/ MEGA72* Win, Mac, Linux Workbench Workbench for molecular evolutionary genetics analysis http://www.megasoftware.net/ Mesquite Win, Mac, Linux Workbench Modular system for evolutionary analysis http://mesquiteproject.org/ SplitsTree473 Win, Mac, Linux Workbench Tree and network creator and viewer http://tinyurl.com/mpbhsg/ TOPALi74 Win, Mac, Linux Workbench Nucleic acid and protein evolutionary analysis http://www.topali.org/ Web-based Phylodendron Renderer Supports a range of tree and branch styles and output formats http://tinyurl.com/m3cdqb/ Hypergeny Viewer Hyperbolic tree browser http://tinyurl.com/nhrfbq/ iTOL48* Annotator Powerful tree-based annotation visualizer; batch interface http://itol.embl.de/ PhyloWidget75 Annotator Processing-based editor/publisher; annotate with image and web links http://www.phylowidget.org/ aAll tools in this table are free for academic use. bOS, operating system: Win, Microsoft Windows; Mac, Macintosh OS X. Tools running on Linux usually also run on other versions of Unix. cRenderers are tools that generate figures by means of a web or command line interface only. Viewers are tools for interactive visualization that have no tree-generation capabilities. Annotators are viewers that allow additional data to be mapped onto the phylogenetic visualization. Workbench tools can generate, manipulate, analyze and visualize trees. dphyloXML is a new format for the exchange of phylogenetic trees. eATV refers to ‘Another Tree Viewer’—which has been superseded by Archaeopteryx. 3D, three-dimensional. *Our recommendations.

command line arguments or a separate file. These parameters con- properties, such as hydrophobicity or ‘burial’ in proteins, and quali-

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 trol how the alignment is drawn and annotated, and some can be tative colorings reflect general physicochemical class (for example, defined independently of the MSAs being rendered, facilitating the sugar ring geometry or amino acid side chain size, shape, polarity use of these noninteractive tools in sequence analysis pipelines. and aromaticity). The assignment of colors according to chemical nature is analogous to the conventions for atom colors prevalent in Interactive alignment viewers , and the amino acid colors used by Clustal4 in Interactive viewers generally provide the same visualizations as alignments (Fig. 2a) broadly correspond with the main groupings static renders but adapted for display on screen. Historically, they of physicochemical attributes of the 20 amino acids (Fig. 2d). Tools were developed as a user-friendly interface to alignment pro- vary in the precise choice of scales and color gradations used for grams (ClustalX4), and more recently, for sequence analysis suites quantitative schemes, but among the qualitative schemes, Taylor13 (eBioX11). Importantly, their interface allows shading and display and Clustal4 (Fig. 2c) are widely supported and may be considered styles to be easily changed, facilitating figure generation. A further de facto standards. advantage is gained when working with large alignments. For exam- ple, an MSA taken from the Pfam12 protein domain family database Shading. Coloring every symbol in an alignment can help identify contains, on average, 300 sequences12—which is too large to inter- gross trends, but becomes confusing for regions showing com- pret without the ability to scroll around the alignment matrix and plex patterns of variation. A more effective approach, pioneered zoom in or out to perceive and focus in on gross trends. in ClustalX4, is to shade symbols on the basis of both their type and their predominance at each alignment position (Fig. 2a). Coloring. The primary role of color in alignment visualizations is This approach is widely supported; and it has many variations, the identification of regions where specific properties predominate as other measures can be used to define color or control shad- and to highlight variation. The simplest way this is achieved is to ing, such as a symbol’s similarity to some reference (usually the color each sequence symbol according to a specific amino acid or consensus or the sequence used for a BLAST search). Alignment nucleotide color scheme (Fig. 2b,c). Schemes are usually one of quality can also be emphasized. Dissimilar sites can be rendered two types: quantitative schemes convey trends in specific empirical with lower-case letters, or, when working with a family of closely

S18 | VOL.7 NO.3s | MARCH 2010 | nature methods SUPPLEMENT review

related homologs, variable regions can be highlighted as such by a b c replacing letters identical to the reference with periods. Hit1 Hit1 Hit1 Query Query Query Hit5 Hit5 Hit5 Summary plots: conservation, consensus and quantitative anno- Hit2 Hit2 Hit2 Hit3 Hit3 tation. Annotation is important for navigation, in both flat dia- Hit4 Hit3 Hit4 Hit4 grams and interactive systems, because it guides the eye toward ‘important’ regions of an alignment. MSA workbenches and most Figure 1 | Alignment topologies. Consistency graphs demonstrating of the editing and analysis tools reviewed here allow the user to complexity of different types of alignment algorithm. Nodes represent the interactively annotate alignments (see below). However, practically query and each hit from the results of a sequence search, and edges indicate all MSA visualizations include some form of automatically gener- the mapping between a pair of sequences that the alignment algorithm optimizes. (a) MSA constructed directly from pairwise database search ated annotation, such as consensus lines and alignment quality results. (b) MSA constructed using a guide tree, in which closely related plots, displayed either above or below the alignment. Consensus sequences and then groups of sequences are optimally aligned. (c) MSA from annotation has its roots in the textual alignment files generated by consistency-based algorithms such as T-Coffee, in which all sequences are MSA programs, but a variety of plots are now provided by modern optimally aligned with one another. tools (Fig. 3). Quality and consensus plots are calculated from each column’s symbol frequency distribution using one of the many measures available14,15. Alternatively, sequence logos16,17 provide Alignment editing, analysis and annotation a user-friendly indication of the dominant symbols at each posi- Integrated systems to support the editing and analysis of tion of the alignment. As in shading, described above, annotation sequences have become possible with increased computing power can result from other kinds of calculations. For example, PFAAT18, and the ubiquity of internet connectivity. Most of the tools for MacVector19, VectorNTI3 and Geneious are able to compute and MSA visualization mentioned here provide alignment color- plot averaged physicochemical quantities such as isoelectric point, ing, shading and automated annotation facilities, as described and STRAP20 supports extension of the program to allow complete above. However, ‘editing and analysis’ tools and most of the MSA customization. workbench tools also allow alignments to be interactively edited,

10 20 30 40 ty

ty ty a b ty ARSA_MOUSE 409 451 Q9DC66_MOUSE 409 451 ex Q32K15_CANFA 410 452 ind propensi Amino rophobici or Q5BL32_BRARE 409 445 acid yd Q6AX40_XENLA 410 450 Buried Turn Strand propensiHelix propensi H Tayl Zappo ClustalX Q8WNR3_PIG 410 452 A Q5XFU5_MOUSE 436 475 R Q3TYD4_MOUSE 436 475 N ARSG_HUMAN 436 475 D Q32KJ9_RAT 436 475 C Consensus Q E G

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 c d Nucleotide Small H I Nucleic Acid Database A C G T U Hydrophobic P L Jalview/TOPALi A C G TU C Tiny S−S K A ClustalX A C G TU M Aliphatic G S N C F MacClade A C G TU I V S−H Neg. charge P L GC versus AT (Geneious) A C G T U T S M D Purine versus pyrimidine A C G TU T Aromatic E W K Y Y R F H Polar V W Pos. charge B X Q Z

Figure 2 | Multiple alignment visualization. (a) A protein sequence alignment diagram rendered with Jalview25. Aligned sequences are arranged in rows and placed into a single reference frame, where each aligned position occupies a column in a table. Dashes indicate gaps. The label on the left-hand side of each sequence gives its Uniprot53 entry name, start and end positions are shown at each end of its row and tick marks at the top allow a particular aligned column or sequence position to be read off. The consensus row at the bottom shows the most frequent residue at each column or a ‘+’ if two or more residues are equally abundant. Residues in the alignment are colored according to the ClustalX4 shading model: a color is only applied when that residue’s abundance in the column is above a residue-specific threshold, highlighting potentially important residues (for example, proline and glycine) or patterns of conservation. (b) Examples of amino acid color schemes. Schemes are either quantitative, reflecting empirical or statistical properties of amino acids; or qualitative, reflecting an assignment according to physicochemical attributes. Zappo is a qualitative scheme developed by M. Clamp (personal communication); B, X and Z are amino acid ambiguity codes: B is aspartate or asparagine; Z is glutamate or glutamine; X is an unknown (or ‘other’). (c) Examples of nucleotide color schemes used by the Nucleic Acid Database54 and a selection of visualization tools. (d) Venn diagram after Taylor55 showing the amino acids grouped according to their physicochemical properties. Coloring of each group (or amino acid label) is according to ClustalX, demonstrating the correspondence between color and physicochemical properties. Pos., positive; neg., negative.

nature methods SUPPLEMENT | VOL.7 NO.3s | MARCH 2010 | S19 review

a Editing and curation. MSAs from even the most accurate multiple alignment algorithm can contain errors, known as alignment arti- facts22, that make those regions of the align- b ment biologically meaningless. These occur SSE (Mirny) because MSA algorithms find the optimal solution to a mathematical problem, rather QPLCTPSRSQLLTGRYQ I RTGLQHS than one reflecting the biochemical equiva- lence of the sequences. Such errors are hard c to detect through automated means because correcting them often requires specialized Conservation knowledge of the protein or gene family. Editing and analysis tools such as Jalview are designed for alignment curation and so allow the user to modify parts of the alignment easily, either manually or with automated assistance23, and allow changes Consensus to be undone. The shading and quality his- tograms in these tools also reflect changes immediately, to provide feedback on the d effect of the modification. Mean hydrophobcity Navigation, overviews, searching and Mean selective row and column display. Systems isoelectric point such as PFAAT18 and CINEMA24 that are e designed for curation, annotation and 4 analysis provide navigation aids, including bird’s-eye or overview windows that locate 3 the visible region in its wider context. Search functions are also essential, and tools vary in 2 capability, but typically they allow the user 1 to locate and select sequence name, posi-

Relative entropy Relative tion or sequence pattern matches. Some 0 12 314567 8910 11 12 13 14 156 17 18 19 20 21 22 252423 tools (for example, Jalview25) also allow the Contribution user to create multiple views on the same alignment and to hide rows or columns, Figure 3 | Examples of automatically generated summary annotation for an alignment generated by © 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 thus juxtaposing regions far apart, to aid in MSA visualization tools. (a) ClustalW quality annotation from ClustalX: ‘*’, ‘:’ and ‘.’ highlight identical, exploration and figure composition. conserved and ‘mostly conserved’ columns, respectively, under a particular substitution model. (b) Mirny56 conservation measure from PFAAT. Shannon entropy score is calculated for each column based on a reduced amino acid alphabet. (c) Amino acid physicochemical property conservation, consensus and overlaid Interactive alignment annotation. sequence logo from Jalview. (d) Mean hydrophobicity and isoelectric point from Geneious. (e) HMMlogo Curation and figure generation require visualization from Logomat-P (ref. 57) using corresponding HMMER58 model. Labels have been added to the a flexible user interface for interactive original images obtained from the tools in the creation of b,d,e. alignment annotation. For example, a set of sequences in an alignment might be annotated and analyzed, and often include additional visualiza- interactively grouped on the basis of the user’s own knowledge, tions for data associated with sequences or the result of analy- or regions corresponding to aligned domains or sequence motifs ses. Special support is provided for exploring sequence annota- might be annotated so they are highlighted with a different ren- tion, visualizing a sequence’s associated protein or nucleic acid dering style. Alternatively, annotation tracks for the alignment structure, or inspecting trees resulting from the application of may be added above or below the sequences containing colored phylogenetic analyses to the alignment. However, the degree of labels and symbols to indicate potentially conserved properties integrated visualization that these tools provide varies consider- such as protein secondary-structure elements or RNA second- ably. The latest tools use modern information visualization tech- ary-structure contacts. Tools that support alignment column niques, such as linked highlighting and brushing. For example, annotation display them in a similar fashion to the automated applying a color to a branch of a tree calculated from an MSA also alignment summary annotation and usually allow the user to colors the sequences in the linked MSA visualization. Conversely, create and modify them interactively. Modern systems such as older tools tend to provide either static or independent views of Jalview, PFAAT and CINEMA also provide a means to import each type of data, but they often have unique visualization or and export annotation, and they offer ‘project files’ that store the analysis features; for example, GeneDoc has dedicated support complete state of an annotated alignment, enabling the user to for the MEME21 motif discovery suite. return to it at a later date.

S20 | VOL.7 NO.3s | MARCH 2010 | nature methods SUPPLEMENT review

Box 1 SOURCES OF ANNOTATION

Annotation, at the level of a complete sequence or for a given in the databases to the uncharacterized ones. These systems subsequence, can be obtained by importing a flat file (such as a assume that closely related sequences generally share GFF file; http://www.sanger.ac.uk/Software/formats/GFF/GFF_ a similar three-dimensional fold and often have similar Spec.shtml) containing information associated with sequences in functions. Under these assumptions, MSAs are analyzed to the alignment, or by remotely accessing bioinformatics databases identify and annotate regions of the sequences sufficiently (for example, UniProt, PDB, InterPro), either directly or using homologous to conserve structure and, perhaps, function. the Sequence Retrieval System (SRS). Recent tools also retrieve A crucial aspect of publicly disseminated annotation annotation by means of programmatic web services, such as the is ensuring that structural and functional information Distributed Annotation System (DAS)76. associated with a sequence is ‘machine readable’. As a The structural and functional annotation of genomic and consequence, ontologies and other structured vocabularies are protein sequences has been the object of a large community being developed to represent the knowledge in a formal way. effort in recent years. Although experimental evidence is Widely used ontologies include the Gene Ontology79,80, which clearly the most reliable source of annotation, it is unfeasible describes gene products in terms of their associated biological for the huge number of sequences available today. Therefore, processes, cellular components and molecular functions, automated or semi-automated systems, such as HAMAP77 or the Sequence Ontology81,82, describing the features and MACSIMS78, are being developed to transfer (in a controlled attributes of biological sequences and the NCBI organismal way) the known annotation from the characterized sequences classification83.

Visualization of annotation. Sequence annotation is an increasing- positions as points on a two- or three-dimensional interactive scatter ly important part of alignment visualization, as it enables the user plot. Here, interactive brushing allows the user to locate and select a to rapidly identify key regions that should be curated, or inspected cluster of sequences or correlated sites and to view their locations in for variation. Annotation is available from a variety of sources (such the linked view of the alignment. as the Distributed Annotation System (DAS); see Box 1), and tools typically provide a means of importing annotation from flat files Combined alignment and three-dimensional structure visualiza- (for example, GFF or GenBank files), or automatically retrieving it tion. A linked molecular structure viewer enables exploration and by means of web services provided by databases or DAS annotation interpretation of specific mutations in an MSA (Supplementary servers. Annotation associated with the complete sequence, such as Figs. 2b and 3a). Most of the tools capable of tree-based alignment its originating organism or biological function, can be shown adja- analysis (discussed below) also allow alignment shading to be trans- cent to the sequence name (PFAAT, CINEMA) or in ‘mouse-overs’ ferred to an associated protein structure25,35–37. (Jalview). Local annotation, such as domains, catalytic sites and protein secondary-structure elements, are rendered at their aligned RNA alignment visualization positions variously as colored boxes, glyphs or other annotation, Unlike proteins, the tertiary structure of an RNA macromolecule

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 and any extra information provided by means of mouse-overs and is almost solely determined by the pattern of nucleotide base pairs embedded hyperlinks to web pages. Many sources of annotation formed as it folds. RNA alignment visualization tools (Table 1) pro- indicate the provenance of their individual annotation records, vide specialized shading and annotation models for investigating and it is important to distinguish experimentally observed and and highlighting the conservation of this secondary structure, and predicted annotation in visualizations (Supplementary Fig. 2). some, such as 4SALE, provide linked visualizations of the network Workbenches, and some of the editing and analysis tools (includ- of base pairs. However, RNA alignments still present problems, and ing PFAAT and Jalview), also allow the interactive creation and new ways of representing these MSAs are being sought38. manipulation of sequence annotation, making these tools useful for sequence annotation curation. Sequence analysis workbenches Sequence alignment workbenches differ from the other tools Investigation of function. The analysis of sequences from diverse described here in that they provide a wide range of data manage- organisms is one of the most powerful ways to probe the structure ment, analysis and visualization capabilities, of which alignment, and function of biological systems26. Most of the strategies for this sequence and phylogenetic analysis are just components. Because of include phylogenetic tree–based alignment analysis, which is dis- this, they tend to separate sequence and sequence annotation editing cussed in the penultimate section of this review. However, some from alignment visualization—unlike tools with their roots in MSA alignment visualization tools also support alternative approaches for visualization, which deal with sequence and alignment annotation functional site analysis. PFAAT and CINEMA can highlight regions within a single context. of alignments that match sequence motif databases (for example, PROSITE27 and TRANSFAC28,29) or, in the case of GENEDOC, Semantic sequence annotation visualization. Workbench and MSA ab initio motif discovery predictions30–32. The principal compo- editing and analysis tools vary greatly in the way in which they ren- nent techniques used in tools such as SeqSpace33 (implemented in der positional sequence annotation on alignments. However, all Jalview) and, more recently, pHMM34 enable them to present a more include some standard mapping between the graphical representa- abstract view of an MSA, by representing sequences and aligned tion used and each type of annotation (for example, domains are

nature methods SUPPLEMENT | VOL.7 NO.3s | MARCH 2010 | S21 review

Phylogram Cladogram repertoire of applications dedicated to tree visualization. These stand-alone or web-based tools can be placed into the three main functional classes used for MSA visualization tools, ‘renderers’, ‘viewers’, and ‘workbenches’, plus a further class, ‘annotators’. Rectangular Rather than detail the attributes of each class, we instead provide a basic guide to the principles underlying phylogenetic analysis (Box 2) and the tree visualization supported by these tools. We follow this by a brief discussion of the combined use of tree and MSA visualizations, and the state-of-the-art tools available for annotated phylogenetic visualization.

Basic tree terminology. Trees are directed graphs, in which branches connect internal (ancestral) nodes to their descendants. Leaf nodes Slanted represent elements of the character data used to construct the tree. These could be sequences that have been aligned in an MSA, or some other kind of characteristic information. In the case of MSA-based trees, two types are possible, which affects how the tree’s internal structure should be formally interpreted. In gene trees, constructed from an MSA containing multiple gene families, internal nodes represent either speciation or gene duplication events. Conversely, if the MSA only contains sequences for a single gene in many spe- Circular cies, the result is a species tree, and internal nodes then correspond to speciation events. However, precise interpretation of both types of tree requires knowledge of the wider evolutionary context, and specialized applications such as NOTUNG39,40 have been developed to aid their analysis.

Tree visualization styles Historically, phylogenetic trees were drawn to mimic real trees, from Radial the ground up. However, with increasing numbers of nodes such tree layouts quickly become cluttered and difficult to read. Therefore, various alternative approaches are used to increase the readability and ease of annotation of trees, dependent on the tree size. These methods can be separated into two main categories, based on the geometry they use: Figure 4 | Euclidean tree layouts. Trees are usually viewed as either phylograms, where branch length reflects similarity; or cladograms, where © 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 Euclidean geometry. This is the most common display method, and branch order reflects number of ancestors. Rectangular layouts highlight variation in branch length, whereas slanted layouts facilitate comparison most tools in Table 2 fall into this category. A variety of Euclidean of branch order (in cladograms). Circular layouts are most efficient when tree styles exist (Fig. 4); but the choice of whether to present a cla- visualizing large numbers of taxa but make it more difficult to compare dogram or phylogram depends on the reliability of the evolution- branch lengths. Radial layouts do not convey ancestral information and are ary information available, and whether the tree is to be used to most appropriate for unrooted trees, which are obtained when appropriate highlight differences in ancestry or rates of evolution. Trees with up outgroups (reference phyla known to be less related to the phyla of interest to several hundred terminal nodes can be visualized with various than those are to one another) are not available. rectangular layouts. Circular and radial layouts make it difficult to compare distal branches of the tree, but they are more useful for displayed differently from metal binding sites). This enables them annotation, since they offer greater capacity for a given diagram to exploit the formal terms found in annotation retrieved from size and can handle up to several thousand nodes. public databases (Box 1) to display the annotation appropriately. Workbench tools with their roots in sequence visualization (such Hyperbolic geometry. Hyperbolic display models are often used as MacVector and VectorNTI) provide the most advanced annota- for very large network visualizations; tools that use this approach tion display capabilities, and they allow each row of the MSA to be can easily handle thousands or even hundreds of thousands of decorated with its own numbering, histograms, annotation tracks nodes. Tools such as HyperTree41, Hypergeny and Treebolic (Table and complex diagrammatic glyphs. 2) use hyperbolic projection to provide a view, analogous to that of a fish-eye lens, often termed ‘focus+context’. This projection Visualizing phylogenetic trees results in a circle in which distances between nodes of the tree are Phylogenetic analysis is an important part of the scientific work- reduced exponentially, according to their distance from the center. flow, and its backbone is the visual inspection, annotation and By interactively panning the tree and bringing different branches exploration of phylogenetic trees. It is therefore no surprise that to the central magnified region, it is possible to examine every the selection presented in Table 2 barely represents the extensive part of the tree in detail while keeping a sense of the context. An

S22 | VOL.7 NO.3s | MARCH 2010 | nature methods SUPPLEMENT review

Box 2 SOURCES OF PHYLOGENETIC DATA

Phylogenetic trees are calculated by applying mathematical approaches. They estimate the mean evolutionary time models to infer evolutionary relationships between organisms, (measured as the mean number of changes per site) since two based on a set of characters that describe their differences. species diverged from their most recent common ancestor86. The most common characters are nucleotide or protein MSAs, However, because they reduce the estimate of most recent but morphological information has also been used. There are common ancestor to a single value, information on character four main categories of phylogenetic reconstruction methods: evolution is lost. maximum parsimony, distance matrix, maximum likelihood Maximum likelihood and Bayesian methods constitute the and Bayesian approaches84. state-of-the-art approaches for tree reconstruction. Maximum Parsimony is the principle of choosing simpler hypotheses in likelihood methods search a set of tree and evolutionary preference to those requiring a more complex explanation85. models to find the ones most likely to generate the observed Maximum parsimony approaches create trees using the characters87. Bayesian approaches offer more flexibility, as they minimum number of ancestors needed to explain the observed allow optimization of all aspects of a tree (model, topology, characters86. branch length)88. But this comes at a cost: they require Distance matrix methods, such as neighbor joining, allow computationally expensive techniques such as Markov chain more sophisticated evolutionary models than parsimony Monte Carlo to estimate terms in the Bayes equation.

alternative approach, introduced by H3Viewer, is to render trees preted when displayed using existing phylogenetic information. in three dimensions embedded within a sphere, which allows the Supplementary Figure 5 demonstrates such a visualization using the visualization of hundreds of thousands of nodes. A snapshot of interactive Tree Of Life (iTOL), which allows users to annotate trees a hyperbolic visualization of the whole NCBI taxonomy using with various data set types, from simple histograms and pie Walrus42 is shown in Supplementary Figure 4. to animated time series data and representations of pro- tein domain architectures. Such tools are already powerful, but their Tree-based alignment analysis capabilities will need to be expanded further as phylogenetic trees Phylogenetic trees and alignments are intrinsically related, and become more commonly used in multidisciplinary investigation. there are tools in Tables 1 and 2 that can work with both kinds of data. However, tree-based alignment analysis35,36,43–45 methods, Perspectives and challenges which enable identification of functional motifs, are usually found Many alignment and tree visualization tools can display molecular only in MSA analysis tools. A notable example is Jevtrace35 (Table structure and sequence annotation data, enabling in-depth analysis 1; Supplementary Fig. 5a), which given an MSA and, optionally, of a sequence family. Thanks to open standards and improved soft- an associated tree, partitions the aligned sequences into subfami- ware and web services technology, more of these tools are becoming lies, and automatically annotates columns containing sites showing interoperable, and we expect them to provide increasingly flexible variation significantly different from the overall tree. Other tools visualization and annotation interfaces for these types of biologi-

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 from Table 1 that have interactive tree viewers usually allow the cal data. Work has begun in this direction with the development of user to create manual or tree-based groupings on alignments. Once integrated tools such as eBIOX11, Utopia50 and general workbench- groups are defined, standard alignment shading models can high- es such as Bioclipse. In the future, we can expect integration with light patterns of conservation and mutation that differ between visualization tools for other kinds of ‘omics’ data, such as complete groups (Supplementary Fig. 5b). genome browsers and protein-protein interaction maps. The ability to perform analyses and create annotation is an intrin- Annotation of trees sic property of tools designed for creating, editing and exploring Phylogenetic trees in their raw form contain valuable information trees and alignments. Systems such as DAS are now being exploited about the relatedness of the sequences (or other data) used to con- to facilitate the gathering of information from a host of bioinfor- struct the tree, but it is possible to annotate the trees with further matics databases, and it is possible to obtain rich and complex information, increasing their value for the interpretation of biologi- annotation derived from large-scale systems biology experiments. cal data. The most basic forms of annotation include branch lengths As a consequence, constraining the complexity of annotated visu- representing evolutionary distances and labels showing the phylo- alizations is becoming necessary, calling for innovative visual rep- genetic support for each of the internal branches on the tree (such resentation techniques that aggregate and summarize annotations as bootstrap proportions). Tree branches are displayed in varying to make the most pertinent information accessible. Furthermore, colors, either to highlight whole clades or to annotate particular biologists are notorious for the complex questions they ask of their features present in different nodes. Tools that support coloring of data, and tools need more sophisticated query mechanisms to the tree branches include TreeDyn46, Treevolution47, iTOL48 and enable visualization of data selected on evolutionary and functional FigTree. Some tools, such as TreeGraph49, can also use the width of annotation criteria. the nodes to convey quantitative annotation. Lastly, other issues of scale must also be addressed. The sequence However, there is a growing need for tools capable of mapping databases are growing exponentially, and the alignment of large more complex information onto trees. For example, metagenomic sets of sequences has become a standard requirement. As an exam- studies generate experimental results that are more easily inter- ple, the largest protein family in the Pfam database contains over

nature methods SUPPLEMENT | VOL.7 NO.3s | MARCH 2010 | S23 review

100,000 sequences, but, given the accelerating rate of sequencing, 12. Finn, R.D. et al. The Pfam protein families database. Nucleic Acids Res. 36, it is likely that most families will contain thousands rather than D281–D288 (2008). 13. Lin, K., May, A.C. & Taylor, W.R. Amino acid encoding schemes from protein hundreds of members within the next 5 years. Tools must therefore structure alignments: multi-dimensional vectors to describe residue types. J. be improved to remedy any technical and conceptual limitations Theor. Biol. 216, 361–365 (2002). exposed when operating with such large data sets. Phylogenetic the empirical analysis underlying the ‘Taylor’ amino acid color scheme; tools have already been developed that cope with relatively large this builds on Taylor’s earlier work (1986) concerning approaches for the classification of amino acids. trees (up to several thousand leaves), but size is still a particular 14. Valdar, W.S. Scoring residue conservation. Proteins 48, 227–241 (2002). problem for multiple alignment systems. Some show usability 15. Chakrabarti, S. & Lanczycki, C.J. Analysis and prediction of functionally problems, such as poor interactive response times when loading important sites in proteins. Protein Sci. 16, 4–13 (2007). 16. Schneider, T.D. & Stephens, R.M. Sequence logos: a new way to display or saving files or during other simple operations, such as selection consensus sequences. Nucleic Acids Res. 18, 6097–6100 (1990). or coloring. More generally, tools will need to provide access to the 17. Schneider, T.D. Twenty years of Delila and molecular information theory: the mass of information in these very large data sets, with enhanced Altenberg-Austin Workshop in Theoretical Biology biological information, overview displays that can summarize and provide easy navigation beyond metaphor: causality, explanation, and unification Altenberg, Austria, 11–14 July 2002. Biol. Theory 1, 250–260 (2006). to more detailed views. In the case of trees, summary techniques 18. Caffrey, D.R. et al. PFAAT version 2.0: a tool for editing, annotating, include pruning and collapsing of branches. For sequence align- and analyzing multiple sequence alignments. BMC Bioinformatics 8, 381 ments, alternative visualization approaches such as partial order (2007). 51 52 19. Rastogi, P.A. MacVector. Integrated sequence analysis for the Macintosh. graphs and circular alignment diagrams have been developed, Methods Mol. Biol. 132, 47–69 (2000). but, as far as we are aware, no interactive tool that supports them 20. Gille, C. & Robinson, P.N. HotSwap for bioinformatics: a STRAP tutorial. BMC exists as yet. In conclusion, the increasingly dense biological data Bioinformatics 7, 64 (2006). landscape presents new challenges for alignment and phylogenetic 21. Bailey, T.L. et al. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37, W202–W208 (2009). visualization. In response, exciting new approaches for the visual- 22. Landan, G. & Graur, D. Characterization of pairwise and multiple sequence ization and annotation of trees and alignments are being developed, alignment errors. Gene 441, 141–147 (2009). and we look forward to using them in the future. to our knowledge, this is the first detailed analysis of the errors that may be introduced by tree based sequence alignment algorithms. 23. Galtier, N., Gouy, M. & Gautier, C. SEAVIEW and PHYLO_WIN: two graphic tools Note: Supplementary information is available on the Nature Methods website. for sequence alignment and molecular phylogeny. Comput. Appl. Biosci. 12, 543–548 (1996). ACKNOWLEDGMENTS 24. Lord, P.W., Selley, J.N. & Attwood, T.K. CINEMA-MX: a modular multiple J.B.P. acknowledges the support of the ENFIN European Network of Excellence alignment editor. Bioinformatics 18, 1402–1403 (2002). (contract LSHG-CT-2005-518254) awarded to G.J.B. Several tools were made 25. Waterhouse, A.M., Procter, J.B., Martin, D.M., Clamp, M. & Barton, G.J. available as prereleases to the authors for evaluation purposes, and we thank the Jalview Version 2–a multiple sequence alignment editor and analysis individuals and companies who obliged our requests. workbench. Bioinformatics 25, 1189–1191 (2009). 26. Margulies, E.H. & Birney, E. Approaches to comparative sequence analysis: COMPETING INTERESTS STATEMENT towards a functional view of vertebrate genomes. Nat. Rev. Genet. 9, 303–313 The authors declare no competing financial interests. (2008). 27. Hulo, N. et al. The 20 years of PROSITE. Nucleic Acids Res. 36, D245–D249 Published online at http://www.nature.com/naturemethods/. (2008). Reprints and permissions information is available online at http://npg. 28. Wingender, E. The TRANSFAC project as an example of framework technology nature.com/reprintsandpermissions/. that supports the analysis of genomic regulation. Brief. Bioinform. 9, 326– 332 (2008).

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 1. Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein 29. Matys, V. et al. TRANSFAC and its module TRANSCompel: transcriptional gene database search programs. Nucleic Acids Res. 25, 3389–3402 (1997). regulation in eukaryotes. Nucleic Acids Res. 34, D108–D110 (2006). 2. Johnson, M. et al. NCBI BLAST: a better web interface. Nucleic Acids Res. 36, 30. Zvelebil, M.J., Barton, G.J., Taylor, W.R. & Sternberg, M.J. Prediction W5–W9 (2008). of protein secondary structure and active sites using the alignment of 3. Lu, G. & Moriyama, E.N. Vector NTI, a balanced all-in-one sequence analysis homologous sequences. J. Mol. Biol. 195, 957–961 (1987). suite. Brief. Bioinform. 5, 378–388 (2004). 31. Chakrabarti, S. & Panchenko, A.R. Ensemble approach to predict specificity 4. Thompson, J.D., Gibson, T.J. & Higgins, D.G. Multiple sequence alignment using determinants: benchmarking and validation. BMC Bioinformatics 10, 207 ClustalW and ClustalX. Curr. Protoc. Bioinformatics 2, 2.3.1–2.3.22 (2002). (2009). 5. Notredame, C., Higgins, D.G. & Heringa, J. T-Coffee: A novel method for 32. Horner, D.S., Pirovano, W. & Pesole, G. Correlated substitution analysis and fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205–217 the prediction of amino acid structural contacts. Brief. Bioinform. 9, 46–56 (2000). (2008). 6. Edgar, R.C. & Batzoglou, S. Multiple sequence alignment. Curr. Opin. Struct. 33. Casari, G., Sander, C. & Valencia, A. A method to predict functional residues in Biol. 16, 368–373 (2006). proteins. Nat. Struct. Biol. 2, 171–178 (1995). a comprehensive review of the approaches available for the alignment of 34. Schwarz, R. et al. Detecting species-site dependencies in large multiple many sequences. sequence alignments. Nucleic Acids Res. 37, 5959–5968 (2009). 7. Raghava, G.P., Searle, S.M., Audley, P.C., Barber, J.D. & Barton, G.J. OXBench: 35. Joachimiak, M.P. & Cohen, F.E. JEvTrace: refinement and variations of the a benchmark for evaluation of protein multiple sequence alignment accuracy. evolutionary trace in JAVA. Genome Biol. 3, RESEARCH0077 (2002). BMC Bioinformatics 4, 47 (2003). 36. Goldenberg, O., Erez, E., Nimrod, G. & Ben-Tal, N. The ConSurf-DB: pre- 8. Gouet, P., Robert, X. & Courcelle, E. ESPript/ENDscript: extracting and calculated evolutionary conservation profiles of protein structures. Nucleic rendering sequence and 3D information from atomic structures of proteins. Acids Res. 37, D323–D327 (2009). Nucleic Acids Res. 31, 3320–3323 (2003). 37. Li, W. & Godzik, A. VISSA: a program to visualize structural features from 9. Barton, G.J. ALSCRIPT: a tool to format multiple sequence alignments. Protein structure sequence alignment. Bioinformatics 22, 887–888 (2006). Eng. 6, 37–40 (1993). 38. Brown, J.W. et al. The RNA structure alignment ontology. RNA 15, 1623–1631 10. Goodstadt, L. & Ponting, C.P. CHROMA: consensus-based colouring of multiple (2009). alignments for publication. Bioinformatics 17, 845–846 (2001). 39. Chen, K., Durand, D. & Farach-Colton, M. NOTUNG: a program for dating gene 11. Barrio, A.M., Lagercrantz, E., Sperber, G.O., Blomberg, J. & Bongcam-Rudloff, duplications and optimizing gene family trees. J. Comput. Biol. 7, 429–447 E. Annotation and visualization of endogenous retroviral sequences using (2000). the Distributed Annotation System (DAS) and eBioX. BMC Bioinformatics 10 40. Vernot, B., Stolzer, M., Goldman, A. & Durand, D. Reconciliation with non- (suppl. 6), S18 (2009). binary species trees. J. Comput. Biol. 15, 981–1006 (2008).

S24 | VOL.7 NO.3s | MARCH 2010 | nature methods SUPPLEMENT review

41. Bingham, J. & Sudarsanam, S. Visualizing large hierarchical clusters in 64. Mizuguchi, K., Deane, C.M., Blundell, T.L., Johnson, M.S. & Overington, J.P. hyperbolic space. Bioinformatics 16, 660–661 (2000). JOY: protein sequence-structure representation and analysis. Bioinformatics 42. Hughes, T., Hyun, Y. & Liberles, D.A. Visualising very large phylogenetic trees 14, 617–623 (1998). in three dimensional hyperbolic space. BMC Bioinformatics 5, 48 (2004). 65. Zmasek, C.M. & Eddy, S.R. ATV: display and manipulation of annotated 43. Livingstone, C.D. & Barton, G.J. Protein sequence alignments: a strategy for phylogenetic trees. Bioinformatics 17, 383–384 (2001). the hierarchical analysis of residue conservation. Comput. Appl. Biosci. 9, 66. Archer, J. & Robertson, D.L. CTree: comparison of clusters between 745–756 (1993). phylogenetic trees made easy. Bioinformatics 23, 2952–2953 (2007). 44. Sankararaman, S. & Sjolander, K. INTREPID–INformation-theoretic TREe 67. Huson, D.H. et al. Dendroscope: an interactive viewer for large phylogenetic traversal for Protein functional site IDentification. Bioinformatics 24, 2445– trees. BMC Bioinformatics 8, 460 (2007). 2452 (2008). 68. Perrière, G. & Gouy, M. WWW-query: an on-line retrieval system for biological 45. Engelen, S., Trojan, L.A., Sacquin-Mora, S., Lavery, R. & Carbone, A. Joint sequence banks. Biochimie 78, 364–369 (1996). evolutionary trees: a large-scale method to predict protein interfaces based 69. Hillis, D.M., Heath, T.A. & St. John, K. Analysis and visualization of tree on sequence sampling. PLoS Comput. Biol. 5, e1000267 (2009). space. Syst. Biol. 54, 471–482 (2005). 46. Chevenet, F., Brun, C., Banuls, A.L., Jacq, B. & Christen, R. TreeDyn: towards A demonstration of different kinds of tree visualization, and an examination dynamic graphics and annotations for analyses of trees. BMC Bioinformatics 7, of how spatial techniques such as multidimensional scaling can be used to 439 (2006). visualize and compare ensembles of trees. 47. Santamaría, R. & Theron, R. Treevolution: visual analysis of phylogenetic 70. Page, R.D. TreeView: an application to display phylogenetic trees on personal trees. Bioinformatics 25, 1970–1971 (2009). computers. Comput. Appl. Biosci. 12, 357–358 (1996). 48. Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL): an online tool for 71. Munzner, T., Guimbretiere, F., Tasiran, S., Zhang, L. & Zhou, Y. TreeJuxtaposer: phylogenetic tree display and annotation. Bioinformatics 23, 127–128 scalable tree comparison using focus+context with guaranteed visibility. ACM (2007). Trans. Graph. 22, 453–462 (2003). 49. Müller, J. & Müller, K. TreeGraph: automated drawing of complex tree figures 72. Kumar, S., Nei, M., Dudley, J. & Tamura, K. MEGA: a biologist-centric software using an extensible tree description format. Mol. Ecol. Notes 4, 786–788 for evolutionary analysis of DNA and protein sequences. Brief. Bioinform. 9, (2004). 299–306 (2008). 50. Pettifer, S. et al. Visualising biological data: a semantic approach to tool and 73. Huson, D.H. & Bryant, D. Application of phylogenetic networks in database integration. BMC Bioinformatics 10 (supp. 6), S19 (2009). evolutionary studies. Mol. Biol. Evol. 23, 254–267 (2006). 51. Raphael, B., Zhi, D., Tang, H. & Pevzner, P. A novel method for multiple describes the phylogenetic network visualization approach implemented alignment of sequences with repeated and shuffled elements. Genome Res. in SplitsTree4, where evolutionary distance and bootstrap support are 14, 2336–2346 (2004). represented in one network structure, rather than an annotated tree. introduces the partially ordered alignment algorithm and demonstrates 74. Milne, I. et al. TOPALi v2: a rich graphical interface for evolutionary how this graph based alignment visualization provides a more compact analyses of multiple alignments on HPC clusters and multi-core desktops. view of complex alignments. Bioinformatics 25, 126–127 (2009). 52. Krzywinski, M. et al. Circos: an information aesthetic for comparative 75. Jordan, G.E. & Piel, W.H. PhyloWidget: web-based visualizations for the tree of genomics. Genome Res. 19, 1639–1645 (2009). life. Bioinformatics 24, 1641–1642 (2008). describes the CIRCOS approach for visualization of comparative genomic 76. Prlić, A. et al. Integrating sequence and structural biology with DAS. BMC data, which can provide a more compact view of large multiple sequence Bioinformatics 8, 333 (2007). alignments. 77. Berman, H.M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 53. UniProt Consortium. The Universal Protein Resource (UniProt) 2009. Nucleic (2000). Acids Res. 37, D169–D174 (2009). 78. Thompson, J.D. et al. MACSIMS: multiple alignment of complete sequences 54. Berman, H.M. et al. The nucleic acid database. A comprehensive relational information management system. BMC Bioinformatics 7, 318 (2006). database of three-dimensional structures of nucleic acids. Biophys. J. 63, 79. Barrell, D. et al. The GOA database in 2009–an integrated Gene Ontology 751–759 (1992). Annotation resource. Nucleic Acids Res. 37, D396–D403 (2009). 55. Taylor, W.R. The classification of amino acid conservation. J. Theor. Biol. 119, 80. The Gene Ontology’s Reference Genome Project. A unified framework for 205–218 (1986). functional annotation across species. PLoS Comput. Biol. 5, e1000431 (2009). 56. Mirny, L.A. & Shakhnovich, E.I. Universally conserved positions in protein 81. Reeves, G.A. et al. The Protein Feature Ontology: a tool for the unification of folds: reading evolutionary signals about stability, folding kinetics and protein feature annotations. Bioinformatics 24, 2767–2772 (2008).

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 function. J. Mol. Biol. 291, 177–196 (1999). 82. Eilbeck, K. et al. The Sequence Ontology: a tool for the unification of genome 57. Schuster-Böckler, B. & Bateman, A. Visualizing profile-profile alignment: annotations. Genome Biol. 6, R44 (2005). pairwise HMM logos. Bioinformatics 21, 2912–2913 (2005). 83. Sayers, E.W. et al. Database resources of the National Center for Biotechnology 58. Eddy, S.R. Profile hidden Markov models. Bioinformatics 14, 755–763 (1998). Information. Nucleic Acids Res. 37, D5–D15 (2009). 59. Seibel, P.N., Muller, T., Dandekar, T. & Wolf, M. Synchronous visual analysis 84. Holder, M. & Lewis, P.O. Phylogeny estimation: traditional and Bayesian and editing of RNA sequence and secondary structure alignments using 4SALE. approaches. Nat. Rev. Genet. 4, 275–284 (2003). BMC Res. Notes 1, 91 (2008). 85. Swofford, D.L., Olsen, G.J., Waddell, P.J. & Hillis, D.M. Phylogenetic inference. 60. Wilm, A., Linnenbrink, K. & Steger, G. ConStruct: improved construction of in Molecular Systematics (eds. Hillis, D.M., Moritz, C. & Mable, B.K.) 407–514 RNA consensus structures. BMC Bioinformatics 9, 219 (2008). (Sinauer, Sunderland, Massachusetts, USA, 1996). 61. Jossinet, F. & Westhof, E. Sequence to Structure (S2S): display, manipulate 86. Felsenstein, J. Inferring Phylogenies (Sinauer, Sunderland, Massachusetts, and interconnect RNA data from sequence to structure. Bioinformatics 21, USA, 2004). 3320–3321 (2005). 87. Felsenstein, J. Evolutionary trees from DNA sequences: a maximum likelihood 62. Andersen, E.S. et al. Semiautomated improvement of RNA alignments. RNA approach. J. Mol. Evol. 17, 368–376 (1981). 13, 1850–1859 (2007). 88. Huelsenbeck, J.P., Ronquist, F., Nielsen, R. & Bollback, J.P. Bayesian 63. Gille, C. Structural interpretation of mutations and SNPs using STRAP-NT. inference of phylogeny and its impact on evolutionary biology. Science 294, Protein Sci. 15, 208–210 (2006). 2310–2314 (2001).

nature methods SUPPLEMENT | VOL.7 NO.3s | MARCH 2010 | S25 nature | methods

Visualization of multiple alignments, phylogenies and gene family evolution James B Procter, Julie Thompson, Ivica Letunic, Chris Creevey, Fabrice Jossinet & Geoffrey J Barton

Supplementary figures and text:

Supplementary Figure 1 BLAST results for the human aryl sulfatase sequence as viewed in VectorNTI and Geneious

Supplementary Figure 2 Annotated visualization of the Pfam alignment for the sulfatase family and linked view of PDB structure 1fsu

Supplementary Figure 3 Visualizing the NCBI taxonomy

Supplementary Figure 4 Tree-based alignment analysis

Supplementary Figure 5 Annotating phylogenetic trees with complex data

Nature Methods: doi: 10.1038/nmeth.1434 Supplementary Figure 1 | Blast results for the human aryl sulfatase sequence as viewed in VectorNTI (a), and Geneious (b). See main text for details.

a Hierarchical hit Hit distribution and consensus list for further profile on query positions details.

Alignment Trace showing context Birds-eye view of of aligned hits on query segments. sequence.

Alignment of Query and Hit Slider sets threshold to b grey out clades in hitlist

distance tree for hits in clade

Top Hit

Nature Methods: doi: 10.1038/nmeth.1434 Supplementary Figure 2 | Annotated visualization of the Pfam alignment for the sulfatase family and linked view of PDB structure 1fsu.

Sulfatases are a highly conserved enzyme family. They hydrolyze sulfate ester bonds in a variety of structurally diverse compounds but have similar overall folds, mechanisms of action, and bivalent metal ion-binding sites1, 2. (a) Pfam family alignment rendered with Jalview. Knowledge of the substrate specificity for each sequence allows the alignment to be divided into 6 functional sub- families (names on the left were added manually). Conserved sequence regions calculated by MACSIMS are indicated by colored shapes. Regions 1-4 (above the alignment) are shown to be shared by all the sub-families. Within these, highlighted single residues correspond to known functionally active sites. Secondary structure annotation from PDB structure 1fsu (ARSB_HUMAN) is shown below the alignment, above the Livingstone and Barton conservation score.Conserved sequence regions (Regions 1 and 2 above the alignment) were detected by MACSIMS and shared by all sub-families. Disulphide bond annotation above the alignment was obtained from PDB sequence 1n2l (ARSA_HUMAN). Secondary structure annotation from PDB structure 1fsu (ARSB_HUMAN) is below the alignment, above the Livingstone and Barton conservation score. (b) Image taken from Jalview’s linked Jmol view of 1fsu, showing the structural context of regions 1-4 in a. (c). Close up of region

underscored in red in (a) with annotated regions of sequences colored according to type and origin, locating PROSITE motifs and known mutations in the alignment. Inset box shows close up of Jalview tooltip conveying additional annotation information. The sequence annotations that were obtained from public databases (Uniprot2, PDB3 or Interpro4), are colored using a dark shade. The features shown in a lighter shade were inferred by MACSIMS, which propagates these known properties to the uncharacterized sequences.

Nature Methods: doi: 10.1038/nmeth.1434 Supplementary Figure 2 | Annotated visualization of the Pfam alignment for the sulfatase family and linked view of PDB structure 1fsu.

a Region 1 Region 2 Region 3 Region 4

ARSA

ARSG ARS STS/ ARSE/F

ARSB

ARSI/J

b

disulphide bond c Region 1 Region 2

ARSA

ARSG

ARS STS/ ARSE/F

ARSB

ARSI/J

PROSITE sulfatase1 metal binding glycosylation hydrophobic sulfatase2 active site NAG binding mutation

Nature Methods: doi: 10.1038/nmeth.1434 Supplementary Figure 3 | Visualizing the NCBI taxonomy. Perhaps, the closest we can get to the visualisation of the entire tree of life. In this figure, taken from Hughes et. al.,5 Walrus and Phylo3D were used to visualize the complete NCBI taxonomy, containing close to 200 000 species, in a hyperbolic 3D space. Bacteria are focused in the image, and shown in orange. Eukaryotes are shown in yellow, in the left hand side. Archaea are represented with the red colored nodes shown in the top and background of the image.

Nature Methods: doi: 10.1038/nmeth.1434 Supplementary Figure 4 | Tree Based Alignment Analysis. (a) Tree based alignment analysis of the sulfatase family using JevTrace, applied to the Pfam family alignment and tree. The panel on the left shows the algorithm’s automatically generated tree partition. The alignment, shown adjacent to the leaves, is annotated with regions found to exhibit sub-family conservation. A view of the associated PDB structure (1fsu) is shown on the right, colored according to sub-family specific mutations. (b) Snapshot from Jalview showing same region of sulfatase alignment as in Figure S2C, with Clustal conservation based shading and colouring applied to each subgroup. This rendering style reveals subfamily specific conservation patterns that generally contain the residues known to be involved in substrate binding (c.f. annotation in figure S2c). (c) Neighbor-joining tree for the alignment in S2A calculated with Jalview, with sub-trees corresponding to each sub-family highlighted with different colors.

Nature Methods: doi: 10.1038/nmeth.1434 Supplementary Figure 4 | Tree Based Alignment Analysis. a

b c

ARSA

ARSG

ARS STS/ ARSE/F

ARSB

ARSI/J

Nature Methods: doi: 10.1038/nmeth.1434 Supplementary Figure 5 | Annotating phylogenetic trees with complex data.

iTOL6 was used to annotate an automatically generated Tree of life.7 Blue

barcharts represent the genome sizes, defined as number of predicted protein

coding genes. Piecharts show the distribution of preferred habitats for various

taxa identified in several metagenomics sequencing projects.8

M

ycoplasma my

MW2

N315

Mu50

Thermoanaerobac

ureus

coides subsp mycoides SC Onion Myco

Mycoplas idis Clostridium ac

Clostridium ticum Ureaplasma parvum F2365 plas yellow

um

subsp aureus

ma mobi reus subsp a str 4b

Clo

s phytop ma pulmoni

penetrans

gallisep ter ten pneumoniae us epiderm stridium tetani

perfringens

Pan troglodyte etobutylicu is us aureus le 163K Ra

lasma ua gcongensis asma genitali ytogenes Homo sapiens c s lasma s Anopheles n ttus n pl no me 87

Mus musculus s in 9 Mycoplasma dura Mycoplasma Mycop

o Myco Staphylococc ria Drosop Gallus gallus rveg m Takifugu rubripes Staphylococcus au monocytogenes Staphylococcus aureus subsp aureus te gambiae str PES Staphylococc is acillus iheyens s s halo C Da icu L aenorh m Listeria monoc illu thracis str A hila m nio s Listeria n Cae a i Sacc Bac reus ATCC 109 rerio s norhabdi a elanogas Oceanob u bditis briggsa soni haromy Schizosaccharom Bacillus subtilis Erem Bacill T subsp lactis ces cerevistisi elegans Bacillus ce otheciu ter oniae R6 Bacillus cereus ATCC 1457 us faecalis Dic ctobacillus plantaru III e obacillus john oniae ty La ostelium dis m gossypii Lact yces pombe ae s up V Arabidops Enterococc an ro us pneum Lactococcus lactis serog C c s mut lactiae seroeg roup yanidioschyzon oid Streptococcus pneum is thaliana eu occu Plasmo lactia m a O Streptococctoc Thalassiosir ryza sativ dium trep GAS8232 S Cryptosporidium ccus ag 85 falcipar m S a Streptococcus aga a ps erol ptoco enes og eudonana C um ae tre S genes MGAS315 3D7 hom Streptococcus pyogenes bsp succin Giardia lam Leishm CM inis yogenes SSI−1 P1335 Streptococcus pyogenes M M ania m et Methanoc blia ATCC Streptococcus pyo han uccinogenes su other Methanocaldococcus jannaschii aj m o Streptococcuscter p s ob occus 50 r a acter ther 803 ivalis maripaludis Fibrob m icron au onas ging totrophic Chlorobaculum tepidum us str D Methanop Porphyromides thetaiotaomm elta H uridaru yrus ka Bactero Pyrococcus horiko ndleri lamydia m Ch e dia trachomatis 183 P − yro shi Chlamy e TW cocc i nia us a umo Pyrococcus furiosusby ne ssi Chlamydophila ilacavia p J138 doph my Methanos Chla 9 arcina Methan mazei oniae CWL02 osarc Chlamydophila pneumoniae ina R39 ace niae A tivora umo Halo ns Chlamydophila pneum pne bacterium s ila ydoph 2246 p NRC−1 hlam Archaeoglobus C bus UQM a obscuriglo fulgidus Gemmat Thermoplasm a volcanium ageni Pirellula sp Thermoplas s serovar Copenh ma acidophilum Leptospira interrogan S ulfolobus s interrogans olfataricus Leptospira Sulfolobus tokodaii Borrelia burgdorferi a denticola Aeropyrum pernix Treponem pallidum Pyrobaculum aerophilum Treponema

Bifidobacterium longum Nanoarchaeum equitans Tropheryma whipplei TW0827

Tropheryma w hipplei str Twist

Streptomyces coelicolor Streptomyces avermitilis

Corynebacterium dip htheriae Corynebacterium efficie Corynebact ns Shigella flexneri erium glutamicum 57T Coryn ri 2a str 24 ebacterium gl M utamicum Shigella flexne richia coli ycoba Esche cte ATCC 1 6 rium Mycobacterium leprae av 3032 ium s ia coli O ubsp para Escherich Mycobacterium bovis tuberculo coli O157:H7 sis 3 93 My Escherichia EDL cobac 7:H7 te 15 601 Mycobacteriumriu mt oli O tu ia c bercu h 2 losis cheric Fusobac Es CD i str Ty uberculosis H37RvC1 2 terium nuc 551 60 Therm oto lea Aquifex aeolicusg tum subsp nucleatum a maritim sp enterica serovar Typh M Yersinia pestis Dehaloc a 01 The occoides ethenogenes rm Salmonella enterica sub Yersinia pestis KI D us therm ondii einococcus radio m icrotus str 910u ) M Gloeo ophilus HB27 iovar ) 19 s subsp la Synechococcusbact 5 er violaceu dur Nostoc sp P ans Yersinia pestis b taciae) Synechocystis sp PCC s bdus luminescen elongatus Pro CC izongia pis ch 712 Photorha Prochlorococcus lor 0 o lossina brevipalpis P coc Buchnera aphidicola (Schizaphis graminum cida roch cus hidicola (Ba annia floridanus i m Synechococcus ulto lor m 680 Buchnera aphidicola (Acyrthosiphon pisum o arinus mbiont of G Acidobacterium capsulatumco 3 us marinus rella m ccus Buchnera ap Candidatus Solibacter usitatus E su b Haemophilus influenzae m sp pa asteu ophilus ducrey Desulfovibrio vulgaris str Hildenborough P ari Candidatus Bloch sp WH 8102 G nu storis str C Haem Vibrio vulnificicus YJ016 eobacter sulfurreducens s str M Bdellovibrio bacteriovoru olerae Campylobact IT 9 o Wolin CM worthia glossinidia endosy 31 0 H P Vibrio vulnif Vibrio ch 3 1 e ATCC 51196 9 iggles m profundum Helicobacter pyllicori ella succinogenes 8 W 6 Helicobacter pylo o Vibrio parahaemolyticus a oneidensis b Wolb KT244 a acte er jeju s Rickettsia

io Ricke llin6076 r h Ca a tii c ni stid Rho e is Bradyrhizobium japonicum h s u pa PhotobacteriuShewanell Sinorhizobium ia s 181661 lo ttsia prowaze

18 as syringae pv tomat Mesorhizobium dopseudom sp Brucella sui tic Brucella melitensis bac onas putida conorii

odis pv citri B 08 u w s ylella fa up A ter vibrio ri J99 X 35 M el

Pseudomonas aeruginosa iolaceum Pseudom Coxiella burne anacearum Pseudomon stris pv campestris l la parapertuss erogroup onas kii

s m ide Xylella fastidiosa Temecula1

eliloti s

loti

erium v palus Bordetella pertussi

gitidis serogro

gitidis s Bordetel

Bordetella bronchiseptica tris Xanthomonas axonop Ralstonia so

monas campe

Nitrosomonas europaea

ia menin

Chromobact

seria menin

Xantho

Neis

Neisser

Nature Methods: doi: 10.1038/nmeth.1434 References

1. Ghosh, D. Human sulfatases: a structural perspective to catalysis. Cell Mol Life Sci 64, 2013-22 (2007). 2. The UniProt Consortium. The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res 37, D169-74 (2009). 3. Berman, H.M. et al. The Protein Data Bank. Nucleic Acids Res 28, 235- 42 (2000). 4. Hunter, S. et al. InterPro: the integrative protein signature database. Nucleic Acids Res 37, D211-5 (2009). 5. Hughes, T., Hyun, Y. & Liberles, D.A. Visualising very large phylogenetic trees in three dimensional hyperbolic space. BMC Bioinformatics 5, 48 (2004). 6. Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics 23, 127-8 (2007). 7. Ciccarelli, F.D. et al. Toward automatic reconstruction of a highly resolved tree of life. Science 311, 1283-7 (2006). 8. von Mering, C. et al. Quantitative phylogenetic assessment of microbial communities in diverse environments. Science 315, 1126-30 (2007).

Nature Methods: doi: 10.1038/nmeth.1434 review

Visualization of image data from cells to organisms Thomas Walter1, David W Shattuck2, Richard Baldock3, Mark E Bastin4, Anne E Carpenter5, Suzanne Duce6, Jan Ellenberg1, Adam Fraser5, Nicholas Hamilton7, Steve Pieper8, Mark A Ragan7, Jurgen E Schneider9, Pavel Tomancak10 & Jean-Karim Hériché1

Advances in imaging techniques and high-throughput technologies are providing scientists with unprecedented possibilities to visualize internal structures of cells, organs and organisms and to collect systematic image data characterizing genes and proteins on a large scale. To make the best use of these increasingly complex and large image data resources, the scientific community must be provided with methods to query, analyze and crosslink these resources to give an intuitive visual representation of the data. This review gives an overview of existing methods and tools for this purpose and highlights some of their limitations and challenges.

By their very nature, microscopy and magnetic resonance different scales of resolution, often generated at different imaging (MRI) (Fig. 1 and Boxes 1 and 2) are dependent institutions. This review describes how the visualization on data visualization. Whereas in the past it was consid- challenges in these three areas are addressed for a range of ered sufficient to show images ( or digitized imaging modalities. images) in the printed version of an article to illustrate an To be useful to the immediate research group and more experimental result, the presentation of image data has broadly to the scientific community, massive datasets must become more challenging for three reasons. First, new be presented in a way that enables them to be browsed, imaging techniques allow the generation of massive data- analyzed, queried and compared with other resources—

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 sets that cannot be adequately presented on paper nor be not only other images but also molecular sequences, struc- browsed and looked at with older software tools. MRI, tures, pathways and regulatory networks, tissue physiology which is mostly used to acquire three-dimensional (3D) and micromorphology. In addition, intuitive and efficient imagery, has faced some of these problems for many years. visualization is important at all intermediate steps in such Second, the availability of high-throughput techniques projects: proper visualization tools are indispensable for enables experiments on a large scale, generating large sets quality control (for example, identification of dead cells, of image data, and even though the readout of each single ‘misbehaving’ markers or image acquisition artifacts), the experiment may be easily visualized, this is no longer true sharing of generated resources among a network of col- for whole screens consisting of thousands of such experi- laborators or the setup and validation of an automated ments. Third, microscopy and MRI are increasingly part analysis pipeline. of a broader analytical context that may include quan- The first section of this review briefly describes issues titative measurement, statistical analysis, mathematical related to digital images. The second section deals with modeling and simulation and/or automated reasoning visualization techniques for complex multidimensional over multiple datasets reflecting different properties and image datasets at relatively low throughput. Next, we possibly resulting from different acquisition techniques at discuss typical visualization problems arising with an

1European Molecular Biology Laboratory, Heidelberg, Germany. 2Laboratory of Neuro Imaging, University of California, Los Angeles, Los Angeles, California, USA. 3Medical Research Council Human Genetics Unit, Institute of Genetics and Molecular Medicine, Edinburgh, UK. 4Medical and Radiological Sciences (Medical Physics), University of Edinburgh, Edinburgh, UK. 5Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA. 6Division of Biological Chemistry and Drug Discovery, College of Life Sciences, University of Dundee, Dundee, UK. 7The University of Queensland, Institute for Molecular Bioscience, Brisbane, Australia. 8Isomics, Inc., Cambridge, Massachusetts, USA. 9British Heart Foundation Experimental Magnetic Resonance Unit (BMRU), Department of Cardiovascular Medicine, University of Oxford, Oxford, UK. 10Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany. Correspondence should be addressed to J.-K.H. ([email protected]). PUBLISHED ONLINE 1 march 2010; DOI:10.1038/NMETH.1431

S26 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

increase in scale: here, the challenges are to a c provide tools allowing the user to navigate b d through large image-derived datasets at dif- ferent levels of abstraction and to develop meaningful profiles and clustering meth- ods. The last section deals with how images can be shared with collaborators or with the community. Finally, we conclude with the need for integration and linking of different e g image-based source data (including compu- f h tational models) into a comprehensive view of biological entities.

Accessing the images Digital representation of images. The use of digital images as a convenient replace- ment for photographic film has paved the way for the increase in the volume of i j kl images produced. While we expect a digital image to carry the same amount of visual information as its analog counterpart, it is amenable to faster and more complex pro- cessing, and the task of viewing an image is complicated by the lack of standard image representation. Whereas photographic film n Study 1 Metabolite - Real NAA m Cr o p used to provide a common format for image Cho representation, digital images have different formats with respect to the number of bits per pixel or whether the encoded values are signed or unsigned. Although most image-handling software programs support unsigned 8-bit images (values between 0 and 255) and unsigned 16-bit images (values between 0 and 65,535), care must be taken with more ‘exotic’ for- Figure 1 | Imaging techniques. (a) Brightfield microscopy: mouse embryo, in situ expression pattern mats, such as unsigned 12-bit images (val- of Irx1, Eurexpress; scale bar, 2 mm. (b) Fluorescence microscopy: HT29 cells stained for DNA (blue),

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 ues between 0 and 4,095) or signed 16-bit actin (red) and phospho-histone H3 (green)75; scale bar, 20 µm. (c) Confocal microscopy: actin images (values between –32,768 and 32,767), polymerization along the breaking nuclear envelope during meiotic maturation of a starfish oocyte. which are routinely produced by modern Actin filaments, red (rhodamine-phalloidin stain); chromosomes, cyan (Hoechst 33342 stain). imaging equipment. If, for instance, an Projection of confocal sections, (image courtesy P. Lénárt); scale bar, 20 µm. (d) Bioluminescence 76 unsigned 12-bit image is simply interpreted imaging: in vivo bioluminescence imaging of mice after implantation of Gli36-Gluc cells , (figure courtesy B.A. Tannous). (e) Optical projection tomography: mouse embryo, EMAP33,66; scale bar, as an unsigned 16-bit image, only ~6% of the 1 mm. (f) Single/selective plane illumination microscopy: late-stage Drosophila embryo probed dynamic range will be used and the images with anti-GFP antibody and DRAQ5 nuclear marker: frontal, caudal, lateral and ventral views of the may appear ‘dark’. If the image is rescaled to same embryo77; scale bar, 50 µm. (g) Transmission electron microscopy: human fibroblast, glancing cover the maximal dynamic range (as it is the section close to surface (image courtesy R. Parton and M. Floetenmeyer); scale bar, 100 nm. default behavior of many image viewers), the (h) Scanning electron microscopy: zebrafish peridermal skin cells (courtesy R. Parton and absolute intensity information is lost, which M. Floetenmeyer); scale bar, 10 µm. (i) microMRI: mouse embryo (source: http://mouseatlas. makes any comparison between different caltech.edu/); scale bar, 5 mm. (j) T2-weighted MRI: human cervical spine (source: http://www. radswiki.net/); scale bar, 5 cm. (k) Fluid attenuation inversion recover (FLAIR) image of a human images impossible. Signed values are also brain with acute disseminated encephalomyelitis. Bright areas indicate demyelination and possibly often misinterpreted by the image-handling some edema (image courtesy N. Salamon); scale bar, 5 cm. (l) Diffusion-weighted image of a human software (for example, negative values may be brain after a stroke. Bright areas indicate areas of restricted diffusion (image courtesy N. Salamon); ignored). Although the above may be trivial scale bar, 5 cm. (m) Maximum intensity projection image of a magnetic resonance angiogram of issues for imaging experts, they are pitfalls a C57BL/6J mouse brain acquired in vivo using blood pool contrast78 (image courtesy G. Howles); routinely encountered by biologists. scale bar, 5 mm. (n) 3D proton magnetic resonance spectroscopic imaging study of normal human brain. Graph shows proton spectrum for the brain location identified by yellow markers on the T1-weighted MRI (lower left) and N-acetylaspartate (NAA; lower right) images. Data acquired Image file formats. The fields of microscopy using the MIDAS/EPSI methodology79 (image courtesy J. Alger); scale bar, 5 cm. (o) Functional and MRI both face significant challenges in the MRI activation map overlaid on a T1-weighted MRI: human brain (image courtesy L. Foland-Ross); sharing and processing of data owing to the scale bar, 5 cm. (p) Direction-encoded color map computed from DTI. Red, left–right directionality; variety of digital file formats that are used. green, anterior–posterior; blue, superior–inferior; scale bar, 5 cm.

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S27 review

For microscopy images, no format has been adopted as a univer- format. To address this important issue, the BioFormats project has sal standard. Faced with a choice, many new users are unaware that been working to create translators for a variety of image formats and image quality is degraded when using a file format that relies on a lossy has accomplished this task for over 70 image file formats so far (http:// compression algorithm (for example, JPEG). Image files can also hold www.loci.wisc.edu/software/bio-formats). further information about the image. Instrument manufacturers use However, most high-dimensional and high-throughput projects either a proprietary format or a customized version of a pre-existing require devising a system to store and query further metadata about extensible format (for example, TIFF) to include metadata such as the images. For example, interpreting a time-lapse experiment requires the time the image was acquired within the image file itself. These understanding which images represent which time points for which embedded metadata usually do not survive conversion to another samples, and there is no standard way of organizing the images to

BOX 1 mICROSCOPY TECHNIQUES

Brightfield microscopy with colorimetric stains is the primary fluorophores from fluorescing and thereby contributing to technique for capturing tissue and whole organism morphology the collected light. In PALM (photo-activated localization (Fig. 1a). For high-throughput capture of in situ expression microscopy)87 and STORM (stochastic optical reconstruction patterns, automated bright-field microscopy has been used for microscopy)88, subsets of the fluorophores present are activated whole-genome projects such as the Allen Brain Atlas. and localized. Iterating this process and combining the acquired raw images yields a high-resolution image. Widefield fluorescence microscopy is the most widely used imaging technique in biology (Fig. 1b). Fluorescent markers Bioluminescence imaging (Fig. 1d) is based on the detection make it possible to see particular structures with high contrast, of light produced by luciferase-mediated oxidation of a substrate either in fixed samples using immunostaining or in living cells in living organisms. Transfected cells expressing luciferase can with expressed GFP-tagged proteins83. The resolution is limited by be injected into animals, or transgenic animals can be created diffraction to about 200 nm. that express luciferase as a reporter gene. When such animals are injected with a luciferase substrate, light is produced by Confocal scanning microscopy generates optical sections the luciferase-expressing cells in the presence of oxygen. The through a specimen by pointwise scanning of different focal bioluminescence image is often superimposed on a white-light planes and thereby reduces both scattered light from the image to show localization of the light-producing cells. focal plane and out-of-focus light84. The image quality of two- dimensional images is therefore improved, and 3D images can be Optical projection tomography captures object projections in taken (axial resolution is typically 2–3 times lower than lateral different directions as line integrals of the transmitted light89 (Fig. resolution; see Fig. 1c). The method is also applicable to live 1e). From these projections (corresponding to the ‘shadow’ of the cell imaging. There are variants of this method increasing axial object), a volumetric model can be calculated by means of back- resolution (for example, 4Pi microscopy)85. projection algorithms. Computational optical sectioning microscopy (COSM) achieves Light sheet–based fluorescence microscopy uses a thin optical sectioning by taking a series of two-dimensional images sheet of laser light for optical sectioning and a perpendicularly

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 with a widefield microscope focusing in different planes of the oriented objective with a CCD camera for detection of the specimen84. Out-of-focus light is then removed computationally. fluorescent signal. Single- or selective plane illumination microscopy (SPIM)90 (Fig. 1f) adds sample rotation that Structured illumination microscopy acquires several enables acquisition of large samples from multiple angles. Low widefield images at different focal planes using spatial phototoxicity, high acquisition speed and ability to cover large illumination patterns84. As the out-of-focus light is less samples make it particularly suitable for in toto time-lapse dependent on the spatial illumination pattern than the imaging of developing biological specimens, such as model in-focus light, combinations of different images at the same organism embryos, with cellular resolution. focal plane under laterally shifted illumination patterns allow computational attenuation of out-of-focus light. Transmission electron microscopy (TEM) (Fig. 1g) uses Two-photon microscopy is similar to confocal scanning accelerated electrons instead of visible light for imaging. As a microscopy but uses nonlinear excitation involving two-photon (or result, the achievable resolution (typically 2 nm) is much higher multiphoton) absorption86. This allows the use of longer excitation than in light microscopy. The method is not applicable to live cell wavelengths, permitting deeper penetration into the tissue and— imaging, and the specimen preparation is technically very complex. owing to the nonlinearity—confines emission to the perifocal In electron tomography, the specimen is physically sectioned region, leading to substantial reduction of scattering. and 3D images are obtained by imaging each section at progressive angles of rotation, followed by computational reassembly to yield a Super-resolution fluorescence microscopy groups several tomogram. Resolution ranges from 20–30 nm to 5 nm or less. recently developed methods in light microscopy capable of significantly increasing resolution and visualizing details at the Scanning electron microscopy (SEM) (Fig. 1h) produces an nanometer scale. In stimulated emission depletion (STED) image of the 3D structure of the surface of the specimen by microscopy85, the focal spot is ‘narrowed’ by overlapping it collecting the scattered electrons (rather than the transmitted with a doughnut-shaped spot that prevents the surrounding electrons as in TEM). The resolution is typically lower than for TEM.

S28 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

reflect this information (typically, a time-lapse experiment is stored rapidly becoming the standard in the community. It as a stack of images, where the time information is encoded in the file is supported by many of the popular image analysis suites and pro- names). Hence, researchers must often rely on their notes to determine vides unambiguous information about image orientation, additional what each image represents, which becomes an issue, particularly when the data are to be shared between collaborators. The most com- BOX 2 MAGNETIC RESONANCE IMAGING TECHNIQUES mon practice is to duplicate images and share metadata in spreadsheets, although a suitable Magnetic resonance imaging uses the intrinsic nuclear magnetization of materials to laboratory information management system probe their general physical and chemical structure. A sample to be imaged is first placed in a strong static magnetic field. Gradients in the static field force the Larmor (LIMS) informatics platform could be used frequency (resonance frequency) of the sample’s atomic nuclei to be a function of their for managing the metadata in a reliable and spatial position within the sample space. The sample is then excited by a carefully convenient way. An attempt to overcome crafted radio frequency electromagnetic pulse that deflects the magnetic moments these issues is the OMERO platform from of the sample’s nuclei away from their steady-state orientation. The relaxation of the Open Microscopy Environment (OME), the magnetic moments back to their steady state creates a radio frequency echo that which provides a client-server system for is detected by an acquisition system. The composition of the material, the spatially managing images and their associated meta- 1–3 dependent Larmor frequency and the magnetic pulse itself determine the characteristics data through a common interface . of that echo. Variations in the power, orientation and duration of the radio frequency Commercial microscopy and image analy- pulse allow different tissue properties to be probed while retaining some details of sis software companies often engage in for- differentiation (different composition) and position. Paramagnetic T1 contrast agents, mat ‘wars’, whereas open-source solutions such as gadolinium, may be injected into the subject. The agent alters the relaxation struggle to bridge the gaps among the many characteristic of water, and the image appears hyperintense in areas of contrast agent proprietary formats. A movement toward concentration; applications include vascular imaging (Fig. 1m) and detection of active universally adopted standards, with a degree tumors or lesions. Some of the widely used acquisition methods are described below. of data integration like that which has been Clinical MRI devices typically use static field strengths in the range of 1.5–3T and have achieved for genome sequences (for example, resolutions on the order of 1 mm. Small-animal scanners apply the same principles but GenBank) and microarray data (for example, use stronger field strengths (typically in the range 7–11T) and are capable of resolutions MIAME), must become a common goal of on the order of tens of microns. industry and academia. T1 applies a short excitation time and a short relaxation time; fat appears bright, water MRI is an inherently digital medium and appears dark. In brain images, white matter appears bright, gray matter slightly darker similarly faces problems with file formats. and cerebrospinal fluid very dark (Fig. 1o). Acquisition systems from different scanners often use proprietary file formats. Though T2 typically uses a long excitation time and a long relaxation time; fluid (for example, clinical scanners support the DICOM stan- cerebrospinal fluid) appears bright in these scans, and fat is less bright (Fig. 1j). dard managed by NEMA, the Association of T2* (usually pronounced “T2-star”) is observed in long-excitation-time gradient echo Electrical and Equipment images; contrast is sensitive to local magnetic field inhomogeneities produced, for Manufacturers, writing a validly formatted example, by iron oxide T2 contrast agents and air-tissue interfaces.

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 DICOM image file is neither practical nor required for many academic imaging projects. Proton density information is obtained from scans with a short excitation time and a In addition, emerging imaging techniques are long relaxation time, or by extrapolating relaxation-weighted datasets back to zero time. often not fully standardized within DICOM, Fluid attenuation inversion recovery (FLAIR) pulse sequences suppress the fluid and implementation of the standard varies signal, which allows otherwise hidden fluid-covered lesions to be observed (Fig. 1k). by scanner vendor. As a result, investigators often rely on file formats that are both sim- Magnetic resonance angiography uses the water proton signal to produce millimeter- pler and better able to capture the parameters scale images of arteries and veins without the addition of contrast agents. required for their particular domains. The Magnetic resonance spectroscopy acquires localized spectra from a defined region Analyze 7.5 file format (Analyze Direct) has within the sample, with spectral peaks indicating the presence of various metabolites or been widely used in many software packages, biomolecules such as lactate, creatine, phosphocreatine and glutamate (Fig. 1l). but its interpretation often differs among Functional MRI (fMRI) measures the signal change that occurs when blood is these. As a result, ambiguities arise regarding deoxygenated; neuronal activity relates to increased oxygen demand, allowing the orientation of the stored data, and great maps of activation to be made by examining the blood oxygenation level– care must be taken to ensure that the right dependent (BOLD) signal (Fig. 1o). and left sides of the image volume are inter- preted correctly. Furthermore, the Analyze Diffusion MRI uses the reduction in the detected MR signal produced by diffusion of format is not designed to store much of the water molecules along the magnetic gradient. Areas with lower diffusion are affected metadata that is contained in DICOM or less than areas with high diffusion, producing brighter signals (Fig. 1l). Performing other proprietary formats. multiple acquisitions with different gradients and field strengths allows models of the The NIfTI file format4 (http://nifti.nimh. directionality of the local diffusion properties to be resolved in the form of diffusion nih.gov/nifti-1/) was recently developed tensors (DTI) or more complicated patterns (Q-Ball and DSI). The diffusion properties are to address many of these problems and is governed by local physical structures in the material. (Fig. 1p).

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S29 review

a b c cells and annotations of subcompartments or tissues. Care must be taken in the inter- pretation of visualized samples, analysis results and derived measurements, as each acquisition method has its own resolution limitation, and therefore not all biological structures might be imaged at sufficient resolution to show relevant detail. Last but not least, intuitive visualization using simu- lated behavior of biological entities can aid d understanding of scientific methods, mod- els and hypotheses not only for scientists themselves but also for the general public. e f g Visualization and analysis of many complex cortex datasets are beyond the capabilities of exist- commisure neuropile ing software packages and rely on cutting- edge research in and

labeled computer vision fields. neurons In most biological experiments, visual- ization means displaying the variations in several channels over the spatiotemporal Figure 2 | Visualization of high–dimensional image data. (a) SPIM scan of autofluorescent adult Drosophila dimensions. As standard computer moni- female gives an impression of 3D rendering in maximum intensity projection (image courtesy D.J. White); tors can only display two spatial dimensions scale bar, 100 µm. (b) Maximum-intensity projection of tiled 3D multichannel acquisition of Drosophila directly, some sort of data reduction must larval nervous system; scale bar, 400 µm. (c) The corresponding 3D rendering in Fiji 3D viewer; borders be applied to visualize multidimensional of the tiles are highlighted; scale bar, 100 µm. (d) Visualization of gastrulation in Drosophila expressing His-YFP in all cells by time-lapse SPIM microscopy. The images show six reconstructed time points covering images. The simplest solution is to display early Drosophila embryonic development rendered in Fiji 3D viewer. Fluorescent beads visible around sample only selected dimensions from the multidi- were used as fiduciary markers for registration of multi-angle SPIM acquisition; scale bar, 100 µm. (e,f) mensional dataset at a time—for instance, Two consecutive slices from serial section transmission electron microscopy dataset of first-instar larval one two-dimensional image—and allow brain. Yellow marks, corresponding SIFT features that can be used for registration; yellow grid, position the user to interactively change the remain- and orientation of one of the SIFT descriptors; inset, corresponding pixel intensities in the area covered by ing dimensions. Because computer mem- the descriptor; scale bar, 1 µm. (g) Multimodal acquisition of Drosophila first-instar larval brain by confocal ory becomes limiting for large datasets, (red, green) and electron microscopy (underlying gray). The two separate specimens were registered using manually extracted corresponding landmarks (not shown). Main anatomical landmarks of the brain multidimensional image browsers must correspond in the two modalities after registration (white labels). (Electron microscopy images courtesy A. ensure that only data that are being viewed Cardona; confocal image courtesy V. Hartenstein). Scale bar, 20 µm. are loaded into memory. Proper memory management is particularly important for

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 codes that describe aspects of the image including its intent and a online browsing applications that must minimize the amount of standardized method for adding extensions to the format. Although image data transferred between the client and the server5. standardized formats address many interoperability issues, significant challenges remain in digitally describing the full experimental para- 3D visualization techniques digm used to collect the data. For example, functional MRI stimulus Multidimensional images can typically be observed as a collection paradigms must typically be hand coded into an application-specific of separate slice planes, but often dimensions are combined using proprietary format for statistical analysis. Similar issues appear in the various projection methods to form a single display object (Fig. 2). analysis of dynamic contrast enhanced images, diffusion images and For two-dimensional display, one spatial dimension can be collapsed other new scanning techniques. by an orthographic projection (for example, maximum intensity projection), creating a partially flattened image (Fig. 2a,b). The pro- Visualization of high-dimensional image data jection can also be applied along any other axis, such as time (creat- As technology develops, images are carrying more and more infor- ing a kymogram) or joint display of color-coded channels. A more mation in the form of additional dimensions. Typically, these advanced technique, the perspective projection, preserves the 3D dimensions correspond to space (3D imaging techniques; Fig. appearance of the object in the two-dimensional projection image 1c,e–f,i–p), time (for example, live cell imaging, functional MRI; (Fig. 2c). In perspective projection, the geometry of the image is modi- Fig. 1o) and channels (for example, different fluorescent markers, fied to have the x and y coordinates of objects in the image converge multispectral imaging; Fig. 1b,c,f). Emerging microscopy tech- toward vanishing points, whereas in the so-called isometric projec- niques, such as single plane illumination microscopy (SPIM; Fig. tion, the original sizes of the objects are preserved. Perspective views 1f) or high-throughput, time-lapse live cell imaging, combine all look more realistic, but isometric views are useful if the image is to be these dimensional expansions and generate massive 3D, time-lapse, used for distance measurements. multichannel acquisitions. High-dimensional visualization is not Projections can be combined with other techniques from computer limited to raw image data; it can also be useful for understand- graphics, such as wire frame models, shading, reflection and illumina- ing features derived from the image data, such as segmentations of tion, to create a realistic 3D rendering of the biological object. When

S30 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

only the outer shape of the 3D object needs to a b c be realistically visualized, surface rendering of the manually or automatically extracted outlines of organelles, cells or tissue can help in assessing their topological arrange- ment within the 3D volume of the imaged specimen. In contrast, when the interior of 3D objects is of interest, ‘’ coupled with transparency manipulations or orthogonal sectioning is required. e In direct volume rendering, viewing rays d are projected through the data6. Data points in the volume are sampled along these rays, and their visual representation is accumu- lated using a transfer function that maps the data values to opacity and color values (Fig. 3a). The transfer functions can be adjusted to emphasize different structures or fea- tures and may introduce color or opacity changes as a function of the local intensity f g gradient. Similarly, the intensity gradient vector can be used to emulate the effect of external light sources interacting with tis- sue boundaries. Although direct volume rendering can be computationally expen- sive, the advent of high-powered graphical processing units has allowed many software tools (for example, OsiriX, ImageVis3D in SCIRun, 3D Slicer7,8, VTK (Tables 1 and 2)) h i to provide these capabilities interactively on personal computers. Direct volume rendering has the advantage of requiring little preprocessing to produce high-quality renderings of multidimensional data and is best suited for data in which the structures of interest are readily differenti-

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 ated by the pixel intensity. When this is not the case, further analytical techniques are Figure 3 | Visualization of anatomical features in MRI. (a) Volume rendering of a difference image required to clearly visualize these structures. computed from a pre- and post-gadolinium contrast scan. Brighter areas indicate a concentration of It is also possible to view all three spatial gadolinium, emphasizing the vasculature. (b) Time-lapse imaging of a subpopulation with Alzheimer’s dimensions at once. In stereoscopic views, disease showing loss of cortical gray matter density at 0, 6, 12 and 18 months80. Blue, no significant an image is presented to the right eye and the difference in cortical thickness from elderly control subjects; red and white, significant differences in cortical thickness (image courtesy P. Thompson). (c) Cardiac MRI analysis using anatomical scans and DTI same image rotated by a small angle is pre- of an ex vivo rat heart81. Color encoding of the DTI indicates the direction of the primary eigenvector: sented to the left eye. This can be achieved by x direction, green; y, red; z, blue. (d) Visualization of a human brain DTI field during a fluid deformation presenting the two images in the two halves process for image registration82. Orientation and shape of each ellipsoid indicate the pattern of diffusion of the monitor or by superimposing the two at that location. Color encoding: low diffusion, green, to high diffusion, red. (e) Interactive visualization images with a small relative shift. The final of high angular resolution diffusion imaging (HARDI) data using spherical harmonics27. Each shape frontier in this area is volume visualization of represents the orientation distribution function measured at that point, which indicates the probability biological image data that combines various of diffusion in each angular direction. Colors indicate direction of maximum probability: red, lateral; blue, inferior–superior; green, anterior–posterior. Visible in this frame are portions of corpus callosum (central visualization approaches and couples them red area) and corticospinal tracts (blue vertical areas near edges). (f) White matter tracts computed from to virtual reality environments to allow not diffusion spectral imaging (DSI) data using Diffusion Toolkit (http://www.trackvis.org/dtk/). The tracts only seamless navigation through the data were then clustered automatically into bundles based on shape similarity measures and finally rendered but also intuitive interaction with the visual- using BrainSuite27. Each color indicates a different bundle. (g) 3D orthogonal views of an MRI volume, ized biological entities. displayed with an automatically extracted surface mesh model of the surface of the cerebral cortex (BrainSuite27). (h) 3D surface reconstructions (Amira) from micro-MRI data: left hindlimb of a mouse with peroneal muscular atrophy. (i) Surgical planning visualization for assessment of white matter integrity: Treatment of the time dimension tumor model (green mass), ventricles (blue), local diffusion for one slice plane (ellipsoid scale and The changes along the time axis in dynami- orientation indicating local diffusion tensor: red, low anisotropy; blue, high) and white matter fiber tracts cally changing biological specimens are best shaded red to blue with increasing local anisotropy (thin lines, peri-tumoral; thick lines, corticospinal visualized by assembling a static gallery of tracts). 3D Slicer: http://wiki.na-mic.org/Wiki/index.php/IGT:ToolKit/Neurosurgical-Planning.

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S31 review

Table 1 | A selective list of image visualization tools Name Cost OS Description URL Stand-alone Amira* $ Win, Mac, Linux Multichannel 4D images, image processing, extensible (in C++), scripting http://www.amiravis.com/ (in Tcl) Arivis $ Win Multichannel 4D images, image acquisition, processing, collaborative http://www.arivis.com/ annotation and browsing, extensible (MATLAB and Python) Axiovision $ Win Multichannel 4D images, image acquisition, image processing http://tiny.cc/cZUbB BioImageXD Free Win, Mac, Linux 3D image analysis and visualization, in Python using VTK library http://www.bioimagexd.net/ Blender Free Win, Mac, Linux 3D content creation suite, the open source Maya http://www.blender.org/ Fiji* Free Win, Mac, Linux ImageJ distribution focused on registration and analysis of confocal http://pacific.mpi-cbg.de/ and electron microscopy data. Six scripting languages, extensive wiki documentation, video tutorials Imaris* $ Win, Mac Multichannel 4D images, image processing http://www.bitplane.com/ IMOD Free Win, Mac, Linux Monochannel 4D images, extensible (in C/C++) http://tiny.cc/kfLgQ Huygens $ Win, Mac, Linux Multichannel 4D images, image processing, scripting (in Tcl), web interface http://www.svi.nl/ for batch processing Image-Pro $ Win Multichannel 4D images, image acquisition, image processing http://www.mediacy.com/ ImageJ* Free Win, Mac, Linux Image processing, extensible (in Java) http://rsbweb.nih.gov/ij/ LSM image Free Win Multichannel 4D images http://tiny.cc/WMHsE browser MetaMorph $ Win Multichannel 4D images, image acquisition, image processing, extensible http://tiny.cc/YrCK3 (in Visual Basic), scripting (with macros) POV Ray Free Win, Mac, Linux High-quality tool for creating impressive 3D graphics http://www.povray.org/ Priism/IVE Free Mac, Linux Multichannel 4D images, image processing, extensible (in C and Fortran) http://tiny.cc/SIAMF V3D, VANO and Free Win, Mac, Linux 3D Image visualization, analysis and annotation http://tiny.cc/JWdFb Cell Explorer VisBio Free Win, Mac, Linux Multichannel 3D images, image processing (with ImageJ), connection to an http://tiny.cc/TOZad/ OMERO server Volocity $ Win, Mac Multichannel 4D images, image acquisition, image processing http://www.improvision.com/ VOXX Free Win, Mac, Linux Real-time rendering of large multichannel 3D and 4D microscopy datasets http://tiny.cc/b0KRt VTK* Free Win, Mac, Linux Library of C++ code for 3D computer graphics, image processing and http://www.vtk.org/ visualization Web-based Brain Maps Free Win, Mac, Linux Interactive multiresolution next-generation brain atlas http://brainmaps.org/

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 CATMAID Free Win, Mac, Linux Collaborative Annotation Toolkit for Massive Amounts of Image Data: http://fly.mpi-cbg.de/~saalfeld/catmaid/ distributed architecture, modeled after Google Maps *Recommended and popular tools. Free means the tool is free for academic use; $ means there is a cost. OS, operating system: Win, Microsoft Windows; Mac, Macintosh OS X. Tools running on Linux usually also run on other versions of Unix. 4D, four-dimensional.

images from different time points (for printed media, see Fig. 2d) higher-spatial-resolution structural MRIs to provide anatomical con- or presenting a movie (for the web). The biological processes are text for the functional activation information (Fig. 1o). often too slow to be shown in real time, and time-lapse techniques, Analysis and visualization of temporal information also depend where the frames are replayed faster, may reveal surprising details. on the time scale of the studies. Typical studies in molecular biology Nevertheless, movies are significant simplifications of the acquired cover relatively short time intervals, ranging from several minutes multidimensional data (for example, they do not allow rigorous time to several hours and sometimes to several days. Often these studies point comparisons and typically discard too much of the captured require reliable tracking algorithms12,13 that enable researchers to fol- data) and researchers should always have the possibility of brows- low the same object (for example, single molecules or cells) over time ing through the raw data along arbitrary dimensions. Movies of 3D and to extract and visualize trajectories and other measurements (for volume renderings of biological data tend to be particularly impres- example, color-coded speed of cells in a developing embryo14). In sive but require substantial computational power. Alternatively, sig- other cases, imaging is only a means to derive various parameters, nal changes over time can be visualized as heat maps overlaid on the whose kinetics are then visualized. For instance, variations in fluo- other dimensions of the image9, on normalized reference templates or rescence intensity over time can be used to measure diffusion coef- on surface models of cellular or anatomical structures. For example, ficients or concentrations. Another example is high temporal reso- functional MRI acquires many images in the span of several minutes lution MRI (‘cine-MRI’) of the heart, where epi- and endocardial during the application of some study paradigm. These are then pro- borders can be traced in the images to obtain global cardiac func- cessed using statistical methods to produce maps of activation, using tional parameters such as ventricular volumes or wall thickening. tools such as SPM10 or FSL11. These maps may then be aligned to Alternatively, displacement or velocities of the ventricular wall can

S32 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

be measured temporally to quantify transmural wall motion and to acquisitions and compare multiple subjects, the data must be regis- assess cardiac function regionally. tered spatially (see Registration below) for the time series to be ana- In the medical sciences, there is interest in long-term studies— lyzed. These dynamic effects are often analyzed using many subjects; often measuring effects over time periods of several months or years. this adds the requirement that all subject data be spatially resampled For example, in neurology, the ability to accurately measure the local to bring them into anatomical correspondence. The FreeSurfer pack- thickness of the cerebral cortex provides an important measure of age, for instance, accomplishes this using a cortical surface match- pathological changes associated with Alzheimer’s disease and other ing technique that aligns brains on the basis of their cortical fold- types of cognitive decline. Since these studies rely upon separate ing patterns and is widely used for detecting these brain changes,

Table 2 | A selective list of MRI visualization tools Name Cost OS Description URL 3D Slicer* Free Win, Mac, Linux Tools for visualization, registration, segmentation and quantification of medical data; http://www.slicer.org/ extensible; uses VTK and ITK Amira $ Win, Mac, Linux Allows 2D slices to be viewed from any angle; provides image segmentation, 3D mesh http://www.amiravis.com/ generation; surface rendering; data overlay and quantitative measurements Analyze $ Win, Mac, Linux Many processing and visualization features for many types of medical imaging data http://tiny.cc/gXO76 Anatomist Free Win, Mac, Linux Visualization software that works in concert with BrainVisa; can map data onto 3D http://brainvisa.info/ renderings of the brain; provides manual drawing tools AVS $ Win, Mac, Linux General purpose data visualization package http://www.avs.com/ BioImage Free Win, Mac, Linux Tools for biomedical image analysis; includes preprocessing, voxel-based http://bioimagesuite.org/ Suite* classification; image registration; diffusion image analysis; cardiac image analysis; fMRI activation detection BrainSuite Free Win, Mac, Linux Automated cortical surface extraction from MRI; orthogonal image viewer; automated and http://tiny.cc/Qxv6x interactive segmentation and labeling; surface visualization BrainVisa Free Win, Mac, Linux Toolbox for segmentation of T1-weighted images; performs classification and mesh http://brainvisa.info/ generation on brain images; automated sulcal labeling BrainVoyager $ Win, Mac, Linux Analysis and visualization of MRI and fMRI data and for EEG and MEG distributed http://tiny.cc/kFKhv source imaging Cardiac Image $ Irix Visualization and functional analysis, in 3D space and through time of cardiac cine data http://tiny.cc/4KY6E Modeller DTIStudio Free Win Tools for tensor calculation, color mapping, fiber tracking and 3D visualization http://tiny.cc/pvU9B FreeSurfer Free Mac, Linux Automated tools for reconstruction of the brain’s cortical surface from structural MRI data http://tiny.cc/H3uG5 and overlay of functional MRI data onto the reconstructed surface FSL* Free Win, Mac, Linux Comprehensive library of analysis tools for fMRI, MRI and DTI brain imaging data; includes http://tiny.cc/NFPHO widely used registration and segmentation tools ImageJ Free Win, Mac, Linux Image processing, extensible (in Java), large user community http://rsb.info.nih.gov/ij/

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 ImagePro $ Win 2D and 3D image processing and enhancement software http://www.mediacy.com/ ITK Free Win, Mac, Linux Extensive suite of software tools for image analysis http://www.itk.org/ Jim $ Win, Mac, Linux Calculates T1 and T2 relaxation times, magnetization transfer, diffusion maps from MRI data http://www.xinapse.com/ MBAT Free Win, Mac, Linux Workflow environment bringing together online resources, a user’s image data and http://tiny.cc/W2Tx2 biological atlases in a unified workspace; extensible via plug-ins MedINRIA Free Win, Mac, Linux Many algorithms dedicated to medical image processing and visualization; provides many http://tiny.cc/RCptw modules, including DTI and HARDI viewing and analysis MIPAV Free Win, Mac, Linux Quantitative analysis and visualization of medical images of numerous modalities such as http://mipav.cit.nih.gov/ PET, MRI, CT or microscopy OpenDX Free/$ Win, Mac, Linux General-purpose data visualization package http://www.opendx.org/ OsiriX* Free Mac Image processing and viewing tool for DICOM images, provides 2D viewers, 3D planar http://tiny.cc/kOTzy reconstruction, surface and volume rendering, export to QuickTime SCIRun Free Win, Mac, Linux Environment for modeling, simulation and visualization of scientific problems; includes http://tiny.cc/eLufx many biomedical analysis components, such as BioTensor, BioFEM and BioImage SPM Free Win, Mac, Linux Analysis of brain imaging data sequences; applies statistical parametric mapping methods http://tiny.cc/dVc7v to sequences of images; widely used in fMRI; provides segmentation and registration TrackVis Free Win, Mac, Linux Tools to visualize and analyze fiber track data from diffusion MRI (DTI, DSI, HARDI, http://trackvis.org/ Q-Ball) tractography TractoR Free Linux Tools to segment comparable tracts in group studies using FSL tractography http://tiny.cc/OsBH9 VTK* Free Win, Mac, Linux Library of C++ code that implements many state-of-the-art visualization techniques http://www.vtk.org/ with a consistent developer interface *Recommended and popular tools. Free means the tool is free for academic use; $ means there is a cost; free/$ means free for Windows and Linux, at a cost for Mac OS X. OS, operating system: Win, Microsoft Windows; Mac, Macintosh OS X. Tools running on Linux usually also run on other versions of Unix. Irix is SGI’s Unix operating system. 2D, two-dimensional; CT, computed tomography; PET, positron emission tomography; HARDI, high angular resolution diffusion imaging.

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S33 review

for populations and for single subjects over time15–17. Once such a of this is white matter tractography in the brain. Identification and spatial normalization has been performed, statistical maps may be visualization of white matter tracts (see, for example, Fig. 3f) in diffu- computed to examine changes in various biomarkers, such as cortical sion imaging are provided by several of the software packages listed in thickness, and these measures may then be mapped onto images or Table 2 (for example, DTIStudio, TrakVis, ITK, 3D Slicer, MedINRIA, surface models for visualization in the form of renderings or time- TractoR and FSL). lapse animations (Fig. 3b). Although in microscopy some tools have been developed aimed at visualizing directional data such as diffusion properties and mapping Extra dimensions them onto two- or 3D datasets (for example, spatio-temporal image Image data can have more dimensions than space and time in two correlation spectroscopy (STICS)23), the microscopy application field ways. Either extra channels can be recorded or each voxel can be seems to be less advanced than in MRI. It will therefore be interest- associated with a dataset encoding various properties. Although dif- ing to see whether the tools developed for visualization of diffusion- ferent channels could be browsed as extra dimensions, they are usu- weighted MRI data will be adopted for microscopy applications. ally color coded and jointly displayed. For more than three channels, however, the combinations of channel values do not result in unique Segmentation colors. Dimensionality reduction techniques can be applied to map Whether analyzing scalar images, vector volumes or more complicated meaningful channel combinations to unique colors. This, however, data types, a frequent task in processing and visualizing 3D data is the only partially alleviates the problem as the number of combinations segmentation of cellular or anatomical structures to define the bound- to display could easily exceed the number of available colors. Color aries of target structures. Once the boundaries of a structure have coding becomes useless when tens or hundreds of channels must been defined, a surface mesh model can then be generated to repre- be visualized simultaneously (for example, in multi-epitope-ligand sent that structure. These models are often generated using isosurface cartography18). To solve this problem, some authors have even con- approaches such as the marching cubes algorithm24. The meshes may sidered converting data to sound (‘data sonification’ (T. Hermann, then be rendered rapidly using accelerated 3D graphics hardware that T. Nattkemper, H. Ritter and W. Schubert. Proc. Mathematical and is optimized for drawing triangles. The VTK software library provides Engineering Techniques in Medical and Biological Sciences, 745–750, a widely used implementation of these techniques and is incorporated 2000) to take advantage of people’s ability to distinguish subtle varia- in, for example, OsiriX, BioImage Suite and 3D Slicer. Surface mesh tions in sound patterns, which shows that in this challenging field, techniques can visualize anatomy, produce 3D digital reconstructions there is still room for new, sophisticated visualization tools. and make volumetric measurements (Fig. 3g,h). Extra related data, The advent of diffusion MRI has increased the number of dimen- such as statistical maps of tissue changes, can also be represented on sions of MRI images. In their most basic form, these images present a these surfaces for display purposes (Fig. 3b). scalar value at each voxel indicating a measure of the local water mol- In many biomedical imaging applications, structures to be seg- ecules diffusion properties of the imaged sample along a particular mented are identified in the data either through manual delineation direction19. These water diffusion properties indicate the local struc- or through automated and semiautomated computational approach- ture along that direction in the image and can be used to examine, for es. Image analysis tools, such as 3D Slicer25, MIPAV26, BrainSuite27, example, the architecture of white matter in the brain. When multiple MedINRIA and Amira (Visage Imaging), can be used to display and images are acquired using different gradient directions, a more com- manually delineate 3D volumetric data, which are then turned into plete spatial approximation of the diffusion can be formed. In the 3D surface models. Although manual delineation is often the gold

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 case of diffusion tensor imaging (DTI), a rank-2 diffusion tensor is standard for identifying structure, many computational approaches estimated at each voxel20. These images may be visualized by using have been developed. Extensible tool suites such as ITK, SCIRun, color to represent the principal direction of diffusion (Figs. 1p and MIPAV and ImageJ provide collections of automated approaches to 3c). They may also be visualized as fields of glyphs representing the the general problem of segmentation for two- and 3D images, as well two-dimensional tensor as an ellipsoid or other shapes that indicate as tools designed to extract specific anatomical structures. Several the pattern of water diffusion and thus provide an indication of the tools have been developed for the task of extracting, analyzing and structure in the image (Fig. 3d). As the number of angular samples visualizing models of the brain from MRI (for example, FreeSurfer28, increases (for example, in Q-Ball imaging21), to resolve multiple white BrainSuite27, BrainVoyager29, MedINRIA and BrainVisa30), as have matter fiber populations in each voxel, the orientation distribution tools specific to cardiac image processing and analysis (for example, function, which describes the probability of diffusion in a given angu- Cardiac Image Modeller). Many of these display tools provide facili- lar direction, becomes more complicated and can be represented using ties for image, volume or surface registration, allowing 3D surface- higher-order functions, such as spherical harmonic series (Fig. 3e). rendered data acquired during different experiments to be overlaid For both DTI and Q-ball, the data is represented as two-dimensional and displayed. These capabilities allow 3D anatomy to be digitally surfaces at each point in a 3D volume. Diffusion spectral imaging reconstructed and enable comparison of various samples. Combining (DSI)22 further increases the dimensionality with multiple acquisi- organ- and tissue-specific analysis techniques with calibrated MRI tions at different magnetic gradient strengths yielding a 3D dataset at acquisition sequences can support accurate in vivo measurement of each voxel in the 3D volume. anatomical structures. The challenge of processing and visualizing these data is to convert the raw data into the tensor or glyph representations and display them Registration in a meaningful way to the user. This may include additional process- The dimensional expansion of biological image data goes beyond ing to reduce the data dimensionality into scalar measures (for exam- individual datasets. Gene activity, to take one exemplar of the proper- ple, fractional anisotropy) or to extract features from the tensor, Q-ball ties of biological systems, may be imaged and visualized one gene at or DSI data that indicate structure in the sample. A prime example a time but systematically for many genes in different specimens31–36.

S34 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

Furthermore, different imaging techniques offering different resolu- to it, but this approach introduces reference-specific bias to the data. tions may be used to visualize different aspects of gene activity. The The computationally most elegant solution is to register all datasets to gene identity becomes yet another dimension in the data, and to quan- one another simultaneously in an empty output image space, but such titatively compare across this dimension the datasets must be prop- an approach is also the most computationally expensive. Therefore, erly registered. Many tools have been developed for image registra- the most commonly used technique is atlas registration, whereby indi- tion36–39. In its simplest form, registration is achieved by designating vidual acquisitions are registered to an idealized expert-defined atlas one acquisition as the reference and registering all other acquisitions based on prior knowledge of the imaged system.

a

b c

d e f

gh © 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010

Figure 4 | Visualization of high–throughput data. (a) By analogy with the ‘eisengram’ for microarray data, the discrete spatial gene expression (left) annotation data can be summarized by a so-called ‘anatogram’ (right), wherein anatomical structures are color coded, grouped temporally (vertical black lines) and ordered consistently within the temporal groups. Over- or under-representation of anatomical term in a group of genes is expressed by height of the color-coded bar; width of the bar is proportional to the frequency of the anatomical term in the annotation dataset. (b) Typical visualizing and browsing of high-throughput data at experiment level: color-coded cell density on a 384-well plate with link to raw data. (c) Typical visualizing and browsing of high- throughput data at the level of exploratory analysis: density plots of nuclear features (area and intensity), linked to the single segmented nuclei. (d) Joint visualization of 2,600 time-lapse experiments with one-dimensional readout (here proliferation curves): values are color-coded; each row corresponds to one experiment. (e) Time-resolved heat map for multidimensional read-out (here percentages of nuclei in the different morphological classes shown at the top): values are color-coded; each row corresponds to one RNAi experiment. Rows are arranged according to trajectory clustering64. (f) Event order map visualizing the relative order of phenotypic events in cell populations: events are color-coded and centered around one phenotype (here dynamic). (g) Visualizing high-throughput subcellular localization data (iCluster): images of ten subcellular localizations (indicated by outline color) spatially arranged by statistical similarity to identify outliers and representative images. (h) Visualization of spatially mapped simulation results (The Visible Cell): simulation of insulin secretion within a beta cell based on electron microscope tomography data (resolution, 15 nm). Blue granules are primed for insulin release, white are docked into the membrane (releasing insulin) and red are returning to the cytoplasm after having been docked.

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S35 review

Table 3 | A selective list of high-throughput visualization tools Name Cost OS Description URL BD Pathway $ Win Automated image acquisition, analysis, data mining and visualization http://tiny.cc/093OJ Cellenger $ Win Automated image analysis http://tiny.cc/rARky CellHTS Bioconductor (R) Free Win, Mac, Linux Analysis of cell-based screens, visualization of screening data, statistical http://www.bioconductor.org/ analysis, links to bioinformatics resources CellProfiler CP-Analyst* Free Win, Mac, Linux Automated image analysis, classification, interactive data browsing, data http://www.cellprofiler.org/ mining and visualization; extensible, supports distributed processing CompuCyte $ Win Automated image acquisition, analysis, data mining and visualization http://tiny.cc/jHsAm GE IN Cell Investigator, Miner $ Win Automated image acquisition, analysis, data mining and visualization http://tiny.cc/9rFoh Genedata Screener* $ Win, Mac, Linux Data analysis and visualization http://tiny.cc/HBfpY Evotec Columbus, Acapella $ Win, Linux Automated image analysis, distributed processing, data management http://tiny.cc/yvyek (OME compatible), data mining and visualization HCDC Free Win, Mac, Linux Workflow management, data mining, statistical analysis, visualization, http://hcdc.ethz.ch/ based on KNIME (http://www.knime.org/) iCluster Free Win, Mac, Linux Spatial layout of imaging by statistical similarity, statistical testing http://icluster.imb.uq.edu.au/ for difference MetaMorph MetaXpress, $ Win, Mac, Linux Automated image acquisition, analysis, data mining and visualization http://tiny.cc/OU9sf AcuityXpress* Olympus Scan^R $ Win Image acquisition, automated image analysis, extensible (with LabView) http://tiny.cc/NtEhH Pathfinder Morphoscan $ Win Automated image analysis, cell and nuclear analysis, karyotyping, histo- http://www.imstar.fr/ and cytopathology, high-content screening Pipeline Pilot $ Win, Linux Workflow management, image processing, data mining and visualization http://tiny.cc/uY4ZO Spotfire* $ Win, Mac, Linux Data analysis and visualization http://tiny.cc/rtVeL Thermo Scientific Cellomics $ Win Automated image analysis, analysis, data mining and visualization http://tiny.cc/Bt7Ov TMALab $ Win Automated image acquisition, storage, analysis, scoring, remote sharing and annotation; mainly for clinical pathology *Recommended and popular tools. Free means the tool is free for academic use; $ means there is a cost. OS, operating system: Win, Microsoft Windows; Mac, Macintosh OS X. Tools running on Linux usually also run on other versions of Unix.

Registration algorithms can take advantage of the actual pixel Multimodal image registration is also an important issue for MRI intensities in the 3D datasets and iteratively minimize some cost data. Multiple scanning technologies are often applied to the same function that reflects the overall image content similarity. Given subject to provide an integrated assessment. For example, the inte- the size of typical 3D image data, such intensity-based approaches gration of structural, functional and diffusion MRI has been pro- are often slow or unfeasible. Therefore, the image content is typi- ductively applied to the analysis of and treatment planning for deep

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 cally reduced to some relatively small set of salient features (Fig. 2e,f) brain tumors, in which functional MRI is used to indicate the ‘elo- and correspondence analysis is used to match the features in dif- quent cortex’ (involved in tasks to be preserved during surgery) while ferent 3D acquisitions and iteratively minimize their displacement. diffusion imaging indicates the white matter fiber bundles and how The features may be extracted from the images fully automatically, they are invaded and/or displaced by the tumor43. Interactive visu- as in the popular ‘scale invariant feature transform’ (SIFT40), or an alization techniques allow clinicians to superimpose the 3D render- expert can define them manually. The manual definition of the ings of the various image modalities to better understand the clinical corresponding landmarks is at present the only option when register- situation and evaluate treatment options (Fig. 3i). Creating an inte- ing multimodal data of vastly different scales, such as from confocal grated visualization is complicated by patient motion between scans and electron microscopy (Fig. 2g). When technically possible, it is ben- and by inherent geometric distortion associated with different scan eficial to uncouple the registration problem from the image intensities techniques—for example, eddy current–induced distortions in dif- by using fiduciary markers such as fluorescent beads41 or gold par- fusion MRI. Automated registration and distortion correction tech- ticles. Regardless of the image content representation, the algorithms niques can be used to compensate for these effects when creating the used for registration use some form of iterative optimization of an integrated view. appropriately chosen cost function. Multimodal MRI visualization can be also used together with An interesting idea for multimodal image registration is to establish real-time data to guide therapeutic procedures such as neurosurgery. a reference output space where the different modalities are registered Current neurosurgical practice is often augmented with so-called to each other once, and subsequently new instances of one modality ‘navigation’ systems consisting of surgical tools whose position and are mapped onto the already registered example. This process can also orientation are digitally tracked. This information is used to provide be iterative, increasing the registration precision with each new incom- a reference between the preoperative image data and the live patient ing dataset. The visualization of registered multimodal image data of with submillimeter accuracy. In this context, it is possible to support different scales presents a new set of challenges (Fig. 2g). Proper down- the procedure with visualizations of MRI data collected preoperatively. sampling techniques based on Gaussian convolution must be used For many interventions, nonlinear deformation of the image data are when changing the scale dimension of the multiresolution data42. required owing to significant changes between pre- and intraoperative

S36 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

patient anatomy. The VectorVision (BrainLabAG) System is an development—eased by the use of custom-built annotation example of a state-of-the-art MRI surgical navigation system, while tools34,52,53. Several visualization aids have been developed to suc- BioImage Suite, 3D Slicer and other open source software tools are cinctly summarize the complex, multidimensional annotations and available for researchers looking to provide enhanced functionality. organize them using clustering methods borrowed from the microar- ray data analysis field. The main challenge of representing qualitative Implementation issues rather than quantitative annotation data was addressed by introduc- The main commercial software tools providing methods for viewing ing discrete color coding for the controlled vocabulary terms collapsed primary image data in cell biology are MetaMorph, Imaris, Volocity, to the most informative level in the annotation ontology. In that way, Amira and LSM Image Browser. There is also a wide range of open the analog microarray ‘eisengram’ evolved into the digital ‘anatogram’ source tools, such as various plug-ins to the image analysis suite capable of visually summarizing the gene expression properties of ImageJ and Fiji, its distribution specialized in 3D registration and arbitrary groups of genes (Fig. 4a). Once large, expert-annotated visualization; BioImageXD, based on the state-of-the-art VTK library; image sets became available, computational approaches were success- and the V3D toolkit, emerging from systematic 3D imaging efforts in fully used to automate the annotation process (for example, automatic neurobiology at Janelia Farm in Ashburn, Virginia, USA (see Table 1). annotation of subcellular protein localization54 and gene expression Because visualization strategies are very diverse, no single software fits patterns55,56). In most cases, large image datasets are automatically all needs, and the availability of either the program’s source code or a processed to extract a wide range of attributes from the images. functional application programming interface (API) is a must for the To navigate efficiently through this sea of data, users need visual- programming required to realize complex visualizations. ization software that can display informative summaries at different Most of these software tools require that the image volumes of inter- levels: in the acquisition and quality-control phase, multiwell plate est be read into the computer memory before they can provide effi- and similar visualizations (Fig. 4b) that show image-based data values cient visualization and reasonable interactive response. Recent com- with a raw data link or thumbnails of images that can be enlarged mercial and open-source tools have started to use graphics hardware for careful examination are very helpful. Tools are also needed to to accelerate 3D visualization. Wider adoption of these approaches show relevant image-based data to biologists in an intuitive manner is prevented by the small spectrum of graphical processing units and enable them to identify meaningful characteristics and explore (GPUs) accessible for parallel programming using standard program- potential correlations and relationships between data and to point ming languages and the relatively small size of GPU memory. Image them toward the most interesting samples in their experiment. For datasets produced nowadays are often so large that it is impossible to this exploratory data analysis, data-enhanced scatter and density load them even into the CPU memory except on systems configured plots (Fig. 4c) and histograms of image-derived data can be used, in expressly (and expensively) for the purpose. Open-source software which the user can select subsets of data, view examples of the raw suites such as ImageJ and Fiji provide practical solutions for managing images producing those data points and filter data points for further memory and displaying massive datasets without unrealistic hard- analysis. Browsing these graphical representations linked to the raw ware requirements. But although these tools are very user friendly and data allows biologists to identify interesting subsets (for example, the popular, they lack generic multidimensional data structures like those morphological classes present in a cell-based screen or training sets for developed by the ITK/VTK project that enable programmatic abstrac- subsequent supervised machine learning) in an interactive and intui- tion of the access to arbitrary dimensions in the image data residing tive manner. Linking to the original data are particularly important: on the hard drive. Approaches that rely on random data access from first, because users must frequently locate relevant images to manually

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 the hard drive at multiple predefined resolutions are now available confirm the automated, quantitative results and, second, because there for very large-scale two-dimensional images in a client-server mode is often no obvious a priori link between quantitative image descrip- and can be accessed over the Internet, which greatly enhances their tors and biological meaning. Further analysis of these attributes and visibility and the possibilities for collaborative annotation. Examples eventually identified subsets leads to image annotations (for example, include Google Maps, Zoomify and CATMAID. Browser-based visu- phenotypes) and/or classifications (for example, as a ‘hit’), often by alization of 3D image data are so far limited to slice-by-slice browsing means of supervised learning methods. of the z dimensions (CATMAID, BrainMaps). Web viewers for 3D When putting the experiment into the context of existing biological data enabling section browsing at arbitrary angles and scales are just knowledge, researchers are concerned about how images and their now emerging44,45. derived data relate to known biological entities. For example, one may want to browse all images related to a given gene, gene ontology term Visualization of high-throughput microscopy data or chemical treatment. This requires integration with other sources In recent years, systems for performing high-throughput microscopy- of information, usually external databases. The visualization methods based experiments have become available and are often used to test the suited for this are also commonly used in systems biology: heat maps effects of chemical or genetic perturbations on cells46,47, to determine and projections in two-dimensional maps57. the subcellular localization of proteins48,49 or to study gene expression However, many of these goals remain unaddressed by existing patterns in development50. These screens produce huge amounts of software tools. Gracefully and intuitively presenting rich image data image data (sometimes tens of terabytes and millions of images) that representing possibly hundreds of attributes extracted from bil- must be managed, quality controlled, browsed, annotated and inter- lions of cells is a demanding task for a visual analysis tool. Still, some preted. As a consequence, tools for visualization and analysis are key recent developments have begun to ease aspects of these visualization at virtually all levels of such projects (see Fig. 4). challenges for high-throughput experiments. Several software tools Some large-scale experiments involving particularly complex read- offered by screening-oriented microscope companies enable certain outs have been annotated manually using controlled vocabularies51, data visualizations (Table 3), as does third-party software such as in some cases—as for high-content analysis of gene expression during Cellenger and the open-source CellProfiler project58,59 (Table 3).

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S37 review

These packages integrate image processing algorithms with statisti- Dissemination of image datasets cal analysis and graphical representations of the data and also offer High-throughput microscopy techniques have led to an exponential machine learning methods that capitalize on the multiple attributes increase in visual biological data. Although every high-throughput measured in the images. In workflow management software (for imaging project strives to perform a comprehensive analysis of its example, HCDC; Table 3), where modules communicate through image data, the sheer volume of the images and the inadequacy of defined inputs and outputs, user-defined visualization modules can the computational tools make such efforts incomplete. It is likely that be integrated into a data acquisition and processing workflow; this distributed, research community–driven and competitive analysis of increase in flexibility and history tracking typically comes with a loss these datasets will lead to new discoveries, as it has for publicly released in user-interactivity and browsing capabilities. genome sequences. Although standalone applications are catching up Although presentation, representation and querying of primary with the immediate needs of primary data visualization, solutions for visual and quantitative data are a significant problem, an associ- distributing image data to the community or in collaborative environ- ated difficulty is that the dimensionality of data derived from or ments are lagging behind. Traditional paper publication or publica- associated with each image or object is rapidly growing. The prob- tion as online supplementary materials is clearly inadequate, and the lem is to visualize such high-dimensional data in a concise way Journal of Cell Biology and Journal of the Optical Society of America so that it may be explored to identify patterns and trends at the have attempted to address this by implementing software systems to image level. A common strategy linearly projects high-dimensional link original image data to articles. The former’s DataViewer (http:// data into low dimensions for visualization using various forms of jcb-dataviewer.rupress.org/) is based on OMERO and provides web- multidimensional scaling60 (for example, principal component based interactivity, whereas the latter’s ISP software uses VTK and analysis, Sammon mapping61). Multidimensional scaling aims to requires readers of ISP-enabled articles to download and install the map high-dimension vectors into low dimensions in such a way ISP software on their machine. as to preserve some measure of distance between the vectors. Once The first attempts at distributing large image datasets to the biol- such an embedding or mapping into two or three dimensions has ogy community have come from atlases of gene expression in model been accomplished, the data can be visualized and any relationships organisms31–35,65 (Table 4). Two projects have now completed respec- observed. One approach to visualizing and interacting with high- tively a transcriptome atlas for the adult mouse brain35 (Allen Mouse dimensional data and microscopy imaging is the iCluster system62, Brain Atlas) and for the mouse embryo (Eurexpress). In these projects developed in association with the Visible Cell63. Here, large image images of tissue sections are captured at about 0.5 µm pixel resolu- sets from single or multiple fluorescence microscopy experiments tion, resulting in images with pixel dimensions of about 4,000 × 4,000. may be visualized in three dimensions (Fig. 4g). Spatial placement With sampling through the brain or embryo at about 150 µm and for in three dimensions can be automatically generated by Sammon most (~20,000) expressed genes, this results in an archive of millions mapping using high-dimensional texture measures or through of images. These have been manually and automatically annotated; in user-supplied statistics associated with each image. Thus, sets of the case of the brain data, this was done using 3D registration. images that are statistically and visually similar are presented as With the exception of Phenobank (http://worm.mpi-cbg.de/ spatially proximate, whereas dissimilar images are distant. This phenobank2/cgi-bin/ProjectInfoPage.py), which provides data for allows outliers and unusual images to be detected easily, while dif- a genome-wide time-lapse screen in Caenorhabditis elegans, image ferences between classes (for example, treatment versus control) or data from high-throughput, image-based RNAi screens have not yet multiple classes within an experiment can be seen as spatial separa- arrived in the public domain, although several projects (Mitocheck,

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 tion. Visualization of relationships and correlations among the data GenomeRNAi) aim to make their images available. The logistics and allow the user to find and define the unusual, the representative and storage requirements are formidable, and perhaps a publicly funded broad patterns in the data. centralized repository similar to GenBank or ArrayExpress should be Most of the above visualization schemes apply to cellular-level mea- established. Success of such a repository would depend on the will- surements of populations of cells, but none of these methods takes ingness of data producers to share their images through a central sys- into consideration time-resolved data. Although the temporal evolu- tem. Alternatively, a distributed infrastructure could be considered. tion of one or several cellular or population features from a single Querying these resources relies on textual annotations of all images. experiment can easily be plotted over time, this approach is imprac- Although already useful, text-based queries are limited by the lack of tical when relationships between hundreds or thousands of experi- ontologies for many descriptive attributes. For example, how can the ments must be visualized. In this case, the time series can be ordered user retrieve all images of mitotic phenotypes when some annota- according to some similarity criterion and visualized as a color-coded tions are free text and use wording such as “chromosome segrega- matrix (Fig. 4d). Similarly, heat maps can be extended to represent tion defect”? However, ontologies will not solve all problems in image multidimensional time series (Fig. 4e); the time series corresponding retrieval, as many images will not have been annotated with the to different dimensions can be concatenated. Here, the most difficult required level of detail. For example, in a screen, most images are just part is to define an appropriate distance function for multidimen- annotated as ‘not a hit’ for a given phenotype. sional time series according to whether absolute or relative temporal Another challenge is to allow browsing without significant down- information is important64. load time. In the Edinburgh Mouse Atlas33,66 (Table 4), the data range Often, the time itself is less informative than the relative order in from medium-resolution (0.5 µm) tissue section images captured which events occur. In this case, it is also possible to estimate a rep- using light microscopy to full 3D images captured typically with opti- resentative order of events from the time-lapse experiments (for cal projection tomography (see Box 1). The data are mapped onto a example, phenotypic events on single-cell level). This event order can standard mouse embryo model to allow direct comparison and analy- be used for characterizing, grouping and visualizing experimental sis in spatial terms. Mapping the spatial patterns of gene expression conditions, creating an event order map (Fig. 4f)64. provides some powerful options for query and analysis and avoids the

S38 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

partiality, ambiguity and resolution problems Table 4 | Selected image repositories of text annotation. For example, the EMAGE Name Description URL gene expression database67, which has spa- tially mapped patterns, allows query on the 4DXpress Cross-species gene expression database http://tiny.cc/DT6lh basis of spatial location and pattern similar- ADNI Imaging and genetics data from elderly http://www.loni.ucla.edu/ADNI/ controls and subjects with mild cognitive ity. Using a straightforward Jaccard index impairment and Alzheimer’s disease measure, the spatial search is sorted accord- Allen Brain Atlas Interactive, genome-wide image database http://www.brain-map.org/ ing to similarity with the search area, which of gene expression can either be manually drawn or can be the APOGEE Atlas of patterns of gene expression in http://tiny.cc/ZASKo pattern resulting from an analysis of an input Drosophila embryogenesis image—that is, “find one like this.” The same BIRN Lists of tools and datasets, mostly from the http://www.birncommunity.org/ similarity measures can be used to cluster neuroimaging community the data and enable varieties of automated Bisque Exchange and exploration of biological http://www.bioimage.ucsb.edu/ data mining. Another example comes from images neurobiology: the anatomic gene expression Cell Centered Database Database for high-resolution 2D, 3D and 4D http://ccdb.ucsd.edu/index.shtm 38 atlas (AGEA) of the mouse brain allows data from light and electron microscopy, users to interactively explore spatial patterns including correlated imaging of gene expression through correlation maps, Edinburgh Mouse Atlas Atlas of mouse embryonic development http://www.emouseatlas.org/ to apply hierarchical, multiscale partitioning (emap) and gene expression patterns of the image volumes according to spatial (EMAGE), 2D, 3D spatially annotated data gene expression similarity and to identify Fly-FISH Atlas of patterns of RNA localization in http://fly-fish.ccbr.utoronto.ca/ genes with localized enrichment in a chosen Drosophila embryogenesis region of interest. fMRIDC Public repository of peer-reviewed http://www.fmridc.org/ Systematic efforts are underway in neu- functional MRI studies and underlying data robiology to map the anatomy and con- ICBM Web-based query interface for selecting http://tiny.cc/0JDXE nectivity of entire nervous systems. Various subject data from the ICBM archive imaging modalities are used in tiling mode Mitocheck Microscopy-based RNAi screening data http://www.mitocheck.org/ across serially sectioned tissues, resulting OASIS Cross-sectional MRI data in young, http://www.oasis-brains.org/ in massive layered image canvases. Several middle-aged, undemented and demented older adults; longitudinal MRI data in projects use a Google Maps–like user inter- undemented and demented older adults face to provide web access to the data. In ZFIN The zebrafish model organism database http://tiny.cc/uitMn BrainMaps (Table 1), images of serial sec- 2D, 3D and 4D: two-, three- and four-dimensional, respectively. tions of both primate and nonprimate brains scanned with submicrometer reso- lution are presented in a Google Maps–like viewer with possibilities of images (so-called picture archiving and communication sys- for controlled vocabulary annotation by registered users. CATMAID tems, or PACS), most research facilities rely upon local systems and

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 also uses the same navigation principle to allow collaborative man- protocols for data storage and organization. Large-scale research ual annotation of serial-section transmission electron microscopy databases do exist that provide access for the neuroimaging com- images of the Drosophila larval brain imaged at a resolution of 4 munity. Both the Alzheimer’s Disease Neuroimaging Initiative nm per pixel5. CATMAID implements synchronized browsing and (ADNI68) and the International Consortium for Brain Mapping annotation of linked multimodal image data (such as confocal and (ICBM69) make use of the Laboratory of Neuro Imaging’s Image electron microscopy). CATMAID does not require the use of a cen- Data Archive (IDA) to store data for thousands of subjects, includ- tral data repository but instead allows the images to be distributed ing MRI, positron emission tomography, magnetic resonance across several laboratories and thus avoids duplication of massive angiography and DTI, as well as related meta-data70. The system datasets. The use of a lightweight web client that is ‘aware’ of various provides investigators with fine-grained control of data access datasets around the world through a collection of expert-submitted rights, ranging from allowing data to be made fully public to annotations in a centralized database seems to be a good approach to restricting it to collaborators. The Functional MRI Data Center start mapping large biological image collections using a community- (fMRIDC71) provides investigators with a repository for peer- driven effort. The alternative possibility would be the creation of reviewed fMRI studies and underlying data, which may then be well-structured repositories. Such repositories should ideally pro- requested by visitors to the fMRIDC website. The Open Access vide three levels of access to the data. The first level represents the Series of Imaging Studies (OASIS72) also provides collections of raw data, and the second level comprises the data analysis results hundreds of brain MRI volumes to the scientific community at and/or annotations, with relevant external information. The third no charge. The Neuroimaging Informatics Tools and Resources level consists in integrating different data sources such that data Clearinghouse (NITRC; http://nitrc.org/) provides a central repos- from multiple datasets can be simultaneously queried. itory through which neuroimaging resources, such as software These efforts at making biological image data accessible and tools or data, may be described, disseminated and discussed. The usable by the scientific community are paralleled by similar efforts development of large, publicly available collections of data pro- in the MRI community. But whereas clinical scanners often rely vides opportunities for large-scale neuroimaging studies, which upon specialized servers dedicated to the storage and distribution introduces new challenges for information visualization.

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S39 review

Perspectives on image data integration 1. Goldberg, I.G. et al. The Open Microscopy Environment (OME) Data Model and Although some tools are available for browsing and querying a XML file: open tools for informatics and quantitative analysis in biological imaging. Genome Biol. 6, R47 (2005). complex large-scale dataset, a systems understanding will require 2. Moore, J. et al. Open tools for storage and management of quantitative image more-comprehensive queries and views. Just as the sequencing rev- data. Methods Cell Biol. 85, 555–570 (2008). olution led to tools such as BLAST73 that allow users to find, com- 3. Swedlow, J.R., Goldberg, I.G. & Eliceiri, K.W. & the OME Consortium. Bioimage pare, sort and make inferences from vast numbers of sequences, a informatics for experimental biology. Annu. Rev. Biophys. 38, 327–346 (2009). 4. Cox, R. et al. A (sort of) new image data format standard: NifTI-1. Neuroimage corresponding set of tools must be developed for imaging to fully 22, 99 (2004). exploit the flood of data becoming available from modern imag- 5. Saalfeld, S., Cardona, A., Hartenstein, V. & Tomancák, P. CATMAID: Collaborative ing techniques and provide a foundation from which to build a Annotation Toolkit for Massive Amounts of Image Data. Bioinformatics 25, 1984–1986 (2009). spatially aware systems-biology model of the cell. At their best, 6. Levoy, M. Display of surfaces from volume data. IEEE Comput. Graph. Appl. 8, “graphics are instruments for reasoning about quantitative informa- 29–37 (1988). tion”74. In this context, visualization extends beyond presentation This paper is a seminal work on the rendering of volumetric data by directly of image data per se to enable, facilitate and integrate statistical tests, shading each voxel value and projecting it onto the viewing plane. The method provides realistic volumetric rendering without the need to model mathematical modeling and simulation and automated reasoning the data with geometric primitives. over multiple data types. Primary image data are combined with 7. Pieper, S.D., Halle, M. & Kikinis, R. 3D Slicer. in IEEE International Symposium instrument and experimental meta-data, data derived from analysis on Biomedical Imaging: From Nano to Macro 632–635 (2004). of the image series and information from diverse external resources. 8. Pieper, S., Lorensen, B., Schroeder, W. & Kikinis, R. The na-mic kit: Itk, vtk, pipelines, grids and 3d slicer as an open platform for the medical image In this way, visualization can help biologists and modelers address computing community. in IEEE International Symposium on Biomedical Imaging: broader questions: for example, how specific pathways or func- From Nano to Macro 698–701 (2006). tions are organized spatially within a cell and how they change with 9. Gordon, J.L., Buguliskis, J.S., Buske, P.J. & Sibley, L.D. Actin-like protein 1 (ALP1) is a component of dynamic, high molecular weight complexes in cell type, disease state or treatment. Already environments such as Toxoplasma gondii. Cell Motil. Cytoskeleton 67, 23–31 (2009). Virtual Cell and BioSPICE embed analyses (for example, biomolecu- 10. Friston, K.J. et al. Statistical parametric maps in functional imaging: a general lar interaction networks, or simulations of metabolite flow through linear approach. Hum. Brain Mapp. 2, 189–210 (1994). pathways) in an abstract two- or 3D space. The Visible Cell environ- 11. Smith, S.M. et al. Advances in functional and structural MR image analysis and 63 implementation as FSL. Neuroimage 23 suppl. 1, S208–S219 (2004). ment Illoura (Fig. 4h) aims to visualize such analysis in empirical 12. Dufour, A. et al. Segmenting and tracking fluorescent cells in dynamic 3-D two- or 3D microscopy data. Within the Visible Cell, first primary microscopy with coupled active surfaces. IEEE Trans. Image Process. 14, 1396– microscopy data are segmented and the resulting objects are marked 1410 (2005). 13. Jaqaman, K. et al. Robust single-particle tracking in live-cell time-lapse up ontologically and stored in a way that supports further annotation. sequences. Nat. Methods 5, 695–702 (2008). Key elements are database federation to enable multiuser access; data 14. Keller, P.J., Schmidt, A.D., Wittbrodt, J. & Stelzer, E.H. Reconstruction of storage in Semantic Web format Resource Description Framework zebrafish early embryonic development by scanned light sheet microscopy. (RDF) to enable complex semantic queries across multiple data types; Science 322, 1065–1069 (2008). 15. Fischl, B. & Dale, A.M. Measuring the thickness of the human cerebral cortex and visualization components to view and interact with two- and 3D from magnetic resonance images. Proc. Natl. Acad. Sci. USA 97, 11050–11055 spatial data and results of queries. (2000). It is said that a picture is worth a thousand words. It is difficult to 16. Fischl, B. et al. Sequence-independent segmentation of magnetic resonance imagine how many words the enormous image space of biological images. Neuroimage 23 suppl. 1, S69–S84 (2004). 17. Salat, D.H. et al. Age-associated alterations in cortical gray and white matter data would be worth. Many primary data in modern biology are in signal intensity and gray to white matter contrast. Neuroimage 48, 21–28

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 the form of images. These images are a rich source of qualitative as (2009). well as quantitative information about the biological system in many 18. Schubert, W. et al. Analyzing proteome topology and function by automated multidimensional fluorescence microscopy. Nat. Biotechnol. 24, 1270–1278 dimensions and across many scales. Biologists and computer scientists (2006). put substantial efforts into untangling this vast image space and giving 19. Le Bihan, D. et al. MR imaging of intravoxel incoherent motions: application meaning to these visual data. Its visualization provides the scientific to diffusion and perfusion in neurologic disorders. Radiology 161, 401–407 community with tools to gain systematic and unprecedented insights (1986). 20. Basser, P.J., Mattiello, J. & LeBihan, D. Estimation of the effective self-diffusion at many levels of biological scale. tensor from the NMR spin echo. J. Magn. Reson. B. 103, 247–254 (1994). 21. Tuch, D.S. et al. High angular resolution diffusion imaging reveals intravoxel ACKNOWLEDGMENTS white matter fiber heterogeneity. Magn. Reson. Med. 48, 577–582 (2002). T.W. was supported by a grant to J.E. (within the Mitocheck European Integrated 22. Wedeen, V.J., Hagmann, P., Tseng, W.-Y.I., Reese, T.G. & Weisskoff, R.M. Project LSHG-CT-2004-503464). D.W.S. was partially supported by US National Mapping complex tissue architecture with diffusion spectrum magnetic Institutes of Health (NIH) grant P41 RR013642. M.E.B. was partly supported by NIH resonance imaging. Magn. Reson. Med. 54, 1377–1386 (2005). grant R01 EB004155-03. S.P. was partially supported by NIH grant P41 RR13218. 23. Hebert, B., Costantino, S. & Wiseman, P.W. Spatiotemporal image correlation S.D. was supported by the Wellcome Trust. J.-K.H. was supported by the ENFIN spectroscopy (STICS) theory, verification, and application to protein velocity European Network of Excellence (contract LSHG-CT-2005-518254) awarded to mapping in living CHO cells. Biophys. J. 88, 3601–3614 (2005). J.E. A.E.C. and A.F. were supported by NIH grant 5 RL1 CA133834-03. J.E.S. was 24. Lorensen, W.E. & Cline, H.E. Marching cubes: a high resolution 3D surface supported by the British Heart Foundation (grant BS/06/001) and the BBSRC (grant construction algorithm. SIGGRAPH ‘87: Proc. 14th Ann. Conf. Computer Graphics E003443). This work was funded in part through the NIH Roadmap for Medical and Interactive Techniques 21, 163–169 (1987). Research grants U54 RR021813 (D.W.S.) and U54 EB005149 (S.P.). Information on This paper presented a fast algorithm for computing a triangular mesh the US National Centers for Biomedical Computing can be obtained from http:// corresponding to an isosurface in a 3D data volume. nihroadmap.nih.gov/bioinformatics/. 25. Lindig, T.M. et al. Spiny versus stubby: 3D reconstruction of human myenteric COMPETING INTERESTS STATEMENT (type I) neurons. Histochem. Cell Biol. 131, 1–12 (2009). The authors declare no competing financial interests. 26. McAuliffe, M. et al. Medical image processing, analysis and visualization in clinical research. in Proc. 14th IEEE Symp. Computer-based Medical Systems Published online at http://www.nature.com/naturemethods/. (CBMS2001) 381–386 (2001). Reprints and permissions information is available online at http://npg.nature. 27. Shattuck, D.W. & Leahy, R.M. BrainSuite: an automated cortical surface com/reprintsandpermissions/. identification tool. Med. Image Anal. 6, 129–142 (2002).

S40 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

28. Fischl, B., Sereno, M.I. & Dale, A.M. Cortical surface-based analysis. II: 60. Kruskal, J.B. & Wish, M. Multidimensional Scaling (Sage Publications, Beverly Inflation, flattening, and a surface-based coordinate system. Neuroimage 9, Hills, California, USA and London, 1978). 195–207 (1999). 61. Sammon, J.W. A nonlinear mapping for data structure analysis. IEEE Trans. 29. Goebel, R., Esposito, F. & Formisano, E. Analysis of functional image analysis Comput. C-18, 401–409 (1969). contest (FIAC) data with Brainvoyager QX: from single-subject to cortically 62. Hamilton, N.A., Wang, J.T.H., Kerr, M.C. & Teasdale, R.D. Statistical and visual aligned group general linear model analysis and self-organizing group differentiation of subcellular imaging. BMC Bioinformatics 10, 94 (2009). independent component analysis. Hum. Brain Mapp. 27, 392–401 (2006). 63. McComb, T. et al. Illoura: a software tool for analysis, visualization and semantic 30. Cointepas, Y., Mangin, J.-F., Garnero, L., Poline, J.-B. & Benali, H. BrainVISA: querying of cellular and other spatial biological data. Bioinformatics 25, 1208– software platform for visualization and analysis of multi-modality brain data. 1210 (2009). Neuroimage 13, S98 (2001). 64. Walter, T. et al. Automatic identification and clustering of chromosome 31. Visel, A., Thaller, C. & Eichele, G. GenePaint.org: an atlas of gene expression phenotypes in a genome wide RNAi screen by time-lapse imaging. J. Struct. patterns in the mouse embryo. Nucleic Acids Res. 32, D552–D556 (2004). Biol. published online, doi:10.1016/j.jsb.2009.10.004 (23 October 2009). 32. Gray, P.A. et al. Mouse brain organization revealed through direct genome-scale 65. Ringwald, M. et al. A database for mouse development. Science 265, 2033–2034 TF expression analysis. Science 306, 2255–2257 (2004). (1994). 33. Christiansen, J.H. et al. EMAGE: a spatial database of gene expression 66. Richardson, L. et al. EMAGE mouse embryo spatial gene expression database: patterns during mouse embryo development. Nucleic Acids Res. 34, 2010 update. Nucleic Acids Res. 38, D703–D709 (2010). D637–D641 (2006). 67. Baldock, R.A. et al. EMAP and EMAGE: a framework for understanding spatially 34. Tomancak, P. et al. Global analysis of patterns of gene expression during organized data. Neuroinformatics 1, 309–325 (2003). Drosophila embryogenesis. Genome Biol. 8, R145 (2007). 68. Mueller, S.G. et al. Ways toward an early diagnosis in Alzheimer’s disease: the 35. Lein, E.S. et al. Genome-wide atlas of gene expression in the adult mouse brain. Alzheimer’s Disease Neuroimaging Initiative (ADNI). Alzheimers Dement. 1, Nature 445, 168–176 (2007). 55–66 (2005). 36. Fowlkes, C.C. et al. A quantitative spatiotemporal atlas of gene expression in the 69. Mazziotta, J. et al. A probabilistic atlas and reference system for the human Drosophila blastoderm. Cell 133, 364–374 (2008). brain: International Consortium for Brain Mapping (ICBM). Phil. Trans. R. Soc. 37. Hill, D.L.G., Batchelor, P.G., Holden, M. & Hawkes, D.J. Medical image Lond. B 356, 1293–1322 (2001). registration. Phys. Med. Biol. 46, R1–R45 (2001). 70. Toga, A.W. Neuroimage databases: the good, the bad and the ugly. Nat. Rev. 38. Ng, L. et al. Neuroinformatics for genome-wide 3D gene expression mapping in Neurosci. 3, 302–309 (2002). the mouse brain. IEEE/ACM Trans. Comput. Biol. Bioinform. 4, 382–393 (2007). 71. Van Horn, J.D. et al. The Functional Magnetic Resonance Imaging Data Center 39. Klein, A. et al. Evaluation of 14 nonlinear deformation algorithms applied to (fMRIDC): the challenges and rewards of large-scale databasing of neuroimaging human brain MRI registration. Neuroimage 46, 786–802 (2009). studies. Phil. Trans. R. Soc. Lond. B 356, 1323–1339 (2001). 40. Lowe, D. Distinctive image features from scale-invariant keypoints. Int. J. 72. Marcus, D.S. et al. Open Access Series of Imaging Studies (OASIS): Cross- Comput. Vis. 60, 91–110 (2004). sectional MRI data in young, middle aged, nondemented, and demented older 41. Preibisch, S., Saalfeld, S., Rohlfing, T. & Tomancak, P. Bead-based mosaicing of adults. J. Cogn. Neurosci. 19, 1498–1507 (2007). single plane illumination microscopy images using geometric local descriptor 73. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local matching. Proc. SPIE 7259 (2009). alignment search tool. J. Mol. Biol. 215, 403–410 (1990). 42. Lindeberg, T. Scale-space for discrete signals. IEEE Trans. Pattern Anal. Mach. 74. Tufte, E.R. The Visual Display of Quantitative Information (Graphics Press, Learn. 12, 234–254 (1990). Cheshire, Connecticut, USA, 2001). 43. Tharin, S. & Golby, A. Functional brain mapping and its applications to The classic text on the science of data visualization. neurosurgery. Neurosurgery 60, 185–201; discussion 201–202 (2007). 75. Moffat, J. et al. A lentiviral RNAi library for human and mouse genes applied to 44. Lau, C. et al. Exploration and visualization of gene expression with an arrayed viral high-content screen. Cell 124, 1283–1298 (2006). neuroanatomy in the adult mouse brain. BMC Bioinformatics 9, 153 (2008). 76. Wurdinger, T. et al. A secreted luciferase for ex vivo monitoring of in vivo 45. Bertrand, L. & Nissanov, J. The Neuroterrain 3D mouse brain atlas. Front. processes. Nat. Methods 5, 171–173 (2008). Neuroinformatics 2, 3 (2008). 77. Ejsmont, R.K., Sarov, M., Winkler, S., Lipinski, K.A. & Tomancak, P. A toolkit for 46. Carpenter, A.E. & Sabatini, D.M. Systematic genome-wide screens of gene high-throughput, cross-species gene engineering in Drosophila. Nat. Methods 6, function. Nat. Rev. Genet. 5, 11–22 (2004). 435–437 (2009). 47. Pepperkok, R. & Ellenberg, J. High-throughput fluorescence microscopy for 78. Howles, G.P. & Ghaghada, K.B. Qi, Y., Munkundan S. & Johnson, G.A. High- systems biology. Nat. Rev. Mol. Cell Biol. 7, 690–696 (2006). resolution magnetic resonance angiography in the mouse using a nanoparticle

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 48. Glory, E. & Murphy, R.F. Automated subcellular location determination and high- blood-pool contrast agent. Magn. Reson. Med. 62, 1447–1456 (2009). throughput microscopy. Dev. Cell 12, 7–16 (2007). 79. Maudsley, A.A. et al. Comprehensive processing, display and analysis for in 49. Simpson, J.C., Wellenreuther, R., Poustka, A., Pepperkok, R. & Wiemann, S. vivo MR spectroscopic imaging. NMR Biomed. 19, 492–503 10.1002/nbm.1025 Systematic subcellular localization of novel proteins identified by large-scale (2006). cDNA sequencing. EMBO Rep. 1, 287–292 (2000). 80. Thompson, P.M. et al. Dynamics of gray matter loss in Alzheimer’s disease. J. 50. Lécuyer, E. & Tomancak, P. Mapping the gene expression universe. Curr. Opin. Neurosci. 23, 994–1005 (2003). Genet. Dev. 18, 506–512 (2008). 81. Plank, G. et al. Generation of histo-anatomically representative models of the 51. Sönnichsen, B. et al. Full-genome RNAi profiling of early embryogenesis in individual heart: tools and application. Phil. Transact. A Math. Phys. Eng. Sci. Caenorhabditis elegans. Nature 434, 462–469 (2005). 367, 2257–2292 (2009). 52. Tomancak, P. et al. Systematic determination of patterns of gene expression 82. Chiang, M.C. et al. Fluid registration of diffusion tensor images using during Drosophila embryogenesis. Genome Biol. 3, research0088.1–0088.14 information theory. IEEE Trans. Med. Imaging 27, 442–456 (2008). (2002). 83. Lichtman, J.W. & Conchello, J.A. Fluorescence microscopy. Nat. Methods 2, 53. Han, L., Hemert, J., Baldock, R. & Atkinson, M. Automating gene expression 910–919 (2005). annotation for mouse embryo. in Proceedings of the 5th International Conference 84. Conchello, J.-A. & Lichtman, J.W. Optical sectioning microscopy. Nat. Methods on Advanced Data Mining and Applications 469–478 (Springer, Beijing, 2009). 2, 920–931 (2005). 54. Newberg, J. & Murphy, R.F. A framework for the automated analysis of 85. Hell, S.W. Toward fluorescence nanoscopy. Nat. Biotechnol. 21, 1347–1355 subcellular patterns in human protein atlas images. J. Proteome Res. 7, 2300– (2003). 2308 (2008). 86. Helmchen, F. & Denk, W. Deep tissue two-photon microscopy. Nat. Methods 2, 55. Ji, S., Sun, L., Jin, R., Kumar, S. & Ye, J. Automated annotation of Drosophila 932–940 (2005). gene expression patterns using a controlled vocabulary. Bioinformatics 24, 87. Betzig, E. et al. Imaging intracellular fluorescent proteins at nanometer 1881–1888 (2008). resolution. Science 313, 1642–1645 (2006). 56. Peng, H. et al. Automatic image analysis for gene expression patterns of fly 88. Rust, M.J., Bates, M. & Zhuang, X. Sub-diffraction-limit imaging by embryos. BMC Cell Biol. 8 suppl. 1, S7 (2007). stochastic optical reconstruction microscopy (STORM). Nat. Methods 3, 57. Gehlenborg, N. et al. Visualization of omics data for systems biology. Nat. 793–795 (2006). Methods 7, S56–S68 (2010). 89. Sharpe, J. et al. Optical projection tomography as a tool for 3D microscopy and 58. Carpenter, A.E. et al. CellProfiler: image analysis software for identifying and gene expression studies. Science 296, 541–545 (2002). quantifying cell phenotypes. Genome Biol. 7, R100 (2006). 90. Huisken, J., Swoger, J., Del Bene, F., Wittbrodt, J. & Stelzer, E.H. Optical 59. Jones, T.R. et al. CellProfiler Analyst: data exploration and analysis software for sectioning deep inside live embryos by selective plane illumination microscopy. complex image-based screens. BMC Bioinformatics 9, 482 (2008). Science 305, 1007–1009 (2004).

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S41 review

Visualization of macromolecular structures Seán I O’Donoghue1, David S Goodsell2, Achilleas S Frangakis3, Fabrice Jossinet4, Roman A Laskowski5, Michael Nilges6, Helen R Saibil7, Andrea Schafferhans1, Rebecca C Wade8, Eric Westhof4 & Arthur J Olson2

Structural biology is rapidly accumulating a wealth of detailed information about protein function, binding sites, RNA, large assemblies and molecular motions. These data are increasingly of interest to a broader community of life scientists, not just structural experts. Visualization is a primary means for accessing and using these data, yet visualization is also a stumbling block that prevents many life scientists from benefiting from three-dimensional structural data. In this review, we focus on key biological questions where visualizing three-dimensional structures can provide insight and describe available methods and tools.

Decades ago, when structural biology was still in its most of them are not prepared to spend months learn- infancy, structures were rare and structural biologists ing complex user interfaces or scripting languages. often dedicated years of their life to studying just one Even today, complex user interfaces in visualization structure at atomic detail. The first tools used for visualiz- tools are often a stumbling block, preventing many ing macromolecular structures were tools for specialists. scientists from benefiting from structural data. Even Today’s situation is very different: the rate at which structural experts have come to expect ease of use from structures are solved has greatly increased, with over molecular graphics tools, in addition to improved 60,000 high-resolution protein structures now avail- speed, features and capabilities. able in the consolidated Worldwide Protein Data Bank In the past, molecular graphics tools were invariably (wwPDB)1. These data provide a wealth of detailed stand-alone, designed to view one molecular system at information that can yield significant insight into once. Today’s tools are increasingly internet aware, often macromolecular function. To use this information integrated tightly with structure databases (Table 1), most effectively, visualization tools were developed as well as with databases containing sequences and are increasingly becoming everyday tools for bio­ and other features (for example, domains, single- logists. For example, many biochemists regularly view nucleotide polymorphisms (SNPs), interactions). protein structures to gain insight into protein function Today, we are spoiled for choice when it comes to (Fig. 1). Chemists look at ligand-binding sites as part of molecular graphics tools for viewing proteins and other drug design. Molecular biologists view RNA structures macromolecular structures. Indeed, the sheer range of and complexes with proteins to gain insight into RNA available tools can be overwhelming. Many molecular signal and message processing. Some aspects of structure graphics tools have been developed to address diverse visualization remain mostly the domain of the specialist, requirements, as documented in recent reviews2–4 such as molecular motion and large-scale molecular and in several web resources maintaining lists of such assemblies. Even in these intrinsically more complex tools (see footnote to Table 1). Most of these tools fields, however, resources are beginning to enable bench have a large set of features in common, including biologists to visualize and use this information. standard representations (ribbon, space-filling, However, although structural information is now ball-and-stick and so on) and coloring schemes (element- viewed and used by a large and diverse group of scientists, based coloring of atoms, coloring by secondary

1European Molecular Biology Laboratory, Heidelberg, Germany. 2The Scripps Research Institute, La Jolla, California, USA. 3Goethe University, Frankfurt, Germany. 4Institut de Biologie Moléculaire et Cellulaire du Centre National de la Recherche Scientifique (CNRS), Université de Strasbourg, Strasbourg, France. 5European Bioinformatics Institute, Cambridge, UK. 6Institut Pasteur, Paris, France. 7Institute of Structural and Molecular Biology, Birkbeck College, London, UK. 8Heidelberg Institute for Theoretical Sciences (HITS), Heidelberg, Germany. Correspondence should be addressed to S.I.O. ([email protected]). published online 1 march 2010; doi:10.1038/nmeth.1427

S42 | vol.7 No.3s | march 2010 | nature methods supplement review structure and so on). It is beyond the scope of this review to Protein structures comprehensively compare all of these tools; instead, we focus on Finding three-dimensional structures. For a biochemist look- key biological questions for which visualizing structures can pro- ing to use three-dimensional structures to gain insight into the vide insight, and we highlight practical methods and tools with functions of a particular protein, the typical first step is a search outstanding features that are particularly suited to addressing for relevant structures. This task is considerably simplified by these questions. the remarkable degree to which all experimentally determined

Table 1 | Selected resources for finding and visualizing macromolecules Name Cost OS Description URL Stand-alone Amira $ Win, Mac, Linux Combines many different methods and scripting (EDM, MRI, optical) http://www.amiravis.com/ Cn3D17 Free Win, Mac, Linux Integrated sequence alignment view; embeddable http://tinyurl.com/Cn3D-NCBI/ Chime Free Win Widely used; structure editing; electrostatic maps; embeddable http://tinyurl.com/chime-pro/ Chimera16 Free Win, Mac, Linux Popular; integrated sequence alignment viewer (EDM, MD) http://www.cgl.ucsf.edu/chimera/ DS Visualizer Free Win, Mac, Linux Free version of Accelrys’s powerful viewer/editor program http://tinyurl.com/DSVisualizer/ ICM-Browser Free Win, Mac, Linux High quality images; integrates with sequence alignment viewer http://tinyurl.com/icm-browser/ IMOD109 Free Win, Mac, Linux Tomogram alignment, display, segmentation (EDM, optical) http://bio3d.colorado.edu/imod/ Jmol Free Win, Mac, Linux Widely used; embeddable http://www.jmol.org/ KiNG Free Win, Mac, Linux Generic tool for creating ‘kinemages’ http://tinyurl.com/KiNGapp/ Mage6 Free Win, Mac, Linux Generic tool for creating ‘kinemages’; allows structure editing http://tinyurl.com/kinemage/ MOE $ Win, Mac, Linux Integrated multifunctional suite; useful for drug design (MM) http://www.chemcomp.com/ Molscript23 Free Unix Useful for preparing manuscript images http://www.avatar.se/molscript/ MolSurfer Free Win, Mac, Linux Shows macromolecular interfaces, for example, by electrostatic potential http://tinyurl.com/molsurfer/ MOLMOL35 Free Win, Mac, Linux Many features, particularly suited for NMR structures http://tinyurl.com/molmol1/ OpenAstexViewer18 Free Win, Mac, Linux Embedded in many PDBe (see below) services http://www.openastexviewer.net/ ProSAT2 (ref. 31) Free Win, Mac, Linux Displays sequence features on three-dimensional structure http://tinyurl.com/ProSAT2/ PMV25 Free Win, Mac, Linux Dynamically extensible; multiple structures, large assemblies (MM) http://tinyurl.com/PMV-MGL/ PyMOL Free Win, Mac, Linux Widely used; embeddable; high-quality images (EDM, MM) http://www.pymol.org/ RasMol110 Free Win, Mac, Linux Widely used; fast; scripting http://www.rasmol.org/ Raster3D24 Free Win, Mac, Linux High-quality, photorealistic rendering http://tinyurl.com/raster3d/ SPICE27 Free Win, Mac, Linux Adds DAS features to three-dimensional structures http://tinyurl.com/spice-browser/ STRAP19 Free Win, Mac, Linux Editor for structural alignments of proteins (HM) http://tinyurl.com/STRAP1/ Swiss-PdbViewer20 Free Win, Mac, Linux Integrated sequence view (EDM, MM) http://spdbv.vital-it.ch/ SYBYL $ Win, Mac, Linux Popular molecular modeling tool (MM) http://tinyurl.com/triposSYBYL/ VMD26* Free Win, Mac, Linux Widely used; extensible, many add-ons (EDM, MD, MM, NMR) http://tinyurl.com/VMD-viewer/ WHAT IF42 $ Win, Mac, Linux Powerful features; good support (EDM, HM, MM) http://swift.cmbi.ru.nl/whatif/ Yasara Free Win, Mac, Linux Innovative ‘virtual reality’ graphical user-interface (EDM, MM, NMR) http://www.yasara.org/

Web-based CAME Free Assesses structure quality (ProSA-Web111); finds structural homologs http://www.came.sbg.ac.at/ EMDB Free Central repository for electron microscopy density maps http://emdatabank.org/ Entrez Structure Free Finds related structures for a sequence http://tinyurl.com/entrez3d/ FirstGlance Free Useful for a first impression of a structure http://firstglance.jmol.org/ JenaLib28 Free Displays sequence features on three-dimensional structure http://tinyurl.com/JenaLib/ NDB68 Free Central repository for nucleic acid structures http://ndbserver.rutgers.edu/ PDBe Free European branch of wwPDB (formerly MSD); many services http://www.ebi.ac.uk/pdbe/ PDBsum29 Free Pictorial structural annotations http://www.ebi.ac.uk/pdbsum/ PISA33 Free Predicts biologically relevant quaternary structure http://tinyurl.com/piserver/ Relibase58,59 Free/$ Finds similar ligands and binding sites; free version has limits http://tinyurl.com/relibase/ RSCB PDB5* Free US branch of wwPDB; has wide range of services http://www.pdb.org/ PMP10 Free Consolidated portal for homology-modeled structures http://tinyurl.com/ThePMP/ Proteopedia94 Free Community annotation of structures http://www.proteopedia.org/ SRS 3D7 Free Finds related structures for a sequence; displays sequence features http://SRS3D.org/ Swiss-Model11 Free Finds related structures for a sequence http://swissmodel.expasy.org/ TraceSuite II Free Maps phylogenetic information onto structures, finds functional residues http://tinyurl.com/TraceSuite/ The table shows only tools with outstanding features or strengths; more complete lists can found on Wikipedia (http://tinyurl.com/moleculargraphics/), at the World Index of Molecular Visualization Resources (http://www.molvisindex.org/), at the PDB (http://tinyurl.com/moleculargraphics-pdb/) and at http://molviz.org/. *Our recommendations. Free means the tool is free for academic use; $ means there is a cost. OS, operating system: Win, Microsoft Windows; Mac, Macintosh OS X. Tools running on Linux usually also run on other versions of Unix. EDM, electron density maps; HM, homology modeling; MD, molecular dynamics; MM, molecular modeling and molecular orbital visualization; MRI, magnetic resonance imaging; NMR, nuclear magnetic resonance; optical, optical microscopy.

nature methods supplement | vol.7 No.3s | march 2010 | S43 review

| Exon 1 Figure 1 Visualizing a tyrosine kinase Exon 2 97 a b c Exon 3 structure (PDB 1QCF) . (a–d,f) A simple way A-> V (in leukemia) Exon 4 Kinase Exon 5 to gain insight into function is to use ribbon Exon 6 Exon 7 representation colored by sequence features: SH3 Exon 8 Exon 9 for example, domains (a), SNPs (b), exons Exon 10 SH2 Exon 11 (c), protein binding sites (d) and sequence Exon 12 conservation (f). (e) An effective way to show overall shape is with nonphotorealistic P-> L rendering using flat colors and outlines. (in leukemia) (g,h) Solvent-accessible surfaces are often used Domains SNPs Exons for displaying electrostatic (g) and hydrophobic Identical potentials (h; hydrophilic in saturated colors Conserved d e f Non-conserved and hydrophobic in white). (i) Superposition Unaligned is commonly used to compare two or more related structures—for example, two distinct states of the same protein, or, as shown here, two separate proteins with similar structure (PDB 1QCF and 1FMK)98. (j,k) Increasingly many tools have an integrated, interactive sequence viewer, which helps users understand Interacts w/ PTPRH the relationship between sequence and Protein binding sites Non-photorealistic rendering Sequence conservation three-dimensional structure. Images were made using SRS 3D7 (a–d,f,j,k), PMV25 (e,g,h) and RCSB PDB5 (i). g h i protein three-dimensional structures are consolidated into a single data reposi- kT/e tory, the Worldwide Protein Data Bank 5 1 2.5 (wwPDB) . Three primary distribution 0 5 –2.5 sites (RSCB PDB , PDB Europe and PDB –5 Electrostatic potential Hydrophobic potential Superposition cmap Japan; Table 1) provide access to the same underlying data bank, each with a wide j range of integrated visualization and anal- ysis tools. In addition, the PDB is mirrored at many other sites, some of which provide k innovative visualization tools tailored to make specific questions easier to answer (Table 1). Most of these sites offer, embed- ded directly in their web-pages, one or more molecular graphics (PMP)10. The original PDB templates also include information tools (for example, Jmol, PyMol, KiNG and Mage6). Increasingly, on experimental conditions, ligands and cofactors, which can be the process of finding and visualizing structures is becoming one relevant in deciding to use or discard a comparative model. seamless step for most users. For sequences where no template PDB structure can be found by the above resources, it may be possible to calculate a struc- Finding structures from sequence. Several websites (for example, ture using so-called ab initio methods14. However, in spite of RCSB PDB5) allow the user to find structures using a sequence progress15, ab initio methods still require much improvement14 identifier or BLAST search (Table 1). Entrez Structure and and we recommend they be used with caution. SRS 3D7 allow the sequence to be aligned to any related three- ­dimensional structure (Fig. 1f). So far, experimental three- Getting a first impression. To gain an initial overview of a protein dimensional structures have been determined for less than 1% structure, it is often useful to choose a representation that hides side of all known proteins (based on direct links from PDB to protein chain atoms; ribbon-like representations do that well and also convey sequences in UniProt8). However, for around 42–48% of all pro- information about secondary structure (Fig. 1a–d). Ligand molecules teins, at least part of their sequence is considered significantly are best displayed in space-filling or ball-and-stick atom representa- similar to a PDB entry, so that some structural information can tions. Many of the websites in Table 1 provide such a view (for example, be inferred9,10. Several websites (for example, Swiss-Model11) pro- FirstGlance, among others), some by default. Typically, each protein vide comparative models for such cases12,13. Each service uses chain is colored differently, thus giving a quick insight into the number slightly varying cut-off criteria for defining ‘significant sequence of molecules present in the PDB entry. To highlight overall shape and similarity’ (for example, in some cases depending on the length of form, nonphotorealistic rendering can be very effective (Fig. 1e), aligned regions), but generally >40% sequence identity to a PDB especially with images for presentation and publication. structure is considered sufficiently good to create a high-quality Some molecular graphics tools (for example, Chimera16, comparative model structure10. These comparative models can be Cn3D17, OpenAstexViewer18, SRS 3D7, STRAP19 and Swiss- accessed at a single consolidated website, the Protein Model Portal PdbViewer20) offer an integrated view of both the amino acid

S44 | vol.7 No.3s | march 2010 | nature methods supplement review

BO X 1 X-RAY CRYSTAL STRUCTURES

About 86% of PDB entries are derived from X-ray crystallo­ transient complexes. To aid crystallization, macromolecules are graphy. Each X-ray structure has a resolution value, that is, often tampered with, for example, by mutating surface residues a measure of the crystal order. The expected error in the three- and truncating segments of sequence; in addition, modifica- dimensional atomic coordinates is correlated to, but much tions present in vivo are often missing in X-ray crystal struc- smaller than, the resolution. An average resolution is about tures. To help with structure calculation, macromolecules are 2.5 Å, and 1.2 Å is very high quality112. At low resolution often crystallized with seleno-methionine, heavy metal ions, or (≥4 Å), there is a significant chance of errors in the structure. other impurities (sometimes even other proteins) that are not Each atom also has a B-factor, a parameter correlated with present in vivo and that may sometimes distort the structure. molecular motion, and many molecular graphics tools can use X-ray crystal structures are calculated by fitting an atomic this parameter to color the molecule. However, the correla- model of the molecule into an electron-density map (EDM) or tion with motion is only partial, as other effects contribute to isosurface and can be visualized by tools such as COOT116, the B-factor113. Regions with very high B-factor (>80) should FRODO117 and O118. Highly mobile parts of the molecule generally be treated as of unknown structure. The goodness of often are missing in the EDM and are usually removed from fit of the structure to the X-ray crystal data is indicated by the the molecular model, resulting in missing atoms or residues. R-value and free R-value114, where values of around 20% are Current methods for structure determination are dominated considered to indicate a good structure. The difference between by the concept of a rigid structure, although improved methods the R-value and the free R-value should be low. The goodness of have very recently been proposed119 that account for fit can be examined at the Electron Density Server115. It can be molecular motion and produce structural ensembles similar useful to also check further independent measures of structure to NMR structures (Box 2). accuracy, and many are available directly on the web page for each structure at a c d e the RCSB PDB site. The PDBe site also has several useful services for assessing accuracy, particularly PDBsum29. A very useful, and independent, accuracy esti- mate is also available from ProSA-Web111, which can show residue-specific quality scores mapped on the three- b dimensional structure (Table 1). When viewing crystal structures, it is important to be aware that PDB entries only give explicit coordinates of the ‘asymmetric subunit’ of the crystal (that is, the smallest portion of a crystal needed to produce the unit cell of the Figure 2 | Caution for beginners: symmetry in crystal structures. PDB entries often do not have crystal) which often is only part of the full explicit three-dimensional coordinates for all parts of symmetric oligomers. (a,b) For example, in PDB 2C2A107, coordinates are given for only one monomer (a), although the biologically active biologically relevant assembly (Fig. 2). state is a homodimer (b). (a–e) Usually this information is given in ‘REMARK 350’, however we A primary limitation of X-ray crystallo­ recommend using PISA33, which automatically constructs a range of assemblies that occur in graphy is the need to crystallize the the crystal and predicts which of these is most biologically relevant. In this case, PISA gives the macromolecule; this is especially diffi- asymmetric unit (a), three dimer forms (b,c,d) and the unit cell (e). Increasingly, sites such as cult for membrane proteins, for proteins RCSB PDB5 provide the biologically relevant assembly precalculated with PISA. Image of PISA with natively disordered regions and for output made using VMD26.

sequence and the three-dimensional structure, and further For publication and presentations, some viewers can create impres- enable interaction between these two views (Fig. 1j). For example, sive, ray-traced images (for example, Amira, Chimera16 ICM-Browser, ­clicking on a residue in the sequence view causes the correspond- Molscript23 plus Raster3D24, PMV25, PyMOL, VMD26). ing residue to be highlighted and selected in the three-dimensional The majority of PDB structures are derived from X-ray crystallo­ view, and vice versa. This feature can significantly help a scientist graphy (Box 1, Fig. 2), about 13% from NMR spectroscopy in understanding and using three-dimensional structures. For (Box 2, Fig. 3) and less than 1% from electron microscopy (Box 3). example, by viewing the location of key residues or sequence These three experimental methods often require specific motifs, a scientist can assess whether they are likely to be considerations and visualization methods (discussed in each accessible for posttranslational modification, such as phospho- display box). rylation21. Some viewers (for example, STRAP19) go one step further, showing structure integrated with a multiple sequence Viewing sequence features on three-dimensional structures. alignment viewer—a feature we anticipate will continue to A very straightforward way to use three-dimensional structures become available for other viewers22. to gain insight into function is by coloring based on features such

nature methods supplement | vol.7 No.3s | march 2010 | S45 review

BO X 2 nmr STRUCTURES

About 13% of PDB entries are derived from NMR spectroscopy. macromolecule. Frequently, truncated segments, often single NMR structures are usually deposited in the PDB as an ensemble domains, are studied by NMR, rather than full-length proteins. of 10–50 structures (Fig. 3), providing a visual representa- In addition, mutations of surface residues are often introduced tion of precision. The ensemble precision derives not from the to avoid aggregation. Finally, post-translational modifications dynamics of the molecule in solution but from the lack of data that are present in vivo are often missing in NMR structures. to describe the structure fully. The method used for structure NMR structures are calculated automatically from constraints calculation also affects the ensemble precision, and recently a derived from the NMR spectra, primarily constraints on inter­ significantly improved method has been developed that ensures atomic distances125. However, manual checks are sometimes that the ensemble precision more truly reflects the data120. needed during structure determination, and several molecular Often, a single ‘minimized average’ structure, or a repre- graphics tools (for example, MOLMOL35) offer the possibility sentative structure, is also provided in the PDB. The ensemble of easily displaying NMR data, such as distance constraints, precision is often measured as the root-mean-square (r.m.s.) directly on the structure. In contrast to X-ray crystal structure deviation to the average structure, where a value of about calculation, it is customary to retain residues with very little or 2.0 Å is typical121; more precise ensembles with lower r.m.s. no data, with the result that some regions of the molecule can deviation may indicate an overfitting of data, rather than high be very divergent in the ensemble. The display of ensembles quality. An ensemble r.m.s. deviation of ≥5 Å generally requires prior superposition of the structures. Some graphics indicates a low quality structure, although values this low may packages offer automatic superposition (VMD26, MOLMOL35), occur, for example, in a structure comprising two well ordered but for more difficult cases, with highly divergent regions, domains connected by a flexible linker region. There is still no dedicated programs should be used; for example, THESEUS35. standardized measure for assessing the goodness of fit of structures to NMR data, although it is common to report the r.m.s. deviation of violations of distance and other a b constraints. As with X-ray structures, it can be useful to check independent measures of accuracy (for example, available at RCSB PDB, PDBe and via CAME/ProSA-Web111). Compared with X-ray crystallography, NMR has the advantage that it is not necessary to crystallize a molecule; instead, NMR usually studies biomolecules in solution, hence arguably in a more natural state. NMR structures also lack the heavy metal contaminants of many X-ray structures. NMR can also dynamically track specific reactions in living cells122 and can more easily study weak associations as well as disordered protein states123. However, NMR has the disadvantage that it imposes an upper limit on the size of the molecule studied: Figure 3 | Visualization of an NMR ensemble for SH3 (ref. 108). the largest molecule solved by NMR so far is 82 kDa (a,b) NMR structures are typically deposited in the PDB as an ensemble (ref. 124), and most NMR structures are 25–30 kDa or less. of superimposed structures (a), with the spread of the ensemble giving an Another disadvantage is that, for the same molecular system, indication of precision, but not of accuracy. The ‘sausage’ representation NMR usually produces less precise structures than does X-ray (b) gives an informative summary of an ensemble by adjusting the width crystallography. As with X-ray crystallography, studying struc- of the tube to match to the width of the ensemble. Images made using tures by NMR often requires tampering with the target MOLMOL35 (a) and VMD26 (b).

as domains, SNPs, exon boundaries, secondary structure and Protein-protein binding sites. Typically, as part of its biological so forth. (Fig. 1a–d,f). The ability to easily see where sequence role, a protein will bind to several other proteins through compar- features are located in the three-dimensional structure can be of atively large but flat binding surfaces. In fact, a large percentage substantial practical value to bench biochemists and molecular of PDB entries contain not just a single protein chain but several. biologists. For example, the spatial location of residues within In some cases, this means identical subunits assembled together; the structure and the proximity to solvent can help in designing in other cases, it means a complex of several different protein primers and mutation experiments. The ability to show such views chains. The arrangement of subunits, and of the interface resi- for a wide range of features is a particular strength of SRS 3D7 and dues that form the subunit-subunit contacts, is often of biological SPICE27 and is also facilitated by JenaLib28, PDBsum29 and Entrez significance. Several websites specialize in finding and visualizing Structure. Viewers such as STRAP30 that provide easy access to subunit-subunit interface residues32. In PDBsum29 the interacting multiple sequence alignment information mapped onto three- residues, and the types of their interaction across the interface, dimensional structures can help locate key conserved residues. are shown schematically. MolSurfer (Table 1) provides a range of ProSAT2 (ref. 31) can display SNPs and also predict their effects, methods that help users explore macromolecular interfaces. allowing a scientist to gauge the potential impact of a SNP on the For symmetric assemblies (dimers, trimers and so on), the protein structure. PDB entry of an X-ray crystal structure will often have explicit

S46 | vol.7 No.3s | march 2010 | nature methods supplement review

BO X 3 macromolecular STRUCTURES FROM ELECTRON MICROSCOPY

The Electron Microscopy Database (EMDB, http://emdatabank. object. The main task is usually the determination of the org/) now has nearly 700 entries, mostly three-dimensional ­relative positions and orientations for the set of views. Atomic electron density maps (EDMs) of macromolecular and cellular structures of components in larger assemblies that have been structures. For about 250 of these entries, an atomic model has determined by crystallography can be docked into the electron been calculated, usually by fitting an X-ray crystal structure microscopy map. Often there are conformational changes (for example, Fig. 5b): these atomic-detail models may be ­between a structure in a crystal lattice and in solution. ­deposited in linked entries in the PDB, where they account for Flexible fitting makes it possible to account for changes, such about 0.4% of PDB entries. as hinge rotations, in fitting. In the most favorable cases, the A significant advantage of electron microscopy structure resolution of macromolecular electron microscopy structures determination126–128 is that it can be used to study a wide can reach about 3 Å, although for cell or tissue sections, range of sample types, from large, ordered assemblies such ­radiation damage limits the resolution to >30–40 Å. At this as helical arrays to isolated complexes (single particles) and high resolution end, the conformation of protein and nucleic irregular objects such as cells or subcellular components. The acid backbones and bulky side chains can be determined upper size limit is mainly the sample thickness (up to a few directly from the electron microscopy density129 with the same hundred nanometers). Compared with X-ray and NMR struc- tools as in X-ray crystallography. tures, electron microscopy is almost always at lower resolution. However, there are uncertainties in determining the resolution Three-dimensional maps are viewed by choosing an appropri- and in validation for noncrystalline samples. When atomic-detail ate density threshold value, normally one that gives a surface structures are known that correspond to any part of the electron enclosing the correct molecular volume and displaying the iso- microscopy structure, docking of the atomic coordinates into surface (for example, with Chimera16). EMDB provides two map the map provides an independent test of reliability. Two density viewers, of which we recommend OpenAstexViewer18 because it maps can be compared by maximizing their cross-correlation (for displays surfaces well and lets the user change threshold levels. ­example, using Chimera), and the comparison can be visualized Transmission electron microscopy images are projections, using semitransparent and solid or wire mesh surface displays that and three-dimensional structure determination involves the can be overlaid. For tomographic reconstruction and for three- ­collection and merging of different projections (views) of the ­dimensional reconstruction of sections, IMOD109 is commonly used.

three-dimensional coordinates for only one monomer. To con- information (Fig. 1e), mean-force potentials39 and electrostatics struct the coordinates for all subunits in the biologically relevant (Fig. 1g). Such colored surfaces (sometimes called texture map- assembly, we recommend PISA33 (see Box 1, Fig. 2). pings) can give insight into molecular interactions and confor- mational changes, for example, by highlighting surface regions Comparing related structures. It is often informative to visualize with complementary shape and charge. The molecular surface can two related structures superimposed—for example, two states of also be used to estimate the energetics of molecular interactions, the same molecule, or two proteins with homologous sequences, including the entropic cost of desolvation, by calculating the area or two structural homologs found by structural comparison buried from solvent upon binding of other molecules40. tools34. Many molecular graphics tools offer automatic super- Although many program can generate a surface, the program position as a standard feature (for example, MOLMOL35, MOE, MSMS41 is widely used as it provides a good estimate of molecular PyMOL or VMD26). These tools allow the researcher to specify surface area and volume, and the most relevant molecular geom- a portion of the molecule to be superimposed. The results are etry when analyzing molecular interactions and interfaces. highly dependent on the regions chosen for the superposition. Typically, the researcher identifies a more-or-less rigid core of Ligand binding sites the molecule and superimposes this region using a subset of the Interactions between macromolecules and small molecules often atoms (typically the α-carbons or the backbone atoms). But many occur in buried active sites; these may be catalytic active sites, other combinations are possible for addressing specific questions allosteric sites, or sites that may either disrupt or stabilize protein- (Figs. 1i and 4d–f). For difficult cases—for example, low sequence protein interactions. The PDB at present contains over 37,000 similarity or large regions that cannot be aligned in sequence—it binding sites involving about 10,000 different types of ligand mole­ is best to use more robust, dedicated superimposition tools (for cules. A range of methods are available to characterize and visual- example, STAMP36, STRAP19 or THESEUS37). ize these sites, depending on the questions asked by the end user.

Molecular surfaces and electrostatic potentials. Many tools Annotation and highlighting. For gaining an initial insight into can generate molecular surfaces, most commonly the so-called the atomic interactions in the binding site, a useful representation Connolly surface38, which is derived by rolling a sphere the radius is to display ligands using a ball-and-stick representation and to of a water molecule around the atomic van der Waals surface of the display only backbone atoms of the protein or nucleic acid, except molecule. This surface, also known as the solvent-excluded surface, for those residues in direct contact with ligands (Fig. 4a). Many can be used as a canvas to map a wide variety of properties such as molecular graphics tools have been developed to support working residue conservation scores, hydrophobicity (Fig. 1h), depth-cue with small molecules (for example, DS Visualizer, MOE, PMV25,

nature methods supplement | vol.7 No.3s | march 2010 | S47 review

a b c d e

f g h i SHBG CG PGR CD O O Asp351A C H O NE Asp351( 2 NH1 Glu353(A) O CZ CB CA O 94(A) N GNRH1 NH2 OD2 Phe404A 2.97 Leu Trp383A PMF1 Phe404(A) OD1 CG H Leu387A 387(A) O3 Ala350(A) O Ala350A C3 2.66 C28 H IGF1 C2 + raloxifene ESR1 N26 Ra O N C4 C19 O23 C25 C27 C1 C18 C5 C20 C14 C24 C29 S6 C21 C15 C17 C31 C30 S Thr347A C7 C16 C22 Leu525A JUN O16 O Trp383(A) C9 C8 C13 Thr347(A) TNFRSF11B Leu525(A) IL6 C10 C11 t388A C12 Leu346(A) let421A O11 OH 2.71

Figure 4 | Visualizing ligand-binding sites. (a) A useful initial view is to show ligands and binding site residues in ball-and-stick and wire-frame representations, respectively. Here, an inhibitor is shown bound to HIV protease (PDB 1HVR99). (b) Visualizing the same binding site using a molecular surface colored by atom type reveals the catalytic oxygen atoms (center, red). (c) Here, AutoLigand44 has been used to find regions that might bind a ligand-sized molecule. (d) Two structures of the same protein (estrogen receptor) superimposed using Relibase58,59, one with estrogen (blue, PDB 1QKU)100, a second with an antagonist (red, PDB 1ERR)101, give insight into the antagonist mechanism. (e) All 74 structures of human estrogen receptor compared using PDBsum, showing estrogen (red) and cofactors (green). (f) Comparing binding sites of related structures can give insight into drug specificity. Image shows estrogen receptor (green), progesterone receptor (gray) and androgen receptor (orange). (g,h) Simplified two-dimensional can be useful for visualizing binding site interactions, such as hydrogen bonds (dashed lines), unbonded contacts (‘eyelashes’, g) and hydrophobic interactions (green curves, h). (i) To study drug specificity, interaction networks can be used to show all proteins known to interact with a drug. Images made using SRS 3D7 (a), PMV25 (b,c), OpenAstexViewer18 (d), Jmol (e), MOE (f), LIGPLOT65 (g), PoseView66 (h) and STITCH63 (i).

PyMOL, STRAP19, Swiss-PdbViewer20, SYBYL, VMD26, WHAT rendered to show the areas of most favorable interaction45. More IF42, Yasara; Table 1) Almost any can implement such views, and recently, atomic probes have been used to create maps of the atomic those with scripting capabilities can often be programmed to affinity. These may be rendered using isocontours, text-mapped recreate this view on demand. clipping planes or volume rendering (Fig. 4b,c). Many researchers In addition, many PDB entries or related files (for example, UniProt) are now analyzing these volume data sets to identify and visualize have annotations indicating which residues form the binding site. ligand-sized regions of maximal affinity44. It can be instructive to display these annotations directly on three- dimensional structures, and many molecular graphics tools enable Sequence-profile approaches. Another approach to identify ligand such displays (for example, JenaLib28, PDBsum29, ProSAT2 (ref. 31), binding sites uses multiple sequence alignments mapped onto three- SRS 3D7 and Ligand Explorer in the RCSB PDB). dimensional structures46. This approach is based on the observation that binding site residues tend to be more conserved than other Surface-based approaches. Structural details of binding sites are positions, so it can be particularly useful when little is known about widely used in rational drug design, usually to generate ideas for a protein. Even for well studied proteins, however, these methods classes of compounds for screening43. A common question is to sometimes find binding sites not previously noticed. Some examples ask what kinds of small molecules may bind to a given binding of such services are TraceSuite47, ETV48 and others48–50. site. Many molecular graphics viewers allow the surface to be colored by local properties, such as hydrogen bonding ability, Multiple ligands. A three-dimensional structure gives a snap- hydrophobicity or electrostatics, to allow exploration of chemical shot of a single state; however, in some cases, several different complementarily (Fig. 1g,h). The local curvature of the surface structures of the same protein exist with different ligands. We may also be used to evaluate steric complementarily. can use this information to help explore the range of conforma- tions available to the system. For example, such comparisons can Volume-based approaches. An alternative approach is to analyze highlight interactions common to all known binding partners, the space around the target molecule, highlighting regions that which may help to guide the search for further possible binding may form strong interactions with small molecules. Some tools partners51–53. For such comparisons, it can be useful to try dif- (for example, AutoLigand44) allow probe atoms, such as carbon ferent sets of atoms for superposition—for example, the ligand atoms or oxygen atoms, to be scanned through the entire space alone, or all atoms involved in the binding site. Each of these and the interaction energies of the probes with the molecule to superimpositions can highlight different aspects of the confor- be evaluated. The resultant three-dimensional data sets are then mational differences.

S48 | vol.7 No.3s | march 2010 | nature methods supplement review

Often, it is of interest to compare structures with multiple on a wide range of experimental databases, including the PDB ligands obtained by means of docking tools (for example, (Fig. 4i). In the future, we anticipate that such approaches will be FlexX54, AutoDock55). To preselect promising compounds, improved, and that PDB data will be increasingly incorporated computational chemists can scan large libraries of drug-like into network visualization methods64. molecules and dock ‘hits’ into the binding site of the protein target56. Subsequently, the docked structures can be inspected Schematic . For presentations and printouts, it visually to find ways of enhancing the predicted strength of bind- can be useful to highlight key interactions in the binding site ing57. Some docking tools now provide graphical interfaces (for using simplified schematic illustrations produced by tools such example, FlexV and AutoDockTools) for the preparation of the as LIGPLOT65, PoseView66 and Ligand:Protein Interaction input structures and the analysis of the results. These tools allow Diagrams67 (part of MOE). These illustrations show the ligand the comparison of interaction geometries of different ligands and interacting protein side chains ‘flattened’ in a plane, and with the same protein. indicating relevant hydrogen bonds, covalent bonds, unbonded Two useful resources for comparing multiple ligand structures contacts and water-mediated hydrogen bonds (Fig. 4g,h). For are Relibase58,59 and Superligands60, which both contain infor­ comparing different complexes, LIGPLOT65 and MOE allow the mation about all ligands in the PDB and take special care to user to generate a series of plots for related proteins binding the ensure the assignment of chemically correct atom and bond types. same or different ligands. Equivalent components of each plot are Both resources allow searching by identifiers as well as chemical plotted in the same relative location, thus highlighting residues substructure searches and similarity searches; Relibase also offers and interactions present in some of the structures but missing keyword searches and sequence similarity searches. The structures in others. can be displayed in two or in three dimensions in embedded view- ers. When exploring a specific protein, it is especially useful to RNA structures search for similar complexes; Relibase lists similar proteins with Over 4,000 nucleic acid three-dimensional structures are on their respective ligands, which can subsequently be superimposed deposit in the Nucleic Acid Databank (NDB68), mostly RNA and displayed in the embedded OpenAstexViewer18 (Fig. 4d structures, either determined experimentally or by ab initio and Supplementary Fig. 1). The extended functionalities of prediction. NDB is also synchronized with the PDB1, and RNA Relibase+ (which requires a paid license) give an analysis of the structures account at present for nearly 8% of PDB entries. Many differences in the superimposed structures (protein movements standard aspects of visualizing three-dimensional structures of and ligand overlap). RNA can be performed completely adequately by molecular PDBsum can also help visualize multiple ligands binding to the graphics tools designed for proteins, such as PyMOL and Swiss- same protein by superimposing the protein’s different structural PdbViewer20 (Table 1). models in the PDB and identifying any ‘ligand clusters’; that is, sites Knowing the secondary structure of an RNA molecule often where the ligands from the different structures overlap (Fig. 4e). gives significant insight into its function, much more so than for protein secondary structure. RNA secondary structure can Multiple proteins and ligands. Finding features that are specific be derived either from multiple sequence alignments or from to a given target adds another level of complexity when studying thermodynamic predictions, although the process requires spe- protein-ligand interactions. To identify features determining cialized features and capabilities not available in most tools for selectivity, it is useful to compare the target binding site with visualizing protein alignments or structures. Multiple sequence binding sites of similar proteins. The “similar binding site” as well alignment is particularly important in RNA research; alignments as the “similar ligand” search of Relibase can help to identify and can be used to find covariations between nucleotide positions, compare similar protein complexes. Here, again, the Relibase+ which are then taken as evidence for a contact between the two comparison table is especially useful for detecting differences nucleotide positions, and these contacts in turn define secondary in the protein binding sites—mutations, insertions and residue structure (Fig. 5). movements. MOE provides a similar facility to help compare mul- Because of these special-purpose requirements, the RNA com- tiple proteins bound to multiple ligands (Fig. 4f). munity has developed their own specialized visualization tools Structural visualization can be useful for predicting side effects (Supplementary Table 1) for viewing RNA secondary structure. and ‘off-label’ uses of known drugs by comparing the target bind- Some of these RNA tools (for example, S2S Assemble69) provide ing site to other known protein structures61,62. Some graphic tools an integrated environment for interactively visualizing multiple support this: for example, Relibase+ offers a search for “similar sequence alignments, intramolecular contacts and RNA three- cavities,” where the protein comparison is based on physico- dimensional structures (Fig. 5). The most useful tools provide the chemical properties rather than residues, hence finding remote option to manually edit the two-dimensional contacts, allowing similarities not evident from sequence similarity. not only reorientations of elements but also deletion and addition Structural visualization can also help in developing more of nucleotides or a whole element, such as a helix. ­selective drugs. Although promising, such approaches remain At present, two of the main challenges in RNA visualization are speculative, and their success will be fundamentally limited, as the as follows: first, RNA often adopts multiple structures depend- PDB contains only a small fraction of all binding site geometries. ing on experimental conditions, and none of the available tools A complementary approach is to use the much larger set of known can deal with this properly. Second, RNA in vivo usually protein-drug interactions where no three-dimensional structure is occurs in complex with proteins, however the RNA-specific available. For example, STITCH63 can be used to show a network tools cannot yet manage such complexes. RNA researchers can featuring all proteins known to interact with a given drug, based use standard molecular graphics tools to view such complexes,

nature methods supplement | vol.7 No.3s | march 2010 | S49 review

Figure 5 | Visualization of RNA structure in one, 1 10 20 30 40 two and three dimensions. Viewing multiple sequence alignment simultaneously with two- S1 20 S2 and three-dimensional representations greatly S3 helps in assigning two-dimensional structure S4 S5 S6 and understanding function. This process is 35 aided by synchronizing colors in all three views. 15 The RNA structure shown is from SARS virus102, 40 30 20 10 1 and the image was made using S2S Assemble69 with PyMOL. S1 S2 10 S3 S4 S5 5 but of course this means losing RNA- S6 G 3′ specific features and capabilities. G 5′ Molecular motion Biomacromolecules are dynamic entities, and motion is usually A related project, called Dynameomics81 (http://www.dynam- essential to function70. Visualizing dynamic molecular processes eomics.org/), provides online interactive views of simulations of is often key toward understanding these processes. Recently, sev- 30 proteins and plans to extend this to all known protein folds. eral visualization tools have become available that allow quick Such services are still very new, and we can expect significant and easy exploration of dynamic transitions between two known advances in the next few years. states of a molecule. For example, the Yale Morph Server71 (http:// Of the molecular graphics tools with molecular dynamics sup- molmovdb.org/) provides morphed animations of potential plau- port, VMD26 is probably the most widely used. It can display sible pathways between two structures; Moviemaker72 (http:// ‘movies’, analyze properties such as atomic fluctuations and allows tinyurl.com/moviemaker-v1/) is a web server that permits the flexible integration with other computational tools and with the user to generate simple animations of a variety of types of pro- user’s own scripts. Although VMD is popular, many other mole­ tein motion. These tools provide very approximate, often simply cular graphics tools support molecular dynamics trajectories, schematic, descriptions of the molecular motions. and each tool often has unique features that may be useful for To explore large-amplitude, low-frequency motions, such as protein particular projects (Table 1). domain flexing, methods based on normal mode analysis and elastic In general, visualization of molecular dynamics trajectories network models provide a computationally efficient approach73. remains challenging owing to intrinsic complexity, such as the There are now several websites, for example, NOMAD-ref 74 large number of atoms involved and the many orders of magnitude and ANM75, where even a novice user can enter a PDB file, compute in time relevant for biological processes. The most straightforward normal modes, and visualize and analyze the results. visualization is to superimpose several molecular dynamics snap- At a slightly higher level of complexity, several programs shots (Fig. 6a). While often useful, this method has obvious limits. allow users to generate conformational ensembles and trajec- Overall motion can be viewed using ‘sausage-like’ representa- tories using constraint-based methods. Such programs include tions (Fig. 3b); however, often dimension-reduction methods tCONCOORD76 and FIRST/FRODA77. One application of these are needed58. An increasing number of such methods are being methods is to identify segmental flexibility in proteins. The ­developed for visualization of specialized cases—for example, researcher identifies rigid domains in the protein connected by transient cavities (Fig. 6b) and molecular diffusion (Fig. 6c–e). flexible tethers, then defines the geometry of the hinge or shear motions that occur as the proteins change conformation78. The Large macromolecular assemblies Database of Macromolecular Movements71 provides a service X-ray crystallography is being used to solve the structures of larger for analyzing hinge motion in proteins. Other websites enable and more complex systems, and there is now considerable overlap molecular motions to be analyzed by means of hierarchical, in the size range of structures from X-ray crystallography and multiresolution flexibility trees79. from electron microscopy (Box 3). It is common to see electron More realistic and detailed studies of motion require mole­ microscopy isosurfaces into which atomic-detail X-ray structures cular dynamics simulations, which typically simulate 10–100 ns have been fitted. Meanwhile, electron microscopy continues to of motion in ~1-fs time-steps. Unfortunately, such calculations produce higher-resolution density maps of large assemblies and are generally too CPU-intensive to be provided as a free service; of single particles, such as viruses or other isolated complexes hence, users usually need to calculate their own trajectories. (Box 3), in addition to tomograms of higher-order, unique struc- For a first look at molecular dynamics simulations, DSMM80 tures such as cell sections or isolated organelles82. (http://tinyurl.com/dsmm-eml/) is a site that collects movies These data on large-scale assemblies that integrate data from showing molecular dynamics simulations. Generally, molecular X-ray crystallography (Box 1), NMR spectroscopy (Box 2), elec- dynamics simulations are recorded as trajectory files that can be tron microscopy (Box 3) and even light microscopy82–84 pose played back in a range of molecular graphics tools that support many new challenges for visualization. Many of these data are molecular dynamics (Table 1). There is as yet no unified resource not at atomic detail, so other representations must be used. to deposit or access trajectory files, although there are several In addition, the systems can be very large, and there are often initiatives in this direction—for example, the MoDEL Molecular issues with computational and graphics performance. There is a Dynamics Extended Library (http://mmb.pcb.ub.es/model/). need for high-performance, interactive visualization of such large

S50 | vol.7 No.3s | march 2010 | nature methods supplement review

Figure 6 | Visualizations of molecular motion. a b (a) Four snapshots from a molecular dynamics simulation visualized (darker protein coloring indicating later snapshots). A ligand is shown moving from its initial position buried in an active site (right) to the protein exterior (left). (b) Same four snapshots using a simplified representation highlighting residues undergoing conformational changes as the ligand escapes. The contoured surface (generated with CAVER103) shows changes to the transient tunnel used by the ligand. (c–e) Visualization of protein-protein diffusion simulations made using SDA (http://tinyurl.com/SDA-EML/). (c) Representative trajectory of a protein (blue) diffusing around a second, target protein (orange). (d) Isocontours (blue) show the c d e 90 region most occupied by the diffusing protein during thousands of trajectories. Target protein, orange. (e) Two-dimensional map of occupancy versus protein-protein center-to-center 0 104

Phi_y distance; blue, the most occupied region .

–90 computer and display hardware are ade- –90 0 90 quate for all visualization tasks we require. Phi_x In the early days of molecular graphics tools, ­assemblies, and across very different distance scales, although hardware limitations were a key issue; display systems were often very some tools, such as Amira (Visage Imaging) and PMV25, were expensive, and they relied on nonstandard hardware. Significant effort designed with such challenges in mind. in software development was directed toward ameliorating hardware At present, researchers typically use a hierarchical approach limitations. Today, although most molecular graphics tools run com- to visualizing large macromolecular assemblies. For portions fortably on standard desktop computers, many hardware issues remain, for which atomic information is available, atomic representa- particularly for the more complex visualization tasks, such as the study tions may be used, and then abstracted to simpler, surface-based of molecular motion and of large assemblies. representations. These surfaces may then be integrated with density sections or volumes from the lower-resolution methods (for example, a b electron microscopy tomography). This approach scales nicely from the level of atoms to the level of cells, allowing the use of simpler, more abstracted representations of the individual components as one moves to large systems, such as intracellular com- ponents (Fig. 7a) or even whole-cell visualization (Fig. 7b), and to multiscale movies85.

Visualization hardware Most of this review has focused exclusively on software developments, tacitly assuming that

Figure 7 | Two examples of multiscale, hierarchical visualization. (a) An atomic structure of an antibody (bottom) was used to create a smoothed surface as part of a more complex scene of blood serum (top). Images made with AVS (http://www.avs.com/) and PMV25. (b) Top, a 2.4-nm electron tomogram slice of a human skin section showing part of the nuclear envelope (blue), cytoplasm (black background) and a desmosome (orange) at the boundary of the two cells. Using sub-tomogram averaging, the interaction of cadherin proteins can be resolved105, and they were used to calculate isosurfaces (below) into which the atomic-detail structure of C-cadherins106 has been fitted. Images created using MATLAB and Amira. Scale bars, 10 nm.

nature methods supplement | vol.7 No.3s | march 2010 | S51 review

Figure 8 | Tangible models in research. Tangible models were used to explore the a c d modes of self-assembly of viral capsids88. (a) The electrostatic and charge complementarity is displayed using isosurfaces for the protein and electrostatic potential. (b) Affordances for placement of magnets were designed into the protein surface using constructive solid geometry methods. (c) Physical models were built and fitted with magnets. Twelve pentameric subunits then self-assemble when b shaken for several minutes in a tube. Images created with PMV25. (d) An augmented-reality interface used to study molecular interactions of the enzyme superoxide dismutase. An inexpensive video camera (not in the picture) views the models, and embedded markers on the surface (small black squares) are used to determine the orientation of the model from the video image. Volume-rendered electrostatic potentials and small animated arrows for the electrostatic field vectors are then overlapped onto the video image, following the video image as the user manipulates the model.

Stereo capabilities can greatly enhance molecular graphics and, wooden and wire molecular models played a critical early role in although available for many years on expensive and specialized systems, structural chemistry and biology, the advent of three-dimensional stereo is only just now becoming available for desktop LCD screens. interactive computer graphics in the 1970’s provided new and For particularly large assemblies, computational speed is often much improved utility in macromolecular structure determina- still an issue—here, it is important to use a top-of-the-line graphics tion and analysis. However, a more recent technology, computer card, and also to use molecular graphics tools that can take advan- autofabrication, or ‘solid printing’, initially developed for indus- tage of hardware acceleration. Fortunately, most tools can, the trial rapid prototyping, is now being used to produce physical principal exceptions being RasMol and Chime. molecular models. Such models bring back the properties of real-object perception and manipulation that were lost when the Immersive virtual reality. Visualizing large, complex and multi- model resided only in the computer. Over the past decade, the scale macromolecular assemblies, especially combined with molec- variety of such printers has increased steadily as the entry price has ular motion, is not only challenging computationally, but ultimately dropped to below $10,000 and printing services have sprouted to may require display systems significantly better than current fill this new niche. Because accurate and complex tangible models computer monitors can provide. Immersive virtual reality is very can be produced automatically as computer ‘printouts’ of mole­ promising, enabling the user to virtually enter a microscopic world, cular geometrical representations, the barrier to custom produc- flying through and interactively manipulating macromolecules. tion has disappeared. Physical models with functional parts have Experimental immersive environments—for example, CAVE86— been autofabricated with analog physical constraints, affinities have been in development for over 20 years, and concepts from this and/or structural behavior of the molecular system87 (Fig. 8). research have been used to enhance the user experience of several Such models have begun to be used for structural research. molecular graphics tools (for example, Yasara; Table 1). But such As persistent objects they are convenient, accessible and natu- techniques have yet to find widespread use for molecular visualiza- rally manipulable. They can be used as springboards to ideas and tion—partially because of the still high cost and cumbersome nature hypotheses88. Such characteristics also make physical models of such systems, but perhaps also because the sense of immersion is useful in multidisciplinary collaborations, helping structural not critical for interaction with the molecular world. experts communicate better with other colleagues. In addition, Today, however, some of the hardware components for vir- physical models lend themselves to teaching89. We are in the early tual reality are becoming affordable and practical, such as head- stages of learning how to best use physical models in structural mounted displays with head tracking, and a variety of haptic biology education and research, perhaps comparable to where devices (mechanical input devices that are touch sensitive), such computer graphics was in the 1970s. This is an ongoing area of as the Wii controller, as well as devices such as wired gloves that research90–92. can provide force feedback. These improvements are largely driven by the gaming market and are expected to continue rap- Future perspectives idly. For most molecular graphics tools, minimal modifications Methods for visualizing molecular structures are very mature. should be required to allow them to work with such hardware, In the near future, we can expect more effective computational and some tools have been built with such support already in mind approaches for representing, analyzing and synthesizing ever- (for example, VMD26 and SRS 3D7). However fully exploiting the more-complex molecular systems. Increased collaboration with promise of virtual reality will require substantial further software the community will also lead to the development development, particularly to the user interface layer. of more effective and intelligible rendering approaches. However, we expect that most of the advances in molecular visualization Physical models. Today, molecular visualization relies almost will come in the areas of computer interfaces, user interaction exclusively on computer-generated images. Although physical and new ways to represent and visualize nonspatial information.

S52 | vol.7 No.3s | march 2010 | nature methods supplement review

These changes will help structures reach an even broader audience. 16. Pettersen, E.F. et al. UCSF Chimera—a visualization system for exploratory 82 research and analysis. J. Comput. Chem. 25, 1605–1612 (2004). Navigating a synthesis of structural data with image data and 17. Wang, Y., Geer, L.Y., Chappey, C., Kans, J.A. & Bryant, S.H. Cn3D: 22,93 64 genomic and biological network information will require sequence and structure views for Entrez. Trends Biochem. Sci. 25, new methods that combine spatial and dynamic representations 300–302 (2000). with statistical and high-dimensional abstract relationships. We 18. Hartshorn, M.J. AstexViewer: a visualisation aid for structure-based drug design. J. Comput. Aided Mol. Des. 16, 871–881 (2002). also anticipate that collaborative community editing of structure- 19. Gille, C. & Frommel, C. STRAP: editor for STRuctural Alignments of related data sources (for example, Proteopedia94) will change how Proteins. Bioinformatics 17, 377–378 (2001). scientists relate to structural data, and to each other. The fields 20. Guex, N. & Peitsch, M.C. SWISS-MODEL and the Swiss-PdbViewer: an of information visualization and visual analytics have developed environment for comparative protein modeling. Electrophoresis 18, 2714–2723 (1997). over the past decade to address problems in making such complex 21. Zanzoni, A., Ausiello, G., Via, A., Gherardini, P.F. & Helmer-Citterich, M. data intelligible and navigable95,96. Phospho3D: a database of three-dimensional structures of protein Some of the drawbacks of immersive virtual reality may be phosphorylation sites. Nucleic Acids Res. 35 Database issue, D229–D231 (2007). overcome by the emerging technology of augmented reality 22. Procter, J.B. et al. Visualization of multiple alignments, phylogenies and (Fig. 8d), which provides inexpensive and accessible ways to gene family evolution. Nat. Methods 7, S16–S25 (2010). interact in intuitive and perceptually rich ways with our 23. Kraulis, P.J. Molscript: a program to produce both detailed and computational models. Whatever direction new technologies will schematic plots of protein structures. J. Appl. Crystallogr. 24, 946–950 (1991). take us, the roles of macromolecular visualization in understand- 24. Merritt, E.A. & Bacon, D.J. Raster3D: photorealistic molecular graphics. ing, gaining insight and developing ideas will remain the same. Methods Enzymol. 277, 505–524 (1997). 25. Sanner, M.F. A component-based software environment for visualizing Note: Supplementary information is available on the Nature Methods website. large macromolecular assemblies. Structure 13, 447–462 (2005). 26. Humphrey, W., Dalke, A. & Schulten, K. VMD: visual molecular dynamics. Acknowledgments J. Mol. Graph. 14, 33–38 (1996). Thanks to M. Berynskyy and L. Biedermannova for assistance with Figure 6. This Widely used and versatile tool for displaying, animating and work was partly supported by the European Union Framework Programme 6 grant analyzing large biomolecular systems. Particularly suited for ‘TAMAHUD’ (LSHC-CT-2007-037472). R.C.W. gratefully acknowledges the support MD simulations. of the Klaus Tschira Foundation. 27. Prlic, A., Down, T.A. & Hubbard, T.J. Adding some SPICE to DAS. Bioinformatics 21 (Suppl. 2), ii40–ii41 (2005). COMPETING INTERESTS STATEMENT 28. Huehne, R. & Suehnel, J. The Jena Library of Biological Macromolecules The authors declare no competing financial interests. – JenaLib. Preprint at 〈http://precedings.nature.com/documents/3114/ version/1/〉 (2009). Published online at http://www.nature.com/naturemethods/. 29. Laskowski, R.A. PDBsum: summaries and analyses of PDB structures. Reprints and permissions information is available online at http://npg.nature. Nucleic Acids Res. 29, 221–222 (2001). com/reprintsandpermissions/. 30. Gille, C. Structural interpretation of mutations and SNPs using STRAP-NT. Protein Sci. 15, 208–210 (2006). 31. Gabdoulline, R.R., Ulbrich, S., Richter, S. & Wade, R.C. ProSAT2–protein structure annotation server. Nucleic Acids Res. 34 (Web Server issue), 1. Berman, H., Henrick, K. & Nakamura, H. Announcing the worldwide W79–W83 (2006). Protein Data Bank. Nat. Struct. Biol. 10, 980 (2003). 32. Bordner, A.J. & Gorin, A.A. Comprehensive inventory of protein 2. Goddard, T.D. & Ferrin, T.E. Visualization software for molecular complexes in the Protein Data Bank from consistent classification of assemblies. Curr. Opin. Struct. Biol. 17, 587–595 (2007). interfaces. BMC Bioinformatics 9, 234 (2008). 3. Tate, J. Molecular visualization. Methods Biochem. Anal. 44, 135–158 33. Krissinel, E. & Henrick, K. Inference of macromolecular assemblies from (2003). crystalline state. J. Mol. Biol. 372, 774–797 (2007). 4. Olson, A.J. & Pique, M.E. Visualizing the future of molecular graphics. 34. Kolodny, R., Koehl, P. & Levitt, M. Comprehensive evaluation of protein SAR QSAR Environ. Res. 8, 233–247 (1998). structure alignment methods: scoring by geometric measures. J. Mol. Biol. 5. Berman, H.M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 346, 1173–1188 (2005). 235–242 (2000). 35. Koradi, R., Billeter, M. & Wuthrich, K. MOLMOL: a program for display 6. Richardson, D.C. & Richardson, J.S. The kinemage: a tool for scientific and analysis of macromolecular structures. J. Mol. Graph. 14, 51–55 communication. Protein Sci. 1, 3–9 (1992). 29–32 (1996). 7. O’Donoghue, S.I., Meyer, J.E.W., Schafferhans, A. & Fries, K. The SRS 3D 36. Russell, R.B. & Barton, G.J. Multiple protein sequence alignment from module: integrating sequence, structure, and annotation data. tertiary structure comparison: assignment of global and residue Bioinformatics 20, 2476–2478 (2004). confidence levels. Proteins 14, 309–323 (1992). 8. Bairoch, A. et al. The Universal Protein Resource (UniProt). Nucleic Acids 37. Theobald, D.L. & Wuttke, D.S. THESEUS: maximum likelihood Res. 33 (Database issue), D154–D159 (2005). superpositioning and analysis of macromolecular structures. Bioinformatics 9. Schafferhans, A., Meyer, J.E.W. & O’Donoghue, S.I. The PSSH database of 22, 2171–2172 (2006). alignments between protein sequences and tertiary structures. Nucleic 38. Connolly, M.L. Solvent-accessible surfaces of proteins and nucleic acids. Acids Res. 31, 494–498 (2003). Science 221, 709–713 (1983). 10. Arnold, K. et al. The Protein Model Portal. J. Struct. Funct. Genomics 10, 39. Sippl, M.J. Boltzmann’s principle, knowledge-based mean fields, and 1–8 (2009). protein folding. An approach to the computational determination of 11. Schwede, T., Kopp, J., Guex, N. & Peitsch, M.C. SWISS-MODEL: an protein structures. J. Comput. Aided Mol. Des. 7, 473–501 (1993). automated protein homology-modeling server. Nucleic Acids Res. 31, 40. Wesson, L. & Eisenberg, D. Atomic solvation parameters applied to 3381–3385 (2003). molecular dynamics of proteins in solution. Protein Sci. 1, 227–235 12. Krieger, E., Nabuurs, S.B. & Vriend, G. Homology modeling. Methods (1992). Biochem. Anal. 44, 509–523 (2003). 41. Sanner, M.F., Olson, A.J. & Spehner, J.-C. Reduced surface: an efficient 13. Cozzetto, D. et al. Evaluation of template-based models in CASP8 with way to compute molecular surfaces. Biopolymers 38, 305–320 standard measures. Proteins 77 (Suppl 9), 18–28 (2009). (1996). 14. Ben-David, M. et al. Assessment of CASP8 structure predictions for 42. Vriend, G. WHAT IF: a molecular modeling and drug design program. template free targets. Proteins 77 (Suppl 9), 50–65 (2009). J. Mol. Graph. 8, 52–56 (1990). 15. Bradley, P., Misura, K.M. & Baker, D. Toward high-resolution de novo 43. Hunter, W.N. Structure-based ligand design and the promise held for structure prediction for small proteins. Science 309, 1868–1871 antiprotozoan drug discovery. J. Biol. Chem. 284, 11749–11753 (2005). (2009).

nature methods supplement | vol.7 No.3s | march 2010 | S53 review

44. Harris, R., Olson, A.J. & Goodsell, D.S. Automated prediction of ligand- 66. Stierand, K., Maass, P.C. & Rarey, M. Molecular complexes at a glance: binding sites in proteins. Proteins 70, 1506–1517 (2008). automated generation of two-dimensional complex diagrams. 45. Goodford, P.J. A computational procedure for determining energetically Bioinformatics 22, 1710–1716 (2006). favorable binding sites on biologically important macromolecules. J. Med. 67. Clark, A.M. & Labute, P. 2D depiction of protein-ligand complexes. Chem. 28, 849–857 (1985). J. Chem. Inf. Model. 47, 1933–1944 (2007). 46. Campbell, S.J., Gold, N.D., Jackson, R.M. & Westhead, D.R. Ligand 68. Berman, H.M. et al. The nucleic acid database: a comprehensive binding: functional site location, similarity and docking. Curr. Opin. relational database of three-dimensional structures of nucleic acids. Struct. Biol. 13, 389–395 (2003). Biophys. J. 63, 751–759 (1992). 47. Lichtarge, O., Bourne, H.R. & Cohen, F.E. An evolutionary trace method 69. Jossinet, F. & Westhof, E. Sequence to Structure (S2S): display, defines binding surfaces common to protein families. J. Mol. Biol. 257, manipulate and interconnect RNA data from sequence to structure. 342–358 (1996). Bioinformatics 21, 3320–3321 (2005). 48. Morgan, D.H., Kristensen, D.M., Mittelman, D. & Lichtarge, O. ET viewer: Offers the most complete set of features for viewing RNA structures. an application for predicting and visualizing functional sites in protein Recommended for advanced users. Also available by web services. structures. Bioinformatics 22, 2049–2050 (2006). 70. Ringe, D. & Petsko, G.A. The ‘glass transition’ in protein dynamics: what 49. Laskowski, R.A., Watson, J.D. & Thornton, J.M. ProFunc: a server for it is, why it occurs, and how to exploit it. Biophys. Chem. 105, 667–680 predicting protein function from 3D structure. Nucleic Acids Res. 33 (2003). (Web Server issue), W89–W93 (2005). 71. Flores, S. et al. The Database of Macromolecular Motions: new features 50. Kinoshita, K., Murakami, Y. & Nakamura, H. eF-seek: prediction of the added at the decade mark. Nucleic Acids Res. 34 (Database issue), functional sites of proteins by searching for similar electrostatic potential D296–D301 (2006). and molecular surface shape. Nucleic Acids Res. 35 (Web Server issue), 72. Maiti, R., Van Domselaar, G.H. & Wishart, D.S. MovieMaker: a web server W398–W402 (2007). for rapid rendering of protein motions and interactions. Nucleic Acids Res. 51. Wolber, G. & Kosara, R. Pharmacophores from macromolecular complexes 33 (Web Server issue), W358–W362 (2005). with LigandScout. in Pharmacophores and Pharmacophore Searches 73. Chennubhotla, C., Rader, A.J., Yang, L.W. & Bahar, I. Elastic network (ed. Langer, T. & Hoffmann, R.D.) vol. 32, 131–150 (Wiley-VCH, models for understanding biomolecular machinery: from enzymes to Weinheim, Germany, 2006). supramolecular assemblies. Phys. Biol. 2, S173–S180 (2005). 52. Vulpetti, A. & Pevarello, P. An analysis of the binding modes of ATP- 74. Lindahl, E., Azuara, C., Koehl, P. & Delarue, M. NOMAD-Ref: visualization, competitive CDK2 inhibitors as revealed by X-ray structures of protein- deformation and refinement of macromolecular structures based on all- inhibitor complexes. Curr. Med. Chem. Anticancer Agents 5, 561–573 (2005). atom normal mode analysis. Nucleic Acids Res. 34 (Web Server issue), 53. Zou, J. et al. Towards more accurate pharmacophore modeling: W52–W56 (2006). Multicomplex-based comprehensive pharmacophore map and most- 75. Eyal, E., Yang, L.W. & Bahar, I. Anisotropic network model: systematic frequent-feature pharmacophore model of CDK2. J. Mol. Graph. Model. evaluation and a new web interface. Bioinformatics 22, 2619–2627 27, 430–438 (2008). (2006). 54. Rarey, M., Kramer, B., Lengauer, T. & Klebe, G. A fast flexible docking 76. Seeliger, D. & De Groot, B.L. tCONCOORD-GUI: visually supported method using an incremental construction algorithm. J. Mol. Biol. 261, conformational sampling of bioactive molecules. J. Comput. Chem. 30, 470–489 (1996). 1160–1166 (2009). 55. Morris, G.M. et al. AutoDock4 and AutoDockTools4: automated docking with 77. Thorpe, M.F., Lei, M., Rader, A.J., Jacobs, D.J. & Kuhn, L.A. Protein selective receptor flexibility. J. Comput. Chem. 30, 2785–2791 (2009). flexibility and dynamics using constraint theory. J. Mol. Graph. Model. 56. Zoete, V., Grosdidier, A. & Michielin, O. Docking, virtual high throughput 19, 60–69 (2001). screening and in silico fragment-based drug design. J. Cell. Mol. Med. 13, 78. Gerstein, M., Lesk, A.M. & Chothia, C. Structural mechanisms for domain 238–248 (2009). movements in proteins. Biochemistry 33, 6739–6749 (1994). 57. Karkola, S., Alho-Richmond, S. & Wahala, K. Pharmacophore modelling of 79. Zhao, Y., Stoffler, D. & Sanner, M. Hierarchical and multi-resolution 17β-HSD1 enzyme based on active inhibitors and enzyme structure. Mol. representation of protein flexibility. Bioinformatics 22, 2768–2774 Cell. Endocrinol. 301, 225–228 (2009). (2006). 58. Hendlich, M. Databases for protein-ligand complexes. Acta Crystallogr. 80. Finocchiaro, G., Wang, T., Hoffmann, R., Gonzalez, A. & Wade, R.C. D Biol. Crystallogr. 54, 1178–1182 (1998). DSMM: a database of simulated molecular motions. Nucleic Acids Res. 31, 59. Gunther, J., Bergner, A., Hendlich, M. & Klebe, G. Utilising structural 456–457 (2003). knowledge in drug design strategies: applications using Relibase. J. Mol. 81. Kehl, C., Simms, A.M., Toofanny, R.D. & Daggett, V. Dynameomics: a Biol. 326, 621–636 (2003). multi-dimensional analysis-optimized database for dynamic protein data. Provides several detailed examples showing how Relibase can aid Protein Eng. Des. Sel. 21, 379–386 (2008). structure-based drug design. 82. Walter, T. et al. Visualization of image data from cells to organisms. Nat. 60. Michalsky, E., Dunkel, M., Goede, A. & Preissner, R. SuperLigands—a Methods 7, S26–S41 (2010). database of ligand structures derived from the Protein Data Bank. BMC 83. Goodsell, D.S. Visual methods from atoms to cells. Structure 13, 347–354 Bioinformatics 6, 122 (2005). (2005). 61. Xie, L., Li, J. & Bourne, P.E. Drug discovery using chemical systems 84. Goodsell, D.S. Making the step from chemistry to biology and back. Nat. biology: identification of the protein-ligand binding network to explain Chem. Biol. 3, 681–684 (2007). the side effects of CETP inhibitors. PLOS Comput. Biol. 5, e1000387 85. McGill, G. Molecular movies. coming to a lecture near you. Cell 133, (2009). 1127–1132 (2008). 62. Kinnings, S.L. et al. Drug discovery using chemical systems biology: 86. Cruz-Neira, C., Sandin, D.J., DeFanti, T.A., Kenyon, R.V. & Hart, J.C. The repositioning the safe medicine Comtan to treat multi-drug and CAVE: audio visual experience automatic virtual environment. Commun. extensively drug resistant tuberculosis. PLOS Comput. Biol. 5, e1000423 ACM 35, 64–72 (1992). (2009). 87. Gillet, A., Sanner, M.F., Stoffler, D. & Olson, A.J. Tangible interfaces for 63. Kuhn, M., von Mering, C., Campillos, M., Jensen, L.J. & Bork, P. STITCH: structural molecular biology. Structure 13, 483–491 (2005). interaction networks of chemicals and proteins. Nucleic Acids Res. 36 88. Olson, A.J., Hu, Y.H. & Keinan, E. Chemical mimicry of viral capsid (Database issue), D684–D688 (2008). self-assembly. Proc. Natl. Acad. Sci. USA 104, 20731–20736 Useful and easy-to-use tool for visualizing graphical networks (2007). showing interactions between proteins and small molecules. 89. Herman, T. et al. Tactile teaching: exploring protein structure/function Underlying data is consolidated from many sources, including PDB. using physical models. Biochem. Mol. Biol. Educ. 34, 247–254 (2006). 64. Gehlenborg, N. et al. Visualization of omics data for systems biology. 90. Creem, S.H. & Proffitt, D.R. Grasping objects by their handles: a Nat. Methods 7, S56–S68 (2010). necessary interaction between cognition and action. J. Exp. Psychol. 65. Wallace, A.C., Laskowski, R.A. & Thornton, J.M. LIGPLOT: a program to Hum. Percept. Perform. 27, 218–228 (2001). generate schematic diagrams of protein-ligand interactions. Protein Eng. 91. Kozma, R. The material features of multiple representations and their 8, 127–134 (1995). cognitive and social affordances for science understanding. Learning and Widely used for generating simplified, two-dimensional schematic Instruction 13, 205–226 (2003). diagrams of protein-ligand interactions from the three-dimensional 92. Zhang, J. & Patel, V.L. Distributed cognition, representation, and coordinates. affordance. Pragmatics & Cognition 14, 333–341 (2006).

S54 | vol.7 No.3s | march 2010 | nature methods supplement review

93. Nielsen, C.B., Cantor, M., Dubchak, I., Gordon, D. & Wang, T. Visualizing 111. Wiederstein, M. & Sippl, M.J. ProSA-web: interactive web service for the genomes: techniques and challenges. Nat. Methods 7, S5–S15 (2009). recognition of errors in three-dimensional structures of proteins. Nucleic 94. Hodis, E. et al. Proteopedia—a scientific ‘wiki’ bridging the rift between Acids Res. 35 (Web Server issue), W407–W410 (2007). three-dimensional structure and function of biomacromolecules. Genome 112. Rhodes, G. Crystallography Made Crystal Clear: A Guide for Users of Biol. 9, R121 (2008). Macromolecular Models 3rd edn. (Academic Press, 2006). 95. Cook, K., Earnshaw, R. & Stasko, J. Discovering the unexpected. IEEE 113. Glykos, N.M. On the application of molecular-dynamics simulations to Comput. Graph. Appl. 27, 15–19 (2007). validate thermal parameters and to optimize TLS-group selection for 96. Kerren, A., Stasko, J.T., Fekete, J.D. & North, C. Information Visualization macromolecular refinement. Acta Crystallogr. D Biol. Crystallogr. 63, (Springer, New York, 2008). 705–713 (2007). 97. Schindler, T. et al. Crystal structure of Hck in complex with a Src 114. Brünger, A.T. The free R factor: a novel statistical quantity for assessing family-selective tyrosine kinase inhibitor. Mol. Cell 3, 639–648 the accuracy of crystal structures. Nature 355, 472–474 (1992). (1999). 115. Kleywegt, G.J. et al. The Uppsala electron-density server. Acta Crystallogr. 98. Xu, W., Harrison, S.C. & Eck, M.J. Three-dimensional structure of the D Biol. Crystallogr. 60, 2240–2249 (2004). tyrosine kinase c-Src. Nature 385, 595–602 (1997). 116. Emsley, P. & Cowtan, K. Coot: model-building tools for molecular 99. Lam, P.Y. et al. Rational design of potent, bioavailable, nonpeptide cyclic graphics. Acta Crystallogr. D Biol. Crystallogr. 60, 2126–2132 (2004). ureas as HIV protease inhibitors. Science 263, 380–384 (1994). 117. Jones, T.A. Diffraction methods for biological macromolecules. Interactive 100. Gangloff, M. et al. Crystal structure of a mutant hERα ligand-binding computer graphics: FRODO. Methods Enzymol. 115, 157–171 (1985). domain reveals key structural features for the mechanism of partial 118. Jones, T.A., Zou, J.Y., Cowan, S.W. & Kjeldgaard, M. Improved methods agonism. J. Biol. Chem. 276, 15059–15065 (2001). for building protein models in electron density maps and the location of 101. Brzozowski, A.M. et al. Molecular basis of agonism and antagonism in errors in these models. Acta Crystallogr. A 47, 110–119 (1991). the oestrogen receptor. Nature 389, 753–758 (1997). 119. Levin, E.J., Kondrashov, D.A., Wesenberg, G.E. & Phillips, G.N. Jr. 102. Robertson, M.P. et al. The structure of a rigorously conserved RNA Ensemble refinement of protein crystal structures: validation and element within the SARS virus genome. PLoS Biol. 3, e5 (2005). application. Structure 15, 1040–1052 (2007). 103. Petrek, M. et al. CAVER: a new tool to explore routes from protein clefts, 120. Rieping, W., Habeck, M. & Nilges, M. Inferential structure determination. pockets and cavities. BMC Bioinformatics 7, 316 (2006). Science 309, 303–306 (2005). 104. Spaar, A., Dammer, C., Gabdoulline, R.R., Wade, R.C. & Helms, V. 121. Nederveen, A.J. et al. RECOORD: a recalculated coordinate database of Diffusional encounter of barnase and barstar. Biophys. J. 90, 1913–1924 500+ proteins from the PDB using restraints from the BioMagResBank. (2006). Proteins 59, 662–672 (2005). 105. Al-Amoudi, A., Diez, D.C., Betts, M.J. & Frangakis, A.S. The molecular 122. Selenko, P. & Wagner, G. Looking into live cells with in-cell NMR architecture of cadherins in native epidermal desmosomes. Nature 450, spectroscopy. J. Struct. Biol. 158, 244–253 (2007). 832–837 (2007). 123. Eliezer, D. Biophysical characterization of intrinsically disordered proteins. 106. Boggon, T.J. et al. C-cadherin ectodomain structure and implications for Curr. Opin. Struct. Biol. 19, 23–30 (2009). cell adhesion mechanisms. Science 296, 1308–1313 (2002). 124. Tugarinov, V., Choy, W.-Y., Orekhov, V.Y. & Kay, L.E. Solution NMR-derived 107. Marina, A., Waldburger, C.D. & Hendrickson, W.A. Structure of the entire global fold of a monomeric 82-kDa enzyme. Proc. Natl. Acad. Sci. USA cytoplasmic portion of a sensor histidine-kinase protein. EMBO J. 24, 102, 622–627 (2005). 4247–4259 (2005). 125. Markwick, P.R., Malliavin, T. & Nilges, M. Structural biology by NMR: structure, 108. Mal, T.K., Matthews, S.J., Kovacs, H., Campbell, I.D. & Boyd, J. Some dynamics, and interactions. PLOS Comput. Biol. 4, e1000168 (2008). NMR experiments and a structure determination employing a {15N,2H} 126. Wang, L. & Sigworth, F.J. Cryo-EM and single particles. Physiology enriched protein. J. Biomol. NMR 12, 259–276 (1998). (Bethesda) 21, 13–18 (2006). 109. Kremer, J.R., Mastronarde, D.N. & McIntosh, J.R. Computer visualization 127. Frank, J. Three-dimensional Electron Microscopy of Macromolecular of three-dimensional image data using IMOD. J. Struct. Biol. 116, 71–76 Assemblies 2nd edn. (Oxford University Press, 2006). (1996). 128. Frank, J. ed. Electron Tomography 2nd edn. (Springer, 2006). 110. Sayle, R.A. & Milner-White, E.J. RASMOL: biomolecular graphics for all. 129. Yu, X., Jin, L. & Zhou, Z.H. 3.88 Å structure of cytoplasmic polyhedrosis Trends Biochem. Sci. 20, 374 (1995). virus by cryo-electron microscopy. Nature 453, 415–419 (2008).

nature methods supplement | vol.7 No.3s | march 2010 | S55 nature | methods

Visualization of macromolecular structures

Seán I O’Donoghue, David S Goodsell, Achilleas S Frangakis, Fabrice Jossinet, Roman A Laskowski,

Michael Nilges, Helen R Saibil, Andrea Schafferhans, Rebecca Wade, Eric Westhof & Arthur J Olson

Supplementary figures and text:

Supplementary Figure 1 Relibase comparison of human estrogen receptor structures

Supplementary Table 1 Tools for visualizing RNA secondary structure

Nature Methods: doi: 10.1038/nmeth.1427 1

Supplementary Figure 1 | Relibase comparison of human estrogen receptor

structures. The superimposition is calculated based on minimal RMSD to the

Cα binding site atoms of the reference structure (PDB 1ERR-A). In the top left corner is OpenAstexViewer1 showing the human estrogen complexed with the antagonist raloxifene (PDB 1ERR, red) and with the agonist estrogen (PDB 1qku, green). Below the embedded viewer is the table calculated by Relibase+ highlighting differences in protein structure. Clicking on a link in the table creates a pop-up window (bottom right) giving details about the protein

movement. The Cα movements are caused by the differences in volume of the agonist and antagonist (shown in stick representation), where the antagonist prevents Helix 12 (green) from closing the binding site and from building an interface to the co-activator. The view has been assembled with the help of the hierarchical selection menu (top right) that is part of the Astex viewer.

Nature Methods: doi: 10.1038/nmeth.1427 2

Supplementary Table 1 | Tools for visualizing RNA secondary structure Name OS Input formats 2D Features and strengthsb Limitations URL layout jViz.Rna2 Win, Mac, Bracket notation Circular 2D manipulation, pseudoknots. Has a confusing http://jviz.cs.sfu.ca Linux BPSEQ Dot plot Can compare different squiggle layout CT Linear structures with the same Squiggle sequence. NAVRNA3 Linux only PDB Squiggle 2D editing & manipulation, Requires hardware http://tinyurl.com/NAVRNA interactive 2D & 3D display, not available to pseudoknots. Uses novel many users approach to interact with structural data. PseudoViewer4 Win only Bracket notation Squiggle 2D manipulation. Can easily http://tinyurl.com/pseudoview BPSEQ recover RNA secondary CT structures with any kind of Custom format pseudoknots. RNAFamily Win, Mac, CT Linear Can render several secondary Linear layout only http://tinyurl.com/rnafmly Linux structures in the same display. RNAMovies5 Win, Mac, Custom format Squiggle Pseudoknots. Currently the No other http://tinyurl.com/rnamovies Linux DCSE only tool that can make movies. functionality RNAStructML Rnaviz6 Win, Mac, DCSE Squiggle 2D manipulation, annotation, No recent update http://rnaviz.sourceforge.net Linux CT pseudoknots. Use of skeleton RNAML files makes it easily produce and save RNA layout. S2S Assemble*7 Win, Mac, BPSEQ Squiggle 2D editing & manipulation, non- Steep learning http://bioinformatics.org/s2s Linux Bracket notation canonical interactions, curve http://bioinformatics.org/assemble CT pseudoknots, interactive 2D & FASTA 3D display, interactive 3D PDB modeling, secondary structure RNAML prediction. Has very rich set of Stockholm features. StructureLab8 Linux only Various formats Dot plot 2D manipulation, annotations, Steep learning http://tinyurl.com/structureLab are supported Stem pseudoknots. interactive 2D & curve due to rich depending on trace 3D display, interactive 3D feature set linked algorithm Squiggle modeling, secondary structure prediction. VARNA*9 Win, Mac, BPSEQ Circular 2D manipulation, 2D editing, http://varna.lri.fr Linux Bracket notation Linear annotations, non-canonical CT Squiggle interactions, pseudoknots. Can RNAML be embedded in webpages. XRNA Win, Mac, Custom format Squiggle 2D manipulation, Constrains on 2D http://rna.ucsc.edu/rnacenter/xrna Linux (XRNA) annotations. editing

* means our recommendations. All tools listed are free for academic use; bTools with ‘2D editing’ can modify the secondary structure definition; tools with ‘Non- canonical’ can handle non-canonical interactions. Tools with interactive 2D & 3D display allow simultaneous visualization, and interaction between, both secondary structure and tertiary structure in a 3D viewer, e.g., PyMOL.

Nature Methods: doi: 10.1038/nmeth.1427 3

References

1. Hartshorn, M. J. AstexViewer: a visualisation aid for structure-based drug design. J Comput Aided Mol Des 16 (12), 871 (2002) 2. Wiese, K., Glen, E., and Vasudevan, A. Jviz.Rna - a Java Tool for RNA Secondary Structure Visualization. IEEE Trans Nanobioscience 4, 212 (2005) 3. Bailly, G., Auber, D., and Nigay, L., in IV'06 Conference on Information Visualization (London, UK, 2006), pp. 107. 4. Byun, Y. and Han, K. Pseudoviewer3: Generating Planar Drawings of Large-Scale RNA Structures With Pseudoknots. Bioinformatics 25, 1435 (2009) 5. Kaiser, A., Kruger, J., and Evers, D. RNA Movies 2: Sequential Animation of RNA Secondary Structures. Nucleic Acids Res . 35, W330 (2007) 6. De Rijk, P., Wuyts, J., and De Wachter, R. Rnaviz 2: An Improved Representation of RNA Secondary Structure. Bioinformatics 19, 299 (2003) 7. Jossinet, F. and Westhof, E. Sequence to Structure (S2S): Display, Manipulate and Interconnect RNA Data From Sequence to Structure. Bioinformatics 21, 3320 (2005). Offers the most complete set of features for viewing RNA structures. Recommended for advanced users. Also available via web services. 8. Shapiro, B. and Kasprzak, W. Structurelab: A Heterogeneous Bioinformatics System for RNA Structure Analysis. J Mol Graph 14, 194 (1996) 9. Darty, K., Denise, A., and Ponty, Y. VARNA: Interactive drawing and editing of the RNA secondary structure. Bioinformatics 25 (15), 1974 (2009). Simple, easy-to-use, and feature-rich tool for searching, display and manipulation of RNA secondary structures. Can be embedded in webpages.

Nature Methods: doi: 10.1038/nmeth.1427 review

Visualization of omics data for systems biology Nils Gehlenborg1,2, Seán I O’Donoghue3, Nitin S Baliga4, Alexander Goesmann5, Matthew A Hibbs6, Hiroaki Kitano7–9, Oliver Kohlbacher10, Heiko Neuweger5, Reinhard Schneider3, Dan Tenenbaum4 & Anne-Claude Gavin3

High-throughput studies of biological systems are rapidly accumulating a wealth of ‘omics’-scale data. Visualization is a key aspect of both the analysis and understanding of these data, and users now have many visualization methods and tools to choose from. The challenge is to create clear, meaningful and integrated visualizations that give biological insight, without being overwhelmed by the intrinsic complexity of the data. In this review, we discuss how visualization tools are being used to help interpret protein interaction, gene expression and metabolic profile data, and we highlight emerging new directions.

Visualization has long been key in helping to under- data5–7. These tools are very diverse, but they can be stand biological systems, such as metabolism1, signal- broadly divided into two partly overlapping catego- ing2 and the regulation of gene expression3. In recent ries, the first consisting of tools focused on automated years, the study of such systems has been profoundly methods for interpreting and exploring large biologi- influenced by the development of a wide range of cal networks (Table 1), and the second consisting of high-throughput experimental methods (Box 1), tools focused on assembly and curation of pathways resulting in a greatly increased volume of complex, (Table 2). Many of these tools are tightly integrated interconnected data. Remarkably, in spite of these with public databases, thus allowing users to visualize

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 changes, and in spite of the development of new and interpret their own data in the context of previ- methods for visualizing and analyzing these data, we ous knowledge. still use the same primary visual metaphor to com- For users and developers of these visualization municate ideas about biological systems: namely, tools, one of the key challenges is how to benefit from pathways (graphs that show overall changes in state) the explosion in systems biology data without being or, more generally, networks (graphs that do not nec- overwhelmed by it—or, in practical terms, how to essarily show state changes). present the data at the right level of detail, in a cohe- As high-throughput experimental methods have sive, insightful manner. Clearly, the answers depend become more routine, many more scientists are using on context. network and pathway visualization to record and In this review, we first discuss the methods and tools communicate their findings. There are now over 300 now being used to visualize and analyze data sets from web resources4 (see http://pathguide.org/) providing three main types of high-throughput experiments: access to many thousands of pathways and networks namely, the investigation of protein-protein interac- that document millions of interactions between pro- tions, of gene expression profiles and of metabolic teins, genes and small molecules. profiles. Such experiments are used to study cellular There has been a corresponding increase in the response to a wide variety of conditions—including development of visualization tools for systems biology drug exposure, disease states and specific genetic

1European Bioinformatics Institute, Cambridge, UK. 2Graduate School of Life Sciences, University of Cambridge, Cambridge, UK. 3European Molecular Biology Laboratory, Heidelberg, Germany. 4Institute for Systems Biology, Seattle, Washington, USA. 5CeBiTec, Bielefeld University, Bielefeld, Germany. 6The Jackson Laboratory, Bar Harbor, Maine, USA. 7Sony Computer Science Laboratories, Tokyo, Japan. 8The Systems Biology Institute, Tokyo, Japan. 9Okinawa Institute of Science and Technology, Okinawa, Japan. 10University of Tübingen, Tübingen, Germany. Correspondence should be addressed to S.I.O. ([email protected]). PUBLISHED ONLINE 1 march 2010; DOI:10.1038/NMETH.1436

S56 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

BOX 1 KEY EXPERIMENTAL METHODS FOR SYSTEMS BIOLOGY Oligonucleotide microarrays. The most widely used methods Several search engines have been developed to predict to monitor the expression levels of RNA transcripts in a peptides and proteins through the comparison of experimentally biological sample are based on microarrays. They measure measured spectra to theoretical spectra (predicted from sequence the hybridization of fluorescently labeled cDNA, synthesized databases). Quality scores provide a measure of the reliability from extracted mRNA, to known nucleotide sequences of a given protein or peptide identification123. For example, for spotted on solid surfaces117. For all genes on the microarray, Mascot124, the most broadly used algorithm, the score features an expression value is derived from the fluorescence the number of identified peptides (sequence coverage). intensity of the hybridized RNAs. These expression values The overall quality of entire MS data sets is generally measured are typically unitless and have meaning only in the context by the false discovery rate (FDR), which is the ‘expected’ of a reference measurement. Before further analysis takes proportion of incorrect assignments among the accepted place, the measurements must therefore be normalized to assignments. The most popular approach to calculate FDR is remove systematic biases and to make it possible to compare based on the use of a target-decoy database123. Also, an array of measurements from different samples. visualization tools has been developed to evaluate the technical Quality assessment is likewise essential for the validity of later quality of the samples and of MS runs. For example, the overall analyses. This is typically performed with the help of (platform- distribution of peptides in an LC-MS map can be visualized with dependent) quality scores at the level of both individual probes Pep3D125 or TOPPView126, enabling the detection of possible and entire arrays, complemented by diagnostic visualization biases, the presence of chemical contaminants or poor separations tools that have been developed for this purpose118,119. during the LC (Supplementary Fig. 2). Additionally, Pep3D Evaluating the quality of individual arrays is routinely done with can integrate quality scores for individual protein or peptide spatial intensity distributions plots and plots of intensity ratio identifications generated by search engines into these maps versus mean intensity (Supplementary Fig. 1a). Comparison (Supplementary Fig. 3). We list tools for mass spectrometry data of multiple arrays can be achieved with intensity box plots, visualization and evaluation in Supplementary Table 1. which are a practical tool to detect outlier arrays that should be In metabolomics applications, owing to large chemical diversity excluded from subsequent analysis (Supplementary Fig. 1b). and variation in molecular composition of the analytes, various Several tools that provide quality assessment visualizations are chromatographic systems, such as gas chromatography (GC), LC or listed in Supplementary Table 1. electrochemistry (EC), are generally applied before MS. GC-MS is the most popular method for global metabolite profiling127. It can RNA deep sequencing. The most recent transcriptomics be complemented with LC-MS analysis to identify compounds that approaches are based on the deep sequencing of transcripts are not suitable for GC-MS analysis128. Similarly to the approaches extracted from biological samples33. The resulting sequence developed for peptides, metabolites can be identified on the reads—typically 30 to 400 base pairs long, depending on the basis of their fragmentation patterns, for which mass spectral DNA-sequencing technology used—are then commonly aligned fingerprint libraries are being developed. Because the raw data to a reference genome and evaluated to determine their quality. are of the same kind as in proteomics mass spectrometry studies, Tools for data processing and quality assessment typically very similar visualization methods are used to assess data quality

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 provide diagnostic visualizations. Examples include the R/ (Supplementary Table 1). Bioconductor packages ShortRead120 and edgeR121. The latter provides many functions that are analogous to those in the Nuclear magnetic resonance. Nuclear magnetic resonance limma package122 for transcriptomics data from microarrays. (NMR) is a common method in metabolomics and, in contrast Reads aligned to a genome can also be visualized and evaluated to MS-based approaches, in most cases does not require analyte with some of the more recent genome browsers that can handle separation. NMR spectroscopy can provide detailed information short read data, such as the Integrative Genomics Viewer on the molecular structure of compounds found in complex (http://www.broadinstitute.org/igv/). This and similar tools are mixtures, and a wide range of small molecule metabolites discussed in the accompanying review by Nielsen et al.72. in a sample can be detected simultaneously. Biofluids, cell Mass spectrometry. In mass spectrometry (MS) experiments, and tissue extracts can be analyzed with minimal sample the compounds present in a sample are identified through the preparation through the use of 1H NMR spectroscopy129. With accurate measurements of their mass-to-charge ratios. MS has the use of two-dimensional NMR spectra, the identification applications in many fields, including proteomics, metabolomics and reliable quantification of individual metabolites becomes and interactome mapping. feasible, which enables NMR-based metabolite profiling. Data In proteomic applications, typical MS data sets consist of processing and spectral deconvolution are challenging, and lists of proteolytic peptides characterized by their mass-to- databases of NMR spectra of pure metabolites are not yet charge ratios (MS spectra, MS1). These peptides can be further comprehensive, but they nonetheless do already help in the fragmented and measurements of the resulting mass spectra identification process130. Applications such as MetaboMiner131 (MS-MS spectra or tandem MS spectra, MS2) used to deduce their can be used for the semiautomated identification of sequences. In some cases, complex samples must be fractionated metabolites in two-dimensional NMR spectra, supported by and proteolytic peptides are separated using high performance visualizations that allow the scientist to inspect the matches of liquid chromatography (LC) before MS analysis (LC-MS). peaks to reference spectra and assess match quality.

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S57 review

perturbations (for example, gene deletions, gene insertions and The size and complexity of these data sets can be daunting; siRNA knockdowns). Often, these experiments produce new hence, a common general strategy is to iteratively dissect the knowledge that is then either added to existing pathways or used data sets into smaller subsets. Typically, these subsets are defined to create new pathways. Thus, we end with a discussion of meth- as sets of proteins that belong to the same complex, or that are ods and tools for pathway editing. found at the same subcellular location, or that belong to a similar functional category. Visualization is key in this strategy, as human Protein interaction data judgment and intervention are often needed, in part because of A range of experimental methods are at present being used for errors (false positives and false negatives) in protein interaction high-throughput studies of protein interactions8. For instance, data sets9,16. Many visualization tools have been developed spe- in yeast, pairwise interactions have been studied on the genome cifically to support the analysis of protein networks (Table 1); scale using yeast two-hybrid screens or protein complementa- here we discuss how these tools can be used to help dissect large tion assays, whereas the assembly of proteins within complexes data sets of interactions, extract biological insight and generate has been systematically charted using tandem affinity purifi- hypotheses leading to further experimental investigations. cation coupled with mass spectrometry (TAP-MS) (Box 1 and As proteins rarely act alone, a first step in analyzing a protein Supplementary Figs. 2 and 3). Recent analyses have estimated interaction data set is to identify protein complexes and groups that, in yeast, some 20,000 pairwise interactions may take place of complexes. For small, simple networks, visualized as a graph between the ~5,000 gene products9, and about 800 protein com- in which each node represents a protein and each edge represents plexes may exist10. As a result of these and similar studies in other an interaction between two proteins, the arrangement of pro- species, vast amounts of protein interaction data are accumulat- teins and complexes can usually be seen clearly using a standard ing in public databases11,12 such as DIP13, HPRD14 and IntAct15. ‘force-directed’ layout17, which automatically arranges each node

Table 1 | Visualization tools focused on interaction networks Name Cost OS Description URL Stand-alone Arena 3D63 Free Win, Mac, Linux Visualization of biological multi-layer networks in 3D http://www.arena3d.org/ BiNA81 Free Win, Mac, Linux Exploration and interactive visualization of pathways http://www.bnplusplus.org/bina/ BioLayout Express 3D37 Free Win, Mac, Linux Generation and cluster analysis of networks with 2D/3D visualization http://www.biolayout.org/ BiologicalNetworks82 Free Win, Mac, Linux Analysis suite; visualizes networks and heat map; abundance data http://www.biologicalnetworks.org/ Cytoscape*20,83 Free Win, Mac, Linux Network analysis; extensive list of plug-ins for advanced visualization http://www.cytoscape.org/ GENeVis36 Free Win, Mac, Linux Network and pathway visualization; abundance data http://tinyurl.com/genevis/ Medusa84 Free Win, Mac, Linux Basic network visualization tool http://coot.embl.de/medusa/ N-Browse85 Free Win, Mac, Linux Network visualization software for heterogeneous interaction data http://www.gnetbrowse.org/ NAViGaTOR23,86 Free Win, Mac, Linux Visualization of large protein-interaction data sets; abundance data http://tinyurl.com/navigator1/ Ondex87 Free Win, Mac, Linux Integrative workbench: large network visualizations; abundance data http://www.ondex.org/ Osprey88 Free Win, Mac, Linux Tool for visualization of interaction networks http://tinyurl.com/osprey1/ © 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 Pajek89 Free Win Generic network visualization and analysis tool http://pajek.imfm.si/ ProViz Free Win, Mac, Linux Software for visualization and exploration of interaction networks http://tinyurl.com/proviz/ SpectralNET90 Free Win Network visualizations; scatter plots for dimensionality reduction methods http://tinyurl.com/spectralnet/ Tulip91 Free Win, Mac, Linux Generic visualization tool; extremely large networks; 3D support http://tulip.labri.fr/TulipDrupal/ VANTED21 Free Win, Mac, Linux Combined visualization of abundance data, networks and pathways http://tinyurl.com/vanted/ yEd Free Win, Mac, Linux Generic network visualization software; offers many layout algorithms http://tinyurl.com/yEdGraph/ Cytoscape plug-in BiNoM92 Free Win, Mac, Linux Extensive support for common systems biology network formats http://tinyurl.com/binom1/ BioModules24 Free Win, Mac, Linux Detects modules in networks; maps abundance data onto nodes and http://tinyurl.com/biomodules/ modules Cerebral*26,78 Free Win, Mac, Linux Biologically motivated layout algorithm; maps abundance data; clustering http://tinyurl.com/cerebral1/ MCODE18 Free Win, Mac, Linux Network clustering algorithm; support for manual cluster refinement http://tinyurl.com/MCODE123/ VistaClara42 Free Win, Mac, Linux Mapping of abundance data to nodes and ‘heat strips’; provides heat map http://tinyurl.com/cytoplugins/ Web-based Graphle93 Free Distributed client/server network exploration and visualization tool http://tinyurl.com/graphle/ Lichen Free Library for web-based visualization of network and abundance matrix data http://tinyurl.com/Lichen1/ MAGGIE Data Viewer Free Visualization of networks; abundance data in heat maps and profile plots http://maggie.systemsbiology.net/ STITCH31 Free Construction and visualization of networks from a wide range of sources http://stitch.embl.de/ VisANT22 Free Win, Mac, Linux Analysis, mining and visualization of pathways and integrated omics data http://visant.bu.edu/ Some of the tools in this table have capabilities similar to tools that are listed in other tables. To avoid listing tools in more than one table, we assigned tools to tables on the basis of what we understand to be their primary purpose. *Our recommendations. Free means the tool is free for academic use; $ means there is a cost. OS, operating system: Win, Microsoft Windows; Mac, Macintosh OS X. Tools running on Linux usually also run on other versions of Unix. 2D, two-dimensional; 3D, three-dimensional.

S58 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

to minimize the number of edge crossings while trying to keep the lengths of all edges ab approximately the same. However, this approach quickly becomes inadequate as the network size and complexity increase (Fig. 1a). Instead, clustering approaches are used, which predict higher-order protein complexes from the interaction data. One very commonly used tool for this purpose is MCODE18. For TAP-MS and other data sets where components of protein complex- es are experimentally determined, other clustering methods are used (for example, ‘clique percolation’19). The results of these c d clustering analyses can then be used to change the layout and appearance of the network (Fig. 1a,b) in a way that may yield biological insights that cannot be easily obtained by simply examining lists of pro- teins or protein complexes. For instance, by viewing the network, the scientist may notice connections between two complexes that suggest a previously unknown bio- logical relationship. Furthermore, on the

basis of previous knowledge, the scientist Dhh1 Hsh155 e Cus1 may be able to assign a putative function Kem1 Lsm2 Rse1 or subcellular localization to the complex; Lsm4 this information can be visualized using Pat1 Krs1 Prp21 Prp11 node color or shape to represent the func- Dcp1 Smd3 Prp9 tional category or location of the proteins. Lsm5 Nam7 Adh3 Lsm1 Similarly, node color or shape can be used Edc3 Dcp2 Core 281 to show which proteins belong to the same Module 128 complex (Fig. 1b). Figure 1 | Visualization of protein interaction networks. (a–d) Cytoscape20 images of Mycoplasma Most network visualization tools pro- pneumoniae protein interaction data derived by mass spectrometry19 analysis. (a) Initial protein vide the ability to interactively change the interaction network (>400 proteins) laid out with a force-directed algorithm. Nodes discussed in the layout of the network—for example, by following steps are overlaid with functional annotations (blue, RNA polymerase; dark or light green,

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 automatically arranging a user-defined small or large ribosomal subunits, respectively; red, elongation factor). (b) Recomputed network group of proteins into any of a variety of remaining after removal of nodes not of interest. Five computationally determined complexes are arrangements (a circle, a line and so forth) colored according to functional annotation. Node shapes represent different roles in the complex or by manually moving nodes. This ability (circle, core protein of complex; diamond, protein attached to complex but not part of the core). At this stage, clusters emerge. (c) Manual refinement of the network layout emphasizing structure can be very useful in creating visualizations of protein complexes and interactions between them. (d) Collapse of nodes in each complex core, that emphasize biologically significant rela- simplifying the network and emphasizing global properties. (e) Stages in deadenylation-dependent tionships and interactions between com- mRNA degradation in Saccharomyces cerevisiae. Reproduced from Gavin et al.10. Arrows show the plexes (Fig. 1c) or between ‘hub’ proteins order of sequential steps in a cellular process. Proteins are colored according to their localization and their partners (for example, between (green, cytoplasm; red, nucleus; blue, punctate composite (undefined subcellular structure); yellow, kinases and their substrates). Tools that mitochondria; white, unknown). Edge styles represent socio-affinity indices (dotted, 5–10; dashed, support such interactive editing particu- 10–15; solid, >15). TAP-MS bait proteins, bold; shaded circles, protein complexes. larly well include Cytoscape20, VANTED21, VisANT22 and NAViGaTOR23. nested hierarchically and can show ‘meta-edges’25 between meta- It is often useful to collapse all members of a protein complex or nodes—these can indicate, for example, when proteins are shared cluster into a single ‘meta-node’ (Fig. 1d) that can later be expand- between two collapsed complexes. ed, depending on screen space and the desired level of detail. Present high-throughput experimental methods often do not Meta-nodes not only simplify the appearance of the network, they determine the spatial, or subcellular, location where an interac- can also be useful in more clearly illustrating biological relation- tion takes place, so it can be highly informative to include any ships between protein complexes. Meta-nodes can also help to previous protein localization information in the analysis of these visually arrange the network to give insight into the integration data sets. For instance, the network may be filtered to show only and coordination of cellular functions (Fig. 1d). Meta-nodes are proteins known to occur in selected locations, thus simplifying supported by yEd (http://tinyurl.com/yEdGraph/), BioModules24 it and allowing the scientist to focus only on interactions within and VisANT, the last of which further allows meta-nodes to be a defined subcellular location. Alternatively, subcellular location

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S59 review

Table 2 | Visualization tools focused on pathways Name Cost OS Description URL Stand-alone BioTapestry94 Free Win, Mac, Linux Visualization of genetic regulatory networks, also with experimental data http://www.biotapestry.org/ Caleydo95 Free Win, Linux Interactive framework for pathway and expression data; 3D ‘bucket’ view http://www.caleydo.org/ CellDesigner*51 Free Win, Mac, Linux Drawing and simulation of pathways and models; supports SBGN http://www.celldesigner.org/ Edinburgh Pathway Editor Free Win, Mac, Linux Construction and visualization of pathway diagrams; supports SBGN http://tinyurl.com/EdinburghPE/ GenMAPP40 Free Win Pathway visualization and construction; abundance data http://www.genmapp.org/ IngenuityPathways $ Win, Mac, Linux Full analysis suite; network and pathway visualizations; abundance data http://tinyurl.com/IngenuityPath/ JDesigner52 Free Win Drawing and simulation of pathways and models http://tinyurl.com/jdesigner/ KaPPA View48 Free Win Analysis and visualization of plant pathways and mapped abundance data http://tinyurl.com/kappa-view/ KEGG Atlas96 Free Win, Mac, Linux Visualization of abundance data on interactive KEGG pathways http://www.genome.jp/kegg/ MetaCore $ Win, Mac, Linux Pathway, network and omics data analysis and visualization suite http://www.genego.com/ PathVisio97 Free Win, Mac, Linux Pathway visualization and editing; supports mapping of omics data http://www.pathvisio.org/ VitaPad98 Free Win, Mac, Linux Editing of pathway diagrams; integration of abundance data http://tinyurl.com/vitapad/ Web-based ArrayXPath99 Free Mapping of abundance data to pathway visualizations http://tinyurl.com/ArrayXPath/ GEPAT100 Free Analysis suite; visualization of transcriptomics data on pathways maps http://tinyurl.com/GEPAT1/ iPath101 Free Visualization and exploration of combined KEGG pathways http://pathways.embl.de/ MapMan46 Free Visualization of abundance data on metabolic pathways http://tinyurl.com/MapManApp/ Omics Viewer47,102 Free Mapping of abundance data to BioCyc pathway diagrams http://www.biocyc.org/ Pathway Explorer49 Free Visualization of abundance data on pathways http://tinyurl.com/pathwayexp/ PATIKA103 Free Pathway visualization suite; good support for signaling pathways http://www.patika.org/ Payaologue Free Collaborative pathway annotation and visualization tool http://celldesigner.org/payao/ ProMeTra41 Free Maps abundance matrices of multiple data types to pathways http://tinyurl.com/ProMeTra/ Reactome SkyPainter30 Free Visualization of over-represented pathways and reactions from gene lists http://reactome.org/ WikiPathways62 Free Wiki-based, community-driven pathway curation and visualization tool http://www.wikipathways.org/ Some of the tools in this table have capabilities similar to tools that are listed in other tables. To avoid listing tools in more than one table, we assigned tools to tables on the basis of what we understand to be their primary purpose. *Our recommendations. Free means the tool is free for academic use; $ means there is a cost. OS, operating system: Win, Microsoft Windows; Mac, Macintosh OS X. Tools running on Linux usually also run on other versions of Unix. 3D, three-dimensional.

can be indicated using node coloring; this can be particularly networks (for example, STITCH31). In some cases, to illustrate a useful when studying the interactions of complexes that move result or insight, it can be useful to add interactions derived from between subcellular locations (Fig. 1e). Another common strat- previous studies—thus forming a hybrid network that shows egy is to arrange the network so that all proteins belonging to both new and old data (Fig. 1e).

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 the same subcellular location are gathered together in one region (Fig. 2, see 2a). For small networks, such a layout depicting sub- Expression profile data cellular localization is often created manually. However, for A range of experimental methods are being used for high- large networks, it is much more convenient to use tools that can throughput expression profiling (Box 1, Supplementary Fig. achieve such a layout automatically, such as Cerebral26 (Fig. 2a) 1 and Supplementary Table 1); in addition to gene expression and PATIKA27. These tools also draw boundaries or use shading profiling with DNA microarrays32 and RNA deep sequencing33, so that the scientist can see clearly which regions of the network a promising emerging technology is quantitative protein expres- correspond to which subcellular locations. sion profiling based on mass spectrometry34,35. Gene expression Protein interaction data sets commonly do not capture infor- profile data sets are being deposited in two main repositories, mation about dynamic changes in protein abundance. Thus, as ArrayExpress (http://www.ebi.ac.uk/arrayexpress/) and Gene with spatial information, it is often useful to include temporal Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/), with information from other experiments—for example, by identi- around 15,000 studies now in the public domain. fying proteins whose abundance is known to vary throughout The initial goal in analyzing expression profiles is usually to the cell cycle. This information can be used to simplify a large find a set of genes or, less typically, proteins that share a related network by either depicting only proteins that are coexpressed pattern of expression—for example, genes that are up- or down- or by mapping expression or abundance profiles of proteins of regulated in a certain genotype, disease model or human disease, interest onto nodes28 in the network, as described in more detail or in response to a drug treatment. The challenge is that a single below (see Network enrichment). data set may contain expression profiles for over 10,000 genes, These processes of dissection are all aimed at dividing a protein measured over a range of time points and experimental condi- interaction data set into manageable, biologically significant parts tions, so that determining which genes are potentially relevant to that can be interpreted; during this process of interpretation, a the studied problem requires an extensive search through a large scientist often makes use of previously established knowledge, amount of often noisy, multivariate data. Together with various particularly pathways (for example, KEGG29 or Reactome30) and clustering algorithms, visualization is key in these analyses32, and

S60 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

Figure 2 | Omics data overlaid onto a biological networks. (a) Cerebral78 showing the TLR4-to-NF-κB signaling pathway79 laid out according to subcellular localization and functional annotation (green shading). Direction of information flow is from top to bottom. Node colors represent relative expression (red, upregulation; blue, downregulation) and edge colors represent interaction type (orange, phosphorylation; cyan, other protein interaction; purple, transcriptional regulation). The left two panels are a ‘small multiples’ display of the same pathway overlaid with gene expression data for two experimental conditions. In this case, the top panel has been selected, and hence is also shown in the main window. The bottom panels show the detailed expression profiles corresponding to genes shown in the pathway panel (see ‘Network enrichment’). The data set shows how upregulation of NFKB1 explains the observed upregulation of several chemokine proteins. b c d (b) ProMeTra41 display showing both metabolomics and transcriptomics time series data from five time points. Metabolite and enzyme nodes in the pathway map are subdivided into five areas, one per time point. Areas are color coded (green, upregulation; yellow, no change; white, missing data) to indicate metabolite concentrations and transcript efg levels relative to a reference time point. (c) Lichen rendering of a gene regulatory network overlaid with transcriptomics data using a circular heat map. Each concentric ring represents a time point, and the color of the circle represents expression level (red, upregulation; © 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 green, downregulation; black, no change). Numbers identify genes. (d) VistaClara42 display showing transcript levels across four time points relative to a reference time point as ‘heat strips’ below the nodes (height of bar, relative expression; red, upregulation; green, downregulation). Node color indicates expression level from time point 4. Node size indicates reliability of the measurement taken at time point 4. (e) GENeVis36 visualization of the same data set. Color coding as in d; height of bar corresponds directly to the reliability of the measurement at each time point. (f) Profile plots of the same data embedded in nodes in VisANT22. The color of each line segment represents the change in expression levels between two time points (red and purple, increase; blue and cyan, decrease). (g) Visualization of metabolite concentrations in a pathway map in VANTED21 using a bar with error bars.

a wide range of tools have been developed to aid the visualization or BioLayout Express 3D37. A logical next step is then to map process (Table 3). Many of these tools implement a set of com- gene expression levels onto the identified pathways. Interpreting monly applied methods (Box 2 and Fig. 3); in particular, scatter expression data in the context of a visualized pathway or network plots combined with dimensionality reduction (Fig. 3a), profile usually proves more insightful than without this type of informa- plots (Fig. 3b), heat maps, and dendrograms (Fig. 3c), as well as tion. For instance, visualizing the data in the context of pathways clustering. As microarray gene expression analysis has matured as may show how the upregulation of a transcription factor explains an experimental technique, many of the corresponding visualiza- the upregulation of many other genes under its control (Fig. 2a) tion methods have become well established and are widely used. and may lead to testable experimental hypotheses. A wide range of representations are used for mapping gene Network enrichment. Once a list of potentially relevant genes has expression levels onto pathways and networks, with the ideal choice been found using the above types of analysis, the next task is often depending on the specific experiment and question of interest38 to find pathways or networks where these genes are significantly (Fig. 2). A simple approach that is available as part of many tools over-represented. These ‘enrichment’ searches can be launched (Table 1) is to represent expression levels as a color gradient, as in directly from several network visualization tools, for example, a heat map, and then color the nodes in the network according to GENeVis36, Reactome SkyPainter30, Metacore (GeneGo Inc.) their expression level under a particular condition (Fig. 2a).

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S61 review

Table 3 | Visualization tools for multivariate omics data Name Cost OS Description URL Stand-alone BicOverlapper104 Free Win, Mac, Linux Visualization of biclusters combined with profile plots and heat maps http://vis.usal.es/bicoverlapper/ BiGGEsTS105 Free Win, Mac, Linux Heat map–based bicluster visualization http://tinyurl.com/BiGGEsTS/ Brain Explorer76 Free Win, Mac Visualization of 3D transcription data in the central nervous system http://tinyurl.com/brainExplorer/ Caryoscope75 Free Win, Mac, Linux Abundance data mapped to chromosomal location http://tinyurl.com/caryoscope/ Data Matrix Viewer Free Win, Mac, Linux Simple profile plot visualization; supports Gaggle http://gaggle.systemsbiology.net/ EXPANDER106 Free Win, Linux Heat maps, scatter plots and profile plots of cluster averages http://acgt.cs.tau.ac.il/expander/ Genesis107 Free Win, Mac, Linux Analysis suite; offers several interactive visualizations http://tinyurl.com/genesisclient/ GeneSpring GX* $ Win, Mac, Linux Analysis suite; interactive and linked visualizations; also networks http://tinyurl.com/genespring/ GeneVAnD108 Free Win, Mac, Linux Linked heat maps, dendrograms and 2D/3D scatter plots http://tinyurl.com/GeneVAnD/ geWorkbench Free Win, Mac, Linux Modular suite; heat maps, dendrograms, profile and scatter plots http://tinyurl.com/geWorkbench/ HCE*109 Free Win Linked heat map, profile and scatter plots; systematic exploration http://tinyurl.com/HCExplorer/ Java TreeView*110 Free Win, Mac, Linux Linked heat maps, karyoscopes, sequence alignments, scatter plots http://jtreeview.sourceforge.net/ Mayday111 Free Win, Mac, Linux Modular suite; many linked visualizations; enhanced heat map112 http://tinyurl.com/maydaywp/ MultiExperiment Viewer*113 Free Win, Mac, Linux Analysis suite; heat maps, dendrograms, profile and scatter plots http://www.tm4.org/ PointCloudXplore77 Free Win, Mac, Linux Visualization of 3D transcription data in Drosophila embryos http://tinyurl.com/PointCloudXplore/ Spotfire Functional Genomics $ Win Analysis suite; many linked visualizations and exploration tools http://spotfire.tibco.com/ TimeSearcher114 Free Win Exploration and analysis of time series; advanced profile plots http://tinyurl.com/timesearcher/ R/BioConductor Geneplotter Free Win, Mac, Linux Karyoscope-style plots and other visualizations http://www.bioconductor.org/ Web-based ExpressionProfiler115 Free Transcriptomics data analysis suite with basic visualizations http://tinyurl.com/exprespro/ GenePattern116 Free Modular analysis platform; several visualization modules available http://tinyurl.com/GenePatt/ Some of the tools in this table have capabilities similar to tools that are listed in other tables. To avoid listing tools in more than one table, we assigned tools to tables on the basis of what we understand to be their primary purpose. *Our recommendations. Free means the tool is free for academic use; $ means there is a cost. OS, operating system: Win, Microsoft Windows; Mac, Macintosh OS X. Tools running on Linux usually also run on other versions of Unix. 2D, two-dimensional; 3D, three-dimensional.

If expression levels from more than one condition are being and therefore emphasized (Fig. 2e). A less common alternative to studied, some tools (for example, VisANT and VistaClara) allow showing a heat map embedded in the node is to embed a profile the scientist to visualize them sequentially, by updating node col- plot in the node. This is supported by, for instance, VisANT (Fig. ors to reflect the expression levels of a selected condition. Some 2f) and has the advantage that multiple profiles can be displayed tools switch automatically to depiction of the next condition in the same node—for example, when a meta-node represents a

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 after a predefined time interval, which leads to an animation-like set of genes. In VANTED, each node of the network has embedded visualization that is well suited to interpreting data from a time visualizations of exceptionally high detail, showing legends, grid series. An alternative strategy to viewing the data from different lines, bar charts or error bars (Fig. 2g). Although powerful, this conditions in series is to view them in parallel, by arranging mul- representation requires the nodes to be rather large in order to tiple versions of the same network in a grid, where each version show the details of the embedded visualizations, which effectively represents the expression levels (visualized as node color) for limits its application to only small pathways and networks. one condition or time point. This approach is known as ‘small For expression profiles with many conditions, visualizing multiples’39 and allows the scientist to visually compare expres- all these data directly in the network is invariably problematic sion levels between conditions, which is not well supported by because of a lack of space, and an approach that links visualiza- animation. A well-designed implementation of small multiples tion of the network to a separate visualization of the expression is available in Cerebral26 (Fig. 2a). profiles is more appropriate. In the linked approach, a heat map Besides animation and small multiples, a third approach is to (as implemented in VistaClara) or a profile plot (as implemented show the complete expression profile within the nodes of a net- in Cerebral; Fig. 2a) is shown next to the network and when the work. The most common representation of this type is based on a scientist selects nodes in the network, the corresponding expres- miniature heat map embedded in each node (Fig. 2b) and is avail- sion profiles are highlighted in the linked heat map or profile able in several tools, including GenMAPP 2 (ref. 40), GeneSpring plot, or vice versa. This approach allows the scientist to check, GX (Agilent Technologies) and ProMeTra41. The Lichen package for instance, whether the members of a putative protein complex (http://tinyurl.com/Lichen1/) uses a circular heat map to depict in the network visualization are coexpressed, by comparing the this information, which has the advantage of being very compact corresponding gene expression profiles in the linked heat map. (Fig. 2c). VistaClara42 provides ‘heat strips’, in which the heights Conversely, selection of a set of coexpressed genes in a clustered of the bars as well as their colors correspond to the expression heat map would allow exploration of their role in the linked pro- levels (Fig. 2d). In contrast, in GENeVis, bar heights represent tein interaction network: the scientist could directly see whether confidence measures, so that reliable measurements are taller these genes are part of the same complex, what their interaction

S62 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

BOX 2 KEY VISUALIZATION METHODS FOR MULTIVARIATE DATA

Multivariate data, for instance from gene expression studies, are very common in systems biology, and many a c tools have been developed to analyze and visualize 2 such data (Table 3). The three most commonly used visualization methods are scatter plots, profile plots and 0 heat maps.

−2

Scatter plots. Scatter plots (Fig. 3a) are primarily Principal Component 2 used to examine dependencies between two variables, but in combination with dimensionality reduction −4 methods, they can also be applied to multivariate data. −4 −2 02 For instance, to gain insight into the global patterns in Principal Component 1 a gene expression matrix, a dimensionality reduction method may be applied to obtain a two-dimensional b (sometimes three-dimensional) representation of the expression profiles, which are then visualized in a scatter plot to reveal clusters and outliers in the data. Some frequently applied dimensionality reduction methods for this purpose are principal component analysis132 (PCA) and multi-dimensional scaling133 (MDS), which are implemented in many tools. Besides PCA and MDS, many other suitable dimensionality reduction methods

134 0 7

exist , but they are often not easily accessible to the 14 21 28 35 42 49 56 63 70 77 84 91 98 105 112 119 casual user. Time (min) Scatter plots combined with dimensionality reduction methods are an excellent tool for gaining insight into Figure 3 | Visualization of gene expression profiles. Expression of 320 transcripts the overall structure of large sets of expression profiles. from S. cerevisiae, collected over 18 time points throughout the cell cycle80. Colors However, because of the dimensionality reduction indicate cluster membership based on a k-means clustering (k = 4). (a) Scatter plot itself, it is not possible to extract information about showing a projection of the profiles on the first two principal components obtained the relationship between expression levels and the by PCA. (b) Profile plot of gene expression across all 18 time points, including conditions under study. k-means cluster information. Genes in the red and blue clusters appear active in the G1 and S phase of the cell cycle, respectively. Phase assignments for yellow and Profile plots. Profile plots (Fig. 3b), also known as green clusters are unclear. (c) Heat map of the profiles. Colors represent abundance parallel coordinate plots135, visualize the expression (red, higher than control; blue, lower than control; white, no change). Rows of the levels of a large number of transcripts across all heat map have been reordered according to a hierarchical clustering, represented by

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 samples. Thus, they provide insight into the patterns the dendrogram. The color bars between the dendrogram and heat map indicate the of correlation between samples and expression levels. k-means clusters, allowing comparison of the two clustering results. Images made For instance, at a glance, the scientist can determine with R (http://www.r-project.org/). whether a transcript is expressed constitutively in all conditions or whether it is only expressed in a single condition, such as a particular tissue or phase of the cell cycle. Furthermore, it is possible to generate hypotheses about trends, such as increasing expression levels for a transcript over time after a stimulus, or differential expression of a transcript—for instance, between samples of diseased and normal tissue. Because many profiles are shown in the same plot, the scientist can interpret such observations in the context of the overall data set. A profile plot can also be queried visually for transcripts with a particular behavior, such as low expression in one set of samples and high in another set, or for profiles that are similar to that of a transcript of interest. A substantial disadvantage of profile plots is that, owing to the manner in which they are constructed, profiles overlap, severely limiting the number of profiles that can be visualized effectively at the same time. Heat maps. Heat maps136,137 (Fig. 3c) are the most commonly used visualization method for expression matrices138 and can be generated using most tools. Like profile plots, heat maps visualize the abundance of each transcript in each sample, but the profiles do not overlap, which means that more profiles can be visualized effectively. However, the size of the heat map grows with the number of profiles, so that the available screen space is often a limiting factor. A key aspect of heat map visualization is the reordering of the rows, which ensures that similar profiles are placed near each other. Typically this reordering is done using hierarchical clustering137, and a dendrogram showing the hierarchy is usually arranged immediately adjacent to the heat map (Fig. 3). This combined view helps a scientist to see groups of genes that have a similar expression pattern. The dendrogram conveys which genes are clustered together, and also which genes are outliers with an unusual expression pattern. The heat map allows the scientist to see in more detail which features of the expression pattern are shared by gene clusters. For example, genes in a cluster may have a peak expression at about the same time in an experiment.

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S63 review

partners are, or whether they are located in the same subcellular reactions contained in the metabolic pathway but that have not compartment. In contrast, when expression profiles are shown been detected in the measurements. If such metabolites are identi- only in the nodes of the network, this type of analysis is not pos- fied, further experiments attempting to detect these molecules can sible because coexpressed genes are not necessarily located next be conducted. Many visualization tools (for example, MapMan46, to each other in the visualization. However, there is a trade-off Pathway Tools Omics Viewer47, KaPPA-view48, PathwayExplorer49 between the flexibility provided by linked views and the conve- and ProMeTra41) have been developed to facilitate enriched views nience of being able to see expression profiles and interactions of metabolic pathways, usually with close integration of metabolic without having to consult two separate visualizations. pathway databases. These tools overlay metabolite profiles pri- marily on static images of pathways obtained from sources such Network clustering and correlation networks. Recently, there has as KEGG29 or MetaCyc50-based databases. been increased interest in a new kind of clustering method— called ‘network clustering’37—that is less susceptible to noise and Pathway editing can lead to more accurate identification of functionally related The analysis of new experimental data sets, as outlined above, genes than established clustering methods (Box 2). Network clus- usually produces new insight into biological processes, which tering of gene expression data is done using so-called ‘correlation may be used to modify existing pathways or to create new path- networks’, in which each gene is a node and each edge indicates ways. A wide range of tools is available that support pathway coexpression of two genes under the conditions of the experi- editing (Table 2); the choice of which tool to use depends on the ment43. As well as being an improved way to calculate clusters, specific requirements of the task at hand. correlation networks allow the scientist to interactively explore gene expression data sets using many of the rich set of network visualization tools that have been developed for visualizing pro- a tein interaction networks. The use of correlation networks for gene expression data is as yet supported by relatively few tools, including BioLayout Express 3D37—which has been developed specifically for this purpose—and Cytoscape, using either the MCODE18 or ClusterMaker plug-in (http://www.cgl.ucsf.edu/ cytoscape/). However, we anticipate that correlation networks may become one of the established methods for interpreting gene expression data sets.

Metabolic profile data A wide variety of spectroscopic methods are being used for high- throughput studies of small-molecule metabolites44, two of the most popular being mass spectrometry and nuclear magnetic reso- nance spectroscopy (Box 1 and Supplementary Table 1). Typically, b present methods identify hundreds of metabolites per experiment.

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 Additionally, many as-yet-unidentified compounds can be repro- ducibly detected. These experimental data are collected in several public repositories, the largest of which is now SetupX45, containing ~20,000 samples from more than 300 studies. The general goal in analyzing metabolite profiles is to gain detailed insight into the molecular mechanisms of cellular metabolic pathways. The identification of molecules that may be used as reliable biomarkers of disease is also of great interest. Metabolite profiles are typically analyzed to find sets of metabo- lites with similar profiles or to measure the impact of genetic modifications, drugs and other biotic or abiotic factors on the metabolome of an organism. As with gene and protein profiles, visualization is key in these analyses, and the same, or very similar, methods (Box 2) and tools (Table 3) are typically used. As for gene expression data, one Figure 4 | Visualization of metabolic pathways and profile data. (a) A of the key visualization methods in metabolomics involves the part of the glycolysis and citric acid cycle pathway in Corynebacterium enrichment of metabolic pathways with visualizations of metab- glutamicum DM1730 overlaid with changes in metabolite concentrations olite concentrations (Fig. 4a), and often the same visual repre- and gene expression across five time points in relation to a reference time 41 sentations as for gene expression data can be used (Fig. 2b–g). point. The visualization was created in ProMeTra (as in Fig. 2b). Nodes shaded gray indicate metabolites for which no concentration data were Visualizing such enriched metabolic pathways can be very use- available. (b) Enlarged iPath/KEGG Atlas image showing the glycolysis ful in understanding the concerted changes of metabolite pools pathway in the context of other parts of the metabolic system. Yellow, within the cell. In addition, enriched pathways can help to iden- amino acid metabolism, purple, energy metabolism. The shaded area tify metabolites that should be present according to enzymatic corresponds to the citric acid cycle shown in a.

S64 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

For building pathways from scratch or editing existing path- facilitating collaboration between scientists at different loca- ways, tools such as GenMAPP 2, PathVisio, and VANTED are use- tions59–61, and several projects have recently been launched that ful, as they are designed for to assist the manual task of arranging are aimed at community-based collaborative editing of biologi- nodes and edges. To this category also belong Cell Designer51 cal network data—notably Payaologue (http://celldesigner.org/ and JDesigner52, which further support pathway simulations by payao/) and WikiPathways62. means of kinetic modeling. The insights gained from these simu- As experimental methods enable scientists to tackle larger lations often lead to new hypotheses, which can then be tested in and more complex systems, it is likely that significant innova- further experiments. tions will be needed for visualizing future data sets. One pos- Manual layout of pathways quickly becomes tedious as the size sible direction for future network visualization tools would be of the pathway grows. Fortunately, a range of automated layout to move beyond the standard two-dimensional layout, and tool methods have been developed, each addressing specific layout developers are already exploring three-dimensional layouts (for requirements. Typically, these methods will arrange the network example, BioLayout Express 3D37), combinations of both three- to highlight the overall state changes that occur—for example, dimensional layouts and time (for example, E-Cell 3D, http:// making sure that all interactions point from left to right, and thus tinyurl.com/ecell3d/), or layouts that mix aspects of two and creating an overall causal flow from left to right. Automated lay- three dimensions (for example, Arena3D63). In addition, systems out can be particularly useful for updating large networks when biologists may well be among the early adopters of innovations new knowledge (nodes or interaction edges) becomes available. in hardware, such as multi-touch interfaces and larger, high- For example, PATIKA53 has an automated layout method that resolution displays64. shows the causal flow of events through different subcellular As systems biology has evolved very quickly over the last compartments. This is particularly useful for depicting signal- decade, some of the difficulties faced by end-users today arise ing networks27. Although these specialized automated layout not from the intrinsic complexity of data but from a lack of stan- methods are useful, they are usually of low quality compared to dards. Biological pathways and networks are now distributed in manually laid out pathways created by human experts and often over 300 web resources4—and in a field as interdisciplinary as require manual editing in addition; however, judging by recent systems biology, there is an obvious strength in such diversity. progress, we expect these method to continue to improve and to However, the field would clearly benefit from a parallel effort become increasingly useful54. toward a consolidated resource, and we would like to add our For very large pathways, it can be important to use compact voices to a call for a consolidated database, similar to the world- visual representations and pathway layouts that reduce the wide Protein Data Bank for three-dimensional structures65. amount of detail shown. A very clear of such a con- The situation is somewhat better with file formats used to store cise visual representation is iPath, which combines 120 KEGG interaction data, pathways and biochemical models. Although pathways into a single, vast pathway map that provides an over- many formats are used, several have emerged as de facto stan- view of all metabolism in an organism (Fig. 4b). Scientists can dards for the exchange of pathway and network data—for exam- zoom into parts of the map to navigate to individual pathways. ple, PSI-MI66 for protein interaction data, BioPAX (http://www. biopax.org/) for pathways and interaction networks, Systems Future perspectives Biology Markup Language (SBML)67 for models of biochemi- Systems biology is still rapidly evolving, which can make it diffi- cal reactions and gene regulation and CellML68 for exchange

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 cult for tool developers to know which visualization tasks are the of a range of different biological models. In regard to graphical most important ones. However, as the field matures, the key tasks notation, there has recently been a significant community-driven will likely become clearer, and the requirements and limitations proposal (Systems Biology Graphical Notation, SBGN69) toward of current visualization methods will become better understood5. developing a more unified standard, and several tools already This process will also be aided by insights from the emerging support the creation and visualization of networks using this field of visual analytics55, which specifically studies the role of standard (see Table 2). visualization in the larger process of understanding and inter- Ultimately, systems biology seeks to provide insights into the preting data. Visual analytics methods have begun to be applied processes of organelles, cells, organs and even whole organisms. to studying the connection between visualization and analytical Fulfilling this ambitious goal requires still further development reasoning in systems biology5,56. in visualization methods; in particular, better integration with We anticipate that the near future will bring significant visualization of other kinds of data, such as imaging data70, mac- improvements in automated pathway and network layout to romolecular structures71, genomes72, and phylogenies73. Efforts better match biologists’ needs54,57. Innovation will continue to to build such integrated visualization platforms have begun (for give more and better choices for the representations of nodes, example, Visible Cell74), and in fact, many tools that bridge dif- edges and overlay information, as well as better ways to con- ferent data types and disciplines are already in place; for instance, vey dynamic properties and to compare networks. Crucially, there are tools that map transcript abundance (or, if available, we expect that usability will improve, partly through improved protein abundance) onto chromosomal location75 and onto navigation methods that help users manage large and complex three-dimensional anatomical representations of tissue76,77. networks23,25,58. However, truly integrated visualization of systems biology data Today, many tools for network and pathway visualization are across the entire range of possible data types is still very much stand-alone applications (Tables 1 and 2); however, there is a in its infancy. trend toward web-based applications, often coupled tightly to underlying databases. Web-based tools show great promise for Note: Supplementary information is available on the Nature Methods website.

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S65 review

ACKNOWLEDGMENTS 26. Barsky, A., Munzner, T., Gardy, J. & Kincaid, R. Cerebral: visualizing multiple The authors would like to acknowledge S. Kühner for providing the data for experimental conditions on a graph with biological context. IEEE Trans. Vis. Figure 1 and Â. Gonçalves for comments on parts of the manuscript. This work Comput. Graph. 14, 1253–1260 (2008). was partly supported by the European Union Framework Programme 6 grant 27. Genc, B. and Dogrusoz, U. A layout algorithm for signaling pathways. Inf. ‘TAMAHUD’ (LSHC-CT-2007-037472). Sci. 176, 135–149 (2006) 28. de Lichtenberg, U., Jensen, L.J., Brunak, S. & Bork, P. Dynamic complex COMPETING INTERESTS STATEMENT formation during the yeast cell cycle. Science 307, 724–727 (2005). The authors declare no competing financial interests. 29. Kanehisa, M. et al. KEGG for linking genomes to life and the environment. Nucleic Acids Res. 36 (database issue), D480–D484 (2008). Published online at http://www.nature.com/naturemethods/. 30. Matthews, L. et al. Reactome knowledgebase of human biological pathways Reprints and permissions information is available online at http://npg. and processes. Nucleic Acids Res. 37 (database issue), D619–D622 (2009). nature.com/reprintsandpermissions/. 31. Kuhn, M. et al. STITCH 2: an interaction network database for small molecules and proteins. Nucleic Acids Res. 38 (database issue), D552–D556 1. Michal, G. Biochemical Pathways: An Atlas of Biochemistry and Molecular (2010). Biology (Wiley, New York, 1998). 32. Quackenbush, J. Computational analysis of microarray data. Nat. Rev. Genet. 2. Nishizuka, Y. The role of protein kinase C in cell surface signal transduction 2, 418–427 (2001). and tumour promotion. Nature 308, 693–698 (1984). 33. Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for 3. Levine, M. & Davidson, E.H. Gene regulatory networks for development. Proc. transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009). Natl. Acad. Sci. USA 102, 4936–4942 (2005). 34. Bantscheff, M., Schirle, M., Sweetman, G., Rick, J. & Kuster, B. Quantitative 4. Bader, G.D., Cary, M.P. & Sander, C. Pathguide: a pathway resource list. mass spectrometry in proteomics: a critical review. Anal. Bioanal. Chem. 389, Nucleic Acids Res. 34 Database issue, D504–D506 (2006). 1017–1031 (2007). 5. Saraiya, P., North, C. & Duca, K. Visualizing biological pathways: 35. Gstaiger, M. & Aebersold, R. Applying mass spectrometry-based proteomics to requirements analysis, systems evaluation, and research agenda. Inf. Vis. 4, genetics, genomics and network biology. Nat. Rev. Genet. 10, 617–627 (2009). 191–205 (2005). 36. Westenberg, M.A., van Hijum, S.A.F.T., Kuipers, O.P. & Roerdink, J.B.T.M. This paper represents one of the first attempts to critically evaluate the Visualizing genome expression and regulatory network dynamics in genomic requirements for visualization software used in biology. and metabolic context. Comput. Graph. Forum 27, 887–894 (2008). 6. Suderman, M. & Hallett, M. Tools for visually exploring biological networks. 37. Freeman, T.C. et al. Construction, visualisation, and clustering of Bioinformatics 23, 2651–2659 (2007). transcription networks from microarray expression data. PLOS Comput. Biol. 7. Pavlopoulos, G.A.G., Wegener, A.L.A. & Schneider, R.R. A survey of visualization 3, e206 (2007). tools for biological network analysis. BioData Min. 1, 12 (2008). 38. Saraiya, P., Lee, P. & North, C. in IEEE Symp. Information Visualization 8. Charbonnier, S., Gallego, O. & Gavin, A.C. The social network of a cell: recent (InfoVis 2005) 225–232 (2005). advances in interactome mapping. Biotechnol. Annu. Rev. 14, 1–28 (2008). 39. Tufte, E.R. The Visual Display of Quantitative Information 2nd edn. (Graphics 9. Yu, H. et al. High-quality binary protein interaction map of the yeast Press, Cheshire, Connecticut, USA, 2001). interactome network. Science 322, 104–110 (2008). 40. Salomonis, N. et al. GenMAPP 2: new features and resources for pathway 10. Gavin, A.C. et al. Proteome survey reveals modularity of the yeast cell analysis. BMC Bioinformatics 8, 217 (2007). machinery. Nature 440, 631–636 (2006). 41. Neuweger, H. et al. Visualizing post genomics data-sets on customized 11. Mathivanan, S. et al. An evaluation of human protein-protein interaction pathway maps by ProMeTra – aeration-dependent gene expression and data in the public domain. BMC Bioinformatics 7 (suppl. 5), S19 (2006). metabolism of Corynebacterium glutamicum as an example. BMC Syst. Biol. 3, 12. Ma’ayan, A. Network integration and graph analysis in mammalian molecular 82 (2009). systems biology. IET Syst. Biol. 2, 206–221 (2008). 42. Kincaid, R., Kuchinsky, A. & Creech, M. VistaClara: an expression browser 13. Salwinski, L. et al. The Database of Interacting Proteins: 2004 update. plug-in for Cytoscape. Bioinformatics 24, 2112–2114 (2008). Nucleic Acids Res. 32 (database issue), D449–D451 (2004). 43. Lee, H.K., Hsu, A.K., Sajdak, J., Qin, J. & Pavlidis, P. Coexpression analysis of 14. Prasad, T.S., Kandasamy, K. & Pandey, A. Human Protein Reference Database human genes across many microarray data sets. Genome Res. 14, 1085–1094 and Human Proteinpedia as discovery tools for systems biology. Methods Mol. (2004). Biol. 577, 67–79 (2009). 44. Dunn, W.B. & Ellis, D.I. Metabolomics: current analytical platforms and 15. Aranda, B. et al. The IntAct molecular interaction database in 2010. Nucleic methodologies. TrAC Trends Anal. Chem. 24, 285–294 (2005).

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 Acids Res. 38 (database issue), D525–D531 (2010). 45. Scholz, M. & Fiehn, O. Setup X – a public study design database 16. von Mering, C. et al. Comparative assessment of large-scale data sets of for metabolomic projects. Pac. Symp. Biocomput. 169–180, protein-protein interactions. Nature 417, 399–403 (2002). doi:10.1142/9789812772435_0017 (2007). 17. Fruchterman, T.M.J. & Reingold, E.M. Graph drawing by force-directed 46. Thimm, O. et al. MAPMAN: a user-driven tool to display genomics data sets placement. Software Pract. Exper. 21, 1129–1164 (1991). onto diagrams of metabolic pathways and other biological processes. Plant J. 18. Bader, G.D. & Hogue, C.W. An automated method for finding molecular 37, 914–939 (2004). complexes in large protein interaction networks. BMC Bioinformatics 4, 2 47. Paley, S.M. & Karp, P.D. The Pathway Tools cellular overview diagram and (2003). omics viewer. Nucleic Acids Res. 34, 3771–3778 (2006). 19. Kuhner, S. et al. Proteome organization in a genome-reduced bacterium. 48. Tokimatsu, T. et al. KaPPA-view: a web-based analysis tool for integration Science 326, 1235–1240 (2009). of transcript and metabolite data on plant metabolic pathway maps. Plant 20. Shannon, P. et al. Cytoscape: a software environment for integrated models Physiol. 138, 1289–1300 (2005). of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003). 49. Mlecnik, B. et al. PathwayExplorer: web service for visualizing high- This paper describes the core Cytoscape software, which has become one of throughput expression data on biological pathways. Nucleic Acids Res. 33 the most popular tools to visualize and analyze biological networks. This is (web server issue), W633–W637 (2005) partly due to the modular design of the software that allows developers to 50. Caspi, R. et al. The MetaCyc database of metabolic pathways and enzymes create plug-ins to address virtually any network analysis problem. and the BioCyc collection of pathway/genome databases. Nucleic Acids Res. 21. Junker, B.H., Klukas, C. & Schreiber, F. VANTED: a system for advanced 36 (database issue), D623–D631 (2008). data analysis and visualization in the context of biological networks. BMC 51. Funahashi, A. et al. CellDesigner 3.5: a versatile modeling tool for Bioinformatics 7, 109 (2006). biochemical networks. Proc. IEEE 96, 1254–1265 (2008). 22. Hu, Z. et al. VisANT 3.5: multi-scale network visualization, analysis and 52. Sauro, H.M. et al. Next generation simulation tools: the Systems Biology inference based on the gene ontology. Nucleic Acids Res. 37 (web server Workbench and BioSPICE integration. OMICS 7, 355–372 (2003). issue), W115–W121 (2009). 53. Demir, E. et al. PATIKA: an integrated visual environment for collaborative 23. McGuffin, M.J. & Jurisica, I. Interaction techniques for selecting and construction and analysis of cellular pathways. Bioinformatics 18, 996–1003 manipulating subgraphs in network visualizations. IEEE Trans. Vis. Comput. (2002). Graph. 15, 937–944 (2009). 54. Schreiber, F., Dwyer, T., Marriott, K. & Wybrow, M. A generic algorithm for 24. Prinz, S. et al. Control of yeast filamentous-form growth by modules in an layout of biological networks. BMC Bioinformatics 10, 375 (2009). integrated molecular network. Genome Res. 14, 380–390 (2004). 55. Thomas, J.J. & Cook, K.A. Illuminating the Path: The Research and 25. Hu, Z. et al. Towards zoomable multidimensional maps of the cell. Nat. Development Agenda for Visual Analytics. (National Visual Analytics Center & Biotechnol. 25, 547–554 (2007). IEEE, Richland, Washington, USA, 2005).

S66 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement review

56. Saraiya, P., North, C. & Duca, K. An Insight-based methodology for 85. Kao, H.L. & Gunsalus, K.C. Browsing multidimensional molecular networks evaluating bioinformatics visualizations. IEEE Trans. Vis. Comput. Graph. 11, with the generic network browser (N-Browse). Curr. Protoc. Bioinformatics 9, 443–456 (2005). 9.11.1–9.11.21 (2008). 57. Dwyer, T., Koren, Y. & Marriott, K. IPSEP-COLA: an incremental procedure for 86. Brown, K.R. et al. NAViGaTOR: Network Analysis, Visualization and Graphing separation constraint layout of graphs. IEEE Trans. Vis. Comput. Graph. 12, Toronto. Bioinformatics 25, 3327–3329 (2009). 821–828 (2006). 87. Kohler, J. et al. Graph-based analysis and visualization of experimental 58. Dwyer, T. et al. Exploration of networks using overview+detail with results with ONDEX. Bioinformatics 22, 1383–1390 (2006). constraint-based cooperative layout. IEEE Trans. Vis. Comput. Graph. 14, 88. Breitkreutz, B.J., Stark, C. & Tyers, M. Osprey: a network visualization 1293–1300 (2008). system. Genome Biol. 4, R22 (2003). 59. Viégas, F.B., Wattenberg, M., van Ham, F., Kriss, J. & McKeon, M. ManyEyes: 89. Batagelj, V. & Mrvar, A. Pajek – Program for large network analysis. a site for visualization at internet scale. IEEE Trans. Vis. Comput. Graph. 13, Connections 21, 47–57 (1998). 1121–1128 (2007). 90. Forman, J.J., Clemons, P.A., Schreiber, S.L. & Haggarty, S.J. SpectralNET–an 60. Heer, J., Viégas, F.B. & Wattenberg, M. Voyagers and voyeurs: supporting application for spectral graph analysis and visualization. BMC Bioinformatics asynchronous collaborative information visualization. in Proceedings of the 6, 260 (2005). SIGCHI Conference on Human Factors in Computing Systems (CHI’07) 1029– 91. Auber, D. A huge graph visualization framework. in Graph Drawing Software 1038 (ACM, New York, 2007). (eds. Mutzel, P. & Jünger, M.) 105–126 (Springer, Heidelberg, Germany, 61. Heer, J. & Agrawala, M. Design considerations for collaborative visual 2004). analytics. Inf. Vis. 7, 49–62 (2008). 92. Zinovyev, A., Viara, E., Calzone, L. & Barillot, E. BiNoM: a Cytoscape plugin 62. Pico, A.R. et al. WikiPathways: pathway editing for the people. PLoS Biol. 6, for manipulating and analyzing biological networks. Bioinformatics 24, e184 (2008). 876–877 (2008). 63. Pavlopoulos, G.A. et al. Arena3D: visualization of biological networks in 3D. 93. Huttenhower, C., Mehmood, S.O. & Troyanskaya, O.G. Graphle: interactive BMC Syst. Biol. 2, 104 (2008). exploration of large, dense graphs. BMC Bioinformatics 10, 417 (2009). 64. Ball, R. & North, C. Realizing embodied interaction for visual analytics 94. Longabaugh, W.J., Davidson, E.H. & Bolouri, H. Visualization, through large displays. Comput. Graph. 31, 380–400 (2007). documentation, analysis, and communication of large-scale gene regulatory 65. Berman, H., Henrick, K. & Nakamura, H. Announcing the worldwide Protein networks. Biochim. Biophys. Acta 1789, 363–374 (2009). Data Bank. Nat. Struct. Biol. 10, 980 (2003). 95. Streit, M., Lex, A., Kalkusch, M., Zatloukal, K. & Schmalstieg, D. Caleydo: 66. Hermjakob, H. et al. The HUPO PSI’s molecular interaction format–a connecting pathways and gene expression. Bioinformatics 25, 2760–2761 community standard for the representation of protein interaction data. Nat. (2009). Biotechnol. 22, 177–183 (2004). 96. Okuda, S. et al. KEGG atlas mapping for global analysis of metabolic 67. Hucka, M. et al. The systems biology markup language (SBML): a medium for pathways. Nucleic Acids Res. 36 (web server issue), W423–W426 (2008). representation and exchange of biochemical network models. Bioinformatics 97. van Iersel, M.P. et al. Presenting and exploring biological pathways with 19, 524–531 (2003). PathVisio. BMC Bioinformatics 9, 399 (2008). 68. Lloyd, C.M., Halstead, M.D. & Nielsen, P.F. CellML: its future, present and 98. Holford, M., Li, N., Nadkarni, P. & Zhao, H. VitaPad: visualization tools for past. Prog. Biophys. Mol. Biol. 85, 433–450 (2004). the analysis of pathway data. Bioinformatics 21, 1596–1602 (2005). 69. Le Novère, N. et al. The Systems Biology Graphical Notation. Nat. Biotechnol. 99. Chung, H. J., Kim, M., Park, C. H., Kim, J., and Kim, J. H. ArrayXPath: 27, 735–741 (2009). mapping and visualizing microarray gene-expression data with integrated This publication marks the first serious attempt to create a community biological pathway resources using scalable vector graphics. Nucleic Acids standard for a graphical notation to represent networks in systems Res. 32 (web server issue), W460–W464 (2004). biology. 100. Weniger, M., Engelmann, J.C. & Schultz, J. Genome Expression Pathway Analysis 70. Walter, T. et al. Visualization of image data from cells to organisms. Nat. Tool–analysis and visualization of microarray gene expression data under Methods 7, S26–S40 (2010). genomic, proteomic and metabolic context. BMC Bioinformatics 8, 179 (2007). 71. O’Donoghue, S.I. et al. Visualization of macromolecular structures. Nat. 101. Letunic, I., Yamada, T., Kanehisa, M. & Bork, P. iPath: interactive exploration Methods 7, S42–S55 (2010). of biochemical pathways and networks. Trends Biochem. Sci. 33, 101–103 72. Nielsen, C.B., Cantor, M., Dubchak, I., Gordon, D. & Wang, T. Visualizing (2008). genomes: techniques and challenges. Nat. Methods 7, S5–S15 (2010). 102. Karp, P.D., Paley, S. & Romero, P. The Pathway Tools software. Bioinformatics 73. Procter, J.B. et al. Visualization of multiple alignments, phylogenies and 18 (suppl. 1), S225–S232 (2002).

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 gene family evolution. Nat. Methods 7, S16–S25 (2010). 103. Dogrusoz, U. et al. PATIKAweb: a Web interface for analyzing biological 74. Burrage, K., Hood, L. & Ragan, M.A. Advanced computing for systems pathways through advanced querying and visualization. Bioinformatics 22, biology. Brief. Bioinform. 7, 390–398 (2006). 374–375 (2006). 75. Awad, I.A., Rees, C.A., Hernandez-Boussard, T., Ball, C.A. & Sherlock, G. 104. Santamaria, R., Theron, R. & Quintales, L. BicOverlapper: a tool for bicluster Caryoscope: an open source Java application for viewing microarray data in a visualization. Bioinformatics 24, 1212–1213 (2008). genomic context. BMC Bioinformatics 5, 151 (2004). 105. Goncalves, J.P., Madeira, S.C. & Oliveira, A.L. BiGGEsTS: integrated 76. Lau, C. et al. Exploration and visualization of gene expression with environment for biclustering analysis of time series gene expression data. neuroanatomy in the adult mouse brain. BMC Bioinformatics 9, 153 (2008). BMC Res Notes 2, 124 (2009). 77. Weber, G.H. et al. Visual exploration of three-dimensional gene expression 106. Shamir, R. et al. EXPANDER–an integrative program suite for microarray data using physical views and linked abstract views. IEEE/ACM Trans. Comput. Biol. analysis. BMC Bioinformatics 6, 232 (2005). Bioinform. 6, 296–309 (2009). 107. Sturn, A., Quackenbush, J. & Trajanoski, Z. Genesis: cluster analysis of 78. Barsky, A., Gardy, J.L., Hancock, R.E. & Munzner, T. Cerebral: a Cytoscape microarray data. Bioinformatics 18, 207–208 (2002). plugin for layout of and interaction with biological networks using 108. Hibbs, M.A., Dirksen, N.C., Li, K. & Troyanskaya, O.G. Visualization methods for subcellular localization annotation. Bioinformatics 23, 1040–1042 (2007). statistical analysis of microarray clusters. BMC Bioinformatics 6, 115 (2005). 79. Mookherjee, N. et al. Modulation of the TLR-mediated inflammatory response 109. Seo, J. & Shneiderman, B. Interactively exploring hierarchical clustering by the endogenous human host defense peptide LL-37. J. Immunol. 176, results. Computer 35, 80–86 (2002). 2455–2464 (2006). 110. Saldanha, A.J. Java Treeview–extensible visualization of microarray data. 80. Spellman, P.T. et al. Comprehensive identification of cell cycle-regulated Bioinformatics 20, 3246–3248 (2004). genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. 111. Dietzsch, J., Gehlenborg, N. & Nieselt, K. Mayday – a microarray data Biol. Cell 9, 3273–3297 (1998). analysis workbench. Bioinformatics 22, 1010–1012 (2006). 81. Kuntzer, J. et al. BNDB – the Biochemical Network Database. BMC 112. Gehlenborg, N., Dietzsch, J. & Nieselt, K. A framework for visualization of Bioinformatics 8, 367 (2007). microarray data and integrated meta information. Inf. Vis. 4, 164–175 (2005). 82. Baitaluk, M., Sedova, M., Ray, A. & Gupta, A. BiologicalNetworks: 113. Saeed, A.I. et al. TM4: a free, open-source system for microarray data visualization and analysis tool for systems biology. Nucleic Acids Res. 34 management and analysis. Biotechniques 34, 374–378 (2003). (web server issue), W466–W471 (2006). One of the first and still one of the most commonly used applications 83. Cline, M.S. et al. Integration of biological networks and gene expression data for the management, analysis and visualization of microarray data. using Cytoscape. Nat. Protoc. 2, 2366–2382 (2007). 114. Hochheiser, H., Baehrecke, E.H., Mount, S.M. & Shneiderman, B. Dynamic 84. Hooper, S.D. & Bork, P. Medusa: a simple tool for interaction graph analysis. querying for pattern identification in microarray and genomic data. Proc. Bioinformatics 21, 4432–4433 (2005). IEEE Multimedia and Expo Int. Conf. 3, 453–456 (2003).

nature methods supplement | VOL.7 NO.3s | MARCH 2010 | S67 review

115. Kapushesky, M. et al. Expression Profiler: next generation–an online 127. Kopka, J. Current challenges and developments in GC-MS based metabolite platform for analysis of microarray data. Nucleic Acids Res. 32 (web server profiling technology. J. Biotechnol. 124, 312–322 (2006). issue), W465–W470 (2004). 128. Broeckling, C.D. et al. Metabolic profiling of Medicago truncatula cell 116. Reich, M. et al. GenePattern 2.0. Nat. Genet. 38, 500–501 (2006). cultures reveals the effects of biotic and abiotic elicitors on metabolism. 117. Quackenbush, J. Microarray data normalization and transformation. Nat. J. Exp. Bot. 56, 323–336 (2005). Genet. 32 (suppl.), 496–501 (2002). 129. Beckonert, O. et al. Metabolic profiling, metabolomic and metabonomic 118. Brettschneider, J.. Collin, F., Bolstad, B.M. & Speed, T.P. Quality procedures for NMR spectroscopy of urine, plasma, serum and tissue assessment for short oligonucleotide microarray data. Technometrics 50, extracts. Nat. Protoc. 2, 2692–2703 (2007). 241–264 (2008). 130. Lindon, J.C. & Nicholson, J.K. Spectroscopic and statistical techniques for 119. Kauffmann, A., Gentleman, R. & Huber, W. ArrayQualityMetrics–a information recovery in metabonomics and metabolomics. Annu. Rev. Anal. bioconductor package for quality assessment of microarray data. Chem. 1, 45–69 (2008). Bioinformatics 25, 415–416 (2009). 131. Xia, J., Bjorndahl, T.C., Tang, P. & Wishart, D.S. MetaboMiner–semi- 120. Morgan, M. et al. ShortRead: a bioconductor package for input, quality automated identification of metabolites from 2D NMR spectra of complex assessment and exploration of high-throughput sequence data. biofluids. BMC Bioinformatics 9, 507 (2008). Bioinformatics 25, 2607–2608 (2009). 132. Hotelling, H. Analysis of complex statistical variables into principal 121. Robinson, M.D., McCarthy, D.J. & Smyth, G.K. edgeR: a Bioconductor components. J. Educ. Psychol. 24, 417–441 (1933). package for differential expression analysis of digital gene expression 133. Kruskal, J. Multidimensional scaling by optimizing goodness of fit to a data. Bioinformatics 26, 139–140 (2010). nonmetric hypothesis. Psychometrika 29, 1–26 (1964). 122. Smyth, G.K., Yang, Y.H. & Speed, T. Statistical issues in cDNA microarray 134. Venna, J. & Kaski, S. Comparison of visualization methods for an atlas of data analysis. Methods Mol. Biol. 224, 111–136 (2003). gene expression data sets. Inf. Vis. 6, 139–154 (2007). 123. Nesvizhskii, A.I., Vitek, O. & Aebersold, R. Analysis and validation of 135. Inselberg, A. The plane with parallel coordinates. Vis. Comput. 1, 69–91 proteomic data generated by tandem mass spectrometry. Nat. Methods 4, (1985). 787–797 (2007). 136. Eisen, M.B., Spellman, P.T., Brown, P.O. & Botstein, D. Cluster analysis and 124. Perkins, D.N., Pappin, D.J., Creasy, D.M. & Cottrell, J.S. Probability- display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95, based protein identification by searching sequence databases using mass 14863–14868 (1998). spectrometry data. Electrophoresis 20, 3551–3567 (1999). Milestone publication that introduced the heat map visualization to the 125. Li, X.J. et al. A tool to visualize and evaluate data obtained by liquid field of transcriptomics and has been cited several thousand times. chromatography-electrospray ionization-mass spectrometry. Anal. Chem. 137. Wilkinson, L. & Friendly, M. The history of the cluster heat map. Am. Stat. 76, 3856–3860 (2004). 63, 179–184 (2009). 126. Sturm, M. & Kohlbacher, O. TOPPView: an open-source viewer for mass 138. Weinstein, J.N. Biochemistry. a postgenomic visual icon. Science 319, spectrometry data. J. Proteome Res. 8, 3760–3763 (2009). 1772–1773 (2008). © 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010

S68 | VOL.7 NO.3s | MARCH 2010 | nature methods supplement nature | methods

Visualization of omics data for systems biology Nils Gehlenborg, Seán I O’Donoghue, Nitin S Baliga, Alexander Goesmann, Matthew A Hibbs, Hiroaki Kitano, Oliver Kohlbacher, Heiko Neuweger, Reinhard Schneider, Dan Tenenbaum & Anne-Claude Gavin

Supplementary figures and text:

Supplementary Figure 1 Diagnostic visualizations for normalization of gene expression microarray data

Supplementary Figure 2 Visualization of LC-MS maps

Supplementary Figure 3 Visualization of LC-MS/MS data and quality scores

Supplementary Table 1 Visualization tools for omics measurement data

Nature Methods: doi:10.1038/nmeth.1436

Supplementary Figure 1. Diagnostic visualizations for normalization of gene expression microarray data

(a) MA-plots before (left) and after LOWESS within-array normalization (right) for data from a two-color array. (b) Box and whiskers plots demonstrating the effects of LOWESS within- (center) and quantile between-array (right) normalization. Original data is shown on the left. The data for Array1 is also shown in the MA-plots.

Nature Methods: doi:10.1038/nmeth.1436

Supplementary Figure 2. Visualization of LC-MS maps

(a) LC-MS data from proteomics displayed as a color-coded 2D plot of the data together with projections of the map onto the retention time and mass-to-charge axes and (b) as a 3D plot. Clearly recognizable are the sets of pairs showing distinct elution profiles (along the retention time axis) and isotope patterns (along the mass-to-charge ratio axis). Visualizations created with TOPPView1.

Nature Methods: doi:10.1038/nmeth.1436

Supplementary Figure 3. Visualization of LC-MS/MS data and quality scores

Peak intensities encoded into a grayscale value and plotted into a retention time t vs m/z coordinate system. The square markers represent MS-MS spectra used for identification purposes. Blue squares represent MS-MS spectra that were not identified as peptides or peptides with a p-value lower than 0.5. All other markers represent peptide identifications and corresponding p-values. Note that the program was set up to ignore all peptide identifications with a p- value below 0.5. This is reflected in the range of the color scale for the p-value.

Image produced with Pep3D2.

Nature Methods: doi:10.1038/nmeth.1436

Supplementary Table 1

Visualization tools for omics measurement data

Name Cost OS Data Description URL Stand-alone Expression Console Free Win A Low-level analysis of Affymetrix array data;,diagnostic plots http://tinyurl.com/yjktyyb (Affymetrix) Insilicos Viewer Free Win M Data viewer; mass spectrum and chromatogram visualizations http://tinyurl.com/yfpzcbw (Insilicos) GenePix Pro (Molecular $ Win A Image acquisition of microarray slides; basic visualizations of raw data http://tinyurl.com/ygks2gv Devices) MetaboMiner3 Free Win Mac Linux N Tool for metabolite identification; visual inspection of matched peaks http://tinyurl.com/ygwv8st msInspect/Qurate4 Free Win Mac Linux M Quantitative mass spectrometry data; chromatogram and LC-MS map http://tinyurl.com/yk2xt3r MzMine 25 Free Win Mac Linux M Full analysis platform; most standard visualizations; PCA scatter plots http://tinyurl.com/yzbefqt Pep3D2 Free Win Mac Linux M Mass spectrum plots, LC-MS maps; integration of statistical analyses http://tinyurl.com/ylgpgkf Prequips6 Free Win Mac Linux M Mass spectrum plots, chromatograms and LC-MS maps. http://tinyurl.com/yf9l5he Proteowizard/SeeMS7 Free Win M Collection of tools; mass spectrum plots and LC-MS maps. http://tinyurl.com/yznqw4a TOPPView*1 Free Win Mac Linux M Mass spectrum plot, chromatogram, 2D and 3D LC-MS maps http://tinyurl.com/yhx7zrc R/BioConductor Packages affy8 Free Win Mac Linux A Exploratory analysis of Affymetrix array data; several diagnostic plots http://tinyurl.com/2lgo67 affycomp9 Free Win Mac Linux A Comparison of Affymetrix array data, e.g. with MA-plots. http://tinyurl.com/2lgo67 arrayQualityMetrics*10 Free Win Mac Linux A MA-plots, intensity density plots and spatial distribution plots; reports http://tinyurl.com/2lgo67 edgeR11 Free Win Mac Linux S Estimation and testing for differential expression; diagnostic plots http://tinyurl.com/2lgo67 limma12 Free Win Mac Linux A Analysis with linear models; measurement level diagnostic plots http://tinyurl.com/2lgo67 shortRead13 Free Win Mac Linux S Processing and evaluation of short read sequencing data http://tinyurl.com/2lgo67 Web-based MeltDB*14 Free M Extensive analysis platform; mass spectrum plots, heatmaps,pathways http://tinyurl.com/yj6o5vs MetaboAnalyst15 Free M Extensive analysis platform; PCA scatter plots, heatmaps and more http://tinyurl.com/yhg4plu

Some of the tools in this table have capabilities similar to tools that are listed in other tables. To avoid listing tools in more than one table we assigned tools to tables based on what we understand is their primary purpose. Abbreviations: An asterisk (*) means the tool is recommended. Free means the tool is free for academic use. Win refers to Microsoft Windows, Mac refers to Mac OS X, tools running on Linux usually also run on other versions of Unix. A = for oligonucleotide microarray data, M = mass spectrometry data, S = deep sequencing data, N = Nuclear Magnetic Resonance

Nature Methods: doi:10.1038/nmeth.1436

References

1. Sturm, M. and Kohlbacher, O. TOPPView: an open-source viewer for mass spectrometry data. J Proteome Res 8 (7), 3760-3763 (2009) 2. Li, X. J. et al. A tool to visualize and evaluate data obtained by liquid chromatography-electrospray ionization-mass spectrometry. Anal Chem 76 (13), 3856-3860 (2004) 3. Xia, J., Bjorndahl, T. C., Tang, P., and Wishart, D. S. MetaboMiner--semi- automated identification of metabolites from 2D NMR spectra of complex biofluids. BMC Bioinformatics 9, 507 (2008) 4. May, D., Law, W., Fitzgibbon, M., Fang, Q., and McIntosh, M. Software platform for rapidly creating computational tools for mass spectrometry-based proteomics. J Proteome Res 8 (6), 3212-3217 (2009) 5. Katajamaa, M., Miettinen, J., and Oresic, M. MZmine: toolbox for processing and visualization of mass spectrometry based molecular profile data. Bioinformatics 22 (5), 634-636 (2006) 6. Gehlenborg, N. et al. Prequips - an extensible software platform for integration, visualization and analysis of LC-MS/MS proteomics data. Bioinformatics 25, 682-683 (2009) 7. Kessner, D., Chambers, M., Burke, R., Agus, D., and Mallick, P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 24 (21), 2534-2536 (2008) 8. Gautier, L., Cope, L., Bolstad, B. M., and Irizarry, R. A. affy--analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20 (3), 307-315 (2004) 9. Irizarry, R. A., Wu, Z., and Jaffee, H. A. Comparison of Affymetrix GeneChip expression measures. Bioinformatics 22 (7), 789-794 (2006) 10. Kauffmann, A., Gentleman, R., and Huber, W. arrayQualityMetrics--a bioconductor package for quality assessment of microarray data. Bioinformatics 25 (3), 415-416 (2009) 11. Robinson, M. D., McCarthy, D. J., and Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26 (1), 139-140 (2010) 12. Smyth, G. K., Yang, Y. H., and Speed, T. Statistical issues in cDNA microarray data analysis. Methods Mol Biol 224, 111-136 (2003) 13. Morgan, M. et al. ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data. Bioinformatics 25 (19), 2607-2608 (2009) 14. Neuweger, H. et al. MeltDB: a software platform for the analysis and integration of metabolomics experiment data. Bioinformatics 24 (23), 2726-2732 (2008) 15. Xia, J., Psychogios, N., Young, N., and Wishart, D. S. MetaboAnalyst: a web server for metabolomic data analysis and interpretation. Nucleic Acids Res 37 (Web Server issue), W652-660 (2009)

Nature Methods: doi:10.1038/nmeth.1436