Empirical Comparison of Visualization Tools for Larger-Scale Network Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Hindawi Advances in Bioinformatics Volume 2017, Article ID 1278932, 8 pages https://doi.org/10.1155/2017/1278932 Review Article Empirical Comparison of Visualization Tools for Larger-Scale Network Analysis Georgios A. Pavlopoulos,1 David Paez-Espino,1 Nikos C. Kyrpides,1 and Ioannis Iliopoulos2 1 Department of Energy, Joint Genome Institute, Lawrence Berkeley Labs, 2800 Mitchell Drive, Walnut Creek, CA 94598, USA 2Division of Basic Sciences, University of Crete Medical School, Andrea Kalokerinou Street, Heraklion, Greece Correspondence should be addressed to Georgios A. Pavlopoulos; [email protected] and Ioannis Iliopoulos; [email protected] Received 22 February 2017; Revised 14 May 2017; Accepted 4 June 2017; Published 18 July 2017 Academic Editor: Klaus Jung Copyright © 2017 Georgios A. Pavlopoulos et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Gene expression, signal transduction, protein/chemical interactions, biomedical literature cooccurrences, and other concepts are often captured in biological network representations where nodes represent a certain bioentity and edges the connections between them. While many tools to manipulate, visualize, and interactively explore such networks already exist, only few of them can scale up and follow today’s indisputable information growth. In this review, we shortly list a catalog of available network visualization tools and, from a user-experience point of view, we identify four candidate tools suitable for larger-scale network analysis, visualization, and exploration. We comment on their strengths and their weaknesses and empirically discuss their scalability, user friendliness, and postvisualization capabilities. 1. Background the available tools, and spot the strengths and the weaknesses of a tool of interest at a glance, no empirical feedback on the Health and natural sciences have become protagonists in the tools’ scalability was obvious. big-data world as high-throughput advances continuously To shortly mention representative tools in the field, 2D contribute to the exponential growth of data volumes. Nowa- standalone applications like graphVizdb [7], Ondex [8], days, biological repositories expand every day by hosting Proviz [9], VizANT [10], GUESS [11], UCINET [12], MAP- various entities such as proteins, genes, drugs, chemicals, MAN [13], PATIKA [14], Medusa [15], or Osprey [16] as ontologies, functions, articles, and the interactions between well as 3D visualization tools such as Arena3D [17, 18] them, often leading to large-scale networks of thousands or and BioLayout Express [19] already exist. Each of them is even millions of nodes and connections. As such networks are designed to serve a different purpose. For example, Ondex characterized by different properties and topologies, graph is implemented to gather and manage data from diverse and theory comes to play a very important role by providing ways heterogeneous datasets, Proviz is dedicated to handle protein- to efficiently store, analyze, and subsequently visualize them protein interaction datasets, VizANT focuses on metabolic [1–5]. networks and ecosystems, Medusa is able to show semantic Visualization and exploration of biological networks at networks and multiedged connections, GUESS supports such scale are a computationally challenging task and many dynamic and time sensitive data, Osprey is implemented to efforts in this direction have failed over the years. Recent annotate biological networks, Arena3D is targeting multilay- reviewarticles[3,4,6]discussthechallengesinthebiological ered graphs, and BioLayout Express is designed for generic data visualization field and list a catalog of standalone and advanced 3D network visualizations. web-basedvisualizationtoolsaswellasthevisualconcepts Despite the fact that such tools are widely used and have they are implemented to serve. While these resources are great potential for further development, to our experience, valuable to capture the big picture in the field, get a sense of they are not recommended for large-scale network analysis in 2 Advances in Bioinformatics their current versions. UCINET windows application could CPU and memory greedy by requiring long running time potentiallybeusedforjustvisualizationpurposes.Itsabsolute to be completed. While Gephi comes with a great variety of maximum network size is about 2 million nodes but, in layout algorithms, OpenOrd [26] and Yifan-Hu [27] force- practice, most of its procedures are too slow to run networks directed algorithms are mostly recommended for large-scale larger than about 5,000 nodes. network visualization. OpenOrd, for example, can scale up to Among several existing tools that we tested, we find over a million nodes in less than half an hour while Yifan-Hu Cytoscape (v3.5.1) [20], Tulip (v4.10.0) [21], Gephi (v0.9.1) is an ideal option to apply after the OpenOrd layout. Notably, [22], and Pajek (v5.01) [23, 24] standalone applications to Yifan-Hu layout can give aesthetically comparable views be the top four candidates for visualization, manipulation, totheonesproducedbythewidelyusedbutconservative exploration, and analysis of very big networks. For these and time-consuming Fruchterman and Reingold [28]. Other four tools, we empirically evaluate their pros and cons, we algorithms offered by Gephi are the circular, contraction, dual comment on their scalability, user friendliness, layout speed, circle, random, MDS, Geo, Isometric, GraphViz, and Force offered analyses, profiling, memory efficiency, and visual atlas layouts. While most of them can run in an affordable styles, and we provide tips and advice on which of their running time, the combination of OpenOrd and Yifan-Hu features can scale and which of them is better to avoid. seems to give the most appealing visualizations. Descent visualization is also offered by OpenOrd layout algorithm if In order to show a representative visualization generated auserstopstheprocesswhen∼50–60% of the progress has by these four tools, we constructed a graph consisting been completed. Of course, efficient parametrization of any of 202,424 nodes and 354,468 edges showing the habitat chosen layout algorithm will affect both the running time and distribution of 202,417 protein families across 7 habitats. the visual result. Data was collected from the IMG integrated genome and metagenome comparative data analysis system [25] whereas 2.1.3. Postvisualization Analysis. Edge-bundling and famous protein families originate from public metagenomes only. clustering algorithms such as the MCL [29] do not come A step-by-step protocol describing how these images were by default with Gephi but can be downloaded from Gephi’s generated is presented as Supplementary Material, available plugin library (∼100 plugins). In addition, GeoLayout Gephi’s online at https://doi.org/10.1155/2017/1278932. Comments on plugin is very suitable to plot a network with geographical problems which occurred during our analysis as well as information. Coming to dynamic network visualization, drawbacks and strengths of the visualization tools used for Gephi is the forefront of innovation with dynamic graph the purposes of this review are extensively discussed. analysis. Users can visualize how a network evolves over time by manipulating its embedded timeline. While visualization 2. Top Four Candidates for Large-Scale of a network over time is something very useful, its current algorithms are not suitable for large-scale networks. Similarly, Network Visualization for large-scale networks it is highly recommended for users 2.1. Gephi (Version 0.9.1). Gephiisfreeopen-source,lead- to apply clustering algorithms using external command line ing visualization and exploration software for all kinds of applications and then import the clustering results to a networksandrunsonWindows,MacOSX,andLinux.It visualization tool. is our top preference as it is highly interactive and users To study a network’s topology, Gephi comes with a can easily edit the node/edge shapes and colors to reveal very basic but high quality network profiler showing basic hidden patterns. The aim of the tools is to assist users in statistics about the network such as the number of nodes, pattern discovery and hypothesis making through efficient the number of edges, its density, its clustering coefficient, dynamic filtering and iterative visualization routines. As a and other metrics. Automatically calculated node attributes generic tool, it is applicable to exploratory data analysis, link such as node connectivity, clustering coefficient, betweenness analysis, social network analysis, biological network analysis, centrality, or edge weight and similarly are trivial tasks and do and poster creation. not require long time to be calculated. 2.1.1. Scalability. Gephi comes with a very fast rendering 2.1.4. Editing. Gephi is highly interactive and provides clever engine and sophisticated data structures for object handling, shortcuts to highlight communities, and shortest paths or thus making it one of the most suitable tools for large- relative distances of any node to a node of interest are offered. scale network visualization. It offers very highly appealing Moreover, users can easily adjust or interactively filter the visualizations and, in a typical computer, it can easily ren- shapes and colors of the network’s edges and nodes according der networks up to