A PRACTICAL GUIDE TO DRAWING AND COMPUTING WITH COMPLEX NETWORKS

ROSS M. RICHARDSON

Abstract. These notes are an attempt to document various resources which I have compiled in the course of my work. They are available to anyone who is interesting, and I welcome comments and suggestions. They are very much a work in progress, and you are advised to check the last change date. I offer no warranty, implicit or explicit, and I make no claim as to the relevance of this information to your own computer system. Last changed: September 28, 2006

Contents 1. Prerequisites 2 2. Graph Formats 2 2.1. A Rouges Gallery 2 2.2. Conversion 5 3. Graph Computing 5 3.1. NetworkX 5 3.2. Boost Graph Library 9 4. Graph Drawing 12 4.1. Algorithms 12 4.2. Presentation 14 4.3. Worked Examples 16 4.4. A Sample Drawing Code 20 5. Degree Distributions 25 5.1. Visualization 25 5.2. Powerlaw Exponent 27 5.3. An example 28 6. A Sample Project 29 Acknowledgments 29 7. Appendix: Datasources. 29 References 29

The reader who enjoys presentations might enjoy a talk I gave on many of these topics. The relevent PDF file can be obtained at http://www.math.ucsd.edu/ ∼rmrichar/talks/graph drawing talk. (warning: 5MB). 1 2 ROSS M. RICHARDSON

1. Prerequisites This guide assumes the reader is sufficiently familiar with computers and com- puter programming to be able to comprehend the code and procedures contained here within. The author does not intend in any way for this guide to serve as a method of instruction for learning these skills. However, for the reader already fa- miliar with computer programming, we do hope to provide enough examples such that the reader feels comfortable tinkering on their own. The documentation also assumes that the reader has access to the tools on her own system; this guide makes no effort to explain their installation. For those members of Fan Chung’s research group, I will try to the best of my abilities to make sure this guide corresponds to currently installed software on math107. This guide includes code in Python, ++, and . The mathematical content in this guide is minimal, and should not be distracting to anyone for whom these notes might be of interest. That said, someone not familiar with the basic terminology of would do well to have a reference on hand. I suggest [14].

2. Graph Formats For almost every graph tool out there, there is some sort of graph file format. Sadly, few apply generally to a large swath of graph drawing contexts. When dis- cussing specific tools that require proprietary formats, we shall discuss the relevant formats. Here we just present a rouges gallery to help you quickly identify those files you come across in the wild. We also discuss some strategies for converting be- tween the formats, which is often the most time-consuming task in any computing project.

2.1. A Rouges Gallery.

2.1.1. GraphXML. This is a newer format, based on XML (eXtensible Markup Language). It should not be confused with the custom XML format used in Lincoln Lu’s graph tools. We don’t currently have tools to use this format, but it is easy to recognize if you come across it. The basic syntax is simple; here is a sample

Figure 1. A GraphXML file.

For further reference, see [1]. A PRACTICAL GUIDE TO DRAWING AND COMPUTING WITH COMPLEX NETWORKS 3

2.1.2. Lincoln’s XML Format. This is an evolving format that I don’t feel very capable of documenting. Questions should be directed to Lincoln [2] or I if you have some reason to use this format. Files which begin: ... are probably in Lincoln’s format.

2.1.3. Large Graph Layout. This is the input format to the Large Graph Layout drawing engine. Proper documentation can be found at the Large Graph Layout web site, found in the references [3]. LGL actually accepts a number of different file formats. The first of these, the .ncol file format, is given as a simple two column file delimited by whitespace. Thus, to place edge between Paul and Endre and Endre and Laszlo, a file would contain the lines: Paul Endre 3.2 Endre Laszlo 4.5 Note here the optional edge weight following the two endpoints. An .lgl file is somewhat different. It lists vertices first, followed by neighbors. Thus, the same relations would be represented as follows: # Endre Paul 3.2 Laszlo 4.5 There are a few caveats to this file format. For use, please see the section on Large Graph Layout, or read the documentation found in the references.

2.1.4. Walrus. This is a strange one. If you see something akin to figure 2 you

Graph { ### metadata ### @name="IMDB1"; @description=; @numNodes=2798; @numLinks=11135; @numPaths=0; @numPathLinks=0; ### structural data ### @links=[ { 712; 0; }, { 0; 735; }, { 0; 2499; }, { 0; 2744; }, { 1; 2; }, { 1; 942; }, ...

Figure 2. Beware the Walrus. 4 ROSS M. RICHARDSON graph SD { OceanBeach -- PacificBeach [pos=’1.0, 2.0’] PacificBeach -- LaJolla [pos=’1.0, -3.0’]; LaJolla -- ScrippsRanch [pos=’-2.5, 1.2’]; Hillcrest -- OceanBeach [pos=’0.3,4.0’]; MissionBeach -- OceanBeach [pos=’2.7,4.2’]; NationalCity; }

Figure 3. A simple DOT file. should back away very slowly. The file format is quite complicated. I refer the interested reader to the project site [4]. Good luck1.

2.1.5. Matlab. A .mat file is not human-readable. Say you have a file labeled graph.mat. In MatlabTM2 (or Octave), issue the following: >> load graph . mat >> whos Name Size Bytes Class

X 3x3 72doublearray

Grand total is 9 elements using 72 bytes

>> Here we see that graph.mat contained a 3 by 3 array labeled X. Matlab is not explicitly a graph format, but very often graphs are stored as adjacency matrices or lists. The use of Matlab in this context is well outside the scope of the present section.

2.1.6. DOT. The DOT format, which comes from the AT&T Graphviz collection, is by now the default choice for graph storage and manipulation. This is due to two factors: the widespread use of AT&T’s Graphviz tools, and the generality and extensibility of the format itself. The format itself is easily recognized. We present an example in figure 3. Both undirected and directed graphs are supported. The format allows for arbi- trary attributes to be associated at the node, edge, and graph level, though there are a set of attributes which are standardized for use with the Graphviz tools. I strongly urge all users who are looking for a format to store their data in to consider DOT. The advantages are many: it is human readable, standardized, popular, and highly extensible (only GraphXML is more extensible). The primary disadvantage, which is shared by all non-trivial formats, is that a full parser is

1I do have some code which allows me to convert into this format from Lincoln’s format. I will release said code if asked. 2A registered trademark of The MathWorks. A PRACTICAL GUIDE TO DRAWING AND COMPUTING WITH COMPLEX NETWORKS 5 required to read and manipulate DOT files. Luckily, there are a number of pre- written parsers, including the newly available pygraphviz parser (an add-on to the NetworkX package). Full documentation for the DOT format is available at the Graphviz project site [5].

2.2. Conversion. TBD

3. Graph Computing Graph computing, in the context of this section, refers to an integrated system or library for manipulating graphs. There are a large collection of such systems; indeed, the digital representation of a graph is an all-to-common project for begin- ning computer science undergraduates. For our purposes, we focus only on those systems which: (1) are very general (2) contain a large number of primitive algorithms (3) are well documented and actively supported In my assessment, there are two systems which meet these requirements. One, the Boost Graph Library, is a C++/Python library which has been around for a few years. This library seeks to be a generic set of data structures and algorithms suitable for constructing robust graph algorithms. The other, NetworkX, is a pure Python package with a host of tools which reflect recent trends in complex research, and which puts an emphasis on interactivity. In what follows, we give a general overview of the two packages, and asses their suitability to various computing tasks. We go over some of the basics of their operation, and provide two a sample application of each.

3.1. NetworkX. Accoding to their project description: “NetworkX (NX) is a Python package for the creation, manipula- tion, and study of the structure, dynamics, and functions of com- plex networks.” More precisely, NetworkX is a collection of complex network tools (many already in existence) which are collected in one place and given a more or less common python interface. This is not unlike the SAGE project [6] in computational number theory. The project itself is due to Aric Hagberg at LANL, and it is currently in very active construction.

3.1.1. Strengths and Weaknesses. There are a number of features to recommend the NetworkX package. NetworkX is accessed through a simple Python interface, which is useful for a number of reasons. One is the ease of interaction; the cost of experimenting with examples is fairly low. Another is the short development cycle. Indeed, because Python is a weakly-typed language and has no compile or link phase, code can be designed and modified on a much shorter time-scale than C/C++. NetworkX is also useful because it serves as an interface to a number of stan- dard tools. Indeed, the Graphviz tools are available through NetworkX, as is the matplotlib drawing and graphing library. The package also allows for interaction 6 ROSS M. RICHARDSON with the scientific computing package NumPy, allowing for spectral investigations, for example. One other benefit of NetworkX is the pragmatic balance between commonly used features very generalized algorithms. Many general graph libraries often lack pri- matives for such common operations as finding vertex neighborhoods or computing connected components, NetworkX has easily learned functions which compute both of these easily. Indeed, most natural graph manipulations which make sense from a theory point of view (taking subgraphs, computing diameters, etc.) are available as atomic operations to the user. Moreover, NetworkX provides a number of gen- erators to create a whole host of common graph families, both deterministic and random (hypercubes, the Petersen graph, and random regular graphs, for exam- ple). On the other hand, the user has complete control over their graph objects, and NetworkX includes a number of fundamental algorithms from which to construct one’s own code (a general Dijkstra implementation, for example). There are a few negatives to NetworkX, which make it inappropriate for a number of applications. Most notable, NetworkX is not based almost entirely in Python. While this makes the code accessible, it is also correspondingly slow. This makes computing with large graphs unrealistic. As mentioned, NetworkX is also pragmatically designed to be quick to learn. The price one pays for this is that not all the code is separable into small, efficient pieces. As a result, this makes NetworkX inappropriate for designing major applications. A final caveat is that NetworkX is under active development, and is subject to both a changing interface and occasional bugs. In particular, this makes any code which is heavily reliant on the NetworkX interface liable to break with every upgrade. In summary, NetworkX is suited well at experimentation and rapidly testing out ideas. Because it integrates many tools, it is the preferred platform for day-to-day computing with graphs. However, given the speed and inflexibility of design, this package is not appropriate for building efficient applications, or anything of large complexity. Roughly speaking, anything that takes more than a few days to code is probably inappropriate for NetworkX.

3.1.2. Basic Commands. To begin with NetworkX, we first open Python.

[triton@math107 ˜]$ python Python 2.3.3 (#1, May 7 2004, 10:31:40) [GCC 3.3.3 20040412 (Red Hat 3.3.3 −7)] on linux2 Type ”help”, ”copyright”, ”credits” or ” l i c e n s e ” for more information. >>> from networkx import ∗ >>> G = Graph ( ) >>> G. add node ( 0 ) >>> G. add edge((1,2)) >>> G. add edge((2,3)) >>> G. add path([2,4,5]) >>> G. nodes ( ) [0, 1, 2, 3, 4, 5] >>> G. edges ( ) A PRACTICAL GUIDE TO DRAWING AND COMPUTING WITH COMPLEX NETWORKS 7

[(1, 2), (2, 3), (2, 4), (4, 5)] >>> s h o r t e s t path(G,1 ,5) [ 1 , 2 , 4 , 5 ] >>> connected components(subgraph(G,[1 ,2 ,5])) [[1, 2], [5]] Most command in the NetworkX package are available by invoking the magic line from networkx import * 3. Alternatively, one could issue the command import networkx as NX. In this case, all NetworkX commands would need to be invoked with a proceeding NX., as in G = NX.Graph(). Full documentation, as well as a better quickstart guide, can be found at the networkx web site [7]. 3.1.3. Recipes. Reading from a DOT file Perhaps one of the most frequently performed tasks. Say we have a well- formed DOT file labeled sd.dot. >>> from pygraphviz import ∗ >>> f = open(”sd.dot”, ”r”) >>> G = Agraph ( ) >>> G. read ( f ) >>> from networkx import ∗ >>> H = networkx from pygraphviz(G) Here, we use the separate library pygraphviz for reading in DOT files. Note that pygraphviz imports files into an Agraph object – this is meant to replicate the functionality of a similar library in the Graphviz package. In particular, an Agraph structure maintains all the attribute data found in DOT files. NOTE: NetworkX graphs do not have attribute data, and networkx from pygraphviz does not preserve this data. It is also worth paying attention to how pygraphviz deals with files; the G.read() method expects a file handle, which is returned here from the system standard open() command. Don’t worry about closing files (i.e. f.close()), since Python will do this for you at exit. Getting and setting attributes in an Agraph() is easily done. >>> from pygraphviz import ∗ >>> G = Agraph ( ) >>> G. add node(’erdos ’) >>> G. add nodes( ’turan ’) >>> G. s e t n o d e attr([’erdos’, ’turan’], pos=’0,0’) >>> nodea = G.get node(’erdos’) >>> nodea . s e t attr(pos=’1,2’) >>> for n in G. nodes ( ) : ... print n , n . g e t attr(’pos’) ... >>> erdos 1 ,2

3This is a standard method for importing a module in Python. 8 ROSS M. RICHARDSON

turan 0 ,0 >>> G.write(open(’foo.dot’,’w’)) Note that attributes can be accessed one or many at a time. One should be careful about when to pass strings; the above example is paradigmatic in this regard. Finally, note that we include the syntax for writing a DOT file. In particular, one should observe that the file was opend with a ’w’ option, indicating that we are opening the file for writing. There is, alas, little documentation as to how to use the Agraph class (though the above examples illustrate much of their use). One should see the file pygraphviz.py for all available features. This is found by issuing a locate pygraphviz.py command in UNIXTMor by browsing the project source, found on the NetworkX website [7]. Randomly Generating a Graph There are many ways to generate random graphs, or non-random graphs for that matter, in NetworkX. >>> G = gnp random graph(1000,.002) >>> G. i n f o ( ) Name : gnp random graph(1000,0.002) Type : Graph Number of nodes: 1000 Number of edges: 975 Average degree: 1.95 >>> G = hypercube graph (10) >>> G. i n f o ( ) Name: hypercube graph (10) Type : Graph Number of nodes: 1024 Number of edges: 5120 Average degree: 10.0 Some fancier graphs need to be imported. For example, the G(w) model is accessed as follows: >>> H = b a r a b a s i a l b e r t graph(100,1) >>> from networkx. generators . degree s e q import ∗ >>> G = expected degree graph(H.degree()) Here we create a Barabasi-Albert preferential attachment graph, and then use the resulting degree sequence to create a G(w) graph. Iterating over Nodes/Edges NetworkX obeys the Python convention for iteration. For instance, we could do the following: >>> G = gnp random graph(1000,.02) >>> L = [ ] >>> for n in G. nodes ( ) : ... i f G.degree(n) > 2 0 : ... L.append(n) ... >>> H = subgraph(G,L) A PRACTICAL GUIDE TO DRAWING AND COMPUTING WITH COMPLEX NETWORKS 9

This shows how one can construct the subgraph of G induced by vertices of degree greater than twenty.

3.2. Boost Graph Library. The Boost Graph Library (BGL) is a C++ graph library made for fast and efficient graph computation. It is due to a team at Indiana University, copyright 2000-2001. There is also a parallel version of the BGL currently in active development for large scale projects [9]. It should also be noted that the research machine, math107, does not currently have the Python bindings installed.

3.2.1. Strengths and Weaknesses. One of the primary advantages of the BGL is the large selection of algorithms available as primitives for the construction of larger algorithms. These include the categories of shortest paths, minimum spanning trees, connected components, max flow, sorting, layout, and others. The algorithms are both highly generic and optimized for speed. It is noteworthy that a number of software packages currently utilize BGL for their production code; in particular, the Large Graph Layout graph drawing package is written BGL. Another major advantage of the BGL is the highly generic interface. For those familiar with C++, BGL is generic in the same manner as the Standard Template Library. In any case, the generality has a number of practical consequences. One such consequence is the ability to use a large class of graph data-structures; any graph data-structure need only satisfy some interface conventions (there are a range of conventions, from highly specific to highly generic) to be useable by the majority of the library. Another consequence is the ability to fine-tune any pre-built algo- rithm with very little code. For instance, the library includes an implementation of the Dijkstra all-pairs-shortest-path algorithm. Customizations include the ability to act on directed or undirected graphs, weighted graphs, to specify an arbitrary distance function, to specify an arbitrary method of combining distances, and even the ability to perform arbitrary actions at various standard points in the algorithm’s execution. In general, the design of the BGL allows for highly functional code which is at the same time highly customizable. It is important to note that the BGL is the most comprehensively documented graph library currently available. Documentation is available in both web and book formats, written by the package’s authors. The documentation consists of both a how-to section and a full listing of the data-structure and algorithm interface. The listing is written in a style very similar to that of the STL documentation, and in particular thus includes complexity guarantees for most operations. For any application that requires speed, this level of control is essential. It is also of some use that the BGL has a number of Python bindings (in other words, an interface to the Python language that still allows one to access the li- brary). The interface is designed to mimic the C++ interface in function while maintaining Python syntax and simplicity. This can be useful in either getting started quickly with the BGL or as a means of prototyping code for later migration to a faster C++ implementation. The primary disadvantage to the BGL is the added complexity associated with the highly generic C++ interface. To fully utilize many of the features, the user needs some familiarity with basic C++ programming, as well as some knowledge about more esoteric features such as templates and iterators. Moreover, as the BGL shares the advantages of the Standard Template Library, it also shares its 10 ROSS M. RICHARDSON well known weaknesses. The two most important of these are the lack of full com- patibility with all but the most modern compilers 4 and the difficulty in obtaining useful debugging information given the inherent complexity of the underlying data- structures5. Another aspect of the complexity is reflected in the greatly increased development time due to both the inherent complexity in the BGL and the use of C++ as the development language. This is somewhat moderated by the availability of the Python binding, but at the cost of much of the efficiency and flexibility of the C++ interface. 3.2.2. A Small Example. We offer only a small snippet to give the flavor of the BGL6. The BGL documentation [8] offers much more. #i n c l u d e // for std::cout #i n c l u d e // for std::pair #i n c l u d e // for std::for e a c h #i n c l u d e #i n c l u d e #i n c l u d e

using namespace boost ;

int main ( int , char ∗ []) { // create a typedef for the Graph type typedef a d j a c e n c y l i s t Graph ;

// Make convenient labels for the vertices enum { A,B,C,D,E,N } ; const int num vertices = N; const char∗ name = ”ABCDE” ;

// writing out the edges in the graph typedef std : : pair Edge ; Edge edge array [ ] = { Edge(A,B), Edge(A,D), Edge(C,A), Edge(D,C), Edge(C,E), Edge(B,D), Edge(D,E) } ; const int num edges = sizeof ( edge array )/ sizeof ( edge array [ 0 ] ) ;

// declare a graph object Graph g ( num vertices ) ;

4This is a decreasingly common problem as most production C++ compilers have finally become compliant with the full language specification. 5We will not attempt to flesh out this difficulty, sufficing only to mention that anyone who has used STL will be familiar with the multipage string-of-symbols error messages commonly reported for simple errors. 6 These examples are lifted from the BGL documentation [8]. A PRACTICAL GUIDE TO DRAWING AND COMPUTING WITH COMPLEX NETWORKS11

// add the edges to the graph object for ( int i = 0 ; i < num edges ; ++i ) add edge ( edge array[i].first , edge array[i].second, g); ... return 0 ; }

Here we have an example constructing a simple graph with seven edges. Here we have an example of the Python bindings used to read in a DOT file, compute the minimum spanning tree, and return a DOT file with the MST hi- lighted. import boost.graph as bgl

# Load a graph from the GraphViz file ’mst.dot’ graph = bgl.Graph.read graphviz( ’mst.dot’)

# Convert the weight into floating −point values weight = graph.convert property map(graph. edge properties[ ’weight’], ’ f l o a t ’ )

# Compute the minimum spanning tree of the graph mst edges = bgl.kruskal minimum spanning tree(graph, weight)

# Compute the weight of the minimum spanning tree print ’MST weight =’ ,sum([weight[e] for e in mst edges ] )

# Put the weights into the label. Make MST edges solid while all other # edges remain dashed. label = graph.edge property map(’string ’) style = graph.edge property map(’string ’) for e in graph.edges: label[e] = str(weight[e]) i f e in mst edges : style[e] = ’solid’ else : style[e] = ’dashed’

# Associate the label and style property maps with the graph for output graph . e d g e properties[’label’] = label graph . e d g e properties[’style’] = style

# Write out the graph in GraphViz DOT format graph. write graphviz( ’mst−out . dot ’ ) 12 ROSS M. RICHARDSON

4. Graph Drawing The general topic of graph drawing is quite broad, so some definitions are in order. By a graph drawing we intend a geometrical embedding of a graph G (pos- sibly directed and with loops) into R2 or R3. The intended application of such an embedding is the production of an image suitable for illustration of some feature of the graph. Before we discuss the topic of interest to this guide, let us discuss some graph drawing topics which are not going to be discussed further. One such topic is the problem of finding embeddings for very well-behaved families of graphs. These include trees, planar/outerplanar graphs, graphs of bounded genus, graph arising from some natural and regular geometry (e.g. lattices), and algebraic graphs (e.g. Cayley graphs). Another topic is drawings which seek to optimize or meet very strict criteria, such as planar drawings/ sphere embeddings, minimum crossing embeddings, lattice embeddings, symmetric embeddings, and the like. So what, then, is our goal? We would like to present here techniques for drawing “large real-world graphs” quickly, automatically, and suitable for publication. Nat- urally, we would feel more comfortable with a definition that did not resort to the use of quotation marks, but the nature of the problem makes an such description necessarily of a heuristic nature. However, we note that practically speaking, the graphs of interest tend to be large (≥ 100 vertices, say), very sparse, and quite often displaying power-law degree distributions. This section is organized as follows. We present two algorithms which are con- cerned with producing the desired embedding, and illustrate their use. We then discuss issues of presentation, including the use of color, edge order, translucence and the like, and present tools based on the NetworkX. Finally, we include some worked examples. The reader who would like to begin immediately should skip ahead to the examples section.

4.1. Algorithms.

4.1.1. Force-Directed Algorithms: Kamada-Kawai. Perhaps the most natural class of algorithms for graph layout are the spring or force-directed algorithms. The archetypical algorithm begins with a random in Rd. To each edge we attach an ideal spring (obeying Hooke’s law, say) with some prescribed ideal length. The algorithm then iterates to minimize the energy of the system, typically until the change in energy drops below a given threshold. While there are a number of implementations of this general algorithm, we dis- cuss one in particular due to Kamada and Kawai [13], with numerical improvements by Gansner et. al. [12]. The input to the algorithm is a list of edge weights (wij) and edge lengths (dij). The (dij) are taken to be the graph-theoretic distance if not otherwise specified. Given some positions X = (X1,...,Xn) of the vertices, the stress is defined to be X 2 stress(X) = wij (kXi − Xjk2 − dij) . i

The formulation due to [12] forms the core of the neato algorithm, found in the popular AT&T Graphviz library. In particular, the algorithm accepts as input both edge weights and edge lengths, and has the default behavior as described. The version due to Kamada and Kawai [13] is found in the Boost Graph Library. These are perhaps the two most commonly used implementations of the force-directed class of drawing tools.

Figure 4. A graph with layout produced by the neato algorithm of the Graphviz package.

4.1.2. Spanning Tree Algorithms: LGL. The force directed algorithm is character- ized by an attempt to find an approximate isometry of some n point metric space into our two (or three) dimensional space. If one dispenses with this notion of graph layout, the next natural choice is to layout important spanning subgraphs of a given graph. One important choice of spanning subgraph is the notion of a spanning tree. Given some (possibly rooted) spanning tree T of a connected graph G, one can then apply any host of tree layout algorithms. As this produces a set of coordinates for the vertices, this induces a drawing of the full graph. This scheme is implemented in the fairly recent code Large Graph Layout (LGL) [3], written by Alex Adai. The algorithm can be described as follows: Beginning with a rooted spanning tree, place the root at the origin. For every neighbor of the root, place it on a unit circle/unit sphere about the root to maximize the angle between any other neighbor (thus on the unit circle space the neighbors evenly). For each neighbor, draw a circle about it and lay out the remaining neighbors on the part of this circle which falls outside the prior circle, distributing vertices evenly on this partial circle. This is then repeated iteratively for a all points. Details for the exact procedure can be found in a paper linked off of the LGL website [3]. In particular, it is worth noting that the full algorithm includes a smoothing procedure, where neighborhoods of the leaves of our spanning tree are passed through a force- directed method for improved layout. In practice, this algorithm does a remarkable job at “spreading out” real world graphs. In part, this is due “octopus” structure of complex networks CITE FAN. 14 ROSS M. RICHARDSON

Figure 5. A graph with layout produced by the LGL code. Com- pare with figure 4 – both represent the same graph.

4.2. Presentation. The prior section considered the question of graph layout, as- signing to each vertex a spacial coordinate7. In this section, we consider the question of actually drawing a graph given an embedding. In principle, drawing a graph given its vertex coordinates is simple–one simply calls an appropriate draw operation for every edge. A simple minded implementa- tion, however, quickly yields problems. Consider the following figure DRAW THIS, drawn with the LGL code. The resulting tangle of edges is not very informative. However, given the performance of LGL, we expect the high degree vertices to be clustered in the center. We suspect our layout indeed has this property, but it is hard to really determine which edges participate in any such subgraph. One such solution is the use of color, or for publication, shades of gray. Examples of this technique have appeared previously in this guide, see figure 5. A few remarks are in order. First, while the vertex degrees are generating the color choices for our figure, they are reflected in colored edges. Why not simply color the vertices instead? From a practical standpoint, vertices take up less area, so they are harder to correctly identify8. Another criticism of this approach is that there is no way to visually identify important from negligible edges. A second note about figure 5 is that the edges are drawn with a given ordering, where the order of drawing is least to most important. In practice, we suggest the following general principle for edge drawing. (1) Give all edges a weight from, say, [0, 1]. (2) Order the edges by weight. (3) Draw the edges from least weight to most weight. Why use weights instead of colors or grayscale values? With weights, one can then utilize different colormaps 9 for various effects.

7We have assumed all graph embeddings are straight-line embeddings, and as such our layout is reduced to a vertex embedding. One could, however, look at the case of curved edges. Though we shall not do so, it should be noted that many graph formats and tools have the ability to work with spline edges in addition to their linear counterparts. 8Often vertex sizes are changed to reflect degree. While this can help identify large degree vertices, in large graph this has the practical effect of covering much of the detail of the graph. 9For the uninitiated, a colormap is formally a function c from [0, 1] into some colorspace, say RGB for concreteness. In most computer implementations, a colormap assigns n evenly spaced A PRACTICAL GUIDE TO DRAWING AND COMPUTING WITH COMPLEX NETWORKS15

So how then to assign weights? Any scheme appropriate to the graph is fine, though for general graphs I prefer the notion of weighted average degree. If d(u) is the degree of vertex u, then we weight edge (u, v) as

w(u, v) = d(u)d(v)∆−2(G) where ∆(G) is the maximum degree. For graphs with a power-law degree distri- bution, we need a minor modification in order to obtain an appropriate gradient, namely ln d(u) + ln d(v) w(u, v) = . 2 ln ∆(G) One could instead use a multiplicative logarithmic version instead of the one above, but then degree one vertices need a special case. Of course, the choice of edge weights may vary with application. Consider the following example (see figure 6)

Figure 6. A very homogenous graph with substructure highlighted.

Here, the edge weights are not given by degree information, but rather highlight some pre-computed substructure. In contrast, a degree-based weighting would be very homogenous, hence little improved over a monochromatic drawing. The lesson to be drawn here is simple: for large graphs, drawings should include edgeweights, and an appropriate method for differentiating the small from the large. Finally, we discuss a few miscellaneous issues of which one should be aware. The use of transparency, for example, is to be encouraged. Properly used, transparency allows greater detail in medium sized graph drawings. As a rough guide, trans- parency shows a noticeable effect on graphs of up to a few thousand edges. In addition to transparency, it is well known that a proper background can highlight contrast between edges of different edge weights. When used in conjunction with an appropriate colormap, a background can make a surprising difference. See figure 7. points in [0, 1] to n specific colors, utilizing an interpolation rule to assign the remainder of the interval [0, 1]. 16 ROSS M. RICHARDSON

Figure 7. Note how the background emphasizes contrast in the edge colors.

4.3. Worked Examples. In this example we shall show how to draw a moderately sized graph using both force-directed and LGL codes. We shall also examine pre- sentation issues. The tools we shall use primarily NetworkX (and the accompanying package pygraphviz). In keeping with the rest of the document, we shall work with the DOT format as our primary storage medium, and show how to manipulate this format using pygraphviz. We shall require NetworkX 0.31, pygraphviz 0.32 include all the relevant code in the following section. To begin, we need a data source. We shall use collaboration data from the discrete geometry literature. Our primary file geom-1.net begins *Vertices 7343 1 "S. Kambhampati" 0.0000 0.0000 0.5000 2 "Christian A. Duncan" 0.0000 0.0000 0.5000 3 "V. P. Grishukhin" 0.0000 0.0000 0.5000 4 "T. T. Moh" 0.0000 0.0000 0.5000 5 "R. P. Brent" 0.0000 0.0000 0.5000 6 "Afra Zomorodian" 0.0000 0.0000 0.5000 7 "Gianfranco Bilardi" 0.0000 0.0000 0.5000 8 "Y. I. Yoon" 0.0000 0.0000 0.5000 9 "N. Bourbaki" 0.0000 0.0000 0.5000 10 "F. W. Levi" 0.0000 0.0000 0.5000 11 "J. P. Kermode" 0.0000 0.0000 0.5000 12 "B. B. Kimia" 0.0000 0.0000 0.5000 13 "R. Livne" 0.0000 0.0000 0.5000 14 "H. N. Gabow" 0.0000 0.0000 0.5000 15 "Jonathan C. Hardwick" 0.0000 0.0000 0.5000 ... and 7343 lines later we find 7343 "J. Wilson" 0.0000 0.0000 0.5000 *Arcs *Edges 272 6588 1 3308 6588 1 A PRACTICAL GUIDE TO DRAWING AND COMPUTING WITH COMPLEX NETWORKS17

4884 6588 1 272 3308 1 272 4884 1 3308 4884 1 3867 6582 1 2990 3867 1 257 4949 2 601 4949 2 1219 4949 2 1077 4949 2 257 601 2 257 1219 2 257 1077 10 601 1219 2 601 1077 2 1077 1219 3 ... We shall disregard the label data, caring only about the edge data. We thus craft a small Perl script (geom.pl) which will parse this data and gives us a basic DOT file. #!/usr/bin/perl −w print ” graph disc geom {\n” ; while(<>) { chomp; @vert = s p l i t ; print ” $vert [ 0 ] −− $vert [ 1 ] ; \ n” ; } print ”}”; We thus issue [triton@math107 ~]$ tail -n +7374 geom-1.net | ./geom.pl > geom.dot to obtain the file geom.dot. Note that here we use the UNIX tail command to ask for lines 7374 and everything that follows. Now, we shift into Python. >>> import pygraphviz >>> import networkx >>> G = pygraphviz.AGraph( ’geom.dot ’) # Read our data >>> H = G.subgraph(G.nodes()[1:1000]) # Get an induced subgraph on 1000 nodes >>> L = connected components(H)[0] # Find the largest connected component >>> K = H.subgraph(L) >>> import os >>> f = load(’geom small.dot’) 18 ROSS M. RICHARDSON

>>> K. write ( f ) >>> f . c l o s e ( ) >>> len(K.nodes()) 699 >>> len(K.edges()) 2635 At this point, in the file geom small.dot we have a connected graph on 699 vertices and 2635 edges. We shall now layout this graph using the force directed method previously discussed. Specifically, we use the algorithm neato, which is part of the AT&T Graphviz package. We issue the command [triton@math107 ~]$ neato geom_small.dot > geom_small_layout.dot We could also have done this inside of Python, using the command K.layout(prog=’neato’). We now use the drawing code supplied in the next section, which uses the drawing engine of Matplotlib. Thus, we go to Python. >>> import graph draw agraph >>> G = AGraph( ’geom small layout.dot’) >>> graph draw agraph.compute weights(G, scheme=’logdeg ’) # Create a logarithmic weighting >>> graph draw agraph.draw(G, ’draw1.png’) Here we used a logarithmic weighting scheme. The result is in figure 8.

Figure 8. draw1.png

The default coloring scheme uses a colormap called matplotlib.cm.jet. Other standard colormaps are available – see the matplotlib.cm documentation. Of course, one can create colormaps as well. Good information about colormaps can be found at http://www.scipy.org/Cookbook/Matplotlib/Show colormaps. Let us create a small colormap. To do so, we issue the following: >>> from pylab import ∗ >>> c d i c t = { ’red’: ((0.0, 0.0, 0.0), (0.3, 0.0, 0.0), (0.7, 0.98, 0.98), A PRACTICAL GUIDE TO DRAWING AND COMPUTING WITH COMPLEX NETWORKS19

(1.0, 1.0, 1.0)), ’green’: ((0.0, 0.0, 0.0), (0.3, 0.15, 0.15), (0.7, 0.75, 0.75), (1.0, 1.0, 1.0)), ’blue’: ((0.0, 0.0, 0.0), (0.3, 0.34, 0.34), (0.7, 0.08, 0.08), (1.0, 1.0, 1.0)) } >>> my cmap = matplotlib. colors .LinearSegmentedColormap( ’my colormap’, cdict , 256) >>> draw(G, ’draw2.png’, my cmap)

An explanation is provided in the reference above. Briefly, the first coordinate in every triple corresponds to a value in [0, 1], and the second coordinate represents the intensity value of the given color (we ignore the third coordinate for now). We thus have four reference values given above, and the rest of the colormap is filled in via interpolation. We thus obtain figure 9.

Figure 9. draw2.png

Next, we wish to layout the same graph using LGL. To do so, we need to do some format conversion. The format for LGL is described in a prior section, and in the following subsection we give a simple code which performs the desired conversion. Using this code, which we call conversion.py, we can thus create the desired file.

>>> from pygraphviz import ∗ >>> G = AGraph( ’geom small layout.dot’) >>> from conversion import ∗ >>> t o lgl(G, ’geom.lgl ’)

We now should have a file geom.lgl in the current directory. We shall now assume that our current directory also contains the files setup.pl and lgl.pl, which are included with this LGL package. We further assume that we are at the root of our user account, in this case /home/triton. 20 ROSS M. RICHARDSON

Layout using LGL requires a few steps. First, we must create a file where LGL will leave its output. We issue the command mkdir /tmp/lgl to create this work directory. Next, we issue the command [triton@math107 ~]$ ./setup.pl -c config This creates our configuration file, which we now edit. We need to edit two specific lines to indicate where the work directory and LGL file are located. Note that the tools require absolute path names. Editing config we find the following: ... # All paths should be absolute. ###########################################################

# The output directory for all the LGL results. Note that # several files and subdirectories will be generated. # This has to be a valid directory name. tmpdir = ’/tmp/lgl’

# The edge file to use for the layout. Has to be a file readable # by LGLFormatHandler.pm. It has to be an existing/valid file name, # with the absolute path. inputfile = ’’

# The output file that will have the final coordinates of # each vertex. This has to be a valid file name, and it # will be place in ’tmpdir’. ... We see that tmpdir is already set to the correct value, since we made /tmp/lgl as our work directory. We thus need to change inputfile = ’’ to inputfile = ’/home/triton/geom.lgl’. With these edits made, we can now run LGL by typing [triton@math107 ~]$ ./lgl.pl -c config where here the argument I pass is the name of our configuration file, config. When the layout is finished, /tmp/lgl will be filled with a number of files. As we are interested simply in the final layout, the relevant file will be /tmp/lgl/final.coords. Opening this file, we obtain: 1001 6.4891 5.8631 4938 6.2929 7.6615 5564 6.1897 6.9687 3851 6.3344 5.1216 101 8.2566 6.0554 696 7.7475 6.8037 ... The file format thus lists triples consisting of the vertex label followed by the first and second coordinates. The file conversion.py contains a function which will allow us to import these position values into our graph. >>> from pygraphviz import ∗ >>> G = AGraph( ’geom small layout.dot’) A PRACTICAL GUIDE TO DRAWING AND COMPUTING WITH COMPLEX NETWORKS21

>>> from conversion import ∗ >>> f r o m l g l coords(G, ’/tmp/lgl/final .coords’) # Note that we overwrite prior ’pos’ attributes At this point, we have an AGraph object with ’pos’ attributes, and we can thus draw such a graph using techniques we’ve developed earlier. See figure ??.

Figure 10. An LGL produced layout.

4.4. A Sample Drawing Code. Here we present an extended drawing code which takes advantage of the graphics services of Matplotlib. The full code may be found here. # GRAPH DRAWA flexible graph drawing code. Pass in a graph with weighted edges. # Copyright (C) 2006 Ross M. Richardson # # This program is ; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the #GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # al ong with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street , Fifth Floor, Boston, MA 02110 −1301 , USA. # The author can be contacted via email at # or by postal mail at # 9500 Gilman Drive # Dept. of Mathematics # UCSD 22 ROSS M. RICHARDSON

# La Jolla , CA 92093−0112 import m a t p l o t l i b import pygraphviz import networkx import sys import math matplotlib.use( ’Agg’) from matplotlib .backends.backend agg import RendererAgg from matplotlib.transforms import ∗ from matplotlib .cm import j e t def draw(graph , outputfile , colormap=matplotlib .cm. jet , height=400, width=400, dotsperinch=72.0, linewidth = 3): ””” This w i l l produce a (currently) PNG image given an AGraph i n s t a n c e ( according to pygraphviz 0 . 3 2 ) .

Note that we expect a graph to have POS attributes , and to have edge weights .

Parameters: graph −− an AGraph i n s t a n c e . o u t p u t f i l e −− file name (we do not add the ’ . png ’ s u f f i x ) . colormap −− matplotlib .cm i n s t a n c e . height , width −− in p o i n t s . dotsperinch −− a f l o a t . l i n e w i d t h −− how t h i c k to make the l i n e s .

Only the f i r s t two arguments are n e c e s s a r y . ””” # Here we set up our instance. dpi = Value(dotsperinch) r = RendererAgg(height , width, dpi) gc = r . new gc ( )

# Here we convert to an XGraph instance G = networkx.from agraph(graph)

E = G.edges() A PRACTICAL GUIDE TO DRAWING AND COMPUTING WITH COMPLEX NETWORKS23

N = G.nodes() Pos = {} xmin = 0 xmax = 0 ymin = 0 ymax = 0

# Establish plotting positions for n in N: posString = G.node attr[n][ ’pos’] i f posString != None: p = posString.split(’,’) x = float(p[0]) y = float(p[1]) Pos[n] = [x, y] i f x > xmax : xmax = x i f x < xmin : xmin = x i f y > ymax : ymax = y i f y < ymin : ymin = y else : print ”No ’ pos ’ a t t r i b u t e on vertex ” , n

# Now we need to set up a coordinate transform displayLim = Bbox(Point(Value(0), Value(0)), Point(Value(width), Value(height))) viewLim = Bbox(Point(Value(xmin), Value(ymin)) , Point(Value(xmax), Value(ymax)))

#We just need to pass this transformation to every # draw c a l l . trans = SeparableTransformation(viewLim, displayLim , Func (IDENTITY) , Func (IDENTITY) )

# Some display properties gc . s e t antialiased(1) gc . s e t linewidth(linewidth) gc . s e t a l p h a ( . 8 ) # This is a global parameter.

#A little magic to get the edges sorted correctly sortedEdges = [] 24 ROSS M. RICHARDSON

for e in E: weight = e[2][ ’weight’] weight = float(weight) sortedEdges.append([weight , e]) sortedEdges. sort()

for [ weight , e ] in sortedEdges: source = e[0] dest = e [ 1 ] gc . s e t foreground(colormap. c a l l ( weight ) ) gc . s e t alpha(.5+.5∗ weight ) # A magical value x = [Pos[source][0], Pos[dest][0]] y = [Pos[source][1], Pos[dest][1]] r . d r a w lines(gc, x, y, trans)

# Need to implement vertex drawing!!

# Output r . renderer.write png(outputfile) def compute weights(graph , scheme=’uniform ’): ””” Write edgeweights to an AGraph i n s t a n c e .

Parameters: graph −− AGraph i n s t a n c e . scheme −− ’ uniform ’ , ’ degree ’ , or ’ logdeg ’ ”””

G = networkx.from agraph(graph)

maxdeg = max(G.degree()) maxdeg2 = float(maxdeg ∗ maxdeg ) for e in graph.edges(): i f scheme == ’degree’: e.attr[ ’weight’] = str(G.degree(e[0]) ∗ G.degree(e[1]) / maxdeg2) i f scheme == ’logdeg’: e.attr[ ’weight’] = str(math.log(G.degree(e[0]) ∗ G.degree(e[1])) / \ math. log(maxdeg2)) else : # uniform e.attr[’weight’] = str(1.0) Next, we list conversion.py. This file contains code for converting DOT to LGL, and similarly for importing coordinates from LGL’s final.coords file into an AGraph object (and hence DOT file). The code can be found at http://www. math.ucsd.edu/∼rmrichar/drawing/links/conversion.py. A PRACTICAL GUIDE TO DRAWING AND COMPUTING WITH COMPLEX NETWORKS25 import pygraphviz import networkx def t o lgl(graph, filename): ””” Write an LGL f i l e given an AGraph f i l e

”””

f = open(filename, ’w’)

for n in graph.nodes(): f.write(’#’ + str(n) + ’ \n ’ ) for nbr in graph [ n ] : i f n < nbr : f.write(nbr + ’ \n ’ ) f . c l o s e ( ) def f r o m l g l coords(graph, coordfile): ””” F i l l in the pos a t t r i b u t e from a LGL coord f i l e . ”””

f = open(coordfile , ’r’) for l i n e in f : L = line.split() n = graph.get node (L [ 0 ] ) n.attr[’pos’] = str(L[1]) + ’,’ + str(L[2]) f . c l o s e ( )

5. Degree Distributions One very standard procedure of computing with large graphs is the analysis of the degree distribution. Recall that the degree distribution of a graph G is the sequence d = (d1, d2, . . . , dn), where di is the degree of vertex i of G. It is sometimes convenient to order the degree distribution.

5.1. Visualization. Visualizing the degree distribution is often very useful. Per- haps the most common procedure is to plot a histogram of the degrees against the associated frequencies. We can do this easily in MATLAB. If D is the degree distribution of a graph, then

>>hist (X) produces figure 10. 26 ROSS M. RICHARDSON

2500

2000

1500

1000

500

0 0 10 20 30 40 50 60

Figure 11. Simple histogram.

The most common use of degree distributions is the verification of power-law distributions. For this, we use hist to put our data in bins. Again, we assume D is the degree distribution. >>[freq, bins] = hist (D, 4 7 ) >>loglog (bins, freq, ’x’) Here, the 47 refers to the number of bins, which we set here to make comparison with the next figure possible. Applied to the same graph as before, we now get a clear powerlaw relationship in figure ??. You will note, however, that the resulting figure is a bit messy toward the high degree end. This is a result of the fact that the high degree occur much more infrequently, resulting in a number of singletons toward the end. To solve this problem, we use exponential binning, or simply stated, we make the bin sizes grow exponentially. To do this, we set the center of each bin to be r times the center of the previous, giving bin(i + 1) = bin(i) ∗ r, r > 1. A PRACTICAL GUIDE TO DRAWING AND COMPUTING WITH COMPLEX NETWORKS27

4 10

3 10

2 10

1 10

0 10 0 1 2 10 10 10

Figure 12. Note the right hand side.

The function degree dist (found here) will do this automatically if we pass the desired value of r as the second argument. >>[freq ,bins] = degree dist(D,1.1); >>loglog (bins, freq, ’x’) See figure 12 for the difference.

5.2. Powerlaw Exponent. We can extract the powerlaw exponent simply from our degree distribution by regression as follows: >>[freq ,bins] = degree dist(D,1.1); >>size ( f r e q ) ans = 1 43 >>lb = [ ] >> l f = [ ] 28 ROSS M. RICHARDSON

3 10

2 10

1 10

0 10 0 1 2 10 10 10

Figure 13. Exponential binning really helps out.

>>for i =1:43 >>i f (freq(i) ˜= 0) >> l f = [ l f log (freq(i))]; >>lb = [ lb log (bins(i ))]; >>end >>end >>polyfit ( lb , l f , 1) ans = −1.6333 7.3827 Here, the 43 above just applies to the particular degree sequence used (yours will differ). The above code basically linearizes the data and removes zero frequency A PRACTICAL GUIDE TO DRAWING AND COMPUTING WITH COMPLEX NETWORKS29 entries. The powerlaw exponent is given here as −1.6333, namely, the first value returned by polyfit.

5.3. An example. Here we extract the degrees of a random preferential attach- ment graph. We first work in python. >>> from networkx import ∗ >>> G = b a r a b a s i a l b e r t graph(1000,1) >>> D = G.degree() >>> D. s o r t ( ) >>> import sys >>> sys.stdout = open(’degrees.txt’, ’a’) #We now redirect output to a file >>> for d in D: ... print d ... We should now have a file labeled degrees.txt in our current directory. We switch to MATLAB. >>load (’degrees.txt’) >> [bins, freq] = degree dist(degrees , 1.1) >> size ( f r e q ) ans =

1 45 >> l f = [ ] >> lb = [ ] >> for i =1:45 i f (freq(i) ˜= 0) l f = [ l f log (freq(i))]; lb = [ lb log (bins(i ))]; end end >> polyfit ( lb , l f , 1) ans =

−1.7641 5.8714 >> f i d = fopen ( ’ d e g r e e s processed.txt’, ’wt’); >> fprintf ( f i d , ’%e\ t%e\n’, [bins’, freq’]’); >> fclose ( f i d ) ; Now back to python. >>> X = load(’degrees processed.txt’, ’wt’) >>> B = [ ] >>> F = [ ] >>> for x in X: 30 ROSS M. RICHARDSON

... i f x [ 1 ] != 0 : ... B.append(x[0]) ... F.append(x[1]) >>> loglog(B,F, ’or’) >>> xlabel(’log(degree)’) >>> ylabel(’log(freq)’) >>> title(’Degree Plot ’ ) >>> savefig(’plot.png’)

Degree Plot 3 10

2 10 log(freq)

1 10

0 10 0 1 2 10 10 10 log(degrees)

Figure 14. Our example.

Here, we demonstrated the use of matplotlib to produce nice output. A simple plot is found in figure 13. The nice output can be seen here. A PRACTICAL GUIDE TO DRAWING AND COMPUTING WITH COMPLEX NETWORKS31

6. A Sample Project Acknowledgments I thank Fan for the support which made this documentation possible (as well as much of the graph drawing knowledge). I thank Lincoln for his work in graph drawing and computing which he freely shared with me, as well as Reid for a number of enlightening conversation. I would like to thank in advance any possible reader who alerts me to errors, omissions, or ideas regarding these notes.

7. Appendix: Datasources. References [1] GraphXML. http://ftp.cwi.nl/CWIreports/INS/INS-R0009.pdf. [2] Lu, Lincoln. [email protected]. http://www.math.sc.edu/∼lu/. [3] Large Graph Layout. http://bioinformatics.icmb.utexas.edu/lgl/ [4] Walrus. http://www.caida.org/tools/visualization/walrus/ [5] AT&T Graphviz Tools. http://www.graphviz.org/ [6] SAGE. http://modular.math.washington.edu/sage/ [7] NetworkX. https://networkx.lanl.gov/ [8] Boost Graph Library http://www.boost.org/libs/graph/doc/ [9] Parallel BGL http://www.osl.iu.edu/research/pbgl/ [10] Roberto Tamassia’s Graph Drawing Resources http://www.cs.brown.edu/∼rt/gd.html [11] Goodman, J.E. and J. O’Rourke, Handbook of Discrete and Computational Geometry, 2nd Edition, CRC Press, 2004. [12] Gansner, E. Koren, Y. and S. North. “Graph Drawing by Stress Majorization”. Graph Draw- ing: 12th International Symposium, GD 2004, New York, NY, 2004. Lecture Notes in CS 3383, 2005. [13] T. Kamada and S. Kawai, An Algorithm for Drawing General Undirected Graphs, Infor- mation Processing Letters 31 (1989), pp. 715. [14] West, D. Introduction to Graph Theory, 2nd Edition, Prentice-Hall, 2001.

9500 Gilman Drive., University of California, San Diego, La Jolla, CA 92093-0112 E-mail address: [email protected]