Journal of Advanced Statistics, Vol. 2, No. 2, June 2017 78 https://dx.doi.org/10.22606/jas.2017.22002

Network Exploration by Complements of Graphs with

Tai-Chi Wang1, Frederick Kin Hing Phoa2* and Yuan-Lung Lin2

1National Center for High-Performance Computing, National Applied Research Laboratories, Taiwan 2Institute of Statistical Science, Academia Sinica, Taiwan Email: [email protected]

Abstract Network data have become very popular with the growth of technologies and social applications such as Twitter and Facebook. However, few visualization tools have been created for exploring large-scale networks. We propose a simple and quick procedure to explore a network in this study. The algorithm changes the edge representation based on the complement of a simple graph and the partition method of vertex coloring. Furthermore, the colors provide additional information on top of the partitions. Our proposed method is demonstrated in some famous networks.

Keywords: Visualization, Complement of Graph, Greedy Algorithm, Network Partition, N-

1 Introduction

Graph theory has been verified as a useful approach to explain complex networks. For example, Erdős and Rényi [1] first provided Poisson random model to explain the relation in a network. Watts and Strogatz [2] used a small-world model to describe the dynamic network system. Barabási and Albert [3] provided a scale-free model to explain the expanded connectivity of networks. There have been huge amount of studies and applications since these models were proposed. It is difficult to glimpse structures of large-size networks due to their network complexities. Although many useful R-, Python- and C-based softwares such as Cytoscape [4], Gephi [5], and igraph [6] were developed to visualize and analyze network data, many network features were still hindered inside their complicated structures. Few studies proposed simple and efficient methods to visualize networks. Fruchterman and Reingold [7] provided some useful methods to embed proper vertex locations and to make visualization easier. However, once a network is big, the analysis becomes a time-consuming task and the network features are still difficult to clearly visualize. Among complicated network patterns, recent researches have focused on exploring network patterns by graph partition and network communities. Graph partition is an important issue in network recognition. The main issue of graph partition is how to divide a graph into subgraphs so that the number of edges connecting different subgraphs is minimized. For example, Karypis and Kumar [8] provided a multilevel partition√ algorithm by considering the coarsening and uncoarsening procedures. Arora et al. [9] developed a O( log n)-approximation algorithm that can quickly partition a graph into two parts by a 2 l2-representation approach. Similar but not exactly equivalent to a graph partition problem that finds the minimum number of cut, a community detection problem focuses on finding subgraphs so that each subgraph is densely connected. Given the reachability of community and the geodesic distance among community members, Wasserman and Faust [10] defined two important quantities: (1) the n−clique is a maximal subgraph in which the largest geodesic distance between any two nodes is no greater than n, and (2) the n−club restricts the geodesic distance within the subgraph to be no greater than n. These n−cliques and n−clubs are possible communities under a weaker definition. Furthermore, the modularity-based method [11] considered a modularity measure to evaluate the similarity of groups. A spectral optimization method is used to classify network communities. Bickel and Chen [12] further investigated the theoretical properties of the modularity measure. Bayesian models, which were developed by Handcock et al. [13] and Heard et al. [14], deduced network properties and assigned cluster labels to vertices by posterior samples. Readers who are interested in other community detection approaches and definitions can refer to Wasserman and Faust [10] and Tang and Liu [15].

JAS Copyright © 2017 Isaac Scientific Publishing Journal of Advanced Statistics, Vol. 2, No. 2, June 2017 79

Although these models and methodologies are helpful in realizing the network patterns, they are commonly limited by their computational efficiencies. In specific, these methodologies might not be helpful and informative if one needs to quickly characterize a complex network within a limited amount of time. For the visualization purpose, an initial step to understand a network is still by raw visual recognitions. In addition, it is common among traditional approaches that they use geodesic distances between pairs of vertices in a network. If we want to quickly glimpse structures of a network according to the interactive distances among vertices, it is essential to develop a simple representation to summarize a network. In general, we want to construct an algorithm to achieve the following goals: – In terms of computing efficiency, the algorithm can quickly visualize and explore a network. – In terms of graph partition, the algorithm can label vertices in a network. – In terms of cluster pattern visualization, same labels are assigned to close vertices. – In terms of extended information, the proposed approach can provide additional information on top of network partition. In order to achieve the above goals, we propose a popular but old-school approach, graph coloring, to visualize and to explore networks in a simple way. Graph coloring is one of the most useful tools developed in . It originally arose from the four-colors problem and has been widely used in other fields such as time tabling and scheduling, register allocation, printed circulated board testing, and so on [16]. In operation research, many researches used this approach to deal with scheduling problems and to make decisions [17, 18]. In compiler optimization, graph coloring is an important tool that used for solving the register allocation problem [19, 20]. The original purpose of graph coloring is to find the minimum chromatic number for labeling a graph. This leads to the partition of a graph into small groups when each group has good properties such as n−cliques or n−clubs (for recognizing partition and cluster patterns). Although Graph coloring is a promising approach to feature a network, it suffers some difficulties when applying to network visualization. One of the most difficulties is the meaning of labels of graph coloring. The most important and fundamental property of the graph coloring is “two adjacent vertices must label different colors”. The opposite side of this property is “two vertices with the SAME color must be NOT adjacent”. Suppose a studied graph is denoted as G. The graph coloring technique on G can shed the light on the dependence of cliques, i.e., the clique members must have different colors. On the other hand, we consider to color the complement of a graph, G, where two vertices of G are connected if and only if they are not connected in G. Based on the above “opposite” property, two vertices with the SAME color must be NOT adjacent on the colored G. Interestingly, this means that the vertices with the same colors are adjacent on the original graph G. Thus, the colors of G can be the partition labels. Based on this idea, we can apply the graph coloring technique to partition a network. In this study, we propose the greedy vertex coloring approach for coloring a graph. The greedy vertex coloring approach has been verified to run O(n + m) time [21], and this running time is efficient and acceptable in the case of a large scale network. Thus, the graph coloring is a good approach to achieve our goals. The remaining part of this paper is organized as follows. First, we introduce some preliminary concepts of graph theory and the basic algorithm for coloring in section 2. In section 3, we consider the complements of different edge representations for partitions by graph coloring, and then we use the Karate network [22] to demonstrate our proposed method with an emphasize on aspects such as labeling the complement results, the representation of partitions, and statistics of labels and edges. Section 4 shows the feasibility and efficiency of our proposed approach using other famous networks. The last section provides the discussion and the final conclusion of the paper.

2 Preliminary Definitions and Graph Coloring

In this study, we focus on the undirected network and vertex coloring only. Since an undirected network is simpler to visualize than a directed network, we can directly see the patterns via graph coloring.

2.1 Preliminary Definitions

A simple graph is an ordered pair G = (V,E), where V = {v1, . . . , vn} is a set of vertices and E = {e1, . . . , em} is a set of edges in which each edge is denoted as (vi, vj). In practice, each edge is denoted as

Copyright © 2017 Isaac Scientific Publishing JAS 80 Journal of Advanced Statistics, Vol. 2, No. 2, June 2017

(vi, vj), where the vertices vi and vj are known as the endpoints of the corresponding edge. Additionally, they are said to be adjacent to each other, and are also called neighbors to each other. The of G is an n by n square matrix with entries aij = 1 if vi and vj are adjacent and aij = 0 otherwise. The degree of a vertex v is the number of neighbors of v.A u-v path is a sequence of distinct vertices {u = v0, v1, . . . , vp−1, vp = v}, and any two consecutive vertices are joined by an edge. The shortest path from u to v is the u-v path containing minimum number of edges among all paths from u to v, and the distance between two vertices is the number of edges in the shortest path from u to v, denoted as dG(u, v). A graph is connected if it has at least one u-v path whenever u, v ∈ V . A colored graph is a graph in which each vertex is assigned to a color. In this study, we use the proper vertex coloring, namely two vertices are assigned distinct colors if they are adjacent. The chromatic number of a graph G, denoted as χ(G), is the least number of distinct colors for properly coloring G. The complement or inverse of a simple graph G, denoted as G, s a graph with the same vertex set of G and any two distinct vertices of G are adjacent under the circumstance that they are not adjacent in G.

2.2

Graph coloring is an approach in graph theory for labeling a graph. This approach uses “colors” to assign labels to a graph subject to certain constraints. This study only considers the vertex coloring as a way of coloring the vertices of a graph so that no two adjacent vertices share the same color. There are many different algorithms to color a graph [23, 24, 25]. Some considered to evenly assign the colors while others considered to sequentially assign the smallest unlabeled colors that are not assigned to the current neighbors. We called the later methods as greedy coloring. Since graph coloring is a NP-complete problem, most of these algorithm cannot guarantee to provide the least number of colors for graph coloring. However, the greedy algorithms can provide a quick procedure to assign colors. We refer the readers to Kosowski and Manuszewski [21] for the details of the properties of graph coloring and some other coloring techniques. Given a graph G = (V,E), let us define the neighborhood of ν, denoted as NG(ν), as a neighbor set of vertices adjacent to ν where ν can be a single vertex or a set of vertices. The greedy algorithm is performed in the following procedures (Algorithm 1). According to the algorithm, each color can be indexed by a positive integer, and each uncolored vertex is labeled as 0. In general, the greedy coloring assigns the smallest proper integer to an uncolored vertex first. We use the criterion, “the minimum degree”, in step 2 and 4 to be the criterion for selecting nodes in the coloring procedure. To provide a unified and reproducible coloring result, “the smallest ID” criterion in step 4 is suggested when more than one vertices have the same degree. Thus, the ID must be set before applying our approach. Based on the minimum degree criterion, the vertices with lower degrees are labeled to the same color classes (smaller integers). On the other hand, the vertices with higher degrees often contain more important information. We can clearly view these vertices labeled as larger integers. It should be noted that this algorithm is applied to a “connected” graph. For a disconnected graph, this algorithm can be applied to each connected component separately. Please refer to Appendix to check the simple illustration of the algorithm by a toy example.

2.3 Demonstration of Greedy Coloring

In order to simply understand the algorithm and the following discussions, we use the karate network [22] to demonstrate the utility of graph coloring. The karate network is an undirected social network of friendships with 34 members of a karate club and 77 edges at a US university in the 1970s. Fig. 1 shows the network of karate club membership. The edges represent the interaction of the members both during and after the lessons of the club. The labels next to the vertices are the node IDs, and those inside the vertices are the degrees of vertices. Since the graph coloring technique implies that two vertices are labeled in different colors if they are adjacent, the frequency of labels provides some useful information of a network. If we treat colors as numbers, then the largest number means the possible maximum size of the clique in a network. These cliques are referred to the possible groups or communities of interest. Therefore, we can quickly check the

JAS Copyright © 2017 Isaac Scientific Publishing Journal of Advanced Statistics, Vol. 2, No. 2, June 2017 81

Algorithm 1 Greedy Coloring procedure GreedyColoring(G = (V,E)) step 1 Initially set the colors of vertices as 0, i.e., c(vi) ← 0 ∀vi ∈ V . step 2 Select the vertex with the minimum degree, called v1. step 3 Assign c(v1) ← 1 and set ν = {v1} to obtain NG(ν). for j = 2 to n do step 4 Select the vertex with the minimum degree among the neighbor list NG(ν), called vj . (when there are more than one vertices that have the same degree, “the smallest ID” criterion is suggested.) step 5 Select the minimum color that is not assigned to NG(vj ), i.e.,

c(vj ) = min{k ∈ N|k 6= c(w) ∀w ∈ NG(vj )}.

Sj step 6 Set ν = i=1 vi end for step 7 return c . The vector of colors end procedure

3 3 V25 V26 V12 1

V13 V17

2 V7 2 4 5 4 V 6 2 V24 28 V32 V6 V27 V 3 4 4 V14 6 V29 4 5 V1 3 4 16 V30 V5 V9 17 V3 2 V11 V34 10 V19 3 12 V2 V 9 2 33 V 3 22 V 23 V8 2 V20 4 2 V31 V18 V 15 3 2 2 2 V V10 16 2

V21

Figure 1: Karate club network

neighbors of these vertices with the maximum number to view the possible community structure of a network. Based on the frequency of labels, we can quickly depict a rough sketch of a network. Fig. 2 shows the results of graph coloring for the karate network and the frequency of the labels. There is only one vertex denoted as label 5, which means that the possible maximum clique size is 5. To check if the true maximum cliques are with size 5 in this network, we found 2 cliques with size 5 and both of them contain the vertex “v3” labeled as 5 (the vertices encircled by the black lines shown in Figure 3 (a) ).

3 Applications to the Complement of a Simple Graph

Before discussing the coloring results of the complements, there is a need to recall the most important and fundamental property of the graph coloring – two adjacent vertices must label different colors. Like we have mentioned in the introduction section, we can shed some lights by applying the graph coloring technique. This technique cannot include all information of a network. However, some useful information can be provided for quickly viewing a complex network.

Copyright © 2017 Isaac Scientific Publishing JAS 82 Journal of Advanced Statistics, Vol. 2, No. 2, June 2017

2

3 20

1

1 2

3 15 2 1 1 2

2 1 3 1 1 1 1 2 10

1 4 5 3 3 4

1 5 1 1 1 1 1 1 1 1 20 6 5 2 1

1 0 1 2 3 4 5 (a) Graph Coloring (b) Bar Chart of Labels

Figure 2: Graph coloring of karate network and its bar chart of labels

3.1 Complements of Path k Networks and Graph Partitions

We apply the properties of the complement of a graph G to visualize networks in this section. Based on the graph coloring, two adjacent vertices labeled as different colors show that they are connected with distance 1. Obviously, any two adjacent vertices in G are nonadjacent in G. This implies that these pairs of vertices are away from distance 1. On the other hand, vertices with the same colors in G are adjacent to each other in G. For this reason, graph coloring is also a useful tool for graph partition. Prior discussing about the complement of a graph, we need to introduce some important theorems that required in finding the bounds of chromatic number χ(G) of a graph and justify why we color complements of graphs.

Theorem 1. If the maximum vertex degree of a graph G is ∆, then χ(G) ≤ ∆ + 1.

Theorem 1 provides a preliminary insight on how many colors we use and determines if these colors are helpful in visualizing the network.

Theorem 2. Given the edge probability p, let b = 1/(1 − p). For constant p and constant  > 0, almost every random graph G with p satisfies

(1/2 − )n/ logb n ≤ χ(G) ≤ (1/2 + )n/ logb n.

Readers can refer to West et al. [26, Chapter 8] for the details of the proofs. If a network comes from a random network assumption, the chromatic number is governed by the edge probability p. If p is high, additional colors are needed for graph coloring. On the other hand, if p is low, only a few colors are needed to complete the graph coloring. According to Theorems 1 and 2, if a network is partitioned into few groups for the purpose of easier observation, we need to define a low edge probability for obtaining a low χ(G). However, the edge probability is usually invariant. Instead, we consider the kth power of G, which is k the simplest graph G with the vertex set V (G) and the edge set {(vi, vj)|dG(vi, vj) ≤ k}. Let M be the adjacent matrix of G. The kth power of M refers to the number of walks of length k in G. It is expressed as M k = M × ... × M, | {z } k times

JAS Copyright © 2017 Isaac Scientific Publishing Journal of Advanced Statistics, Vol. 2, No. 2, June 2017 83

k and its entry Mij refers to the number of walks of length k from vi to vj. Thus, the adjacent matrix of graph Gk is defined as Σk and its entry is

 1 if Pk [M]d > 0 ∀i 6= j [Σ]k = d=1 ij ij 0 otherwise.

Based on the adjacent matrix of Gk, the Gk graph can be created quickly. A high edge probability can be obtained based on this manner. This motivates us to use the complement instead of the original graph. If a long distance k is selected, a highly connected network Gk is resulted. On the contrary, the complement graph of Gk, defined as Gk, is a network with a low edge connection and Gk leads to a simple observation. The complement Gk can be created by its adjacent matrix Λ, which is obtained by switching the (0, 1) off-diagonal elements in the adjacent matrix of Gk, [Σ]k. The element (i, j) of this adjacent matrix of Gk is expressed as

 1 if [Σ]k =0, ∀i 6= j [Λ]k = ij ij 0 otherwise.

Applying the greedy vertex coloring algorithm, two connected vertices in Gk represent that they are assigned to different colors and their distance is longer than k. Oppositely, the distance of two vertices with the same color is within k. This is just the definition of n−clique with n = k. Thus, the partitions obtained by our algorithm are n−cliques. Consider the karate network with distance 2 as an example. Fig. 3 (a) shows the G2 graph of karate network and Fig. 3 (b) is its complement G2. We see little information when the network has a large chromatic number. On the contrary, its complement is clear to see some possible partitions.

2 5 1 5

1 3

3 1 3 2 5 2 9 14 4 1 22 1 2 4

15 4 1 2 11 1 8 20 2 4 1 2 19 17 1 1

5 12 4 1 18 1 6 2 16 1 7 10 4 1 21 1 8 3 9 3 3 4 13 7 1 3 4 1 4 1

6 4

(a) Colored Graph G2 (b) The Complement of G2

Figure 3: Graph coloring of karate network with distance 2

Although the number of colors is reduced to 5 in G2, it is not clear enough to observe the network structure due to the large number of edges. We simplify the network into small partitions, where each one has its vertices with the same color. Only the number of vertices and the number of edges between each pair of colors are recorded. Notice that more edges in the complement suggest that the connection of two partitions is weaker, which is shown in the following table and figure.

Copyright © 2017 Isaac Scientific Publishing JAS 84 Journal of Advanced Statistics, Vol. 2, No. 2, June 2017

C(5)

2 Table 1: Information of Partition Graph of G 5(25) of karate network

10(10) G1 G2 |V (G1)| |V (G2)| |E(G1,G2)| 23(70) B(5) 1 2 14 5 37 10(10) 1 3 14 5 23 E(2) 1 4 14 8 25 1 5 14 2 15 37(70) 15(28) 2 3 5 5 5 13(16) 40(40) 2 4 5 8 40

2 5 5 2 10 40(40) 3 4 5 8 40 3 5 5 2 10

4 5 8 2 13 A(14) G and G are the sets of vertices in which all vertices have the same col- 25(112) 1 2 D(8) ors/labels in the partition graph. |V (G1)| and |V (G2)| are numbers of ver- tices in G1 and G2, respectively. |E(G1,G2)| represents the number of edges between these two sets. Figure 4: Partition plot of G2 of karate network The number of edges in the complement graph is monotone decreasing as the distance increases, which is an important property of the complement of a graph. Again, consider the karate network with distance 3 as an example. Fig. 5 (a) shows the G3 graph of karate network and Fig. 5 (b) is its complement. Fig. 5 is comparably more informative than Fig. 3. In this example, the complement graph G3 of the karate network is a , in which the vertices are partitioned into two disjoint parts. That is, the distance of any two adjacent vertices in G3 is at most 3, and the distance of any two adjacent vertices in G3 is higher than 3. Clearly, it is easy to point out which vertices are distanced higher than 3.

12 2 13 2

6 1

7 1 2 2 4 1 7 14 2 2 26 1 2 2

15 3 2 1 22 1 8 24 2 2 2 1 23 19 2 1

4 18 1 1 21 1 5 1 17 1 6 20 2 1 25 2 9 2 10 1 1 1 16 8 1 2 3 11 1 1

5 2

(a) Graph Coloring of G3 (b) Complement Graph of G3

Figure 5: Graph coloring of karate network with distance 3

By considering the distances from 1 to 4, we collect the information of all coloring labels of both original graphs and their complements. These graphs are summarized as stacked bar plots (Fig. 6) in which each segment represents the number of vertices in each color group. According to the changes of

JAS Copyright © 2017 Isaac Scientific Publishing Journal of Advanced Statistics, Vol. 2, No. 2, June 2017 85

stacks with different distances, it is helpful to visualize the move of the partition patterns. Apparently, we can see the changes according to the increasing distances and the numbers of vertices in the graphs and their complements. These results correspond to Theorems 1 and 2. The edge probability gets higher when the distance increases. Thus, strong connections lead to more colors for graph coloring in the original graphs. On the contrary, only few colors are needed for coloring the complements with higher distances. 5 10 15 20 25 30 5 10 15 20 25 30 0 0

1 2 3 4 1 2 3 4 (a) Label Frequencies of Gks (b) Label Frequencies of Gks

Figure 6: Graph coloring results of karate network by bar charts with distances from 1 to 4, where each segment represents the number of vertices in each color group.

3.2 Statistics of Labels The labels of graph coloring imply much information, but this information is implicit unless these labels undergo some transformations. We focus on the cluster pattern in this study. In practice, the clustering coefficient [2] is an important quantity that measures the cluster relation of neighbors in networks. The original clustering coefficient of a vertex vi is defined as

2|{ejk : vj, vk ∈ NG(vi), ejk ∈ E(G)}| Ci = , (1) |NG(vi)|(|NG(vi)| − 1) where NG(vi) is the set of neighbors of vertex vi, and |NG(vi)| is the cardinality of NG(vi). This quantity measures which nodes tend to cluster together by discussing the edges between their neighbors. Since our approach also provides the information of clustering, we are interested in knowing whether the labels can deliver this information by considering the relations of vertices instead of edges. Although we cannot restrict the type of graph coloring, labeling different colors on two adjacent vertices always holds. Therefore, the labels of neighbors of each vertex can be discussed. Since an edge leads to two vertices with two different colors with respect to coloring the original graph, the different colors are considered to represent a kind of relation between two vertices. We call the different colors of vertices the “variation”. The small variation of labels indicates that the neighbors of vertex have similar colors and also implies that the connection between neighbor is weak. Two vertices with different colors do not guarantee their adjacencies. However, a latent mechanism might exist and label them in different colors according to the greedy coloring algorithm. Thus, the variation of labels is just a comparative quantity that tells the relations among vertices.

Copyright © 2017 Isaac Scientific Publishing JAS 86 Journal of Advanced Statistics, Vol. 2, No. 2, June 2017

In order to compute the variation, we use the variation of categorical data provided in Light and Margolin [27]. The variation of categorical data is expressed as

1 X X ρ = d , 2n ij i j where dij = d(Xi,Xj) is defined as  1 if Xi and Xj are labeled as different categories dij = 0 if Xi and Xj are labeled as the same category.

We adjust this equation to network expression as

1 X V = d , (2) i  |N (v )|  jk G i j6=k; 2 ∀vj ,vk∈NG(vi) where djk is defined as  1 if vj and vk are labeled as different colors in the original graph G djk = 0 if vj and vk are labeled as the same color in the original graph G.

On the other hand, when the coloring result of the graph complement is discussed, the same color obviously represents that two vertices belong to the same group. We consider the similarity of the neighbors by adjusting the variation measure. We define the similarity of labels as

1 X S = s , (3) i  |N (v )|  jk G i j6=k; 2 ∀vj ,vk∈NG(vi) and sjk has the contrary definition compared with djk, i.e.,

 1 1 if vj and vk are labeled as the same color in G sjk = 1 0 if vj and vk are labeled as different colors in G .

1 1 The NG(vi) is obtained under G instead of G although the coloring result is obtained under G . Since the clustering pattern of the original graph is our focus, we only discuss about how the different representations of vertices affect the measures of clustering pattern. Compared with the variation of labels, which is a conflicting measure to define the relation of neighbors, the similarity is very conservative. Recall that variation considers the vertices with different colors as 1, but such vertices possibly have no edges. The similarity considers the vertices with the same colors as 1. However, the vertices with different color possibly have edges. These differences lead to the value of clustering coefficient locating between similarity and variation, i.e., Si ≤ Ci ≤ Vi. From this aspect, we can not only visualize networks by graph coloring, but also compute the information of the coloring labels. Since the graph coloring can produce flexible outputs, more measures besides the variation and similarity can be created if other coloring procedures are applied. The karate network is applied to compute these statistics, including similarity, variation, and clustering coefficient. G1 of karate network is provided in Fig. 7. Since the original karate network contains few edges, its complement shows an opposite pattern that needs to use many labels for coloring. We expect to get low similarities because there are few members in each label. We present different patterns of similarity, variation, and clustering coefficient in Fig. 8. Most peripheral vertices have high values and center vertices have lower values respectively in clustering coefficient and variation. On the other hand, most vertices of the left part of the network are 0 in similarity, which implies that the left part seems to have no cluster pattern in this conservative criterion, but the middle vertices have some relations with the other vertices.

JAS Copyright © 2017 Isaac Scientific Publishing Journal of Advanced Statistics, Vol. 2, No. 2, June 2017 87

9 9 9 9 V V 25 V 22 26 12 22

V13 V17 13 4 13 V 4 4 7 4 3 3 3 3 1 6 V 1 6 V24 28 V V V 32 6 10 4 27 10 V4 2 4 V14 2 V29 6 2 8 6 2 V1 8 7 2 7 2 V30 V5 V9 1 1 V3 19 19 V11 2 V34 2 8 V19 8 1 1 V2 V 2 21 2 21 33 V 12 22 12 V 23 V8 15 15 V20 5 5 17 17 V31 V18 V 15 11 14 11 14 16 18 16 18 V V10 16 20 20 V21 (a) Graph Coloring of G1 (b) Labels of G1 in G1 layout

Figure 7: G1 of karate network with graph coloring and the coloring result represented in the G1 layout

25 25 25 26 26 26

12 12 12 1.0 0.9 13 17 13 17 13 17 7 7 7 0.8 24 28 24 28 24 28 32 32 32 0.7 27 27 27

0.6 29 6 29 6 29 6 4 4 4 0.5 30 14 5 30 14 5 30 14 5 0.4 9 1 9 1 9 1

0.3 19 34 19 34 19 34 3 3 3 0.2 11 11 11 33 33 33 0.1 23 2 23 2 23 2 20 20 20 0.0 22 22 22 8 8 8 15 15 15 31 18 31 18 31 18 16 10 16 10 16 10

21 21 21 (a) Similarity (b) Clustering Coefficient (c) Variation

Figure 8: Statistics of Labels of karate network

Copyright © 2017 Isaac Scientific Publishing JAS 88 Journal of Advanced Statistics, Vol. 2, No. 2, June 2017

3.3 Statistics of Edges

We consider the number of edges as an additional usage of the complement of a graph. In each complement Gk, the edges represent the distance between pairs of vertices that is larger than k. Based on these graphs, the number of edges is collected to verify the random model assumption. If the distances of the shortest paths are of interest, Blondel et al. [28] provided the density function of the distance of path by assuming the hidden variables follow the Poisson random graphs. If a network is a Poisson random graph with homogeneous connection probability p, the probability of the shortest path between any pair of vertices greater than distance k is expressed as

(np)k S(k) = P rob(D > k) = exp[− ], (4) n where n is the number of vertices. Each edge in Gk represents that two vertices are distanced higher than k. Thus, the density of the shortest paths can be estimated by the empirical density, i.e.,

2|E(Gk)| Sˆ(k) = . (5) |V (G)|(|V (G)| − 1)

Similar to the Q-Q plot provided for goodness of fit for normal assumption, the scatter plot is drawn with Sˆ(k) vs. S(k). If the scatter plot tends to fall on a straight (45-degree) line, a network is likely to be a Poisson random graph. Consider karate network as an example. We have to create all graphs with different distances and then generate their complements. Based on the above equations, the theoretical densities and the empirical densities of the shortest paths are obtained. The results are given in Table 2 and Figure 9. The similarity between the estimated and theoretical values suggests that this network is possibly a Poisson random graph. Although we cannot provide a statistical testing for the goodness of fit for the results, we can glimpse the possibility of randomness from the provided information. 0.8

Table 2: Information of Density Estimation 0.6

ˆ S

Distance k S(k) S(k) 0.4 1.00 0.86 0.87

2.00 0.39 0.53 0.2 3.00 0.14 0.05

4.00 0.01 0.00 0.0 0.0 0.2 0.4 0.6 0.8 ^ S

Figure 9: Density of Karate network

4 Partitions of Other Networks

In this section, we provide other classical examples of networks by applying our proposed method to the partition purpose. That is, coloring the complements of graphs with different distances. In order to provide a unified output, a different framework from Fig. 4 for the karate network is applied to visualize the graph partitions. Since we are interested in the relationship between all partitions, the partition graph is represented by a circular embedding. Thus, we can check all the relations among different partitions. In addition, the sizes of vertices and widths of edges are set to be equal but the labels of vertices represent the number of each group and the types of edges show the strength of connection.

JAS Copyright © 2017 Isaac Scientific Publishing Journal of Advanced Statistics, Vol. 2, No. 2, June 2017 89

There are four types of edges representing the group connections according to the inter-grouping probability. Suppose a complement of a graph is colored into several groups in which each vertex has the same color. Given a pair of groups Gs and Gt, the inter-grouping probability is defined as

|E(Gs,Gt)| pst = , (6) |V (Gs)||V (Gt)| where |V (Gs)| and |V (Gt)| are the number of vertices in Gs and Gt respectively, and where |E(Gs,Gt)| is the number of edges between these two groups. Note that the solid dark line (—) represents a strong distance between two groups and its probability belongs to the interval (0.75-1); whereas the dash dark line (––) represents a less strong distance between two groups and its probability belongs to the interval (0.5 -0.75); and the solid gray line (—) represents a weak distance between two groups and its probability belongs to the interval (0.25-0.5); and finally the dash gray line (––) represents a very weak distance between two groups and its probability belongs to the interval (0.0-0.25). Here, a strong distance means that the two groups cannot be reached easily one from another. In each case, the bar chart of different distances is performed with the modularity measure [11] at top, which is defined as 1 X kikj Q = (e − )δ(v , v ), (7) 2|E(G)| ij 2|E(G)| i j where ki and kj are the degrees of vi and vj, eij is 1 if vi and vj are adjacent, and 0 otherwise; and δ(vi, vj) is 1 if vi and vj are labeled as the same color, and 0 otherwise. In order to provide a clear partition, we select the partition with maximum modularity except distance 1. Although the partitions have a high modularity with distance 1 in some cases, we do not choose them because too many labels are not helpful for clear visualization. Our result shows an uncertainty on the goodness of the suggested group size, but we suggest this method as an initial stage to view the structure of network if the distance between vertices is the main concern.

4.1 Dolphin Network

The dolphin network compiled by Lusseau et al. [29] and Lusseau [30] is an undirected social network of frequent associations between dolphins in a community living off Doubtful Sound, New Zealand. This network contains 62 vertices (dolphins) and 159 edges, which means that dolphin pairs were observed to have statistically significant frequent association. The partition results are shown in Fig. 10. One particular feature of this network is that the maximum distance (diameter) between dolphin pairs is 7. Since this network only contains 62 vertices, the diameter of this network is high and implicit, which means that the connections between dolphins are low. When the distance 4 (Q = 0.3628) is chosen, we obtain the maximum modularity and 4 partitions. The maximum group size is 24 only and this network is mainly partitioned into 2 major groups (A and B in Fig. 10 (b)) that have small distances (gray solid line).

4.2 Les Misérables Network

Les Misérables Network compiled by Knuth et al. [31] contains the weighted network of co-appearances of characters in Victor Hugo’s novel “Les Misérables”. 77 vertices (characters), and 254 edges are shown to represent the co-appearance of two characters in one or more scenes. The partition result is shown in Fig. 11. The modularity from the bar chart shows a big drop from distance 1 (Q = 0.3039) to distance 2 (Q = 0.1764) , so is the number of groups. When the distance 2 is selected, this network is partitioned into 11 groups, and all groups have strong distances (black solid line). This result matches our knowledge of this famous novel. That is, the characters belonged to different scenes and eras.

4.3 Football Network

The football network compiled by Girvan and Newman [32] contains the network of American football games between Division I colleges during regular season in Fall 2000. There are 115 vertices (teams)

Copyright © 2017 Isaac Scientific Publishing JAS 90 Journal of Advanced Statistics, Vol. 2, No. 2, June 2017

0.2179 0.2386 0.2371 0.3628 −0.0283 0.0444 0 B (35)

C A (2) (24) Number of Vertices 0 10 20 30 40 50 60 1 2 3 4 5 6 7 D (1) Distance (a) Bar Chart of Labels (b) The Suggested Partitions with Distance 4

Figure 10: Visualization of dolphin network by the coloring of complements of graphs. (a) The stacked bar chart where each segment represents the number of vertices in each color group. (b) The partition with distance 4.

80 0.3039 0.1764 0.022 0.0059 D (7) C (4) E (2)

60 B (8)

F (7)

40 A (37) Number of Vertices

G (1) 20 K (1)

H (7)

0 J 1 2 3 4 I (1) (2) Distance (a) Bar Chart of Labels (b) The Suggested Partitions with Distance 2

Figure 11: Visualization of Les Misérables network by the coloring of complements of graphs. (a) The stacked bar chart where each segment represents the number of vertices in each color group. (b) The partition with distance 2.

that belong to 12 different conferences, and 616 edges show that some teams have one or more games during this regular season. The partition result is shown in Fig. 12. Unlike the bar chart in Fig. 11, the modularity decreases slightly from distance 1 (Q = 0.3189) to distance 2 (Q = 0.2467), and the number of groups slightly drops at the same time. This implies that the structures between the Les Misérables network and the football network are different, but our plots cannot show the details of the differences.

JAS Copyright © 2017 Isaac Scientific Publishing Journal of Advanced Statistics, Vol. 2, No. 2, June 2017 91

When the distance 2 is selected, this network is partitioned into 15 groups, and most groups have similar group sizes around 10 except some groups with few members. In fact, this network is comprised by 12 college football leagues, which is close to our finding.

0.3189 0.2437 −0.0036 E (10) D F (10) (8) C (13) G (10) B 80 100 120 (8)

H (8)

60 A (13) Number of Vertices

I

40 (10)

O (6) J 20 (5) N (1) K

0 (5) M 1 2 3 L (2) (6) Distance (a) Bar Chart of Labels (b) The Suggested Partitions with Distance 2

Figure 12: Visualization of football network by the coloring of complements of graphs. (a) The stacked bar chart where each segment represents the number of vertices in each color group. (b) The partition with distance 2.

4.4 Political Book Network

The political book network contains the network of book sale about US politics published around the time of the 2004 presidential election and sold by the online bookseller. This network contains 105 vertices (books), and 441 edges are shown to represent that two books are bought at the same time at least once. The network was compiled by V. Krebs but is unpublished, yet can be found on Krebs’ website http://www.orgnet.com/. The partition result is shown in Fig. 13. A great improvement is found in both modularity and group size with distance 3 (Q = 0.3178). Only 7 groups are recorded and are mainly composed by three groups (A, B, and C) which are similar to the number of original categories of these books (liberal, conservative, and neutral).

4.5 Facebook Network

The Facebook network complied by Leskovec and Mcauley [33] is obtained from 10 ego-networks, consisting of 193 circles and 4,039 users. This network comprises 4,039 vertices (people), and 88,234 edges are shown to illustrate that two friends of an ego are also friends to each other. The partition result is shown in Fig. 14. This example demonstrates the utility of our proposed method. Despite the fact that this is not a big network, one cannot look at its whole structure quickly. By applying our method, this network is quickly partitioned into 11 groups with distance 2 and got a high modularity 0.684. In fact, this network is compiled by combining 10 ego-networks. Our result is very close to the real one. Note that the first group in the bar chart does not show one black color only. Instead, too many labels result in out of capacity for gray colors. Besides, it only takes around 2 minutes to execute the coloring of complements of graphs with all 7 different distances in our personal computer (Intel Core i7-4770 CPU 3.40GHz).

Copyright © 2017 Isaac Scientific Publishing JAS 92 Journal of Advanced Statistics, Vol. 2, No. 2, June 2017

0.1773 0.2917 0.3178 0.2623 −0.0058 −1e−04 C (29)

100 B (35) 80 D (4) 60

A (32) Number of Vertices 40

E (1) 20

G (2) 0 1 2 3 4 5 6 F (2) Distance (a) Bar Chart of Labels (b) The Suggested Partitions with Distance 3

Figure 13: Visualization of book network by the coloring of complements of graphs. (a) The stacked bar chart where each segment represents the number of vertices in each color group. (b) The partition with distance 3.

0.1112 0.684 0.5607 0.3428 0.1579 0.1726 0.0021 D (777) C 4000 (748) E (546)

B (71) 3000

F (146)

A 2000 (1046) Number of Vertices

G (100)

1000 K (59)

H (340)

0 J 1 2 3 4 5 6 7 I (36) (170) Distance (a) Bar Chart of Labels (b) The Suggested Partitions with Distance 2

Figure 14: Visualization of Facebook network by the coloring of complements of graphs. (a) The stacked bar chart where each segment represents the number of vertices in each color group. (b) The partition with distance 2.

5 Discussion and Conclusion

It is difficult to quickly glimpse a large-scale network by a limited plot region within a short limit of time. In this study, we summarize a large network into few partitions by the greedy vertex coloring approach, which is an efficient algorithm and has been verified to run O(n + m) time according to their interactive distances. The basic notion of graph coloring is to force two connected vertices to be assigned to different

JAS Copyright © 2017 Isaac Scientific Publishing Journal of Advanced Statistics, Vol. 2, No. 2, June 2017 93

colors/labels. By considering the complement, two connected vertices become unconnected and close vertices have high probabilities to be assigned to the same colors in this regime. We can treat different colors as different groups to glimpse the structure of a network. In general, a network can be partitioned into n-cliques by the graph coloring approach. Thus, we can recognize graph partition and cluster patterns in a simple way. Based on the labels of vertices after coloring, we get further information by summarizing these labels. Some famous networks are demonstrated to show the abilities of our provided method. If the distances between pairs of vertices are the main concern, our method is a useful tool to summarize a large network. Although we leave an open question whether the partition reaches the chromatic number via the greedy coloring procedure, the partitions have provided useful clues to future researches. For example, if finding the best communities is of interest, one can start from the suggested partitions obtained by our algorithm and create some add or delete algorithms to modify the groups. Even though other coloring methods can provide better coloring results, most of them take more time on coloring a graph. Using time-consuming algorithms contradicts our initial purpose of computing efficiency. Therefore, we look forward to constructing and searching for better methods to execute graph coloring. On the other hand, during the demonstration to the best partitions, the modularity is suggested as the selection criterion. However, this criterion cannot provide the information of the number of groups. If we think parsimony partitions are a good representation, a modified modularity is needed to take this information into account. A modified criterion is considered in our future work. Furthermore, if we care about other information rather than community, other criteria are also needed. Note that the methods and statistics provided in this study contain no statistical inferences. However, it does not imply that no statistical inference can be constructed directly. For example, the number of edge in different complements of graphs can be used to test the goodness of fit of random model. The number of edge between two different groups can also be a test statistic to verify if two groups are significantly split. It takes more effort to improve this method in the future.

Acknowledgments. The authors would thank Ms. Wendy Tzu-Wen Kao and Dr. Martin Tshishimbi Wa Lukusa for their helps in improving English of this paper. This work was supported by (a) Career Development Award of Academia Sinica (Taiwan) grant number 103-CDA-M04, (b) Ministry of Science and Technology (Taiwan) grant numbers 104-2118-M-001-016-MY2 and 105-2118-M-001-007-MY2, and (c) Thematic Research Program of Academia Sinica (Taiwan) grant number AS-103-TP-C03.

References

1. P. Erdős and A. Rényi, “On random graphs,” Publicationes Mathematicae Debrecen, vol. 6, pp. 290–297, 1959. 2. D. J. Watts and S. H. Strogatz, “Collective dynamics of ąěsmall-worldąęnetworks,” nature, vol. 393, no. 6684, pp. 440–442, 1998. 3. A.-L. Barabási and R. Albert, “Emergence of scaling in random networks,” science, vol. 286, no. 5439, pp. 509–512, 1999. 4. P. Shannon, A. Markiel, O. Ozier, N. S. Baliga, J. T. Wang, D. Ramage, N. Amin, B. Schwikowski, and T. Ideker, “Cytoscape: a software environment for integrated models of biomolecular interaction networks,” Genome research, vol. 13, no. 11, pp. 2498–2504, 2003. 5. M. Bastian, S. Heymann, M. Jacomy et al., “Gephi: an open source software for exploring and manipulating networks.” ICWSM, vol. 8, pp. 361–362, 2009. 6. G. Csardi and T. Nepusz, “The igraph software package for complex network research,” InterJournal, Complex Systems, vol. 1695, no. 5, 2006. 7. T. M. Fruchterman and E. M. Reingold, “Graph drawing by force-directed placement,” Software: Practice and experience, vol. 21, no. 11, pp. 1129–1164, 1991. 8. G. Karypis and V. Kumar, “A fast and high quality multilevel scheme for partitioning irregular graphs,” SIAM Journal on scientific Computing, vol. 20, no. 1, pp. 359–392, 1998. 9. S. Arora, S. Rao, and U. Vazirani, “Expander flows, geometric embeddings and graph partitioning,” Journal of the ACM (JACM), vol. 56, no. 2, p. 5, 2009.

Copyright © 2017 Isaac Scientific Publishing JAS 94 Journal of Advanced Statistics, Vol. 2, No. 2, June 2017

10. S. Wasserman and K. Faust, Social network analysis: Methods and applications. Cambridge university press, 1994, vol. 8. 11. M. E. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Physical review E, vol. 69, no. 2, p. 026113, 2004. 12. P. J. Bickel and A. Chen, “A nonparametric view of network models and newman–girvan and other modularities,” Proceedings of the National Academy of Sciences, vol. 106, no. 50, pp. 21 068–21 073, 2009. 13. M. S. Handcock, A. E. Raftery, and J. M. Tantrum, “Model-based clustering for social networks,” Journal of the Royal Statistical Society: Series A (Statistics in Society), vol. 170, no. 2, pp. 301–354, 2007. 14. N. A. Heard, D. J. Weston, K. Platanioti, and D. J. Hand, “Bayesian anomaly detection methods for social networks,” The Annals of Applied Statistics, vol. 4, no. 2, pp. 645–662, 2010. 15. L. Tang and H. Liu, “Community detection and mining in social media,” Synthesis Lectures on Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 1–137, 2010. 16. D.-Z. Du and P. M. Pardalos, Handbook of combinatorial optimization: supplement. Springer Science & Business Media, 1999, vol. 1. 17. F. T. Leighton, “A graph coloring algorithm for large scheduling problems,” Journal of research of the national bureau of standards, vol. 84, no. 6, pp. 489–506, 1979. 18. P. Hell and J. Nešetřil, “On the complexity of< i> h-coloring,” Journal of Combinatorial Theory, Series B, vol. 48, no. 1, pp. 92–110, 1990. 19. P. Briggs, K. D. Cooper, and L. Torczon, “Improvements to graph coloring register allocation,” ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 16, no. 3, pp. 428–455, 1994. 20. M. Thorup, “All structured programs have small tree width and good register allocation,” Information and Computation, vol. 142, no. 2, pp. 159–181, 1998. 21. A. Kosowski and K. Manuszewski, “Classical coloring of graphs,” Contemporary Mathematics, vol. 352, pp. 1–20, 2004. 22. W. W. Zachary, “An information flow model for conflict and fission in small groups,” Journal of anthropological research, pp. 452–473, 1977. 23. A. Hertz and D. de Werra, “Using tabu search techniques for graph coloring,” Computing, vol. 39, no. 4, pp. 345–351, 1987. 24. D. S. Johnson, C. R. Aragon, L. A. McGeoch, and C. Schevon, “Optimization by simulated annealing: an experimental evaluation; part ii, graph coloring and number partitioning,” Operations research, vol. 39, no. 3, pp. 378–406, 1991. 25. P. Galinier and J.-K. Hao, “Hybrid evolutionary algorithms for graph coloring,” Journal of combina- torial optimization, vol. 3, no. 4, pp. 379–397, 1999. 26. D. B. West et al., Introduction to graph theory. Prentice hall Upper Saddle River, 2001, vol. 2. 27. R. J. Light and B. H. Margolin, “An analysis of variance for categorical data,” Journal of the American Statistical Association, vol. 66, no. 335, pp. 534–544, 1971. 28. V. D. Blondel, J.-L. Guillaume, J. M. Hendrickx, and R. M. Jungers, “Distance distribution in random graphs and application to network exploration,” Physical Review E, vol. 76, no. 6, p. 066101, 2007. 29. D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M. Dawson, “The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations,” Behavioral Ecology and Sociobiology, vol. 54, no. 4, pp. 396–405, 2003. 30. D. Lusseau, “The emergent properties of a dolphin social network,” Proceedings of the Royal Society of London. Series B: Biological Sciences, vol. 270, no. Suppl 2, pp. S186–S188, 2003. 31. D. E. Knuth, D. E. Knuth, and D. E. Knuth, The Stanford GraphBase: a platform for combinatorial computing. Addison-Wesley Reading, 1993, vol. 37. 32. M. Girvan and M. E. Newman, “Community structure in social and biological networks,” Proceedings of the National Academy of Sciences, vol. 99, no. 12, pp. 7821–7826, 2002. 33. J. Leskovec and J. J. Mcauley, “Learning to discover social circles in ego networks,” in Advances in neural information processing systems, 2012, pp. 539–547.

JAS Copyright © 2017 Isaac Scientific Publishing Journal of Advanced Statistics, Vol. 2, No. 2, June 2017 95

Appendix: Toy Example of Greedy Coloring Algorithm

Here is a toy example to illustrate the greedy coloring approach and its procedures based on Algorithm 1. We construct a synthetic network with 6 vertices and 11 edges. Fig. 15 (a) shows the degrees of this network. In our algorithm, we start the coloring at the vertex having the smallest degree, and then sequentially color the graph from this vertex. Fig. 15 (b) shows the order for coloring vertices based on our provided algorithm. We select the node with the smallest vertex ID if there are more than one adjacent vertices that have the same smallest degree, so V2 is the first node to be assigned a label. According to the order, this network can be successively colored and the coloring result is shown in Fig. 15 (c). Besides, nodes {V1,V2,V5,V6} form a clique. Thus, the chromatic number of this graph is 4 and our algorithm reaches the chromatic number in this example.

V3 V2 V3 V2 V3 V2 4 3 2 1 2 1

V4 V1 V4 V1 V4 V1 3 5 3 6 1 4

V5 V6 V5 V6 V5 V6 3 4 4 5 2 3 (a) Degree (b) Coloring Order (c) Labels of Colors

Figure 15: Graph coloring for the synthetic network where the labels inside the vertices are the node degrees.

Copyright © 2017 Isaac Scientific Publishing JAS