Cliques and a New Measure of Clustering: with Application to U.S. Domestic Airlines
Steve Lawford† and Yll Mehmeti Data, Economics and Interactive Visualization (DEVI) group, ENAC (University of Toulouse), 7 avenue Edouard Belin, CS 54005, 31055, Toulouse, Cedex 4, France †Corresponding author. Email: [email protected]
Abstract We propose a higher-order generalization of the well-known overall clustering coefficient for triples C(3) to any number of nodes. We give analytic formulae for the special cases of three, four, and five nodes and show that they have very fast runtime performance for small graphs. We discuss some theoretical properties and limitations of the new measure, and use it to provide insight into dynamic changes in the structure of U.S. airline networks.
1 Introduction
Complex networks are widely used to describe important systems, with applications to biology, technology and infrastructure, and social and economic relationships [4, 5, 25, 59, 69, 83]. A network or “graph” involves a set of nodes or “vertices” that are linked by edges. For example, an airline company’s transportation of passengers can be thought of as a network of airports (nodes) joined by routes that have regular service (edges). The statistical physics and graph theory communities have focused in particular on the topology and dynamics of random and real-world networks, and have been successful in identifying robust structural features and organizational principles.1 These include the small-world property, characterized by systems that are highly clustered but have short characteristic path lengths; and scale-free networks, which means that the number of neighbours of a node, or its “degree”, follows a power-law distribution whereby the topology of the system is dominated by a few high degree nodes [7, 22, 72]. One common property of networks is clustering or “transitivity”, which measures the relative frequency with which two neighbours of a given node are also neighbours of one another, forming a connected triangle of nodes. Many real-world networks display higher levels of clustering than would be expected if those networks were
arXiv:1806.05866v3 [cs.SI] 23 Aug 2020 random, with nodes creating tightly connected groups [4, 58, 69, 72]. Clustering is especially important in economic and social networks, and there is strong evidence that it is related to cooperative social behaviour and beneficial information and reputation transfer [41, 42, 44, 59]. Other recent examples of the empirical application of graphs in economics include [6, 32, 42] (social networks) and [3, 23, 31, 40, 64] (financial networks). In this paper, we focus our empirical application on air transportation. Recent work has considered air cargo networks [12, 53], the world-wide airport network [19, 37, 38, 51, 67, 71], and airline networks in the U.S. [2, 9, 35, 50, 65, 66, 73], Europe [63], and China [18, 30]; good surveys of research in this area appear in [52, 65, 79]. Typically, these papers report a selection of summary statistics to capture global or local aspects of the network, and provide insight into topology and dynamics that would not be available from other methods.
1The field is continually expanding and is far too large to survey here. We thank one anonymous referee for pointing us towards recent work on edge prediction [80] and multiplex models [81, 82].
1 One widely used measure of clustering is the overall clustering coefficient or “transitivity” which is defined in [8, 59, 60, 61, 62] as, in our notation, 3 × number of triangles in the network C(3) = , (1) number of connected triples of vertices where a connected triple is a set of three distinct nodes u, v and w, such that at least two of the possible edges between them exist. In a social network this measures how often an individual’s “friends” are also friends with one another, on average, across the entire network. An alternative measure of clustering, the average clustering coefficient, takes a different approach to (1) and is computed locally for each node, and then averaged across all nodes.2 We focus on overall clustering (transitivity) in this paper. There is now substantial evidence that significant topological structures (known as “graphlets”, “motifs” or “subgraphs”), on more than three nodes, can be found in real-world networks, and that they may perform precise specialized functions [1, 10, 11, 14, 45, 55, 74]. Since the usual clustering coefficient C(3) is based upon connected triples of nodes, it is natural to ask whether a similar measure can be derived for any number of nodes. A generalized clustering coefficient could potentially identify hidden higher-order clustering and enable a better understanding of the structure of real-world networks. The need to go beyond usual three-node average clustering has been addressed by [16] (cycles of length four), who find that “grid clustering” scales with node degree in a similar way to usual clustering, [34] (shortest paths of length greater than one between a node’s neighbours), who confirm the absence of clustering in Barabasi-Albert´ preferential attachment (scale-free) models, and [46] (connectivity of more than one of a node’s nearest neighbours), who investigate scaling properties using data on urban street networks. The work that is most closely related to the present paper is by Yin-Benson-Leskovec [76], hereafter referred to as YBL, who propose new overall and average clustering coefficients based on the relative frequency of cliques of order greater than three. Our work differs essentially from YBL’s overall coefficient in the way that we define the “relative frequency”, but also in the methods that are used for computation, and the motivation and empirical application. We draw careful comparisons between our method and YBL, for theoretical Erdos-R˝ enyi´ random graphs and simulated small-world models, and argue that the two approaches are complementary.3 In this paper, we make the following specific contributions:
• We propose a new generalized clustering coefficient C(b), based upon connected groups of b nodes, which nests the standard clustering coefficient C(3). We develop a very fast analytic implementation for connected groups of three, four and five nodes, that we show to be up to 2,000 times faster than a na¨ıve nested loop algorithm, for some small dense graphs.
• We examine the theoretical properties of C(b) for Erdos-R˝ enyi´ and small-world random graphs, and lollipop graphs, and draw comparisons with YBL. We show that it will become increasingly difficult to compute C(b) efficiently as b becomes large, even using analytic formulae. Using dynamic data on U.S. airline networks, we also observe that C(b) can be highly correlated across b, and with network density. When we control for lower-order clustering, we find low to moderate higher-order clustering in these networks. It is not known whether this finding holds generally for large classes of networks.
All of the analytic formulae that we use, and several proofs, are collected in Appendix A, and additional figures and tables are reported in Appendix B.
2Overall clustering (1) assigns the same weight to every triangle in the graph. Average clustering gives each node the same weight. Since high-degree nodes may be adjacent to more triangles than low-degree nodes, overall and average clustering can give different values. 3We became aware of the excellent [76] after the original version of our paper had been completed and submitted to the arXiv repository. The present paper contains substantial new material to address this omission. There is related work by YBL-Gleich [75] and by YBL [77].
2 2 Graph Theory and Clustering
We briefly review some relevant tools of graph theory. Important monographs include [29] (mathematics), [41] (economics of social networks) and [47] (algorithms). A graph is an ordered pair G = (V,E) where V and E denote the sets of nodes and edges of G, respectively. We use n = |V| and m = |E| to represent the numbers of nodes and edges of G. A graph has an associated n × n adjacency matrix g, with representative element (g)i j that takes value one when an edge is present between nodes i and j, and zero otherwise. We also use (i, j) ∈ E to denote an edge between nodes i and j, and say that they are directly-connected. A graph is simple and unweighted if (g)ii = 0 (no self-links) and (g)i j ∈ {0,1} (no pair of nodes is linked by more than one edge, or by an edge with a weight that is different from one). A graph is undirected if (g)i j = (g) ji.A walk between nodes i and j is a sequence of edges {(ir,ir+1)}r=1,...,R such that i1 = i and iR+1 = j, and a path is a walk with distinct nodes. A graph is connected if there is at least one path between any pair of nodes i and j; otherwise the graph is disconnected.A bridge is an edge the removal of which will disconnect the graph. In this paper, we consider simple, unweighted, undirected and connected graphs. The degree ki = ∑ j(g)i j is the number of nodes that are directly-connected to node i, and the (1-degree) neighbourhood of node i in G, denoted by ΓG(i) = { j : (i, j) ∈ E}, is the set of all nodes that are directly-connected to i. The density d(G) = 2m/n(n−1) is the number of edges in G relative to the maximum possible number of edges in a graph with n nodes. A graph G0 = (V 0,E0) is a subgraph of G if V 0 ⊆ V and E0 ⊆ E where (i, j) ∈ E0 implies that i, j ∈ V 0.A tree is a connected graph with no cycles. A spanning tree on a connected G is a connected subgraph with nodes V and the minimum possible number of edges m = n − 1.A complete graph on n nodes, Kn, has all possible edges, and a complete subgraph on b nodes is called a b-clique.A maximal clique is a clique that cannot be made larger by the addition of another node in G with its associated edges, while preserving the complete-connectivity of the clique. A maximum clique is a (maximal) clique of the largest possible size in G, and the clique number w(G) of the graph G is the number of nodes in a maximum clique in G. Let G(n, p) be an Erdos-R˝ enyi´ random graph with nodes V = {1,...,n} and edges that arise independently with a constant edge-formation probability p, giving a statistically homogeneous network that has, on average, (n − 1) p n edges for a given node, and 2 p randomly-distributed edges in total. We also use the lollipop graph L(b,n − b), 1 with n nodes and 2 b(b−3)+n edges, and 3 ≤ b ≤ n. The lollipop can be thought of as a b-clique Kb that is attached 4 by a bridge to a path graph on n − b nodes. Note that L(n,0) is the complete graph Kn. Using the notation of [1], (b) we refer to particular topological subgraphs by Ma , where b is the number of nodes in the subgraph, and a is the decimal representation of the smallest binary number derived from a row-by-row reading of the upper triangles of each adjacency matrix g from the set of all topologically-identical subgraphs on the same b nodes (also see [48]).
2.1 Analytic formulae for a generalized clustering coefficient The clustering coefficient C(3) is bounded by 0 ≤ C(3) ≤ 1, attaining the minimum when there are no triangles in the graph, and taking the maximum value for a complete graph Kn. Since each triangle contains three triples of nodes, a factor of three appears in the numerator of (1). A na¨ıve algorithm that is based on nested loops, and considers every distinct triple of nodes in G, will run in O(n3) time. However, it is easy to write down an analytic version of C(3), using the nested subgraph enumeration formulae in [1, equations (1) and (2)]:
3|M(3)| tr(g3) C(3) = 7 = , (2) (3) ∑i ki(ki − 1) |M3 | (3) (3) and where C(3) makes explicit the definition of clustering in terms of triples M3 and triangles M7 .
4The lollipop was first introduced by [49, Example 2] in the one parameter case b = n/2, and was generalized to two parameters by [15]. It has applications in the fields of combinatorics (Ramsey theory) [33] and linear algebra (spectral theory) [13, 39].
3 If we instead interpret (2) as the average probability that any three connected nodes in a graph are also completely- connected, then a natural generalization follows to any number b of nodes, such that 3 ≤ b ≤ n. In this paper, we define the generalized clustering coefficient as follows:
a(b) × number of b-cliques in G C(b) = , (3) number of b-spanning trees in G b−2 where Cayley’s formula a(b) = b gives the number of spanning trees in Kb, and ensures that 0 ≤ C(b) ≤ 1. Clearly, C(b) nests C(3), and equals zero if and only if there are no b-cliques in the graph. It is natural that the generalized clustering coefficient should attain its maximum value for a complete graph, in the same way as C(3), and we show this in:
Proposition 2.1. Let G be a connected graph with at least b nodes (b ≥ 3). Then C(b) = 1 if and only if G is complete.
See Appendix A for a proof of Proposition 2.1. A na¨ıve algorithm for (3), based on nested loops, will run in O(nb) time. For example, the denominator of (3) can be calculated by considering every distinct set of b nodes in G, and counting the number of spanning trees on each subgraph. This will be excessively slow. If we instead think of C(b) as a measure of the prevalence of b-cliques relative to all connected groups of b nodes, then it is clear that we can use analytic subgraph enumeration for counting the cliques and the spanning trees for the special cases C(4) and C(5), in the same way as for (2):
(4) 16|M | 4 ∑ tr(g3 ) C(4) = 63 = i −i . (4) (4) (4) 3 ∑i ki(ki − 1)(ki − 2) + 6 ∑(i, j)∈E (ki − 1)(k j − 1) − 3 tr(g ) |M11 | + |M13 |
125|M(5) | C(5) = 1023 (5) (5) (5) |M75 | + |M77 | + |M86 | 3 (5) 25 ∑ ∑ tr(((g−i)− j) ) = i j∈ΓG(i) . ∑i ki(ki − 1){(ki − 2)(ki − 15) − 24} + 12 ∑(i, j)∈E (ki − 1)(ki + k j − 8)(k j − 1) 3 4 3 − 48 ∑i(g )ii(ki − 2) + 12 ∑i6= j(g )i j − 12 tr(g )
(4) (5) The numerator terms |M63 | and |M1023| are the number of 4-cliques and 5-cliques respectively. The denominator (4) (4) (5) (5) terms are the counts of the 4-star (|M11 |), the 4-path (|M13 |), the 5-star (|M75 |), the 5-arrow (|M77 |), and the 5-path (5) (|M86 |), which are illustrated in Figures 1 and 2. Since there are sixteen possible spanning trees on any given four nodes in the graph, all of which will occur in K4, the factor a(4) equals 16. Similarly, counting the distinct 5-spanning trees in K5 gives a(5) equal to 125. We do not recommend using the right-hand-sides of (4) and (5) for computation. We report the runtime performance of the analytic formulae in Appendix B.1 for small graphs. However, while this approach seems promising, it will rapidly become hard to derive analytic formulae for larger values of b, because the number of denominator terms will explode. Essentially, we would need to find a formula for every non-isomorphic tree on b nodes. For example, C(6) would require evaluation of six denominator terms (Figure 3). Numerical values for the number of trees on n unlabelled nodes are given as series A000055 in the Online Encyclopedia of Integer Sequences ( http://oeis.org/A000055 ). For example, C(7) has 11 denominator terms, C(8) has 23 denominator terms, and C(36) has more than 6.2 × 1012 terms! This creates an intrinsic bound on the general applicability of analytic formulae for C(b): we can reasonably expect to use them for C(3), C(4), C(5) and perhaps C(6) and C(7), but not beyond. There has been considerable research on efficient numerical algorithms for generating all possible spanning trees of a simple undirected connected graph; see [17] for a review and comparison
4 (4) (4) Figure 1: The sixteen spanning trees on four labelled nodes: four 4-stars M11 and twelve 4-paths M13 . This illustrates Cayley’s formula a(b) = bb−2, which appears in the numerator of the generalized clustering coefficient, for b = 4.
(5) (5) (5) (a) 5-star M75 . (b) 5-arrow M77 . (c) 5-path M86 .
(5) (5) (5) Figure 2: The three spanning trees on five unlabelled nodes: the 5-star M75 , the 5-arrow M77 and the 5-path M86 . The total count of these subgraphs appears in the denominator of the generalized clustering coefficient C(b), for b = 5.
Figure 3: The six non-isomorphic spanning trees on six unlabelled nodes. The total count of these subgraphs appears in the denominator of the generalized clustering coefficient C(b), for b = 6. The number of spanning trees that must be counted in order to compute C(b) increases very rapidly with the number of nodes b in the subgraph.
5 of different methods. It is possible that numerical methods could be used to extend C(b) to higher values of b, although computation of C(b) requires consideration of every possible set of b connected nodes in the graph, and it is well known that the number of spanning trees of a graph increases exponentially in the number of nodes.5
2.2 The generalized clustering coefficient Cb−1 of Yin-Benson-Leskovec In recent work, Yin-Benson-Leskovec [76] develop an overall generalized clustering coefficient, based on clique expansion, that is closely related to C(b).6 Their coefficient (Equation 4 in [76]) is
2 (` + `)|K`+1| C` = ; ` ≥ 2, |W`| where K`+1 is the set of (` + 1)-cliques and W` is the set of ` “wedges” (they define a wedge as an `-clique with one additional node that is adjacent to any node in the clique). Despite the apparently different formulation, we can write YBL’s coefficient in the notation of our paper, with ` = b − 1, as
(b2 − b)|K | C = b ; b ≥ 4, b−1 |L(b − 1,1)| where L(·,·) is the lollipop graph. Note that the lollipop L(2,1) is not typically defined, and so we need b − 1 ≥ 3. YBL nest the usual clustering coefficient in their generalized framework by implicitly defining L(2,1) as a 3-path with directed edges (double-counting the undirected 3-path), which gives the coefficient 6 in the numerator rather than the usual 3. This is just a counting issue, and so we assume in the rest of the paper that C(3) = C2. We now compare the higher-order clustering coefficients on four and five nodes:
( ) ( ) 16|K | 16|M 4 | 12|K | 12|M 4 | C(4) = 4 = 63 ; C = 4 = 63 , number of 4-spanning trees in G (4) (4) 3 |L(3,1)| (4) |M11 | + |M13 | |M15 |
(4) where M15 is the tadpole subgraph; and
125|K | 125|M(5) | 20|K | 20|M(5) | C(5) = 5 = 1023 ; C = 5 = 1023 , number of 5-spanning trees in G (5) (5) (5) 4 |L(4,1)| (5) |M75 | + |M77 | + |M86 | |M127|
(5) where M127 is the kite subgraph (see [48] for the count formula). Hence YBL’s coefficient Cb−1 fits naturally into the analytic framework of our paper. Likewise, our coefficient C(b) can potentially use similar computational techniques to those in YBL. Intuitively, both methods will face computational difficulties as b becomes large, and in fact YBL do not go beyond b = 5. In practice, this is unlikely to be a serious issue, as shown by the empirical results of YBL, and by Section 3 in this paper. Small values of b appear to be sufficient to capture much of the higher-order clustering that is present in some real-world networks. The essential difference between C(b) and Cb−1 is in the definition of the “relative frequency”. We compute the frequency of b-cliques relative to the number of minimally connected subgraphs on b nodes. On the other hand, YBL compute the frequency of b-cliques relative to the number of lollipops L(b − 1,1) i.e. a b-clique where all but one of the edges adjacent to one node have been removed. This reflects the two possible interpretations of the 3-path
5Note that the analytical algorithm is not a property of the generalized clustering coefficient itself, but a means to compute it efficiently for small b, and for small to moderate n. As in other areas, when exact analytic methods become intractable, it often becomes necessary to use numerical techniques and approximations instead, such as approximate sampling algorithms and asymptotic results. 6“The novelty of our interpretation of the clustering coefficient is considering it as a form of clique expansion rather than as the closure of a length-2 path, which is key to our generalizations in the next section.” (Page 052306-2 in [76])
6 Figure 4: The theoretical expectation of the clustering coefficient C(b) for the Erdos-R˝ enyi´ random graph G(n, p) is EG[C(b)] = p(b−1)(b−2)/2 as n becomes large, with edge-formation probability 0 ≤ p ≤ 1. We observe that clustering is monotonically increasing in probability p, as the network moves from a set of disconnected nodes to a complete graph. For a given p, the expected level of clustering is decreasing in b. in the denominator of C(3), either as a spanning tree on three nodes (our paper) or as a 2-clique with an additional adjacent node (YBL), and the two natural generalizations of C(3) to higher-order clustering.
2.3 Analysis of generalized clustering C(b) and Cb−1 for the G(n, p) model In the special case of the Erdos-R˝ enyi´ random graph G = G(n, p), it follows from (3) that the expectation of b (b−1)(b−2)/2 n (2) b−2 n b−1 C(b) is given by EG[C(b)] = p as n becomes large, since there are b p b-cliques and b b p b-spanning trees in G(n, p), on average. Also see Figure 4. We implicitly assume that both C(b) and Cb−1 are well- b−1 defined on G. Numerical values of 2 are given in A161680 of the Online Encyclopedia of Integer Sequences ( b b−2 n (2) http://oeis.org/A161680 ). We can also show that EG[Cb−1] = p as n becomes large, since there are b p b−1 2 n ( 2 )+1 b-cliques and (b −b) b p lollipops in G(n, p), on average (see also Proposition 2(1) in [76] for this result). It follows immediately that EG[C(b)] T EG[Cb−1] as (b−2)(b−3) S 0, with p 6= 0,1. Directly, EG[C(b)] ≤ EG[Cb−1], with equality when p = 0,1 (for all b) or when b = 3 (for all p). Intuitively, there will be more b-spanning trees than there are lollipops L(b − 1,1) in G. For example, there is one 4-star and two 4-paths in the tadpole; and there can also be 4-spanning trees that are not in a tadpole. In Figure 5 we examine the expected difference between C(b) b−2 (b−2)(b−3)/2 and Cb−1 for G = G(n, p), where EG[Cb−1]−EG[C(b)] = p (1− p ) ≥ 0, with equality at p = 0,1. We see that the difference can be substantial, and that it increases in the edge-formation probability p (as the graph G becomes more dense) up to a certain point, and then falls to zero; but that it can increase or decrease in b, the order of the clustering, for a given p.
2.4 Invariance of Cb−1 on graphs with vanishing density
The YBL coefficient Cb−1 (b ≥ 4) has a peculiar invariance property for a particular class of graphs, that is neither a feature of the usual clustering coefficient C(3) = C2 nor of our generalized clustering C(b). Note that C(b) increases as the number of b-cliques increases or the number of b-spanning trees falls, while Cb−1 increases as the number of b-cliques increases or the number of lollipops L(b − 1,1) falls. It follows that Cb−1 will give the same value for
7 Figure 5: The theoretical difference in expectation between the clustering coefficient Cb−1 of [76] and our coefficient C(b) b−2 (b−2)(b−3)/2 for the Erdos-R˝ enyi´ random graph G(n, p) is EG[Cb−1] − EG[C(b)] = p (1 − p ) as n becomes large, with edge-formation probability 0 ≤ p ≤ 1. We note that C(3) = C2. We observe that C(b) and Cb−1 are numerically identical when p = 0 (a set of disconnected nodes) and p = 1 (a complete graph). For a given b, the expected difference is positive and increases in p up to a certain point, and then falls to zero. It is easy to show that the maximum is attained at p = (2/(b − 1))2/(b−2)(b−3), which gives p = 2/3 (for b = 4), p = 2−1/3 ≈ 0.7937 (for 1/6 b = 5) and p = (2/5) ≈ 0.8584 (for b = 6). It is clear that Cb−1 > C(b) in expectation for all p 6= 0,1. The numerical difference between the two statistics can be quite substantial, given that 0 ≤ Cb−1 ≤ 1 and 0 ≤ C(b) ≤ 1. The maximum difference is 4/27 ≈ 0.1481 (for b = 4), 1/4 (for b = 5) and (108/3125)1/3 ≈ 0.3257 (for b = 6).
8 a lollipop graph G = L(b,1) as for any graph G2 on more than b + 1 nodes that does not contain any additional lollipops L(b − 1,1) beyond those contained in G.7 We illustrate with two specific examples. First, take the lollipop L(3,n − 3) with n ≥ 4. It is easy to show that C(3) = C2 = 3/(n + 1), which is decreasing in n. This makes intuitive sense: as the lollipop becomes progressively less dense, there is less clustering, and in the limit n → ∞ the lollipop resembles a path graph. Second, take the lollipop L(4,n − 4) that has C(4) = 16/(n + 22) for n ≥ 6: our fourth order clustering decreases in n. The usual clustering C(3) = 12/(n + 10) (for n ≥ 5) also falls in n. Again, this is intuitively correct: the density of L(4,n − 4), for n ≥ 5, is d(G) = 2(n + 2)/(n(n − 1)) → 0 as n → ∞. However, (4) (4) 8 C3 = 12|M63 |/|M15 | = 0.8 for all n ≥ 5 i.e. it is invariant as n increases. This is interesting for two reasons. First, C3 will not change even as the graph becomes infinitely sparse in the limit. Second, C3 does not behave in the same way as the usual clustering coefficient C2 in this respect. The problem is not specific to the lollipop, and Cb−1 will display this behaviour whenever a lollipop L(b,1) is extended in a non-trivial way so that no additional L(b−1,1) structure is added to the graph. Consider G = L(5,1) with n = 6 nodes and m = 11 edges, so that C4 ≈ 0.8333. Then consider a connected graph G2 with an arbitrarily large number of nodes, that includes one copy of G as a subgraph, and that has additional non-trivial topological structure outside of the subgraph G, that can include 17 of the possible 21 non-isomorphic subgraphs on five nodes [48], but with no additional L(4,1) structure. Nevertheless, C4 will take the same value on G2 as on G.
2.5 Analysis of generalized clustering C(b) for the small-world model We now investigate the properties of C(b) for a simulated small-world model that can interpolate between regular and random behaviour. Many real-world networks exhibit small-world behaviour [4, 8, 43, 54, 57, 72, 78]. Following the original one-parameter model of [72], we start with a regular ring lattice on n nodes, where each node is adjacent to its k nearest neighbours in both the clockwise and anti-clockwise directions, giving a total nk edges. To have a sparse but connected network, we assume that n 2k ln(n) 1.9 We choose an edge that connects a node u to its nearest neighbour v in a clockwise direction, and rewire this edge, uniformly with probability p, to an edge (u,w), avoiding self-loops and duplicated edges. With probability 1 − p the original edge is left in place. We continue in a clockwise direction around the ring until all nodes have been considered once. We then look at edges that connect each node to its second-nearest neighbour, and so on, continuing around the ring k times, until each edge in the original lattice has been examined once. The edge rewiring creates more randomness in the network: when p = 0, the ring lattice is unchanged and completely regular, and when p = 1 all edges are rewired randomly. It is well-known [72] that intermediate values of p give graphs with the small-world property, characterized by low average path length (as in a random graph) but high clustering (as in a regular graph). This is due to the presence of a small number of “short-cut” edges that connect nodes that would otherwise be far apart in a regular graph. In Figure 6, we report the expected clustering coefficient EG[C(b)] from 250 replications of a small-world graph G on n = 50 nodes, with k = 7. The small-world model is connected by construction, and so C(b) will be well defined. We note that clustering falls in b (given p) but that it is not monotonic as p increases (given b): at some point, introducing more randomness actually increases clustering.10 Finding a closed-form formula for the expected value of C(b) in a finite small-world model is an open problem. Some partial results are available [4, 8, 72]. When p = 0,
7 If G2 contains any positive contribution to higher-order clustering (in the sense of more b-cliques) then both C(b) and Cb−1 will detect it. 8We observe qualitatively the same result for the lollipop L(5,n − 5) with C(3) = 30/(n + 28) for n ≥ 6 and C(5) = 125/(n + 203) for n ≥ 8 but C4 = 5/6 ≈ 0.8333 for n ≥ 6 despite a vanishing density d(G) = 2(n + 5)/(n(n − 1)) → 0 as n → ∞. 9If n = 2k + 1 then the graph is complete, and C(b) = 1 from Proposition 2.1, for 3 ≤ b ≤ n. If n > 2k + 1 then, trivially, we have that C(b) = 0 for b > k + 1 on the regular ring lattice with no rewiring. 10The observation that expected clustering falls in b for a given p is qualitatively the same as Figure 3 in [76], who also consider this small-world model, but for their average clustering coefficient, and with (apparently one replication of) a small-world model on n = 20,000 nodes, and k = 5. Figure 3 in [76] also suggests that their average clustering decreases monotonically in p (for a given b) for this large n. See Figure B.5 for simulation results on the expected overall clustering coefficient EG[Cb−1] of Yin-Benson-Leskovec, and Figure B.6 for simulation results on the difference in expected clustering EG[Cb−1] − EG[C(b)], for the small-world model of [72].
9 Figure 6: The simulated expected clustering coefficient EG[C(b)] from 250 replications of a small-world graph with n = 50 nodes, each of which has degree 2k = 14, and edge-rewiring probability 0 ≤ p ≤ 1, which parameterizes the network from a regular graph to a random one. As in [72], we begin with a regular ring lattice, where each node is connected to its k nearest neighbours in both the clockwise and anti-clockwise directions. Moving around the lattice in a clockwise direction, each edge between a node u and its nearest neighbour v is randomly and uniformly rewired with probability p to another edge (u,w), where self-loops and repeated edges are not allowed. We then continue with the second nearest neighbour, and so on. We observe that clustering falls in b but that it is not monotonic as p increases. As p increases from p = 0, the randomness that is added to the graph “breaks” the regular clusters and reduces C(b). At some point, though, introducing more randomness actually increases clustering. Finding a closed-form formula for the expectation of C(b) in a small-world model with n < ∞ remains an open problem, except for b = 3 and p = 0 whereupon C(3) = 3/2 × (k − 1)/(2k − 1) (see [4]), which equals 9/13 ≈ 0.6923 in this example.