Cliques and a New Measure of Clustering: with Application to U.S
Total Page:16
File Type:pdf, Size:1020Kb
Cliques and a New Measure of Clustering: with Application to U.S. Domestic Airlines Steve Lawford† and Yll Mehmeti Data, Economics and Interactive Visualization (DEVI) group, ENAC (University of Toulouse), 7 avenue Edouard Belin, CS 54005, 31055, Toulouse, Cedex 4, France †Corresponding author. Email: [email protected] Abstract We propose a higher-order generalization of the well-known overall clustering coefficient for triples C(3) to any number of nodes. We give analytic formulae for the special cases of three, four, and five nodes and show that they have very fast runtime performance for small graphs. We discuss some theoretical properties and limitations of the new measure, and use it to provide insight into dynamic changes in the structure of U.S. airline networks. 1 Introduction Complex networks are widely used to describe important systems, with applications to biology, technology and infrastructure, and social and economic relationships [4, 5, 25, 59, 69, 83]. A network or “graph” involves a set of nodes or “vertices” that are linked by edges. For example, an airline company’s transportation of passengers can be thought of as a network of airports (nodes) joined by routes that have regular service (edges). The statistical physics and graph theory communities have focused in particular on the topology and dynamics of random and real-world networks, and have been successful in identifying robust structural features and organizational principles.1 These include the small-world property, characterized by systems that are highly clustered but have short characteristic path lengths; and scale-free networks, which means that the number of neighbours of a node, or its “degree”, follows a power-law distribution whereby the topology of the system is dominated by a few high degree nodes [7, 22, 72]. One common property of networks is clustering or “transitivity”, which measures the relative frequency with which two neighbours of a given node are also neighbours of one another, forming a connected triangle of nodes. Many real-world networks display higher levels of clustering than would be expected if those networks were arXiv:1806.05866v3 [cs.SI] 23 Aug 2020 random, with nodes creating tightly connected groups [4, 58, 69, 72]. Clustering is especially important in economic and social networks, and there is strong evidence that it is related to cooperative social behaviour and beneficial information and reputation transfer [41, 42, 44, 59]. Other recent examples of the empirical application of graphs in economics include [6, 32, 42] (social networks) and [3, 23, 31, 40, 64] (financial networks). In this paper, we focus our empirical application on air transportation. Recent work has considered air cargo networks [12, 53], the world-wide airport network [19, 37, 38, 51, 67, 71], and airline networks in the U.S. [2, 9, 35, 50, 65, 66, 73], Europe [63], and China [18, 30]; good surveys of research in this area appear in [52, 65, 79]. Typically, these papers report a selection of summary statistics to capture global or local aspects of the network, and provide insight into topology and dynamics that would not be available from other methods. 1The field is continually expanding and is far too large to survey here. We thank one anonymous referee for pointing us towards recent work on edge prediction [80] and multiplex models [81, 82]. 1 One widely used measure of clustering is the overall clustering coefficient or “transitivity” which is defined in [8, 59, 60, 61, 62] as, in our notation, 3 × number of triangles in the network C(3) = ; (1) number of connected triples of vertices where a connected triple is a set of three distinct nodes u, v and w, such that at least two of the possible edges between them exist. In a social network this measures how often an individual’s “friends” are also friends with one another, on average, across the entire network. An alternative measure of clustering, the average clustering coefficient, takes a different approach to (1) and is computed locally for each node, and then averaged across all nodes.2 We focus on overall clustering (transitivity) in this paper. There is now substantial evidence that significant topological structures (known as “graphlets”, “motifs” or “subgraphs”), on more than three nodes, can be found in real-world networks, and that they may perform precise specialized functions [1, 10, 11, 14, 45, 55, 74]. Since the usual clustering coefficient C(3) is based upon connected triples of nodes, it is natural to ask whether a similar measure can be derived for any number of nodes. A generalized clustering coefficient could potentially identify hidden higher-order clustering and enable a better understanding of the structure of real-world networks. The need to go beyond usual three-node average clustering has been addressed by [16] (cycles of length four), who find that “grid clustering” scales with node degree in a similar way to usual clustering, [34] (shortest paths of length greater than one between a node’s neighbours), who confirm the absence of clustering in Barabasi-Albert´ preferential attachment (scale-free) models, and [46] (connectivity of more than one of a node’s nearest neighbours), who investigate scaling properties using data on urban street networks. The work that is most closely related to the present paper is by Yin-Benson-Leskovec [76], hereafter referred to as YBL, who propose new overall and average clustering coefficients based on the relative frequency of cliques of order greater than three. Our work differs essentially from YBL’s overall coefficient in the way that we define the “relative frequency”, but also in the methods that are used for computation, and the motivation and empirical application. We draw careful comparisons between our method and YBL, for theoretical Erdos-R˝ enyi´ random graphs and simulated small-world models, and argue that the two approaches are complementary.3 In this paper, we make the following specific contributions: • We propose a new generalized clustering coefficient C(b), based upon connected groups of b nodes, which nests the standard clustering coefficient C(3). We develop a very fast analytic implementation for connected groups of three, four and five nodes, that we show to be up to 2,000 times faster than a na¨ıve nested loop algorithm, for some small dense graphs. • We examine the theoretical properties of C(b) for Erdos-R˝ enyi´ and small-world random graphs, and lollipop graphs, and draw comparisons with YBL. We show that it will become increasingly difficult to compute C(b) efficiently as b becomes large, even using analytic formulae. Using dynamic data on U.S. airline networks, we also observe that C(b) can be highly correlated across b, and with network density. When we control for lower-order clustering, we find low to moderate higher-order clustering in these networks. It is not known whether this finding holds generally for large classes of networks. All of the analytic formulae that we use, and several proofs, are collected in Appendix A, and additional figures and tables are reported in Appendix B. 2Overall clustering (1) assigns the same weight to every triangle in the graph. Average clustering gives each node the same weight. Since high-degree nodes may be adjacent to more triangles than low-degree nodes, overall and average clustering can give different values. 3We became aware of the excellent [76] after the original version of our paper had been completed and submitted to the arXiv repository. The present paper contains substantial new material to address this omission. There is related work by YBL-Gleich [75] and by YBL [77]. 2 2 Graph Theory and Clustering We briefly review some relevant tools of graph theory. Important monographs include [29] (mathematics), [41] (economics of social networks) and [47] (algorithms). A graph is an ordered pair G = (V;E) where V and E denote the sets of nodes and edges of G, respectively. We use n = jVj and m = jEj to represent the numbers of nodes and edges of G. A graph has an associated n × n adjacency matrix g, with representative element (g)i j that takes value one when an edge is present between nodes i and j, and zero otherwise. We also use (i; j) 2 E to denote an edge between nodes i and j, and say that they are directly-connected. A graph is simple and unweighted if (g)ii = 0 (no self-links) and (g)i j 2 f0;1g (no pair of nodes is linked by more than one edge, or by an edge with a weight that is different from one). A graph is undirected if (g)i j = (g) ji.A walk between nodes i and j is a sequence of edges f(ir;ir+1)gr=1;:::;R such that i1 = i and iR+1 = j, and a path is a walk with distinct nodes. A graph is connected if there is at least one path between any pair of nodes i and j; otherwise the graph is disconnected.A bridge is an edge the removal of which will disconnect the graph. In this paper, we consider simple, unweighted, undirected and connected graphs. The degree ki = ∑ j(g)i j is the number of nodes that are directly-connected to node i, and the (1-degree) neighbourhood of node i in G, denoted by GG(i) = f j : (i; j) 2 Eg, is the set of all nodes that are directly-connected to i. The density d(G) = 2m=n(n−1) is the number of edges in G relative to the maximum possible number of edges in a graph with n nodes. A graph G0 = (V 0;E0) is a subgraph of G if V 0 ⊆ V and E0 ⊆ E where (i; j) 2 E0 implies that i; j 2 V 0.A tree is a connected graph with no cycles.