Centrality Measures and Link Analysis
Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of Rochester [email protected] http://www.ece.rochester.edu/~gmateosb/
February 21, 2020
Network Science Analytics Centrality Measures and Link Analysis 1 Centrality measures
Centrality measures
Case study: Stability of centrality measures in weighted graphs
Centrality, link analysis and web search
A primer on Markov chains
PageRank as a random walk
PageRank algorithm leveraging Markov chain structure
Network Science Analytics Centrality Measures and Link Analysis 2 Quantifying vertex importance
I In network analysis many questions relate to vertex importance
Example
I Q1: Which actors in a social network hold the ‘reins of power’?
I Q2: How authoritative is a WWW page considered by peers?
I Q3: The ‘knock-out’ of which genes is likely to be lethal?
I Q4: How critical to the daily commute is a subway station?
I Measures of vertex centrality quantify such notions of importance ⇒ Degrees are simplest centrality measures. Let’s study others
Network Science Analytics Centrality Measures and Link Analysis 3 Closeness centrality
I Rationale: ‘central’ means a vertex is ‘close’ to many other vertices
I Def: Distance d(u, v) between vertices u and v is the length of the shortest u − v path. Oftentimes referred to as geodesic distance
I Closeness centrality of vertex v is given by 1 cCl (v) = P u∈V d(u, v)
∗ I Interpret v = arg maxv cCl (v) as the most approachable node in G
Network Science Analytics Centrality Measures and Link Analysis 4 Normalization, computation and limitations
I To compare with other centrality measures, often normalize to [0, 1]
Nv − 1 cCl (v) = P u∈V d(u, v)
I Computation: need all pairwise shortest path distances in G 2 ⇒ Dijkstra’s algorithm in O(Nv log Nv + Nv Ne ) time
I Limitation 1: sensitivity, values tend to span a small dynamic range ⇒ Hard to discriminate between central and less central nodes
I Limitation 2: assumes connectivity, if not cCl (v) = 0 for all v ∈ V ⇒ Compute centrality indices in different components
Network Science Analytics Centrality Measures and Link Analysis 5 Betweenness centrality
I Rationale: ‘central’ node is (in the path) ‘between’ many vertex pairs
I Betweenness centrality of vertex v is given by
X σ(s, t|v) c (v) = Be σ(s, t) s6=t6=v∈V
I σ(s, t) is the total number of s − t shortest paths
I σ(s, t|v) is the number of s − t shortest paths through v ∈ V
∗ I Interpret v = arg maxv cBe (v) as the controller of information flow
Network Science Analytics Centrality Measures and Link Analysis 6 Computational considerations
I Notice that a s − t shortest path goes through v if and only if
d(s, t) = d(s, v) + d(v, t)
I Betweenness centralities can be naively computed for all v ∈ V by: Step 1: Use Dijkstra to tabulate d(s, t) and σ(s, t) for all s, t Step 2: Use the tables to identify σ(s, t|v) for all v 3 Step 3: Sum the fractions to obtain cBe (v) for all v (O(Nv ) time)
I Cubic complexity can be prohibitive for large networks
I O(Nv Ne )-time algorithm for unweighted graphs in: U. Brandes, “A faster algorithm for betweenness centrality,” Journal of Mathematical Sociology, vol. 25, no. 2, pp. 163-177, 2001
Network Science Analytics Centrality Measures and Link Analysis 7 Eigenvector centrality
I Rationale: ‘central’ vertex if ‘in-neighbors’ are themselves important ⇒ Compare with ‘importance-agnostic’ degree centrality
I Eigenvector centrality of vertex v is implicitly defined as X cEi (v) = α cEi (u) (u,v)∈E
I No one points to 1 5 4 I Only 1 points to 2 I Only 2 points to 3, but 2 1 6 more important than 1 I 4 as high as 5 with less links
2 3 I Links to 5 have lower rank
I Same for 6
Network Science Analytics Centrality Measures and Link Analysis 8 Eigenvalue problem
I Recall the adjacency matrix A and X cEi (v) = α cEi (u) (u,v)∈E
> I Vector cEi = [cEi (1),..., cEi (Nv )] solves the eigenvalue problem
−1 AcEi = α cEi ⇒ Typically α−1 chosen as largest eigenvalue of A [Bonacich’87]
I If G is undirected and connected, by Perron’s Theorem then ⇒ The largest eigenvalue of A is positive and simple
⇒ All the entries in the dominant eigenvector cEi are positive
−1 2 I Can compute cEi and α via O(Nv ) complexity power iterations
AcEi (k) cEi (k + 1) = , k = 0, 1,... kAcEi (k)k
Network Science Analytics Centrality Measures and Link Analysis 9 Example: Comparing centrality measures
I Q: Which vertices are more central? A: It depends on the context
I Each measure identifies a different vertex as most central ⇒ None is ‘wrong’, they target different notions of importance
Network Science Analytics Centrality Measures and Link Analysis 10 4 4 Example: Comparing centrality measures
I Q: Which vertices are more central? A: It depends on the context
4 4 (a)(a) (b) (b)
(a)(a) (b) (b) (c)(c) (d) (d)
Fig.Fig. 4.4 4.4IllustrationIllustration of (b) of (b) closeness, closeness, (c) (c) betweenness, betweenness, and and (d) (d) eigenvector eigenvector centrality centrality measures measures on on Closeness thethe graph graphBetweenness in (a). in (a). Example Example and andfiguresfigures courtesy courtesy of Ulrik of Ulrik Brandes. Brandes.Eigenvector
I Small green vertices are arguably more peripheral ⇒ Less clear how the yellow, dark blue and red vertices compare
(c)(c) (d) (d) Network Science Analytics Centrality Measures and Link Analysis 11 Fig.Fig. 4.4 4.4IllustrationIllustration of of (b) (b) closeness, closeness, (c) (c) betweenness, betweenness, and and (d) (d) eigenvector eigenvector centrality centrality measures measures on on thethe graph graph in in (a). (a). Example Example and andfiguresfigures courtesy courtesy of of Ulrik Ulrik Brandes. Brandes. Case study
Centrality measures
Case study: Stability of centrality measures in weighted graphs
Centrality, link analysis and web search
A primer on Markov chains
PageRank as a random walk
PageRank algorithm leveraging Markov chain structure
Network Science Analytics Centrality Measures and Link Analysis 12 Centrality measures robustness
I Robustness to noise in network data is of practical importance
I Approaches have been mostly empirical ⇒ Find average response in random graphs when perturbed ⇒ Not generalizable and does not provide explanations
I Characterize behavior in noisy real graphs ⇒ Degree and closeness are more reliable than betweenness
I Q: What is really going on? ⇒ Framework to study formally the stability of centrality measures
I S. Segarra and A. Ribeiro, “Stability and continuity of centrality measures in weighted graphs,” IEEE Trans. Signal Process., 2015
Network Science Analytics Centrality Measures and Link Analysis 13 Definitions for weighted digraphs
5 I Weighted and directed graphs G(V , E, W ) a b ⇒ Set V of Nv vertices 2 ⇒ Set E ⊆ V × V of edges 3 4 ⇒ Map W : E → of weights in each edge R++ c
I Path P(u, v) is an ordered sequence of nodes from u to v
I When weights represent dissimilarities ⇒ Path length is the sum of the dissimilarities encountered
I Shortest path length sG (u, v) from u to v
`−1 X sG (u, v) := min W (ui , ui+1) P(u,v) i=0
Network Science Analytics Centrality Measures and Link Analysis 14 Stability of centrality measures
I Space of graphs G(V ,E) with (V , E) as vertex and edge set
I Define the metric d(V ,E)(G, H): G(V ,E) × G(V ,E) → R+ X d(V ,E)(G, H) := |WG (e) − WH (e)| e∈E
I Def: A centrality measure c(·) is stable if for any vertex v ∈ V in any two graphs G, H ∈ G(V ,E), then
G H c (v) − c (v) ≤ KG d(V ,E)(G, H)
I KG is a constant depending on G only I Stability is related to Lipschitz continuity in G(V ,E) I Independent of the definition of d(V ,E) (equivalence of norms)
I Node importance should be robust to small perturbations in the graph
Network Science Analytics Centrality Measures and Link Analysis 15 Degree centrality
I Sum of the weights of incoming arcs X cDe (v) := W (u, v) u|(u,v)∈E
I Applied to graphs where the weights in W represent similarities
I High cDe (v) ⇒ v similar to its large number of neighbors
Proposition 1
For any vertex v ∈ V in any two graphs G, H ∈ G(V ,E), we have that
G H |cDe (v) − cDe (v)| ≤ d(V ,E)(G, H)
i.e., degree centrality cDe is a stable measure
I Can show closeness and eigenvector centralities are also stable
Network Science Analytics Centrality Measures and Link Analysis 16 Betweenness centrality
I Look at the shortest paths for every two nodes distinct from v ⇒ Sum the proportion that contains node v
X σ(s, t|v) c (v) := Be σ(s, t) s6=v6=t∈V
I σ(s, t) is the total number of s − t shortest paths
I σ(s, t|v) is the number of those paths going through v
Proposition 2
The betweenness centrality measure cBe is not stable
Network Science Analytics Centrality Measures and Link Analysis 17 Instability of betweenness centrality
I Compare the value of cBe (v) in graphs G and H G H 1 1 v 1 1 1 1 + v 1 + 1
1 1 1 1 1 1 1 1
G H cBe (v) = 9 cBe (v) = 0
H ⇒ Centrality value cBe (v) = 0 remains unchanged for any > 0
I For small values of , graphs G and H become arbitrarily similar
G H 9 = |cBe (v) − cBe (v)| ≤ KG d(V ,E)(G, H) → 0
⇒ Inequality is not true for any constant KG
Network Science Analytics Centrality Measures and Link Analysis 18 Stable betweenness centrality
v v v v v v v I Define G =(V , E , W ), V =V \{v}, E =E|V v ×V v , W =W |E v ⇒ G v obtained by deleting from G node v and edges connected to v
I Stable betweenness centrality cSBe (v) X cSBe (v) := sG v (s, t) − sG (s, t) s6=v6=t∈V ⇒ Captures impact of deleting v on the shortest paths
I If v is (not) in the s − t shortest path, sG v (s, t) − sG (s, t) > (=)0
⇒ Same notion as (traditional) betweenness centrality cBe Proposition 3
For any vertex v ∈ V in any two graphs G, H ∈ G(V ,E), then
G H 2 |cSBe (v) − cSBe (v)| ≤ 2Nv d(V ,E)(G, H)
i.e., stable betweenness centrality cSBe is a stable measure
Network Science Analytics Centrality Measures and Link Analysis 19 Centrality ranking variation in random graphs
I Gn,p graphs with p = 10/n and weights U(0.5, 1.5) ⇒ Vary n from 10 to 200 ⇒ Perturb multiplying weights with random numbers U(0.99, 1.01)
I Compare centrality rankings in the original and perturbed graphs
10 1.4 C C D D C C C 1.2 C C C 8 B B C C E 1 E C C SB SB 6 0.8
0.6 4
0.4 2 0.2 Mean average change in ranking Mean maximum change in ranking 0 0 0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200 Network Size Network Size
I Betweenness centrality presents larger maximum and average changes
Network Science Analytics Centrality Measures and Link Analysis 20 Centrality ranking variation in random graphs
I Compute probability of observing a ranking change ≥ 5 ⇒ Plot the histogram giving rise to the empirical probabilities
1 60 C C D D C C C C C 50 C 0.8 B B C C E E C 40 C SB SB 0.6
30
0.4 Frequency 20
0.2 10
0 0
Probability max change greater than 5 0 20 40 60 80 100 120 140 160 180 200 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Network Size Maximum change in ranking
I For cBe some node varies its ranking by 5 positions with high probability
I Long tail in histogram is evidence of instability ⇒ Minor perturbation generates change of 19 positions
Network Science Analytics Centrality Measures and Link Analysis 21 Centrality ranking variation in an airport graph
I Real-world graph based on the air traffic between popular U.S. airports
⇒ Nodes are Nv = 25 popular airports ⇒ Edge weights are the number of yearly passengers between them
0.9 90 C C D D 0.8 C 80 C C C C C 0.7 B 70 B C C E E 0.6 C 60 C SB SB 0.5 50
0.4 40
0.3 Frequency 30
0.2 20
0.1 10
0 0
Probability max change greater than 1 5 10 15 20 25 30 35 40 45 50 0 1 2 3 4 5 6 Perturbation Size (1000 x δ) Maximum change in ranking
I Betweenness centrality still presents the largest variations
Network Science Analytics Centrality Measures and Link Analysis 22 Centrality, link analysis and web search
Centrality measures
Case study: Stability of centrality measures in weighted graphs
Centrality, link analysis and web search
A primer on Markov chains
PageRank as a random walk
PageRank algorithm leveraging Markov chain structure
Network Science Analytics Centrality Measures and Link Analysis 23 The problem of ranking websites
I Search engines rank pages by looking at the Web itself ⇒ Enough information intrinsic to the Web and its structure
I Information retrieval is a historically difficult problem ⇒ Keywords vs complex information needs (synonymy, polysemy)
I Beyond explosion in scale, unique issues arised with the Web
I Diversity of authoring styles, people issuing queries
I Dynamic and constantly changing content
I Paradigm: from scarcity to abundance
I Finding and indexing documents that are relevant is ‘easy’
I Q: Which few of these should the engine recommend? ⇒ Key is understanding Web structure, i.e., link analysis
Network Science Analytics Centrality Measures and Link Analysis 24 Voting by in-links
Ex: Suppose we issue the query ‘newspapers’
I First, use text-only information retrieval to identify relevant pages
I Idea: Links suggest implicit endorsements of other relevant pages
I Count in-links to assess the authority of a page on ‘newspapers’
Network Science Analytics Centrality Measures and Link Analysis 25 A list-finding technique
I Query also returns pages that compile lists of relevant resources
I These hubs voted for many highly endorsed (authoritative) pages
I Idea: Good lists have a better sense of where the good results are
I Page’s hub value is the sum of votes received by its linked pages
Network Science Analytics Centrality Measures and Link Analysis 26 Repeated improvement
I Reasonable to weight more the votes of pages scoring well as lists ⇒ Recompute votes summing linking page values as lists
I Q: Why stop here? Use also improved votes to refine the list scores ⇒ Principle of repeated improvement
Network Science Analytics Centrality Measures and Link Analysis 27 Hubs and authorities
I Relevant pages fall in two categories: hubs and authorities
I Authorities are pages with useful, relevant content
I Newspaper home pages
I Course home pages
I Auto manufacturer home pages
I Hubs are ‘expert’ lists pointing to multiple authorities
I List of newspapers
I Course bulletin
I List of US auto manufacturers
I Rules: Authorities and hubs have a mutual reinforcement relationship ⇒ A good hub links to multiple good authorities ⇒ A good authority is linked from multiple good hubs
Network Science Analytics Centrality Measures and Link Analysis 28 Hubs and authorities ranking algorithm
I Hyperlink-Induced Topic Search (HITS) algorithm [Kleinberg’98]
I Each page v ∈ V has a hub score hv and authority score av > > ⇒ Network-wide vectors h = [h1,..., hNv ] , a = [a1,..., aNv ]
Authority update rule:
X > av (k) = hu(k − 1), for all v ∈ V ⇔ a(k) = A h(k − 1) (u,v)∈E
Hub update rule: X hv (k) = au(k), for all v ∈ V ⇔ h(k) = Aa(k) (v,u)∈E √ I Initialize h(0) = 1/ Nv , normalize a(k) and h(k) each iteration
Network Science Analytics Centrality Measures and Link Analysis 29 Limiting values
I Define the hub and authority rankings as
a := lim a(k), h := lim h(k) k→∞ k→∞
I From the HITS update rules one finds for k = 0, 1,...
A>Aa(k) AA>h(k) a(k + 1) = , h(k + 1) = kA>Aa(k)k kAA>h(k)k
> > I Power iterations converge to dominant eigenvectors of A A and AA
> −1 > −1 A Aa = αa a, AA h = αh h ⇒ Hub and authority ranks are eigenvector centrality measures
Network Science Analytics Centrality Measures and Link Analysis 30 Link analysis beyond the web
Ex: link analysis of citations among US Supreme Court opinions
I Rise and fall of authority of key Fifth Amendment cases [Fowler-Jeon’08]
Network Science Analytics Centrality Measures and Link Analysis 31 PageRank
I Node rankings to measure website relevance, social influence
I Key idea: in-links as votes, but ‘not all links are created equal’ ⇒ How many links point to a node (outgoing links irrelevant) ⇒ How important are the links that point to a node
I PageRank key to Google’s original ranking algorithm [Page-Brin’98]
I Inuition 1: fluid that percolates through the network ⇒ Eventually accumulates at most relevant Web pages
I Inuition 2: random web surfer (more soon) ⇒ In the long-run, relevant Web pages visited more often
I PageRank and HITS success was quite different after 1998
Network Science Analytics Centrality Measures and Link Analysis 32 Basic PageRank update rule
> I Each page v ∈ V has PageRank rv , let r = [r1,..., rNv ] ⇒ Define P := (Dout )−1A, where Dout is the out-degree matrix
PageRank update rule:
X ru(k − 1) r (k) = , for all v ∈ V ⇔ r(k) = PT r(k − 1) v d out (u,v)∈E u
I Split current PageRank evenly among outgoing links and pass it on ⇒ New PageRank is the total fluid collected in the incoming links
⇒ Initialize r(0) = 1/Nv . Flow conserved, no normalization needed
I Problem: ‘Spider traps’
I Accumulate all PageRank
I Only when not strongly connected
Network Science Analytics Centrality Measures and Link Analysis 33 Scaled PageRank update rule
I Apply the basic PageRank rule and scale the result by s ∈ (0, 1) Split the leftover (1 − s) evenly among all nodes (evaporation-rain)
Scaled PageRank update rule:
X ru(k − 1) 1 − s rv (k) = s × out + , for all v ∈ V d Nv (u,v)∈E u
T I Can view as basic update r(k) = P¯ r(k − 1) with
11> P¯ := sP + (1 − s) Nv ⇒ Scaling factor s typically chosen between 0.8 and 0.9 ⇒ Power iteration converges to the dominant eigenvector of P¯ T
Network Science Analytics Centrality Measures and Link Analysis 34 A primer on Markov chains
Centrality measures
Case study: Stability of centrality measures in weighted graphs
Centrality, link analysis and web search
A primer on Markov chains
PageRank as a random walk
PageRank algorithm leveraging Markov chain structure
Network Science Analytics Centrality Measures and Link Analysis 35 Markov chains
I Consider discrete-time index n = 0, 1, 2,...
I Time-dependent random state Xn takes values on a countable set I In general denote states as i = 0, 1, 2,..., i.e., here the state space is N I If Xn = i we say “the process is in state i at time n”
T I Random process is XN, its history up to n is Xn = [Xn, Xn−1,..., X0]
n I Def: process XN is a Markov chain (MC) if for all n ≥ 1, i, j, x ∈ N P Xn+1 = j Xn = i, Xn−1 = x = P Xn+1 = j Xn = i = Pij
I Future depends only on current state Xn (memoryless, Markov property) ⇒ Future conditionally independent of the past, given the present
Network Science Analytics Centrality Measures and Link Analysis 36 Matrix representation
I Group the Pij in a transition probability “matrix” P P00 P01 P02 ... P0j ... P10 P11 P12 ... P1j ... ...... P = ...... Pi0 Pi1 Pi2 ... Pij ... ...... ......
⇒ Not really a matrix if number of states is infinite
P∞ I Row-wise sums should be equal to one, i.e., j=0 Pij = 1 for all i
Network Science Analytics Centrality Measures and Link Analysis 37 Graph representation
I A graph representation or state transition diagram is also used
Pi−1,i−1 Pi,i Pi+1,i+1
Pi−2,i−1 Pi−1,i Pi,i+1 Pi+1,i+2
... i −1 i i +1 ...
Pi−1,i−2 Pi,i−1 Pi+1,i Pi+2,i+1
I Useful when number of states is infinite, skip arrows if Pij = 0
I Again, sum of per-state outgoing arrow weights should be one
Network Science Analytics Centrality Measures and Link Analysis 38 Example: Bipolar mood
I I can be happy (Xn = 0) or sad (Xn = 1) ⇒ My mood tomorrow is only affected by my mood today
I Model as Markov chain with transition probabilities
0.8 0.7 0.2 0.8 0.2 P = 0.3 0.7 H S
0.3
I Inertia ⇒ happy or sad today, likely to stay happy or sad tomorrow
I But when sad, a little less likely so (P00 > P11)
Network Science Analytics Centrality Measures and Link Analysis 39 Example: Random (drunkard’s) walk
I Step to the right w.p. p, to the left w.p. 1 − p ⇒ Not that drunk to stay on the same place
p p p p
... i −1 i i +1 ...
1 − p 1 − p 1 − p 1 − p
I States are 0, ±1, ±2,... (state space is Z), infinite number of states
I Transition probabilities are
Pi,i+1 = p, Pi,i−1 = 1 − p
I Pij = 0 for all other transitions
Network Science Analytics Centrality Measures and Link Analysis 40 Multiple-step transition probabilities
I Q: What can be said about multiple transitions?
I Probabilities of Xm+n given Xm ⇒ n-step transition probabilities
n Pij = P Xm+n = j Xm = i
(n) n ⇒ Define the matrix P with elements Pij
Theorem The matrix of n-step transition probabilities P(n) is given by the n-th power of the transition probability matrix P, i.e.,
P(n) = Pn
Henceforth we write Pn
Network Science Analytics Centrality Measures and Link Analysis 41 Unconditional probabilities
n I All probabilities so far are conditional, i.e., Pij = P Xn = j X0 = i ⇒ May want unconditional probabilities pj (n) = P (Xn = j)
I Requires specification of initial conditions pi (0) = P (X0 = i)
n I Using law of total probability and definitions of Pij and pj (n)
∞ X pj (n) = P (Xn = j) = P Xn = j X0 = i P(X0 = i) i=0 ∞ X n = Pij pi (0) i=0
T I In matrix form (define vector p(n) = [p1(n), p2(n),...] )
p(n) = (Pn)T p(0)
Network Science Analytics Centrality Measures and Link Analysis 42 Limiting distributions
I MCs have one-step memory. Eventually they forget initial state
I Q: What can we say about probabilities for large n?