<<

Centralities (4)

By: Ralucca Gera, NPS

Excellence Through Knowledge Some slide from last week that we didn’t talk about in class:

2 PageRank

• Eigenvector : i’s Rank score is the sum of the Rank scores of all pages j that point to i :

, ∈ • Then adds the teleportation by adding a small weight edge to each node (using a weight of ):

, ∈ • BUT, since a page j may point to many other pages, its prestige score should be shared among these pages. (For example NPS pointing to many sites) , ∈ 3 notation (1)

• Let be a n-dimensional column vector of PageRank T values, i.e., ) . • Let A be the of our digraph with entries • Then the PageRank centrality of node is given by: deg or 1 Where is the damping factor, generally set for = .85 (more on the next page). 4 Matrix notation (2) So the PageRank centrality of node is given by: β where α is the damping factor (generally α = .85) Recall from : A x =x or x = A x • Small values (close to 0): the contribution given by paths longer than one hop is small, so centrality scores are mainly influenced by in-degrees. • Large values (close to ): allows long paths to be devalued smoothly, and centrality scores influenced by the topology of G. • Recommendation: choose α∈0, , where the centrality diverges at α = . The default is usually .85 5 Overview

What makes a How do you When is it How can we capture it? central in describe it appropriate to a network? mathematically? use it? Lots of one-hop A weighted For example Eigenvector centrality connections to centrality based on when the α high centrality the weight of the people you are vertices neighbors connected to Where A is the in degree matrix matter. Lots of one-hop A weighted degree Directed graphs Katz connections to centrality based on that are not α ∑ + β high out-degree the out degree of strongly Where β is some small weight vertices the neighbors connected for each node

As above but As above but Page Rank: distribute the deg distributing the α weight that a wealth of a deg node has to the node to the ones or nodes it points to it points to α β

PR: most known and influential for computing the relevance of web pages An example as just described:

Problem vertex (no outgoing links) Recall that the problem with vertices with indegree = 0 was solved by using β

in-degree matrix 010000 α β deg  101000 or 110100 α β A   000001 each row Is the formula above shows the 000101 well defined? in degree If not, how could we fix 000100 the formula or the matrix? each column shows the out degree How can we fix the problem?

1. Remove those pages with no out-links during the PageRank computation as these pages do not affect the of any other page directly (these pages will get outgoing links in the future). 2. Add a complete set of outgoing links from each such page i to all the pages on the Web. each column shows the out in-degree matrix degree each row 010010 shows the The second choice  is used in PR since 101010 in degree matrix may get  updated 110110 A   000011 000111  000110 8 How can we fix the out degree = 0?

010010  α β 101010 110110 A   000011 000111  000110 1/2 0 0 0 0 0 in-degree matrix  01/20 0 0 0 001/1000 D-1   0001/300 00001/60 Inverse of the out-degree matrix  000001/29 PR centrality formula is well defined

By multiplying them we obtain the matrix that captures: 1. The in and out degree per vertex α β 2. Divides the centrality of each vertex by its degree

out-degree matrix in-degree matrix  0 1 2 0 0 1 6 0    1 2 0 1 0 1 6 0    -1 1/ 2 1 2 0 1 3 1 6 0 The contribution AD    of node 5 is  0 0 0 0 1 6 1 2 insignificant,  0 0 0 1 3 1 6 1 2   and the formula   is now well defined  0 0 0 1 3 1 6 0  10 Transition probability matrix

• This modified matrix is called the state transition

probability matrix. Denote its entries by pij :  p p . . . p   11 12 1n   p21 p22 . . . p2n   . . .  AD-1  .   . . .     . . .     pn1 pn2 . . . pnn 

• pij represents the transition probability that the surfer in state i (page i) will move to state j (page j). • Here is an example: 11 A small consisting of just 4 websites

Source: http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html 12 A small Internet consisting of just 4 websites

pij represents the transition probability that the surfer on page j will move to page i:

 0 0 1 1/ 2   1/ 3 0 0 0  AD -1  p  ij 1/ 3 1 2 0 1 2      1/ 3 1/ 2 0 0 

Source: http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html 13 A small Internet consisting of just 4 websites

Random surfer: each page has equal probability ¼ to be chosen as a starting point.

 0 0 1 1/ 2   1/ 3 0 0 0  AD -1  p  ij 1/ 3 1 2 0 1 2      1/ 3 1/ 2 0 0 

The probability that page i will be visited after k steps (i.e. the random surfer ending up at page i ) is equal to entry of A kx.

Simplification for this example: No β was involved since id i > 0, for all i

Source: http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html 14 Overview Updated!

What makes a How do you When is it How can we capture it? vertex central in describe it appropriate to a network? mathematically? use it? Lots of one-hop A weighted degree For example Eigenvector centrality connections to centrality based on when the high centrality the weight of the people you are vertices neighbors connected to є matter. Lots of one-hop A weighted degree Directed graphs Katz connections to centrality based on that are not ∑ є + β high out-degree the out degree of strongly Where β is some initial weight vertices the neighbors connected

As above but As above but Page Rank: distribute the deg distributing the α β weight that a wealth of a deg node has to the node to the ones Where outdeg j = max{1, out nodes it points to it points to degree of node j} Some comments

• Newman’s book gives: where α is called the damping factor which can be set to between 0 and 1(or the largest eigenvalue of A). • And the formula in the original PageRank is: where d is the damping factor (d = 0.85 as default) • Gephi: the default value for is the probability = 0.85 and Epsilon is the criteria for eigenvector convergence based on the power method Final Points on PageRank

• Fighting spam. – A page is important if the pages pointing to it are important. – Since it is not easy for Web page owner to add in-links into his/her page from other important pages, it is thus not easy to influence PageRank. • PageRank is a global measure and is query independent. – The values of the PageRank algorithm of all the pages are computed and saved off-line rather than at the query time => fast • Criticism: – There are companies that can increase your by adding it to a cluster and increasing its indegree – It cannot not distinguish between pages that are authoritative in general and pages that are authoritative on the query topic. • But it works based on the keyword search 17

Some pages are adapted from Dan Ryan, Mills College Different types of :

Betweenness Centrality

Closeness Centrality

Eigenvector Centrality

Degree Centrality Source: Discovering Sets of Key Players in Social Networks – Daniel Ortiz-Arroyo – Springer 2010/ 19 Betweenness Centrality

• Intuition: how many pairs of individuals would have to go through you in order to reach one another in the minimum number of hops? • Interactions between two individuals depend on the other individuals in the set of nodes. The nodes in the middle have some control over the paths in the graph. • Useful for flow, such as information or data packages Assumptions

• When there is more than one , all are equally likely to be used. • Flow takes the shortest (we’ll look at alternatives) • Every pair of nodes in G exchanges a message with equal probability per unit time. • Question: How many , on average, will have passed through each vertex en route to their destination? – A node’s betweenness is given by all pairs of nodes, including the node in question.

21 Meaning of betweenness centrality

• Vertices with high betweenness centrality have influence in the network by virtue of their control over information passing between others.

– They get to see the messages as they pass through – They could get paid for passing the message along Thus they get a lot of power: their removal would disrupt communication

How would you capture it in a mathematical formula?

22 Formula for betweenness centrality

, where is the number of s-t geodesics that i belongs to (default: i could equal s or t, but in other versions it cannot and that’s where you see 0 values) – in an undirected graph, an s-t geodesic is the same as a t-s geodesics, so the edge gets counted twice) It is applicable to directed networks as well.

23 Bounds for disconnected graphs Let G be a disconnected graph: • What is the minimum value of betweenness centrality a vertex can have in disconnected graphs? – an isolated vertex: 0 • What is the maximum value of betweenness centrality a vertex can have in disconnected graphs? – center of a with center: Let at with center node at Then there are pairs of nodes, from which we take away the paths from to since is not on them.

24 Bounds for connected graphs Let G be connected: • What is the minimum value of betweenness centrality a vertex can have in connected graphs? – A leaf x would have it: (11121 where we have 1 paths from x to each vertex. And 1 more paths from each vertex to x. Finally one path from x to x.

• What is the maximum value of betweenness centrality a vertex can have in disconnected graphs? – center of a star in the largest : 1 which is the number of pairs of nodes minus the paths from a leaf to itself

25 A refined formula

, where is the number of s-t geodesics that i belongs to. is the number of s-t geodesics – Convention: if = 0 and = 0, then (in an undirected graph, an s-t geodesic is the same as a t-s

geodesics, so it gets counted twice) 26 In class activity: betweenness of A?

• Fraction of shortest paths that include vertex A

,∀ ,∈ 1 shortest path of Number of paths 4 goes through A ABCDEFG

A - 111411

B -41111

C -1111

D -144

E 1 shortest path of -11 4 goes through A F -1

G 1 shortest path- of ∑ = 0.75 4 goes through A A normalized refined formula

) / where is the number of s-t geodesics that i belongs to. is the number of s-t geodesics – Convention: = 0 and = 0, then

28 Another normalized formula

) / where is the number of s-t geodesics that i belongs to. is the number of s-t geodesics – Convention: = 0 and = 0, then

29 Betweenness Centrality

• Used generally for Information flow • Typically distributed over a wide range • Betweenness only uses geodesic paths • Information can also flow on longer paths • Sometimes we hear it through the grapevine

• While betweenness focuses just on the geodesic, flow betweenness centrality focuses on how information might flow through many different paths. Flow betweenness centrality

, BUT is the maximum flow transmitted from s to t through all possible paths that i belongs to.

is the maximum flow transmitted from s to t through all possible paths – Convention: = 0 and = 0, then (in an undirected graph, an s-t geodesic is the same as a t-s geodesics, so it gets counted twice) 31 betweenness centrality

,

BUT is the number of times a random walk from s to t passes through i, averaged over many repetitions of a walk • Note that ≠ • A good measure for traffic that doesn’t have a particular destination 32 Other extensions of centralities

• How would you extend the centralities you have seen? What else would you introduce that would capture the centrality of a vertex? • Would you use it for edges? • This is a good time to share your thoughts • Subgraph/subset centrality? – How central are you to that particular subgraph? – How central is the subgraph to the network? – If so, would you repeat the centralities seen before

for that subgraph? 33 Overview

Local measure: degree

Relative to rest of network: closeness, betweenness, eigenvector, Katz, PageRank

How evenly is centrality distributed among nodes? hubs and authorities

You’ve learned the traditional centralities. Based on your understanding of the methodologies that create them, decide which one is appropriate to use for your application. 34 • Let’s practice in Gephi • And if there is time, in Python (code on line, same code as before)

35