Centralities (4)
Total Page:16
File Type:pdf, Size:1020Kb
Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge Some slide from last week that we didn’t talk about in class: 2 PageRank algorithm • Eigenvector centrality: i’s Rank score is the sum of the Rank scores of all pages j that point to i : , ∈ • Then Katz centrality adds the teleportation by adding a small weight edge to each node (using a weight of ): , ∈ • BUT, since a page j may point to many other pages, its prestige score should be shared among these pages. (For example NPS pointing to many sites) , ∈ 3 Matrix notation (1) • Let be a n-dimensional column vector of PageRank T values, i.e., ) . • Let A be the adjacency matrix of our digraph with entries • Then the PageRank centrality of node is given by: deg or 1 Where is the damping factor, generally set for = .85 (more on the next page). 4 Matrix notation (2) So the PageRank centrality of node is given by: β where α is the damping factor (generally α = .85) Recall from eigenvector centrality: A x =x or x = A x • Small values (close to 0): the contribution given by paths longer than one hop is small, so centrality scores are mainly influenced by in-degrees. • Large values (close to ): allows long paths to be devalued smoothly, and centrality scores influenced by the topology of G. • Recommendation: choose α∈0, , where the centrality diverges at α = . The default is usually .85 5 Overview What makes a How do you When is it How can we capture it? vertex central in describe it appropriate to a network? mathematically? use it? Lots of one-hop A weighted degree For example Eigenvector centrality connections to centrality based on when the α high centrality the weight of the people you are vertices neighbors connected to Where A is the in degree matrix matter. Lots of one-hop A weighted degree Directed graphs Katz connections to centrality based on that are not α ∑ + β high out-degree the out degree of strongly Where β is some small weight vertices the neighbors connected for each node As above but As above but Page Rank: distribute the deg distributing the α weight that a wealth of a deg node has to the node to the ones or nodes it points to it points to α β PR: most known and influential algorithms for computing the relevance of web pages An example as just described: Problem vertex (no outgoing links) Recall that the problem with vertices with indegree = 0 was solved by using β in-degree matrix 010000 α β deg 101000 or 110100 α β A 000001 each row Is the formula above shows the 000101 well defined? in degree If not, how could we fix 000100 the formula or the matrix? each column shows the out degree How can we fix the problem? 1. Remove those pages with no out-links during the PageRank computation as these pages do not affect the ranking of any other page directly (these pages will get outgoing links in the future). 2. Add a complete set of outgoing links from each such page i to all the pages on the Web. each column shows the out in-degree matrix degree each row 010010 shows the The second choice is used in PR since 101010 in degree matrix may get updated 110110 A 000011 000111 000110 8 How can we fix the out degree = 0? 010010 α β 101010 110110 A 000011 000111 000110 1/2 0 0 0 0 0 in-degree matrix 01/20 0 0 0 001/1000 D-1 0001/300 00001/60 Inverse of the out-degree matrix 000001/29 PR centrality formula is well defined By multiplying them we obtain the matrix that captures: 1. The in and out degree per vertex α β 2. Divides the centrality of each vertex by its degree out-degree matrix in-degree matrix 0 1 2 0 0 1 6 0 1 2 0 1 0 1 6 0 -1 1/ 2 1 2 0 1 3 1 6 0 The contribution AD of node 5 is 0 0 0 0 1 6 1 2 insignificant, 0 0 0 1 3 1 6 1 2 and the formula is now well defined 0 0 0 1 3 1 6 0 10 Transition probability matrix • This modified matrix is called the state transition probability matrix. Denote its entries by pij : p p . p 11 12 1n p21 p22 . p2n . AD-1 . . . pn1 pn2 . pnn • pij represents the transition probability that the surfer in state i (page i) will move to state j (page j). • Here is an example: 11 A small Internet consisting of just 4 websites Source: http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html 12 A small Internet consisting of just 4 websites pij represents the transition probability that the surfer on page j will move to page i: 0 0 1 1/ 2 1/ 3 0 0 0 AD -1 p ij 1/ 3 1 2 0 1 2 1/ 3 1/ 2 0 0 Source: http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html 13 A small Internet consisting of just 4 websites Random surfer: each page has equal probability ¼ to be chosen as a starting point. 0 0 1 1/ 2 1/ 3 0 0 0 AD -1 p ij 1/ 3 1 2 0 1 2 1/ 3 1/ 2 0 0 The probability that page i will be visited after k steps (i.e. the random surfer ending up at page i ) is equal to entry of A kx. Simplification for this example: No β was involved since id i > 0, for all i Source: http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html 14 Overview Updated! What makes a How do you When is it How can we capture it? vertex central in describe it appropriate to a network? mathematically? use it? Lots of one-hop A weighted degree For example Eigenvector centrality connections to centrality based on when the high centrality the weight of the people you are vertices neighbors connected to є matter. Lots of one-hop A weighted degree Directed graphs Katz connections to centrality based on that are not ∑ є + β high out-degree the out degree of strongly Where β is some initial weight vertices the neighbors connected As above but As above but Page Rank: distribute the deg distributing the α β weight that a wealth of a deg node has to the node to the ones Where outdeg j = max{1, out nodes it points to it points to degree of node j} Some comments • Newman’s book gives: where α is called the damping factor which can be set to between 0 and 1(or the largest eigenvalue of A). • And the formula in the original PageRank is: where d is the damping factor (d = 0.85 as default) • Gephi: the default value for is the probability = 0.85 and Epsilon is the criteria for eigenvector convergence based on the power method Final Points on PageRank • Fighting spam. – A page is important if the pages pointing to it are important. – Since it is not easy for Web page owner to add in-links into his/her page from other important pages, it is thus not easy to influence PageRank. • PageRank is a global measure and is query independent. – The values of the PageRank algorithm of all the pages are computed and saved off-line rather than at the query time => fast • Criticism: – There are companies that can increase your pagerank by adding it to a cluster and increasing its indegree – It cannot not distinguish between pages that are authoritative in general and pages that are authoritative on the query topic. • But it works based on the keyword search 17 Betweenness Centrality Some pages are adapted from Dan Ryan, Mills College Different types of centralities: Betweenness Centrality Closeness Centrality Eigenvector Centrality Degree Centrality Source: Discovering Sets of Key Players in Social Networks – Daniel Ortiz-Arroyo – Springer 2010/ 19 Betweenness Centrality • Intuition: how many pairs of individuals would have to go through you in order to reach one another in the minimum number of hops? • Interactions between two individuals depend on the other individuals in the set of nodes. The nodes in the middle have some control over the paths in the graph. • Useful for flow, such as information or data packages Assumptions • When there is more than one geodesic, all geodesics are equally likely to be used. • Flow takes the shortest path (we’ll look at alternatives) • Every pair of nodes in G exchanges a message with equal probability per unit time. • Question: How many messages, on average, will have passed through each vertex en route to their destination? – A node’s betweenness is given by all pairs of nodes, including the node in question. 21 Meaning of betweenness centrality • Vertices with high betweenness centrality have influence in the network by virtue of their control over information passing between others. – They get to see the messages as they pass through – They could get paid for passing the message along Thus they get a lot of power: their removal would disrupt communication How would you capture it in a mathematical formula? 22 Formula for betweenness centrality , where is the number of s-t geodesics that i belongs to (default: i could equal s or t, but in other versions it cannot and that’s where you see 0 values) – in an undirected graph, an s-t geodesic is the same as a t-s geodesics, so the edge gets counted twice) It is applicable to directed networks as well.