Linear Response Theory for Google Matrix
Total Page:16
File Type:pdf, Size:1020Kb
EPJ manuscript No. (will be inserted by the editor) Linear response theory for Google matrix Klaus M. Frahm1 and Dima L. Shepelyansky1 Laboratoire de Physique Théorique, IRSAMC, Université de Toulouse, CNRS, UPS, 31062 Toulouse, France August 26, 2019 Abstract. We develop the linear response theory for the Google matrix PageRank algorithm with respect to a general weak perturbation and a numerical efficient and accurate algorithm, called LIRGOMAX algorithm, to compute the linear response of the PageRank with respect to this perturbation. We illustrate its efficiency on the example of the English Wikipedia network with more than 5 millions of articles (nodes). For a group of initial nodes (or simply a pair of nodes) this algorithm allows to identify the effective pathway between initial nodes thus selecting a particular subset of nodes which are most sensitive to the weak perturbation applied to them (injection or pumping at one node and absorption of probability at another node). The further application of the reduced Google matrix algorithm (REGOMAX) allows to determine the effective interactions between the nodes of this subset. General linear response theory already found numerous applications in various areas of science including statistical and mesoscopic physics. Based on these grounds we argue that the developed LIRGOMAX algorithm will find broad applications in the analysis of complex directed networks. PACS. XX.XX.XX No PACS code given 1 Introduction bility at a certain network node (or group of nodes) and absorbing probability at another specific node (or group of Linear response theory finds a great variety of applications nodes). In a certain sense such a procedure reminds lasing in statistical physics, stochastic processes, electron trans- in random media where a laser pumping at a certain fre- port, current density correlations and dynamical systems quency generates a response in complex absorbing media (see e.g. [1,2,3,4,5]). In this work we apply the approach [14]. of linear response to Google matrices of directed networks More specifically we select two particular nodes, one with the aim to characterize nontrivial interactions be- for injection and one for absorption, for which we use tween nodes. the LIRGOMAX algorithm to determine a subset of most The concept of Google matrix and the related PageR- sensitive nodes involved in a pathway between these two ank algorithm for the World Wide Web (WWW) has been nodes. Furthermore we apply to this subset of nodes the proposed by Brin and Page in 1998 [6]. A detailed descrip- REduced GOogle MAtriX (REGOMAX) algorithm devel- tion of the Google matrix construction and its properties oped in [10,11] and obtain in this way an effective Google is given in [7]. This approach can be applied to numerous matrix description between nodes of the found pathway. situations and various directed networks [8]. Here we develop the LInear Response algorithm for In general the REGOMAX algorithm determines effec- arXiv:1908.08924v1 [cs.SI] 23 Aug 2019 GOogle MAtriX (LIRGOMAX) which applies to a very tive interactions between selected nodes of a certain rel- general model of a weakly perturbed Google matrix or atively small subset embedded in a global huge network. the related PageRank algorithm. As a particular applica- Its efficiency was recently demonstrated for the Wikipedia tion we consider a model of injection and absorption at a networks of politicians [11] and world universities [12], small number of nodes of the networks and test its effi- SIGNOR network of protein-protein interactions [13] and ciency on examples of the English Wikipedia network of multiproduct world trade network of UN COMTRADE 2017 [9]. However, the scope of LIRGOMAX algorithm is [15]. more general. Thus, for example, it can be also applied In this work our aim is to provide a first illustration to compute efficiently and accurately the PageRank sen- of the efficiency of the LIRGOMAX algorithm combined sitivity with respect to small modifications of individual with the reduced Google matrix analysis. Due to this we elements of the Google matrix or its reduced version [10, restrict in this work our considerations to the analytical 11,12,13]. description of the LIRGOMAX algorithm and the illus- From a physical viewpoint the approach of injec- tration of its application to two cases from the English tion/absorption corresponds to a small pumping proba- Wikipedia network of 2017. 2 K.M. Frahm and D.L. Shepelyansky: Linear response theory for Google matrix The paper is constructed as follows: in Section 2 we full Google matrix G between the Nr selected nodes of in- provide the analytical description of the LIRGOMAX al- terest. The PageRank vector Pr of GR coincides with the gorithm complemented by a brief description of the Google full PageRank vector projected on the subset of nodes, up matrix construction and the REGOMAX algorithm, in to a constant multiplicative factor due to the sum nor- Section 3 we present certain results for two examples of malization. The mathematical computation of GR pro- the Wikipedia network, the discussion of results is given vides a decomposition of GR into matrix components that in Section 4. Additional data are also available at [16]. clearly distinguish direct from indirect interactions: GR = Grr + Gpr + Gqr [11]. Here Grr is given by the direct links between the selected Nr nodes in the global G matrix 2 Theory of a weakly perturbed Google with N nodes. Gpr is a rank one matrix whose columns matrix are rather close (up to constant factor) to the reduced PageRank vector Pr. Even though the numerical weight of G is typically quite large it does not give much new in- 2.1 Google matrix construction pr teresting information about the reduced effective network structure. We first briefly remind the general construction of the The most interesting role is played by G , which takes Google matrix G from a direct network of N nodes. For qr into account all indirect links between selected nodes hap- this one first computes the adjacency matrix A with el- ij pening due to multiple pathways via the global network ements 1 if node j points to node i and zero otherwise. (nd) The matrix elements of G have the usual form G = nodes N (see [10,11]). The matrix Gqr = Gqrd + Gqr ij (nd) αSij + (1 − α)=N [6,7,8], where S is the matrix of Markov has diagonal (Gqrd) and non-diagonal (Gqr ) parts with (nd) transitions with elements Sij = Aij=kout(j) and kout(j) = Gqr describing indirect interactions between selected PN nodes. The exact formulas and the numerical algorithm Aij 6= 0 being the out-degree of node j (number of i=1 for an efficient numerical computation of all three com- outgoing links) or Sij = 1=N if j has no outgoing links (dangling node). The parameter 0 < α < 1 is the damp- ponents of GR are given in [10,11]. It is also useful to ing factor with the usual value α = 0:85 [7] used here. compute the weights WR, Wpr, Wrr, Wqr of GR and its We note that for the range 0:5 ≤ α ≤ 0:95 the results are 3 matrix components Gpr, Grr, Gqr given by the sum of not sensitive to α [7,8]. This corresponds to a model of a all its elements divided by the matrix size Nr. Due to random surfer who follows with probability α at random the column sum normalization of GR we obviously have one of the links available from the actual node or jumps WR = Wrr + Wpr + Wqr = 1. with probability (1 − α) to an arbitrary other node in the network. The right PageRank eigenvector of G is the solution of 2.3 General model of linear response the equation GP = λP for the unit eigenvalue λ = 1 [6,7]. The PageRank P (j) values represent positive probabilities We consider a Google matrix G(") (with non-negative ma- P to find a random surfer on a node j ( j P (j) = 1). All trix elements satisfying the usual column sum normal- nodes can be ordered by decreasing probability P num- ization) depending on a small parameter " and a gen- bered by the PageRank index K = 1; 2; :::N with a max- eral stochastic process P (t + 1) = G(") F ("; P (t)) where imal probability at K = 1 and minimal at K = N. The F ("; P ) is a general function on " and P which does NOT numerical computation of P (j) is done efficiently with the need to be linear in P . (Here P (t) denotes a time depen- PageRank iteration algorithm described in [6,7]. dence of the vector P ; below and for the rest of this paper It is also useful to consider the original network with we will use the notation P (j) for the jth component of inverted direction of links. After inversion the Google ma- the vector P ). trix G∗ is constructed via the same procedure (using the Let ET = (1;:::; 1) be the usual vector with unit en- transposed adjacency matrix) and its leading eigenvector tries. Then the condition of column sum normalization of P ∗, determined by G∗P ∗ = P ∗, is called CheiRank [17] G(") reads ET G(") = ET . The function F ("; P ) should (see also [8]). Its values P ∗(j) can be again ordered in de- satisfy the condition ET F ("; P ) = 1 if ET P = 1. At " = 0 creasing order resulting in the CheiRank index K∗ with we also require that F (0;P ) = P , i.e. F (0;P ) is the iden- ∗ ∗ highest value of P at K = 1 and smallest values at tity operation on P . We denote by G0 = G(0) the Google ∗ ∗ K = N.