IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 12, DECEMBER 2016 3339 Near-Optimal Algorithms for Controlling Propagation at Group Scale on Networks
Yao Zhang, Abhijin Adiga, Sudip Saha, Anil Vullikanti, and B. Aditya Prakash
Abstract—Given a network with groups, such as a contact-network grouped by ages, which are the best groups to immunize to control the epidemic? Equivalently, how to choose best communities in social media like Facebook to stop rumors from spreading? Immunization is an important problem in multiple different domains like epidemiology, public health, cyber security, and social media. Additionally, clearly immunization at group scale (like schools and communities) is more realistic due to constraints in implementations and compliance (e.g., it is hard to ensure specific individuals take the adequate vaccine). Hence, efficient algorithms for such a “group- based” problem can help public-health experts take more practical decisions. However, most prior work has looked into individual-scale immunization. In this paper, we study the problem of controlling propagation at group scale. We formulate a set of novel Group Immunization problems for multiple natural settings (for both threshold and cascade-based contagion models under both node-level and edge-level interventions) and develop multiple efficient algorithms, including provably approximate solutions. Finally, we show the effectiveness of our methods via extensive experiments on real and synthetic datasets.
Index Terms—Graph mining, social networks, immunization, diffusion, groups Ç
1INTRODUCTION NFECTIOUS diseases account for a large fraction of deaths individual utility. We model such limited compliance by ran- Iworldwide. The main public health response to containing dom vaccine allocation within each group, which motivates epidemic outbreaks is by vaccination and social distancing, our paper. Our focus in this paper is on developing interven- e.g., [1], [2]. These interventions have resource constraints tions that can be implemented before the start of the epidemic. (e.g., limited supply of vaccines and the high cost of social dis- Further, interventions can be of two kinds: vaccination (which tancing), and therefore, designing optimal control strategies is can be modeled in terms of node removals) and social distanc- an active area of research in public health policy planning, ing (which can be modeled in terms of edge removal, e.g., e.g., [1], [3], [4], [5], [6], [7]. However, optimal strategies based reducing contacts between certain sub-populations). We con- on node level characteristics, such as the degree or spectral sider two kinds of metrics: (1) maximizing the expected num- properties [6], [7] cannot be easily turned into implementable ber of people who do not get infected, and (2) minimizing the policies, because such targeted immunization of specific indi- time for the epidemic to die out. These are both commonly viduals raises significant social and moral issues. As a result, studied metrics in public health (see, e.g., [8], [9]). Most of the vaccination policies, such as those specified by CDC are at the work in mathematical epidemiology has been formalized in level of groups (e.g., based on demographics), and almost all terms of reducing the reproductive number. However, these the efforts in epidemiology are focused on developing group methods do not extend to network based models. In this level strategies, even though this may lead to sub-optimal sol- paper, we develop algorithms for optimizing these metrics in utions compared to the individual level policies. For instance, two different models of diffusion. Medlock et al. [1] develop an optimal vaccine allocation for Similar diffusion processes arise in other domains such as different age groups. Even so, all prior work on optimal group social media, e.g., the spread of spam/rumors on Facebook, level immunization has focused on differential equation Twitter, LiveJournal or Friendster. These are also commonly based models, and has not been studied on network models modeled by models such as the Linear Threshold (LT) of epidemic spread. Implementing such interventions is chal- model [10]. Analogous to the public-health case, we can con- lenging because people “comply” with them based on their trol such processes by ‘immunization’ via blocking users or preventing some interactions, such that the expected number of users who adopt spam/rumors is minimal. Past work has Y. Zhang and B.A. Prakash are with the Department of Computer Science, studied individual-level based immunization algorithms for Virginia Tech, Blacksburg, VA 24061. E-mail: {yaozhang, badityap}@cs.vt.edu. the LT model [11]. However, it is more realistic to issue a A. Adiga is with NDSSL, Biocomplexity Institute of Virginia Tech, warning bulletin on group pages, and some members within Blacksburg, VA 24061. E-mail: [email protected]. those groups comply with the warning to stop disseminating S. Saha and A. Vullikanti are with the Department of Computer Science, rumors. Similarly, Twitter can warn a group of accounts to Virginia Tech, Blacksburg, VA 24061, and NDSSL, Biocomplexity Insti- tute of Virginia Tech, Blacksburg, VA 24061. control the spread of the malicious tweets. The same holds E-mail: {ssaha, akumar}@vbi.vt.edu. true for user groups in Friendster and LiveJournal. Manuscript received 29 Dec. 2015; revised 9 July 2016; accepted 24 Aug. In this paper, we present a unified approach to study 2016. Date of publication 1 Sept. 2016; date of current version 2 Nov. 2016. strategies for controlling the spread of diffusion processes Recommended for acceptance by A. Gionis. through group level interventions, capturing both uncer- For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. tainty and lack of control at high resolution within groups. Digital Object Identifier no. 10.1109/TKDE.2016.2605088 The main contributions of our paper are:
1041-4347 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 3340 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 12, DECEMBER 2016
1) Problem Formulation: We develop group level interven- behaviors, e.g., ideas/spam/rumors on Twitter and Face- tion problems in both the LT model, and the SIS/SIR book. A classic example is the linear threshold model, which models, for which we consider a spectral radius based has been extensively studied [10]. In this paper, we study the formulation. We consider arbitrarily specified groups, problem of minimizing propagation for the LT model. and interventions that involve both edge and node Cascade style models, such as the ‘flu-like’ Susceptible- removal, modeling quarantining and vaccination, Infectious-Susceptible (SIS), ‘mumps-like’ Susceptible-Infec- respectively. The interventions specify the number xi tious-Recovered (SIR) and its special-case the Independent of nodes/edges that can be removed within each Cascade (IC) [8], [9], [10], [22], are popular in epidemiology group Ci; however, these are chosen randomly within literature to model different epidemiological states of people, the group. These problems generalize the node level and their state-transitions. Much work has gone into in find- problems and have not been studied before. ing the ‘epidemic threshold’ for such models (the minimum 2) Effective Algorithms: We develop efficient theoretical virulence of a virus which results in an epidemic over the and practical algorithms for the four problem classes network). For example, recent studies [12], [26] show that the we consider, including provable approximation algo- spectral radius of the underlying network (the largest eigen- rithms (SDP and GROUPGREEDYWALK). We find that value of the adjacency matrix of the graph) is related to the diverse kinds of techniques are needed for these epidemic threshold for a wide-range of cascade models. problems—submodular function maximization on an Hence, here we investigate how to control an epidemic by integer lattice, semidefinite programming, quadratic minimizing the spectral radius for cascade style models. programming, and the link between closed walks Immunization. There has been much work on finding opti- and spectral radius. Our algorithms also leverage mal strategies for vaccination and social distancing [1], [3], prior techniques for analyzing contagion processes, [4], [5], [6], [7]. Much of the work in the epidemiology litera- e.g., [6], [12], [13], but require non-trivial extensions. ture has been based on differential equation methods [1], [3], 3) Experimental Evaluation: We present extensive experi- [4]. Cohen et al. [5] studied the popular acquaintance immuni- ments on multiple real datasets including epidemio- zation policy (pick a random person, and immunize one of its logical and social networks, and demonstrate that neighbors at random). Using game theory, Aspnes et al. [27] our algorithms outperform other competitors on developed inoculation strategies for victims of viruses under node and edge deletion at group scale for controlling random starting points. Kuhlman et al. [28] studied two for- infection as well as spectral radius minimization. mulations of the problem of blocking a contagion through Outline of the Paper. The rest of the paper is organized as edge removals under the model of discrete dynamical sys- follows. We first discuss the related work in Section 2, and tems. Tong et al. [7], 29], Van Miegham et al. [30], Prakash then formulate the Group Immunization Problem in et al. [6] proposed various node-based and edge-based immu- Section 3. Section 4 presents our algorithms for different set- nization algorithms based on minimizing the largest eigen- tings of the problem for both edge and node removal. Experi- value of the graph. Other non-spectral approaches for mental results on several datasets are in Section 5. We finally immunization have been studied by Budak et al. [31], He discuss future work, and conclude in Section 6. et al. [32], Khalil et al. [11], Saha et al. [13], and Zhang et al. [33]. All of these papers studied individual-based immu- nization (where either one targets specific individuals or ELATED ORK 2R W whole demographics). Here we study group-based problems, In general, there has been a lot of interest in studying where vaccines are distributed randomly inside groups. dynamical processes on large graphs like (a) blogs and Other Diffusion Problems. Other diffusion based optimiza- propagations [14], [15], (b) information cascades [16], [17]; tion problems include the influence maximization problem, (c) marketing and product penetration [18], [19] and (d) which was introduced by Domingos and Richardson [34], malware prediction [20]. These dynamic processes are all and formulated by Kempe et al. [10] as a combinatorial opti- closely related to virus propagation in epidemiology, rumor mization problem. They proved it is NP-Hard and also gave spread in social media, malware outbreaks in computer net- a simple ð1 1=eÞ-approximation based on the submodular- works, etc. In this section, we review related work mainly ity of expected spread of a set of starting seeds. Recently the from four areas: epidemiology, propagation models, immu- paper by Eftekhar et al. [35] studied this problem at group nization and other diffusion based optimization problems. scale. Other such problems where we wish to select a subset In short, past work concentrates on individual-based immuni- of ‘important’ vertices on graphs, include ‘outbreak zation—in contrast, in this paper we study group-based detection’ [36] and ‘finding most-likely culprits of epi- immunization problems under various models. demics’ [37]. Purohit et al. [38] looked into ‘zooming-out’ of Epidemiology. The classical texts on epidemic models and a graph by forming groups based on similar influence. analysis are May and Anderson [8] and Hethcote [21]. Widely studied epidemiological models include homoge- UR ROBLEM ORMULATIONS neous models [8], [9], [22], which assume that every individ- 3O P F ual has equal contact with others in the population. Table 1 lists the main symbols we use throughout the paper. Propagation Models. There are broadly two types of propa- Here we assume our graph GðV; EÞ is directed and gation models which have been used to describe dynamical weighted. We refer to both node and edge level interven- processes on graphs: threshold based and cascade style. tions as immunization. Threshold based models are well-motivated in the social Groups in a Graph. For a graph GðV; EÞ, we assume that the science literature [23], [24], [25] to represent ‘threshold’ edge (node) set is partitioned into groups C ¼fC1; ...;Cng ZHANG ET AL.: NEAR-OPTIMAL ALGORITHMS FOR CONTROLLING PROPAGATION AT GROUP SCALE ON NETWORKS 3341
TABLE 1 expected number of nodes we can save from being infected, Terms and Symbols by selecting groups for removing edges/nodes. In the LT model, a node v can be influenced by each neigh- Symbol Definition and Description P bor u according to a weight puv where eðu;vÞ2E puv 1.The GðV; EÞ graph G with the node set V and the edge set E C set containing groups diffusion process proceeds as follows: at the start, every node A set of initial infected nodes u uniformly randomly chooses a threshold uu from the range n the number of groups in the graph [0,1], which represents the weighted fraction of u’s neighbors m budget (the number of vaccines) that must be active toP activate u;aninactivenodeu becomes puv weight on edge eðu; vÞ t active at time t þ 1 if w Nt pwu uu where Nu is the set of gðvÞ group index of node v, i.e., gðvÞ¼i if v 2 Ci 2 i gðu; vÞ group index of edge ðu; vÞ, i.e., gðu; vÞ¼i if ðu; vÞ2Ci active neighbors of u at time t; all active nodes will stay active. x vaccine allocation vector ðx1; ...;xnÞ for edges/nodes The process stops when no additional node becomes active. s C;AðxÞ the expected number of infected nodes at the Each group may have some seeds (initial infected nodes). The end when x is allocated to edges 0 seeds will spread information/virus by the LT model. sC;AðxÞ the expected number of infected nodes at the end when x is allocated to nodes For the edge deletion under the LT model, let sC;AðxÞ n ek vector with ek ¼ 1 and ei ¼ 0 for i 6¼ k (Z ! R) denote the expected number of infected nodes in E MEðxÞ ½MðxÞ G (the footprint of G), given seed set A and vaccine alloca- DEðxÞ maximum expected degree of GðxÞ tion vector x for the group set C. Now we are ready to EðxÞ expected spectral radius of MðxÞ ðMEðxÞÞ spectral radius of the expected matrix MEðxÞ define the edge version of the problem under the LT model. min E minimum expected spectral radius over all PROBLEM 1: GROUP IMMUNIZATION under LT model MðxÞ, i.e., minx EðxÞ (edge version): x the allocation vector which minimizes ðMEðxÞÞ min GIVEN: Graph GðV; EÞ, a partition of the edge set over all x, i.e., arg minx ðMEðxÞÞ C ¼fC1; ...;Cng, seed set A and m vaccines (budget). Let x be s number of samples in GROUPGREEDYWALK the edge vaccine allocation vector. FIND: The optimum allocation xopt which maximizes for the edge (node) immunization problems. For a node parti- fðxÞ¼sC;Að0Þ sC;AðxÞ s.t. jxj m. tion, C may correspond to groups of communities, locations, Next, we define the node version of this problem. Let demographics, etc. And edge groups can be induced from 0 sC;AðxÞ denote the footprint of G. It is same as sC;AðxÞ except e ¼ðu; vÞ u v node groups. For example, for an edge ,if and that the allocation vector x corresponds to node vaccination. belong to a group Ct,thene 2 Ct, otherwise it belongs to PROBLEM 2: GROUP IMMUNIZATION under LT model C ¼fe ju 2 C ;v2 C g group ij uv i j . The edge groups we defined (node version): ensure that every edge e has a group even if the endpoints of e GIVEN: Graph GðV; EÞ, a partition of the vertex set C ¼ belong to different node groups.Note that we assume there fC1; ...;Cng,seedsetA and m vaccines (budget). Let x be the node are no overlaps among groups. vaccine allocation vector. Allocating Vaccines to Groups. We define x ¼ðx1; ...;xnÞ 0 FIND: The optimum allocation xopt which maximizes f ðxÞ¼ x as the vaccine allocation vector, i.e., if we give i vaccines to s0 ð0Þ s0 ðxÞ s.t. jxj m. group C , x edges (nodes) will be uniformly randomly C;A C;A i i Hardness of Our Problems. Problems 1 and 2 are removed from C , which means those edges/nodes will not i NP-hard as their special case, individual-level based immu- be involved in the diffusion process. The objective of our nizations (when each edge/node is a group), are NP-hard immunization problem is to find an allocation that controls themselves [11], [33]. the diffusion process most effectively. For edge deletion, a good solution tends to give more vaccines to the edge groups where edges inside have high 3.2 Problem Definition for Spectral Radius chance to be a part of cuts/walks. Similarly, for node dele- Our second set of problems are based on the spectral radius tion, we prefer the node groups where nodes insides have formulation [7], [29] for a variety of cascade models includ- high impact on the influence/eigenvalue. ing the fundamental SIR (‘mumps-like’ which generalizes Main Idea of Our Problem Definitions. We give two differ- the well-known IC model [10]), SIS (‘flu-like’), and SEIS ent sets of problems which cover a wide range of contagion- (with incubation period) models. In the SIS/SIR models, like processes both threshold-based and cascade-style in the every node can be either susceptible (S), infectious (I) or next two sections. In addition, all our problems have been recovered (R). Each infected node u (in state I) can infect carefully formulated to be seamless generalizations of the each susceptible neighbor v (in state S) with the probability corresponding individual-level problems. puv. In the SIS model, each infected node u can switch to the susceptible state with the recovery rate d. In the SIR model, 3.1 Problem Definition under LT Model each infected node u can switch to the recovered state with Our first set of problems are based on the LT model which the recovery rate d, meaning u cannot be infected again. is a well-known model for social media and complex propa- Spectral radius, denoted by , refers to the largest eigen- gations [10] suited for representing ‘threshold’ behaviors for value of the adjacency matrix of a graph G. Recent results [12], activation. As mentioned in the introduction, the vaccination [26] have shown that is connected to the reproduction num- problem here can help to control such processes like spam ber in epidemiology, and determines the phase-transition and rumors on Twitter and Facebook. Under the LT model, (‘epidemic threshold’ t) between epidemic/non-epidemic our goal is to minimize the expected number of infected regimes in a very large range of cascade-style models [12], nodes at the end of diffusion, in other words, maximize the including SIR, SIS , SEIS and so on. As shown in [12], t / , 3342 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 12, DECEMBER 2016 and if t < 1 the disease will die out quickly irrespective of ini- The notion of submodularity of set functions has been tial conditions. This gives us the motivation to control the dis- extended to functions over integer lattices—see, e.g., [39], ease spread by minimizing in the underlying network. which shows that a greedy algorithm gives a constant factor Tong et al. [7], [29] proposed effective node-based and approximation to submodular lattice functions with budget edge-based individual immunization methods to minimize . constraints. We note that in the context of functions defined Following their methodology, in this paper we aim to maxi- on an integer lattice, unlike in the case of set functions, submo- mize the drop of the spectral radius of G, D ,whenvaccines dularity need not be equivalent to the diminishing return are allocated to groups. Similar to Problems 1 and 2, when xi property. Besides, there are multiple non-equivalent defini- vaccines are given to group Ci, we uniformly remove xi tions of the diminishing return property, as observed in [39]. nodes/edge at random. Hence, we want to find the optimal Next, we show that in Theorem 1 that a greedy algorithm allocation x such that the expectation of D , E½D ðxÞ is maxi- gives an ð1 1=eÞ-factor approximation to an integer lattice mum. Note that we do not define the problems here based on function satisfying the properties ðP1Þ, ðP2Þ and ðP3Þ above, the ‘footprint’ (as in the previous section for LT) for primarily and our objective function follows all above properties two reasons: (a) these versions naturally generalize the corre- (Lemma 2). Note that it is not clear whether the analysis of sponding individual-level immunization problems studied in [39] implies a similar bound for the kind of functions fðxÞ we past literature [7], [29]; and (b) due to the epidemic threshold need to consider here. results, using the spectral radius allows us to immediately for- T Z LemmaP 1. Suppose y ¼ðyi; ...;ynPÞ where yi 2 and mulate a general problem for multiple cascade-style models yj ¼ m,thenfðx þ yÞ fðxÞ yjðfðx þ ejÞ fðxÞÞ. (like SIR/SIS/IC) each with differences in their exact spread- j j ing process which we can ignore. Formally our problems are: Proof. The proof is in the appendix.1 tu PROBLEM 3: GROUP IMMUNIZATION for spectral radius (edge n Theorem 1. Suppose fðxÞ, x 2 Z satisfies the properties ðP1Þ, version) ðP2Þ and ðP3Þ above. Then, Algorithm 1 gives a GIVEN: Graph GðV; EÞ, a partition of the edge set C ¼ ð1 1=eÞ-approximateP solution to the problem of maximizing fC1; ...;Cng, and m vaccines (budget). Let x be the edge vaccine fðxÞ subject to xi m. allocation vector, and let E½D ðxÞ denote the expected drop in the i spectral radius after the immunization. Proof. Suppose x is the solution from the greedy algorithm,P IND E D F : The optimum allocation xopt which maximizes ½ , andP x is the optimal solution. Hence, we have j xj ¼ i.e., x ¼ arg max E½D ðxÞ s.t. jxj m. opt x j xj ¼ m.SincesCð0Þ is constant, the greedy algorithm is PROBLEM 4: GROUP IMMUNIZATION for spectral radius (node equivalent to version) C ¼ arg maxCi fðx þ eiÞ fðxÞ: GIVEN: Graph GðV; EÞ, a partition of the node set C ¼ ðiÞ fC1; ...;Cng, and m vaccines (budget). Let x be the edge vaccine Let us define x as the solution got from the ith itera- E D ðmÞ allocation vector, and let ½ ðxÞ denote the expected drop in the tion of the greedyP algorithm, hence x ¼ x . And x can spectral radius after the immunization. be represent as j xj ej. We have E D FIND: The optimum allocation xopt which maximizes ½ , ðiÞ E D fðx Þ fðx þ x Þ i.e., xopt ¼ arg maxx ½ ðxÞ s.t. jxj m. ¼ fð ðiÞÞþðfð þ ðiÞÞ fð ðiÞÞÞ Hardness of Our Problems. Problems 3 and 4 are NP-hard x Xx x x too—their special cases, individual-level immunizations are ðiÞ ðiÞ ðiÞ fðx Þþ xj ðfðx þ ejÞ fðx ÞÞ ðLemma 1Þ NP-hard [7], [29]. Xj fðxðiÞÞþ x ðfðxðiþ1ÞÞ fðxðiÞÞÞ ðGreedy Alg:Þ 4PROPOSED METHODS j j We first discuss our algorithms for the GROUP IMMUNIZATION ¼ fðxðiÞÞþmðfðxðiþ1ÞÞ fðxðiÞÞÞ: problem under the LT model (Sections 4.1 and 4.2 for ðiþ1Þ 1 ðiÞ 1 Problems 1 and 2), and then the spectral radius versions Hence, fðx Þ ð1 mÞfðx Þþm fðx Þ. Recursively, (Sections 4.3 and 4.4 for Problems 3 and 4). ðiÞ 1 i we can get fðx Þ ð1 ð1 mÞ Þfðx Þ. Therefore, fðxÞ¼ f ðmÞ 1 1 1 m f 1 1=e f 4.1 Edge Deletion under LT Model ðx Þ ð ð mÞ Þ ðx Þ ð Þ ðx Þ. tu Recall that the function fðxÞ in Problem 1 is not a simple set Algorithm 1. Greedy Algorithm function; it is over an integer lattice. Hence the submodularity f m property used in [11] is not applicable to our problem, and we Require: , budget 1: x ¼ 0 can not simply apply their greedy algorithm. Instead, our 2: for j ¼ 1 to m do approach is to carefully identify a ‘submodularity like’ condi- 3: i ¼ arg max fðx þ e Þ fðxÞ tion that is satisfied by our function fðxÞ, for which a greedy k¼1;...;n k 4: x ¼ x þ e algorithm gives good performance. Let e be the vector with 1 i k 5: end for k at the th index and 0 be the all zeros vector. We consider the 6: return x following three properties. Now, we will show that the objective function ðP1Þ fðxÞ 0 and fð0Þ¼0. fðxÞ¼sC;Að0Þ sC;AðxÞ for the edge deletion problem under ðP2Þ (Non-decreasing) fðxÞ fðx þ ekÞ for any k. 0 ðP3Þ (Diminishing returns) For any x x and k, we have 0 0 1. The proof is in the appendix, which can be found at: http://peo- fðx þ ekÞ fðxÞ fðx þ ekÞ fðx Þ. ple.cs.vt.edu/yaozhang/group-immu/ ZHANG ET AL.: NEAR-OPTIMAL ALGORITHMS FOR CONTROLLING PROPAGATION AT GROUP SCALE ON NETWORKS 3343 the LT model satisfies the properties stated in Theorem 1. In this will imply that fðxÞ satisfies Property 3. Suppose 0 the ensuing discussion, we will assume without loss of gener- x ¼ x þ ej, we have two cases to consider: (1). ek ¼ ej; ality that there is only one seed node. This is because, if there (2). ek 6¼ ej. are multiple seed nodes, then, we can merge all of them to a For 1 i n, let ci ¼jCij and xi denote the ith ele- single ‘super’ node (say s) inP the following manner: for every ment in x. vertex v 2 V n A,setpsv ¼ u2NðvÞ\A puv,whereNðvÞ is the First,Q we consider case (1) (ek ¼Qej). For R 2RðxÞ, 1 1 1 set of neighbors of v. We note that after this modification the Pr½R ¼ i ¼ r , where, r ¼ i6¼k ci c ci xi k xi edges between v andP its susceptibleP neighbors are unchanged, xk and at time 0, pwv ¼ pwv ¼ psv. Hence, w2Nv w2NðvÞ\A ^ ^ sCðG; xÞ sCðG; x þ ekÞ sC;AðxÞ¼sC;sðxÞ. Henceforth, we will assume that there is X 1 X 1 only one seed node, and drop the subscript A from sC;AðxÞ, ¼ r g^ðRÞ r g^ðR0Þ ck ck denoting it by sCðxÞ. R2RðxÞ R02Rðx0Þ 2 xk ðxkþ1Þ 3 Lemma 2. The function fðxÞ¼sC ð0Þ sCðxÞ satisfies the X X 4 1 1 1 5 properties ðP1Þ, ðP2Þ and ðP3Þ above. ¼ r g^ðRÞ g^ðR [fegÞ : ck x þ 1 ck R2RðxÞ k e2C nR xk k xkþ1 Proof. Property 1 is trivially true because, when x ¼ 0,by definition, fð0Þ¼0, and since vaccination does not The factor 1 is due to the fact that R [feg comes up xkþ1 increase the number of infections, sCðxÞ sC ð0Þ. For the in ðxk þ 1Þ combinations involving R and e.Thissim- rest of the proof, since sCð0Þ is a constant, we only need plifies to 0 to analyze sCðxÞ. Note that for any x x, we can find a ^ ^ sequence of vectors ðz1; z2; ...; zlÞ for some l such that sC ðG; xÞ sC ðG; x þ ekÞ 0 x ¼ z1, x ¼ zl and zi ¼ zi 1 þ eki 1 for some index ki 1. X X rxk!ðck xk 1Þ! Therefore, it is enough to prove that Properties 1 and 2 ¼ g^ðRÞ g^ðR [fegÞ : (1) 0 ck! hold for x ¼ x þ ej for some index j. Also, we can assume R2RðxÞ e2CknR that xj < jCjj, for j ¼ 1; ...;n, for if this is not true for some j, then, it implies that all the edges in Cj will be vac- Similarly, we have cinated, and therefore, we can simply remove all C from j ^ 0 ^ 0 sCðG; x Þ sCðG; x þ ekÞ the analysis and reduce the budget by xj. V X X Let RðxÞ 2 be the collection of sets R satisfying rðxk þ 1Þ!ðck xk 2Þ! ¼ g^ðR0Þ g^ðR0 [fegÞ jR \ Cij¼xi. Following the equivalence between influence ck! 0 0 0 R 2Rðx Þ e2CknR in the LT model andP the directedP percolation process [10], X X ^ ^ rðxk þ 1Þ!ðck xk 2Þ! 1 we have sCðxÞ¼ G^ Pr½G R2RðxÞ Pr½R gCðG; RÞ,where ¼ ck! ðxk þ 1Þ 0 the first sum is over all possible live-edge subgraphs G^ of G X R2RðxÞ e 2CknR in the percolation process, Pr½R is the probability when the g^ðR [fe0gÞ g^ðR [fe; e0gÞ : 0 ^ e2CknðR[fe gÞ set R is removed , and gC ðG; RÞ is the expected number of ^ infected nodes in G at the end of the LT process after the set From [11, proof of Theorem 6], g^ðRÞ g^ðR [fegÞ R is removed. This can be rewritten as s ðxÞ¼ 0 0 P C g^ðR [fe gÞ g^ðR [fP e; e gÞ (supermodularity).P P There- ^ Pr½G^ s ðG;^ xÞ,wherePr½G^ is the probability of sam- G C P fore, ðck xk 1Þ e½g^ðRÞ g^ðR [fegÞ e0 e g^ðR [ ^ ^ ^ e0 g^ R e; e0 pling G,andsC ðG; xÞ¼ R2RðxÞ Pr½R gCðG; RÞ.Hence- f gÞ ð [f gÞ. Hence proved. ^ Now, we consider case (2). Let forth, we will abbreviate gC ðG; RÞ as g^ðRÞ. We will show that s ðG;^ xÞ is non-increasing, i.e., 1 1 C Pr R r0 ; ^ ^ 0 0 ½ ¼ sCðG; xÞ sC ðG; x Þ where x ¼ x þ ej, thereby showing ck cj x x that fðxÞ satisfies Property 2. Since the number of nodes k j Q reachable from the seed node with R removed is at least 0 1 ^ ^ where r ¼ . We can get sCðG; xÞ sCðG; x þ ekÞ i6¼j;k ci as many as those with R [feg removed, for any xi e 2 C n R, we have g^ðRÞ g^ðR [fegÞ. Therefore, from Eqn. (1). And j X ^ 0 0 0 ^ 0 ^ 0 sC ðG; x Þ¼ Pr½R g^ðR Þ sCðG; x Þ sC ðG; x þ ekÞ 0 0 R 2Rðx Þ 0 X X r xk!ðck xk 1Þ! X X 1 ¼ ½g^ðR0Þ g^ðR0 [fegÞ cj ¼ Pr½R g^ðR [fegÞ ck! R02Rðx0Þ e2C nR0 jCjj xj ðxjþ1Þ k R2RðxÞ e2CjnR X X X r0ðx þ 1Þ!ðc x 1Þ! x !ðc x 1Þ! 1 1 ¼ j j j k k k Pr½R g^ðRÞ cj! ck! xj þ 1 jCjj xj X X R R2RðxÞ e2CjnR X ½g^ðR [ e Þ g^ðR0 [fe; e gÞ : ^ j j ¼ Pr½R g^ðRÞ¼sC ðG; xÞ: ej2CjnR e2CknðR[fejgÞ R2RðxÞ P ^ ^ Finally, we will show that sCðG; x þ ekÞ sCðG; xÞ AgainP P from [11], ðcj xj 1Þ e g^ðRÞ g^ðR [fegÞ ^ 0 ^ 0 g^ðR [fe gÞ g^ðR [fe ;egÞ tu sCðG; x þ ekÞ sCðG; x Þ. From the above discussion, ej e j j . Hence proved. 3344 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 12, DECEMBER 2016
I Algorithm 1 provides a simple greedy algorithm. Here, Running Time of GREEDY-LT. Calculating all rðu; TXÞ costs I to estimate sC ðxÞ when vaccines are uniformly at random OðjMjjV jÞ time since we can traverse TX once to get all val- I allocated within groups, we apply the Sample Average ues of rðu; TXÞ. And greedily choosing m vaccine allocation Approximation (SAA) framework. Let L RðxÞ, denote a needs OðmnjMjjV jÞ. Hence, the serial version of GREEDY-LT sample set fromP the set of all possible allocations. costs OðmnjMjjV jÞ. Note that in practice, we can speed it 1 sCðxÞ s^CðxÞ¼ g ðRÞ, Kempe et al. [10] show that I jLj R2L C up by computing and updating riðu; TXÞ in parallel. In addi- I gCðRÞ can be estimated by sampling from the set of live- tion, since TX is tree, the increasing difference property still edge graphs. A live-edge graphs T is generated as follows: holds, hence we can accelerate GREEDY-LT by “lazy eval- for each node v 2 V , independently select at most one of its uation” [36], [40] as well. incomingP edges with probability puv, and with probability 1 u:ðu;vÞ2E puv no edge is selected. Let this sample set be 4.2 Node Deletion under LT Model denoted by M. This approach takes OðjMjjLjðjEjþjV jÞÞ Our algorithm for the node version of the GROUP IMMUNIZA- time to estimate sC ðxÞ, and OðmnjMjjLjðjEjþjV jÞÞ for the TION problem is also the greedy Algorithm 1, as in the edge full greedy algorithm, which is not practical for large net- version in Section 4.1. Without loss of generality, we also works. However, we can speed up this naive greedy assume that all seed nodes in A are merged, and drop the 0 0 algorithm. subscript A from sC;AðxÞ, denoting it by sC ðxÞ. Our analysis 0 0 0 relies on proving that the function f ðxÞ¼sCð0Þ sC ðxÞ in Algorithm 2. GREEDY-LT Problem 2 satisfies the properties ðP1Þ, ðP2Þ and ðP3Þ from Require: Graph G, group set C, seed set A, and budget m Section 4.1, as discussed below. 1: Merge seed set A to I 0 0 0 Lemma 3. The function f ðxÞ¼sCð0Þ sCðxÞ satisfies the 2: Sample live-edge graphs M¼fT I ; ...;TI g X1 X P P P I I jMj properties ð 1Þ, ð 2Þ and ð 3Þ. 3: For each TX, calculate rðu; TXÞ for all nodes (in parallel) 4: Set x ¼ 0 Proof. The proof is in the appendix. tu 5: for j ¼ 1 to m do I Lemma 3 suggests that Theorem 1 holds for node ver- 6: for each TX and Ci do Ci I sion as well: GREEDY algorithm will provide a ð1 7: pick an edge eX at random for Ci and TX 1=eÞ-approximate solution. We extend GREEDY-LT (Algo- 8: end for P I I Ci C ¼ arg max C ðrðI; T Þ rðI; T n e ÞÞ rithm 2) to the node version: instead of randomly pick 9: Ci e i T I X X X X 2 X 10: xC ¼ xC þ 1 edges (Line 7), we randomly pick nodes to calculate the I marginal loss (Line 9), and remove the corresponding 11: for each TX do C I Ci I 12: If eX ðu; vÞ2TX, remove edge eX and update rðn; TXÞ nodes (Line 12). The observation is that calculating the for node n (in parallel) marginal loss of removing node v in C in constant I I 13: end for timeholdshereaswell,i.e.,rðI; TXÞ rðI; TX n vÞ¼ I 14: end for rðv; TXÞþ1. Hence, the updating process is the same as 15: return x theedgeversionofGREEDY-LT.
Speed-Up of the Greedy Algorithm: GREEDY-LT. Since a live- 4.3 Edge Deletion for Spectral Radius s graph sampled from M is a tree, we can denote it as TX s We propose three algorithms for Problem 3 (edge immu- where s is the root, and rðu; TXÞ¼jfvjv 2 subtreeðuÞgj, i.e., s nization based on spectral radius) with different trade- the number of nodes that are under the subtree of u in TX. offs of quality and running time: the first one, SDP, is a GREEDY-LT is summarized in Algorithm 2. It first merges all constant factor approximation algorithm that minimizes s jMj seeds into a ‘supernode’ and samples live-edge the actual eigendrop; the second algorithm, GROUPGREE- rðu; T s Þ graphs, and then compute X in parallel for all nodes DYWALK, is a bicriteria approximation algorithm based on in all the live graphs (Lines 1-3). After that we greedily hitting-walks; the third algorithm, LP, is an Linear Pro- select m vaccines (Lines 4-10): we initially set the allocation gramming (LP) based method which uses an estimation vector x ¼ 0, and in each iteration, for eachP group Ci, we cal- of the eigendrop. D s culate the marginal loss Ci;sðx þ eiÞ¼ e u;v T s rðs; TXÞ ð Þ2 X SDP is a constant-factor approximation algorithm, which s rðs; TX n eÞ, i.e., we randomly pick one edge from each group gives us good results, but it is very slow with a for each live-edge graph, then sum their marginal losses up OðjV j4polylogðjV jÞÞ time complexity. Hence, we develop s over TX as Ci’s marginal loss. Note that GROUPGREEDYWALK, a bicriteria approximation algorithm s s s rðs; TXÞ rðs; TX n eÞ¼rðv; TXÞþ1, where node v is the based on hitting closed walks [13]. Though GROUPGREEDY- endpoint of e [11]. We pick the group C with the maximum WALK loses a little quality compared to SDP, it is faster with a marginal loss. Finally we removed the edge that has been Oðsm2jV j3Þ time complexity (where s corresponds to the s picked, and update rðu; TXÞ in parallel (Lines 11-13). There number of samples, described later). However, it may still not s s are two cases to update TX if eðu; vÞ2TX: (1) for v’s children, be scalable to very large networks with millions of nodes. we can remove them because it is not reachable from s; (2) Therefore, we come up with LP, a linear programming based s s s for any ancestor a of v, rða; TX n eÞ¼rða; TXÞ rðv; TXÞ 1, heuristic whose time complexity depends only on the number which can be done in constant time. Following Theorem 1, of groups, not graph size. In reality, the number of groups in a GREEDY-LT is a ð1 1=e Þ-approximation algorithm where group is typically much smaller than the number of nodes. is the approximation factor for estimating sCðxÞ. Hence, LP is much faster than SDP and GROUPGREEDYWALK. ZHANG ET AL.: NEAR-OPTIMAL ALGORITHMS FOR CONTROLLING PROPAGATION AT GROUP SCALE ON NETWORKS 3345
D 4 And experimental results demonstrate that it is scalable to degree. If EðHÞ log jVpj,ffiffiffiffiffiffiffiffiffiffiffiffiffiffi then, almost surely iðMðHÞÞ networks with millions of nodes, and provides competitive iðMEðHÞÞj ð2 þ oð1ÞÞ DEðHÞ, for i ¼ 1; ...; jV j. empirical performance (see Section 5). Note that even though SDP and GROUPGREEDYWALK may Recall that xmin is the output of SDP (4), and it corre- not be scalable to very large networks, both have proven per- sponds to the allocation vector which minimizes ðMEðxÞÞ D formance guarantee. In addition, they are not merely of theo- over all x. Let EðxminÞ denote the maximum expected retical interest: they can be used as a baseline to assess the degree of GðxminÞ. The following lemma proves that the performance of faster heuristics on smaller networks. SDP formulation givespffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi us an approximation algorithm with Next, we will introduce the SDP algorithm (Section 4.3.1), constant factor Oð DEðxminÞÞ. the GROUPGREEDYWALK algorithm (Section 4.3.2), and the LP 4 Lemma 4. If x is such that DEðx Þ log jV j, then, heuristic (Section 4.3.3) respectively. min pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffimin min E ðMEðxminÞÞþð2 þ oð1ÞÞ DEðxminÞ þ 1. pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 4.3.1 SDP: A Constant Factor Approximation Algorithm Proof. Let z ¼ ðMEðxminÞÞþð2 þ oð1ÞÞ DEðxminÞ. Apply- Let GðV; EÞ be a graph whose edge set is partitioned into n ing Theorem 2 to GðxminÞ, ðMðxminÞÞ z almost surely. D 4 groups C1; ...;Cn. Let x be the edge allocation vector. For In fact, for Eðxmin Þ log jV j, it can be shown that an edge ðu; vÞ, let gðu; vÞ denote the index of the group to Pr ðMðxminÞÞ z 1=jV j (see [41, proof of Theorem ðu; vÞ Gð Þ which belongs. Let x be the random graph 6]). Noting that ðMðxminÞÞ ðMÞ, obtained by removing each edge in Ci with probability p ¼ x =c , where c ¼jC j. Let MðxÞ be its adjacency matrix E i i i i i EðxminÞ¼ ½ ðMðxminÞÞ and EðxÞ¼E½ ðMðxÞÞ be the expected spectral radius Pr ðMðxminÞÞ z z 1; with prob. ð1 pgðu;vÞÞ if ðu; vÞ2EðGÞ; þ Pr ðMðxminÞÞ z ðMÞ ðMðxÞÞ ¼ uv 0; otherwise: 1 1 z þ ðMÞ Let MEðxÞ¼E½MðxÞ be the expectation of the adjacency min min By definition, E EðxminÞ. Therefore, E E matrix of GðxÞ ðxminÞ z þ 1. Hence, proved. tu 1 pgðu;vÞ; if ðu; vÞ2EðGÞ; Running Time. ðMEðxÞÞuv ¼ (3) The SDP step (Eq. (4)) dominates the run- 0; otherwise: ning time of this algorithm, which is OðjV j4polylogðjV jÞÞ. The problem is to find the optimal allocation, i.e., the x for which EðxÞ is minimized. We will denote this value by 4.3.2 GROUPGREEDYWALK: A Bicriteria Approximation min E :¼ minx EðxÞ. Algorithm 4 Remark 4.1. In the SDP formulation, for ease of analysis, we As shown above, SDP with a ðjV j polylogðjV jÞÞ time com- replace the hard budget constraint by an expected budget plexity, is too slow for large networks. In this section, we constraint, i.e., the expected size of the vaccine allocation leverage the technique of hitting closed walks [13] for the vector x is m. This is not a problem since, in reality, the bud- GROUP IMMUNIZATION problem, and propose a bicriteria get is sufficiently high ( log n). Hence, with high probabil- approximation algorithm called GROUPGREEDYWALK. ity, the number of vaccines in the solution will be very close Saha et al. [13] studied the problem of minimizing the spec- to the expected budget. Given this small difference, we can tral radius under a given threshold by removing the smallest force the number of vaccines to be within the budget con- number of edges, and developed a greedy based approxima- straints, with very little effect on the performance. tion algorithm for it. Different from their work, our goal is to distribute a given budget of vaccines to groups to minimize The SDP Formulation: Finding the Allocation x with Minimum the spectral radius as small as possible. We can adapt their ðMEðxÞÞ. Note that, MEðxÞuv ¼ð1 pgðu;vÞÞ,ifðu; vÞ2EðGÞ. greedy algorithm to the group immunization, by choosing We use a simple SDP to find the allocation which minimizes groups with maximum marginal gain of hitting closed walks. ðMEðxÞÞ and meets the budget constraint m However, it is not clear whether this works, as we need to con- sider all “possible worlds” for group immunization. minimize t In graph G, a closed walk is a sequence of nodes starting subject to 0 p 1; for i ¼ 1; ...;n P i (4) and ending at the same node, with two consecutive nodes ipijCij m; adjacent to each other. Closed k-walk is a walk with tI MEðxÞ 0: length k. Let walks ðe; G; kÞ denote the number of closed k-walks in G containing e ¼ði; jÞ. We say that an edge set E Let xmin denote the allocation vector corresponding to the hits a walk w if w contains an edge from E. Recall that GðxÞ solution of the SDP. min is a random graph obtained by removing a random subset Analysis: Relating E to ðMEðxminÞÞ. One can use the fol- of xi edges in Ci, where C1; ...;Cn is a partition of the edge lowing result by Lu and Peng [41] to bound EðxÞ with set E. Let WðG; kÞ be the set of all walks of length k in the respect to ðMEðxÞÞ. graph G. Let nkðG; eÞ denote the number of walks of Theorem 2 ([41]). Consider an edge-independent random graph length k in G that pass through edge e. Similarly, let H. Let MðHÞ denote its adjacency matrix and nkðG; SÞ denote the walks of length k in G that pass though MEðHÞ¼E½MðHÞ . DEðHÞ denotes the maximum expected edges in the set S. Let nkðGÞ¼nkðG; EÞ ¼ jWðG; kÞj denote 3346 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 12, DECEMBER 2016 the number of walks with length k in G. Here we focus on Lemma 6. Let xoptðmÞ be the optimum allocation such that opt walks of a fixed length k ¼ uðlog jV jÞ. Note that for GðxÞ, T ¼ E½ 1ðGðx ðmÞÞÞ . Let y be defined as nkðGðxÞÞ is a random variable. opt opt xi ; if xi mi=2; yi ¼ Algorithm 3. GROUPGREEDYWALK (G, m) mi otherwise, Require: Graph G, group set C, and budget m 1: x ¼ 0 where mi is the number of edges in group Ci. Then, we have k k 2: for j ¼ 1 to m do E½nkðGðyÞÞ jV j2 T . 3: i ¼ arg max ExpCountWalksðx þ e Þ k¼1;...;n k Proof. The proof is in the appendix. tu ExpCountWalksðxÞ 4: x ¼ x þ ei Now, we prove Theorem 3. 5: end for 6: return x Proof of Theorem 3. Let gðxÞ denote the expected number of walks in WðG; kÞ hit by the edges that are removed in GðxÞ. Then gðxÞ has the diminishing returns property, i.e., Algorithm 3 gives the pseudocode of our GROUPGREE- 0 0 0 for x x , we have gðx þ eiÞ gðxÞ gðx þ eiÞ gðx Þ. DYWALK algorithm. EXPCOUNTWALKSðG; xÞ returns the The proof of the diminishing returns follows the proof of expected number of walks surviving in GðxÞ.Notethat Lemma 2. EXPCOUNTWALKS is different from COUNTWALKS in [13], as We will compare gðxgÞ to gðyÞ where y is defined as it returns the expected number of hitting walks when a vaccine allocation vector x is assigned to groups, while opt opt xi ; if xi mi=2; COUNTWALKS in [13] is based on removing a set of edges yi ¼ mi otherwise, given a budget constraint. The idea of GROUPGREEDYWALK is that, each time we select a group Ci with the maxi- whereP miPis the number of edges in group Ci. Note that mummarginalgaininEXPCOUNTWALKSðG; xÞ,whenallo- opt i yi 2 i xi 2m. cating one vaccine to Ci. ðiÞ Let x denote the vector after ith iteration of GROUP- Algorithm 3 follows the framework of the individual based GREEDYWALK. Since gðxÞ has the diminishing returns GREEDYWALK algorithm [13]. Instead of picking edges, it choo- property, it follows the proof of Theorem 1 that ses groups to maximize marginal gain of eigendrop. The main ðiÞ 1 i 2 fðx Þ ð1 ð1 2mÞ ÞfðyÞ. Therefore, for i ¼ Oðm log jV jÞ, challenge here is to show GROUPGREEDYWALK is a provable 2 2 opt 1 Oðm log jV jÞ log jV j approximation algorithm. Let x ðmÞ be the optimum solu- we have 1 ð1 2mÞ 1 ð1=eÞ 1 opt 1 1 tion corresponding to budget m ,andT ¼ 1ðGðx ðmÞÞÞ (the 1 , where N is the number of total walks in jV jlog jV j N spectral radius after vaccine allocation for the optimum solu- the original graph G. tion). We can prove the following theorem: c k From Lemma 6, we have E½nkðGðyÞÞ jV j T for a c k Theorem 3. Let xoptðmÞ be the optimum solution corresponding to constant c. This implies fðyÞ N jV j T . Therefore, g g 1 c k c k budget m of edges removed. Let x be the allocation returned by fðx Þ ð1 jV jÞðN jV j T Þ N 1 jV j T . This implies 2 g c k GROUPGREEDYWALK (G; c1m log jV j), for a constant c1.Thenwe that nkðGðx ÞÞ OðjV j T Þ. From Lemma 5, it follows g 0 0 g g 0 have 1ðGðx ÞÞ c T for a constant c ,where 1ðGðx ÞÞ is the that 1ðGðx ÞÞ c T. tu spectral radius after allocating vaccines based on xg. Implementation Notes. Given the adjacency matrix A of G, ROUP REEDY ALK k 1 Remark 4.2. Theorem 3 shows that G G W is a the number of k-length walks from u to v is given by Auv .It 2 0 ðc1 log jV j;cÞ-bicriteria approximation algorithm. Differ- also corresponds to the number of walks hit by the ent from the analysis of traditional approximation algo- edge ðu; vÞ. We implement the algorithm as follows. In each rithms, in order to bound the result of GROUPGREEDYWALK iteration, we randomly sample a set of edges of the G w.r.t to the optimal solution, we need a larger budget according to x. For each sample, we compute the expected 2 2 c1 log jV jm.Typically,log jV j is much smaller than the decrease in the number of walks for the removal of one budget m. And when the budget m is very large, the mar- edge in group i (for computing the effect of allocation 2 ginal gain of eigendrop for a larger budget c1 log jV jm will vector x þ ei) as follows: We construct GðxÞ, compute tend to be very close to the marginal gain of eigendrop for A0 ¼ AðGðxÞÞk 1 and take the average over all A0ðu; vÞ ele- the budget m. Hence, adding such small factor into the bud- ments where ðu; vÞ belongs to group i. We perform this for get m will have little effect on the performance. each sample (number of samples is s) and take the average over all the samples. Finally, we choose that i which gives We will use Lemmas 5 and 6 to prove this theorem. Intui- the maximum average and update x by adding ei to it. tively, Lemma 5 shows the expected spectral radius is Running Time. For budget m, Am 1 can be computed upperbounded by T if the number of walk k ¼ Oðlog jV jÞ; in time Oðm2jV j3Þ. For each sample of x,wecompute while Lemma 6 shows that the expected number of walks m 1 with length k can be upperbounded by T as well. AðGðxÞÞ . Note that, computing the effect of removing 2 ei for each sample takes only OðjV j Þ time. Therefore, for k k Lemma 5. If E½nkðGðxÞ ¼ OðjV j2 T Þ for k ¼ Oðlog jV jÞ, asamplesizeofs, the algorithm overall takes then E½ ðGðxÞÞ c T for a constant c . 3 1 3 3 Oðsm2jV j Þ time. If m ¼ Oðlog jV jÞ, the time complexity is Proof. The proof is in the appendix. tu OðsjV j3log 2jV jÞ. ZHANG ET AL.: NEAR-OPTIMAL ALGORITHMS FOR CONTROLLING PROPAGATION AT GROUP SCALE ON NETWORKS 3347 4.3.3 LP: A Fast Heuristic Applying the above to (8), GROUPGREEDYWALK is a good approximation algorithm like X X X X SDP, however, it may not be scalable to very large networks 2 2 jCaj fðxÞ¼ 2 uj xa Mijuiujxa jCaj 1 with millions of nodes. In this section, the propose a much a j2Ca a i;j2Ca faster heuristic based on estimating eigendrop. X X M u u x x : The eigendrop when removingP edges in the set ET can be ij i j a b fðTÞ¼ M u u Mu ¼ u a6¼b i2Ca;b2Cb approximated by ði;jÞ2ET ij i j where u x and u ¼ðu1; ...;ui; ...Þ [29]. Given the allocation vector x, ObservingP that Mij, i andP a are constants, defining 2 jCaj the expected drop in spectral radius is then given by aa ¼ 2 u , b ¼ Mijuiuj, and j2PCa j jCaj 1 a i;j2Ca G u u ab ¼ i2Ca;j2C Mij i j, we get, E½D fðxÞ b X X X X 2 ¼ Mijuiuj Pr ði; jÞ is removed fðxÞ¼ aaxa b x Gabxaxb : 5 a a Xi;j2E X ( ) a a a6¼b ¼ Mijuiujxa : a2C ði;jÞ2Ck Our aim is to find that x which maximizes fðxÞ. This can be formulated as a quadratic program P P If we define aa ¼ ði;jÞ2C Mijuiuj, then, fðxÞ¼ a aaxa. P P P a minimize b x2 G x x a x We want to maximize fðxÞ subject to the budget constraints. a a a þ a6¼b ab a b a a a This can be formulated as a linear program as given below ¼ 1 xT Qx þ cT x P2 (10) P subject to a xajCaj B maximize aaxa Pa 0 xa 1; subject to a xajCaj m (6) 0 xa 1: G where, Qaa ¼ 2ba and for a 6¼ b, Qab ¼ 2 ab and ca ¼ aa.IfQ Oðn4Þ n is not semi-definite, the problem is NP-Hard [42]. In that case, Running Time. The LP takes time where is the num- b ber of groups. Note that it is not a function of the graph size. we use a low-rank matrix Q formed by all its eigenvectors cor- Typically, the number of groups is small, hence this algo- responding to non-negative eigenvalues. The QP on Q^ can be rithm is very fast. solved in polynomial time using the ellipsoid method [42]. b 1 T b T Let fðxÞ¼ x Qx þ c x,andxQ, xb correspond to the best 4.4 Node Deletion for Spectral Radius 2 Q b Here, we propose an algorithm for solving Problem 4: the allocation vectors corresponding to Q and Q respectively. group node immunization problem with respect to eigen- The next lemma shows that xb is a good approximation to xQ. Q drop. It is based on the approximate eigendrop method b n b which was discussed in Section 4.3. The eigendrop when Lemma 7. jfðxbÞ fðxQÞj kQ Qk , where n is the Q 2 F removing nodes in S can be approximated as follows [7]: number of groups in the graph. X X D 2 Proof. The proof is in the appendix. tu fðSÞ¼ 2 uj Mijuiuj; (7) j2S i;j2S Running Time. The QP takes Oðn4Þ time. Again, note that n is the number of groups. Hence, it is fast when the number where Mu ¼ u and u ¼ðu1; ...;ui; ...Þ. Recall that C is the of groups is small. set of groups and x ¼ðx1; ...;xi; ...Þ is the allocation vector where, xi is the fraction of nodes vaccinated in group Ci. 5EMPIRICAL STUDY For the group vaccination problem, the expected eigendrop can be approximated by applying (7) as follows: We present a detailed experimental evaluation now. X 5.1 Experimental Setup E D 2 2 ½ fðxÞ¼ 2 uj Prðj is vaccinatedÞ We implemented the algorithms in Python, and conducted j2V X (8) the experiments using a 4 Xeon E7-4850 CPU with 512 GB Mijuiuj Prði & j are vaccinatedÞ : of 1,066 Mhz main memory. i;j2V Datasets. Table 2 briefly summarizes the dataset. We run our experiments on multiple datasets, which were cho- Let gðvÞ denote the index of the group to which v belongs to, senfortheirsizeaswellasdifferentdomainswherethe v 2 C gðvÞ¼i j i.e., if i,then, . The probability that is vacci- GROUP IMMUNIZATION problem is especially applicable. nated is xgðjÞ and the probability that both i and j are vacci- Note that all our datasets are networks, not diffusion nated is traces. If diffusion traces areprovidedasinputsinstead ( of a network, there are state-of-the-art algorithms (such x x ; if g i g j ; gðiÞ gðjÞ ð Þ 6¼ ð Þ as [43]) which can be applied to learn edge weights first, Prði & j are vaccinatedÞ¼ jC j x2 gðiÞ ; otherwise. and then apply our algorithm. gðiÞ jCgðiÞj 1 (9) 2. Code: http://people.cs.vt.edu/yaozhang/group-immu/ 3348 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 12, DECEMBER 2016 TABLE 2 three baselines for both node and edge deletion to better Datasets judge their performance. Analogous versions of these base- lines have been regularly used in state-of-the-art individual Dataset Num. of nodes Num. of edges Num. of groups immunization studies [7], [13], [29]. SBM 1,500 5,000 20 Protein 2,361 7,182 13 (1) RANDOM: uniformly randomly assign vaccines to OregonAS 10,670 22,002 100 groups for both node deletion and edge deletion. YouTube 50 K 450 K 5,000 (2) DEGREE: for node deletion, we calculate the average Portland 0.5 million 1.6 million 91 degree d of each group C , and independently Miami 0.6 million 2.1 million 91 Ci i P C d = d assign vaccines to i with probability Ci Ck2C Ck ; for edge deletion, we first calculate the product 1) SBM (Stochastic Block Model) [44] is a well-known degree de [30] of each edge e ¼ðu; vÞ, i.e., de ¼ du dv, model to generate synthetic graphs with groups. We then similar to node deletion, we calculate the aver- generate small networks from the Stochastic Block age product degree dCi Pof Ci, and assign vaccines to C d = d Model to test the effectiveness of all our methods. i with probability Ci Ck2C Ck . 3 2) Protein is a protein-protein interaction network in (3) EIGEN: Eigenvalue centrality has been widely used in budding yeast. There are 13 classes of proteins, which the immunization literature [7], [29], even as a base- are naturally treated as groups. It is a biological net- line for LT model [11]. Let u be the eigenvector corre- work, where our immunization algorithms can be sponding to the first eigenvalue of the graph. The potentially applied to block protein interactions. eigenscore of node a is ua, while the eigenscore of 4 3) OregonAS is the Oregon AS router graph collected edge eða; bÞ is juaubj [29]. For both node and edge from the Oregon router views, and groups here are deletion, we calculate the average eigenscore uCi of based on router conductivities. We use Lou- each group Ci, and independentlyP assign vaccines to vain [45], a fast community detection algorithm to C u = u i with probability Ci Ck2C Ck . specify groups. It is a computer network where our algorithms can be used to stop malware outbreaks. Remark 5.1. Note that we do not compare and run the indi- 4) YouTube5 is a friendship network in which users vidual based immunization methods [7], [11] “as-is” on can form groups. We create an induced graph by the original graph because these methods directly pick selecting nodes that are in the top 5,000 communi- nodes which we do not allow in our problems. Instead, ties. It is a social media network where we can apply we aim to pick the best groups, and then uniformly at our algorithms to control rumor spread. random allocate vaccines within the group. In addition, 5) Portland and Miami are social-contact graphs based we did study the effect of our algorithm w.r.t. the size of on detailed microscopic simulations of large US cities, groups (see Fig. 6). If each node is a group, GROUP IMMUNI- which has been used in national smallpox and influ- ZATION reduces to the individual based immunization. enza modeling studies using the SIR model [2]. We Indeed the reason we formulate the group immunization divided people into groups by ages ranging from 0-90 problems in this paper is that it is typically not feasible to (hence 91 groups in both networks). They are both con- force targeted individuals to be vaccinated in practice (as tact networks where our algorithms can be adopted to discussed before in the introduction). minimize virus propagation. Settings. For LT model, we uniformly randomly choose 1 5.2 Results percent nodes as the infected nodes (seeds) at the start. And In short, we demonstrate that our methods outperform other we use the same method in [11] to generate the probabilities baselines on all datasets. We also show how the behaviors of on the edges: for a node v, we assign each its incoming edge our methods change as groups vary. Finally, we conduct a ðu; vÞ with a probability p^uv uniformly at random, then we case study to analyze the vaccine allocations at group scale. uniformly randomly give a probability wv to v representing v’s incoming edges fail to activateP it. Then we get the nor- 5.2.1 Performance malized weight puv ¼ p^uv=ð u2V p^uv þ wvÞ. We construct 1,000 live-edge graphs in our algorithm for LT model. For Fig. 1 shows experimental results under LT model for group robustness, each data point we show is the mean of 1,000 edge deletion, while Fig. 2 demonstrates the results for node runs of randomly sampling removed edges/nodes from deletion. In all networks, GREEDY-LT consistently outperform groups. In the edge deletion version, edge communities are other competitors. Since we have same budgets for both edge induced from node communities, i.e., for an edge e ¼ðu; vÞ, and node deletion, clearly node removal should perform bet- if both u and v belong to a group Ct, then e 2 Ct, otherwise ter than edge deletion as node deletion removes more edges. it belongs to group Cij ¼feuvju 2 Ci;v2 Cjg. Our results demonstrate this fact. As shown in Fig. 1, GREEDY- Baselines. As we are not aware of any direct competitor LT performs pretty well for edge deletion compared with tackling our group immunization problems, we construct other competitors, e.g., in YouTube,GREEDY-LT can reduce about 25 percent of the infection if 500 edges are removed, while for RANDOM,DEGREE and EIGEN, the infection almost 3. http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast. remains the same even removing 500 edges. For node deletion htm 4. http://snap.stanford.edu/data/oregon1.html (Fig. 2), GREEDY-LT performs even better: it reduces more than 5. http://snap.stanford.edu/data/com-Youtube.html 30 percent of the infection given the maximum budgets. ZHANG ET AL.: NEAR-OPTIMAL ALGORITHMS FOR CONTROLLING PROPAGATION AT GROUP SCALE ON NETWORKS 3349 footprint when vaccines are given Fig. 1. Effectiveness for LTmodel various Real Datasets (edge deletion). Footprint ratio ( footprint without giving vaccines ) versus number of vaccines. Lower is better. GREEDY-LTconsistently outperforms other baseline algorithms. footprint when vaccines are given Fig. 2. Effectiveness for LTmodel various Real Datasets (node deletion). Footprint ratio ( footprint without giving vaccines ) versus number of vaccines. Lower is better. GREEDY-LTconsistently outperforms other baseline algorithms.