IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 12, DECEMBER 2016 3339 Near-Optimal Algorithms for Controlling Propagation at Group Scale on Networks

Yao Zhang, Abhijin Adiga, Sudip Saha, Anil Vullikanti, and B. Aditya Prakash

Abstract—Given a network with groups, such as a contact-network grouped by ages, which are the best groups to immunize to control the epidemic? Equivalently, how to choose best communities in social media like Facebook to stop rumors from spreading? Immunization is an important problem in multiple different domains like epidemiology, public health, cyber security, and social media. Additionally, clearly immunization at group scale (like schools and communities) is more realistic due to constraints in implementations and compliance (e.g., it is hard to ensure specific individuals take the adequate vaccine). Hence, efficient algorithms for such a “group- based” problem can help public-health experts take more practical decisions. However, most prior work has looked into individual-scale immunization. In this paper, we study the problem of controlling propagation at group scale. We formulate a set of novel Group Immunization problems for multiple natural settings (for both threshold and cascade-based contagion models under both node-level and edge-level interventions) and develop multiple efficient algorithms, including provably approximate solutions. Finally, we show the effectiveness of our methods via extensive experiments on real and synthetic datasets.

Index Terms—Graph mining, social networks, immunization, diffusion, groups Ç

1INTRODUCTION NFECTIOUS diseases account for a large fraction of deaths individual utility. We model such limited compliance by ran- Iworldwide. The main public health response to containing dom vaccine allocation within each group, which motivates epidemic outbreaks is by vaccination and social distancing, our paper. Our focus in this paper is on developing interven- e.g., [1], [2]. These interventions have resource constraints tions that can be implemented before the start of the epidemic. (e.g., limited supply of vaccines and the high cost of social dis- Further, interventions can be of two kinds: vaccination (which tancing), and therefore, designing optimal control strategies is can be modeled in terms of node removals) and social distanc- an active area of research in public health policy planning, ing (which can be modeled in terms of edge removal, e.g., e.g., [1], [3], [4], [5], [6], [7]. However, optimal strategies based reducing contacts between certain sub-populations). We con- on node level characteristics, such as the degree or spectral sider two kinds of metrics: (1) maximizing the expected num- properties [6], [7] cannot be easily turned into implementable ber of people who do not get infected, and (2) minimizing the policies, because such targeted immunization of specific indi- time for the epidemic to die out. These are both commonly viduals raises significant social and moral issues. As a result, studied metrics in public health (see, e.g., [8], [9]). Most of the vaccination policies, such as those specified by CDC are at the work in mathematical epidemiology has been formalized in level of groups (e.g., based on demographics), and almost all terms of reducing the reproductive number. However, these the efforts in epidemiology are focused on developing group methods do not extend to network based models. In this level strategies, even though this may lead to sub-optimal sol- paper, we develop algorithms for optimizing these metrics in utions compared to the individual level policies. For instance, two different models of diffusion. Medlock et al. [1] develop an optimal vaccine allocation for Similar diffusion processes arise in other domains such as different age groups. Even so, all prior work on optimal group social media, e.g., the spread of spam/rumors on Facebook, level immunization has focused on differential equation Twitter, LiveJournal or Friendster. These are also commonly based models, and has not been studied on network models modeled by models such as the Linear Threshold (LT) of epidemic spread. Implementing such interventions is chal- model [10]. Analogous to the public-health case, we can con- lenging because people “comply” with them based on their trol such processes by ‘immunization’ via blocking users or preventing some interactions, such that the expected number of users who adopt spam/rumors is minimal. Past work has Y. Zhang and B.A. Prakash are with the Department of Computer Science, studied individual-level based immunization algorithms for Virginia Tech, Blacksburg, VA 24061. E-mail: {yaozhang, badityap}@cs.vt.edu. the LT model [11]. However, it is more realistic to issue a A. Adiga is with NDSSL, Biocomplexity Institute of Virginia Tech, warning bulletin on group pages, and some members within Blacksburg, VA 24061. E-mail: [email protected]. those groups comply with the warning to stop disseminating S. Saha and A. Vullikanti are with the Department of Computer Science, rumors. Similarly, Twitter can warn a group of accounts to Virginia Tech, Blacksburg, VA 24061, and NDSSL, Biocomplexity Insti- tute of Virginia Tech, Blacksburg, VA 24061. control the spread of the malicious tweets. The same holds E-mail: {ssaha, akumar}@vbi.vt.edu. true for user groups in Friendster and LiveJournal. Manuscript received 29 Dec. 2015; revised 9 July 2016; accepted 24 Aug. In this paper, we present a unified approach to study 2016. Date of publication 1 Sept. 2016; date of current version 2 Nov. 2016. strategies for controlling the spread of diffusion processes Recommended for acceptance by A. Gionis. through group level interventions, capturing both uncer- For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. tainty and lack of control at high resolution within groups. Digital Object Identifier no. 10.1109/TKDE.2016.2605088 The main contributions of our paper are:

1041-4347 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 3340 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 12, DECEMBER 2016

1) Problem Formulation: We develop group level interven- behaviors, e.g., ideas/spam/rumors on Twitter and Face- tion problems in both the LT model, and the SIS/SIR book. A classic example is the linear threshold model, which models, for which we consider a spectral radius based has been extensively studied [10]. In this paper, we study the formulation. We consider arbitrarily specified groups, problem of minimizing propagation for the LT model. and interventions that involve both edge and node Cascade style models, such as the ‘flu-like’ Susceptible- removal, modeling quarantining and vaccination, Infectious-Susceptible (SIS), ‘mumps-like’ Susceptible-Infec- respectively. The interventions specify the number xi tious-Recovered (SIR) and its special-case the Independent of nodes/edges that can be removed within each Cascade (IC) [8], [9], [10], [22], are popular in epidemiology group Ci; however, these are chosen randomly within literature to model different epidemiological states of people, the group. These problems generalize the node level and their state-transitions. Much work has gone into in find- problems and have not been studied before. ing the ‘epidemic threshold’ for such models (the minimum 2) Effective Algorithms: We develop efficient theoretical virulence of a virus which results in an epidemic over the and practical algorithms for the four problem classes network). For example, recent studies [12], [26] show that the we consider, including provable approximation algo- spectral radius of the underlying network (the largest eigen- rithms (SDP and GROUPGREEDYWALK). We find that value of the adjacency matrix of the graph) is related to the diverse kinds of techniques are needed for these epidemic threshold for a wide-range of cascade models. problems—submodular function maximization on an Hence, here we investigate how to control an epidemic by integer lattice, semidefinite programming, quadratic minimizing the spectral radius for cascade style models. programming, and the link between closed walks Immunization. There has been much work on finding opti- and spectral radius. Our algorithms also leverage mal strategies for vaccination and social distancing [1], [3], prior techniques for analyzing contagion processes, [4], [5], [6], [7]. Much of the work in the epidemiology litera- e.g., [6], [12], [13], but require non-trivial extensions. ture has been based on differential equation methods [1], [3], 3) Experimental Evaluation: We present extensive experi- [4]. Cohen et al. [5] studied the popular acquaintance immuni- ments on multiple real datasets including epidemio- zation policy (pick a random person, and immunize one of its logical and social networks, and demonstrate that neighbors at random). Using game theory, Aspnes et al. [27] our algorithms outperform other competitors on developed inoculation strategies for victims of viruses under node and edge deletion at group scale for controlling random starting points. Kuhlman et al. [28] studied two for- infection as well as spectral radius minimization. mulations of the problem of blocking a contagion through Outline of the Paper. The rest of the paper is organized as edge removals under the model of discrete dynamical sys- follows. We first discuss the related work in Section 2, and tems. Tong et al. [7], 29], Van Miegham et al. [30], Prakash then formulate the Group Immunization Problem in et al. [6] proposed various node-based and edge-based immu- Section 3. Section 4 presents our algorithms for different set- nization algorithms based on minimizing the largest eigen- tings of the problem for both edge and node removal. Experi- value of the graph. Other non-spectral approaches for mental results on several datasets are in Section 5. We finally immunization have been studied by Budak et al. [31], He discuss future work, and conclude in Section 6. et al. [32], Khalil et al. [11], Saha et al. [13], and Zhang et al. [33]. All of these papers studied individual-based immu- nization (where either one targets specific individuals or ELATED ORK 2R W whole demographics). Here we study group-based problems, In general, there has been a lot of interest in studying where vaccines are distributed randomly inside groups. dynamical processes on large graphs like (a) blogs and Other Diffusion Problems. Other diffusion based optimiza- propagations [14], [15], (b) information cascades [16], [17]; tion problems include the influence maximization problem, (c) marketing and product penetration [18], [19] and (d) which was introduced by Domingos and Richardson [34], malware prediction [20]. These dynamic processes are all and formulated by Kempe et al. [10] as a combinatorial opti- closely related to virus propagation in epidemiology, rumor mization problem. They proved it is NP-Hard and also gave spread in social media, malware outbreaks in computer net- a simple ð1 1=eÞ-approximation based on the submodular- works, etc. In this section, we review related work mainly ity of expected spread of a set of starting seeds. Recently the from four areas: epidemiology, propagation models, immu- paper by Eftekhar et al. [35] studied this problem at group nization and other diffusion based optimization problems. scale. Other such problems where we wish to select a subset In short, past work concentrates on individual-based immuni- of ‘important’ vertices on graphs, include ‘outbreak zation—in contrast, in this paper we study group-based detection’ [36] and ‘finding most-likely culprits of epi- immunization problems under various models. demics’ [37]. Purohit et al. [38] looked into ‘zooming-out’ of Epidemiology. The classical texts on epidemic models and a graph by forming groups based on similar influence. analysis are May and Anderson [8] and Hethcote [21]. Widely studied epidemiological models include homoge- UR ROBLEM ORMULATIONS neous models [8], [9], [22], which assume that every individ- 3O P F ual has equal contact with others in the population. Table 1 lists the main symbols we use throughout the paper. Propagation Models. There are broadly two types of propa- Here we assume our graph GðV; EÞ is directed and gation models which have been used to describe dynamical weighted. We refer to both node and edge level interven- processes on graphs: threshold based and cascade style. tions as immunization. Threshold based models are well-motivated in the social Groups in a Graph. For a graph GðV; EÞ, we assume that the science literature [23], [24], [25] to represent ‘threshold’ edge (node) set is partitioned into groups C ¼fC1; ...;Cng ZHANG ET AL.: NEAR-OPTIMAL ALGORITHMS FOR CONTROLLING PROPAGATION AT GROUP SCALE ON NETWORKS 3341

TABLE 1 expected number of nodes we can save from being infected, Terms and Symbols by selecting groups for removing edges/nodes. In the LT model, a node v can be influenced by each neigh- Symbol Definition and Description P bor u according to a weight puv where eðu;vÞ2E puv 1.The GðV; EÞ graph G with the node set V and the edge set E C set containing groups diffusion process proceeds as follows: at the start, every node A set of initial infected nodes u uniformly randomly chooses a threshold uu from the range n the number of groups in the graph [0,1], which represents the weighted fraction of u’s neighbors m budget (the number of vaccines) that must be active toP activate u;aninactivenodeu becomes puv weight on edge eðu; vÞ t active at time t þ 1 if w Nt pwu uu where Nu is the set of gðvÞ group index of node v, i.e., gðvÞ¼i if v 2 Ci 2 i gðu; vÞ group index of edge ðu; vÞ, i.e., gðu; vÞ¼i if ðu; vÞ2Ci active neighbors of u at time t; all active nodes will stay active. x vaccine allocation vector ðx1; ...;xnÞ for edges/nodes The process stops when no additional node becomes active. s C;AðxÞ the expected number of infected nodes at the Each group may have some seeds (initial infected nodes). The end when x is allocated to edges 0 seeds will spread information/virus by the LT model. sC;AðxÞ the expected number of infected nodes at the end when x is allocated to nodes For the edge deletion under the LT model, let sC;AðxÞ n ek vector with ek ¼ 1 and ei ¼ 0 for i 6¼ k (Z ! R) denote the expected number of infected nodes in E MEðxÞ ½MðxÞ G (the footprint of G), given seed set A and vaccine alloca- DEðxÞ maximum expected degree of GðxÞ tion vector x for the group set C. Now we are ready to EðxÞ expected spectral radius of MðxÞ ðMEðxÞÞ spectral radius of the expected matrix MEðxÞ define the edge version of the problem under the LT model. min E minimum expected spectral radius over all PROBLEM 1: GROUP IMMUNIZATION under LT model MðxÞ, i.e., minxEðxÞ (edge version): x the allocation vector which minimizes ðMEðxÞÞ min GIVEN: Graph GðV; EÞ, a partition of the edge set over all x, i.e., arg minxðMEðxÞÞ C ¼fC1; ...;Cng, seed set A and m vaccines (budget). Let x be s number of samples in GROUPGREEDYWALK the edge vaccine allocation vector. FIND: The optimum allocation xopt which maximizes for the edge (node) immunization problems. For a node parti- fðxÞ¼sC;Að0ÞsC;AðxÞ s.t. jxjm. tion, C may correspond to groups of communities, locations, Next, we define the node version of this problem. Let demographics, etc. And edge groups can be induced from 0 sC;AðxÞ denote the footprint of G. It is same as sC;AðxÞ except e ¼ðu; vÞ u v node groups. For example, for an edge ,if and that the allocation vector x corresponds to node vaccination. belong to a group Ct,thene 2 Ct, otherwise it belongs to PROBLEM 2: GROUP IMMUNIZATION under LT model C ¼fe ju 2 C ;v2 C g group ij uv i j . The edge groups we defined (node version): ensure that every edge e has a group even if the endpoints of e GIVEN: Graph GðV; EÞ, a partition of the vertex set C ¼ belong to different node groups.Note that we assume there fC1; ...;Cng,seedsetA and m vaccines (budget). Let x be the node are no overlaps among groups. vaccine allocation vector. Allocating Vaccines to Groups. We define x ¼ðx1; ...;xnÞ 0 FIND: The optimum allocation xopt which maximizes f ðxÞ¼ x as the vaccine allocation vector, i.e., if we give i vaccines to s0 ð0Þs0 ðxÞ s.t. jxjm. group C , x edges (nodes) will be uniformly randomly C;A C;A i i Hardness of Our Problems. Problems 1 and 2 are removed from C , which means those edges/nodes will not i NP-hard as their special case, individual-level based immu- be involved in the diffusion process. The objective of our nizations (when each edge/node is a group), are NP-hard immunization problem is to find an allocation that controls themselves [11], [33]. the diffusion process most effectively. For edge deletion, a good solution tends to give more vaccines to the edge groups where edges inside have high 3.2 Problem Definition for Spectral Radius chance to be a part of cuts/walks. Similarly, for node dele- Our second set of problems are based on the spectral radius tion, we prefer the node groups where nodes insides have formulation [7], [29] for a variety of cascade models includ- high impact on the influence/eigenvalue. ing the fundamental SIR (‘mumps-like’ which generalizes Main Idea of Our Problem Definitions. We give two differ- the well-known IC model [10]), SIS (‘flu-like’), and SEIS ent sets of problems which cover a wide range of contagion- (with incubation period) models. In the SIS/SIR models, like processes both threshold-based and cascade-style in the every node can be either susceptible (S), infectious (I) or next two sections. In addition, all our problems have been recovered (R). Each infected node u (in state I) can infect carefully formulated to be seamless generalizations of the each susceptible neighbor v (in state S) with the probability corresponding individual-level problems. puv. In the SIS model, each infected node u can switch to the susceptible state with the recovery rate d. In the SIR model, 3.1 Problem Definition under LT Model each infected node u can switch to the recovered state with Our first set of problems are based on the LT model which the recovery rate d, meaning u cannot be infected again. is a well-known model for social media and complex propa- Spectral radius, denoted by , refers to the largest eigen- gations [10] suited for representing ‘threshold’ behaviors for value of the adjacency matrix of a graph G. Recent results [12], activation. As mentioned in the introduction, the vaccination [26] have shown that is connected to the reproduction num- problem here can help to control such processes like spam ber in epidemiology, and determines the phase-transition and rumors on Twitter and Facebook. Under the LT model, (‘epidemic threshold’ t) between epidemic/non-epidemic our goal is to minimize the expected number of infected regimes in a very large range of cascade-style models [12], nodes at the end of diffusion, in other words, maximize the including SIR, SIS , SEIS and so on. As shown in [12], t / , 3342 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 12, DECEMBER 2016 and if t < 1 the disease will die out quickly irrespective of ini- The notion of submodularity of set functions has been tial conditions. This gives us the motivation to control the dis- extended to functions over integer lattices—see, e.g., [39], ease spread by minimizing in the underlying network. which shows that a greedy algorithm gives a constant factor Tong et al. [7], [29] proposed effective node-based and approximation to submodular lattice functions with budget edge-based individual immunization methods to minimize . constraints. We note that in the context of functions defined Following their methodology, in this paper we aim to maxi- on an integer lattice, unlike in the case of set functions, submo- mize the drop of the spectral radius of G, D,whenvaccines dularity need not be equivalent to the diminishing return are allocated to groups. Similar to Problems 1 and 2, when xi property. Besides, there are multiple non-equivalent defini- vaccines are given to group Ci, we uniformly remove xi tions of the diminishing return property, as observed in [39]. nodes/edge at random. Hence, we want to find the optimal Next, we show that in Theorem 1 that a greedy algorithm allocation x such that the expectation of D, E½DðxÞ is maxi- gives an ð1 1=eÞ-factor approximation to an integer lattice mum. Note that we do not define the problems here based on function satisfying the properties ðP1Þ, ðP2Þ and ðP3Þ above, the ‘footprint’ (as in the previous section for LT) for primarily and our objective function follows all above properties two reasons: (a) these versions naturally generalize the corre- (Lemma 2). Note that it is not clear whether the analysis of sponding individual-level immunization problems studied in [39] implies a similar bound for the kind of functions fðxÞ we past literature [7], [29]; and (b) due to the epidemic threshold need to consider here. results, using the spectral radius allows us to immediately for- T Z LemmaP 1. Suppose y ¼ðyi; ...;ynPÞ where yi 2 and mulate a general problem for multiple cascade-style models yj ¼ m,thenfðx þ yÞfðxÞ yjðfðx þ ejÞfðxÞÞ. (like SIR/SIS/IC) each with differences in their exact spread- j j ing process which we can ignore. Formally our problems are: Proof. The proof is in the appendix.1 tu PROBLEM 3: GROUP IMMUNIZATION for spectral radius (edge n Theorem 1. Suppose fðxÞ, x 2 Z satisfies the properties ðP1Þ, version) ðP2Þ and ðP3Þ above. Then, Algorithm 1 gives a GIVEN: Graph GðV; EÞ, a partition of the edge set C ¼ ð1 1=eÞ-approximateP solution to the problem of maximizing fC1; ...;Cng, and m vaccines (budget). Let x be the edge vaccine fðxÞ subject to xi m. allocation vector, and let E½DðxÞ denote the expected drop in the i spectral radius after the immunization. Proof. Suppose x is the solution from the greedy algorithm,P IND E D F : The optimum allocation xopt which maximizes ½ , andP x is the optimal solution. Hence, we have j xj ¼ i.e., x ¼ arg max E½DðxÞ s.t. jxjm. opt x j xj ¼ m.SincesCð0Þ is constant, the greedy algorithm is PROBLEM 4: GROUP IMMUNIZATION for spectral radius (node equivalent to version) C ¼ arg maxCi fðx þ eiÞfðxÞ: GIVEN: Graph GðV; EÞ, a partition of the node set C ¼ ðiÞ fC1; ...;Cng, and m vaccines (budget). Let x be the edge vaccine Let us define x as the solution got from the ith itera- E D ðmÞ allocation vector, and let ½ ðxÞ denote the expected drop in the tion of the greedyP algorithm, hence x ¼ x . And x can spectral radius after the immunization. be represent as j xj ej. We have E D FIND: The optimum allocation xopt which maximizes ½ , ðiÞ E D fðx Þfðx þ x Þ i.e., xopt ¼ arg maxx ½ ðxÞ s.t. jxjm. ¼ fð ðiÞÞþðfð þ ðiÞÞfð ðiÞÞÞ Hardness of Our Problems. Problems 3 and 4 are NP-hard x Xx x x too—their special cases, individual-level immunizations are ðiÞ ðiÞ ðiÞ fðx Þþ xj ðfðx þ ejÞfðx ÞÞ ðLemma 1Þ NP-hard [7], [29]. Xj fðxðiÞÞþ xðfðxðiþ1ÞÞfðxðiÞÞÞ ðGreedy Alg:Þ 4PROPOSED METHODS j j We first discuss our algorithms for the GROUP IMMUNIZATION ¼ fðxðiÞÞþmðfðxðiþ1ÞÞfðxðiÞÞÞ: problem under the LT model (Sections 4.1 and 4.2 for ðiþ1Þ 1 ðiÞ 1 Problems 1 and 2), and then the spectral radius versions Hence, fðx Þð1 mÞfðx Þþm fðx Þ. Recursively, (Sections 4.3 and 4.4 for Problems 3 and 4). ðiÞ 1 i we can get fðx Þð1 ð1 mÞ Þfðx Þ. Therefore, fðxÞ¼ f ðmÞ 1 1 1 m f 1 1=e f 4.1 Edge Deletion under LT Model ðx Þð ð mÞ Þ ðx Þð Þ ðx Þ. tu Recall that the function fðxÞ in Problem 1 is not a simple set Algorithm 1. Greedy Algorithm function; it is over an integer lattice. Hence the submodularity f m property used in [11] is not applicable to our problem, and we Require: , budget 1: x ¼ 0 can not simply apply their greedy algorithm. Instead, our 2: for j ¼ 1 to m do approach is to carefully identify a ‘submodularity like’ condi- 3: i ¼ arg max fðx þ e ÞfðxÞ tion that is satisfied by our function fðxÞ, for which a greedy k¼1;...;n k 4: x ¼ x þ e algorithm gives good performance. Let e be the vector with 1 i k 5: end for k at the th index and 0 be the all zeros vector. We consider the 6: return x following three properties. Now, we will show that the objective function ðP1Þ fðxÞ0 and fð0Þ¼0. fðxÞ¼sC;Að0ÞsC;AðxÞ for the edge deletion problem under ðP2Þ (Non-decreasing) fðxÞfðx þ ekÞ for any k. 0 ðP3Þ (Diminishing returns) For any x x and k, we have 0 0 1. The proof is in the appendix, which can be found at: http://peo- fðx þ ekÞfðxÞfðx þ ekÞfðx Þ. ple.cs.vt.edu/yaozhang/group-immu/ ZHANG ET AL.: NEAR-OPTIMAL ALGORITHMS FOR CONTROLLING PROPAGATION AT GROUP SCALE ON NETWORKS 3343 the LT model satisfies the properties stated in Theorem 1. In this will imply that fðxÞ satisfies Property 3. Suppose 0 the ensuing discussion, we will assume without loss of gener- x ¼ x þ ej, we have two cases to consider: (1). ek ¼ ej; ality that there is only one seed node. This is because, if there (2). ek 6¼ ej. are multiple seed nodes, then, we can merge all of them to a For 1 i n, let ci ¼jCij and xi denote the ith ele- single ‘super’ node (say s) inP the following manner: for every ment in x. vertex v 2 V n A,setpsv ¼ u2NðvÞ\A puv,whereNðvÞ is the First,Q we consider case (1) (ek ¼Qej). For R 2RðxÞ, 1 1 1 set of neighbors of v. We note that after this modification the Pr½R¼ i ¼ r , where, r ¼ i6¼k ci c ci xi k xi edges between v andP its susceptibleP neighbors are unchanged, xk and at time 0, pwv ¼ pwv ¼ psv. Hence, w2Nv w2NðvÞ\A ^ ^ sCðG; xÞsCðG; x þ ekÞ sC;AðxÞ¼sC;sðxÞ. Henceforth, we will assume that there is X 1 X 1 only one seed node, and drop the subscript A from sC;AðxÞ, ¼ r g^ðRÞ r g^ðR0Þ ck ck denoting it by sCðxÞ. R2RðxÞ R02Rðx0Þ 2 xk ðxkþ1Þ 3 Lemma 2. The function fðxÞ¼sC ð0ÞsCðxÞ satisfies the X X 41 1 1 5 properties ðP1Þ, ðP2Þ and ðP3Þ above. ¼ r g^ðRÞ g^ðR [fegÞ : ck x þ 1 ck R2RðxÞ k e2C nR xk k xkþ1 Proof. Property 1 is trivially true because, when x ¼ 0,by definition, fð0Þ¼0, and since vaccination does not The factor 1 is due to the fact that R [feg comes up xkþ1 increase the number of infections, sCðxÞsC ð0Þ. For the in ðxk þ 1Þ combinations involving R and e.Thissim- rest of the proof, since sCð0Þ is a constant, we only need plifies to 0 to analyze sCðxÞ. Note that for any x x, we can find a ^ ^ sequence of vectors ðz1; z2; ...; zlÞ for some l such that sC ðG; xÞsC ðG; x þ ekÞ 0 x ¼ z1, x ¼ zl and zi ¼ zi1 þ eki1 for some index ki1. X X rxk!ðck xk 1Þ! Therefore, it is enough to prove that Properties 1 and 2 ¼ g^ðRÞg^ðR [fegÞ : (1) 0 ck! hold for x ¼ x þ ej for some index j. Also, we can assume R2RðxÞ e2CknR that xj < jCjj, for j ¼ 1; ...;n, for if this is not true for some j, then, it implies that all the edges in Cj will be vac- Similarly, we have cinated, and therefore, we can simply remove all C from j ^ 0 ^ 0 sCðG; x ÞsCðG; x þ ekÞ the analysis and reduce the budget by xj. V X X Let RðxÞ2 be the collection of sets R satisfying rðxk þ 1Þ!ðck xk 2Þ! ¼ g^ðR0Þg^ðR0 [fegÞ jR \ Cij¼xi. Following the equivalence between influence ck! 0 0 0 R 2Rðx Þ e2CknR in the LT model andP the directedP percolation process [10], X X ^ ^ rðxk þ 1Þ!ðck xk 2Þ! 1 we have sCðxÞ¼ G^ Pr½G R2RðxÞ Pr½RgCðG; RÞ,where ¼ ck! ðxk þ 1Þ 0 the first sum is over all possible live-edge subgraphs G^ of G X R2RðxÞ e 2CknR in the percolation process, Pr½R is the probability when the g^ðR [fe0gÞ g^ðR [fe; e0gÞ : 0 ^ e2CknðR[fe gÞ set R is removed , and gC ðG; RÞ is the expected number of ^ infected nodes in G at the end of the LT process after the set From [11, proof of Theorem 6], g^ðRÞg^ðR [fegÞ R is removed. This can be rewritten as s ðxÞ¼ 0 0 P C g^ðR [fe gÞ g^ðR [fP e; e gÞ (supermodularity).P P There- ^ Pr½G^s ðG;^ xÞ,wherePr½G^ is the probability of sam- G C P fore, ðck xk 1Þ e½g^ðRÞ g^ðR [fegÞ e0 e g^ðR [ ^ ^ ^ e0 g^ R e; e0 pling G,andsC ðG; xÞ¼ R2RðxÞ Pr½RgCðG; RÞ.Hence- f gÞ ð [f gÞ. Hence proved. ^ Now, we consider case (2). Let forth, we will abbreviate gC ðG; RÞ as g^ðRÞ. We will show that s ðG;^ xÞ is non-increasing, i.e., 1 1 C Pr R r0 ; ^ ^ 0 0 ½ ¼ sCðG; xÞsC ðG; x Þ where x ¼ x þ ej, thereby showing ck cj x x that fðxÞ satisfies Property 2. Since the number of nodes k j Q reachable from the seed node with R removed is at least 0 1 ^ ^ where r ¼ . We can get sCðG; xÞsCðG; x þ ekÞ i6¼j;k ci as many as those with R [feg removed, for any xi e 2 C n R, we have g^ðRÞg^ðR [fegÞ. Therefore, from Eqn. (1). And j X ^ 0 0 0 ^ 0 ^ 0 sC ðG; x Þ¼ Pr½R g^ðR Þ sCðG; x ÞsC ðG; x þ ekÞ 0 0 R 2Rðx Þ 0 X X r xk!ðck xk 1Þ! X X 1 ¼ ½g^ðR0Þg^ðR0 [fegÞ cj ¼ Pr½Rg^ðR [fegÞ ck! R02Rðx0Þ e2C nR0 jCjjxj ðxjþ1Þ k R2RðxÞ e2CjnR X X X r0ðx þ 1Þ!ðc x 1Þ! x !ðc x 1Þ! 1 1 ¼ j j j k k k Pr½Rg^ðRÞ cj! ck! xj þ 1 jCjjxj X X R R2RðxÞ e2CjnR X ½g^ðR [ e Þg^ðR0 [fe; e gÞ : ^ j j ¼ Pr½Rg^ðRÞ¼sC ðG; xÞ: ej2CjnR e2CknðR[fejgÞ R2RðxÞ P ^ ^ Finally, we will show that sCðG; x þ ekÞsCðG; xÞ AgainP P from [11], ðcj xj 1Þ e g^ðRÞg^ðR [fegÞ ^ 0 ^ 0 g^ðR [fe gÞ g^ðR [fe ;egÞ tu sCðG; x þ ekÞsCðG; x Þ. From the above discussion, ej e j j . Hence proved. 3344 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 12, DECEMBER 2016

I Algorithm 1 provides a simple greedy algorithm. Here, Running Time of GREEDY-LT. Calculating all rðu; TXÞ costs I to estimate sC ðxÞ when vaccines are uniformly at random OðjMjjV jÞ time since we can traverse TX once to get all val- I allocated within groups, we apply the Sample Average ues of rðu; TXÞ. And greedily choosing m vaccine allocation Approximation (SAA) framework. Let LRðxÞ, denote a needs OðmnjMjjV jÞ. Hence, the serial version of GREEDY-LT sample set fromP the set of all possible allocations. costs OðmnjMjjV jÞ. Note that in practice, we can speed it 1 sCðxÞs^CðxÞ¼ g ðRÞ, Kempe et al. [10] show that I jLj R2L C up by computing and updating riðu; TXÞ in parallel. In addi- I gCðRÞ can be estimated by sampling from the set of live- tion, since TX is tree, the increasing difference property still edge graphs. A live-edge graphs T is generated as follows: holds, hence we can accelerate GREEDY-LT by “lazy eval- for each node v 2 V , independently select at most one of its uation” [36], [40] as well. incomingP edges with probability puv, and with probability 1 u:ðu;vÞ2E puv no edge is selected. Let this sample set be 4.2 Node Deletion under LT Model denoted by M. This approach takes OðjMjjLjðjEjþjV jÞÞ Our algorithm for the node version of the GROUP IMMUNIZA- time to estimate sC ðxÞ, and OðmnjMjjLjðjEjþjV jÞÞ for the TION problem is also the greedy Algorithm 1, as in the edge full greedy algorithm, which is not practical for large net- version in Section 4.1. Without loss of generality, we also works. However, we can speed up this naive greedy assume that all seed nodes in A are merged, and drop the 0 0 algorithm. subscript A from sC;AðxÞ, denoting it by sC ðxÞ. Our analysis 0 0 0 relies on proving that the function f ðxÞ¼sCð0ÞsC ðxÞ in Algorithm 2. GREEDY-LT Problem 2 satisfies the properties ðP1Þ, ðP2Þ and ðP3Þ from Require: Graph G, group set C, seed set A, and budget m Section 4.1, as discussed below. 1: Merge seed set A to I 0 0 0 Lemma 3. The function f ðxÞ¼sCð0ÞsCðxÞ satisfies the 2: Sample live-edge graphs M¼fT I ; ...;TI g X1 X P P P I I jMj properties ð 1Þ, ð 2Þ and ð 3Þ. 3: For each TX, calculate rðu; TXÞ for all nodes (in parallel) 4: Set x ¼ 0 Proof. The proof is in the appendix. tu 5: for j ¼ 1 to m do I Lemma 3 suggests that Theorem 1 holds for node ver- 6: for each TX and Ci do Ci I sion as well: GREEDY algorithm will provide a ð1 7: pick an edge eX at random for Ci and TX 1=eÞ-approximate solution. We extend GREEDY-LT (Algo- 8: end for P I I Ci C ¼ arg max C ðrðI; T ÞrðI; T n e ÞÞ rithm 2) to the node version: instead of randomly pick 9: Ci e i T I X X X X 2 X 10: xC ¼ xC þ 1 edges (Line 7), we randomly pick nodes to calculate the I marginal loss (Line 9), and remove the corresponding 11: for each TX do C I Ci I 12: If eX ðu; vÞ2TX, remove edge eX and update rðn; TXÞ nodes (Line 12). The observation is that calculating the for node n (in parallel) marginal loss of removing node v in C in constant I I 13: end for timeholdshereaswell,i.e.,rðI; TXÞrðI; TX n vÞ¼ I 14: end for rðv; TXÞþ1. Hence, the updating process is the same as 15: return x theedgeversionofGREEDY-LT.

Speed-Up of the Greedy Algorithm: GREEDY-LT. Since a live- 4.3 Edge Deletion for Spectral Radius s graph sampled from M is a tree, we can denote it as TX s We propose three algorithms for Problem 3 (edge immu- where s is the root, and rðu; TXÞ¼jfvjv 2 subtreeðuÞgj, i.e., s nization based on spectral radius) with different trade- the number of nodes that are under the subtree of u in TX. offs of quality and running time: the first one, SDP, is a GREEDY-LT is summarized in Algorithm 2. It first merges all constant factor approximation algorithm that minimizes s jMj seeds into a ‘supernode’ and samples live-edge the actual eigendrop; the second algorithm, GROUPGREE- rðu; T s Þ graphs, and then compute X in parallel for all nodes DYWALK, is a bicriteria approximation algorithm based on in all the live graphs (Lines 1-3). After that we greedily hitting-walks; the third algorithm, LP, is an Linear Pro- select m vaccines (Lines 4-10): we initially set the allocation gramming (LP) based method which uses an estimation vector x ¼ 0, and in each iteration, for eachP group Ci, we cal- of the eigendrop. D s culate the marginal loss Ci;sðx þ eiÞ¼ e u;v T s rðs; TXÞ ð Þ2 X SDP is a constant-factor approximation algorithm, which s rðs; TX n eÞ, i.e., we randomly pick one edge from each group gives us good results, but it is very slow with a for each live-edge graph, then sum their marginal losses up OðjV j4polylogðjV jÞÞ time complexity. Hence, we develop s over TX as Ci’s marginal loss. Note that GROUPGREEDYWALK, a bicriteria approximation algorithm s s s rðs; TXÞrðs; TX n eÞ¼rðv; TXÞþ1, where node v is the based on hitting closed walks [13]. Though GROUPGREEDY- endpoint of e [11]. We pick the group C with the maximum WALK loses a little quality compared to SDP, it is faster with a marginal loss. Finally we removed the edge that has been Oðsm2jV j3Þ time complexity (where s corresponds to the s picked, and update rðu; TXÞ in parallel (Lines 11-13). There number of samples, described later). However, it may still not s s are two cases to update TX if eðu; vÞ2TX: (1) for v’s children, be scalable to very large networks with millions of nodes. we can remove them because it is not reachable from s; (2) Therefore, we come up with LP, a linear programming based s s s for any ancestor a of v, rða; TX n eÞ¼rða; TXÞrðv; TXÞ1, heuristic whose time complexity depends only on the number which can be done in constant time. Following Theorem 1, of groups, not graph size. In reality, the number of groups in a GREEDY-LT is a ð1 1=e Þ-approximation algorithm where group is typically much smaller than the number of nodes. is the approximation factor for estimating sCðxÞ. Hence, LP is much faster than SDP and GROUPGREEDYWALK. ZHANG ET AL.: NEAR-OPTIMAL ALGORITHMS FOR CONTROLLING PROPAGATION AT GROUP SCALE ON NETWORKS 3345

D 4 And experimental results demonstrate that it is scalable to degree. If EðHÞlog jVpj,ffiffiffiffiffiffiffiffiffiffiffiffiffiffi then, almost surely iðMðHÞÞ networks with millions of nodes, and provides competitive iðMEðHÞÞj ð2 þ oð1ÞÞ DEðHÞ, for i ¼ 1; ...; jV j. empirical performance (see Section 5). Note that even though SDP and GROUPGREEDYWALK may Recall that xmin is the output of SDP (4), and it corre- not be scalable to very large networks, both have proven per- sponds to the allocation vector which minimizes ðMEðxÞÞ D formance guarantee. In addition, they are not merely of theo- over all x. Let EðxminÞ denote the maximum expected retical interest: they can be used as a baseline to assess the degree of GðxminÞ. The following lemma proves that the performance of faster heuristics on smaller networks. SDP formulation givespffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi us an approximation algorithm with Next, we will introduce the SDP algorithm (Section 4.3.1), constant factor Oð DEðxminÞÞ. the GROUPGREEDYWALK algorithm (Section 4.3.2), and the LP 4 Lemma 4. If x is such that DEðx Þlog jV j, then, heuristic (Section 4.3.3) respectively. min pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffimin min E ðMEðxminÞÞþð2 þ oð1ÞÞ DEðxminÞ þ 1. pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 4.3.1 SDP: A Constant Factor Approximation Algorithm Proof. Let z ¼ ðMEðxminÞÞþð2 þ oð1ÞÞ DEðxminÞ. Apply- Let GðV; EÞ be a graph whose edge set is partitioned into n ing Theorem 2 to GðxminÞ, ðMðxminÞÞ z almost surely. D 4 groups C1; ...;Cn. Let x be the edge allocation vector. For In fact, for Eðxmin Þlog jV j, it can be shown that an edge ðu; vÞ, let gðu; vÞ denote the index of the group to Pr ðMðxminÞÞ z 1=jV j (see [41, proof of Theorem ðu; vÞ Gð Þ which belongs. Let x be the 6]). Noting that ðMðxminÞÞ ðMÞ, obtained by removing each edge in Ci with probability p ¼ x =c , where c ¼jC j. Let MðxÞ be its adjacency matrix E i i i i i EðxminÞ¼ ½ðMðxminÞÞ and EðxÞ¼E½ðMðxÞÞ be the expected spectral radius Pr ðMðxminÞÞ z z 1; with prob. ð1 pgðu;vÞÞ if ðu; vÞ2EðGÞ; þ Pr ðMðxminÞÞ z ðMÞ ðMðxÞÞ ¼ uv 0; otherwise: 1 1 z þ ðMÞ

Let MEðxÞ¼E½MðxÞ be the expectation of the adjacency min min By definition, E EðxminÞ. Therefore, E E matrix of GðxÞ ðxminÞz þ 1. Hence, proved. tu 1 pgðu;vÞ; if ðu; vÞ2EðGÞ; Running Time. ðMEðxÞÞuv ¼ (3) The SDP step (Eq. (4)) dominates the run- 0; otherwise: ning time of this algorithm, which is OðjV j4polylogðjV jÞÞ. The problem is to find the optimal allocation, i.e., the x for which EðxÞ is minimized. We will denote this value by 4.3.2 GROUPGREEDYWALK: A Bicriteria Approximation min E :¼ minxEðxÞ. Algorithm 4 Remark 4.1. In the SDP formulation, for ease of analysis, we As shown above, SDP with a ðjV j polylogðjV jÞÞ time com- replace the hard budget constraint by an expected budget plexity, is too slow for large networks. In this section, we constraint, i.e., the expected size of the vaccine allocation leverage the technique of hitting closed walks [13] for the vector x is m. This is not a problem since, in reality, the bud- GROUP IMMUNIZATION problem, and propose a bicriteria get is sufficiently high ( log n). Hence, with high probabil- approximation algorithm called GROUPGREEDYWALK. ity, the number of vaccines in the solution will be very close Saha et al. [13] studied the problem of minimizing the spec- to the expected budget. Given this small difference, we can tral radius under a given threshold by removing the smallest force the number of vaccines to be within the budget con- number of edges, and developed a greedy based approxima- straints, with very little effect on the performance. tion algorithm for it. Different from their work, our goal is to distribute a given budget of vaccines to groups to minimize The SDP Formulation: Finding the Allocation x with Minimum the spectral radius as small as possible. We can adapt their ðMEðxÞÞ. Note that, MEðxÞuv ¼ð1 pgðu;vÞÞ,ifðu; vÞ2EðGÞ. greedy algorithm to the group immunization, by choosing We use a simple SDP to find the allocation which minimizes groups with maximum marginal gain of hitting closed walks. ðMEðxÞÞ and meets the budget constraint m However, it is not clear whether this works, as we need to con- sider all “possible worlds” for group immunization. minimize t In graph G, a closed walk is a sequence of nodes starting subject to 0 p 1; for i ¼ 1; ...;n P i (4) and ending at the same node, with two consecutive nodes ipijCijm; adjacent to each other. Closed k-walk is a walk with tI MEðxÞ0: length k. Let walks ðe; G; kÞ denote the number of closed k-walks in G containing e ¼ði; jÞ. We say that an edge set E Let xmin denote the allocation vector corresponding to the hits a walk w if w contains an edge from E. Recall that GðxÞ solution of the SDP. min is a random graph obtained by removing a random subset Analysis: Relating E to ðMEðxminÞÞ. One can use the fol- of xi edges in Ci, where C1; ...;Cn is a partition of the edge lowing result by Lu and Peng [41] to bound EðxÞ with set E. Let WðG; kÞ be the set of all walks of length k in the respect to ðMEðxÞÞ. graph G. Let nkðG; eÞ denote the number of walks of Theorem 2 ([41]). Consider an edge-independent random graph length k in G that pass through edge e. Similarly, let H. Let MðHÞ denote its adjacency matrix and nkðG; SÞ denote the walks of length k in G that pass though MEðHÞ¼E½MðHÞ. DEðHÞ denotes the maximum expected edges in the set S. Let nkðGÞ¼nkðG; EÞ ¼ jWðG; kÞj denote 3346 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 12, DECEMBER 2016 the number of walks with length k in G. Here we focus on Lemma 6. Let xoptðmÞ be the optimum allocation such that opt walks of a fixed length k ¼ uðlog jV jÞ. Note that for GðxÞ, T ¼ E½1ðGðx ðmÞÞÞ. Let y be defined as nkðGðxÞÞ is a random variable. opt opt xi ; if xi mi=2; yi ¼ Algorithm 3. GROUPGREEDYWALK (G, m) mi otherwise, Require: Graph G, group set C, and budget m 1: x ¼ 0 where mi is the number of edges in group Ci. Then, we have k k 2: for j ¼ 1 to m do E½nkðGðyÞÞ jV j2 T . 3: i ¼ arg max ExpCountWalksðx þ e Þ k¼1;...;n k Proof. The proof is in the appendix. tu ExpCountWalksðxÞ 4: x ¼ x þ ei Now, we prove Theorem 3. 5: end for 6: return x Proof of Theorem 3. Let gðxÞ denote the expected number of walks in WðG; kÞ hit by the edges that are removed in GðxÞ. Then gðxÞ has the diminishing returns property, i.e., Algorithm 3 gives the pseudocode of our GROUPGREE- 0 0 0 for x x , we have gðx þ eiÞgðxÞgðx þ eiÞgðx Þ. DYWALK algorithm. EXPCOUNTWALKSðG; xÞ returns the The proof of the diminishing returns follows the proof of expected number of walks surviving in GðxÞ.Notethat Lemma 2. EXPCOUNTWALKS is different from COUNTWALKS in [13], as We will compare gðxgÞ to gðyÞ where y is defined as it returns the expected number of hitting walks when a vaccine allocation vector x is assigned to groups, while opt opt xi ; if xi mi=2; COUNTWALKS in [13] is based on removing a set of edges yi ¼ mi otherwise, given a budget constraint. The idea of GROUPGREEDYWALK is that, each time we select a group Ci with the maxi- whereP miPis the number of edges in group Ci. Note that mummarginalgaininEXPCOUNTWALKSðG; xÞ,whenallo- opt i yi 2 i xi 2m. cating one vaccine to Ci. ðiÞ Let x denote the vector after ith iteration of GROUP- Algorithm 3 follows the framework of the individual based GREEDYWALK. Since gðxÞ has the diminishing returns GREEDYWALK algorithm [13]. Instead of picking edges, it choo- property, it follows the proof of Theorem 1 that ses groups to maximize marginal gain of eigendrop. The main ðiÞ 1 i 2 fðx Þð1 ð1 2mÞ ÞfðyÞ. Therefore, for i ¼ Oðm log jV jÞ, challenge here is to show GROUPGREEDYWALK is a provable 2 2 opt 1 Oðm log jV jÞ log jV j approximation algorithm. Let x ðmÞ be the optimum solu- we have 1 ð1 2mÞ 1 ð1=eÞ 1 opt 1 1 tion corresponding to budget m ,andT ¼ 1ðGðx ðmÞÞÞ (the 1 , where N is the number of total walks in jV jlog jV j N spectral radius after vaccine allocation for the optimum solu- the original graph G. tion). We can prove the following theorem: c k From Lemma 6, we have E½nkðGðyÞÞ jV j T for a c k Theorem 3. Let xoptðmÞ be the optimum solution corresponding to constant c. This implies fðyÞN jV j T . Therefore, g g 1 c k c k budget m of edges removed. Let x be the allocation returned by fðx Þð1 jV jÞðN jV j T ÞN 1 jV j T . This implies 2 g c k GROUPGREEDYWALK (G; c1m log jV j), for a constant c1.Thenwe that nkðGðx ÞÞ OðjV j T Þ. From Lemma 5, it follows g 0 0 g g 0 have 1ðGðx ÞÞ c T for a constant c ,where1ðGðx ÞÞ is the that 1ðGðx ÞÞ c T. tu spectral radius after allocating vaccines based on xg. Implementation Notes. Given the adjacency matrix A of G, ROUP REEDY ALK k1 Remark 4.2. Theorem 3 shows that G G W is a the number of k-length walks from u to v is given by Auv .It 2 0 ðc1 log jV j;cÞ-bicriteria approximation algorithm. Differ- also corresponds to the number of walks hit by the ent from the analysis of traditional approximation algo- edge ðu; vÞ. We implement the algorithm as follows. In each rithms, in order to bound the result of GROUPGREEDYWALK iteration, we randomly sample a set of edges of the G w.r.t to the optimal solution, we need a larger budget according to x. For each sample, we compute the expected 2 2 c1 log jV jm.Typically,log jV j is much smaller than the decrease in the number of walks for the removal of one budget m. And when the budget m is very large, the mar- edge in group i (for computing the effect of allocation 2 ginal gain of eigendrop for a larger budget c1 log jV jm will vector x þ ei) as follows: We construct GðxÞ, compute tend to be very close to the marginal gain of eigendrop for A0 ¼ AðGðxÞÞk1 and take the average over all A0ðu; vÞ ele- the budget m. Hence, adding such small factor into the bud- ments where ðu; vÞ belongs to group i. We perform this for get m will have little effect on the performance. each sample (number of samples is s) and take the average over all the samples. Finally, we choose that i which gives We will use Lemmas 5 and 6 to prove this theorem. Intui- the maximum average and update x by adding ei to it. tively, Lemma 5 shows the expected spectral radius is Running Time. For budget m, Am1 can be computed upperbounded by T if the number of walk k ¼ Oðlog jV jÞ; in time Oðm2jV j3Þ. For each sample of x,wecompute while Lemma 6 shows that the expected number of walks m1 with length k can be upperbounded by T as well. AðGðxÞÞ . Note that, computing the effect of removing 2 ei for each sample takes only OðjV j Þ time. Therefore, for k k Lemma 5. If E½nkðGðxÞ ¼ OðjV j2 T Þ for k ¼ Oðlog jV jÞ, asamplesizeofs, the algorithm overall takes then E½ ðGðxÞÞ c T for a constant c . 3 1 3 3 Oðsm2jV j Þ time. If m ¼ Oðlog jV jÞ, the time complexity is Proof. The proof is in the appendix. tu OðsjV j3log 2jV jÞ. ZHANG ET AL.: NEAR-OPTIMAL ALGORITHMS FOR CONTROLLING PROPAGATION AT GROUP SCALE ON NETWORKS 3347

4.3.3 LP: A Fast Heuristic Applying the above to (8), GROUPGREEDYWALK is a good approximation algorithm like X X X X SDP, however, it may not be scalable to very large networks 2 2 jCaj fðxÞ¼ 2uj xa Mijuiujxa jCaj1 with millions of nodes. In this section, the propose a much a j2Ca a i;j2Ca faster heuristic based on estimating eigendrop. X X M u u x x : The eigendrop when removingP edges in the set ET can be ij i j a b fðTÞ¼ M u u Mu ¼ u a6¼b i2Ca;b2Cb approximated by ði;jÞ2ET ij i j where u x and u ¼ðu1; ...;ui; ...Þ [29]. Given the allocation vector x, ObservingP that Mij, i andP a are constants, defining 2 jCaj the expected drop in spectral radius is then given by aa ¼ 2u , b ¼ Mijuiuj, and j2PCa j jCaj1 a i;j2Ca G u u ab ¼ i2Ca;j2C Mij i j, we get, E½DfðxÞ b X X X X 2 ¼ Mijuiuj Pr ði; jÞ is removed fðxÞ¼ aaxa b x Gabxaxb : 5 a a Xi;j2E X ( ) a a a6¼b ¼ Mijuiujxa : a2C ði;jÞ2Ck Our aim is to find that x which maximizes fðxÞ. This can be formulated as a quadratic program P P If we define aa ¼ ði;jÞ2C Mijuiuj, then, fðxÞ¼ a aaxa. P P P a minimize b x2 G x x a x We want to maximize fðxÞ subject to the budget constraints. a a a þ a6¼b ab a b a a a This can be formulated as a linear program as given below ¼ 1 xT Qx þ cT x P2 (10) P subject to a xajCajB maximize aaxa Pa 0 xa 1; subject to a xajCajm (6) 0 xa 1: G where, Qaa ¼ 2ba and for a 6¼ b, Qab ¼ 2 ab and ca ¼aa.IfQ Oðn4Þ n is not semi-definite, the problem is NP-Hard [42]. In that case, Running Time. The LP takes time where is the num- b ber of groups. Note that it is not a function of the graph size. we use a low-rank matrix Q formed by all its eigenvectors cor- Typically, the number of groups is small, hence this algo- responding to non-negative eigenvalues. The QP on Q^ can be rithm is very fast. solved in polynomial time using the ellipsoid method [42]. b 1 T b T Let fðxÞ¼ x Qx þ c x,andxQ, xb correspond to the best 4.4 Node Deletion for Spectral Radius 2 Q b Here, we propose an algorithm for solving Problem 4: the allocation vectors corresponding to Q and Q respectively. group node immunization problem with respect to eigen- The next lemma shows that xb is a good approximation to xQ. Q drop. It is based on the approximate eigendrop method b n b which was discussed in Section 4.3. The eigendrop when Lemma 7. jfðxbÞfðxQÞj kQ Qk , where n is the Q 2 F removing nodes in S can be approximated as follows [7]: number of groups in the graph. X X D 2 Proof. The proof is in the appendix. tu fðSÞ¼ 2uj Mijuiuj; (7) j2S i;j2S Running Time. The QP takes Oðn4Þ time. Again, note that n is the number of groups. Hence, it is fast when the number where Mu ¼ u and u ¼ðu1; ...;ui; ...Þ. Recall that C is the of groups is small. set of groups and x ¼ðx1; ...;xi; ...Þ is the allocation vector where, xi is the fraction of nodes vaccinated in group Ci. 5EMPIRICAL STUDY For the group vaccination problem, the expected eigendrop can be approximated by applying (7) as follows: We present a detailed experimental evaluation now. X 5.1 Experimental Setup E D 2 2 ½ fðxÞ¼ 2uj Prðj is vaccinatedÞ We implemented the algorithms in Python, and conducted j2V X (8) the experiments using a 4 Xeon E7-4850 CPU with 512 GB Mijuiuj Prði & j are vaccinatedÞ : of 1,066 Mhz main memory. i;j2V Datasets. Table 2 briefly summarizes the dataset. We run our experiments on multiple datasets, which were cho- Let gðvÞ denote the index of the group to which v belongs to, senfortheirsizeaswellasdifferentdomainswherethe v 2 C gðvÞ¼i j i.e., if i,then, . The probability that is vacci- GROUP IMMUNIZATION problem is especially applicable. nated is xgðjÞ and the probability that both i and j are vacci- Note that all our datasets are networks, not diffusion nated is traces. If diffusion traces areprovidedasinputsinstead ( of a network, there are state-of-the-art algorithms (such x x ; if g i g j ; gðiÞ gðjÞ ð Þ 6¼ ð Þ as [43]) which can be applied to learn edge weights first, Prði & j are vaccinatedÞ¼ jC j x2 gðiÞ ; otherwise. and then apply our algorithm. gðiÞ jCgðiÞj1 (9) 2. Code: http://people.cs.vt.edu/yaozhang/group-immu/ 3348 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 12, DECEMBER 2016

TABLE 2 three baselines for both node and edge deletion to better Datasets judge their performance. Analogous versions of these base- lines have been regularly used in state-of-the-art individual Dataset Num. of nodes Num. of edges Num. of groups immunization studies [7], [13], [29]. SBM 1,500 5,000 20 Protein 2,361 7,182 13 (1) RANDOM: uniformly randomly assign vaccines to OregonAS 10,670 22,002 100 groups for both node deletion and edge deletion. YouTube 50 K 450 K 5,000 (2) DEGREE: for node deletion, we calculate the average Portland 0.5 million 1.6 million 91 degree d of each group C , and independently Miami 0.6 million 2.1 million 91 Ci i P C d = d assign vaccines to i with probability Ci Ck2C Ck ; for edge deletion, we first calculate the product 1) SBM (Stochastic Block Model) [44] is a well-known degree de [30] of each edge e ¼ðu; vÞ, i.e., de ¼ du dv, model to generate synthetic graphs with groups. We then similar to node deletion, we calculate the aver- generate small networks from the Stochastic Block age product degree dCi Pof Ci, and assign vaccines to C d = d Model to test the effectiveness of all our methods. i with probability Ci Ck2C Ck . 3 2) Protein is a protein-protein interaction network in (3) EIGEN: Eigenvalue has been widely used in budding yeast. There are 13 classes of proteins, which the immunization literature [7], [29], even as a base- are naturally treated as groups. It is a biological net- line for LT model [11]. Let u be the eigenvector corre- work, where our immunization algorithms can be sponding to the first eigenvalue of the graph. The potentially applied to block protein interactions. eigenscore of node a is ua, while the eigenscore of 4 3) OregonAS is the Oregon AS router graph collected edge eða; bÞ is juaubj [29]. For both node and edge

from the Oregon router views, and groups here are deletion, we calculate the average eigenscore uCi of based on router conductivities. We use Lou- each group Ci, and independentlyP assign vaccines to vain [45], a fast community detection algorithm to C u = u i with probability Ci Ck2C Ck . specify groups. It is a where our algorithms can be used to stop malware outbreaks. Remark 5.1. Note that we do not compare and run the indi- 4) YouTube5 is a friendship network in which users vidual based immunization methods [7], [11] “as-is” on can form groups. We create an induced graph by the original graph because these methods directly pick selecting nodes that are in the top 5,000 communi- nodes which we do not allow in our problems. Instead, ties. It is a social media network where we can apply we aim to pick the best groups, and then uniformly at our algorithms to control rumor spread. random allocate vaccines within the group. In addition, 5) Portland and Miami are social-contact graphs based we did study the effect of our algorithm w.r.t. the size of on detailed microscopic simulations of large US cities, groups (see Fig. 6). If each node is a group, GROUP IMMUNI- which has been used in national smallpox and influ- ZATION reduces to the individual based immunization. enza modeling studies using the SIR model [2]. We Indeed the reason we formulate the group immunization divided people into groups by ages ranging from 0-90 problems in this paper is that it is typically not feasible to (hence 91 groups in both networks). They are both con- force targeted individuals to be vaccinated in practice (as tact networks where our algorithms can be adopted to discussed before in the introduction). minimize virus propagation. Settings. For LT model, we uniformly randomly choose 1 5.2 Results percent nodes as the infected nodes (seeds) at the start. And In short, we demonstrate that our methods outperform other we use the same method in [11] to generate the probabilities baselines on all datasets. We also show how the behaviors of on the edges: for a node v, we assign each its incoming edge our methods change as groups vary. Finally, we conduct a ðu; vÞ with a probability p^uv uniformly at random, then we case study to analyze the vaccine allocations at group scale. uniformly randomly give a probability wv to v representing v’s incoming edges fail to activateP it. Then we get the nor- 5.2.1 Performance malized weight puv ¼ p^uv=ð u2V p^uv þ wvÞ. We construct 1,000 live-edge graphs in our algorithm for LT model. For Fig. 1 shows experimental results under LT model for group robustness, each data point we show is the mean of 1,000 edge deletion, while Fig. 2 demonstrates the results for node runs of randomly sampling removed edges/nodes from deletion. In all networks, GREEDY-LT consistently outperform groups. In the edge deletion version, edge communities are other competitors. Since we have same budgets for both edge induced from node communities, i.e., for an edge e ¼ðu; vÞ, and node deletion, clearly node removal should perform bet- if both u and v belong to a group Ct, then e 2 Ct, otherwise ter than edge deletion as node deletion removes more edges. it belongs to group Cij ¼feuvju 2 Ci;v2 Cjg. Our results demonstrate this fact. As shown in Fig. 1, GREEDY- Baselines. As we are not aware of any direct competitor LT performs pretty well for edge deletion compared with tackling our group immunization problems, we construct other competitors, e.g., in YouTube,GREEDY-LT can reduce about 25 percent of the infection if 500 edges are removed, while for RANDOM,DEGREE and EIGEN, the infection almost 3. http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast. remains the same even removing 500 edges. For node deletion htm 4. http://snap.stanford.edu/data/oregon1.html (Fig. 2), GREEDY-LT performs even better: it reduces more than 5. http://snap.stanford.edu/data/com-Youtube.html 30 percent of the infection given the maximum budgets. ZHANG ET AL.: NEAR-OPTIMAL ALGORITHMS FOR CONTROLLING PROPAGATION AT GROUP SCALE ON NETWORKS 3349

footprint when vaccines are given Fig. 1. Effectiveness for LTmodel various Real Datasets (edge deletion). Footprint ratio ( footprint without giving vaccines ) versus number of vaccines. Lower is better. GREEDY-LTconsistently outperforms other baseline algorithms.

footprint when vaccines are given Fig. 2. Effectiveness for LTmodel various Real Datasets (node deletion). Footprint ratio ( footprint without giving vaccines ) versus number of vaccines. Lower is better. GREEDY-LTconsistently outperforms other baseline algorithms.

0 Fig. 3. Effectiveness for the change of the first eigenvalue various Real Datasets (edge deletion). Eigendrop ratio ( G) versus number of vaccines G 0 (G is the expected eigenvalue after allocating vaccines). Lower is better. SDP,GROUPGREEDYWALK, and LP consistently outperform other baseline algorithms.

0 Fig. 4. Effectiveness for the change of the first eigenvalue various Real Datasets (node deletion). Eigendrop ratio ( G) versus number of vaccines G 0 (G is the expected eigenvalue after allocating vaccines). Lower is better. QP consistently outperforms other baseline algorithms.

Fig. 3 shows experimental results of edge version of group For edge deletion (Fig. 3), RANDOM,DEGREE and EIGEN cannot immunization for spectral radius, while Fig. 4 demonstrates decrease more than 10 percent of the first eigenvalue in You- the results for node deletion. In all networks, SDP, GROUPGREE- Tube when 5k vaccines are given to groups, while LP can DYWALK, LP and QP consistently outperform other competi- reduce more than 20 percent of the eigenvalue. For node dele- tors. SDP gives the best results for Protein, however, it is not tion (Fig. 4), QP can get more than twice reduction of eigen- scalable to large networks with more than thousands of nodes. value compared to other competitors. When comparing GROUPGREEDYWALK gives the second best performance, and it between node and edge deletion, we get the same result as works for graphs with about 10 K nodes. For very large net- Figs. 1 and 2: given same vaccines to both edge and node, works like YouTube and Portland with millions of nodes, node removal can get a larger decrease of the spectral radius. approximate algorithms like SDP and GROUPGREEDYWALK can As mentioned above, the problems of minimizing the not finish within an allocated time. LP for edge deletion and spectral radius are motivated by the epidemic threshold [12]; QP for node deletion, perform very well for large networks. an epidemic will quickly die out if the spectral radius is very 3350 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 12, DECEMBER 2016

Fig. 5. SIS simulations after vaccine allocation. The fraction of infected Fig. 7. Running Time (seconds). Running time versus number of nodes (in log-scale) versus the time step. Lower is better. SDP,GROUP- vaccines. GREEDYWALK, and LP consistently outperform other baseline algorithms. becomes individual immunization which is effective but small. Hence as an example, we also run the SIS model to much more expensive. Figs. 6c and 6d show the perfor- show how effective our algorithms are to prevent an epi- mance of GREEDY-LT as the number of groups varies. Similar demic from breaking out. We assume all nodes are in the to QP and LP, it consistently outperforms other baselines. infectious states at the beginning, and the recovery rate is 0.6. And the performance improvement is even more obvious: Fig. 5 shows the results on Portland and YouTube for node when the graph size increases from 1 to 200, GREEDY-LT deletion and edge deletion respectively, which is averaged almost reduces 90 percent of the infection. over 1,000 runs (note that we got the similar results on other networks). We observe that LP and QP consistently outper- 5.2.3 Scalability form other competitors: they both have the least number of Although our algorithms are polynomial-time, we show infected nodes in the network when vaccines are allocated. some running time results to demonstrate the scalability of our algorithms. Fig. 7 shows the running time of our algo- 5.2.2 Varying Groups rithms w.r.t. the number of vaccines. We did not show the We would like to see the effect of the change of granularity running time of RANDOM,DEGREE and EIGEN, because they are of vaccine allocation. We changed the number of groups on faster heuristics. First, as expected from the time complexity Portland, YouTube and OregonAS. For Portland, age of GREEDY-LT, when the number of vaccines m increases, the ranges from 0 to 90, hence there are initially 91 groups. We running time of GREEDY-LT increases linearly (Fig. 7a). Sec- decrease the number of groups by randomly merging two ond, since the time complexities of SDP and LP are irrelevant adjacency age groups. For OregonAS, we use community to m, as shown in Fig. 7b, the running time of them remains detection algorithm Louvain [45] to find different number almost constant. Furthermore, we observe that when m is of groups. For YouTube, we randomly merge ground true small, GROUPGREEDYWALK ran faster than SDP. As the per- communities to form smaller size of groups. formances of GROUPGREEDYWALK and SDP are very close, in Figs. 6a and 6b show the performance of QP and LP as large graphs with a relatively small budget, we could get the number of groups changes. First, both of them outper- very good solution from GROUPGREEDYWALK. form other baselines for Portland and YouTube. Second, as the number of groups increases, the spectral radius 5.2.4 Case Study decreases more for all algorithms (except for RANDOM) due We now study the group vaccination problem on realistic to the fact that the randomization of allocating vaccines social contact networks, Portland and Miami,usingage deceases. The extreme case is that when there is only one based groups; as discussed earlier, age based directives are group, QP, DEGREE and EIGEN are uniformly randomly allo- commonly used by public health agencies. Fig. 8 shows the cate vaccine to the whole graph, which is exactly the same number of vaccines assigned to different age groups, for a as RANDOM. On the contrary, when the number of groups is total of 10,000 vaccines, using the QP algorithm. We find the equal to the number of nodes, group immunization groups with age 70-79 and 60-66 get the maximum allocation,

Fig. 6. (a) and (b): Eigendrop ratio versus number of groups. (c) and (d): Footprint ratio versus number of groups. Lower is better. Our algorithms consistently outperform other baseline algorithms as the number of groups changes as well as the size of groups changes. ZHANG ET AL.: NEAR-OPTIMAL ALGORITHMS FOR CONTROLLING PROPAGATION AT GROUP SCALE ON NETWORKS 3351

the following grants: DTRA Grant HDTRA1-11-1-0016, DTRA CNIMS Contract HDTRA1-11-D-0016-0010, US National Science Foundation Career CNS 0845700, US National Science Foundation ICESCCF-1216000, US National Science Foundation NETSE Grant CNS-1011769, US National Science Foundation DIBBS Grant ACI-1443054, US National Science Foundation Grant IIS-1353346, NEH Grant HG-229283-15, Maryland Procurement Office under contract H98230-14-C-0127, and a Facebook faculty gift. This work is also supported by the Intelligence Advanced Research Proj- Fig. 8. Vaccine Distributions for Portland and Miami (Budget = 10,000). Number of vaccines versus Age. (Age range ‘0-9’: 1; ‘10-19’: 2; ects Activity (IARPA) via Department of Interior National ‘20-29’: 3; ‘30-39’: 4; ‘40-49’: 5; ‘50-59’: 6; ‘60-69’: 7; ‘70-79’: 8; ‘80-89’: Business Center (DoI/NBC) contract number D12PC000337, 9; ‘90-’: 10.) the US Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any for the Portland and Miami networks, respectively. This copyright annotation thereon. Any opinions, findings, and contrasts with CDC recommendations, and the strategy pro- conclusions or recommendations express in this material are posed by Medlock et al. [1], as CDC recommendations include those of the author(s) and do not necessarily reflect the views children through age 18, and Medlock et al. suggests to priori- of the respective funding agencies. tization of schoolchildren and adults aged 30 to 39 years. This might be because these results do not use the detailed net- work structure. We believe this is an interesting result which REFERENCES merits further study. [1] J. Medlock and A. P. Galvani, “Optimizing influenza vaccine dis- tribution,” Science, vol. 325, pp. 1705–1708, 2009. [2] S. Eubank, et al., “Modelling disease outbreaks in realistic urban 6DISCUSSION AND CONCLUSION social networks,” Nature, vol. 429, no. 6988, pp. 180–184, May 2004. [3] D. Z. Roth and B. Henr, “Social distancing as a pandemic influ- This paper addresses the problems of controlling epidemics by enza prevention measure: Evidence review,” National Collaborat- means of interventions that can be implemented at a group ing Centre for Infectious Diseases, 2011. level. We formulate the GROUP IMMUNIZATION problem in the LT [4] E. Shim, “Optimal strategies of social distancing and vaccination model as well as SIS/SIR models (considering the spectral against seasonal influenza,” Math. Biosciences Eng.,vol.10,no.5/6, pp. 1615–1634, 2013. radius minimization) for both edge-level and node-level inter- [5] R. Cohen, S. Havlin, and D. ben Avraham, “Efficient immuniza- ventions. We develop algorithms with rigorous performance tion strategies for computer networks and populations,” Phys. guarantees and good empirical performance for all these prob- Rev. Lett., vol. 91, no. 24, Dec. 2003, Art. no. 247901. lem classes. Our algorithms require a diverse class of techni- [6] B. A. Prakash, L. A. Adamic, T. J. Iwashyna, H. Tong, and C. Faloutsos, “Fractional immunization in networks,” in Proc. ques, including submodular function maximization, linear SIAM Int. Conf. Data Mining, 2013, pp. 659–667. programming, quadratic programming, semidefinite pro- [7] H. Tong, B. A. Prakash, C. E. Tsourakakis, T. Eliassi-Rad, gramming, and hitting closed walks. Finally, we evaluate C. Faloutsos, and D. H. Chau, “On the vulnerability of large them on real networks of diverse scales. We demonstrate that graphs,” in Proc. IEEE Int. Conf. Data Mining, 2010, pp. 1091–1096. [8] R. M. Anderson and R. M. May, Infectious Diseases of Humans. our algorithms significantly outperform other heuristics, and London, U.K.: Oxford Univ. Press, 1991. adapt to the group structure. Some of our algorithms, e.g., SDP [9] N. Bailey, The Mathematical Theory of Infectious Diseases and Its is fairly time intensive, though it runs in polynomial time. Applications. London, U.K.: Griffin, 1975. [10] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing the spread of However, it is important to keep in mind that these algorithms influence through a ,” in Proc. 9th ACM SIGKDD are expected to be run before an epidemic outbreak, where the Int. Conf. Knowl. Discovery Data Mining, 2003, pp. 137–146. solution quality is much more critical than the run time. [11] E. B. Khalil, B. Dilkina, and L. Song, “Scalable diffusion-aware Proc. 20th ACM SIGKDD Currently our SDP and GROUPGREEDYWALK algorithm work optimization of network topology,” in Int. Conf. Knowl. Discovery Data Mining, 2014, pp. 1226–1235. for edge deletion. Developing provable approximation algo- [12] B. A. Prakash, D. Chakrabarti, M. Faloutsos, N. Valler, and rithms for node deletion by leveraging SDP and GROUPGREEDY- C. Faloutsos, “Threshold conditions for arbitrary cascade models WALK, can be another future direction. In addition, our on arbitrary networks,” Knowl. Inf. Syst., vol. 33, pp. 549–575, 2012. formulations capture the uncertainty, lack of control and com- [13] S. Saha, A. Adiga, B. A. Prakash, and A. K. S. Vullikanti, “Approximation algorithms for reducing the spectral radius to con- pliance at a fine granularity in immunization interventions in trol epidemic spread,” in Proc. SIAM Int. Conf. Data Mining, 2015, public health and social media. Another important practical pp. 568–576. consideration is the economies of scale that arise in such [14] D. Gruhl, R. V. Guha, D. Liben-Nowell, and A. Tomkins, group level formulations—these could be the result of “Information diffusion through blogspace,” in Proc. 13th Int. Conf. World Wide Web, May 2004, pp. 491–501. decreasing per unit cost of production or distributionP within a [15] R. Kumar, J. Novak, P. Raghavan, and A. Tomkins, “On the bursty n group. Such constraints can be modeled as i¼1 fiðxiÞB, evolution of blogspace,” in Proc. 12th Int. Conf. World Wide Web, 2003, pp. 568–576. where fiðxiÞ is a concave function and xi is the allocation to C B [16] S. Bikhchandani, D. Hirshleifer, and I. Welch, “A theory of fads, group i,and is a budget constraint. Extending our algo- fashion, custom, and cultural change in informational cascades,” rithms to handle such constraints with our formulation is an J. Political Economy, vol. 100, no. 5, pp. 992–1026, 1992. interesting future work. [17] J. Goldenberg, B. Libai, and E. Muller, “Talk of the network: A complex systems look at the underlying process of word-of- ACKNOWLEDGMENTS mouth,” Marketing Lett., vol. 12, pp. 211–223, 2001. [18] E. M. Rogers, Diffusion of Innovations, 5th ed. New York, NY, USA: The authors would like to thank the anonymous reviewers for Free Press, Aug. 2003. their comments. This work has been partially supported by 3352 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 12, DECEMBER 2016

[19] J. Leskovec, L. A. Adamic, and B. A. Huberman, “The dynamics of Yao Zhang received the bachelor’s and master’s viral marketing,” in Proc. 7th ACM Conf. Electron. Commerce, 2006, degrees in computer science from Nanjing Univer- pp. 228–237. sity, China. He is working toward the PhD degree in [20] E. E. Papalexakis, T. Dumitras, D. H. P. Chau, B. A. Prakash, and the Department of Computer Science, Virginia C. Faloutsos, “Spatio-temporal mining of software adoption & Tech. His current research interests include graph penetration,” in Proc. IEEE/ACM Int. Conf. Advances Social Netw. mining and with focus on Anal. Mining, 2013, pp. 878–885. understanding and managing information diffusion [21] H. W. Hethcote, “The mathematics of infectious diseases,” SIAM in networks. He has published several papers in top Rev., vol. 42, pp. 599–653, 2000. conferences and journals such as the Knowledge [22] A. G. McKendrick, “Applications of mathematics to medical prob- Discovery in Databases, ICDM, SDM, and the lems,” Proc. Edinburgh Math. Soc., vol. 44, pp. 98–130, 1925. Transactions on Knowledge Discovery from Data. [23] M. Granovetter, “Threshold models of collective behavior,” Amer. J. Sociology, vol. 83, pp. 1420–1443, 1978. Abhijin Adiga received the PhD degree from the [24] D. J. Watts, “A simple model of global cascades on random Department of Computer Science & Automation, networks,” Proc. Nat. Academy Sci. USA, vol. 99, no. 9, pp. 5766– Indian Institute of Science, in 2011. He is a 5771, 2002. research assistant professor in the Network [25] D. Centola and M. Macy, “Complex contagions and the weakness Dynamics and Simulation Science Laboratory, Bio- of long ties,” Amer. J. Sociology, vol. 113, no. 3, pp. 702–734, 2007. complexity Institute of Virginia Tech. His interests [26] A. Ganesh, L. Massoulie, and D. Towsley, “The effect of network include , modeling, algorithms, topology on the spread of epidemics,” in Proc. IEEE INFOCOM, combinatorics, and game theory with current focus 2005, pp. 1455–1466. on dynamical processes over networks and design [27] J. Aspnes, K. Chang, and A. Yampolskiy, “Inoculation strategies for & implementation of complex simulation systems. victims of viruses and the sum-of-squares partition problem,” in Proc. 16th Annu. ACM-SIAM Symp. Discr. Algorithms, 2005, pp. 43–52. [28] C. J. Kuhlman, G. Tuli, S. Swarup, M. V. Marathe, and S. Ravi, Sudip Saha received the BSc degree in com- “Blocking simple and complex contagion by edge removal,” in puter science and engineering from the Bangla- Proc. IEEE 13th Int. Conf. Data Mining, 2013, pp. 399–408. desh University of Engineering and Technology, [29] H. Tong, B. A. Prakash, T. Eliassi-Rad, M. Faloutsos, and C. Faloutsos, in 2006, and the MS degree in computer science “Gelling, and melting, large graphs by edge manipulation,” in Proc. from the University of Memphis, Tennessee, in 21st ACM Int. Conf. Inf. Knowl. Manage., 2012, pp. 245–254. 2010. He is currently working toward the PhD [30] P. V. Mieghem, et al., “Decreasing the spectral radius of a graph degree in computer science at Virginia Tech. His by link removals,” Phys. Rev. E, vol. 84, 2011, Art. no. 016101. research interests include graph mining, game [31] C. Budak, D. Agrawal, and A. E. Abbadi, “Limiting the spread of theory, and cybersecurity. misinformation in social networks,” in Proc. 20th Int. Conf. World Wide Web, 2011, pp. 665–674. [32] X. He, G. Song, W. Chen, and Q. Jiang, “Influence blocking maximi- zation in social networks under the competitive linear threshold Anil Vullikanti received the undergraduate model,” in Proc. SIAM Int. Conf. Data Mining, 2012, pp. 463–474. degree from the Indian Institute of Technology, [33] Y. Zhang and B. Prakash, “DAVA: Distributing vaccines over Kanpur, and the PhD degree from the Indian Insti- large networks under prior information,” in Proc. SIAM Int. Conf. tute of Science, Bangalore. He is an associate pro- Data Mining, 2014, pp. 46–54. fessor in the Department of Computer Science and [34] M. Richardson and P. Domingos, “Mining knowledge-sharing the Biocomplexity Institute of Virginia Tech. He was sites for viral marketing,” in Proc. 8th ACM SIGKDD Int. Conf. a post-doctoral researcher with the Max-Planck Knowl. Discovery Data Mining, 2002, pp. 61–70. Institute for Informatics, and a technical staff mem- [35] M. Eftekhar, Y. Ganjali, and N. Koudas, “Information cascade at ber with the Los Alamos National Laboratory. His group scale,” in Proc. 19th ACM SIGKDD Int. Conf. Knowl. Discov- research interests include the broad areas of ery Data Mining, 2013, pp. 401–409. approximation and randomized algorithms and [36] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, dynamical systems, and their applications to computational epidemiology and N. S. Glance, “Cost-effective outbreak detection in networks,” and the modeling, simulation and analysis of socio-technical systems. in Proc. 13th ACM SIGKDD Int. Conf. Knowl. Discovery Data Min- ing, 2007, pp. 420–429. [37] T. Lappas, E. Terzi, D. Gunopulos, and H. Mannila, “Finding B. Aditya Prakash received the BTech (com- effectors in social networks,” in Proc. 16th ACM SIGKDD Int. Conf. puter science) degree from the Indian Institute of Knowl. Discovery Data Mining, 2010, pp. 1059–1068. Technology (IIT)—Bombay, in 2007, and the PhD [38] M. Purohit, B. A. Prakash, C. Kang, Y. Zhang, and V. Subrahmanian, degree from the Computer Science Department, “Fast influence-based coarsening for large networks,” in Proc. 20th Carnegie Mellon University, in 2012. He is an ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2014, assistant professor in the Computer Science pp. 1296–1305. Department, Virginia Tech. His research interests [39] T. Soma, N. Kakimura, K. Inaba, and K.-i. Kawarabayashi, include data mining, applied , “Optimal budget allocation: Theoretical guarantee and efficient and databases, with emphasis on big-data prob- algorithm,” in Proc. 31st Int. Conf. Mach. Learn., 2014, pp. 351–359. lems in large real-world networks and time-series. [40] M. Minoux, “Accelerated greedy algorithms for maximizing sub- He has published more than 40 refereed papers modular set functions,” in Optimization Techniques. Berlin, in major venues, holds two US patents and has given two tutorials Germany: Springer, 1978, pp. 234–243. (VLDB 2012 and ECML/PKDD 2012) at leading conferences. His work [41] L. Lu and X. Peng, “Spectra of edge-independent random has also received a best paper award, two best-of-conference selections graphs,” Electron. J. Combinatorics, vol. 20, no. 4, 2013, Art. no. P27. (CIKM 2012, ICDM 2012, and ICDM 2011), and multiple travel awards. [42] S. Sahni, “Computationally related problems,” SIAM J. Comput., His work has been funded through grants/gifts from US NSF,DoE, NSA, vol. 3, no. 4, pp. 262–279, 1974. NEH and from companies like Symantec. He received a Facebook Fac- [43] A. Goyal, F. Bonchi, and L. V. Lakshmanan, “Learning influence ulty Gift Award in 2015. He is also an affiliated faculty member in the Dis- probabilities in social networks,” in Proc. 3rd ACM Int. Conf. Web covery Analytics Center, Virginia Tech. Search Data Mining, 2010, pp. 241–250. [44] A. F. McDaid, T. B. Murphy, N. Friel, and N. J. Hurley, “Improved " Bayesian inference for the stochastic block model with application For more information on this or any other computing topic, to large networks,” Comput. Statist. & Data Anal., Elsevier, vol. 60, please visit our Digital Library at www.computer.org/publications/dlib. pp. 12–31, 2013. [45] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities in large networks,” J. Statistical Mech.: Theory Experiment, vol. 2008, no. 10, 2008, Art. no. P10008.