Large Social Networks can be Targeted for Viral Marketing with Small Seed Sets

Paulo Shakarian and Damon Paulo Network Science Center and Department of Electrical Engineering and Computer Science United States Military Academy West Point, New York 10996 Email: paulo[at]shakarian.net, damon.paulo[at]usma.edu

Abstract—In a “tipping” model, each node in a , Section IV. Finally, we provide an overview of related work representing an individual, adopts a behavior if a certain number in Section V. of his incoming neighbors previously held that property. A key problem for viral marketers is to determine an initial “seed” II.TECHNICAL PRELIMINARIES set in a network such that if given a property then the entire Throughout this paper we assume the existence of a social network adopts the behavior. Here we introduce a method for quickly finding seed sets that scales to very large networks. Our network, G = (V,E), where V is a set of vertices and E is a approach finds a set of nodes that guarantees spreading to the set of directed edges. We will use the notation n and m for the entire network under the tipping model. After experimentally cardinality of V and E respectively. For a given node vi ∈ V , evaluating 31 real-world networks, we found that our approach in the set of incoming neighbors is ηi , and the set of outgoing often finds such sets that are several orders of magnitude smaller out neighbors is ηi . The cardinalities of these sets (and hence than the population size. Our approach also scales well - on a in out Friendster social network consisting of 5.6 million nodes and the in and out degrees of node vi) are di , di respectively. 28 million edges we found a seed sets in under 3.6 hours. We We now define a threshold function that for each node returns also find that highly clustered local neighborhoods and dense the fraction of incoming neighbors that must be activated for network-wide community structure together suppress the ability it to become activate as well. of a trend to spread under the tipping model. Definition 1 (Threshold Function): We define the thresh- I.INTRODUCTION old function as mapping from V to (0, 1]. Formally: θ : V → A much studied model in network science, tipping [10], (0, 1]. For the number of neighbors that must be active, we will [11], [20] (a.k.a. deterministic linear threshold [12]) is often in associated with “seed” or “target” set selection, [7] (a.k.a. use the shorthand ki. Hence, for each vi, ki = dθ(vi) · di e. the maximum influence problem). In this problem we have a We now define an activation function that, given an initial set social network in the form of a directed graph and thresholds of active nodes, returns a set of active nodes after one time for each individual. Based on this data, the desired output is step. the smallest possible set of individuals such that, if initially Definition 2 (Activation Function): Given a threshold func- tion, θ, an activation function Aθ maps subsets of V to activated, the entire population will adopt the new behavior (a 0 seed set). This problem is NP-Complete [9], [12]. Although subsets of V, where for some V ⊆ V , 0 0 in 0 approximation algorithms have been proposed, [3], [7], [8], Aθ(V ) = V ∪ {vi ∈ V s.t. |ηi ∩ V | ≥ ki} (1) [15] none seem to scale to very large data sets. Here, inspired We now define multiple applications of the activation func- arXiv:1205.4431v2 [cs.SI] 15 Mar 2013 by shell decomposition, [2], [5], [13] we present a method tion. guaranteed to find a set of nodes that causes the entire popula- Definition 3 (Multiple Applications of the Activation Function): tion to activate - but is not necessarily of minimal size. We then Given a natural number i > 0, set V 0 ⊆ V , and threshold evaluate the algorithm on 31 large real-world social networks function, θ, we define the multiple applications of the and show that it often finds very small seed sets (often several activation function, Ai (V 0), as follows: orders of magnitude smaller than the population size). We also θ show that the size of a seed set is related to Louvain modularity ( 0 i 0 Aθ(V ) if i = 1 and average clustering coefficient. Therefore, we find that Aθ(V ) = i−1 0 (2) Aθ(A (V )) otherwise dense community structure and tight-knit local neighborhoods θ i 0 i−1 0 together inhibit the spreading of trends under the tipping Clearly, when Aθ(V ) = Aθ (V ) the process has con- model. verged. Further, this occurs in no more than n steps (as, in The rest of the paper is organized as follows. In Section II, each step, at least one new node must be activated). Based we provide formal definitions of the tipping model. This on this idea, we define the function Γ which returns the set is followed by the presentation of our new algorithm in of all nodes activated upon the convergence of the activation Section III. We then describe our experimental results in function. Definition 4 (Γ Function): Let j be the least value such that b:1 h:0 b:1 j 0 j−1 0 V V A B Aθ(V ) = Aθ (V ). We define the function Γθ : 2 → 2 as follows. a:1 a:1 j:0 j:1 Remove e,f,h,i 0 j 0 c:2 c:2 Γθ(V ) = Aθ(V ) (3) g:1 g:0 d:1 We now have all the pieces to introduce our problem - i:0 d:-1 e:0 finding the minimal number of nodes that are initially active f:0 to ensure that the entire set V becomes active. Definition 5 (The MIN-SEED Problem): The MIN-SEED C b:1 D b:0 E c:0 Problem is defined as follows: given a threshold function, θ, g:-1 0 0 a:1 return V ⊆ V s.t. Γθ(V ) = V , and there does not exist c:1 d:-1 00 00 0 00 c:2 V ⊆ V where |V | < |V | and Γθ(V ) = V . g:-1 g:-1 The following theorem is from the literature [9], [12] and d:-1 d:-1 F g:-2 tells us that the MIN-SEED problem is NP-complete. d:-2 Theorem 1 (Complexity of MIN-SEED [9], [12]): MIN- SEED in NP-Complete. Remove g,j Remove b Fig. 1. Example of our algorithm for a simpleRemove network a depicted in box A. III.ALGORITHM We use a threshold value set to 50% of the node degree. Next to each node in in di To deal with the intractability of the MIN-SEED problem, label (lower-case letter) is the value for di − ki (where ki = d 2 e). In the first four iterations, nodes e, f, h, and i are removed resulting in the network we design an algorithm that finds a non-trivial subset of nodes in box B. This is followed by the removal of node j resulting in the network that causes the entire graph to activate, but we do not guarantee in box C. In the next two iterations, nodes a and b are removed (boxes D- that the resulting set will be of minimal size. The algorithm E respectively). Finally, node c is removed (box F). The nodes of the final network, consisting of d and g, have negetive values for di − θi and become is based on the idea of shell decomposition often cited in the output of the algorithm. physics literature [2], [5], [13], [21] but modified to ensure that the resulting set will lead to all nodes being activated. The algorithm, TIP DECOMP is presented in this section. Theorem 2: If all nodes in V 0 ⊆ V returned by TIP DECOMP are initially active, then every node in V will Algorithm 1 TIP DECOMP eventually be activated, too. Require: Threshold function, θ and directed social network Proof: Let w be the total number of nodes removed by G = (V,E) TIP DECOMP, where v1 is the last node removed and vw is Ensure: V 0 the first node removed. We prove the theorem by induction on w as follows. We use P (w) to denote the inductive hypothesis 1: For each vertex vi, compute ki. which states that all nodes from v1 to vw are active. In the 2: For each vertex v , dist = din − k . i i i i base case, P (1) trivially holds as we are guaranteed that from 3: FLAG = TRUE. 0 set V there are at least k1 edges to v1 (or it would not be 4: while FLAG do removed). For the inductive step, assuming P (w) is true, when 5: Let vi be the element of v where disti is minimal. vw+1 was removed from the graph distw+1 ≥ 0 which means 6: if disti = ∞ then in in that dw+1 ≥ kw+1. All nodes in ηw+1 at the time when vw+1 7: FLAG = FALSE. was removed are now active, so vw+1 will now be activated - 8: else which completes the proof. 9: Remove v from G and for each v in ηout, if dist > i j i j We also note that by using the appropriate data structure (we 0, set dist = dist − 1. Otherwise set dist = ∞. j j j used a binomial heap in our implementation), for a network 10: end if of n nodes and m edges, this algorithm can run in time 11: end while O(m log n). 12: return All nodes left in G. Proposition 1: The complexity of TIP DECOMP is O(m· log(n)). Intuitively, the algorithm proceeds as follows (Figure 1). IV. RESULTS Given network G = (V,E) where each node vi has threshold in ki = dθ(vi) · di e, at each iteration, pick the node for which All experiments were run on a computer equipped with an in di − ki is the least but positive (or 0) and remove it. Once Intel X5677 Xeon Processor operating at 3.46 GHz with a in there are no nodes for which di − ki is positive (or 0), the 12 MB Cache. The machine was running Red Hat Enterprise algorithm outputs the remaining nodes in the network. Linux version 6.1 and equipped with 70 GB of physical Now, we prove that the resulting set of nodes is guaranteed memory. TIP DECOMP was written using Python 2.6.6 in to cause all nodes in the graph to activate under the tipping 200 lines of code that leveraged the NetworkX library available model. This proof follows from the fact that any node removed from http://networkx.lanl.gov/. The code used a binomial is activated by the remaining nodes in the network. heap library written by Bjorn¨ B. Brandenburg available from http://www.cs.unc.edu/∼bbb/. All statistics presented in this users to share web bookmarks with their friends. [23] It section were calculated using R 2.13.1. was extracted in Dec. 2010. • Digg is a social news website that allows users to share A. Datasets with friends. [23] It was extracted in Dec. 2010. In total, we examined 31 networks: nine academic col- • EU E-Mail is an e-mail network extracted from a large laboration networks, three e-mail networks, and 19 networks European Union research institution. [14] It is based on extracted from social-media sites. The sites included included e-mail traffic from Oct. 2003 to May 2005. general-purpose social-media (similar to or MyS- • is a popular general-purpose Dutch social net- pace) as well as special-purpose sites (i.e. focused on sharing working site. [23] It was extracted in Dec. 2010. of blogs, photos, or video). • Yelp is a social networking site that allows users to share All datasets used in this paper were obtained from one of product reviews. [23] It was extracted in Nov. 2010. four sources: the ASU Social Computing Data Repository, [23] the Stanford Network Analysis Project, [14] the University D. Category C of Michigan, [17] and Universitat Rovira i Virgili. [1] All • CA-AstroPh is a an academic collaboration network for networks considered were symmetric – i.e. if a directed edge Astro Physics from Jan. 1993 - Apr. 2003. [14] from vertex v to v0 exists, there is also an edge from vertex v0 • CA-CondMat is an academic collaboration network for to v. Tables I (A-C) show some of the pertinent qualities of Condense Matter Physics. Samples from 1999 (Cond- these networks. The networks are categorized by the results Mat99), 2003 (CondMat03), and 2005 (CondMat05) were (explained later in this section). In what follows, we provide obtained from the University of Michigan. [17] A second their real-world context. sample from 2003 (CondMat03a) was obtained from B. Category A Stanford University. [14] • CA-GrQc is a an academic collaboration network for • BlogCatalog is a social blog directory that allows users General Relativity and Quantum Cosmology from Jan. to share blogs with friends. [23] The first two samples of 1993 - Apr. 2003. [14] this site, BlogCatalog1 and 2, were taken in Jul. 2009 and • CA-HepPh is a an academic collaboration network for June 2010 respectively. The third sample, BlogCatalog3 High Energy Physics - Phenomenology from Jan. 1993 - was uploaded to ASU’s Social Computing Data Reposi- Apr. 2003. [14] tory in Aug. 2010. • CA-HepTh is a an academic collaboration network for • Buzznet is a network designed for sharing High Energy Physics - Theory from Jan. 1993 - Apr. photographs, journals, and videos. [23] It was extracted 2003. [14] in Nov. 2010. • CA-NetSci is a an academic collaboration network for • Douban is a Chinese social medial website designed to Network Science from May 2006. provide user reviews and recommendations. [23] It was • Enron E-Mail is an e-mail network from the Enron cor- extracted in Dec. 2010. poration made public by the Federal Energy Regulatory • Flickr is a social media website that allows users to Commission during its investigation. [14] share photographs. [23] It was uploaded to ASU’s Social • URV E-Mail is an e-mail network based on commu- Computing Data Repository in Aug. 2010. nications of members of the University Rovira i Virgili • Flixster is a social media website that allows users to (Tarragona). [1] It was extracted in 2003. share reviews and other information about cinema. [23] • YouTube is a video-sharing website that allows users It was extracted in Dec. 2010. to establish friendship links. [23] The first sample • FourSquare is a location-based social media site. [23] It (YouTube1) was extracted in Dec. 2008. The second was extracted in Dec. 2010. sample (YouTube2) was uploaded to ASU’s Social Com- • Frienster is a general-purpose social-networking puting Data Repository in Aug. 2010. site. [23] It was extracted in Nov. 2010. • Last.Fm is a music-centered social media site. [23] It E. Runtime was extracted in Dec. 2010. • LiveJournal is a site designed to allow users to share First, we examined the runtime of the algorithm (see Fig- their blogs. [23] It was extracted in Jul. 2010. ure 2). Our experiments aligned well with our time complexity • Livemocha is touted as the “world’s largest result (Proposition 1). For example, a network extracted from community.” [23] It was extracted in Dec. 2010. the Dutch social-media site Hyves consisting of 1.4 million • WikiTalk is a network of individuals who set and re- nodes and 5.5 million directed edges was processed by our ceived messages while editing WikiPedia pages. [14] It algorithm in at most 12.2 minutes. The often-cited LiveJournal was extracted in Jan. 2008. dataset consisting of 2.2 million nodes and 25.6 million directed edges was processed in no more than 66 minutes C. Category B - a short time for an NP-hard combinatorial problem on a • Delicious is a social bookmarking site, designed to allow large-sized input. htfrms ewrs hr a ierrato between realtion linear a size. was seed and there value networks, threshold most for than that less size set seed osataogalvrie o ahtiladsta ninteger an at set and interval trial the each for in vertices all among set constant we trials, these Size Seed F. Fgr n al I.I eea,oln oilntok had a networks social results - (as online sets on seed general, smallest algorithm In based the II). categories Table our and three by 3 into (Figure returned have networks we discussion, our nodes the of grouped purposes on of For nodes). based set total of model percentage the tipping of the under size spreading promote to o ahntok eperformed we network, each For I FRAINO H EWRSIN NETWORKS THE ON NFORMATION YouTube2 YouTube1 URV E-Mail Enron E-Mail CA-NetSci CA-HepTh CA-HepPh CA-GrQc CA-CondMat99 CA-CondMat05 CA-CondMat03a CA-CondMat03 CA-AstroPh Yelp Hyves EU E-Mail Digg Delicious WikiTalk Livemocha LiveJournal Last.Fm Frienster FourSquare Flixster Flickr Douban Buzznet BlogCatalog3 BlogCatalog2 BlogCatalog1

Name [1 , 10] 1,138,499 1,402,673 2,394,385 2,238,731 1,191,812 5,689,498 2,523,386 θ eeautdteaiiyo network a of ability the evaluated We . 487,401 265,214 771,231 536,408 104,103 639,014 154,908 101,163 ( 97,884 88,784 13,723 36,692 12,008 16,264 39,577 23,133 30,460 18,772 80,513 10,312 2% v 13 1,133 1,463 9,877 5,242

i # Nodes min( = ) ewrso hstp a naverage an had type this of networks CATEGORY C CATEGORY B CATEGORY A ftepplto.W lonoticed, also We population. the of AL I TABLE 11,814,826 25,632,368 28,135,774 15,837,602 11,799,764 5,980,886 4,686,962 5,554,838 2,732,272 9,319,130 4,386,166 9,038,680 6,429,972 5,526,132 3,337,294 4,186,390 153,530 367,662 236,978 351,384 186,878 240,058 396,100 728,962 654,324 667,966 10,902 51,946 28,968 95,188 5,486 # Edges d C 10.55 21.07 73.28 27.31 32.39 17.05 23.58 i in ATEGORIES 2.63 5.59 4.81 5.01 1.87 2.63 9.87 2.76 2.93 4.44 4.04 3.94 4.81 1.98 1.37 7.66 2.55 1.95 5.72 3.79 2.47 5.03 3.14 2.11 k ,

10 Avg. Degree ASU ASU ASU ASU ASU ASU ASU ASU ASU ASU URV SNAP UMICH SNAP SNAP SNAP UMICH UMICH SNAP UMICH SNAP ASU ASU SNAP ASU ASU SNAP ASU ASU ASU ASU ) itgr ras In trials. “integer” where Source SocMedia SocMedia E-Mail E-Mail Collab Collab Collab Collab Collab Collab Collab Collab Collab SocMedia SocMedia E-Mail SocMedia SocMedia SocMedia SocMedia SocMedia SocMedia SocMedia SocMedia SocMedia SocMedia SocMedia SocMedia SocMedia SocMedia SocMedia ,B, A,

k Type AND a kept was C. ae edfieti ocp omlybelow. formally undirected the concept For this triangle. define a making we - case, edge an share that pairs network-wide two be clustering ( to with coefficient First, structure. likely results community more characterizing our measures network promotes compared a network We into make insight given “tipped.” gain aspects a to structural measurement well this which use how we of Here, spreading. measurement a Structure as Community of Function a as Size Seed YouTube. on G. links friendship networks, from derived email networks the two of and two networks, seed collaboration initial average academic than an greater having model, size tipping the in network. diffusion e-mail an networks, and seed media networks, initial media social social purpose average extent specialty general an the two two to had included not They group than but A. greater - Category nodes size in initial those of of set small criteria. relatively that meeting networks nine had we four, a than of For less small. size particularly seed was a had nodes we seed three of of levels, set threshold threshold the lower of the size of the some under For was websites. size media seed social entire average the the by trials, adoption 2% ten to our lead In can population. individuals behavior of fraction a small having very a initially as - influence to susceptible 2 n i.2. Fig. . 2 snme fnds.Terltosi slna with linear is relationship The nodes). of number is nti eto,w iwterslso u ersi algorithm heuristic our of results the view we section, this In hamper to seemed that networks of consisted C Category a with influence to susceptible are B Category in Networks highly networks social as of thought be can A Category · 10 o aho these of each for

− Runtime (seconds, log scale) 50000 5000 16 0.05 m 500 0.5 50 1.20E+04 . 5 ln C n sdfie o oea h rcino neighbor of fraction the as node a for defined is ) s utm nscns(o scale, (log seconds in runtime vs. 2% m ln (n) (m= # of edges, n= # of nodes, log scale) 10% 1.20E+05 0 . 13 u esthan less but 5% hsctgr nlddalo the of all included category This . 11 ewrs l eeetatdfrom extracted were All networks. ftepplto.Frathreshold a For population. the of fteCtgr ewrswith networks A Category the of 1.20E+06 10% 1.20E+07 ebr nthis in Members . m snme fedges, of number is R 2

0 = . 9015 , p = 3.5 partitions compared to the density of edges between partitions. 3 A BlogCatalog1 We present a formal definition for an undirected network BlogCatalog2 below. 2.5 BlogCatalog3 Buzznet Definition 7 (Modularity [16]): Modularity, 2 Douban 1 P didj M = [1 − ]δ(ci, cj), where m is the Flickr 2m i,j∈V 2m 1.5 Flixster number of undirected edges, di is node degree, ci is the FourSquare community to which vi belongs and δ(x, y) = 1 if x = y and

1 Friendster Seed Size (PercentageofNodes) Last.Fm 0 otherwise. 0.5 LiveJournal The modularity of an optimal network partition can be used Livemocha 0 to measure the quality of its community structure. Though WikiTalk 1 2 3 4 5 6 7 8 9 10 modularity-maximization is NP-hard, the approximation algo- Threshold Value rithm of Blondel et al. [4] (a.k.a. the “Louvain algorithm”) 14 has been shown to produce near-optimal partitions.1 We call

12 B the modularity associated with this algorithm the “Louvain modularity.” Unlike the C, which describes local properties, 10

Delicious M is descriptive of the community level. For the 31 networks 8 2 Digg we considered, M and C appear uncorrelated (R = 0.0538, 6 EU E-mail p = 0.2092). Hyves 4 Yelp We plotted the initial seed set size (S) (from our algorithm Seed Size (PercentageofNodes) - averaged over the 10 threshold settings) as a function of 2 M and C (Figure 4a) and uncovered a correlation (planar fit, 0 R2 = 0.8666, p = 5.666·10−13, see Figure 4 A). The majority 1 2 3 4 5 6 7 8 9 10 Threshold Value of networks in Category C (less susceptible to spreading) were characterized by relatively large M and C (Category 60 CA-AstroPh C includes the top nine networks w.r.t. C and top five w.r.t. C CA-CondMat03 50 M). Hence, networks with dense, segregated, and close-knit CA-CondMat03a CA-CondMat05 communities (large M and C) suppress spreading. Likewise, 40 CA-CondMat99 those with low M and C tended to promote spreading. Also, CA-GrQC 30 we note that there were networks that promoted spreading with CA-HepPh CA-HepTh dense and segregated communities, yet were less clustered (i.e. 20

CA-NetSci Category A networks Friendster and LiveJournal both have Seed Size (PercentageofNodes) Enron E-Mail 10 M ≥ 0.65 and C ≤ 0.13). Further, some networks with a URV E-Mail YouTube1 moderately large clustering coefficient were also in Category 0 1 3 5 7 9 YouTube2 A (two networks extracted from BlogCatalog had C ≥ 0.46) Threshold Value but had a relatively less dense community structure (for those two networks M ≤ 0.33). We also studied the effects on spreading when the threshold Fig. 3. Threshold value (assigned as an integer in the interval [1, 10]) vs. size of initial seed set as returned by our algorithm in our three identified categories values would be assigned as a certain fraction of the node’s of networks (categories A-C are depicted in panels A-C respectively). Average in-degree. [11], [22] This results in heterogeneous θi’s for the seed sizes were under 2% for Categorty A, 2−10% for Category B and over nodes. We performed 12 trials for each network. Thresholds 10% for Category C. The relationship, in general, was linear for categories A and B and lograthimic for C. CA-NetSci had the largest Louvain Modularity for each trial were based on the product of in-degree and a and clustering coefficient of all the networks. This likely explains why that fraction in the interval [0.05, 0.60] (multiples of 0.05). The particular network seems to inhibit spreading. results (Figure 5 and Table II) were analogous to our integer tests. We also compared the averages over these trials with M and C and obtained similar results as with the other trials Definition 6 (Clustering Coefficient): Let r be the number (Figure 4 B). of edges between nodes with which vi has an edge and di be 2r the degree of v . The clustering coefficient, C = . V. RELATED WORK i i d (d − 1) i i Tipping models first became popular by the works of Intuitively, a node with high C tends to have more pairs i [10] and [20] where it was presented primarily in a social of friends that are also mutual friends. We use the average context. Since then, several variants have been introduced in clustering coefficient as a network-wide measure of this local the literature including the non-deterministic version of [12] property. (described later in this section) and a generalized version of Second, we consider modularity (M) defined by Newman and Girvan. [16]. For a partition of a network, M is a real 1Louvain modularity was computed using the implementation available number in [−1, 1] that measures the density of edges within from CRANS at http://perso.crans.org/aynaud/communities/. 12

BlogCatalog1 A 10 A

BlogCatalog2

BlogCatalog3 8

Buzznet Douban 6 Flickr Flixster

4 FourSquare

Friendster Seed Size (PercentageofNodes) Last.Fm 2 LiveJournal Livemocha 0 WikiTalk 0.05 0.15 0.25 0.35 0.45 0.55 Threshold Value 12

Average Seed Size (Percent of Nodes) of (Percent Size Seed Average 10

B

Louvain Modularity 8 Delicious 6 Digg B EU E-mail 4 Hyves

Yelp

Seed Size (PercentageofNodes)

2

0 0.06 0.16 0.26 0.36 0.46 0.56 Threshold Value (Fraction of Neighbors) 50

45 CA-AstroPh C CA-CondMat03 40 CA-CondMat03a 35 CA-CondMat05 30 CA-CondMat99

25 CA-GrQC Average Seed Size (Percent of Nodes) of (Percent Size Seed Average 20 CA-HepPh CA-HepTh 15 CA-NetSci

Louvain Modularity Seed Size (PercentageofNodes) 10 Enron E-Mail

5 URV E-Mail YouTube1 0 YouTube2 0.05 0.15 0.25 0.35 0.45 0.55 Threshold Value (Fraction of Neighbors)

Fig. 4. (A) Louvain modularity (M) and average clustering coefficient (C) vs. the average seed size (S). The planar fit depicted is S = 43.374 · M + Fig. 5. Threshold value (assigned as a fraction of node in-degree as a multiple 33.794 · C − 24.940 with R2 = 0.8666, p = 5.666 · 10−13. (B) Same of 0.05 in the interval [0.05, 0.60]) vs. size of initial seed set as returned by plot at (A) except the averages are over the 12 percentage-based threshold our algorithm in our three identified categories of networks (categories A-C values. The planar fit depicted is S = 18.105 · M + 17.257 · C − 10.388 are depicted in panels A-C respectively, categories are the same as in Figure with R2 = 0.816, p = 5.117 · 10−11. 1). Average seed sizes were under 5% for Categorty A, 1 − 7% for Category B and over 3% for Category C. In general, the relationship between threshold and initial seed size for networks in all categories was exponential.

[11]. In this paper we focused on the deterministic version. In [22], the authors look at deterministic tipping where each node Our algorithm is based on the idea of shell-decomposition is activated upon a percentage of neighbors being activated. that currently is prevalent in physics literature. In this process, Dryer and Roberts [9] introduce the MIN-SEED problem, which was introduced in [21], vertices (and their adjacent study its complexity, and describe several of its properties edges) are iteratively pruned from the network until a network w.r.t. certain special cases of graphs/networks. The hardness “core” is produced. In the most common case, for some value of approximation for this problem is described in [7]. The k, nodes whose degree is less than k are pruned (in order of work of [3] presents an algorithm for target-set selection degree) until no more nodes can be removed. This process whose complexity is determined by the tree-width of the was used to model the Internet in [5] and find key spreaders graph - though it provides no experiments or evidence that under the SIR epidemic model in [13]. More recently, a the algorithm can scale for large datasets. The recent work of “heterogeneous” version of decomposition was introduced in [18] prove a non-trivial upper bound on the smallest seed set. [2] - in which each node is pruned according to a certain o reyapoiainshm.I hsagrtm teach at the algorithm, this that In prove scheme. to allowed approximation the was which greedy - model, a of Kempe submodular for this was by maximizes adopters for of finding that on work number main expected classic size focused the The certain area al., adopters. a this et of of in seed number set Work the expected seed [15] of a [12], version finding [8], approxi- corresponding inspired problem. the has set for randomly schemes assigned mation are thresholds the Tipping Non-Deterministic on Notes A. this on for based values work certain that with in nodes parameter. studied of distribution is probability process a the and - parameter eas oeta natraevrino h oe where model the of version alternate an that note also We R GESO NLSSADNETWORK AND ANALYSIS EGRESSION YouTube2 YouTube1 URV E-Mail Enron E-Mail CA-NetSci CA-HepTh CA-HepPh CA-GrQc CA-CondMat99 CA-CondMat05 CA-CondMat03a CA-CondMat03 CA-AstroPh Yelp Hyves EU E-Mail Digg Delicious WikiTalk Livemocha LiveJournal Last.Fm Frienster FourSquare Flixster Flickr Douban Buzznet BlogCatalog3 BlogCatalog2 BlogCatalog1

EWRSIN NETWORKS Name 0.08 0.14 0.22 0.50 0.69 0.47 0.61 0.53 0.64 0.65 0.63 0.65 0.63 0.11 0.04 0.07 0.09 0.03 0.05 0.05 0.13 0.07 0.05 0.11 0.08 0.17 0.02 0.23 0.46 0.49 0.35 Clust. Coeff. 0.72 0.67 0.57 0.62 0.96 0.77 0.66 0.86 0.85 0.73 0.73 0.76 0.63 0.62 0.77 0.79 0.53 0.75 0.58 0.35 0.65 0.58 0.76 0.40 0.60 0.52 0.60 0.31 0.24 0.33 0.32 Louv. Mod. C CATEGORY A CATEGORY C CATEGORY B 16.06 11.21 13.17 18.15 50.69 30.63 21.35 35.09 34.71 25.59 26.52 27.80 14.31 AL II TABLE ATEGORIES 7.07 4.90 6.66 4.64 8.27 0.90 1.04 1.12 1.31 0.21 0.07 1.14 0.69 1.54 0.40 0.29 0.01 0.73 Int.-based Avg. Seed Size (%) 0.87 0.98 0.97 0.95 0.82 0.95 0.98 0.92 0.95 0.98 0.98 0.98 1.00 0.99 0.97 0.81 0.98 0.98 0.98 0.89 0.97 0.97 0.95 0.27 1.00 0.95 0.99 0.83 0.89 0.86 0.97 R^2 (linear fit for Int. tests) 7.9E-05 4.8E-08 1.5E-07 1.3E-06 3.0E-04 1.3E-06 1.8E-08 1.2E-05 1.3E-06 2.8E-08 2.3E-08 7.8E-08 6.3E-11 2.2E-10 1.5E-07 3.8E-04 2.0E-08 2.9E-08 8.0E-08 3.6E-05 1.4E-07 1.2E-07 1.2E-06 1.2E-01 1.1E-11 1.2E-06 3.2E-09 2.7E-04 3.9E-05 1.1E-04 1.4E-07 - IEMAUE O THE FOR MEASURES WIDE ,B, A, p-value (linear fit for Int. tests) 29.22 12.47 10.59 16.86 15.48 11.26 11.62 12.45 3.73 4.24 5.54 7.64 8.53 1.44 2.10 6.48 1.10 2.87 0.37 2.31 1.09 0.93 0.42 0.44 0.98 3.11 1.73 1.78 3.62 0.69 1.01 AND Deg.-based Avg. Seed Size (%) 0.51 0.89 0.89 0.84 0.93 0.96 0.90 0.90 0.79 0.86 0.87 0.90 0.93 0.89 0.91 0.92 0.93 0.91 0.91 0.92 0.89 0.70 0.79 0.95 0.73 0.86 0.82 0.90 0.79 0.79 0.86 C. R^2 (exp. fit for Deg. tests) 9.50E-03 5.06E-06 3.89E-06 2.76E-05 4.99E-07 1.42E-08 2.25E-06 2.15E-06 5.56E-05 2.99E-06 1.22E-04 1.19E-04 1.38E-05 1.2E-04 1.3E-05 9.8E-06 2.5E-06 5.5E-07 4.1E-06 1.2E-06 8.1E-07 3.0E-07 1.6E-06 1.2E-06 8.7E-07 3.4E-06 7.2E-04 1.2E-04 5.8E-08 3.8E-04 1.5E-05 p-value (exp. fit for Deg. tests) ia upr.Adtoal,w ol iet hn i opar- no (in thank to like would we Additionally, support. nical many for promise viral.” hold hold “go to to to wishing seems seem those algorithm for type our networks, this social of online results a expect As could investment. marketeer under viral network, a For adopted), social datasets. ( large online threshold be for FourSquare often majority even the - can with algorithm sizes en- example, fast relatively seed are our assume initial using results we small found our if values, [24] Even threshold marketers. [6], large viral networks, for social couraging real in occur 6 Figure See more. by or magnitude bounded details. of order be for considerably an were by can sets often seed - set smaller Our bound. seed Reichman’s to minimal results the of size the P that d devel- were algorithm. Reichman no the performs of of Reichman work evaluation that experimental note the also and We We independently. work case). oped special our (although that that approach to our note only with mirrors applies networks that algorithm algorithm Reicshman’s undirected constructive an of is yields proof case The and special [18] of values. the size threshold minimal for homogeneous the set on seed bound upper a an Set proves Seed that Initial Reichman the of Bound Upper an on Note result. B. approximation their by in leveraged al. property Sub- et the here. Kempe is hold returns) not counter- (diminishing does simple modularity sub-modularity A (de- that paper. original shows this the example in to presented apply model directly terministic) not does case deterministic approach. our than expensive more ersishv enpooe,btteetpclyrl nthe on rely an - typically geodesics these of but computation proposed, been have and heuristics simulation per (typically, runs nodes of be number to the complexity time is overall expensive the often causing is - operation simulation ratio open approximation the non-trivial Further, an unlikely. a be is still that that indicate is as [19] to Roth such guarantee Dan proved problems of a counting-complexity later such on a work was finding question, which Though - [8] hard. adopters expected of the determining number for guarantee approximation approximation an than The upon (less selected. obtained is the adopters guarantee in increase of greatest number the for expected allows which node the iteration, i ewudlk otakGye og(SA o i tech- his for (USMA) Wong Gaylen thank to like would We can it that indicates tipping on work empirical recent As network undirected Given Daniel by research of aware made were we recently, Very non- the for argument approximation the Additionally, i n h hehl au o l oe is nodes all for value threshold the and min { 1 , d i k +1 } n > x o u nee et,w oprdour compared we tests, integer our For . A 50% CKNOWLEDGMENTS I C VI. .I re oaodsmlto,various simulation, avoid to order In ). ficmn egbr previously neighbors incoming of ONCLUSION G O 0 hr ahnode each where . ( 63 n 3 ) fotml scontingent is optimal) of prto hc salso is which - operation k n 297 ecmnproves Reichman , O stenme of number the is ( x fl eunon return -fold · v n i 2 a degree has ) where # P x - REFERENCES 0.05 BlogCatalog1 A BlogCatalog2 [1] A. Arenas, “Network data sets,” 2012. [Online]. Available: http:

//deim.urv.cat/∼aarenas/data/welcome.htm 0.04 BlogCatalog3 Buzznet [2] G. J. Baxter, S. N. Dorogovtsev, A. V. Goltsev, and J. F. F. Mendes, “Het- k Douban erogeneous -core versus bootstrap percolation on complex networks,” 0.03 Phys. Rev. E Flickr , vol. 83, May 2011. [3] O. Ben-Zwi, D. Hermelin, D. Lokshtanov, and I. Newman, “Treewidth Flixster governs the complexity of target set selection,” Discrete Optimization, 0.02 FourSquare vol. 8, no. 1, pp. 87–96, 2011.

Seed Size OverUpper Bound Friendster [4] V. Blondel, J. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding Last.Fm 0.01 of communities in large networks,” Journal of Statistical Mechanics: LiveJournal Theory and Experiment, vol. 2008, p. P10008, 2008. Livemocha 0 [5] S. Carmi, S. Havlin, S. Kirkpatrick, Y. Shavitt, and E. Shir, “From WikiTalk 1 2 3 4 5 6 7 8 9 10 the Cover: A model of Internet topology using k-shell decomposition,” Threshold Value PNAS, vol. 104, no. 27, pp. 11 150–11 154, 2007. 0.16 [6] D. Centola, “The Spread of Behavior in an Online Social Network Experiment,” Science, vol. 329, no. 5996, pp. 1194–1197, Sep. 2010. 0.14 B [7] N. Chen, “On the approximability of influence in social networks,” SIAM 0.12 J. Discret. Math., vol. 23, pp. 1400–1415, September 2009. [8] W. Chen, C. Wang, and Y. Wang, “Scalable influence maximization for 0.1 Delicious prevalent viral marketing in large-scale social networks,” in Proceedings Digg 0.08 of the 16th ACM SIGKDD international conference on Knowledge EU E-mail 0.06 discovery and data mining, ser. KDD ’10. New York, NY, USA: ACM, Hyves 2010, pp. 1029–1038.

Seed Size OverUpper Bound 0.04 Yelp [9] P. Dreyer and F. Roberts, “Irreversible -threshold processes: Graph- 0.02 theoretical threshold models of the spread of disease and of opinion,” 0 Discrete Applied Mathematics, vol. 157, no. 7, pp. 1615 – 1627, 2009. 1 2 3 4 5 6 7 8 9 10 [10] M. Granovetter, “Threshold models of collective behavior,” The Ameri- Threshold Value can Journal of Sociology, no. 6, pp. 1420–1443. 0.7 [11] M. Jackson and L. Yariv, “Diffusion on social networks,” in Economie Publique, vol. 16, no. 1, 2005, pp. 69–82. [12] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing the spread of 0.6 CA-AstroPh C influence through a social network,” in KDD ’03: Proceedings of the CA-CondMat03

ninth ACM SIGKDD international conference on Knowledge discovery 0.5 CA-CondMat03a and data mining. New York, NY, USA: ACM, 2003, pp. 137–146. CA-CondMat05 [13] M. Kitsak, L. K. Gallos, S. Havlin, F. Liljeros, L. Muchnik, H. E. 0.4 CA-CondMat99 Stanley, and H. A. Makse, “Identification of influential spreaders in CA-GrQC complex networks,” Nat Phys, no. 11, pp. 888–893, Nov. 0.3 CA-HepPh [14] J. Leskovec, “Stanford network analysis project (snap),” 2012. [Online]. CA-HepTh.txt Available: http://snap.stanford.edu/index.html

Seed Size OverUpper Bound CA-NetSci 0.2 [15] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and Enron E-Mail N. Glance, “Cost-effective outbreak detection in networks,” in KDD 0.1 URV E-Mail ’07: Proceedings of the 13th ACM SIGKDD international conference on YouTube1 Knowledge discovery and data mining. New York, NY, USA: ACM, YouTube2 0 2007, pp. 420–429. 1 3 5 7 9 Threshold Value [16] M. E. J. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Phys. Rev. E, vol. 69, no. 2, p. 026113, Feb 2004. [17] M. Newman, “Network data,” 2011. [Online]. Available: http: //www-personal.umich.edu/∼mejn/netdata/ Fig. 6. Integer threshold values vs. the seed size divided by Reichman’s upper [18] D. Reichman, “New bounds for contagious sets,” Discrete Mathematics bound [18] the three categories of networks (categories A-C are depicted (in press), no. 0, pp. –, 2012. [Online]. Available: http://www. in panels A-C respectively). Note that in nearly every trial, our algorithm sciencedirect.com/science/article/pii/S0012365X12000301 produced an initial seed set significantly smaller than the bound - in many [19] D. Roth, “On the hardness of approximate reasoning,” Artificial Intelli- cases by an order of magnitude or more. gence, vol. 82, pp. 273–302, 1996. [20] T. C. Schelling, Micromotives and Macrobehavior. W.W. Norton and Co., 1978. [21] S. B. Seidman, “Network structure and minimum degree,” Social Networks, vol. 5, no. 3, pp. 269 – 287, 1983. [Online]. Available: http://www.sciencedirect.com/science/article/pii/037887338390028X ticular order) Albert-Laszl´ o´ Barabasi´ (NEU), Sameet Sreeni- [22] D. J. Watts and P. S. Dodds, “Influentials, networks, and public opinion formation,” Journal of Consumer Research, vol. 34, no. 4, pp. 441–458, vasan (RPI), Boleslaw Szymanski (RPI), John James (USMA), 2007. [Online]. Available: http://www.journals.uchicago.edu/doi/abs/10. and Chris Arney (USMA) for their discussions relating to this 1086/518527 work. Finally, we would also like to thank Megan Kearl, Javier [23] R. Zafarani and H. Liu, “Social computing data repository at ASU,” 2009. [Online]. Available: http://socialcomputing.asu.edu Ivan Parra, and Reza Zafarani of ASU for their help with [24] M. P. Zhang, L., “Two is a crowd: Optimal trend adoption in social some of the datasets. The authors are supported under by the networks,” in Proceedings of International Conference on Game Theory Army Research Office (project 2GDATXR042) and the Office for Networks (GameNets), 2011. of the Secretary of Defense (project F1AF262025G001). The opinions in this paper are those of the authors and do not nec- essarily reflect the opinions of the funders, the U.S. Military Academy, or the U.S. Army.