STAND: A Spatio-Temporal Algorithm for Network Di€usion Simulation Fangcao Xu Bruce Desmarais Donna Peuquet [email protected] [email protected] [email protected] Department of Geography, Depatment of Political Science, Department of Geography, Pennsylvania State University Pennsylvania State University Pennsylvania State University State College, Pennsylvania, USA State College, Pennsylvania, USA State College, Pennsylvania, USA

ABSTRACT which each node gets infected is usually not. Many algorithms Information, ideas, and diseases, or more generally, contagions, have been developed to simulate cascades for di‚erent contagions spread over time and space through individual transmissions via to drive a beŠer understanding of di‚usion process. However, most social networks, as well as through external sources. A detailed current algorithms only exploit time as an explanatory variable and picture of any di‚usion process can be achieved only when both don’t take the geographic space into consideration. a detailed network structure and individual di‚usion pathways Nevertheless, all di‚usion processes that involve physical agents, are obtained. Studying such di‚usion networks provides valuable locations or interactions, are embedded in geographic space. For insights to understand important actors in carrying and spread- examples, news events usually occur at speci€c locations. People ing contagions and to help predict occurrences of new infections. receiving and exchanging information or ideas via social media Most prior research focuses on modeling di‚usion process only also have a physical location. or physically being proximal to each in the temporal dimension. Œe advent of rich social, media and other. When we trace the di‚usion phenomena at a large scale geo-tagged data now allows us to study and model this di‚usion over a long time, it has also been demonstrated the spread and process in both temporal and spatial dimensions than previously adoption of many contagions are di‚erent from region to region possible. Nevertheless, how information, ideas or diseases are prop- with signi€cant local characteristics [6, 8]. Instead of geographic agated through the network as an overall spatiotemporal process becoming increasingly irrelevant, it’s more accurate to say is dicult to trace. Œis propagation is continuous over time and technology has made border less relevant. space, where individual transmissions occur at di‚erent rates via Œe motivation of this research is to provide a well-de€ned and complex, latent connections. mathematically solid approach for solving di‚usion simulation and To tackle this challenge, a probabilistic spatiotemporal algorithm modeling problems taking both spatial and temporal information for network di‚usion simulation (STAND) is developed based on the into consideration. To achieve this, we have developed a probabilis- survival model in this research. Both time and geographic distance tic algorithm called STAND,using a survival approach, to simulate are used as explanatory variables to simulate the di‚usion process spatiotemporal di‚usion cascades. Œis algorithm is applicable to over two di‚erent network structures. Œe aim is to provide a various types of network structures. It is intended that our research more detailed measure of how di‚erent contagions are transmiŠed can lead to new insights of how di‚erent contagions, including through various networks where nodes denote geographic locations topics, opinions,sentiments, or events mined from world wide web at a large scale. are propagated over space and time.

KEYWORDS 2 RELATED WORK spatiotemporal di‚usion; probabilistic function; survival model; Simulating the complete of spatiotemporal networks and network analysis multiple contagions spread over them is challenging for two reasons: First, in many cases we can only observe the timing information of when nodes get infected [11]. Second, even though large amounts arXiv:1904.05998v3 [cs.SI] 26 Jan 2020 1 INTRODUCTION of digital heterogeneous data are now available via the World Wide With the advent of rich social and other web-based media contain- Web, locational information is extremely sparse, unstructured and ing both temporal and locational information, tracing the di‚usion o‰en ambiguous [1, 4, 13]. Developing a ƒexible model for deriving of ideas, political opinions and even evidence of processes such as the network structure and cascade behaviors is key to uncovering the spread of disease at a large scale has become a focus of research the mechanism that governs spatiotemporal di‚usion processes in recent years. Di‚usion is the process by which contagions spread and their dynamics. over space and time via structures. Contagions Several di‚erential equation models (DE) and agent-based mod- start at speci€c nodes and spread from node to node over the edges els (AB) have been proposed to simulate the network structure of the network. Œese traces are called cascades. and spatiotemporal di‚usion cascades. DE models [9, 10] usually Previous network research [7] indicates that observing when aggregate agents into several states (e.g., infected or uninfected). individual nodes in the network get infected by various contagions Œe transitions among di‚erent states are modeled by di‚erential is easy, but determining the transmission pathways is dicult. In equations. In contrast, agent-based models [2, 5] have considered other words, the times at which nodes get infected are noted in the heterogeneity of agents. Œey simulate the di‚usion in real- the observational data but the sequence and parent node through istic networks by de€ning how the infection may occur through Table 1: Parameters in Di‚usion Networks

Parameters Def G(V, E) with node set V and edge set E c Contagion that spreads over G C Set of contagions c T Cascade propagation tree Tc (G) Set of all possible cascade trees of the contagion c ti Time when node i get infected by a contagion ∆i, j Time di‚erence between the node infection time tj − ti α Di‚usion speed scaling parameter Figure 1: Di‚erent cascade trees β Probability that contagion spreads over the edge of G individual-based interactions (e.g., infection can only occur when For example, Figure 1 shows three di‚erent trees with the same agents are at the same location). observed spatiotemporal di‚usion sequence where the nodes are ge- ( ) While, most existing simulation models of network di‚usion ographic locations, tW ashington, tPhiladelphia, tPittsburgh, tChicago c . are based on assumptions of agent homogeneity or how agents Œe infection time tW ashington < tPhiladelphia < tPittsburgh < interact with each other for spreading contagions, none have inte- tChicago. Œe di‚usion path from Washington to Chicago can ( → → → ) grated space and time together to account for di‚usion probability be W ashington Philadelphia Pittsburgh Chicago as ( → → individually for each node-pair over spatiotemporal networks at shown by the red line in the le‰ plot or W ashington Pittsburgh ) large-scale. We also intend to €ll this gap. We develop an exponen- Chicago in the middle plot. tial di‚usion probabilistic algorithm that integrates geographic dis- Œe di‚usion probability from Washington to Chicago of these tance, node infection time and transmission speed together based three paths de€ned by Equation 1 are same, demonstrated as below: upon NETINF [11]. Œe reasons to choose an exponential net- β β β β3 work di‚usion model as a starting point for STAND are: 1) it’s a Pc (i,k,p, j) = ∗ ∗ = eα ∆i,k α ∆k,p eα ∆p, j α(∆i,k +∆k,p +∆p, j ) continuous-time model without any assumptions about the indi- e e 3 3 vidual interactions, 2) it is amenable to inference/estimation for β β = = = Pc (i,p,q, j) di‚erent datasets, and 3) it has a ƒexible structure in terms of adding eα ∆i, j eα(∆i,p +∆p,q +∆q, j ) features like geo-distance. Other time distributions (e.g., log-normal, β3 = = Pc (i,k,q, j) Gamma, etc.) can also be considered in the future research as well. eα(∆i,k +∆k,q +∆q, j ) (2) 3 THE STAND ALGORITHM where i is Washington, k is Philadelphia, p is PiŠsburgh, q is New Table I de€nes the parameters used in many di‚usion network York and j is Chicago. models, and we build upon this notation. When a speci€c contagion It’s thus needed to extend the current di‚usion probabilistic spreads over the network, it will create a cascade by infecting nodes algorithms to include space as well as time to beŠer model the in a temporal sequence. Œe cascade of a contagion c is denoted by dynamic di‚usion over the spatiotemporal network. a directed tree T , consisting of a set of infected nodes with observed infection time: (i, t ) where i ∈ V . i c 3.2 Spatiotemporal Probabilistic Survival 3.1 Temporal Probabilistic Survival Likelihood Likelihood Œe likelihood of a contagion c spreading over an edge (i, j) in We assume the spatiotemporal di‚usion probability Pc (i, j) that a NETINF is calculated by a probabilistic exponential function and cascade c will spread from a node i to a node j decreases with both assumed to depend only on the time di‚erence ∆i, j in Equation (1). the spatial distance dij and time di‚erence ∆i, j in a certain way. Geographic distance is integrated into Equation (1) as a starting β point. Considering that the geographic distance may have direct Pc (i, j) ≈ Pc (∆i, j ) ∝ (1) eα ∆i, j inƒuence on the di‚usion speed and infection time interval, this Only the €rst time when a node gets infected will be counted for inƒuence is modeled by the survival analysis model. each contagion c and the contagion must di‚use forward in time. Supposing T is the observed infection time of a node, then the If there is a node j that never got infected by any contagion, then probability that this node would not yet be infected at any time t is denoted by the survival function, S(t) = P(T > t). tj = ∞ and ∆i, j ≈ ∞. Œus each infected node will only have one Œe probability that a given node would get infected at any time parent node who di‚uses the contagion c and the tj > ti . However, there may exist more than one possible di‚usion cas- t, is denoted by a cumulative probability density function (CDF), cade tree for the same set of infected nodes. Œese trees have the F(t) = P(T ≤ t) = 1 − S(t). Œe probability that a node will get same temporal sequence for the occurrences of infections among infected within a time interval (t, t+dt) is P(t ≤ T ≤ t+dt) = f (t)dt nodes but their di‚usion paths are di‚erent. Equation (1) which where f (t) is the infection rate over time, given by the probability only exploits the infection time di‚erence to explain the di‚usion density function (PDF): probability, cannot solve which di‚usion path has a higher proba- d d d f (t) = F(t) = (1 − S(t)) = − S(t) (3) bility. dt dt dt 2 Œe instantaneous infection rate h(t) at the given time t, is called as: t the hazard rate, that reƒects the likelihood that the uninfected node ∫ (λ1+λ1di, j ) − h(t)dt −θe ∆i, j will get infected within a very short time interval (t, t + dt), given Fi, j (∆i, j ) = 1 − e 0 = 1 − e this node hasn’t been infected before: θe(λ0+λ1di, j ) f , (∆ , ) = i j i j ( ) (9) Pr(t ≤ T < t + dt) f (t) eθe λ0+λ1di, j ∆i, j h(t) = lim = (4) × ( ) ( ) βθe(λ0+λ1di, j ) dt→0 dt S t S t ( ) ∝ ( ) Pc i, j β fi, j ∆i, j = ( eθe λ0+λ1di, j )∆i, j Œe survival likelihood S(t) and the cumulative infection prob- ability F(t) have the following relationships with the hazard rate Œe network inference problem can also be solved by developing h(t): a greedy algorithm to search for all edges that can maximize the probability of a network structure over which all contagions C can ∫ t ∫ t − h(t)dt − h(t)dt spread. S(t) = e 0 F(t) = 1 − e 0 (5)

∫ t 4 EXPERIMENTAL SIMULATION where the h(t)dt is the cumulative hazard that represents the 0 In order to simulate real-world di‚usion processes, and evaluate total risks of a node being infected up to the time point t. our algorithm for recovering the network structure on the synthetic In order to measure pairwise infection likelihood in Equation cascades, experiments are described in the following sections. (1), the probability density f (t) and the hazard rate h(t) are needed to calculate the probability that a node j gets infected by the node 4.1 Simulation of Ground-truth Network i within the time interval (ti , ti + dt). A higher value of density function f (t) means the node j is more likely to be infected by the One hundred geographic cities/counties with the largest population node i within a short time interval dt which results in a higher in the United States, excluding Hawaii and Alaska, are taken as hazard rate. spatial network nodes V . Œe original data is from the U.S. Geo- In this step, we use the a proportional hazard function suggested logical Survey collected in 2017. Some cities/counties having high by [12] to exploit the geographic properties in the form below: population and very close geographic locations were merged into larger metro areas (e.g., New York City or Greater Los Angeles). (λX +Y ) Œen we manually deleted a few small cities and add other cities in h(t;Y; λ;θ) = e h0(t;θ) (6) Montana, Wyoming, North and South Dakota to make the nodes more evenly distributed over the United States. where the h0(t;θ) is the baseline hazard function, and t is the ob- served infection time. X is a vector of explanatory variables associ- Considering that many real-world networks share some simi- ated with node properties and Y is a vector of explanatory variables lar properties, such as relatively small average path length, high associated with the spatial dependencies. λ and θ are parameter clustering coecient, or high reciprocity (i.e., the percentage of coecients for the explanatory variables X and the baseline hazard mutually connected node-pairs), two directed network structures function. are considered to generate the ground-truth network G. Œey are the random network structure [3] and the small-world network To simplify, the baseline hazard rate h0(t;θ) in Equation (6) is assumed to be constant and independent with time for any node- structure [14]. Characterizing how contagions spread over these two extreme network structures can help us beŠer understand more pairs. Œus the h0(t;θ) and its associated cumulative hazard rate complex network di‚usion phenomena. However, our proposed H0(t;θ) are: STAND algorithm can be applied to any kind of network structure. ¹ t Œe pseudocode of STAND algorithm is given as below: h0(t;θ) = θ H0(t;θ) = h0(t;θ)dt = θt (7) 0 4.1.1 Simulation of Random Network. Œe random network as- sumes the node in the network has a normal distribution Furthermore, only the spatial distance is implemented into the and edges are randomly picked with a constant probability with a di‚usion probability as a starting point. Œus the λX is a constant same node-edge density. Œis has been widely used for network modeled by the parameter λ and the explanatory variables Y, ac- simulations with the absence of topological information about the counting for spatial dependencies in Equation (6) is modeled by network structure. the distance between any node-pair i and j with a decay parameter. Building this type of network begins with n spatial isolated To unify the parameter coecients, the λ for node properties is nodes and selecting out of those any two to randomly place an edge reset as λ and the decay parameter for spatial dependencies (i.e., 0 between them until all edges of the required number have been distance here) is set as λ . Equation (6) can be reformaŠed as: 1 added to the network without repetition. An example is shown in Figure 2 (Le‰). h(t;Y; λ;θ) = θe(λ0+λ1di, j ) (8) 4.1.2 Simulation of the Small-world Network. Œe small-world where the di, j is the geographic distance between the node i and j. network as shown in Figure 2 (Right) assumes the node degree Œe cumulative probability F(t), the probability density f (t), and follows the power-law distribution, in which only a small propor- the infection likelihood de€ned in Equation (1) with the hazard rate tion of nodes has a large number of links while the majority has (λ +λ d ) θe 0 1 i, j over the infection time interval ∆i, j can be rewriŠen only few or no links. Œus it has a high clustering coecient and 3 Figure 2: Random network (Le ) and Small-world network (Right) with 100 nodes and 600 edges.

Algorithm 1 STAND Algorithm Table 2: Parameters of λ1 and tmax Require: coordinates of n nodes (lng,lat), number of edges m, net- −5 ∗ 10−5 −3 ∗ 10−5 −2 ∗ 10−5 −10−5 work G(V , E), parameter β,θ, λ0, λ1, cuto‚ di‚usion time tmax , −6 −6 −6 −6 λ1 −9 ∗ 10 −8 ∗ 10 −7 ∗ 10 −6 ∗ 10 number of nearest neighbours k. −5 ∗ 10−6 −4 ∗ 10−6 −3 ∗ 10−6 −2 ∗ 10−6 Step 1: Simulate Random & Small-world Networks 500 1000 2000 4000 V ← (lng,lat) t max 8000 ∞ Grandom ← E ← COMBINATION (V , 2,m) for all node v ∈ V do Gsw ← E ← (v, K N EAREST (v,k, Euclidean)) 4.2 Simulation of Cascades RECON N ECT (E ∈ G (V , E), 0.05) sw A set of cascadesC that spreads over the network G are generated by return Grandom,Gsw the probabilistic function de€ned in Equation (9). At this step, each Step 2: Simulate Cascades node is selected as the single starting infected node with assigned for all edges (i, j) ∈ Grandom |Gsw do an infection time 0, and then it starts to spread the contagion to di, j ← EuclideanDist((lngi ,lati ), (lngj ,latj )) every remaining node. Œe infection time of remaining nodes is (λ0+λ1di, j ) hrandom |hsw ← h(i, j) = θe Equation (8) calculated based on Equation (9). If there is no edge between any return hr andom, hsw two nodes, the hazard rate α is forced to be set as 0. Cr andom = SI MU LAT E(Gr andom, hr andom ) Equation (9) Di‚erent parameter values are selected for generating di‚usion ( ) Csw = SI MU LAT E Gsw, hsw cascades. Recalling that the hazard rate α = θe(λ0+λ1di, j ) in Equa- return simulated cascades C , C r andom sw tion (8) controls the infection rate that a contagion spreads over edges, there are three parameters to express this: θ (i.e., a base- line hazard rate), λ0 (i.e., inƒuence of node properties), and λ1 (i.e., parameter coecient of inƒuence of distance). Several parameters are set as €xed values in this step because a relative short average path length. Œe small-network model is they are constant scaling factors irrelevant to account for the inƒu- widely used in many application contexts, such as the outbreak of ence of explanatory variables (i.e., the distance). Œey are: 1) the disease or social activities. prior probability of an edge to successfully spread the contagion, Building this type of network starts with connecting each spatial β = 0.5. A higher value of β means most of edges are more likely node to its k nearest neighbors. Next, a small proportion of edges to spread the contagion, which results in large infections/cascade are randomly rewired by removing the original edge and recon- size within the network; 2) the baseline hazard rate, θ = 0.5; 3) the necting the starting node of the original edge to another node with inƒuence of node properties on the di‚usion hazard rate, λ0 = 1; a probability p (i.e., 0.05). and 4) the penalty parameter that accounts for the small probability Here, we have built 600 edges among these geographic locations that the contagion may spread via non-existing edges, ϵ = 10−4. for each network structure. In comparing le‰ and right €gures in Œe parameter coecient of the inƒuence of the distance λ1 and Figure 2, it is obvious that the average edge distance of the ran- the cuto‚ di‚usion time tmax used to decide to what extent the dom network is much longer than that of the small-world network, di‚usion should fail when the di‚usion time along an edge is too which has made the small-world network seem sparser over space long, are chosen from a list of values in Table II. than the random network even though they have the same num- Œese values of λ and tmax increase monotonically and can ber of edges. Œe small-world network also shows a signi€cant represent cascades ranging from small (i.e., contagions can almost clustering paŠern of cliques in contrast to the random network. not spread) to large (i.e., a large infection over all nodes). Œe 4 Table 3: Information about Simulated Cascades

node name time parent node v path cascade id 3: Chicago 0.00000 3 3 3 31: Milwaukee 1.71148 3 (3, 31) 3 56: Fort Wayne 3.61669 3 (3, 56) 3 82: Grand Rapids 4.03627 56 (3, 56, 82) 3

Figure 3: Histogram and QQ-plot of the distance between any two nodes in the Random and Small-world networks reason to set λ1 as negative is that in general, the hazard rate α should be small when the spatial distance is very large. In other words, the di‚usion time that a contagion spreads via a network edge is longer when two nodes are far away from each other if all other parameters are €xed. Œe α has a positive relationship with λ . When λ increases from the −5 ∗ 10−5 to −2 ∗ 10−6, the 1 1 Figure 4: Box plots of the number of infected nodes in 100 di‚usion/infection speed via network edges will be larger. simulated cascades for Random network and Small-world Œe cuto‚ di‚usion time t is used to set those nodes that are max network infected beyond the threshold as ∞. For example, if 1000 (time units) is selected as the cuto‚ value, then nodes with simulated infection 5.2 Infected Nodes in Simulated Cascades time more than 1000 should be set as ”won’t get infected”. It can For each combination of λ and t shown in Table II, 100 cas- model phenomena that the outbreaks of some contagions happen 1 max cades are generated by taking every node as a starting infected within a short period then fades very quickly due to geographic node and the information including the infected node name, the constraints while others can easily spread at a large scale and last infection time, the parent node from whom the contagion spreads, for a long time. the di‚usion pathway and the cascade id are reported. For exam- Œe analysis of the simulated cascades of the static random net- ple, the information in Table III suggests a di‚usion cascade that work and the static small-world network is given in the following starts from the node Chicago spreads via two pathways (Chicago → section using the parameters selected above. Milwaukee) and (Chicago → Fort W ayne → Grand Rapids). 5 ANALYSIS OF RESULTS Œe box-plots in Figure 4 give a clearer picture of the distribution of the number of infected nodes under di‚erent sets of λ1 and tmax . 5.1 Distribution of the Distance Œe le‰ series of plots in Figure 5 shows how the total number of Œe histograms and ‹antile-‹antile (QQ) plots in Figure 3 are infected nodes changes for di‚erent values of λ1, €xing the cuto‚ used to examine the statistical distribution of the edge distance. di‚usion time while the right plot in Figure 5 shows how the total Œe range of the statistical distance (km) between two nodes in number of infected nodes changes for di‚erent cuto‚ di‚usion time the random networks is much larger than that in the small-world tmax , €xing λ1. network. Both of them have a bell-shaped curve as shown in the Several conclusions about the inƒuence of λ1 and tmax and the top of Figure 3. However, the density curves in black suggest that di‚erence between the random network structure and the small- the distribution of the edge distance in the random network is world network structure are drawn from these statistical results: smoother and more widely distributed while the edge distance (1) If the cuto‚ di‚usion time is set as ∞ where all nodes can the small-world is dominated by more lower values. Œe reason get infected (i.e., the boŠom row of the le‰ plot in Figure accounting for this in the random network may be due to the limited 4, the value of λ1 has no inƒuence on the total number of number of geographic points, thus few observations can easily skew infected nodes; the distribution. For the small-world network, the reason is the (2) For other cuto‚ di‚usion times, it is clear that the geo- existence of a small portion of edges that connect di‚erent cliques. graphic distance has a great inƒuence on the total number 5 Figure 5: Number of infected nodes in 100 simulated cascades for the random networks and small-world networks

of infected nodes. Œe number of infected nodes monoton- When λ1 is very small, most nodes fail to spread contagions for ically increases to the peak when λ1 increases; di‚erent cuto‚ di‚usion time except for ∞. When λ1 increases, (3) When the absolute value of λ1 is low, the inƒuence of most nodes in the small-world network can get infected within a the distance on the di‚usion is small, which results in an short time due to an average small distance. So the inƒuence of the −5 approximate constant hazard rate for all location-pairs; cuto‚ di‚usion time is only obvious for λ1 around −2 ∗ 10 . Œe (4) When the absolute value of λ1 is high, more nodes will distribution of the number of infected nodes of the random network survive from contagions due to the geographic constraint is nearly normally distributed while that of the small-world network as €rst several columns of the right plot in Figure 4; is more skewed with a large proportion of outliers identi€ed. (5) Œe maximum number of infected nodes that the small- world network can reach is more than that in the random network because the small-world network is highly clus- 6 VISUALIZATION OF DIFFUSION CASCADES tered that makes the contagion spreads more easily; To beŠer understand the geographic distribution of the infected (6) Œe simulation process of the random network is more nodes in the di‚usion process, Figure 6 below shows simulated sensitive to the change of parameters and the number of cascades generated by taking Chicago as the starting infected infected nodes will increase gradually when λ1 and tmax node under di‚erent sets of λ1 and cuto‚ di‚usion time tmax for increase. Œis number in the small-world network will the random network and the small-world network. increase dramatically to the peak within a short range of When λ1 and cuto‚ di‚usion time tmax is very small, it is hard λ1 and is less sensitive to the change of the cuto‚ di‚usion for contagions to spread over the random network. However, conta- −5 time except when λ1 = −2 ∗ 10 . gions can still spread from Chicago to Milwaukee in the small-world −5 network when λ1 = −5 ∗ 10 and tmax = 500 due to nodes in 6 −5 −5 λ1 = −3 ∗ 10 , tmax = 1000 λ1 = −10 , tmax = 1000

−5 −5 λ1 = −10 , tmax = 4000 λ1 = −10 , tmax = ∞

−5 −5 λ1 = −5 ∗ 10 , tmax = 500 λ1 = −2 ∗ 10 , tmax = 500

−5 −5 λ1 = −2 ∗ 10 , tmax = 8000 λ1 = −2 ∗ 10 , tmax = ∞ Figure 6: Simulated cascades in the random and small-world networks under di‚erent sets of λ1 and tmax

7 the small-world network are highly connected with a small av- Also, a more complex interactive visualization tool and an R erage geographic distance. It is also intuitive to observe that the package that integrates the STAND algorithm should be developed contagion spreads like a chain via the connected cliques in the in the future to model the synthetic spatial network and simulated small-world network but more divergent over space in the random di‚usion cascades as an aid to understanding high-dimensional and network. When the cuto‚ di‚usion time is Inf , still the node 40 complex results. (i.e., Raleigh) in the random network and the node 41 (i.e., Miami) in the small-world network haven’t been infected. Œis is because DATA AVAILABILITY some nodes only have outward connections. Contagions cannot We have published our codes and data at the Spatiotemporal Net- spread to these nodes unless they are the starting infected node. work Di‚usion Codes and Data repository on the ScholarSphere. Œis repository collection includes: 1) 100 geographic locations we 7 DISCUSSION used for modeling the network nodes; 2) R codes for simulating First novel aspect of the STAND algorithm is that the geographic both the random network and small-world network as well as the distance is implemented into the hazard rate function, integrated di‚usion cascades spreading over these two network structures, over the time to quantify its inƒuence on the di‚usion probability using the proposed STAND algorithm; and 3) CSV €les of simulated di‚usion cascades under di‚erent sets of parameters λ1 and tmax in Equation (9). By varying its coecient λ1, we can model di‚erent real-world di‚usion phenomena, where the geographic aspects have di‚erent e‚ects. ACKNOWLEDGMENTS Œe second novel aspect is that it can be applied to any kind of We would also like to thank to Prof. Guido Cervone who provided network structures that are observed in the real-world di‚usion many suggestions and comments about this research. phenomena. Œe network structure is regarded as the input, and the hazard rate and di‚usion probability are calculated separately REFERENCES for each single edge dependent on the distance and the infection [1] Zack W Almquist. 2018. Large-scale spatial network models: An application to modeling information di‚usion through the homeless population of San time di‚erence between two nodes that this edge connects. Œis is Francisco. Environment and Planning B: Urban Analytics and City Science (2018). more ƒexible and accurate than previous research that assume a [2] Christopher L BarreŠ, Keith R Bisset, Stephen G Eubank, Xizhou Feng, and uniform global di‚usion rate over the network. Madhav V Marathe. 2008. EpiSimdemics: an ecient algorithm for simulating the spread of infectious disease over large realistic social networks. In SC’08: Œe third novel aspect of the STAND algorithm is that the node Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE, 1–12. properties and spatial dependencies (e.g., geographic distance) are [3] Paul Erdos and Alfred´ Renyi.´ 1960. On the evolution of random graphs. Publi- cation of the Mathematical Institute of the Hungarian Academy of Sciences 5, 1 considered to be vectors of explanatory variables X and Y in the (1960), 17–60. de€nition of the hazard rate in Equation (6). Our STAND algorithm [4] KM Ariful Kabir and Jun Tanimoto. 2019. Analysis of epidemic outbreaks in two- allows di‚erent node properties or spatial dependencies for di‚erent layer networks with di‚erent structures for information spreading and disease di‚usion. Communications in Nonlinear Science and Numerical Simulation (2019). node-pairs within the same network since the di‚usion probability [5] Elmar Kiesling, Markus Gunther,¨ Christian Stummer, and Lea M. Wakolbinger. is calculated separately for each edge. 2012. Agent-based simulation of innovation di‚usion: a review. Central European Journal of Operations Research (2012). [6] Jure Leskovec and Eric Horvitz. 2014. Geospatial structure of a planetary-scale 8 CONCLUSIONS AND FUTURE WORK . IEEE Transactions on Computational Social Systems 1, 3 (2014), 156–163. Œis research presents a network di‚usion algorithm, called STAND, [7] David Liben-Nowell and Jon Kleinberg. 2008. Tracing information ƒow on a based on the framework of probabilistic survival modeling, to simu- global scale using chain-leŠer data. Proceedings of the National Academy of Sciences 105, 12 (2008). late the di‚usion cascades and infer the underlying spatiotemporal [8] David Liben-Nowell, Jasmine Novak, Ravi Kumar, Prabhakar Raghavan, and network over which various contagions (e.g., political policies, so- Andrew Tomkins. 2005. Geographic routing in social networks. Proceedings of cial opinions, news, diseases, etc.) spread over both space and time. the National Academy of Sciences 102, 33 (2005). [9] Vijay Mahajan, Eitan Muller, and Yoram Wind. 2000. New-product di‚usion It can estimate the di‚usion speed and the individual-level infec- models. Vol. 11. Springer Science & Business Media. tion likelihood given any network structures as input. It can help [10] Hazhir Rahmandad and John Sterman. 2008. Heterogeneity and network struc- ture in the dynamics of di‚usion: Comparing agent-based and di‚erential equa- discover the di‚usion pathways and also predict new occurrences tion models. Management Science 54, 5 (2008), 998–1014. of infections over space and time as well as as well as trace back to [11] Manuel Gomez Rodriguez, David Balduzzi, and Bernhard Scholkopf.¨ 2011. Un- detect where and when the di‚usion began. covering the temporal dynamics of di‚usion networks. Še 28th International Conference on Machine Learning (2011). Real-world network di‚usion phenomena are usually more com- [12] Benjamin Taylor and Barry Rowlingson. 2017. spatsurv: An R package for plex than the simulated scenarios we have described here. Œe Bayesian inference with spatial survival models. Journal of Statistical So‡ware complexities can be summarized as follows: 1) the real-world net- 77, 4 (2017), 1–32. [13] Zheye Wang, Xinyue Ye, and Ming-Hsiang Tsou. 2016. Spatial, temporal, and works over which contagions spread are a mixture of di‚erent content analysis of TwiŠer for wild€re hazards. Natural Hazards 83, 1 (2016), levels of randomness and clustering rather than the simple random 523–540. [14] Duncan J WaŠs and Steven H Strogatz. 1998. Collective dynamics of small-world network or small-world network structure; and 2) the network networks. Nature 393, 6684 (1998). nodes are heterogeneous and their properties have high inƒuence on the di‚usion process. We will address these complexities as the next step in our research. Future work will collect large-scale real- world spatiotemporal di‚usion dataset and take various geographic characteristics into consideration, such as the local population or geographic scale. 8