arXiv:1510.04839v1 [cs.SI] 16 Oct 2015 colo ulcHat,L aSigFclyo eiie The sph.linwang@hku. Medicine, (email: of China Faculty SAR, Kong Shing Hong Ka an Kong, Li Hong Information China, Health, of 200433, Public of of Shanghai School School University, Engineering, Fudan Electronic Engineering, of partment lix in ewrs rcs dnicto,identifiability. identification, process networks, tion, h outpromnei dniyn culivso path actual spread. sh identifying networks in driving metapopulation performance t empirical robust assess R and the identified. artificial to be entrop can both measure pathway on the invasion identifiability an that on an level difficulty Based introduced simulations. we theory, computer para in and in burdens the without calibrations iteratively, assembling patch by each process in whole pathways inva the esti probable recovering likelihood maximum iii, most via tion; patch the each disj inferring underlying into pathways subpopulations ii, all patches; dynamica among componential process on anatomi devel- spread i, whole based by procedures: the algorithm three problem comprises the optimization which solved programming, efficient We an subpopulation. oping itse aims each in process which in of history spread problem, cases mechanic arrival spatial the reverse or regarding stochastically information new statistical observable the using a and identify by propose stochastic to achieved we pro be highly Here to spread is models. hard the humans often of of is forecast/mining mobility Accurate the heterogeneous. via ulations fSatNtok n ytm,Sho fIfrainScienc (e-mail Information China of 200433, School Shanghai University, Systems, Fudan the and neering, with Networks and Smart Engineering, of Electronic of Department ratory, ua nvriyEcletDcoa eerhPorm(985 Program Research S Doctoral part the Excellent SMEC-EDF acknowledge University also Shanghai Fudan L.W. and and J.B.W. (14SG03). 61273223), project N (No. the 61425019), Foundation (No. Science China of Scholars Young Distinguished pr fEoaifcin nwsencutisi 04[11], [12]. 2014 MERS in of th countries [10], western potential in [9], in recent flu Asia infections and southeast pandemic Ebola in (H1N1) of influenza A spark avian of [8], outbreak in [7], global SARS-CoV 2009 the of [6], spread Real-worl 2003 trans-national [5]. the [4], include passengers examples lo with disseminat of accompanied the establishment pathogens facilitate which the networks, particular, ur traffic the in distance to and, due partially process is ization trend This [1]-[3]. burdens health T .Wn a ihteAatv ewrsadCnrlLaborator Control and Networks Adaptive the with was Wang L. l orsodne hudcnatX Li. X. contact should correspondences All * .B agadX iaewt h dpieNtok n Contro and Networks Adaptive the with are Li X. and Wang J.-B. hswr a atal upre yteNtoa cec Fu Science National the by supported partially was work This } ne Terms Index Abstract @fudan.edu.cn). eetdcdsla ogetsca,eooi,adpublic and economic, social, great to in lead diseases decades infectious recent emerging of outbreaks frequent HE inB Wang, Jian-Bo dniyn pta naino admc on pandemics of invasion spatial Identifying Sailsra fifciu iessaogpop- among diseases infectious of spread —Spatial eaouainntok i anatomizing via networks metapopulation Sailsra,ifciu iess metapopula- diseases, infectious spread, —Spatial .I I. tdn ebr IEEE Member, Student NTRODUCTION ria history* arrival : i Wang, Lin , eerhCenter Research a upr from support ial { Program). toa Natural ational jianbowang11, swt the with is d cec and Science n Engi- and e University hk). fectious ffrom lf huguang Labo- l vasion o of ion meter dfor nd ,De- y, esults ways ban- zing sion ma- oint cess ng- ow he al d y e l ebr IEEE Member, 1 h ehdlg fsse dnicto snthlflin helpful models. not metapopulation s is systems, as many-body identification stochastic high-dimensional system con- solving on of relies methodology equa differential which The ordinary [31], trans of mainly systems time the dynamical epidemiology generation as structing and such in rate parameters, identification epidemic mission parameters inferring system system on of focuses infer to use used The usually fields, engineering Works network Related and A. here. detection summarized source briefly atten as are which paying such reconstruction, start problems, fields) different reverse engineering to over st in recent average (particularly systems, surveillance ies the by generated by data information epidemic described meaningful there more extract thing reality To [30]. in realizations such while th curves, no epidemic predict is of prediction to variance aims ove model and average which mean reliable ensemble realizations, the a simulation are numerous achieving results model [4]. in Generally, rea outbreaks [29]. in essential initial collected is data of with time weeks training model early continuous the Hence, availa during be not accurate of may which or requirement cases, incidence hig for the need data calibrations resolution of model The power. because computer high-level time-consuming, is models eut a eue oeaut h fetvns fcontrol of effectiveness alloc vaccine the the evaluate optimizing as to such [24]-[28], used strategies be mo can Quantitative times, sizes. results peak epidemic and including which curves, features [11], epidemic dynamic Ebola basic key [23], such some flu cases pandemic capture real-world (H1N1) can the A edge study [6], SARS to the as applied and mobil been individual have city), metapopulat the models networked a The drives [19]. [18], that (e.g. populations route between geo-region that traffic individuals same the of describes the population in at a framework, represents reside metapopulation node each with which generalized are ne to simple models geo-locations, different human between character spread from To edges spatial [17]-[22]. and spreading contacts persons often represent interpersonal epidemic nodes are represent where networks the contacts, via epidemiology, describe human network to of [13]-[16] used study systems modeling complex the for of/on In tool dynamics valuable and a structure as the developed been has networks h hoyo ytmietfiainhsbe salse in established been has identification system of theory The metapopulation large-scale of computing numerical The uigams h aeeoh h hoyo complex of theory the epoch, same the almost During n in Li, Xiang and , eirMme,IEEE Member, Senior z the ize twork ation. tions. from tion uch ud- ion ble del ity h- l- e - r . . Source detection for rumor spreading on complex networks i) A novel reverse problem of identifying the stochastic is becoming a popular topic, attracting extensive discussions pandemic spatial spread process on metapopulation networks in recent years. The target is to figure out the causality that is proposed, which cannot be solved by existing techniques. can trigger the explosive dissemination across social networks, ii) An efficient algorithm based on dynamical programming is such as Facebook, Twitter, and Weibo. For example, using proposed to solve the problem, which comprises three proce- maximum likelihood estimators, D. Shah and T. Zaman [32] dures. Firstly, the whole spread process among all populations proposed the concept of rumor centrality that quantifies the will be decomposed into disjoint componential patches, which role of nodes in network spreading. W. Luo et al. [33] designed can be categorized into four types of invasion cases. Then, new estimators to infer infection sources and regions in large since two types of invasion cases contain hidden pathways, networks. Z. Wang et al. [34]-[36] extended the scope by using an optimization approach based on the maximum likelihood multiple observations, which largely improves the detection estimation is developed to infer the most probable invasion accuracy. Another interesting topic is the network inference, pathways underlying each path. Finally, the whole spread which engages in revealing the topology structure of a network process will be recovered by assembling the invasion pathways from the hint underlying the dynamics on a network [37]. of each patch chronologically, without burdens in parameter Some useful algorithms (e.g. NetInf) have been proposed in calibrations and computer simulations. refs. [38]-[42]. Note that the algorithms for source detection iii) An entropy-based measure called identifiability is intro- and network inference are not feasible in identifying the duced to depict the difficulty level an invasion case can spreading processes on metapopulation networks. be identified. Comparisons on both artificial and empirical Using metapopulation networks models, some heuristic networks show that our algorithm outperforms the existing measures have been proposed to understand the spatial spread methods in accuracy and robustness. of infectious diseases, which are most related to this work. The remaining sections are organized as follows: Sec. II Gautreau et al. [30] developed an approximation for the mean provides the preliminary definitions and problem formula- first arrival time between populations that have direct connec- tion; Sec. III describes the procedures of our identification tion, which can be used to construct the shortest path tree algorithm, and introduces the identifiability measure; Sec. IV (SPT) that characterizes the average transmission pathways performs computer experiments to compare the performance among populations. Brockmann et al. [4] proposed a measure of algorithms; and Sec. V gives the conclusion and discussion. called ‘effective distance’, which can also be used to build the SPT. Using a different method based on the maximum II. PRELIMINARY AND PROBLEM FORMULATION likelihood, Balcan et al. [43] generated the transmission path- This section first elucidates the structure of networked ways by extracting the minimum spanning tree from extensive metapopulation model, and then provides the preliminary Monte Carlo simulation results. Details about these measures definitions and problem formulation. will be given in Sec. IV, which compares the algorithmic performance. A. Networked Metapopulation Model B. Motivation In the networked metapopulation model, individuals are organized into social units such as counties and cities, de- Current algorithms to inferring pandemic spatial spread fined as subpopulations, which are interconnected by traffic generally make use of the topology features of metapopulation networks of transportation routes. The disease prevails in networks or extensive epidemic simulations. The resulting each subpopulation due to interpersonal contacts, and spreads outcome is an ensemble average over all possible transmission between subpopulations via the mobility of infected persons. pathways, which may fail in capturing those indeed transmit- Fig. 1 illustrates the model structure. ting the disease between populations, because of the high-level Within each subpopulation, individuals mix homogeneously. stochasticity and heterogeneity in the spreading process. This assumption is partially supported by recent empirical Good news comes from the development of modern sentinel findings on intra-urban human mobility patterns [19], [45]- and internet-based surveillance systems, which becomes in- [48]. The intra-population epidemic dynamics are character- creasingly popular in guiding public health control strategies. ized by compartment models. Considering the wide applica- Such systems can or will provide high-resolution, location- tions in describing the spread of pathogens, species, rumors, specific data on human and poultry cases [44]. Human mobil- emotion, behavior, crisis, etc. [32], [33], [35], [49], we used the ity data are also available from mass transportation systems or susceptible-infected (SI) model in this work. Define N as the GPS-based mobile Apps [3]. Integrating these data often used i population size of each subpopulation i, I (t) the number of in different fields, a natural reverse problem poses itself, which i infected cases in subpopulation i at time t, β the transmission is the central interest of this work: Is it probable to design rate that an infected host infects a susceptible individual an efficient algorithm to identify or retrospect the stochastic shared the same location in unit time. As such, the risk of pandemic spatial spread process among populations by linking infection within subpopulation i at time t is characterized by epidemic data and models? λi(t) = βIi(t)/Ni. Per unit time, the number of individuals newly infected in subpopulation i can be calculated from a C. Our Contributions binomial distribution with probability λi(t) and trails equalling Main contributions of this work are as follows: the number of susceptible persons Si(t).

2 by the arrival of infected host(s) from m potential upstream subpopulations in I through the invasion edges. i S (iv) mI 7→ nS(m,n > 1): In this case, S and I both I are composed of no less than two subpopulations, and they i Diffusive constitute a connected subgraph. Each previously unaffected Mobility subpopulation in S is contaminated due to the simultaneous ar- j j rival of infected hosts from m potential source subpopulations in I. Each subpopulation in I may lead to the contamination of at least one but no more than n neighboring downstream subpopulations in S through the invasion edges. Multiple edges (b) between any pair of subpopulations are forbidden. (a) Figure 2(a)–(b) illustrate the two scenarios of mI 7→ Fig. 1. Illustration of a networked metapopulation model, which comprises S(m > 1) and mI 7→ nS(m,n > 1). A decomposition six subpopulations/patches that are coupled by the mobility of individuals. procedure of invasion partition (INP) is used to generate the In each subpopulation, each individual can be in one of two disease statuses components of invasion cases in each invasion event. The (i.e. susceptible and infectious), shown in different colors. Each individual can travel between connected subpopulation. (a) Networked metapopulation. heuristic search algorithm to proceed the invasion partition (b) two subpopulations. is given in Algorithm I if an invasion event occurs.

C. Problem Formulation The mobility of individuals among subpopulations is Suppose that the spread starts at an infected subpopulation. conceptually described by diffusion dynamics, ∂tXi = It forms the invasion pathways when this source invades many pjiXj (t) − pij Xi(t), where Xi(t) is a placeholder j∈ν(i) susceptible subpopulations and the cascading invasion goes forP Si(t) or Ii(t), ν(i) is the set of subpopulations directly on. We record the infected individuals of each subpopulation connected with subpopulation i, and pij is the per capita mobility rate from subpopulation i to j, which equals the ratio per unit time. From the data, we should know when a subpopulation is infected and how many infected individuals between the daily flux of passengers from subpopulation i to in this subpopulation, but we may not know which infected j and the population size of departure subpopulation i. The subpopulations invade this subpopulation if it has (m, m ≥ 2) ensemble of mobility rates 0 ≤ pij < 1 defines a transition infected neighbor subpopulations through the corresponding matrix P, determined by the topology structure and traffic fluxes of the mobility network. The inter-population mobility edge(s) (see Figure 2(a)) at that time step. The question of of individuals is simulated with binomial or multinomial interest is how to identify the instantaneous spatial invasion process (Appendix A). More details in modeling rules can process just according to the surveillance data. Herein, we refer to our review paper [19]. know the network topology including subpopulation size and travel flows, such as the city populations of airports and travelers by an airline of the real network of American airports B. Basic Definitions network. The epidemic arrival time (EAT) is the first arrival time of Define an invasion pathway which are the directed edges infectious hosts traveling to a susceptible subpopulation.At a that infected individuals invade to susceptible subpopulations given EAT, at least an unaffected (susceptible) subpopulation at EAT. To identify it, we proceed the following invasion will be contaminated, characterizing the occurrence of inva- pathways identification(IPI) algorithm: sion event(s). Herein, S(I) denotes a(an) susceptible(infected) i) Decompose the whole pathways as four types of invasion subpopulation. cases by the invasion partition at each EAT; Suppose the whole For an invasion event, organizing newly contaminated sub- invasion pathways T are anatomized into Λ of four invasion populations (remaining unaffected prior to that invasion event) cases. Let aˆi denote the identified invasion pathways based on into set S, and infected subpopulations into set I, we define the surveillance data G of that invasion case i and the given the four types of invasion case (INC) as follows: graph G. According to the (stochastic) dynamic programming, (i) I 7→ S: I and S both are composed of a single subpopula- we have the following equation to optimally solve this problem tion respectively, which represents that a previously unaffected subpopulation is infected by the new arrival of infectious Λ host(s) from its unique neighboring infected subpopulation. Twhole invasion pathways = opt aˆi. (1) (ii) I 7→ nS(n > 1): In this case, I only consists of a single Xi=1 subpopulation, while S contains n(n > 1) subpopulations. ii) For each invasion case, we first judge whether it has a This represents that n previously unaffected subpopulations unique set of invasion pathways or more than one potential are contaminated due to the new arrival of infectious hosts invasion pathways. When an invasion case has more than one from their common infected subpopulation in I. possible invasion pathway, each set of which is called potential (iii) mI 7→ S(m > 1): S only consists of a single sub- invasion pathway. If it has more than one potential invasion I ∗ population, and contains m(m > 1) subpopulations. This pathway, we estimate the true invasion pathways ai , denoted means that the newly infected subpopulation in S is infected by aˆi, based on the surveillance data G of that invasion case

3 I1 I1

ℋ ℋ11 11 I2 I2 ℋ21

ℋ͡1 ℋ21 S1 ℋ ...... S1 ... 2͢ Im Im ℋmn Invasion ℋ Edges Invasion ͡1 Sn Edges

(a) (b)

Fig. 2. (a) An example of the mI 7→ S invasion case, in which m infected subpopulations invade one susceptible subpopulation. The red patches denote the infected subpopulations, while the plain patch is the subpopulation that remains susceptible before time t but will be contaminated during that time step due to the arrival of infectious cases from upstream infected subpopulations. (b) An example of the mI 7→ nS invasion case, in which m infected subpopulations invade n(n ≥ 2) susceptible subpopulations. and the given graph G. A potential pathway belonged to that As time evolves, infected hosts travel among subpopu- invasion case is denoted by ∀ai ∈ GINCi . To make this lations, inducing the spatial pandemic dispersal. For each estimation, we shall compute the likelihood of a potential invasion case, by analyzing the variance of infected hosts invasion pathway ai. With respect to this setting, the maximum in each subpopulation i, we define three levels of extent of ∗ likelihood (ML) estimator of ai with respect to the networked subpopulations observability to reflect the information held for metapopulation model given by that invasion case maximizes the inference of relevant invasion pathway: the correct identification probability. Therefore, we define the (i) Observable Subpopulation: Subpopulation i is observable ML estimator during an invasion case, given the occurrence of the three most evident (subpopulation’s) status transitions. The first refers aˆi = argmax P (ai|GINC ), (2) i to the transition Si → Ii, accounting that the previously ai∈GINC i unaffected subpopulation i is contaminated during that inva- where P (ai|GINCi ) is the likelihood of observing the poten- sion case due to the arrival of infected hosts. The second ∗ tial pathway ai assuming it’s the true pathway ai . Thus we concerns the transition Ii → Si, in which the previously would like to evaluate P (ai|GINCi ) for all ai ∈ GINCi and infected subpopulation i becomes susceptible again during then choose the maximal one. that invasion case, since the infected hosts do not trigger a local outbreak and leave i. In the third transition Si → Si, Algorithm 1 Invasion Partition despite of having infected subpopulations in the neighborhood, 1: for an invasion event, collect all newly infected S as subpopulation i remains unaffected during that invasion case initially S and their previously infected neighbors as I; due to no arrival of infected hosts. Figure 3(a) illustrates such 2: start with an arbitrary element Si in set S; observable transitions. ∗ 3: find all neighbors I of Si in set I; (ii) Partially Observable Subpopulation: Subpopulation i is 4: find the new neighbors S∗ in the S if have; partially observable during an invasion case occurring at time 5: find the new neighbors in the I if have; t, if its number of infected hosts is decreased, i.e., Ii(t) < 6: repeat the above two steps until cannot find any new Ii(t − 1) and Ii(t) > 0, which implies that at least ∆Ii(t)= neighbors in S and I, we get an invasion case consisting of |Ii(t) − Ii(t − 1)| infected hosts leave i during that invasion I∗ and S∗, then update the S and I; case. It is impossible to distinguish their mobility destinations 7: repeat the 2-6 steps to get new invasion cases until there unless the invasion case I 7→ S or I 7→ nS occurs. Fig. 3(b) are no elements in S. illustrates the partially observable subpopulation. (iii) Unobservable Subpopulation: Subpopulation i is unob- servable during an INC occurring at time t, if its number of infected hosts has not been decreased, i.e., Ii(t) ≥ Ii(t − 1), considering the difficulty in judging whether there present III. IDENTIFICATION ALGORITHM TO INVASION PATHWAY infected hosts leaving subpopulation i during that invasion According to our above invasion partition decompose al- case. See Fig. 3(c) for an illustration. gorithm, it is easy to identify the invasion pathways for the We further categorize the edges emanated from each in- invasion case scenario I 7→ nS(n ≥ 1) (they have the only fected subpopulation in set I into four types, i.e., invasion invasion pathway from their neighbor infected subpopulation). edges, observable edges, partially observable edges and unob- Thus our invasion pathways’ identification algorithm mainly servable edges: deals with the other two kinds of invasion cases mI 7→ (i) Invasion Edges: In an invasion case, invasion edges rep- S(m> 1) and mI 7→ nS(m,n > 1). To make the description resent each route emanated from subpopulation i in I to clear, we restate the term Ii denotes subpopulation i which is subpopulation j in S. They are considered as a unique cate- infected, and its number of infected individuals of Ii at time gory, because invasion edges contain all invasion pathway (an t is denoted by Ii(t). invasion pathway must be an invasion edge, but an invasion

4 t-1 t t-1 t Theorem 1 (Accurate Identification of Invasion Pathway): With the following conditions: 1) Among m possible sources illus- trated in set I, there are only m′(m′ ≤ m) partially observable Ii Si ′ Ii (t-1)>Ii (t) subpopulations I , whose neighboring subpopulations (exclud- Ii Ii ing the invasion destination S1) only experience the transition (b) Partially observable i S to S or I to S at that EAT, 2) i∈I′ Ii(t−1)−Ii(t) = H, Si Ii the invasion pathway of an invasionP case mI 7→ S(m > 1) can be identified accurately.

Ii (t-1) 1) and mI 7→ S(m,n > 1). We define the solution space Φ of Eq. (3) of the invasion case mI 7→ S(m > 1), which subjects to two conditions: m A. The Case of mI 7→ S(m> 1) (i) i=1 Hi1 = H; (ii) ∀Hi1, Hi1 ≤ Ii(t − 1). The second As shown in Figure 2(a), a typical INC mI 7→ S(m > 1) conditionP is obvious, since the number of infected travelers is composed of two sets of subpopulations, i.e., the previ- departing from the source Ii cannot exceed Ii(t − 1). Let us I assume that Φ contains M solutions, and a typical solution ously infected subpopulations = {I1, I2,...,Im} and the (j) is formulated as σj = {H ,i = [1, ..., m]}. Obviously, each previously unaffected subpopulation S = {S1}. Suppose that i1 solution σj corresponds to a potential invasion pathway aj . subpopulation S1 is contaminated at time t due to the appear- ance of H infected hosts (H is a positive integer number) that come from the potential sources in I. If the actual number of Through the invasion case mI 7→ S(m > 1), the observed I infected hosts from subpopulation Ii is Hi1, i ∈ , we have event E shows that the destination S1 is contaminated due m to the arrival of totally H infected hosts from the potential Hi1 = H, (3) sources Ii,i = [1, ..., m]. With this posterior information, we Xi=1 first measure the likelihood of each possible solution σj , which corresponds to the reasoning event that for each source Ii, with the conditions 0 ≤ Hi1 ≤ H and Hi1 ≤ Ii(t − 1). (i) Accurate Identification of Invasion Pathway i ∈ [1,m], Hi1 infected hosts are transferred to S1. It is evident Given a few satisfied prerequisites, Eq. (3) can has a that ∀j, P (EmIS|σj )=1, since σj will lead to the occurrence unique solution, which implies that the invasion pathways of of event EmIS, which corresponds to GmIS. that invasion case can be identified accurately. Theorem 1 elucidates this scenario. According to Bayes’ theorem, the likelihood of the solution

5 σj is characterized by measured by the following transferring estimator:

Ωu(Hi1) = P Hi1,pki ; Ii(t − 1); xℓ,pℓ,ℓ = [1, ..., ℓi]; P (σ |E )= P (E |σ )P (σ ) P (E ) j mIS mIS j j mIS xℵ,pℵ, ℵ = [ℓi +1, ..., ki − 1]; xi, pi , (5)  M  = P (EmIS|σj )P (σj ) P (EmIS|σj )P (σj ) where xi accounts for the number of infected hosts that do . Xj=1   not leave source Ii after the invasion case. Here, the observed M number of infected persons in source Ii before the invasion = P (σj ) P (σj ) case, i.e., Ii(t − 1), is used for the estimation, since the . Xj=1   probability that a newly infected host also experiencing the m M m (j) (i) mobility process is very low. Considering the conservation of = Ω(Hk1 ) Ω(Hk1 ), infected hosts, and the implication of observable edges (i.e., . i kY=1 X=1 kY=1 ℵ ), we have (4) x = 0, ∀ℵ Ii(t − 1) = Hi1 + ℓ xℓ + xi = Hi1 + ηi + xi. Taking into account all scenariosP that fulfill where M represents the number of potential solution σj , the condition η′ = η + x = I (t − 1) − H , the transferring and the last item Ω(H(i)) represents the mobility likelihood i i i i i1 k1 estimator is simplified by the marginal distribution of Eq.(4), transferring estimator of infected subpopulation I in I. k i.e., One linchpin of our algorithm in handling the scenario mI 7→ S(m> 1) is to estimate the probability of transferring P xi, xℓ,ℓ = [1, ..., ℓi] = Hi1 infected hosts from each infected subpopulation Ii,i ∈ I, ′ − −H   ηi=Ii(Xt 1) i1 to the destination subpopulation S1. Based on the indepen- (6) Ii(t − 1)! Hi1 xℓ xi dence between the intra-subpopulation epidemic reactions and p p pi . H ! x !x ! ki ℓ ′ − −H i1 ℓ ℓ i the inter-subpopulation personal diffusion, we introduce a ηi=Ii(Xt 1) i1 Yℓ transferring estimator to analyze the individual mobility of Q each source Ii, which is in particular useful if there are With independence, the transferring estimator becomes partially observable and unobservable edges emanated from ′ η I (t − 1)! H i the focal infected subpopulation. Ω = i p i1 p + p . (7) u ′ ki ℓ i Hi !η ! The specific formalisms of the transferring estimator are 1 i h Xℓ i defined according to the three types of infected subpopulation Observable Subpopulation Ii (Ii to Si): If the infected hosts Ii consisted of set I which are unobservable subpopulation, of source all leave to travel from − to , the partially unobservable subpopulation and observable subpop- Ii tEAT 1 tEAT subpopulation I is observable at that invasion case. In this ulation with transition of I to S. i case, we have additional posterior messages, i.e., I(t)=0, Unobservable Subpopulation Ii: Due to the occurrence of ∆Ii(t) = I(t − 1). Here, the number of infected hosts , among all edges emanated from subpopulation mI 7→ S ki transferred to S1 cannot exceed the total number of infected , there is only one invasion edge in that invasion case, Ii travelers departing from source Ii, i.e., Hi1 ≤ ∆Ii(t). In labeled as , along which the traveling rate is and ki pki Hi1 this regard, the probability that Hi1 infected hosts arrive infected hosts are transferred to the destination . Assume S1 in destination S1 is measured by the following transferring that there are ℓi (1 ≤ ℓi < ki) unobservable and partially estimator unobservable edges, labeled as 1, 2, ..., ℓi, respectively. Along H −H each unobservable or partially unobservable edge, the traveling ∆I (t) p i1 p ∆Ii(t) i1 i ki 1 − ki . rate is pℓ, ℓ ∈ [1,ℓi], and xℓ infected hosts leave Ii. Accord-    Hi1 h ℓ pℓ + pki i h ℓ pℓ + pki i ingly, in total infected hosts leave through the ηi = ℓ xℓ Ii P P (8) unobservable and partiallyP unobservable edges. There remain Partially Observable Subpopulation Ii: If source Ii is partially ki − ℓi − 1 observable edges, labeled as ℓi +1, ..., ki − 1, observable, we can develop the inference algorithm with respectively. Along each observable edge, the traveling rate is an additional posterior message, which reveals that at least pℵ, ℵ ∈ [ℓi +1, ki − 1], and xℵ infected hosts leave Ii. With ∆Ii(t)= Ii(t) − Ii(t − 1) ≥ 1 infected hosts leave the focal probability pi = 1 − pki − ℓ pℓ − ℵ pℵ, an infected host source Ii after the occurrence of that invasion case. In order keeps staying at source Ii. P P to measure the conditional probability that Hi1 infected hosts Since the infected hosts transferred by unobservable and are transferred from source Ii to destination S1, we inspect partially unobservable edges are untraceable, it is unable to all possible scenarios in detail, as follows: reveal the actual invasion pathways resulting in that invasion Type 1: ∆Ii(t) ≤ Hi1, i.e., the observed reduction in the case accurately. Fortunately, the message of traveling rates on number of infected hosts ∆Ii(t) is less than those transferred each edge is available by collecting and analyzing the hu- from Ii to S1. Here, we consider all cases that are in man mobility transportation networks. Therefore, the mobility accordance with this condition. multinomial distribution (Appendix A Eq. (31)) can be used to If all ∆Ii(t) confirmed infected travelers are transferred obtain the conditional probability that Hi1 infected hosts are from Ii to S1, the transferring estimator can be used to quan- transferred from infected source Ii to destination S1, which is tify the conditional probability that the remaining Hi1 −∆Ii(t)

6 infected hosts concerned also visit S1, i.e., observable travelers ∆Ii(t), the transferring estimator becomes

H I t −H ∆Ii(t) pk i1 pk ∆ i( ) i1 Ωp Hi1 − ∆Ii(t)|Ii(t − 1) − ∆Ii(t) i 1 − i · ′  Hi  pℓ + pk pℓ + pk   − −  ηi 1 h ℓ i i h ℓ i i Ii(t 1) ∆Ii(t) ! [Hi1−∆Ii(t)] = p pℓ + pi · − H − ′ ki ℓ P P Ii(t) ∆Ii(t)  i1 ∆Ii(t) !ηi! h P i pℓ + pi , ∆Ii(t)     pk h i i , (9) Xℓ P pℓ+pk (13) h ℓ i i where the last item accounts for the constraint that the re- where pki ℓ pℓ + pki represents the relative traveling rate maining Ii(t) − ∆Ii(t) infected hosts will not be transferred that any person P from source Ii is transferred to S1, thus the last to S1. item on the right-hand side (rhs) accounts for the probability Similar to Type 1, the other two cases are: only a fraction of that ∆Ii(t) confirmed infected travelers all visit S1. Hi1 infected hosts transferred to S1 are from the observable If only a fraction of ∆Ii(t) confirmed infected travelers are travelers ∆Ii(t), and Hi1 infected hosts transferred to S1 are transferred from Ii to S1, the situation is more complicated. all not from the observable travelers ∆Ii(t), we can also derive Assume that ∆Ii(t) − φ (1 ≤ φ < ∆Ii(t)) confirmed trav- the transferring estimators. elers successfully come to S1, the corresponding transferring Taking into account all these cases, the probability that Hi1 estimator becomes infected hosts move to the destination subpopulation S1 is measured by the following transferring estimator

Ωp(Hi1 − ∆Ii(t)+ φ|Ii(t − 1) − ∆Ii(t)) Hi1 ∆Hi1 ∆Ii(t) pki ∆Ii(t)−φ φ · ∆Ii(t) pki pki = 1 −  ∆Hi1  ℓ pℓ + pki P p +p P p +p △HX1=0 h i  φ  ℓ ℓ ki   ℓ ℓ ki i h i h i P ′ ∆I (t)−∆H I (t−1)−∆I (t) ! H − ηi pk i i1 i i p[ i1 ∆Ii(t)+φ] p + p , (10) 1 − i · ′ ki ℓ ℓ i   Hi1−∆Ii(t)+φ !η ! pℓ + pk i h P i h ℓ i i   Ii(t − 1) − ∆Ii(t)]!P H − H ′ p i1 ∆ i1 p + p ]η . (14) where the first item on the r.h.s. accounts for the probability ′ ki ℓ i  (Hi − ∆Hi )!η ! 1 1 h Xℓ that ∆Ii(t) − φ confirmed visitors visit S1. If all ∆Ii(t) confirmed infected travelers from Ii are not Generally, set I consists of the three classes of transferred to S1, the conditional probability that among the subpopulations Ii(1 ≤ i ≤ m) discussed above: unobservable remaining Ii(t − 1) − ∆Ii(t) infected hosts, Hi1 infected subpopulation, partially unobservable subpopulation, travelers are transferred to S1, which is measured by the observable subpopulation of I → S. According to Eq.(4), following transferring estimator generally each potential pathway ai corresponds to a potential solution σi, the most-likely invasion pathway for a mI 7→ S [Ii(t−1)−∆Ii(t)]! Ωp(Hi1|Ii(t − 1) − ∆Ii(t)) = H ′ · can be identified as i1!ηi! ′ H pk i1 ηi i ∆Ii(t) pk [ ℓ pℓ + pi] [1 − ] , (11) mIS i Pℓ pℓ+pki aˆ = argmax P (σi|EmIS) P σi (15) = argmax P (ai|GmIS). where the last item on the r.h.s. accounts for the probability ai that ∆Ii(t) confirmed infected travelers all do not visit S1. Taking into account all the above cases, the probability that B. The Case of mI 7→ nS(m> 1,n> 1) Hi1 infected hosts arrive at destination S1 is measured by the following transferring estimator Finally, we consider the case of mI 7→ nS(m> 1,n> 1), which is more complicated than mI 7→ S, because some in- I ∆Ii(t)−φ fectious populations in set may have more than one invasion ∆Ii(t) ∆Ii(t) pki Ωp(Hi1)= φ=0 P p +p · edge to the corresponding susceptible subpopulations in set S,  φ  ℓ ℓ ki   P h i and the number of elements in set S are more than one, which φ p Ii(t−1)−∆Ii(t) ! 1 − ki · obey a joint probability distribution of transferring likelihood. P pℓ+pk ′ ℓ i Hi1−∆Ii(t)+φ !η ! h i i As shown in Fig. 2, an invasion case mI 7→ nS includes ′  ηi [Hi1−∆Ii(t)+φ] set I = {Ii|i = 1, 2,...,m} and S = {Si|i = 1, 2,...,n}. p pℓ + pi . (12) ki ℓ The first arrival infected individuals invaded each susceptible h P i subpopulation in set S are {Hi|i =1, 2,...,n}, respectively. Type 2: ∆Ii(t) > Hi1, i.e., the observed reduction in the Here, denote Ui(i = 1, 2,...,m) the subset of susceptible number of infected hosts ∆Ii(t) exceeds the number of in- neighbor subpopulations in set S of infected subpopulation fected hosts transferred to S1. Similar to the above analysis, we Ii , and Yj (j = 1, 2,...,n) the subset of infected neighbor develop the transferring estimator by considering all possible subpopulations in set I of susceptible subpopulation Sj . cases that are in accordance with this condition. We define σ = {{Hi1|i ∈ Y1},..., {Hin|i ∈ Yn}} a If Hi1 infected hosts transferred to S1 are all from the potential solution for the mI 7→ nS, if subjects to two

7 conditions: (i) where M represents the number of solution σj , and the last (i) item Ω(Hkk~ ) represents the transfer estimator of infected Hik = Hk, (16) subpopulation Ik in I, k~ ∈ Yk. Note that σj and EmInS iX∈Yk correspond to a potential invasion pathway aj of mI 7→ nS Hik ≥ 0; (ii) For any Hik which denotes the number of and GmInS, respectively. infected hosts travel to subpopulation Sk from Ii at tEAT , we Now we discuss the transferring estimator of subpopulation have H ≤ I (t − 1), where 1 ≤ i ≤ m, 1 ≤ k ≤ n. Ii according to its extent of subpopulation observability. k∈Ui ik i (j) (a) Subpopulation Ii has only one neighbor (invasion edge) in If a mIP7→ nS has M potential solutions, let σj = {{Hi1 |i ∈ (j) set S. Y },..., {H |i ∈ Yn}}, 1 ≤ j ≤ M. 1 in In this case, the transferring estimator is the same as the Similarly, we first discuss the directly identifiable pathway depicted one in mI 7→ S. for a given mI 7→ nS, then estimate the most-likely numbers (b) Subpopulation I has ρ (ρ ≥ 2) neighbors (invasion edges) of each H as accurate as possible by designing our identi- i ik in set S. fication algorithm, since one solution of Eq. (16) corresponds Suppose there are totally edges emanate from which to one invasion pathway of an invasion case mI 7→ nS. ki Ii consist of the following three kinds as: (1) There are ρ (i) Accurate Identification of Invasion Pathway i invasion edges (2 ≤ ρi ≤ n), labeled 1, 2,...,ρi, along Given a few satisfied prerequisites, for all k ∈ Ui,i ∈ Yk of which the traveling rates are p~, ~ ∈ [1,ρ ], and H invade the equations constituted by Eq.(16) can has a unique solution, i ii~ the subpopulations in the subset {Y = i~}, respectively; which implies that the invasion pathway of that invasion i (2) There are ℓ unobservable and partially observable edges, case can be identified accurately. Theorem 2 elucidates this i labeled 1+ ρ ,...,ℓ + ρ , respectively. Along each unobserv- scenario. i i i able or partially unobservable edge, the traveling rate is p , Theorem 2 (Accurate Identification of Invasion Pathway): With ℓ ℓ ∈ [1,ℓ ], and x infected hosts leave I . Accordingly, in the following conditions: 1) the number of invasion edges i ℓ i total η = x infected hosts leave I through the unob- E ≤ n + m, 2) the neighbor subpopulations of each i ℓ ℓ i in servable and partially unobservable edges. (3) There remain subpopulation in set I are with the transition S to S or I to S P ki − ℓi − ρi observable edges, labeled as ℓi + ρi +1, ..., ki, except their neighbor subpopulations in set S during tEAT −1 m n respectively. Along each observable edge, the traveling rate to tEAT , 3) i=1 △Ii(t)= k=1 Hk, the invasion pathway is pℵ, ℵ ∈ [ℓi + ρi +1, ki], and xℵ infected hosts leave Ii. of an invasionP case mI 7→ nSP(m,n > 1) can be identified With probability p =1− p~ − p − pℵ, an infected accurately. i ~ ℓ ℓ ℵ host keeps staying at the source I . There are x infected hosts Proof: Since the number of infected individuals in the P i P P i staying in subpopulation I with the probability p . Because partially observable subpopulation i reduces at time t, i.e., i i Ii connects the unobservable and partially observable infected I (t) < I (t − 1), I (t) > 0, it is inevitable that a few ′ i i i subpopulations, we only know the sum x + x = η . infected carries diffuse away from subpopulation i. Occurring ℓ ℓ i Now we employ the following estimatorsP to evaluate the the state transitions of S → I, I → S at time t, subpopulations transferring likelihood of the three categories of I . in the neighborhood of i (excluding the new contaminated i Unobservable Subpopulation I : Because △I (t) = I (t − subpopulation j) cannot receive infected travelers. Therefore, i i i 1)−I (t) ≤ 0, we don’t know whether and how many infected the only possible destination for those infected travelers is i individuals travel to which destinations. Similar to the invasion subpopulation Sj. m case mI 7→ S, the transferring likelihood estimator of Ii is The conditions Ein ≤ n + m and i=1 △Ii(t) = n ~ H make the equations H = H and Ωu(Hii~ )= P (Hii~ ,p~, = [1,...,ρ]; xℓ,pℓ,ℓ = [1 + ρ, k=1 k i∈Yk ikP k H = △I (t) only has the unique solution σ = Pk∈Ui ik i P ...,l + ρ]; xℵ,pℵ, ℵ = [l + ρ +1,...,k]; xi, pi). {{HP i1|i ∈ Y1},..., {Hin|i ∈ Yn}}. The reason is that (18) rank(Acoef )=Ein, where Acoef is the coefficient matrix of By means of the observable edges, the transferring estimator equations H = H and H = △I (t). Thus i∈Yk ik k k∈Ui ik i can be simplified as the invasion pathway of this mI 7→ nS(m,n > 1) can be P P ~ identified accurately. P (Hii~ ,p~, = [1,...,ρ]; xℓ,pℓ, ′ − − H ηi=Ii(t X1) P ii~

(ii) Potential Invasion Pathway ℓ = [1 + ρ,...,l + ρ]; xi, pi If the conditions of Theorem 2 are unsatisfied, the equations Ii(t − 1)!  constituted by Eq. (16) has multiple solutions, each solution = ~ Hii~ !η! xℓ!xi! η′ =I (t−X1)−P H ℓ corresponds to a set of potential invasion pathways that can i i ii~ Q Q Hii result in the related mI 7→ nS(m,n > 1). We derive the j η xi p~ ( pℓ) pi . (19) transferring likelihood of each potential solution similar to Y~ Xℓ case of mI 7→ S. Therefore, the likelihood of solution σ j Then the transferring estimator becomes by the marginal is characterized by distribution as: ′ m M m η (j) (i) Ii(t − 1)! Hii~ i P (σj |EmInS)= Ω(Hkk~ ) Ω(Hkk~ ), (17) Ωu = ′ p~ pℓ + pi . (20) ~ Hii~ !η ! kY=1 . Xi=1 kY=1 i Y~ h Xℓ i Q 8 I Observable Subpopulation Ii (Ii to Si): For this situation, 1 ~ Hi = {Hii~ | = 1,...,ρ} all come from △Ii(t). The transferring likelihood estimator of a I → S observable ℋ11 I2 subpopulation Ii is ′′ △I t H i( )! p~ ii~ ℋ21 Ωob = H △ − H ~( l+ρ ) · Q~ ii~ !( Ii(t) P~ ii~ )! P pk k=1 S1 p Q Pℓ ℓ △Ii(t)−P~ Hii~ ( l+ρ ) , (21) Pj=1 pj Fig. 4. An example of 2I 7→ S invasion case. Suppose that three infected where △Ii(t)= Ii(t − 1) − Ii(t)= Ii(t − 1). cases reach subpopulation S1 simultaneously, which means H = 3. The three Partially Unobservable Subpopulation Ii: Due to △Ii(t) = possible permutations are: H1 = 3, H11 = 1, H21 = 2; H2 = 3, H11 = Ii(t − 1) − Ii(t) > 0, at least △Ii(t) infected hosts leave 2, H21 = 1; H3 = 3, H11 = 3, H21 = 0. The permutations 1 and 2 indicate the same pathways, but 3 is different. source Ii from tEAT −1 to tEAT . ~ We first decompose Hi = {Hii~ | = 1,...,ρ} as two ′ ′ ~ ′′ ′′ ~ subsets: Hi = {Hii~ | = 1,...,ρ} and Hi = {Hii~ | = ′ ′′ ′ ′′ potential pathway. For example, a mI 7→ S is illustrated in 1,...,ρ}, Hii~ + Hii~ = Hii~ , where Hii~ ≥ 0, Hii~ ≥ 0. ′ ′ ~ Fig. 4. In this situation, we merge the transferring likelihood Denote Hi = {Hii~ | =1,...,ρ} the infected hosts coming ′′ ′′ of potential solutions of mI 7→ S or mI 7→ nS if they belong from Ii(t − 1) − △Ii(t), and H = {H |~ = 1,...,ρ} i ii~ to the same invasion pathways, then find out the most-likely the infected hosts coming from △Ii(t). Then we analyze the transferring estimator on the following two types. invasion pathways, which are corresponding to the maximum transferring likelihood. Type 1: ~ Hii~ ≥△Ii(t) ′′ According to Eq. (1) and (2), the whole invasion path- SupposeP φ = ~ H ii~ (0 ≤ φ ≤ △Ii(t)), which rep- way T can be reconstructed chronologically by assembling resents the numberP of infected hosts coming from △Ii(t). Given a fixed φ, there may be more than one permutation all identified invasion pathway of each invasion case after ′′ ′′ for ′′. The transferring likelihood identification of four classes of invasion cases. To depict the Hi = {Hiij |j =1,...,ρ} Hi estimator is IPI algorithm explicitly, the pseudocode for our algorithm is given in Algorithm II. △Ii(t)

Ωpu = P1P2, (22) φX=0 P HX′′ =φ ii~ C. Analysis of IPI Algorithm where Science IPI algorithm is based on hierarchical-iteration-

△Ii(t)! like decomposition technique, which reduce the temporal- P1 = ′′ · spatial complexity of spreading, it can handle large-scale ~ Hii~ !(△Ii(t) − φ)! ′′ spatial pandemic. Note that the invasion infected hosts Hi at Q p~ H pℓ △I t −φ ( ) ii~ ( ℓ ) i( ) , EAT always are very small (generally ≤ 3). Therefore, the l+ρ Pl+ρ Y~ k=1 pk j=1 pj computation cost of our IPI algorithm is small, and we employ P P the enumeration algorithm to compute each of M potential H′ (Ii(t−1)−△Ii(t))! ii~ permutations. In this section, we only discuss the simplest P2 = Q H′ !(I (t−1)−△I (t)−P H +φ)! ~ p~ · ~ ii~ i i ~ ii~ situation that one pathway only corresponds to one permitted Ii(t−1)−△Ii(t)−P~ Hii~Q+φ ( ℓ pℓ + pi) . solution in an invasion case. The situation of one pathway P corresponds to multiplex potential solutions can be extended. Type 2: H < △I (t) ~ ii~ i Denote π the probability corresponding to the most likely Suppose ′′ , which P φ = ~ H ii~ (0 ≤ φ ≤ ~ Hii~ ) pathways for a given invasion case. Thus we have represents the numberP of infectious hosts comingP from △Ii(t). Given a fixed φ, there may be more than one solution for H′′. i π(σ) = sup{P (σi|E)}. (25) The transferring likelihood estimator is σi

H P~ ii~ Property 1: Given an invasion case ‘mI 7→ S’ or ‘mI 7→ m Qk=1 Ω Ωpu = P1P2, (23) nS’, P (σj |E) = M m , there must exist Pmin and Pi=1 Qk=1 Ω φX=0 P HX′′ =φ ii~ Pmax satisfying where P and P are the same as those in Eq. (22). 1 2 P ≤ π(σ) ≤ P . (26) According to Eq. (17), the most-likely invasion pathways min max for an INC mI 7→ nS can be identified as m Proof: Suppose that k=1 P (Ik(t), σ1) ≤ mInS m aˆ = argmax P (σi|EmInS) ... ≤ k=1 P (Ik(t), σMQ). Thus Pmax = σi m Qk=1 P (Ik(t),σM ) (24) m m ; Be- Q P (I (t),σ −Q)+Q P (I (t),σ ) = argmax P (ai|GmInS). k=1 k M 1 k=1 k M ai cause π(σ) > 1/M, let Pmin = m Qk=1 P (Ik (t),σj ) Note that if the first arrival infectious individuals H ≥ 3, max{1/M, m M m }. We Qk=1 P (Ik (t),σ1)+Pi=1 Qk=1 P (Ik (t),σj ) there may be multiple potential solutions corresponding to one have Pmin ≤ π(σ) ≤ Pmax.

9 Algorithm 2 Invasion Pathways Identification (IPI) identifiability statistically tells us why some invasion cases are 1: Inputs: the time series of infection data Fi(t) and topology easy to identify, whose Π are more than 0.5, and why some of network G(V, E) (including diffusion rates p) invasion cases are difficult to identify, whose Π are much less 2: Find all invasion events via EAT data than 0.5. 3: for each invasion event Next we show that there exist the upper and lower bound- 4: Invasion partition to find out the I 7→ S , I 7→ nS, aries of identifiability Π for a given invasion case. mI 7→ S and mI 7→ nS. Theorem 3: Given an invasion case ‘mI 7→ S’ or ‘mI 7→ 5: for each mI 7→ S or mI 7→ nS nS’, Π= π(σ)(1−S) is the identifiability computed by the IPI 1 ′ 6: if it satisfy conditions of Th 1 or Th 2 algorithm. There exist a lower boundary Πmin = M (1 − S ) 7: computethe uniqueinvasionpathway and an upper boundary Πmax = π − S(π(σ)) that 8: end if Π ≤ Π ≤ Π , (29) 9: if don’t satisfy conditions of Th 1 or Th 2 min max ′ 1 1−π 1−π compute the all M potential solutions σi where S = − log M (π log(π)+ M−1 log( M−1 )). 1 1 10: computethe P (σi|EmIS) or P (σi|EmInS) Proof: Π = π(1 − S) P≥ M (1 − S) ≥ M (1 − ′ ′ 1 1−π 1−π 11: mergethe P (σi|EmIS) or P (σi|EmInS) of S ), where S = − log M (π log(π)+ M−1 log( M−1 )) = potential solution if they belong to same pathway 1 1−π 1 σi − log M (π log(π)+(1 − π) log( M−1 ))P = − log M (π log(π)+ 12: end if (1−π) log(1−π)−(1−π) log(M −1)). According to Fano’s 13: end for inequality, the entropy S ≤ S′. mIS mInS 14: find the maximal aj and aj invasion pathway On the other hand, we note that function 15: end for f(y) = y log(y) is strictly convex. According to 16: reconstruct the whole invasion pathways (T) by assembling Jensen’s inequality, πS(σ) = π × (−π(σ) log(π(σ)) − M−1 2 2 each invasion cases chronologically i=1 P (σi|E) log P (σi|E)) ≥ −(π(σ)) log((π(σ)) ) − M−1 Pi=1 π(σ)P (σi|E) log π(σ)P (σi|E). Π = π(1 − S) = πP− πS ≤ π − S(πσ). Therefore, Πmin ≤ Π ≤ Πmax, 1 ′ where Πmin = M (1 − S ) and Πmax = π − S(π(σ)). That D. Identifiability of Invasion Pathway completes the proof of Theorem 3. Accordingly, our IPI algorithm first decomposes the whole invasion pathways into four classes of invasion cases. Some IV. COMPUTATIONAL EXPERIMENTS invasion cases are easy to identify, but some are difficult. To verify the performance of our algorithm, we proceed net- Therefore, it is important to describe how possible an invasion worked metapopulation-based Monte Carlo simulation method case can be wrongly identified. The identification extent of an to simulate stochastic epidemic process on the American air- invasion case relates with the absolute value of π(σ) and infor- ports network(AAN) and the Barabasi-Albert (BA) networked mation given by the probability vector of all potential invasion metapopulation. pathways. We employ the entropy to describe the information The AAN is a highly heterogeneous network. Each node of of likelihood vector, which contains the all likelihood of M the AAN represents an airport, the population size of which is potential solutions/pathways of an invasion case. the serving area’s population of this airport. The directed traffic Definition 1 (Entropy of Transferring Likelihoods of M flow is the number of passengers through this edge/airline. The Potential Solutions): According to Shannon entropy, we data of the AAN we are used to simulate is based on the true define the normalized entropy of transferring likelihood demography and traffic statistics [50]. We take the maximal P (σ1|E),...,P (σM |E) as component consisted of 404 nodes (airports/subpopulations) of all American airports as the network size of the ANN. The M 1 average degree of the AAN is nearly hki = 16. The total S = − P (σi|E) log P (σi|E). (27) 9 log M population of the AAN is the Ntotal ≈ 0.243 × 10 , which Xi=1 covers most of the population of the USA. This likelihood entropy S tells the information embedded in The BA network obeys heterogeneous degree distribu- the likelihood vector of the potential solutions of a given tion [51], which holds two properties of growth and preference invasion case. attachment. For a BA networked metpopulation, each node is a The bigger of π(σ) and the smaller of entropy S, the subpopulation containing many individuals. The details of how easier to identify the epidemic pathways for an invasion case. to generate a BA networked metapopulation including travel Define identifiability of invasion pathways to characterize the rates setting is presented in Appendix B. To test the perfor- feasibility an invasion case can be identified mance of our algorithm to handle large-scale network, the sub- populations number of the BA networked metapopulation is Π= π(σ)(1 − S). (28) fixed as 3000. This is nearly equal to the number of the world Although the likelihood entropies of some invasion cases are airports network [52]. We fix hki = 16 as the average degree of small (less than 0.5), they are still difficult to identify, because the BA networked metapopulation. The initial population size 5 their π(σ) are much less than 0.5. Therefore, identifiability Π of each subpopulation is N1 = N2 = ··· = NN = 6 × 10 , 5 9 describes the practicability of a given mI 7→ S or mI 7→ nS and the total population is Ntotal =6×10 ×3000 = 1.8×10 , better than only using π(σ) or likelihoods entropy S. The which covers most of the active travelers of the world.

10 The whole spreading accuracy of American Airports Network The whole accuracy of 3000 nodes of BA networked metapopulation 0.8 0.7

0.75 0.6

0.7 0.5 accuracy 0.65 accuracy 0.4

0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 Index of each realization Index of each realization The early stage accuracy of American Airports Network The early accuracy of 3000 nodes of BA networked metapopulation 1 1

0.8 0.8

0.6 0.6 accuracy accuracy 0.4 0.4

0.2 0.2 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 Index of each realization Index of each realization 3000 nodes of BA networked metapopulation American Airports Network IPI 0.9 IPI 1 ARR ARR MCML 0.8 MCML 0.8 EFF EFF 0.7

accuracy 0.6 accuracy 0.6 0.4 0.5 early mIS early mInS whole mIS whole mInS early mIS early mInS whole mIS whole mInS

Fig. 5. The top and middle figures show the identified accuracy for the whole Fig. 6. The top and middle figures show the identified accuracy for the and early stage (before appearance of the first 50 infected subpopulations) whole and early stage invasion pathways for twenty independent spreading invasion pathways for twenty independent spreading realizations on the AAN. realizations on 3000 subpopulations of the BA networked metapopulation. The bottom shows accumulative identified accuracy of invasion cases (mI 7→ The bottom shows accumulative identified accuracy of mI 7→ S and S and mI 7→ nS) for the early stage and the whole invasion pathways on mI 7→ nS for the early stage (the first 300 infected subpopulations) and the AAN. the whole invasion pathways on 3000 subpopulations of the BA networked metapopulation.

A. Networked Metapopulation-based Monte Carlo Simulation Method to Simulate Stochastic Epidemic Process the seed subpopulation to other subpopulations of the whole At the beginning, we assume only one subpopulation is network. ii) The EFF (effective-distance-based most probable paths tree)[4] methods: From subpopulation i to subpopulation seeded as infected and others are susceptible. Thus I1(0) = j, the effective distance Dij is defined as the minimum 5, Ii(0) = 0(i = 2, ··· ,N). We record and update each individual’s state(i.e., susceptible or infected) at each time step. of the sum of effective lengths along the arbitrary legs of the path. The set of shortest paths to all subpopulations At each ∆t (∆t is defined as the unit time from t−1 to t), the from seed subpopulation i constitutes a shortest path tree. transmission rate β and diffusion rate pij are converted into probabilities. The rules of individuals reaction and diffusion ii) The MCML (the Monte-Carlo-Maximum-Likelihood-based most likely epidemic invasion tree)[43]: To produce a most process in ∆t are as follows: (1) Reaction Process: Individuals which are in the same likely infection tree, they constructed the minimum spanning subpopulation are homogeneously mixing. Each susceptible tree from the seed subpopulation to minimize the distance. Some recent works [53]-[55] uses machine learning or genetic individual (in subpopulation i) becomes infected with proba- Ii(t) algorithms to infer transmission networks from surveillance bility β . Therefore, the average number of newly added Ni − data. Because of the distinction in model assumptions and infected individuals is β (Ni(t) Ii(t))Ii , but the simulation Ni conditions, we do not perform comparison with them. results fluctuate from one realization to another. The reaction We consider to access the identification accuracy for the process is simulated by binomial distribution. inferred invasion pathways. This accuracy is defined by the (2) Diffusion Process: After reaction, the diffusion process of ratio between the number of corrected identified invasion individuals between different subpopulation posterior to the pathways by each method and the number of true invasion reaction process is taken into account. Each individual from pathways, respectively. We also compute the accuracy of subpopulation i migrates to the neighboring subpopulation j accumulative invasion cases of mI 7→ S and mI 7→ nS. with probability p . The average number of new infectious ij This accuracy is defined by the ratio between the number travelers from subpopulation i to j is p I (t). The diffusion ij i of corrected identified invasion pathways by each method in process is simulated by binomial distribution or multinomial this class of invasion case and the number of true invasion distribution. pathways in this classes of invasion case. Additionally, we investigate the identification accuracy of early stage of a global B. Numerical Results pandemic spreading, which is important to help understand We compared our IPI algorithm with three heuristic al- how to predict and control the prevalence of epidemics. gorithms that generate the shortest path tree or minimum In the top and middle of Figure 5, we observe the whole spanning tree of the metapopulation networks. i) The ARR identification accuracy and the early-stage identification ac- (average-arrival-time-based shortest path tree) [30]: The min- curacy. The bottom of Figure 5 shows the early and whole imum distance path from subpopulation i to subpopulation accumulative identification accuracy of mI 7→ S and mI 7→ j over all possible paths is generated in terms of mean first nS through twenty independent realizations on the AAN for arrival time. Thus the average-arrival-time-based shortest path each algorithm, respectively. The simulation results show our tree is constructed by assembling all shortest paths from algorithm is outperformance, which indicates heterogeneity of

11

27 30 41 likelihoods entropies S of INCs 21 0.45

identifiabilities of INCs 34 39 12 8 17 0.40 50 31 5 28 22 7 29 37 48 46 42 2 10 0.35 38 25 47 16 1 14 26 0.30 4 3 13 6 40 32 15 19 0.25

45 36 35 18 33 43 49

0.20 frequency

0.15

9

0.10 23

44 24 0.05 20 11 27 30 41 0.00

0.1 0.2 21 0.4 0.3 0.5 0.6 0.7 0.8 0.9 1 39 34 12 intervals 17 8 50 31 5 28 22 7 29 37 48 46 42 2 10 Fig. 8. Statistics analysis of the likelihoods entropy and identifiability of 38 25 47 16 1 14 26 wrongly identified mI 7→ S of 20 realizations of epidemic spreading on the 4 3 13 AAN. 6 40 32 15 19 36 45 35 18 33 43 49 V. CONCLUSIONANDDISCUSSION

9 To conclude, we have proposed an identification framework 23 as the so called IPI algorithm to explore the problem of 44 24 20 11 inferring invasion pathway for a pandemic outbreak. We first anatomize the whole invasion pathway into four classes Fig. 7. Illustration of the actual invasion pathways and the most likely iden- tified invasion pathways, in a given realization, during the early stage (before of invasion cases at each epidemic arrival time. Then we the appearance of 50 infected subpopulations) on the AAN. Subpopulation 1 identify four classes of invasion cases, and reconstruct the is the seed. whole invasion pathway from the source subpopulation of a spreading process. We introduce the concept of identifiability to quantitatively analyze the difficulty level that an invasion structure of the AAN plays an important role. case can be identified. The simulation results on the Figure 6 shows the results of the BA networked metapop- American Airports Network (AAN) and large-scale BA ulation with 3000 subpopulations, the top of which presents networked metapopulation have demonstrated our algorithm the identification accuracy of whole invasion pathway for each held a robust performance to identify the spatial invasion realization of the four algorithms. while the middle of Figure pathway, especially for the early stage of an epidemic. 6 shows the identification accuracy of early stage invasion We conjecture the proposed IPI algorithm framework can pathway for each realization. The bottom shows accumulative extend to the problems of virus diffusion in computer identified accuracy of mI 7→ S and mI 7→ nS of twenty network, human to human’s epidemic contact network, and realizations for four alorithms respectively. The simulation the reaction dynamics may extend to the SIR or SIS dynamics. results indicate that our algorithm can handle a large scale networked metapopulation with robust performance. Note that the performance of the ARR for the BA networked metapop- APPENDIX A ulation is the same as that of the EFF, because our parameter MOBILITY OPERATOR C is a constant in the diffusion model (see Appendix B). We discuss the individual mobility operator. Due to the pres- The numerical results suggest that networks with different ence of stochasticity and independence of individual mobility, topologies yield different identification performances, which the number of successful transform of individuals between or indicate an identification algorithm should embed in both among adjacent subpopulations is quantified by a binomial or the effects of spreading and topology. Our algorithm takes a multinomial process, respectively. If the focal subpopulation into account both the heterogeneity of epidemics (the number i only has one neighboring subpopulation j, the number of of infected individuals) and the network topology (diffusion individuals in a given compartment X (X ∈{S, I} and flows). X Xi = Ni) transferred from i to j per unit time, Tij (Ui), We finally test the identifiability of an invasion case. Figure isP generated from a binomial distribution with probability p 8 shows the entropy and identifiability of wrongly identified representing the diffusion rate and the number of trials Ui, mI 7→ S of twenty realizations on the AAN. The smaller i.e., the identifiability of an invasion case is, the easier it is Ui! Tij (Ui−Tij ) Binomial(Tij, Ui,p)= p (1 − p) . prone to be wrongly identified. The identifiability depicts Tij !(Ui − Tij )! the wrongly identified mI 7→ S more reasonable than the (30) likelihoods entropy. It indicates that identifiability Π has a If the focal subpopulation i has multiple neighboring subpop- better performance to distinguish whether an invasion case is ulations j1, j2, ..., jk, with k representing i’s degree, the num- difficult to identify or not than using the likelihoods entropy. bers of individuals in a given compartment U transferred from

12 i to j1, j2, ..., jk are generated from a multinomial distribution [11] M. F. C. Gomes, A. P. y Piontti, L. Rossi, et al., ”Assessing the international spreading risk associated with the 2014 West African Ebola with probabilities pij ,pij , ..., pij (pij +pij +...+pij = p) 1 2 k 1 2 k outbreak,” PLoS Curr., Sep. 2014. representing the diffusion rates on the edges emanated from [12] B. J. Cowling, M. Park, V J Fang, et al., ”Preliminary epidemiological subpopulation i and the number of trails Ui, i.e., assessment of MERS-CoV outbreak in South Korea, May to June,” Euro. Surveill., vol. 20, no. 25, pii=21163, Jun. 2015.

Multinominal({Tijℓ }, Ui, {pijℓ })= [13] X. Wang and G. Chen, ”Complex network: Small-world, scale-free, and T beyond,” IEEE Circuits Syst. Mag., vol. 3, no. 1, pp. 6–20, Jan.-Mar. Ui! ijℓ (U −P T ) ( p )(1 − p ) i ℓ ijℓ , 2003. ijℓ ijℓ Tij !(Ui − Tij )! [14] M. E. J. Newman, Networks: An Introduction, New York, US: Oxford ℓ ℓ ℓ ℓ Yℓ Xℓ Q P (31) University Press, 2010. [15] C. W. Tan, M. Chiang and R. Srikant, ”Fast algorithms and performance where ℓ ∈ [1, k]. bounds for sum rate maximization in wireless networks,” IEEE/ACM Trans. on Netw., vol. 21, no. 3, pp. 706–719, Jun. 2013. [16] B. Yang, J. Liu, and D. Liu, ”Characterizing and extracting multiplex APPENDIX B patterns in complex networks,” IEEE Trans. Syst., Man, Cybern. B, A GENERIC DIFFUSION MODEL TO GENERATE Cybern., vol. 42, no. 2, pp. 469–481, Apr. 2012. A BARABASI-ALBERT METAPOPULATION NETWORK [17] X. Fu, M. Small, and G. Chen, ”Propagation dynamics on complex networks: models, methods and stability analysis,” John Wiley & Sons, We develop a general diffusion model to generate a BA metapopu- 2013. lation network in Section V, which characterizes the human mobility [18] Pastor-Satorras, R., C. Castellano, P. Van Mieghem and A. Vespignani, pattern on the empirical statistical rules of air transportation networks. ”Epidemic processes in complex networks,” Rev. Mod. Phys., vol.87, no. wij 3, pp. 925–979, Jul.-Sep. 2015. The diffusion rate from subpopulation i to j is pij = , Ni where w denotes the traffic flow from subpopulation i to j. [19] L. Wang, and X. Li, ”Spatial epidemiology of networked metapopula- ij tion: an overview,” Chin. Sci. Bull., vol. 59, no. 28, pp. 3511–3522, Jul. These empirical statistical rules are verified in the air transportation ′ ′ θ β 2014. network [52]: hpij i ∼ (kikj ) ,θ′ = 0.5 ± 0.1; T ∼ k , β′ ≃ ′ [20] P.-Y. Chen, S.-M. Cheng, and K.-C. Chen, ”Optimal control of epidemic λ 1.5 ± 0.1; N ∼ T (T = Pl wjl),λ′ ≃ 0.5. information dissemination over networks,” IEEE Trans. Cybern., vol. 44, All the above empirical formulas relate to node’s degree k. To no. 12, pp. 2316–2328, Dec. 2014. generate an artificial transportation network, we introduce a generic [21] C. Griffin and R. Brooks, ”A note on the spread of worms in scale-free diffusion model to determine the diffusion rate networks,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 36, no. 1, pp. 198–202, Feb. 2006. θ bij kj [22] L.-C. Chen and K. Carley, ”The impact of countermeasure propagation pij = C, (32) P b kθ on the prevalence of computer viruses,” IEEE Trans. Syst., Man, Cybern. l il l B, Cybern., vol. 34, no. 2, pp. 823–833, Apr. 2004. where bij stands for the elements of the adjacency matrix (bij = 1 [23] D. Balcan, H. Hu, B. Goncalves, ”Seasonal transmission potential and activity peaks of the new influenza A(H1N1): a Monte Carlo likelihood if i connects to j, and bij = 0 otherwise), C is a constant, and θ is a variable parameter. We assume that parameter θ follows the analysis based on human mobility,” BMC Med., vol. 7, no. 1, pp. 45(1– 2 2 1 (θˆ θ) 12), Sep. 2009. ˆ − Gaussian distribution θ ∼ N(θ, δ ) = √ exp(− 2δ2 ). Based [24] N. M. Ferguson, D. A. T. Cummings, C. Fraser, et al., ”Strategies for ′ 2πδ β on the empirical rule of T ∼ k , β′ ≃ 1.5 ± 0.1, where β is mitigating an influenza pandemic,” Nature, vol. 442, no. 7101, pp. 448– approximately linear with θ, the least squares estimation is employed 452, Jul. 2006. to evaluate parameters θˆ and δ2 if we set the initial population of [25] V. Colizza, A. Barrat, M. Barthelemy, et al., ”Modeling the worldwide spread of pandemic influenza: baseline case and containment interven- each node and constant C. Then, for a given BA network, we get a tions,” PLoS Med., vol. 4, no. 1, pp. 0095–0110, Jan. 2007. BA networked metapopulation in which real statistic information is [26] L. Wang, Y. Zhang, T. Huang, and X. Li, ”Estimating the value embedded by using the above method. of containment strategies in delaying the arrival time of an influenza pandemic: A case study of travel restriction and patient isolation,” Phys. Rev. E, vol. 86, no. 3, pp. 032901, Sep. 2012. EFERENCES R [27] M. E. Halloran, N. M. Ferguson, S. Eubank, et al., ”Modeling targeted [1] M. J. Keeling and P. Rohani, Modeling Infectious Diseases in Humans layered containment of an influenza pandemic in the United States,” Proc. and Animals, Princeton and Oxford: Princeton University Press, 2008. Natl. Acad. Sci. USA, vol. 105, no. 12, pp. 4639–4644, Mar. 2008. [2] H. Heesterbeek, R. M. Anderson, V. Andreasen, et al., ”Modeling [28] J. T. Wu, G. M. Leung, M. Lipsitch, et al., ”Hedging against antiviral infectious disease dynamics in the complex landscape of global health,” resistance during the next influenza pandemic using small stockpiles of Science, Vol. 347, no. 6227, pp. aaa4339(1–10), Mar. 2015. an alternative chemotherapy,” PLoS Med., vol. 6, no. 5, pp. 1–11, May [3] J. P. Fitch, ”Engineering a global response to infectious diseases,” Proc. 2009. IEEE, vol. 103, no. 2, pp. 263–272, Feb. 2015. [29] E. T. Lofgren, M. E. Halloran, C. M. Rivers, et al., ”Opinion: Mathe- [4] D. Brockmann and D. Helbing, ”The hidden geometry of complex, matical models: A key tool for outbreak response,” Proc. Natl. Acad. Sci. network-driven contagion phenomena,” Science, vol. 342, no. 6164, pp. USA, vol. 111, no. 51, pp. 18095–18096, Dec. 2014. 1337–1342, Dec. 2013. [30] A. Gautreau, A. Barrat, and M. Barth´elemy. ”Global disease spread: [5] A. J. McMichael, ”Globalization, climate change, and human health,” N. statistics and estimation of arrival times,” J. Theor. Biol., vol. 251, no. 3, Engl. J. Med., Vol. 368, no. 14, pp. 1335–1343, Apr. 2013. pp. 509–522, Apr. 2008. [6] A. R. Mclean, R. M. May, J. Pattison, et al. (eds.), SARS: A Case Study [31] H. Miao, X. Xia, A. S. Perelson, et al., ”On identifiability of nonlinear in Emerging Infections, New York: Oxford University Press, 2005. ODE models and applications in viral dynamics,” SIAM Rev., vol. 53, no. [7] C. Fraser, C. A. Donnelly, S. Cauchemez, et al., ”Pandemic potential of 1, pp. 3–39, Feb. 2011. a strain of influenza A (H1N1): Early findings,” Science, vol. 324, no. [32] D. Shah and T. Zaman, ”Rumors in a network: Who’s the Culprit?” 5934, pp. 1557–1561, Jun. 2009. IEEE Trans. Inf. Theory, vol. 57, no. 8, pp. 5163–5181, Aug. 2011. [8] Y. Yang, J. D. Sugimoto, M. E. Halloran, et al., ”The transmissibility [33] W. Luo, W. P. Tay, and M. Leng, ”Identifying infection sources and and control of pandemic influenza A (H1N1) virus,” Science, vol. 326, regions in large networks,” IEEE Trans. Signal Process., vol. 61, no. 11, no. 5953, pp. 729–733, Oct. 2009. pp. 2850–2865, Jun. 2013. [9] H. Yu, B. J. Cowling, L. Feng, et al., ”Human infection with avian [34] Z. Wang, W. Dong, W. Zhang and C. W. Tan, ”Rumor source detection influenza A H7N9 virus: An assessment of clinical severity,” Lancet, vol. with multiple observations: Fundamental limits and algorithms,” in Proc. 382, no. 9887, pp. 138–145, Jun. 2013. ACM SIGMETRICS 2014, Austin, TX, USA, Jun. 2014, pp. 1–13. [10] T. Tsan-Yuk Lam, B. Zhou, J. Wang, et al., ”Dissemination, divergence [35] W. Dong, W. Zhang, and C. W. Tan, ”Rooting out the rumor culprit from and establishment of H7N9 influenza viruses in China,” Nature, vol. 522, suspects,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Istanbul, Turkey, pp. 102–105, Jun. 2015. Jul. 2013, pp. 2671–2675.

13 [36] Z. Wang, W. Dong, W. Zhang and C. W. Tan, ”Rooting our rumor sources in online social networks: The value of diversity from multiple observations”, IEEE J. Sel. Topics Signal Process., vol. 9, no. 4, pp. 663– 677, Jun. 2015. [37] X. Han, Z. Shen, W.-X. Wang and Z. Di, ”Robust reconstruction of complex networks from sparse data,” Phys. Rev. Lett., vol. 114, no. 2, pp. 028701, Jan. 2015. [38] M. Gomez-Rodriguez, J. Leskovec, A. Krause, ”Inferring networks of diffusion and influence,” in Proc. 16th ACM SIGKDD Conf. Knowl. Discov. Data Mining (KDD), Washington DC, DC, USA, Jul. 2010, pp. 1019–1028. [39] M. Gomez-Rodriguez, J. Leskovec and A. Krause, ”Inferring networks of diffusion and influence,” ACM Trans. Knowl. Discov. Data, vol. 5, no. 4, pp. 21(1–37), Feb. 2012. [40] M. Gomez-Rodriguez, D. Balduzzi and B. Sch¨olkopf, ”Uncovering the temporal dynamics of diffusion networks,” in Proc. 28th Int. Conf. Machine Learning (ICML), Bellevue, WA, USA, Jul. 2011, pp. 561–568. [41] M. Gomez-Rodriguez, J. Leskovec, D. Balduzzi and B. Sch¨olkopf, ”Uncovering the structure and temporal dynamics of information propa- gation,” Netw. Sci., vol. 2, no. 01, pp. 26–65, Apr. 2014. [42] H. Daneshmand, M. Gomez-Rodriguez, L. Song, and B. Sch¨olkopf, ”Estimating diffusion network structures: recovery conditions, sample complexity & soft-thresholding algorithm,” in Proc. 31st Int. Conf. Machine Learning (ICML), Beijing, China, Jun. 2014, vol. 32, pp. 793– 801. [43] D. Balcan, V. Colizza, B. Gonc¸alves, H. Hu, J. J. Ramasco and A. Vespignani, ”Multiscale mobility networks and the spatial spreading of infectious diseases,” Proc. Natl. Acad. Sci. USA, vol. 106 no. 51, pp. 21484–21489, Dec. 2009. [44] K.-L. Tsui, Z. S.-Y. Wong, D. Goldsman, et al., ”Tracking infectious disease spread for global pandemic containment,” IEEE Intell. Syst., vol. 28, no. 6, pp. 60–64, Nov.-Dec. 2013. [45] M. Barth´elemy, ”Spatial networks,” Phys. Rep., vol. 499, no. 1–3, pp. 1–101, Feb. 2010. [46] A. Bazzani, B. Giorgini, S. Rambaldi, et al., ”Statistical laws in urban mobility from microscopic GPS data in the area of Florence,” J. Stat. Mech., vol. 2010, no. 05, p. P05001, May 2010. [47] C. Peng, X. Jin, K.-C. Wong, et al., ”Collective human mobility pattern from taxi trips in urban area,” PLoS ONE, vol. 7, no. 4, p. e34487, Apr. 2012. [48] X. Liang, J. Zhao, L. Dong, et al., ”Unraveling the origin of exponential law in intra-urban human mobility,” Sci. Rep., vol. 3, p. 2983, Oct. 2013. [49] E. Brooks-Pollock, G. O. Roberts, and M. J. Keeling, ”A dynamic model of bovine tuberculosis spread and control in Great Britain,” Nature, vol. 511, no. 7508, pp. 228–231, Jul. 2014. [50] L. Wang, X. Li, Y. Zhang, et al., ”Evolution of scaling emergence in large-scale spatial epidemic spreading,” PLoS ONE, vol. 6, no. 7, p. e21197, Jul. 2011. [51] R. Albert and A.-L. Barabasi, ”Statistical mechanics of complex net- works,” Rev. Mod. Phys., vol. 74, no. 1, pp.47–97, Jan. 2002. [52] A. Barrat, M. Barth´elemy, R. Pastor-Satorras, et al., ”The architecture of complex weighted networks,” Proc. Natl. Acad. Sci. USA, vol. 101, no. 11, pp. 3747–3752, Mar. 2004. [53] X. Wan, J. Liu, W.-K. Cheung, T. Tong, ”Inferring epidemic network topology from surveillance data,” PLoS ONE, vol. 30, no. 9(6), pp. e100661, Jun. 2014. [54] B. Shi, J. Liu, X.-N. Zhou, G.-J. Yang, ”Inferring Plasmodium vivax transmission networks from tempo-spatial surveillance data,” PLoS Negl. Trop. Dis., vol. 6, no. 8(2), pp. e2682, Feb. 2014. [55] X. Yang, J. Liu, X.-N. Zhou, W.-K. Cheung, ”Inferring disease trans- mission networks at a metapopulation level,” Health Inf. Sci. Syst., vol. 17, no. 2, pp. 8, Nov. 2014.

14