Uncovering the Temporal Dynamics of Diffusion Networks

Uncovering the Temporal Dynamics of Diffusion Networks Manuel Gomez-Rodriguez1;2 [email protected] David Balduzzi1 [email protected] Bernhard Scholkopf¨ 1 [email protected] 1MPI for Intelligent Systems and 2Stanford University Abstract Observing a diffusion process often reduces to noting when Time plays an essential role in the diffusion of in- nodes (people, blogs, etc.) reproduce a piece of informa- formation, influence and disease over networks. tion, get infected by a virus, or buy a product. Epidemiolo- In many cases we only observe when a node gists can observe when a person becomes ill but they can- copies information, makes a decision or becomes not tell who infected her or how many exposures and how infected – but the connectivity, transmission rates much time was necessary for the infection to take hold. In between nodes and transmission sources are un- information propagation, we observe when a blog mentions known. Inferring the underlying dynamics is a piece of information. However if, as is often the case, of outstanding interest since it enables forecast- the blogger does not link to her source, we do not know ing, influencing and retarding infections, broadly where she acquired the information or how long it took construed. To this end, we model diffusion pro- her to post it. Finally, viral marketers can track when cus- cesses as discrete networks of continuous tempo- tomers buy products or subscribe to services, but typically ral processes occurring at different rates. Given cannot observe who influenced customers’ decisions, how cascade data – observed infection times of nodes long they took to make up their minds, or when they passed – we infer the edges of the global diffusion net- recommendations on to other customers. In all these sce- work and estimate the transmission rates of each narios, we observe where and when but not how or why edge that best explain the observed data. The op- information (be it in the form of a virus, a meme, or a timization problem is convex. The model nat- decision) propagates through a population of individuals. urally (without heuristics) imposes sparse solu- The mechanism underlying the process is hidden. How- tions and requires no parameter tuning. The ever, the mechanism is of outstanding interest in all three problem decouples into a collection of indepen- cases, since understanding diffusion is necessary for stop- dent smaller problems, thus scaling easily to net- ping infections, predicting meme propagation, or maximiz- works on the order of hundreds of thousands of ing sales of a product. nodes. Experiments on real and synthetic data This article presents a method for inferring the mechanisms show that our algorithm both recovers the edges underlying diffusion processes based on observed infec- of diffusion networks and accurately estimates tions. To achieve this aim, we construct a model incor- their transmission rates from cascade data. porating some basic assumptions about the spatiotemporal structures that generate diffusion processes. The assumptions are as follows. First, diffusion processes occur over 1. Introduction static (fixed) but unknown networks (directed graphs). Sec- ond, infections are binary, i.e., a node is either infected or Diffusion and propagation processes have received increas- it is not; we do not model partial infections or the partial ing attention in a broad range of domains: information propagation of information. Third, infections along edges propagation (Adar & Adamic, 2005; Gomez-Rodriguez of the network occur independently of each other. Fourth, et al., 2010; Meyers & Leskovec, 2010), social net- an infection can occur at different times: the likelihood of works (Kempe et al., 2003; Lappas et al., 2010), viral mar- node a infecting node b at time t is modeled via a proba- keting (Watts & Dodds, 2007) and epidemiology (Wallinga bility density function depending on a, b and t. Finally, we & Teunis, 2004). observe all infections occurring in the network during the Appearing in Proceedings of the 28 th International Conference recorded time window. Our aim is to infer the connectiv- on Machine Learning, Bellevue, WA, USA, 2011. Copyright 2011 ity of the network and the likelihood of infections across by the author(s)/owner(s). its edges after observing the times at which nodes in the Uncovering the Temporal Dynamics of Diffusion Networks network become infected. section we formulate our model, starting from the data it is designed for, and concluding with a precise statement of In more detail, we formulate a generative probabilistic the network inference problem. model of diffusion that aims to describe realistically how infections occur over time in a static network. Finding Data. Observations are recorded on a fixed population of the optimal network and transmission rates maximizing the N nodes and consist of a set C of cascades ft1;:::; tjCjg. likelihood of an observed set of infection cascades reduces Each cascade tc is a record of observed infection times to solving a convex program. The convex problem decou- within the population during a time interval of length T c. ples into many smaller problems, allowing for natural par- c c c A cascade is an N-dimensional vector t := (t1; : : : ; tN ) allelization so that our algorithm scales to networks with c c recording when nodes are infected, tk 2 [0;T ] [ f1g. hundreds of thousands of nodes. We show the effectiveness Symbol 1 labels nodes that are not infected during ob- of our method by reconstructing the connectivity and con- servation window [0;T c] – it does not imply that nodes tinuous temporal dynamics of synthetic and real networks are never infected. The ‘clock’ is reset to 0 at the start using cascade data. of each cascade. Lengthening the observation window T c increases the number of observed infections within a cas- Related work. The work most closely related to cade c and results in a more representative sample of the ours (Gomez-Rodriguez et al., 2010; Meyers & Leskovec, underlying dynamics. However, these advantages must be 2010) also uses a generative probabilistic model for infer- weighed against the cost of observing for longer periods. ring diffusion networks. Gomez-Rodriguez et al. (2010) For simplicity we assume T c = T for all cascades; the (NETINF) infers network connectivity using submodular results generalize trivially. optimization and Meyers & Leskovec (2010) (CONNIE) infer not only the connectivity but also a prior probabil- The time-stamps assigned to nodes by a cascade induce the ity of infection for every edge using a convex program and structure of a directed acyclic graph (DAG) on the network some heuristics. However, both papers force the transmis- (which is not acyclic in general) by defining node i is a par- sion rate between all nodes to be fixed – and not inferred. ent of j if ti < tj. Thus, it is meaningful to refer to parents In contrast, our model allows transmission at different rates and children within a cascade, but not on the network. The across different edges so that we can infer temporally het- DAG structure dramatically simplifies the computational erogeneous interactions within a network, as found in real- complexity of the inference problem. Also, since the un- world examples. Thus, we can now infer the temporal dy- derlying network is inferred from many cascades (each of namics of the underlying network. which imposes its own DAG structure), the inferred network is typically not a DAG. The main innovation of this paper is to model diffusion as a spatially discrete network of continuous, conditionally in- Pairwise transmission likelihood. The first step in model- dependent temporal processes occurring at different rates. ing diffusion dynamics is to consider pairwise interactions. Infection transmission depends on the complex intricacies We assume that infections can occur at different rates over e.g. of the underlying mechanisms ( , a person’s susceptibil- different edges of a network, and aim to infer the transmis- ity to viral infections depends on weather, diet, age, stress sion rates between pairs of nodes in the network. levels, prior exposures to similar pathogens and so on). We avoid modeling the mechanisms underlying individual in- Define f(tijtj; αj;i) as the conditional likelihood of trans- fections, and instead develop a data-driven approach, suit- mission between a node j and node i. The transmission able for large-scale analyses, that infers the diffusion pro- likelihood depends on the infection times (tj; ti) and a pair- cess using only the visible spatiotemporal traces (cascades) wise transmission rate αj;i. A node cannot be infected by a it generates. We therefore model diffusion using only time- node infected later in time. In other words, a node j that has dependent pairwise transmission likelihood between pairs been infected at a time tj may infect a node i at a time ti of nodes, transmission rates and infection times, but not only if tj < ti. Although in some scenarios it may be possi- prior probabilities of infection that depend on unknown ble to estimate a non-parametric likelihood empirically, for external factors. To the best of our knowledge, continu- simplicity we consider three well-known parametric mod- ous temporal dynamics of diffusion networks has not been els: exponential, power-law and Rayleigh (see Table 1). modeled or inferred in previous work. We believe this is a Transmission rates are denoted as αj;i ≥ 0 and δ is the key point for understanding diffusion processes. minimum allowed time difference in the power-law to have a bounded likelihood. As αj;i ! 0 the likelihood of in- 2. Problem formulation fection tends to zero and the expected transmission time becomes arbitrarily long. Without loss of generality, we This paper develops a method for inferring the spatiotem- consider δ = 1 in the power-law model from now on.

Load more