<<

Discovering Latent Network Structure in Data

Scott W. Linderman [email protected] Harvard University, Cambridge, MA 02138 USA

Ryan P. Adams [email protected] Harvard University, Cambridge, MA 02138 USA

Abstract are considered known and the data are the entries in the as- sociated . A rich literature has arisen in Networks play a central role in modern data anal- recent years for applying statistical mod- ysis, enabling us to reason about systems by els to this type of problem, e.g., Liben-Nowell & Kleinberg studying the relationships between their parts. (2007); Hoff (2008); Goldenberg et al. (2010). Most often in network analysis, the edges are given. However, in many systems it is diffi- In this paper we are concerned with implicit networks that cult or impossible to measure the network di- cannot be observed directly, but about which we wish to rectly. Examples of latent networks include eco- perform analysis. In an implicit network, the vertices nomic interactions linking financial instruments or edges of the graph may not be directly observed, but and patterns of reciprocity in gang violence. In the graph structure may be inferred from noisy emissions. these cases, we are limited to noisy observations These noisy observations are assumed to have been gen- of events associated with each node. To enable erated according to underlying dynamics that respect the analysis of these implicit networks, we develop latent network structure. a probabilistic model that combines mutually- For example, trades on financial stock markets are executed exciting with mod- thousands of times per second. Trades of one stock are els. We show how the Poisson superposition likely to cause subsequent activity on stocks in related in- principle enables an elegant auxiliary variable dustries. How can we infer such interactions and disen- formulation and a fully-Bayesian, parallel infer- tangle them from market-wide fluctuations? Discovering ence algorithm. We evaluate this new model em- latent structure underlying financial markets not only re- pirically on several datasets. veals interpretable patterns of interaction, but also provides insight into the stability of the market. In Section 4 we will analyze the stability of mutually-excitatory systems, and in 1. Introduction Section 6 we will explore how stock similarity may be in- Many types of modern data are characterized via relation- ferred from trading activity. ships on a network. analysis is the most As another example, both the edges and vertices may be commonly considered example, where the properties of in- latent. In Section 7, we examine patterns of violence in dividuals (vertices) can be inferred from “friendship” type Chicago, which can often be attributed to social structures connections (edges). Such analyses are also critical to un- in the form of gangs. We would expect that attacks from derstanding regulatory biological pathways, trade relation- one gang onto another might induce cascades of violence, ships between nations, and propagation of disease. The but the vertices (gang identity of both perpetrator and vic- tasks associated with such data may be unsupervised (e.g., tim) are unobserved. As with the financial data, it should identifying low-dimensional representations of edges or be possible to exploit dynamics to infer these social struc- vertices) or supervised (e.g., predicting unobserved links tures. In this case spatial information is available as well, in the graph). Traditionally, network analysis has focused which can help inform latent identities. on explicit network problems in which the graph itself is considered to be the observed data. That is, the vertices In both of these examples, the noisy emissions have the form of events in time, or “spikes,” and our intuition is Proceedings of the 31 st International Conference on Machine that a spike at a vertex will induce activity at adjacent ver- Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- tices. In this paper, we formalize this idea into a probabilis- right 2014 by the author(s). Discovering Latent Network Structure in Point Process Data tic model based on mutually-interacting point processes. 1 4 Specifically, we combine the Hawkes process (Hawkes, I 1971) with recently developed exchangeable random graph priors. This combination allows us to reason about latent 3 5a 5c networks in terms of the way that they regulate interaction in the Hawkes process. Inference in the resulting model can II be done with Monte Carlo, and an elegant 2 5b data augmentation scheme results in efficient parallelism. III 2. Preliminaries 2.1. Poisson Processes Figure 1: Illustration of a Hawkes process. Events induce impulse responses on connected processes and spawn “child” events. See Point processes are fundamental statistical objects that the main text for a complete description. { }N ⊂ S S ′ yield random finite sets of events sn n=1 , where to the intensity of other processes k . Causality and locality is a compact subset of RD, for example, space or of influence are enforced by requiring hk,k′ (∆t) to be zero time. The Poisson process is the canonical exam- for ∆t∈ / [0, ∆tmax]. ple. It is governed by a nonnegative “rate” or “inten- sity” function, λ(s): S → R+. The number of events By the superposition theorem for Poisson processes, these ′ in a subset∫ S ⊂ S follows a with additive components can be considered independent pro- mean ′ λ(s)ds. Moreover, the number of events in dis- cesses, each giving rise to their own events. We augment S ∈ { − } joint subsets are independent. our data with a latent zn 0, . . . , n 1 to indicate the cause of the n-th event (0 if the event is due { }N ∼ PP We use the notation sn n=1 (λ(s)) to indicate that to the background rate and 1 . . . n − 1 if it was caused by { }N a set of events sn n=1 is drawn from a Poisson process a preceding event). The augmented Hawkes likelihood is with rate λ(s). The likelihood is given by then the product of likelihoods of each Poisson process: { } ∫ ∏N N N { } | − { } | { } {{ ′ }} p( sn n=1 λ(s)) = exp λ(s)ds λ(sn). (1) p( (sn, cn, zn) n=1 λ0,k(t) , hk,k (∆t) ) = S n=1 ∏K p({s : c = k ∧ z = 0} | λ (t)) × In this work we will make use of a special property of Pois- n n n 0,k k=1 son processes, the Poisson superposition theorem, which ∏N ∏K states that {s } ∼ PP(λ (s) + ... + λ (s)) can be de- n 1 K p({s ′ : c ′ = k ∧ z ′ = n} | h (t − s )), composed into K independent Poisson processes. Let- n n n cn,k n n=1 k=1 ting zn denote the origin of the n-th event, we perform the decomposition by independently sampling each zn where the densities in the product are given by Equation 1. from Pr(z = k) ∝ λ (s ), for k ∈ {1 ...K} (Daley & n k n Figure 1 illustrates a causal cascades of events for a simple Vere-Jones, 1988). network of three processes (I-III). The first event is caused by the background rate (z1 = 0), and it induces impulse 2.2. Hawkes Processes responses on processes II and III. Event 2 is spawned by Though Poisson processes have many nice properties, they the impulse on the third process (z2 = 1), and feeds back cannot capture interactions between events. For this we onto processes I and II. In some cases a single parent event turn to a more general model known as Hawkes pro- induces multiple children, e.g., event 4 spawns events 5a- cesses (Hawkes, 1971). A Hawkes process consists of K c. In this simple example, processes excite one another, point processes and gives rise to sets of marked events but do not excite themselves. Next we will introduce more { }N ∈ { } sophisticated models for such interaction networks. sn, cn n=1, where cn 1,...,K specifies the process on which the n-th event occurred. For now, we assume the events are points in time, i.e., sn ∈ [0,T ]. Each of 2.3. Random Graph Models the K processes is a conditionally Poisson process with Graphs of K nodes correspond to K × K matrices. a rate λ (t | {s : s < t}) that depends on the history of k n n Unweighted graphs are binary adjacency matrices A events up to time t. where Ak,k′ = 1 indicates a directed edge from node k to Hawkes processes have additive interactions. Each process node k′. Weighted directed graphs can be represented by has a “background rate” λ0,k(t), and each event sn on pro- a real matrix W whose entries indicate the weights of the cess k adds a nonnegative impulse response hk,k′ (t − sn) edges. Random graph models reflect the probability of dif- Discovering Latent Network Structure in Point Process Data ferent network structures through distributions over these plete graph recovers the standard Hawkes process. Mak- matrices. ing g a probability density endows W with units of “ex- pected number of events” and allows us to compare the Recently, many random graph models have been unified relative strength of interactions. The form suggests an under an elegant theoretical framework due to Aldous and intuitive generative model: for each impulse response Hoover (Aldous, 1981; Hoover, 1979). See Lloyd et al. draw m ∼ Poisson(W ′ ) number of induced events and (2012) for an overview. Conceptually, the Aldous-Hoover k,k draw the m child event times i.i.d. from g, enabling com- representation characterizes the class of exchangeable ran- putationally tractable conjugate priors. dom graphs, that is, graph models for which the joint probability is invariant under permutations of the node la- Intuitively, the background rates, λ0,k(t), explain events bels. Just as de Finetti’s theorem equates exchangeable se- that cannot be attributed to preceding events. In the sim- quences to independent draws from a random probability plest case the background rate is constant. However, there measure , Aldous-Hoover renders the entries of A condi- are often fluctuations in overall intensity that are shared tionally independent given latent variables of each node. among the processes, and not reflective of process-to- process interaction, as we will see in the daily variations Empty graph models (A ′ ≡ 0) and complete mod- k,k in trading volume on the S&P100 and the seasonal trends els (A ′ ≡ 1) are trivial examples, but much more k,k in homicide. To capture these shared background fluctua- structure may be encoded. For example, consider a tions, we use a sparse Log Gaussian (Møller model in which nodes are endowed with a location in D et al., 1998) to model the background rate: space, xk ∈ R . This could be an abstract feature space or a real location like the center of a gang territory. The { } ∼ GP ′ probability of connection between two notes decreases with λ0,k(t) = µk + αk exp y(t) , y(t) (0,K(t, t )). −||x −x ′ ||/τ between them as Ak,k′ ∼ Bern(ρe k k ), where ρ is the overall sparsity and τ is the characteristic ′ distance scale. The K(t, t ) describes the covariance structure of the background rate that is shared by all processes. For ex- Many models can be constructed in this manner. Stochastic ample, a periodic kernel may capture seasonal or daily fluc- block models, latent eigenmodels, and their nonparametric tuations. The offset µk accounts for varying background extensions all fall under this class (Lloyd et al., 2012). We intensities among processes, and the scaling factor αk gov- will leverage the generality of the Aldous-Hoover formal- erns how sensitive process k is to these background fluctu- ism to build a flexible model and inference algorithm for ations (when αk = 0 we recover the constant background Hawkes processes with structured interaction networks. rate).

Finally, in some cases the process identities, cn, must also 3. The Network Hawkes Model be inferred. With gang incidents in Chicago we may have ∈ R2 In order to combine Hawkes processes and random net- only a location, xn . In this case, we may place a spa- work models, we decompose the Hawkes impulse response tial Gaussian mixture model over the cn’s, as in Cho et al. (2013). Alternatively, we may be given the label of the hk,k′ (∆t) as follows: community in which the incident occurred, but we suspect ′ ′ ′ hk,k (∆t) = Ak,k Wk,k gθk,k′ (∆t). (2) that interactions occur between clusters of communities. In this case we can use a simple clustering model or a non- ∈ { }K×K Here, A 0, 1 is a binary adjacency matrix parametric model like that of Blundell et al. (2012). ∈ RK×K and W + is a non-negative weight matrix. To- gether these specify the sparsity structure and strength of 3.1. Inference with Gibbs Sampling the interaction network, respectively. The non-negative We present a Gibbs sampling procedure for inferring the function gθk,k′ (∆t) captures the temporal aspect of the in- {{ ′ }} { } teraction. It is parameterized by θk,k′ and satisfies two model parameters, W , A, θk,k , λ0,k(t) , and, if nec- { } properties: a) it has bounded support for ∆t ∈ [0, ∆tmax], essary, cn . In order to simplify our Gibbs updates, and b) it integrates to one. In other words, g is a probability we will also sample a set of parent assignments for each density with compact support. event {zn}. Incorporating these parent variables enables conjugate prior distributions for W , θ ′ , and, in the case Decomposing h as in Equation 2 has many advantages. It k,k of constant background rates, λ0,k. Detailed derviations allows us to express our separate beliefs about the spar- are provided in the supplementary material. sity structure of the interaction network and the strength of the interactions through a spike-and-slab prior on A and W (Mohamed et al., 2012). The empty graph model Sampling weights W . A gamma prior on the weights, ′ ∼ 0 0 recovers independent background processes, and the com- Wk,k Gamma(αW , βW ), results in the conditional dis- Discovering Latent Network Structure in Point Process Data tribution, Collapsed Gibbs sampling A and zn. With Aldous- N Hoover graph priors, the entries in the binary adjacency W ′ | {s , c , z } , θ ′ ∼ Gamma(α ′ , β ′ ), k,k n n n n=1 k,k k,k k,k matrix A are conditionally independent given the param- ∑N ∑N eters of the prior. The likelihood introduces dependencies 0 α ′ = α + δ δ ′ ′ δ ′ k,k W cn,k cn ,k zn ,n between the rows of A, but each column can be sampled ′ n=1 n =1 in parallel. Gibbs updates are complicated by strong de- ∑N pendencies between the graph and the parent variables, zn. ′ 0 βk,k = βW + δcn,k. ′ Specifically, if zn = n, then we must have Acn,cn′ = 1. n=1 To improve the of our sampling algorithm, first we The posterior parameters correspond to the number of update A | {sn, cn}, W , θk,k′ by marginalizing the parent events caused by an interaction and the total unweighted variables. The posterior is determined by the likelihood rate induced by events on node k. Here and elsewhere, δi,j of the conditionally Poisson process λk′ (t | {sn : sn < t}) is the Kronecker delta function. We use the inverse-scale (Equation 1) with and without interaction Ak,k′ and the parameterization of the gamma distribution . prior comes from the Aldous-Hoover graph model. Then we update zn | {sn, cn}, A, W , θk,k′ by sampling from the Sampling background rates λ0,k. Similarly, discrete conditional distribution. Though there are N par- for background rates λ0,k(t) ≡ λ0,k, the prior ent variables, they are conditionally independent and may ∼ 0 0 λ0,k Gamma(αλ, βλ) is conjugate with the likeli- be sampled in parallel. We have implemented our inference hood and yield the conditional distribution algorithm on GPUs to capitalize on this parallelism. | { }N ∼ λ0,k sn, cn, zn n=1, Gamma(αλ, βλ), ∑ Sampling process identities cn. As with the adjacency 0 αλ = αλ + δcn,kδzn,0 matrix, we use a collapsed Gibbs sampler to marginalize n out the parent variables when sampling the process identi- 0 βλ = βλ + T ties. Unfortunately, the cn’s are not conditionally indepen- dent and hence must be sampled sequentially. This conjugacy no longer holds for back- ground rates, but conditioned upon the parent variables, Computational concerns. Compact impulse responses we must simply fit a Gaussian process for those events for limit the number of potential event parents and significantly which zn = 0. We use elliptical slice sampling (Murray reduce the memory requirements and running time of our et al., 2010) for this purpose. algorithm. If the average firing rate is constant, the ex- pected number of potential parents per event will be linear Sampling impulse response parameters θk,k′ . The in K. Summing the per-event contributions to the log like- logistic-normal density with parameters θk,k′ = {µ, τ} lihood can be done in O(log N) time using standard paral- provides a flexible model for the impulse response: lel reductions. Hence, after parallelizing over the columns { ( ( ) ) } of A and the parents z , one step of our sampling algorithm − 2 n 1 τ −1 ∆t gk,k′ (∆t | µ, τ) = exp σ − µ takes O(K(K + log N)) time when process identities are Z 2 ∆tmax known, and O((K + N)(K + log N)) time otherwise. On σ−1(x) = ln(x/(1 − x)) the datasets used in the following experiments, our GPU 1 ( )− 1 implementation achieves 5-50 iterations per second. ∆t(∆t − ∆t) τ 2 Z = max . ∆t 2π max 4. Stability of Network Hawkes Processes ∼ N G | 0 0 0 0 The normal-gamma prior µ, τ (µ, τ µµ, κµ, ατ , βτ ) is conjugate and yields a conditional distribution with the Due to their recurrent nature, Hawkes processes must be following sufficient : constrained to ensure their positive feedback does not lead to infinite numbers of events. A stable system must satisfy2 xn,n′ = ln(sn′ − sn) − ln(tmax − (sn′ − sn)), ∑N ∑N λmax = max | eig(A ⊙ W ) | < 1 ′ m = δcn,kδcn′ ,k δzn′ ,n, n=1 n′=1 (c.f. Daley & Vere-Jones (1988)). When we are condition- N N ing on finite datasets we do not have to worry about this. 1 ∑ ∑ x¯ = δ δ ′ ′ δ ′ x ′ . We simply place weak priors on the network parameters, m cn,k cn ,k zn ,n n,n n=1 n′=1 1https://github.com/slinderman 2 Intuitively, these correspond to the number of events caused In this context λmax refers to an eigenvalue rather than a rate, by an interaction and their average delay. and ⊙ denotes the Hadamard product. Discovering Latent Network Structure in Point Process Data (a) (b) (c) (d) 6 100 G(1,5) G(2,5)

−1 ) ) 4 G(4,8) 10 max max λ λ

G(8,12) ρ p(W) 2 10−2 Pr( Pr(

0 0 1 2 4 64 1024 0 0.5 1 0 0.5 1 λ λ W K max max Figure 2: Empirical and theoretical distribution of the maximum eigenvalue for Erdos-Renyi˝ graphs with gamma weights. (a) Four gamma weight distributions. The colors correspond to the curves in the remaining panels. (b) Sparsity that theoretically yields 99% probability of stability as a function of p(W ) and K. (c) and (d) Theoretical (solid) and empirical (dots) distribution of the maximum eigenvalue. Color corresponds to the weight distribution in (a) and intensity indicates K and ρ shown in (b). e.g., a beta prior on the sparsity ρ of an Erdos-Renyi˝ graph, butions have increased variance and lead to a greater than and a Jeffreys prior on the scale of the gamma weight distri- expected probability of unstable matrices for the range of bution. For the generative model, however, we would like network sizes tested here. We conclude that networks with to set our hyperparameters such that the prior distribution strong weights should be counterbalanced by strong spar- places little mass on unstable networks. In order to do so, sity limits, or additional structure in the adjacency matrix we use tools from theory. that prohibits excitatory feedback loops. The celebrated circular law describes the asymptotic eigen- value distribution for K × K random matrices with en- 5. Synthetic Results tries that are i.i.d. with zero mean and variance σ2. Our inference algorithm is first tested on gen- As K grows, the eigenvalues are uniformly distributed erated from the network Hawkes model. We perform two over a disk in the complex plane centered at the origin √ tests: a) a task where the process identities and with radius σ K. In our case, however, the mean are given and the goal is to simply infer whether or not an of the entries, µ = E[A ′ W ′ ], is not zero. Silverstein k,k k,k interaction exists, and b) an event prediction task where we (1994) has analyzed such “noncentral” random matrices measure the probability of held-out event sequences. and shown that the largest eigenvalue is asymptotically dis- 2 tributed as λmax ∼ N (µK, σ ). The network Hawkes model can be used for link predic- tion by considering the posterior probability of interac- In the simple case of Wk,k′ ∼ Gamma(α, β) tions P (Ak,k′ | {sn, cn}). By thresholding at varying prob- and A ′ ∼ Bern(ρ), we have µ = ρα/β and √ k,k abilities we compute a ROC curve. A standard Hawkes σ = ρ((1 − ρ)α2 + α)/β. For a given K, α and β, we process assumes a complete set of interactions (A ′ ≡ 1), ρ k,k can tune the sparsity parameter to achieve stability with but we can similarly threshold its inferred weight matrix to ρ high√ probability. We simply set such that the minimum perform link prediction. of σ K and, say, µK + 3σ, equals one. Figures 2a and 2b show a variety of weight distributions and the maximum Cross correlation provides a simple alternative measure of stable ρ. Increasing the network size, the mean, or the interaction. By summing the cross-correlation over off- variance will require a concomitant increase in sparsity. sets ∆t ∈ [0, ∆tmax), we get a measure of directed inter- action. A probabilistic alternative is offered by the gen- This approach relies on asymptotic eigenvalue distribu- eralized linear model for point processes (GLM), a pop- tions, and it is unclear how quickly the spectra of ran- ular model for spiking dynamics in computational neuro- dom matrices will converge to this distribution. To test science (Paninski, 2004). The GLM allows for constant this, we computed the empirical eigenvalue distribution for background rates and both excitatory and inhibitory inter- random matrices of various size, mean, and variance. We actions. Impulse responses are modeled with linear ba- 104 generated random matrices for each weight distribu- sis functions. Area under the impulse response provides K = 4 64 1024 ρ tion in Figure 2a with sizes , , and , and a measure of directed excitatory interaction that we use to set to the theoretical maximum indicated by dots in Fig- compute a ROC curve. See the supplementary material for ure 2b. The theoretical and empirical distributions of the a detailed description of this model. maximum eigenvalue are shown in Figures 2c and 2d. We find that for small mean and variance weights, for exam- We sampled ten network Hawkes processes of 30 nodes ple Gamma(1, 5) in the Figure 2c, the empirical results each with Erdos-Renyi˝ graph models, constant background closely match the theory. As the weights grow larger, as rates, and the priors described in Section 3. The Hawkes in Gamma(8, 12) in 2d, the empirical eigenvalue distri- processes were simulated for T = 1000 seconds. We used Discovering Latent Network Structure in Point Process Data

Synthetic Link Prediction ROC achieves 2.2  .1 bits/spike improvement in predictive log

1 likelihood over a homogeneous Poisson process. Figure 4 shows that on average the standard Hawkes and the GLM 0.8 provide only 60% and 72%, respectively, of this predictive power. See the supplementary material for further analysis. 0.6

TP rate 6. Trades on the S&P 100 0.4 Net. Hawkes As an example of how Hawkes processes may discover 0.2 Std. Hawkes interpretable latent structure in real-world data, we study GLM the trades on the S&P 100 index collected at 1s intervals XCorr 0 during the week of Sep. 28 through Oct. 2, 2009. Every 0 0.2 0.4 0.6 0.8 1 time a stock price changes by 0.1% of its current price FP rate an event is logged on the stock’s process, yielding a total of K = 100 processes and N=182,037 events. Figure 3: Comparison of models on a link prediction test aver- aged across ten randomly sampled synthetic networks of 30 nodes Trading volume varies substantially over the course of the each. The network Hawkes model with the correct Erdos-Renyi˝ day, with peaks at the opening and closing of the market. graph prior outperforms a standard Hawkes model, GLM, and simple thresholding of the cross-correlation matrix. This daily variation is incorporated into the background rate via a Log Gaussian Cox Process (LGCP) with a pe- Synthetic Predictive Log Likelihood riodic kernel (see supplementary material). We look for 5 Net Hawkes short-term interactions on top of this background rate with Std Hawkes time scales of ∆tmax = 60s. In Figure 5 we compare the 4 GLM predictive performance of independent LGCPs, a standard Hawkes process with LGCP background rates, and the net- 3 Inferred Embedding of Financial Stocks

2 IT τ Financials Energy Pred. log lkhd. (bits/s) 1 Health Care Consumer Industrials 0 1 2 3 4 5 6 7 8 9 10 Network Figure 4: Comparison of predictive log likelihoods for the same set of networks as in Figure 3, compared to a baseline of a Pois-

son process with constant rate. Improvement in predictive likeli- Latent Dimension 2 hood over baseline is normalized by the number of events in the test data to obtain units of “bits per spike.” The network Hawkes model outperforms the competitors in all sample networks.

Latent Dimension 1 the models above to predict the presence or absence of in- Top 4 eigenvectors of A ⋅W teractions. The results of this experiment are shown in the ROC curves of Figure 3. The network Hawkes model ac-

curately identifies the sparse interactions, outperforming all AAPL JNJ JPM CVS PG WAG XOM other models. With the Hawkes process and the GLM we can evaluate the log likelihood of held-out test data. On Figure 6: Top: A sample from the posterior distribution over em- this task, the network Hawkes outperforms the competitors beddings of stocks from the six largest sectors of the S&P100 for all networks. On average, the network Hawkes model under a latent distance graph model with two latent dimensions. Scale bar: the characteristic length scale of the latent distance Financial Model Pred. log lkhd. (bits/spike) model. The latent embedding tends to embed stocks such that Indep. LGCP 0.579  0.006 they are nearby to, and hence more likely to interact with, others Std. Hawkes 0.903  0.003 in their sector. Bottom: Hinton diagram of the top 4 eigenvectors. Net. Hawkes (Erdos-Renyi)˝ 0.893  0.003 Size indicates magnitude of each stock’s in the eigen- Net. Hawkes (Latent Distance) 0.879  0.004 vector and colors denote sectors as in the top panel, with the addi- tion of Materials (aqua), Utilities (orange), and Telecomm (gray). Figure 5: Comparison of financial models on a event prediction We show the eigenvectors corresponding to the four largest eigen- task, relative to a homogeneous Poisson process baseline. values λmax = 0.74 (top row) to λ4 = 0.34 (bottom row). Discovering Latent Network Structure in Point Process Data

Inferred Gang Regions 3 0.6 Empty 42.1

Complete Cluster 1 (t) 2 0.5 1

Erdos−Renyi λ Distance 42.05 Cluster 2 1 0.4 Cluster 3 0 0.3 42 Cluster 4 1980 1985 1990 1994 0.2 3

Pred. Log Lkhd (bits/s) 41.95 0.1 (t) 2 2 λ 0 1 Communities Clusters 41.9 Process ID Model 0 1980 1985 1990 1994 (a) 41.85 3

(t) 2

Cluster Interactions 3

41.8 λ 1 1 40 41.75 0 2 1980 1985 1990 1994 ] 2 41.7 3 20 −3 3 Offset

10 Background × Initiating Cluster 4 2 41.65 Interactions 0 1 2 3 4 1

Receiving Cluster (t) [Hom/Day/km 0

41.6 4

−87.9 −87.8 −87.7 −87.6 −87.5 λ 1980 1985 1990 1994 (b) (c) (d) Figure 7: Inferred interactions among clusters of community areas in the city of Chicago. (a) Predictive log likelihood for “communities” and “clusters” process identity models and four graph models. Panels (b-d) present results for the model with the highest predictive log likelihood: an Erdos-Renyi˝ graph with K = 4 clusters. (b) The weighted interaction network in units of induced homicides over the training period (1980-1993). (c) Inferred clustering of the 77 community areas. (d) The intensity for each cluster, broken down into the offset, the shared background rate, and the interactions (units of 10−3 homicides per day per square kilometer). work Hawkes model with LGCP background rates under 7. Gangs of Chicago two graph priors. The models are trained on four days of data and tested on the fifth. Though the network Hawkes is In our final example, we study spatiotemporal patterns of slightly outperformed by the standard Hawkes, the differ- gang-related homicide in Chicago. Sociologists have sug- ence is small relative to the performance improvement from gested that gang-related homicide is mediated by under- considering interactions, and the inferred network parame- lying social networks and occurs in mutually-exciting, re- ters provide interpretable insight into the market structure. taliatory patterns (Papachristos, 2009). This is consistent with a spatiotemporal Hawkes process in which processes In the latent distance model for A, each stock has a la- correspond to gang territories and homicides incite further 2 tent embedding xk ∈ R such that nearby stocks are more homicides in rival territories. likely to interact, as described in Section 2.3. Figure 6 shows a sample from the posterior distribution over embed- We study gang-related homicides between 1980 and 1995 dings in R2 for ρ = 0.2 and τ = 1. We have plotted stocks (Block et al., 2005). Homicides are labeled by the com- in the six largest sectors, as listed on Bloomberg.com. munity in which they occurred. Over this time-frame there Some sectors, notably energy and financials, tend to cluster were N = 1637 gang-related homicides in the 77 commu- together, indicating an increased probability of interaction nities of Chicago. between stocks in the same sector. Other sectors, such as We evaluate our model with an event-prediction task, train- consumer goods, are broadly distributed, suggesting that ing on 1980-1993 and testing on 1994-1995. We use a Log these stocks are less influenced by others in their sector. Gaussian Cox Process (LGCP) temporal background rate For the consumer industry, which is driven by slowly vary- in all model variations. Our baseline is a single process ing factors like inventory, this may not be surprising. with a uniform spatial rate for the city. We test two pro- The Hinton diagram in the bottom panel of Figure 6 shows cess identity models: a) the “community” model, which the top 4 eigenvectors of the interaction network. All eigen- considers each community a separate process, and b) the values are less than 1, indicating that the system is stable. “cluster” model, which groups communities into processes. The number of clusters is chosen by cross-validation (see The top row corresponds to first eigenvector (λmax = 0.74). Apple (AAPL), J.P. Morgan (JPM), and Exxon Mobil (XOM) supplementary material). For each process identity model, have notably large entries in the eigenvector, suggesting we compare four graph models: a) independent LGCPs that their activity will spawn cascades of self-excitation. (empty), b) a standard Hawkes process with all possible in- Discovering Latent Network Structure in Point Process Data teractions (complete), c) a network Hawkes model with a We have adapted their latent variable formulation in our sparsity-inducing Erdos-Renyi˝ graph prior, and d) a net- fully-Bayesian inference algorithm and introduced a frame- work Hawkes model with a latent distance model that work for prior distributions over the latent network. prefers short-range interactions. Others have considered special cases of the model we have The community process identity model improves predic- proposed. Blundell et al. (2012) combine Hawkes pro- tive performance by accounting for higher rates in South cesses and the Infinite Relational Model (a specific ex- and West Chicago where gangs are deeply entrenched. Al- changeable graph model with an Aldous-Hoover represen- lowing for interactions between community areas, how- tation) to cluster processes and discover interactions. Cho ever, results in a decrease in predictive power due to over- et al. (2013) applied Hawkes processes to gang incidents in fitting (there is insufficient data to fit all 772 potential in- Los Angeles. They developed a spatial Gaussian mixture teractions). Interestingly, sparse graph priors do not help. model (GMM) for process identities, but did not explore They bias the model toward sparser but stronger interac- structured network priors. We experimented with this pro- tions which are not supported by the test data. These re- cess identity model but found that it suffers in predictive sults are shown in the “communities” group of Figure 7a. log likelihood tests (see supplementary material). Clustering the communities improves predictive perfor- Recently, Iwata et al. (2013) developed a stochastic EM mance for all graph models, as seen in the “clusters” group. algorithm for Hawkes processes, leveraging similar conju- Moreover, the clustered models benefit from the inclusion gacy properties, but without network priors. Zhou et al. of excitatory interactions, with the highest predictive log (2013) have developed a promising optimization-based ap- likelihoods coming from a four-cluster Erdos-Renyi˝ graph proach to discovering low-rank networks in Hawkes pro- model with interactions shown in Figure 7b. Distance- cesses, similar to some of the network models we explored. dependent graph priors do not improve predictive perfor- mance on this dataset, suggesting that either interactions do Perry & Wolfe (2013) derived a partial likelihood inference not occur over short distances, or that local rivalries are not algorithm for Hawkes processes with a similar emphasis substantial enough to be discovered in our dataset. More on structural patterns in the network of interactions. They data is necessary to conclusively say which. provide an estimator capable of discovering and other network effects. Our fully-Bayesian approach gener- Looking into the inferred clusters in Figure 7c and their alizes this method to capitalize on recent developments in rates in 7d, we can interpret the clusters as “safe suburbs” random network models (Lloyd et al., 2012). in gold, “buffer neighborhoods” in green, and “gang ter- ritories” in red and blue. Self-excitation in the blue clus- Finally, generalized linear models (GLMs) are widely used ter (Figure 7b) suggests that these regions are prone to in computational neuroscience (Paninski, 2004). GLMs al- bursts of activity, as one might expect during a turf-war. low for both excitatory and inhibitory interactions, but, as This interpretation is supported by reports of “a burst of we have shown, when the data consists of purely excitatory street-gang violence in 1990 and 1991” in West Englewood interactions, Hawkes processes outperform GLMs in link- (41.77◦N, −87.67◦W) (Block & Block, 1993). and event-prediction tests. Figure 7d also shows a significant increase in the homicide rate between 1989 and 1995, consistent with reports of es- 9. Conclusion calating gang warfare (Block & Block, 1993). In addition We developed a framework for discovering latent network to this long-term trend, homicide rates show a pronounced structure from spiking data. Our auxiliary variable formu- seasonal effect, peaking in the summer and tapering in the lation of the multivariate Hawkes process supported arbi- winter. A LGCP with a quadratic kernel point-wise added trary Aldous-Hoover graph priors, Log Gaussian Cox Pro- to a periodic kernel captures both effects. cess background rates, and models of unobserved process identities. Our parallel MCMC algorithm allowed us to 8. Related Work reason about uncertainty in the latent network in a fully- Bayesian manner. We leveraged results from random ma- Gomez-Rodriguez et al. (2010) introduced one of the ear- trix theory to analyze the conditions under which random liest algorithms for discovering latent networks from cas- network models will be stable, and our applications uncov- cades of events. They developed a highly scalable approx- ered interpretable latent networks in a variety of synthetic imate inference algorithm, but they did not explore the po- and real-world problems. tential of random network models or emphasize the point process nature of the data. Simma & Jordan (2010) studied Acknowledgements We thank Eyal Dechter and Leslie this problem from the context of Hawkes processes and de- Valiant for their many contributions. S.W.L. was supported veloped an expectation-maximization inference algorithm. by a NDSEG Fellowship. This work was partially funded by DARPA Young Faculty Award N66001-12-1-4219. Discovering Latent Network Structure in Point Process Data

References Liben-Nowell, David and Kleinberg, Jon. The link- prediction problem for social networks. Journal of the Aldous, David J. Representations for partially exchange- American society for information science and technol- able arrays of random variables. Journal of Multivariate ogy, 58(7):1019–1031, 2007. Analysis, 11(4):581–598, 1981. Lloyd, James Robert, Orbanz, Peter, Ghahramani, Zoubin, Block, Carolyn R and Block, Richard. Street gang crime and Roy, Daniel M. Random function priors for ex- in Chicago. US Department of Justice, Office of Justice changeable arrays with applications to graphs and rela- Programs, National Institute of Justice, 1993. tional data. Advances in Neural Information Processing Block, Carolyn R, Block, Richard, and Authority, Illinois Systems, 2012. Criminal Justice Information. Homicides in Chicago, Mohamed, Shakir, Ghahramani, Zoubin, and Heller, 1965-1995. ICPSR06399-v5. Ann Arbor, MI: Inter- Katherine A. Bayesian and L1 approaches for sparse university Consortium for Political and Social Research unsupervised learning. In Proceedings of the 29th Inter- [distributor], July 2005. national Conference on Machine Learning, pp. 751–758, Blundell, Charles, Heller, Katherine, and Beck, Jeffrey. 2012. Modelling reciprocating relationships with Hawkes pro- Møller, Jesper, Syversveen, Anne Randi, and cesses. Advances in Neural Information Processing Sys- Waagepetersen, Rasmus Plenge. Log Gaussian tems, 2012. Cox processes. Scandinavian Journal of Statistics, 25 Cho, Yoon Sik, Galstyan, Aram, Brantingham, Jeff, and (3):451–482, 1998. Tita, George. Latent point process models for spatial- Murray, Iain, Adams, Ryan P., and MacKay, David J.C. temporal networks. arXiv:1302.2671, 2013. Elliptical slice sampling. Journal of Machine Learning Research: Workshop and Conference Proceedings (AIS- Daley, Daryl J and Vere-Jones, David. An introduction TATS) to the theory of point processes. Springer-Verlag, New , 9:541–548, 2010. York, 1988. Paninski, Liam. Maximum likelihood estimation of cas- cade point-process neural encoding models. Network: Goldenberg, Anna, Zheng, Alice X, Fienberg, Stephen E, Computation in Neural Systems, 15(4):243–262, January and Airoldi, Edoardo M. A survey of statistical network 2004. models. Foundations and Trends in Machine Learning, 2(2):129–233, 2010. Papachristos, Andrew V. Murder by structure: Domi- nance relations and the social structure of gang homi- Gomez-Rodriguez, Manuel, Leskovec, Jure, and Krause, cide. American Journal of Sociology, 115(1):74–128, Andreas. Inferring networks of diffusion and influence. 2009. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Perry, Patrick O and Wolfe, Patrick J. Point process mod- pp. 1019–1028. ACM, 2010. elling for directed interaction networks. Journal of the Royal Statistical Society: Series B (Statistical Method- Hawkes, Alan G. Spectra of some self-exciting and mu- ology), 2013. tually exciting point processes. Biometrika, 58(1):83, 1971. Silverstein, Jack W. The spectral radii and norms of large dimensional non-central random matrices. Stochastic Hoff, Peter D. Modeling homophily and stochastic equiv- Models, 10(3):525–532, 1994. alence in symmetric relational data. Advances in Neural Information Processing Systems 20, 20:1–8, 2008. Simma, Aleksandr and Jordan, Michael I. Modeling events with cascades of Poisson processes. Proceedings of the Hoover, Douglas N. Relations on probability spaces and 26th Conference on Uncertainty in Artificial Intelligence arrays of random variables. Technical report, Institute (UAI), 2010. for Advanced Study, Princeton, 1979. Zhou, Ke, Zha, Hongyuan, and Song, Le. Learning so- Iwata, Tomoharu, Shah, Amar, and Ghahramani, Zoubin. cial infectivity in sparse low-rank networks using multi- Discovering latent influence in online social activities dimensional Hawkes processes. In Proceedings of the via shared cascade Poisson processes. In Proceedings International Conference on Artificial Intelligence and of the 19th ACM SIGKDD International Conference on Statistics, volume 16, 2013. Knowledge Discovery and Data Mining, pp. 266–274. ACM, 2013.