<<

Variational Inference for Sparse Gaussian Process Modulated Hawkes Process

Rui Zhang,1,2 Christian Walder,1,2 Marian-Andrei Rizoiu3 1The Australian National University, 2Data61 CSIRO, 3University of Technology Sydney [email protected], [email protected], [email protected]

Abstract The Hawkes process (HP) has been widely applied to model- ing self-exciting events including neuron spikes, earthquakes and tweets. To avoid designing parametric triggering kernel and to be able to quantify the prediction confidence, the non- parametric Bayesian HP has been proposed. However, the inference of such models suffers from unscalability or slow convergence. In this paper, we aim to solve both problems. Specifically, first, we propose a new non-parametric Bayesian HP in which the triggering kernel is modeled as a squared sparse Gaussian process. Then, we propose a novel varia- tional inference schema for model optimization. We employ the branching structure of the HP so that maximization of (a) Poisson Cluster Process (b) Branching Structure evidence lower bound (ELBO) is tractable by the expectation- maximization algorithm. We propose a tighter ELBO which Figure 1: The Cluster Representation of a HP. (a) A HP with improves the fitting performance. Further, we accelerate the a decaying triggering kernel φ( ) has intensity λ(x) which novel variational inference schema to linear time complexity increases after each new point (dash· line) is generated. It can by leveraging the stationarity of the triggering kernel. Different be represented as a cluster of PPes: PP(µ) and PP(φ(x from prior acceleration methods, ours enjoys higher efficiency. − xi)) associated with each xi. (b) The branching structure Finally, we exploit synthetic data and two large social media corresponding to the triggering relationships shown in (a), datasets to evaluate our method. We show that our approach where an edge xi xj that xi triggers xj, and its outperforms state-of-the-art non-parametric frequentist and → Bayesian methods. We validate the efficiency of our acceler- probability is denoted as pji. ated variational inference schema and practical utility of our tighter ELBO for model selection. We observe that the tighter ELBO exceeds the common one in model selection. a tree structure, which is called the branching structure (an example is shown in Fig. 1b corresponding to Fig.1a). With 1 Introduction the branching structure, we can decompose the HP into a cluster of PPes. The triggering kernel φ is shared among all The Hawkes process (HP) (Hawkes 1971) is particularly cluster Poisson processes relating to a HP, and it determines useful to model self-exciting point data – i.e., when the occur- the overall behavior of the process. Consequently, designing rence of a point increases the likelihood of occurrence of new the kernel functions is of utmost importance for employing points. The process is parameterized using a background in-

arXiv:1905.10496v2 [cs.LG] 23 Nov 2019 the HP to a new application, and its study has attracted much tensity µ, and a triggering kernel φ. The Hawkes process can attention. be alternatively represented as a cluster of Poisson processes Prior non-parametric frequentist solutions. In the case (PPes) (Hawkes and Oakes 1974). In the cluster, a PP with in which the optimal triggering kernel for a particular ap- an intensity µ (denoted as PP(µ)) generates immigrant points plication is unknown, a typical solution is to express it us- which are considered to arrive in the system from the outside, ing a non-parametric form, such as the work of Lewis and and every existing point triggers offspring points, which are Mohler (2011); Zhou, Zha, and Song (2013); Bacry and generated internally through the self-excitement, following a Muzy (2014); Eichler, Dahlhaus, and Dueck (2017). These PP(φ). Points can therefore be structured into clusters where are all frequentist methods and among them, the Wiener- each cluster contains either a point and its direct offspring Hoef equation based method (Bacry and Muzy 2014) en- or the background process (an example is shown in Fig. 1a). joys linear time complexity. Our method has a similar ad- Connecting all points using the triggering relations yields vantage of linear time complexity per iteration. The Euler- Copyright c 2020, Association for the Advancement of Artificial Lagrange equation based solutions (Lewis and Mohler 2011; Intelligence (www.aaai.org). All rights reserved. Zhou, Zha, and Song 2013) require discretizing the input domain so they face a problem of poorly scaling with the the Gibbs method, and the practical utility of our dimension of the domain. The same problem is also faced by tighter ELBO for model selection, which outperforms the Eichler, Dahlhaus, and Dueck (2017)’s discretization based common one in model selection. method. In contrast, our method requires no dixscretization so enjoys scalability with the dimension of the domain. 2 Prerequisite Prior Bayesian solutions. The for In this section, we review the Hawkes process, the variational the HP has also been studied, including the work of Ras- inference and its application to the Poisson process. mussen (2013); Linderman and Adams (2014); Linder- man and Adams (2015). These work require either con- 2.1 Hawkes Process (HP) structing a parametric triggering kernel (Rasmussen 2013; The HP (Hawkes 1971) is a self-exciting , in Linderman and Adams 2014) or discretizing the input domain which the occurrence of a point increases the arrival rate λ( ), to scale with the data size (Linderman and Adams 2015). The a.k.a. the (conditional) intensity, of new points. Given a set· shortcoming of discretization is just mentioned and to over- N R of time-ordered points = xi i=1, xi R , the intensity come it, Donnet, Rivoirard, and Rousseau (2018) propose at x conditioned on givenD points{ } is written∈ as: a continuous non-parametric Bayesian HP and resort to an X unscalable Monte Carlo (MCMC) estimator λ(x) = µ + φ(x xi), − to the posterior distribution. A more recent solution based xi 0 is a constant background intensity, and φ : complexity per iteration by exploiting the HP’s branching R [0, ) is the triggering kernel. structure and the stationarity of the triggering kernel similar R We→ are particularly∞ interested in the branching structure of to the work of Halpin (2012). This approach considers only the HP. As introduced in Sec.1, each point x has a parent that high-probability triggering relationships in computations for i we represent by a one-hot vector b = [b , b , , b ]T : acceleration. However, those relationships are updated in i i0 i1 i,i−1 each element b is binary, b = 1 represents··· that x is each iteration, which is less efficient than our pre-computing ij ij i triggered by xj (0 j i 1, x0: the background), and them. Besides, our variational inference schema enjoys faster Pi−1 ≤ ≤ − convergence than that of the Gibbs sampling based method. j=0 bij = 1. A branching structure B is a set of bi, namely N Contributions. In this paper, we propose the first sparse B = bi , and if the probability of bij = 1 is defined as { }i=1 Gaussian process modulated HP which employs a novel vari- pij p(bij = 1), the probability of B can be expressed as ≡ ational inference schema, enjoys a linear time complexity per QN Qi−1 bij p(B) = pij . Since bi is a one-hot vector, there iteration and scales to large real world data. Our method is i=1 j=0 Pi−1 p = 1 i inspired by the variational Bayesian PP (VBPP) (Lloyd et al. is j=0 ij for all . Given both and a branching structure B, the log like- 2015) which provides the Bayesian non-parametric inference D only for the whole intensity of the HP without for its compo- lihood of µ and φ becomes a sum of log likelihoods of nents: the background intensity µ and the triggering kernel φ. PPes. With B, the HP can be decomposed into PP(µ) N D Thus, the VBPP loses the internal reactions between points, and PP(φ(x xi)) i=1. The data domain of PP(µ) equals that{ of the HP,− which} we denote , and we denote the data and developing the variational Bayesian non-parametric infer- T ence for the HP is non-trivial and more challenging than the domain of PP(φ(x xi)) by i . As a result, the log likelihood log p(D,B− µ, φ) is expressedT ⊂ T as: VBPP. In this paper, we adapt the VBPP for the HP and term | the new approach the variational Bayesian Hawkes process logp( ,B µ,φ)= D | (VBHP). The contributions are summarized: N i−1 N (1) (Sec.3) We introduce a new Bayesian non-parametric XX  XZ b logφ +b logµ φ µ , (1) HP which employs a sparse Gaussian process modulated trig- ij ij i0 − Ti − |T | gering kernel and a Gamma distributed background intensity. i=1 j=1 i=1 R R (2) (Sec.3&4) We propose a new variational inference where 1 dx, φij φ(xi xj) and φ |T | ≡ T ≡ − Ti ≡ schema for such a model. Specifically, we employ the branch- R φ(x) dx. Throughout this paper, we simplify the integra- Ti ing structure of the HP so that maximization of the evi- tion by omitting the integration variable, which is x unless dence lower bound (ELBO) is tractable by the expectation- otherwise specified. This work is developed based on the maximization algorithm, and we contribute a tighter ELBO univariate HP. However, it can be extended to be multivariate which improves the fitting performance of our model. following the same procedure. (3) (Sec.5) We propose a new acceleration trick based on the stationarity of the triggering kernel. The new trick 2.2 Variational Inference (VI) enjoys higher efficiency than prior methods and accelerates Consider the latent variable model p(x, z θ) where x and z the variational inference schema to linear time complexity are the data and the latent variables respectively.| The vari- per iteration. ational approach introduces a variational distribution to ap- (4) (Sec.6) We empirically show that VBHP provides more proximate the posterior distribution q(z θ0) p(z x, θ) and accurate predictions than state-of-the-art methods on syn- maximizes a lower bound of the log-likelihood,| ≈ which| can thetic data and on two large online diffusion datasets. We be derived from the non-negative gap perspective: validate the linear time complexity and faster convergence of our accelerated variational inference schema compared to log p(x θ) | Table 1: Notations which in turn leads to the posterior GP: q(f) = (f ν, Σ), (4) VBHP VBPP VI N | −1 N N ν(x) KxzKzz m, xn n=1 xn n=1 x ≡ D ≡ { } D ≡ { } 0 −1 −1 B, µ, f, u f, u z Σ(x, x ) Kxx0 + KxzKzz (SKzz I)Kzx0 . R R ≡ − k0, c0, αi i=1, γ αi i=1, γ θ { } R { } Then, the CELBO is obtained by using Eqn.(2): k, c, m, S, αi i=1, γ, R 0 i{−1 }N m, S, αi i=1, γ θ qij { } CELBO(q(f, u), p( f, u), p(f, u)) {{ }j=0}i=1 D| = Eq(f)[log p( f)] KL(q(u) p(u)). D| − || p(x, z θ) p(z x, θ) Note that the second term is the KL divergence between two = log | log | q(z θ0) − q(z θ0) multivariate Gaussian distributions, so is available in closed |  | 0 form. The first term turns out to be the expectation w.r.t. q(f) = Eq(z|θ0) log p(x z, θ) KL(q(z θ ) p(z θ)) N log p( f) = P log f 2(x ) R f 2 | {z | } − | | {z|| | } of the log-likelihood i=1 i T . reconstruction term regularisation term The expectation of the integralD| part is relatively straight-− | {z } ≡CELBO(q(z),p(x|z),p(z)) forward to compute and the expectation of the other (data- dependent) part is available in almost closed-form with a + KL(q(z θ0) p(z x, θ)) | | {z|| | } hyper-geometric . intractable (non-negative) gap CELBO(q(z), p(x z), p(z)), (2) 3 Variational Bayesian Hawkes Process ≥ | where we omit θ and θ0 in conditions. We term this the 3.1 Notations Common ELBO (CELBO) to differentiate with our tighter To extend VBPP to HP, we introduce two more variables: ELBO of Sec.4. For notational convenience, we will often the background intensity µ and the branching structure B, 0 omit conditioning on θ and θ hereinafter. Optimizing the defined in Sec.2.1. We assume that the prior distribution of CELBO w.r.t. θ0 balances between the reconstruction error 1 µ is a Gamma distribution p(µ) = Gamma(µ k0, c0) and and the Kullback-Leibler (KL) divergence from the prior. the posterior distribution is approximated by another| Gamma Generally, the conditional p(x z) is known, so is the prior. N distribution q(µ) = Gamma(µ k, c). For B = bi i=1 given Thus, for an appropriate choice| of q, it is easier to work with N | { } = xi , we assume the variational posterior bij, j = this lower bound than with the intractable posterior p(z x). D { }i=1 0, , i 1, have a categorical distribution: qij = q(bij = We also see that, due to the form of the intractable gap,| if ··· − 1) Pi−1 q = 1 q is from a distribution family containing elements close to and j=0 ij , and thus, the variational posterior the true unknown posterior, then q will be close to the true probability of B is expressed as: posterior when the CELBO is close to the true likelihood. An N i−1 alternative derivation applies Jensen’s inequality (Jordan et Y Y bij q(B) = qij . (5) al. 1999). i=1 j=0 2.3 Variational Bayesian Poisson Process (VBPP) The same squared link function is adopted for the trig- VBPP (Lloyd et al. 2015) applies the VI to the Bayesian gering kernel φ(x) = f 2(x), so are the priors for f and u, Poisson process, which exploits the sparse Gaussian process namely (f 0, Kxx0 ) and (u 0, Kzz0 ). More link func- N | N | (GP) to model the Poisson intensity. Specifically, VBPP uses tions such as exp( ) are discussed by Lloyd et al. (2015). · a squared link function to map a sparse GP distributed func- Moreover, we use the same variational joint posterior on f tion f to the Poisson intensity λ(x) = f 2(x). The sparse GP and u as Eqn.(3). Consequently, we complete the variational employs the ARD kernel: joint distribution on all latent variables as below: R 0 2 Y  (xr x )  q(B,µ,f,u) q(B)q(µ)p(f u)q(u), (6) K(x, x0) γ exp − r . ≡ | 2α ≡ r=1 − r and notations of VBHP are summarized in Table 1. R where γ and αr are GP hyper-parameters. Let u { }r=1 ≡ 3.2 CELBO (f(z1), f(z2), , f(zM )) where zi are inducing points. The prior and··· the approximate posterior distributions of Based on Eqn.(2)&(6), we obtain the CELBO for VBHP (see u are Gaussian distributions p(u) = (u 0, Kzz) and details in A.1 of the appendix (App. 2019)): N | q(u) = (u m, S) where m and S are the vector CELBO(q(B, µ, f, u), p( B, µ, f, u), p(B, µ, f, u)) N | u f and the respectively. Note both and h D| i employ zero mean priors. Notations of VBPP are connected = Eq(B,µ,f) log p( ,B f, µ) +HB D | with those of VI (Sec.2.2) in Table1. | {z } Importantly, the variational joint distribution of f and u Data Dependent Expectation (DDE) uses the exact conditional distribution p(f u), i.e., 1 | 1Gamma(µ|k , c ) = µk0−1e−x/c0 q(f, u) p(f u)q(u) (3) 0 0 Γ(k )ck ≡ | 0 0 −1 ∗ −1 2 KL(q(µ) p(µ)) KL(q(u) p(u)). (7) Kxz˜ Kzz0 S Kzz0 Kzx˜) (˜ν, σ˜ ). Given the relation φ = − || − || f 2, it is straightforward to≡ derive N the corresponding φ(x˜) PN Pi−1 where HB = qij log qij is the entropy of the ˜ ˜ 2 2 2 2 2∼ − i=1 j=0 Gamma(k, c˜) where the shape k = (˜ν +σ ˜ ) /[2˜σ (2˜ν + variational posterior B. The KL terms are between gamma σ˜2)] and the scale c˜ = 2˜σ2(2˜ν2 +σ ˜2)/(˜ν2 +σ ˜2). and Gaussian distributions for which closed forms are pro- vided in A.2 of the appendix (App. 2019). 4 New Variational Inference Schema 3.3 Data Dependent Expectation We now propose a new variational inference (VI) schema which uses a tighter ELBO than the common one, i.e. Now, we are left with the problem of computing the data dependent expectation (DDE) in Eqn.(7). The DDE is w.r.t. Theorem 1. For VBHP, there is a tighter ELBO the variational posterior probability q(B, µ, f). From Eqn.(6), h i q(B, µ, f) = R q(B, µ, f, u) u = q(B)q(µ)q(f) q(f) Eq(B,µ,f) log p( ,B f, µ) + HB log p( ). d and D | ≤ D is identical to Eqn.(4). As a result, we can compute the DDE | {z } w.r.t. q(B) first, and then w.r.t. q(µ) and q(f). ≡TELBO Expectation w.r.t. q(B). From Eqn.(1), we easily obtain Remark. TELBO is tighter because it is equivalent to the log p( ,B f, µ) by replacing φ with f 2, whereupon it is CELBO (Eqn.(7)) except without subtracting non-negative D | clear that only bij in log p( ,B f, µ) is dependent on B. KL divergences over µ and u. Other graphical models such D | Therefore, Eq(B)[log p( ,B f, µ)] is computed as: as the variational Gaussian mixture model (Attias 1999) have D | a similar TELBO. Later on, we propose a new VI schema Eq(B)[logp( ,B f,µ)] D | based on the TELBO. N i−1 N Z XX 2  X 2 Proof. With the variational posterior probability of the = qij logf +qi0 logµ f µ ij − − |T | branching structure q(B) defined in Eqn.(5) and through i=1 j=1 i=1 Ti the Jensen’s inequality, we have: where fij f(xi xj). X Expectation≡ w.r.t.− q(f) and q(µ). We compute the ex- log p( ) q(B) log p( ,B) + HB, (9) D ≥ D pectation w.r.t. q(f) and q(µ) by exploiting the expecta- B tion and the log expectation of the Gamma distribution: where HB is the entropy of B defined in Eqn.(7). The term Eq(µ)(µ) = kc and Eq(µ)(log(µ)) = ψ(k) + log c, and also P q(B) log p( ,B) can be understood as follows. Con- 2 2 B the property E(x ) = E(x) + Var(x): sider that infiniteD branching structures are drawn from q(B) ∞ N i−1 independently, say Bi i=1. Given a branching structure Bi, XhX 2  the Hawkes process{ can} be decomposed into a cluster of Pois- DDE= qij Eq(f)(logfij)+qi0 ψ(k)+logc son processes, denoted as ( ,Bi), and the corresponding i=1 j=1 D P Z Z log-likelihood is log p( ,Bi). Then, B q(B) log p( ,B) 2 i D ∞ D (f) (f) kc , is the mean of all log likelihoods log p( ,Bi) i=1, Eq(f) Varq(f) (8) { D } − Ti − Ti − |T | n 1 X X nB where ψ is the Digamma function. We provide closed form lim log p( ,Bi) = lim log p( ,B) R 2 R n→∞ n D n→∞ n D expressions for (f) and Var (f) in A.3 of i=1 B Ti Eq(f) Ti q(f) 2 X the appendix (App. 2019). As in VBPP, Eq(f)(log fij) = = q(B) log p( ,B), (10) 2 D Ge( ν /(2Σij))+log(Σij/2) C is available in the closed B − − ij − form with a hyper-geometric function, where νij = ν(xi where nB is the number of occurences of branching struc- x ), Σ = Σ(x x , x x ) (ν and Σ are defined in− ∞ j ij i j i j ture B. Since all branching structures Bi i=1 are i.i.d., the Eqn.(4)), C 0.57721566− is− the Euler-Mascheroni constant { } ∞ clusters of Poisson processes generated over Bi i=1 should ≈ (1,0,0) ∞ { } and Ge is defined as Ge(z) = 1F 1 (0, 1/2, z), i.e., the also be independent, i.e., ( ,Bi) i=1 are i.i.d.. It follows partial derivative of the confluent hyper-geometric function that { D } ∞ 1F 1 w.r.t. the first argument. We compute Ge using the method X 0 ∞ of Ancarani and Gasaneo (2008) and implement G˜ and G˜ log p( ,Bi) = log p( ( ,Bi) i=1). (11) D { D } by linear interpolation of a lookup table (see a demo in the i=1 supplementary material). ∞ We compute the CELBO of log p( ( ,Bi) i=1) by making n{ D } z = (µ, f, u) and x = ( ,Bi) in Eqn.(2): 3.4 Predictive Distribution of φ { D }i=1 n  n The predictive distribution of f(x) depends on the poste- logp( ( ,Bi) ) Eq(f,µ) logp( ( ,Bi) f,µ)] { D }i=1 ≥ { D }i=1| rior u. We assume that the optimal variational distribution KL(q(µ) p(µ)) KL(q(u) p(u)). (12) of u approximates the true posterior distribution, namely − || − || q(u , θ0∗) = (u m∗, S∗) p(u , θ). Therefore, there Further, we plug Eqn.(11) and Eqn.(12) into Eqn.(10): |D 0∗ N | ≈ |D is q(f , θ ) p(f , θ), i.e., the approximate predic- 1 |D ≈ |D −1 ∗ −1 n tive f(x˜) (Kxz˜ K m , Kx˜x˜ Kxz˜ K 0 Kzx˜ + Eqn.(10) = lim log p( ( ,Bi) i=1) ∼ N zz − zz n→∞ n { D } (a) 1  n Acceleration to Linear Time Complexity To accelerate lim Eq(f,µ) log p( ( ,Bi) i=1 f, µ)] ≥ n→∞ n { D } | our VBHP, similarly to Zhang et al. (2019) we exploit the (b) X nB   stationarity of the triggering kernel, assuming the kernel has = lim Eq(f,µ) log p( ,B f, µ) n→∞ n D | negligible values for sufficiently large inputs. As a result, nB sufficiently distant pairs of points do not enter into the com- = E  log p( ,B f, µ) putations. This trick reduces possible parents of a point from q(f,µ,B) D | where (a) is because the finite values of KL terms are divided all prior points to a set of neighbors. The number of relevant C by infinitely large n, and (b) is due to i.i.d. ( ,B ) and the neighbors is bounded by a constant and as a result the total i O(CM 3N) variational posterior B being independent of fDand µ. Finally, time complexity is reduced to . we plug the above inequality into Eqn.(9) and obtain the Specifically, we introduce a compact region = R min max S [ , ] so that φ(xi xj) = 0 and qij = 0 TELBO. r=1 Sr Sr ⊆ T − ×if xi xj . As a result, all terms related to xi xj vanish.− To6∈ choose S a suitable , we again use the− TELBO,6∈ S New Optimization Schema for VBHP To optimize the taking the smallest S for whichS the TELBO doesn’t drop sig- model parameters, we employ the expectation-maximization nificantly; we optimize Smin and Smax by grid search with q r r algorithm. Specifically, in the E step, all ij are optimized to other dimensions fixed (so that this step is run R times in m S k c maximize the CELBO, and in the M step, , , and are total) and we optimize Smin after optimizing Smax. updated to increase the CELBO. We don’t use the TELBO to r r Rather than selecting pairs of points in each iteration in the optimize the variational distributions because it doesn’t guar- manner of Halpin’s trick (Halpin 2012; Zhang et al. 2019), antee minimizing the KL divergence between variational and our method pre-computes those pairs, leading to gains in true posterior distributions. Instead, the TELBO is employed computational efficiency. The similar aspect is that both tricks to select GP hyper-parameters: have hyper-parameters to select to threshold the triggering ∗ R ∗ α , γ = R . kernel value. We employ the TELBO for hyper-parameter i i=1 argmax{αi} ,γ TELBO { } i=1 selection while frequentist methods use the cross validation. The TELBO bounds the marginal likelihood more tightly than CELBO, and is therefore expected to lead to a better 6 Experiments predictive performance — an intuition which we empirically validate in Sec.6. Evaluation. We employ two metrics: the first is the L2 dis- The updating equations for qij are derived through maxi- tance (for cases with a known ground truth), which mea- Pi−1 sures the difference between predictive and truth Hawkes mization of Eqn.(7) under the constraints j=0 qij = 1 for R kernels, formulated as L2(φpred, φtrue) = ( T (φpred(x) all i. This maximization problem is dealt with the Lagrange 2 0.5 − multiplier method, and yields the below updating equations: φtrue(x)) dx) and L2(µpred, µtrue) = µpred µtrue ; the second is the hold-out log likelihood (HLL|), which− describes|  2 exp( q(f)(log f ))/Ai, j > 0; how well the predictive model fits the test data, formulated as q = E ij ij N PN R θ exp(ψ(k))/Ai, j = 0, log p( Test = xi µ, f) = log λ(xi) λ. To D { }i=1| i=1 − T Pi−1 2 calculate the HLL for each process, we generate a number where Ai = θ exp(ψ(k)) + j=1 exp(Eq(f)(log fij)) is the of test sequences by every time randomly assigning each normalizer. point of the original process to either a training or testing Furthermore, and similarly to VBPP, we fix the inducing sequence with equal probability; HLLs of test sequences are points on a regular grid over . Despite the observation that T normalized (by dividing test sequence length) and averaged. more inducing points lead to better fitting accuracy (Lloyd et Prediction. We use the pointwise mode of the approximate al. 2015; Snelson and Ghahramani 2006), in the case of our posterior triggering kernel as the prediction because it is more complex VBHP, more inducing points may cause slow computationally intractable to find the posterior mode at convergence (Fig.5a (App. 2019)) for some hyper-parameters, multiple point locations (Zhang et al. 2019). Besides, we and therefore lead to poor performance in limited iterations. exploit the mode of the approximate posterior background Generally, more inducing points improve accuracy at the intensity as the predictive background intensity. expense of longer fitting time. Baselines. We use the following models as baselines. (1) A parametric Hawkes process equipped with the 5 Acceleration Trick sum of exponential (SumExp) triggering kernel φ(x) = Time Complexity Without Acceleration In the E step PK i i i i=1 a1a2 exp( a2x) and the constant background inten- of model optimization, updating qij requires computing sity. (2) The ordinary− differential equation (ODE) based non- the mean and the variance of all fij, which both take parametric non-Bayesian Hawkes process (Zhou, Zha, and O(M 3 + M 2N 2) with N points in the HP and M induc- Song 2013). The code is publicly available (Bacry et al. 2017). ing points. Here, we omit the dimension of data R since (3) Wiener-Hopf (WH) equation based non-parametric non- normally M > R for a regular grid of inducing points. Sim- Bayesian Hawkes process (Zhou, Zha, and Song 2013). The ilarly, in the M step, computing the hyper-geometric term code is publicly available (Bacry et al. 2017). (4) The Gibbs requires the means and variances of all the fij. Finally, com- sampling based Bayesian non-parametric Hawkes process putation of the integral terms takes O(M 3N). Thus, the total (Gibbs Hawkes) (Zhang et al. 2019). For fairness, the ARD time complexity per iteration is O(M 3N + M 2N 2). kernel is used by Gibbs Hawkes and corresponding eigenfunc- 1000 101 101 2.5

Gibbs-Hawkes 3.0

800 2.0 VBHP Truth 100 1000 100 1.5 1100 γ γ

1.0 2.0 0.7 1 1

10 1200 10 − 1216 − 0.5 Triggering Kernel Value 0.3 1100 0.0 2 2 10− 2 1 0 1 10− 2 1 0 1 0.0 0.5 1.0 1.5 2.0 2.5 3.0 10− 10− 10 10 10− 10− 10 10 Time Difference α α (a) Variational Posterior Triggering Kernel (b) TELBO (c) L2 Distance for φ

Figure 2: The Relationship between the Log Marginal Likelihood and the L2 Distance. In (a), the true φsin (dash green) is plotted with the median (solid) and the [0.1, 0.9] interval (filled) of the approximate posterior triggering kernel obtained by VBHP and Gibbs Hawkes (10 inducing points). It uses the maximum point of the TELBO (red star in (b)). In (c), the maximum point of the TELBO is marked. The maximum point overlaps with that of the CELBO. [0, 1.4] is used as the support of the predictive triggering kernel and 10 inducing points are used.

101 101 0.5 1.50 2.40 Gibbs Hawkes 3.00 VBHP 2.00 15 2.40 3.30 0 0 0.4 10 10 3.34

γ 3.00 γ 3.34 3.33 10 1 3.29 1 0.3 10− 3.15 10− 3.25 0.2 5 2 2 10− 2 1 0 1 10− 2 1 0 1 10− 10− 10 10 10− 10− 10 10 10 20 30 α α Number of Inducing Points (a) Expected TELBO (Train) (b) HLL (c) Time VS Inducing Points (d) Time VS Data Size

Figure 3: (a) (b) The Relationship between the TELBO and the HLL; (c) (d) Average Fitting Time (Seconds) Per Iteration. In (a), the maximum∼ point is marked by the red star. In (b), the maximum points∼ of the TELBO and CELBO are marked by red and blue stars. (c) is plotted on 50 processes. (d) shows the fitting time of Gibbs Hawkes (star) and VBHP (circle) on 120 processes. 10 inducing points are used unless specified. tions are approximated by Nystrom¨ method (Williams and (the TELBO) of a sequence. It is observed that the contour Seeger 2001), where regular grid points are used as VBHP. plot of the TELBO has agreement to the contour plots of Different from batch training in (Zhang et al. 2019), all ex- L2(φ) (Fig.2c) — GP hyper-parameters with relatively high periments are conduct on single sequences. marginal likelihoods have relatively low L2 errors. Fig.2a plots the posterior triggering kernel corresponding to the max- 6.1 Synthetic Experiments imal approximate marginal likelihood. Similar agreement is Synthetic Data. Our synthetic data are generated from three also observed between the TELBO and the HLL (Fig.3a, 3b). Hawkes processes over = [0, π], whose triggering kernels This demonstrates the practical utility of both the marginal are sin, cos and exp functionsT respectively, shown as below, likelihood itself and our approximation of it. and whose background intensities are the same µ = 10: Evaluation. To evaluate VBHP on synthetic data, 20 se- φsin(x) = 0.9[sin(3x) + 1], x [0, π/2]; otherwise, 0; quences are drawn from each model and 100 pairs of train ∈ φcos(x) = cos(2x) + 1, x [0, π/2]; otherwise, 0; and test sequences drawn from each sample to compute the ∈ HLL. We select GP hyper-parameters of Gibbs Hawkes and φexp(x) = 5 exp( 5x), x [0, ). − ∈ ∞ of VBHP by maximizing approximate marginal likelihoods. N As a result, for any generated sequence, say xi , i = Table 2 shows evaluations for baselines and VBHP (using 10 { }i=1 T [0, π xi] is used in the CELBO and the TELBO. inducing points for trade-off between accuracy and time, so − Model Selection. As the marginal likelihood p( θ) is a does Gibbs Hawkes) in both L2 and HLL. VBHP achieves key advantage of our method over non-Bayesian approachesD| the best performance but is two orders of magnitudes slower (Zhou, Zha, and Song 2013; Bacry and Muzy 2014), we than Gibbs Hawkes per iteration (shown as Fig.3c and 3d). investigate its efficacy for model selection. Fig.2b shows The TELBO performs closely to the CELBO in the L2 er- the contour plot of the approximate log marginal likelihood ror and this is also reflected in Fig.2c where the maximum Measure Data SumExp ODE WH Gibbs Hawkes VBHP (C) VBHP (T) φ:0.693 0.665 2.463 0.408 0.152 0.183 Sin ±0.028 ±0.121 ±0.145 ±0.198 ±0.091 ±0.076 µ:2.968±1.640 4.514±3.808 6.794±5.054 4.108±3.949 0.640±0.528 0.579±0.523 φ:0.473±0.102 0.697±0.065 1.743±0.083 0.667±0.686 0.325±0.073 0.292±0.096 L2 Cos µ:2.751±1.902 7.030±5.662 6.099±4.613 4.685±4.421 0.555±0.294 0.515±0.293 φ:0.133 1.835 2.254 0.676 0.257 0.235 Exp ±0.138 ±0.539 ±2.042 ±0.233 ±0.086 ±0.102 µ:3.290±1.991 8.969±8.604 16.66±20.95 7.648±9.647 0.471±0.432 0.486±0.418

Sin 3.490±0.400 3.489±0.413 3.233±0.273 3.492±0.406 3.488±0.400 3.497±0.406 Cos 3.874±0.544 3.872±0.552 3.613±0.373 3.871±0.562 3.876±0.541 3.878±0.548 HLL Exp 2.825±0.481 2.822±0.496 2.782±0.490 2.826±0.492 2.826±0.491 2.829±0.487 ACTIVE 1.692±1.371 0.880±2.716 0.710±0.943 1.323±2.160 1.824±1.159 1.867±1.181 SEISMIC 2.943±0.959 2.582±1.665 1.489±1.796 3.110±1.251 3.143±0.895 3.164±0.843

Table 2: Results on Synthetic and Real World Data (Mean One Standard Variance). VBHP (C) and (T) use the CELBO and the TELBO to update the hyper-parameters respectively. ± points of the TELBO and the CELBO overlap. In contrast, VBHP and Gibbs Hawkes on synthetic and real world point the TELBO consistently improves the performance of VBHP processes, which is summarized in Fig.3c and 3d. The fitting in the HLL, which is also reflected in Fig.3b where hyper- time is averaged over iterations and we observe that the in- parameters selected by the TELBO tend to have a higher creasing trends with the number of inducing points and with HLL. Interestingly, when the parametric model SumExp uses the data size are similar between Gibbs Hawkes and VBHP. the same triggering kernel (a single exponential function) as Although VBHP is significantly slower than Gibbs Hawkes the ground truth φexp, SumExp fits φexp best in L2 distance per iteration, VBHP converges faster, in 10 20 iterations while due to learning on single sequences, the background (Fig.5 (App. 2019)), giving an average convergence∼ time of intensity has relatively high errors. Although our method is 549 seconds for a sequence of 1000 events, compared to 699 not aware of the parametric family of the ground truth, it seconds for Gibbs Hawkes. The slope of VBHP in Fig.3d is performs well. Compared with non-parametric frequentist 1.04 (log-scale) and the correlation coefficient is 0.96, so we methods which have strong fitting capacity but suffer from conclude that the fitting time is linear to the data size. noisy data and have difficulties with hyper-parameter selec- tion, our Bayesian solution overcomes these disadvantages and achieves better performance. 7 Conclusions In this paper, we presented a new Bayesian non-parametric 6.2 Real World Experiments Hawkes process whose triggering kernel is modulated by a Real World Data. We conclude our experiments with two sparse Gaussian process and background intensity is Gamma large scale tweet datasets. ACTIVE (Rizoiu et al. 2018) is a distributed. We provided a novel VI schema for such a model: tweet dataset which was collected in 2014 and contains 41k we employed the branching structure so that the common ∼ (re)tweet temporal point processes with links to Youtube ELBO is maximized by the expectation-maximization algo- videos. Each sequence contains at least 20 (re)tweets. SEIS- rithm; we contributed a tighter ELBO which performs better MIC (Zhao et al. 2015) is a large scale tweet dataset which in model selection than the common one. To address the was collected from October 7 to November 7, 2011, and difficulty with scaling with the data size, we utilize the sta- contains 166k (re)tweet temporal point processes. Each tionarity of the triggering kernel to reduce the number of ∼ sequence contains at least 50 (re)tweets. possible parents for each point. Different from prior accel- Evaluation. Similarly to synthetic experiments, we evalu- eration methods, ours enjoys higher efficiency. On synthetic ate the fitting performance by averaging HLL of 20 test se- data and two large Twitter diffusion datasets, VBHP enjoys quences randomly drawn from each original datum. We scale linear fitting time with the data size and fast convergence rate, all original data to = [0, π] (leading to i = [0, π xi] and provides more accurate predictions than those of state- T T −N used in the CELBO and the TELBO for a sequence xi ) of-the-art approaches. The novel ELBO is also demonstrated { }i=1 and employ 10 inducing points to balance time and accuracy. to exceed the common one in model selection. The model selection is performed by maximizing the approx- imate marginal likelihood. The obtained results are shown in Table 2. Again, we observe similar predictive performance 8 Acknowledgements of VBHP: the TELBO performs better the CELBO; VBHP We would like to thank Dongwoo Kim for helpful discussions. achieves best scores. This demonstrates our Bayesian model Rui is supported by the Australian National University and and novel VI schema are useful for flexible real life data. Data61 CSIRO PhD scholarships. Fitting Time. We further evaluate the fitting speed2 of

2The CPU we use is Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz and the language is Python 3.6.5. References Web, 419–428. International World Wide Web Conferences Ancarani, L. U., and Gasaneo, G. 2008. Derivatives of any Steering Committee. order of the confluent hypergeometric function f 1 1 (a, b, z) Snelson, E., and Ghahramani, Z. 2006. Sparse gaussian with respect to the parameter a or b. Journal of Mathematical processes using pseudo-inputs. In Advances in neural infor- Physics 49(6):063508. mation processing systems, 1257–1264. App. 2019. Appendix: Variational inference for sparse gaus- Williams, C. K. I., and Seeger, M. 2001. Using the nystrom¨ sian process modulated hawkes process. In the supplementary method to speed up kernel machines. In Leen, T. K.; Diet- material. terich, T. G.; and Tresp, V., eds., Advances in Neural Infor- mation Processing Systems 13. MIT Press. 682–688. Attias, H. 1999. Inferring parameters and structure of latent variable models by variational bayes. In Proceedings of the Zhang, R.; Walder, C.; Rizoiu, M.-A.; and Xie, L. 2019. Fifteenth conference on Uncertainty in artificial intelligence, Efficient Non-parametric Bayesian Hawkes Processes. In 21–30. Morgan Kaufmann Publishers Inc. International Joint Conference on Artificial Intelligence (IJ- CAI’19). Bacry, E., and Muzy, J.-F. 2014. Second order characterization of hawkes processes and non-parametric Zhao, Q.; Erdogdu, M. A.; He, H. Y.; Rajaraman, A.; and estimation. arXiv preprint arXiv:1401.0903. Leskovec, J. 2015. Seismic: A self-exciting point process model for predicting tweet popularity. In Proceedings of the Bacry, E.; Bompaire, M.; Ga¨ıffas, S.; and Poulsen, S. 2017. 21th ACM SIGKDD International Conference on Knowledge tick: a Python library for statistical learning, with a particular Discovery and Data Mining, 1513–1522. ACM. emphasis on time-dependent modeling. ArXiv e-prints. Zhou, K.; Zha, H.; and Song, L. 2013. Learning triggering Donnet, S.; Rivoirard, V.; and Rousseau, J. 2018. Nonpara- kernels for multi-dimensional hawkes processes. In Interna- metric bayesian estimation of multivariate hawkes processes. tional Conference on , 1301–1309. arXiv preprint arXiv:1802.05975. Eichler, M.; Dahlhaus, R.; and Dueck, J. 2017. Graphi- cal modeling for multivariate hawkes processes with non- parametric link functions. Journal of Analysis 38(2):225–242. Halpin, P. F. 2012. An em algorithm for hawkes process. Psychometrika 2. Hawkes, A. G., and Oakes, D. 1974. A cluster process representation of a self-exciting process. Journal of Applied Probability 11(3):493–503. Hawkes, A. G. 1971. Spectra of some self-exciting and mutually exciting point processes. Biometrika 58(1):83–90. Jordan, M. I.; Ghahramani, Z.; Jaakkola, T. S.; and Saul, L. K. 1999. An introduction to variational methods for graphical models. Machine learning 37(2):183–233. Lewis, E., and Mohler, G. 2011. A nonparametric em algo- rithm for multiscale hawkes processes. Journal of Nonpara- metric Statistics 1(1):1–20. Linderman, S., and Adams, R. 2014. Discovering latent network structure in point process data. In International Conference on Machine Learning, 1413–1421. Linderman, S. W., and Adams, R. P. 2015. Scalable bayesian inference for excitatory point process networks. arXiv preprint arXiv:1507.03228. Lloyd, C.; Gunter, T.; Osborne, M.; and Roberts, S. 2015. Variational inference for gaussian process modulated poisson processes. In ICML, 1814–1822. Rasmussen, J. G. 2013. Bayesian inference for hawkes pro- cesses. Methodology and Computing in Applied Probability 15(3):623–642. Rizoiu, M.-A.; Mishra, S.; Kong, Q.; Carman, M.; and Xie, L. 2018. Sir-hawkes: Linking epidemic models and hawkes processes to model diffusions in finite populations. In Pro- ceedings of the 27th International Conference on World Wide Accompanying the submission Variational Inference for Sparse Gaussian Process Modulated Hawkes Process.

A Omitted Derivations A.1 Deriving Eqn.(7) As per Eqn.(2), there is CELBO(q(B, µ, f, u), p( B, µ, f, u), p(B, µ, f, u)) D| = Eq(B,µ,f)[log p( B, µ, f)] KL(q(B, µ, f, u) p(B, µ, f, u)) D| − || where the KL term can be simplified as KL(q(B, µ, f, u) p(B, µ, f, u)) || X ZZZ q(B)q(µ)p(f u)q(u) = q(B, µ, f, u) log | du df dµ (Eqn.(6) and Bayes’ rule) p(B)p(µ)p(f u)p(u) B | X ZZZ q(B) = q(B, µ, f, u) du df dµ log p(B) B X ZZZ q(µ) + q(B, µ, f, u) du df log dµ p(µ) B X ZZZ q(u) + q(B, µ, f, u) df dµ log du (simplification) p(u) B X q(B) Z q(µ) Z q(u) = q(B) log + q(µ) log dµ + q(u) log du (simplification) p(B) p(µ) p(u) B = KL(q(B) p(B)) + KL(q(µ) p(µ)) + KL(q(u) p(u)). || || || We utilise the likelihood p( ,B µ, f) by the reconstruction term and the KL term w.r.t. B D | Eq(B,µ,f)[log p( B, µ, f)] KL(q(B) p(B)) D| − || X ZZ X q(B) = q(B, µ, f) log p( B, µ, f) df dµ q(B) log (definition) D| − p(B) B B X ZZ X ZZ q(B) = q(B, µ, f) log p( B, µ, f) df dµ q(B, µ, f) log df dµ D| − p(B) B B (align probabilities) X ZZ p( B, µ, f)p(B) = q(B, µ, f) log D| df dµ (merge) q(B) B X ZZ X = q(B, µ, f) log p( ,B µ, f) df dµ q(B) log q(B) (merge) D | − B B h i = Eq(B,µ,f) log p( ,B µ, f) + HB D | P where HB = q(B) log q(B) is the entropy of B and further computed as follows. We adopt q(B) from Eqn.(5) and the − B close form expression of HB is derived as

N i−1 N i−1 X  Y Y bij   Y Y bij  HB = q log q − ij ij N i=1 j=0 i=1 j=0 {bi}i=1 k−1 i−1 i−1 X X  Y Y bij   Y Y bij  = qkj q log qkj q (split the summation) − ij ij j=0 {bi}i6=k i6=k j=0 i6=k j=0 k−1 i−1 i−1 X X  Y Y bij h  Y Y bij i = qkj q log qkj + log q (split the log) − ij ij j=0 {bi}i6=k i6=k j=0 i6=k j=0 k−1 i−1 k−1 i−1 i−1 X X  Y Y bij  X X  Y Y bij   Y Y bij  = qkj log qkj q qkj q log q − ij − ij ij j=0 {bi}i6=k i6=k j=0 j=0 {bi}i6=k i6=k j=0 i6=k j=0 | {z } | {z } =1 =1 (distributive law of multiplication) k−1 i−1 i−1 X X  Y Y bij   Y Y bij  = qkj log qkj q log q (simplification) − − ij ij j=0 {bi}i6=k i6=k j=0 i6=k j=0 = (do the same operations as above) ··· N i−1 X X = qij log qij. − i=1 j=0 A.2 Closed Forms of KL Terms in the ELBO

KL(q(µ) p(µ)) = (k k0)ψ(k) k0 log(c/c0) k log[Γ(k)/Γ(k0)] + ck/c0 || − − − − −1 T −1 KL(q(u) p(u)) = [Tr(K 0 S) + log Kzz0 / S M + m K 0 m]/2 || zz | | | | − zz A.3 Extra Closed Form Expressions for Eqn.(8) Z Z 2 T −1 −1 Eq(f)(f) = m Kzz0 KzxKxzKzz0 m dx Ti Ti Z T −1  −1 = m Kzz0 KzxKxz dx Kzz0 m Ti T −1 −1 m K 0 ΨiK 0 m. ≡ zz zz Z Z −1 −1 −1 Varq(f)[f] = Kxx KxzKzz0 Kzx + KxzKzz0 SKzz0 Kzx dx Ti Ti − Z −1 −1 −1 = γ Tr(Kzz0 KzxKxz) + Tr(Kzz0 SKzz0 KzxKxz) dx Ti − Z Z Z −1 −1 −1 = γ dx Tr(Kzz0 KzxKxz dx) + Tr(Kzz0 SKzz0 KzxKxz dx) Ti − Ti Ti −1 −1 −1 γ i Tr(K 0 Ψi) + Tr(K 0 SK 0 Ψi). ≡ |T | − zz zz zz Z 0 Ψi(z, z ) = KzxKxz0 dx Ti Z R 2 0 2 Y  (xr zr)   (z xr)  = γ2 exp − exp r − dx 2α 2α Ti r=1 − r − r Z R 0 2 2 Y  (zr z )   (¯zr xr)  = γ2 exp − r exp − dx 4α α Ti r=1 − r − r R 0 2 Z 2 Y  (zr z )   (¯zr xr)  = γ2 exp − r exp − dx (y = (¯z x )/α ) 4α α r r r r r r=1 − r Ti,r − r − R max √ 0 2 Z (¯zr −Ti,r )/ αr Y  (zr z )  = γ2 α exp r exp( y2) dy √ r − √ r r − − 4αr min − r=1 (¯zr −Ti,r )/ αr R 0 2 max min Y √παr  (zr z ) h z¯r i,r  z¯r i,r i = γ2 exp − r erf − T erf − T . 2 4α α α r=1 − − r √ r − √ r R min max where we explicitly express i as a Cartesian product i [ i,r , i,r ] for R-dimensional data. T T ≡×r=1 T T B Additional Experiment Results

5 Gibbs-Hawkes 3.0 Gibbs-Hawkes VBHP 4 2.5 VBHP Truth 3 2.0 Truth 1.5 2 1.0 1 0.5

Triggering Kernel Value 0 Triggering Kernel Value 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Time Difference Time Difference (a) Exp (b) Cos

Figure 4: Posterior Triggering Kernels Inferred By VBHP and Gibbs Hawkes. Results of Gibbs Hawkes are obtained in 2000 iterations.

103 5 3 2 5 10− 10 10 10 1 4 15 10 15 10− 20 100 20 5 25 Relative Error 25 10 Relative Error − 10 1 30 − 30 10 2 0 5 10 15 20 25 30 35 40 − 0 200 400 600 800 1000 Iteration Iteration (a) VBHP (b) Gibbs Hawkes

Figure 5: Convergence Rate of VBHP and Gibbs Hawkes with Different Numbers of Inducing Points. VBHP and Gibbs Hawkes measure respectively the relative error of the approximate marginal likelihood and of the posterior distribution of the Gaussian process.