Variational Inference for Sparse Gaussian Process Modulated Hawkes Process
Rui Zhang,1,2 Christian Walder,1,2 Marian-Andrei Rizoiu3 1The Australian National University, 2Data61 CSIRO, 3University of Technology Sydney [email protected], [email protected], [email protected]
Abstract The Hawkes process (HP) has been widely applied to model- ing self-exciting events including neuron spikes, earthquakes and tweets. To avoid designing parametric triggering kernel and to be able to quantify the prediction confidence, the non- parametric Bayesian HP has been proposed. However, the inference of such models suffers from unscalability or slow convergence. In this paper, we aim to solve both problems. Specifically, first, we propose a new non-parametric Bayesian HP in which the triggering kernel is modeled as a squared sparse Gaussian process. Then, we propose a novel varia- tional inference schema for model optimization. We employ the branching structure of the HP so that maximization of (a) Poisson Cluster Process (b) Branching Structure evidence lower bound (ELBO) is tractable by the expectation- maximization algorithm. We propose a tighter ELBO which Figure 1: The Cluster Representation of a HP. (a) A HP with improves the fitting performance. Further, we accelerate the a decaying triggering kernel φ( ) has intensity λ(x) which novel variational inference schema to linear time complexity increases after each new point (dash· line) is generated. It can by leveraging the stationarity of the triggering kernel. Different be represented as a cluster of PPes: PP(µ) and PP(φ(x from prior acceleration methods, ours enjoys higher efficiency. − xi)) associated with each xi. (b) The branching structure Finally, we exploit synthetic data and two large social media corresponding to the triggering relationships shown in (a), datasets to evaluate our method. We show that our approach where an edge xi xj means that xi triggers xj, and its outperforms state-of-the-art non-parametric frequentist and → Bayesian methods. We validate the efficiency of our acceler- probability is denoted as pji. ated variational inference schema and practical utility of our tighter ELBO for model selection. We observe that the tighter ELBO exceeds the common one in model selection. a tree structure, which is called the branching structure (an example is shown in Fig. 1b corresponding to Fig.1a). With 1 Introduction the branching structure, we can decompose the HP into a cluster of PPes. The triggering kernel φ is shared among all The Hawkes process (HP) (Hawkes 1971) is particularly cluster Poisson processes relating to a HP, and it determines useful to model self-exciting point data – i.e., when the occur- the overall behavior of the process. Consequently, designing rence of a point increases the likelihood of occurrence of new the kernel functions is of utmost importance for employing points. The process is parameterized using a background in-
arXiv:1905.10496v2 [cs.LG] 23 Nov 2019 the HP to a new application, and its study has attracted much tensity µ, and a triggering kernel φ. The Hawkes process can attention. be alternatively represented as a cluster of Poisson processes Prior non-parametric frequentist solutions. In the case (PPes) (Hawkes and Oakes 1974). In the cluster, a PP with in which the optimal triggering kernel for a particular ap- an intensity µ (denoted as PP(µ)) generates immigrant points plication is unknown, a typical solution is to express it us- which are considered to arrive in the system from the outside, ing a non-parametric form, such as the work of Lewis and and every existing point triggers offspring points, which are Mohler (2011); Zhou, Zha, and Song (2013); Bacry and generated internally through the self-excitement, following a Muzy (2014); Eichler, Dahlhaus, and Dueck (2017). These PP(φ). Points can therefore be structured into clusters where are all frequentist methods and among them, the Wiener- each cluster contains either a point and its direct offspring Hoef equation based method (Bacry and Muzy 2014) en- or the background process (an example is shown in Fig. 1a). joys linear time complexity. Our method has a similar ad- Connecting all points using the triggering relations yields vantage of linear time complexity per iteration. The Euler- Copyright c 2020, Association for the Advancement of Artificial Lagrange equation based solutions (Lewis and Mohler 2011; Intelligence (www.aaai.org). All rights reserved. Zhou, Zha, and Song 2013) require discretizing the input domain so they face a problem of poorly scaling with the the Gibbs sampling method, and the practical utility of our dimension of the domain. The same problem is also faced by tighter ELBO for model selection, which outperforms the Eichler, Dahlhaus, and Dueck (2017)’s discretization based common one in model selection. method. In contrast, our method requires no dixscretization so enjoys scalability with the dimension of the domain. 2 Prerequisite Prior Bayesian solutions. The Bayesian inference for In this section, we review the Hawkes process, the variational the HP has also been studied, including the work of Ras- inference and its application to the Poisson process. mussen (2013); Linderman and Adams (2014); Linder- man and Adams (2015). These work require either con- 2.1 Hawkes Process (HP) structing a parametric triggering kernel (Rasmussen 2013; The HP (Hawkes 1971) is a self-exciting point process, in Linderman and Adams 2014) or discretizing the input domain which the occurrence of a point increases the arrival rate λ( ), to scale with the data size (Linderman and Adams 2015). The a.k.a. the (conditional) intensity, of new points. Given a set· shortcoming of discretization is just mentioned and to over- N R of time-ordered points = xi i=1, xi R , the intensity come it, Donnet, Rivoirard, and Rousseau (2018) propose at x conditioned on givenD points{ } is written∈ as: a continuous non-parametric Bayesian HP and resort to an X unscalable Markov chain Monte Carlo (MCMC) estimator λ(x) = µ + φ(x xi), − to the posterior distribution. A more recent solution based xi
Gibbs-Hawkes 3.0
800 2.0 VBHP Truth 100 1000 100 1.5 1100 γ γ
1.0 2.0 0.7 1 1
10 1200 10 − 1216 − 0.5 Triggering Kernel Value 0.3 1100 0.0 2 2 10− 2 1 0 1 10− 2 1 0 1 0.0 0.5 1.0 1.5 2.0 2.5 3.0 10− 10− 10 10 10− 10− 10 10 Time Difference α α (a) Variational Posterior Triggering Kernel (b) TELBO (c) L2 Distance for φ
Figure 2: The Relationship between the Log Marginal Likelihood and the L2 Distance. In (a), the true φsin (dash green) is plotted with the median (solid) and the [0.1, 0.9] interval (filled) of the approximate posterior triggering kernel obtained by VBHP and Gibbs Hawkes (10 inducing points). It uses the maximum point of the TELBO (red star in (b)). In (c), the maximum point of the TELBO is marked. The maximum point overlaps with that of the CELBO. [0, 1.4] is used as the support of the predictive triggering kernel and 10 inducing points are used.
101 101 0.5 1.50 2.40 Gibbs Hawkes 3.00 VBHP 2.00 15 2.40 3.30 0 0 0.4 10 10 3.34
γ 3.00 γ 3.34 3.33 10 1 3.29 1 0.3 10− 3.15 10− 3.25 0.2 5 2 2 10− 2 1 0 1 10− 2 1 0 1 10− 10− 10 10 10− 10− 10 10 10 20 30 α α Number of Inducing Points (a) Expected TELBO (Train) (b) HLL (c) Time VS Inducing Points (d) Time VS Data Size
Figure 3: (a) (b) The Relationship between the TELBO and the HLL; (c) (d) Average Fitting Time (Seconds) Per Iteration. In (a), the maximum∼ point is marked by the red star. In (b), the maximum points∼ of the TELBO and CELBO are marked by red and blue stars. (c) is plotted on 50 processes. (d) shows the fitting time of Gibbs Hawkes (star) and VBHP (circle) on 120 processes. 10 inducing points are used unless specified. tions are approximated by Nystrom¨ method (Williams and (the TELBO) of a sequence. It is observed that the contour Seeger 2001), where regular grid points are used as VBHP. plot of the TELBO has agreement to the contour plots of Different from batch training in (Zhang et al. 2019), all ex- L2(φ) (Fig.2c) — GP hyper-parameters with relatively high periments are conduct on single sequences. marginal likelihoods have relatively low L2 errors. Fig.2a plots the posterior triggering kernel corresponding to the max- 6.1 Synthetic Experiments imal approximate marginal likelihood. Similar agreement is Synthetic Data. Our synthetic data are generated from three also observed between the TELBO and the HLL (Fig.3a, 3b). Hawkes processes over = [0, π], whose triggering kernels This demonstrates the practical utility of both the marginal are sin, cos and exp functionsT respectively, shown as below, likelihood itself and our approximation of it. and whose background intensities are the same µ = 10: Evaluation. To evaluate VBHP on synthetic data, 20 se- φsin(x) = 0.9[sin(3x) + 1], x [0, π/2]; otherwise, 0; quences are drawn from each model and 100 pairs of train ∈ φcos(x) = cos(2x) + 1, x [0, π/2]; otherwise, 0; and test sequences drawn from each sample to compute the ∈ HLL. We select GP hyper-parameters of Gibbs Hawkes and φexp(x) = 5 exp( 5x), x [0, ). − ∈ ∞ of VBHP by maximizing approximate marginal likelihoods. N As a result, for any generated sequence, say xi , i = Table 2 shows evaluations for baselines and VBHP (using 10 { }i=1 T [0, π xi] is used in the CELBO and the TELBO. inducing points for trade-off between accuracy and time, so − Model Selection. As the marginal likelihood p( θ) is a does Gibbs Hawkes) in both L2 and HLL. VBHP achieves key advantage of our method over non-Bayesian approachesD| the best performance but is two orders of magnitudes slower (Zhou, Zha, and Song 2013; Bacry and Muzy 2014), we than Gibbs Hawkes per iteration (shown as Fig.3c and 3d). investigate its efficacy for model selection. Fig.2b shows The TELBO performs closely to the CELBO in the L2 er- the contour plot of the approximate log marginal likelihood ror and this is also reflected in Fig.2c where the maximum Measure Data SumExp ODE WH Gibbs Hawkes VBHP (C) VBHP (T) φ:0.693 0.665 2.463 0.408 0.152 0.183 Sin ±0.028 ±0.121 ±0.145 ±0.198 ±0.091 ±0.076 µ:2.968±1.640 4.514±3.808 6.794±5.054 4.108±3.949 0.640±0.528 0.579±0.523 φ:0.473±0.102 0.697±0.065 1.743±0.083 0.667±0.686 0.325±0.073 0.292±0.096 L2 Cos µ:2.751±1.902 7.030±5.662 6.099±4.613 4.685±4.421 0.555±0.294 0.515±0.293 φ:0.133 1.835 2.254 0.676 0.257 0.235 Exp ±0.138 ±0.539 ±2.042 ±0.233 ±0.086 ±0.102 µ:3.290±1.991 8.969±8.604 16.66±20.95 7.648±9.647 0.471±0.432 0.486±0.418
Sin 3.490±0.400 3.489±0.413 3.233±0.273 3.492±0.406 3.488±0.400 3.497±0.406 Cos 3.874±0.544 3.872±0.552 3.613±0.373 3.871±0.562 3.876±0.541 3.878±0.548 HLL Exp 2.825±0.481 2.822±0.496 2.782±0.490 2.826±0.492 2.826±0.491 2.829±0.487 ACTIVE 1.692±1.371 0.880±2.716 0.710±0.943 1.323±2.160 1.824±1.159 1.867±1.181 SEISMIC 2.943±0.959 2.582±1.665 1.489±1.796 3.110±1.251 3.143±0.895 3.164±0.843
Table 2: Results on Synthetic and Real World Data (Mean One Standard Variance). VBHP (C) and (T) use the CELBO and the TELBO to update the hyper-parameters respectively. ± points of the TELBO and the CELBO overlap. In contrast, VBHP and Gibbs Hawkes on synthetic and real world point the TELBO consistently improves the performance of VBHP processes, which is summarized in Fig.3c and 3d. The fitting in the HLL, which is also reflected in Fig.3b where hyper- time is averaged over iterations and we observe that the in- parameters selected by the TELBO tend to have a higher creasing trends with the number of inducing points and with HLL. Interestingly, when the parametric model SumExp uses the data size are similar between Gibbs Hawkes and VBHP. the same triggering kernel (a single exponential function) as Although VBHP is significantly slower than Gibbs Hawkes the ground truth φexp, SumExp fits φexp best in L2 distance per iteration, VBHP converges faster, in 10 20 iterations while due to learning on single sequences, the background (Fig.5 (App. 2019)), giving an average convergence∼ time of intensity has relatively high errors. Although our method is 549 seconds for a sequence of 1000 events, compared to 699 not aware of the parametric family of the ground truth, it seconds for Gibbs Hawkes. The slope of VBHP in Fig.3d is performs well. Compared with non-parametric frequentist 1.04 (log-scale) and the correlation coefficient is 0.96, so we methods which have strong fitting capacity but suffer from conclude that the fitting time is linear to the data size. noisy data and have difficulties with hyper-parameter selec- tion, our Bayesian solution overcomes these disadvantages and achieves better performance. 7 Conclusions In this paper, we presented a new Bayesian non-parametric 6.2 Real World Experiments Hawkes process whose triggering kernel is modulated by a Real World Data. We conclude our experiments with two sparse Gaussian process and background intensity is Gamma large scale tweet datasets. ACTIVE (Rizoiu et al. 2018) is a distributed. We provided a novel VI schema for such a model: tweet dataset which was collected in 2014 and contains 41k we employed the branching structure so that the common ∼ (re)tweet temporal point processes with links to Youtube ELBO is maximized by the expectation-maximization algo- videos. Each sequence contains at least 20 (re)tweets. SEIS- rithm; we contributed a tighter ELBO which performs better MIC (Zhao et al. 2015) is a large scale tweet dataset which in model selection than the common one. To address the was collected from October 7 to November 7, 2011, and difficulty with scaling with the data size, we utilize the sta- contains 166k (re)tweet temporal point processes. Each tionarity of the triggering kernel to reduce the number of ∼ sequence contains at least 50 (re)tweets. possible parents for each point. Different from prior accel- Evaluation. Similarly to synthetic experiments, we evalu- eration methods, ours enjoys higher efficiency. On synthetic ate the fitting performance by averaging HLL of 20 test se- data and two large Twitter diffusion datasets, VBHP enjoys quences randomly drawn from each original datum. We scale linear fitting time with the data size and fast convergence rate, all original data to = [0, π] (leading to i = [0, π xi] and provides more accurate predictions than those of state- T T −N used in the CELBO and the TELBO for a sequence xi ) of-the-art approaches. The novel ELBO is also demonstrated { }i=1 and employ 10 inducing points to balance time and accuracy. to exceed the common one in model selection. The model selection is performed by maximizing the approx- imate marginal likelihood. The obtained results are shown in Table 2. Again, we observe similar predictive performance 8 Acknowledgements of VBHP: the TELBO performs better the CELBO; VBHP We would like to thank Dongwoo Kim for helpful discussions. achieves best scores. This demonstrates our Bayesian model Rui is supported by the Australian National University and and novel VI schema are useful for flexible real life data. Data61 CSIRO PhD scholarships. Fitting Time. We further evaluate the fitting speed2 of
2The CPU we use is Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz and the language is Python 3.6.5. References Web, 419–428. International World Wide Web Conferences Ancarani, L. U., and Gasaneo, G. 2008. Derivatives of any Steering Committee. order of the confluent hypergeometric function f 1 1 (a, b, z) Snelson, E., and Ghahramani, Z. 2006. Sparse gaussian with respect to the parameter a or b. Journal of Mathematical processes using pseudo-inputs. In Advances in neural infor- Physics 49(6):063508. mation processing systems, 1257–1264. App. 2019. Appendix: Variational inference for sparse gaus- Williams, C. K. I., and Seeger, M. 2001. Using the nystrom¨ sian process modulated hawkes process. In the supplementary method to speed up kernel machines. In Leen, T. K.; Diet- material. terich, T. G.; and Tresp, V., eds., Advances in Neural Infor- mation Processing Systems 13. MIT Press. 682–688. Attias, H. 1999. Inferring parameters and structure of latent variable models by variational bayes. In Proceedings of the Zhang, R.; Walder, C.; Rizoiu, M.-A.; and Xie, L. 2019. Fifteenth conference on Uncertainty in artificial intelligence, Efficient Non-parametric Bayesian Hawkes Processes. In 21–30. Morgan Kaufmann Publishers Inc. International Joint Conference on Artificial Intelligence (IJ- CAI’19). Bacry, E., and Muzy, J.-F. 2014. Second order statistics characterization of hawkes processes and non-parametric Zhao, Q.; Erdogdu, M. A.; He, H. Y.; Rajaraman, A.; and estimation. arXiv preprint arXiv:1401.0903. Leskovec, J. 2015. Seismic: A self-exciting point process model for predicting tweet popularity. In Proceedings of the Bacry, E.; Bompaire, M.; Ga¨ıffas, S.; and Poulsen, S. 2017. 21th ACM SIGKDD International Conference on Knowledge tick: a Python library for statistical learning, with a particular Discovery and Data Mining, 1513–1522. ACM. emphasis on time-dependent modeling. ArXiv e-prints. Zhou, K.; Zha, H.; and Song, L. 2013. Learning triggering Donnet, S.; Rivoirard, V.; and Rousseau, J. 2018. Nonpara- kernels for multi-dimensional hawkes processes. In Interna- metric bayesian estimation of multivariate hawkes processes. tional Conference on Machine Learning, 1301–1309. arXiv preprint arXiv:1802.05975. Eichler, M.; Dahlhaus, R.; and Dueck, J. 2017. Graphi- cal modeling for multivariate hawkes processes with non- parametric link functions. Journal of Time Series Analysis 38(2):225–242. Halpin, P. F. 2012. An em algorithm for hawkes process. Psychometrika 2. Hawkes, A. G., and Oakes, D. 1974. A cluster process representation of a self-exciting process. Journal of Applied Probability 11(3):493–503. Hawkes, A. G. 1971. Spectra of some self-exciting and mutually exciting point processes. Biometrika 58(1):83–90. Jordan, M. I.; Ghahramani, Z.; Jaakkola, T. S.; and Saul, L. K. 1999. An introduction to variational methods for graphical models. Machine learning 37(2):183–233. Lewis, E., and Mohler, G. 2011. A nonparametric em algo- rithm for multiscale hawkes processes. Journal of Nonpara- metric Statistics 1(1):1–20. Linderman, S., and Adams, R. 2014. Discovering latent network structure in point process data. In International Conference on Machine Learning, 1413–1421. Linderman, S. W., and Adams, R. P. 2015. Scalable bayesian inference for excitatory point process networks. arXiv preprint arXiv:1507.03228. Lloyd, C.; Gunter, T.; Osborne, M.; and Roberts, S. 2015. Variational inference for gaussian process modulated poisson processes. In ICML, 1814–1822. Rasmussen, J. G. 2013. Bayesian inference for hawkes pro- cesses. Methodology and Computing in Applied Probability 15(3):623–642. Rizoiu, M.-A.; Mishra, S.; Kong, Q.; Carman, M.; and Xie, L. 2018. Sir-hawkes: Linking epidemic models and hawkes processes to model diffusions in finite populations. In Pro- ceedings of the 27th International Conference on World Wide Accompanying the submission Variational Inference for Sparse Gaussian Process Modulated Hawkes Process.
A Omitted Derivations A.1 Deriving Eqn.(7) As per Eqn.(2), there is CELBO(q(B, µ, f, u), p( B, µ, f, u), p(B, µ, f, u)) D| = Eq(B,µ,f)[log p( B, µ, f)] KL(q(B, µ, f, u) p(B, µ, f, u)) D| − || where the KL term can be simplified as KL(q(B, µ, f, u) p(B, µ, f, u)) || X ZZZ q(B)q(µ)p(f u)q(u) = q(B, µ, f, u) log | du df dµ (Eqn.(6) and Bayes’ rule) p(B)p(µ)p(f u)p(u) B | X ZZZ q(B) = q(B, µ, f, u) du df dµ log p(B) B X ZZZ q(µ) + q(B, µ, f, u) du df log dµ p(µ) B X ZZZ q(u) + q(B, µ, f, u) df dµ log du (simplification) p(u) B X q(B) Z q(µ) Z q(u) = q(B) log + q(µ) log dµ + q(u) log du (simplification) p(B) p(µ) p(u) B = KL(q(B) p(B)) + KL(q(µ) p(µ)) + KL(q(u) p(u)). || || || We utilise the likelihood p( ,B µ, f) by the reconstruction term and the KL term w.r.t. B D | Eq(B,µ,f)[log p( B, µ, f)] KL(q(B) p(B)) D| − || X ZZ X q(B) = q(B, µ, f) log p( B, µ, f) df dµ q(B) log (definition) D| − p(B) B B X ZZ X ZZ q(B) = q(B, µ, f) log p( B, µ, f) df dµ q(B, µ, f) log df dµ D| − p(B) B B (align probabilities) X ZZ p( B, µ, f)p(B) = q(B, µ, f) log D| df dµ (merge) q(B) B X ZZ X = q(B, µ, f) log p( ,B µ, f) df dµ q(B) log q(B) (merge) D | − B B h i = Eq(B,µ,f) log p( ,B µ, f) + HB D | P where HB = q(B) log q(B) is the entropy of B and further computed as follows. We adopt q(B) from Eqn.(5) and the − B close form expression of HB is derived as
N i−1 N i−1 X Y Y bij Y Y bij HB = q log q − ij ij N i=1 j=0 i=1 j=0 {bi}i=1 k−1 i−1 i−1 X X Y Y bij Y Y bij = qkj q log qkj q (split the summation) − ij ij j=0 {bi}i6=k i6=k j=0 i6=k j=0 k−1 i−1 i−1 X X Y Y bij h Y Y bij i = qkj q log qkj + log q (split the log) − ij ij j=0 {bi}i6=k i6=k j=0 i6=k j=0 k−1 i−1 k−1 i−1 i−1 X X Y Y bij X X Y Y bij Y Y bij = qkj log qkj q qkj q log q − ij − ij ij j=0 {bi}i6=k i6=k j=0 j=0 {bi}i6=k i6=k j=0 i6=k j=0 | {z } | {z } =1 =1 (distributive law of multiplication) k−1 i−1 i−1 X X Y Y bij Y Y bij = qkj log qkj q log q (simplification) − − ij ij j=0 {bi}i6=k i6=k j=0 i6=k j=0 = (do the same operations as above) ··· N i−1 X X = qij log qij. − i=1 j=0 A.2 Closed Forms of KL Terms in the ELBO
KL(q(µ) p(µ)) = (k k0)ψ(k) k0 log(c/c0) k log[Γ(k)/Γ(k0)] + ck/c0 || − − − − −1 T −1 KL(q(u) p(u)) = [Tr(K 0 S) + log Kzz0 / S M + m K 0 m]/2 || zz | | | | − zz A.3 Extra Closed Form Expressions for Eqn.(8) Z Z 2 T −1 −1 Eq(f)(f) = m Kzz0 KzxKxzKzz0 m dx Ti Ti Z T −1 −1 = m Kzz0 KzxKxz dx Kzz0 m Ti T −1 −1 m K 0 ΨiK 0 m. ≡ zz zz Z Z −1 −1 −1 Varq(f)[f] = Kxx KxzKzz0 Kzx + KxzKzz0 SKzz0 Kzx dx Ti Ti − Z −1 −1 −1 = γ Tr(Kzz0 KzxKxz) + Tr(Kzz0 SKzz0 KzxKxz) dx Ti − Z Z Z −1 −1 −1 = γ dx Tr(Kzz0 KzxKxz dx) + Tr(Kzz0 SKzz0 KzxKxz dx) Ti − Ti Ti −1 −1 −1 γ i Tr(K 0 Ψi) + Tr(K 0 SK 0 Ψi). ≡ |T | − zz zz zz Z 0 Ψi(z, z ) = KzxKxz0 dx Ti Z R 2 0 2 Y (xr zr) (z xr) = γ2 exp − exp r − dx 2α 2α Ti r=1 − r − r Z R 0 2 2 Y (zr z ) (¯zr xr) = γ2 exp − r exp − dx 4α α Ti r=1 − r − r R 0 2 Z 2 Y (zr z ) (¯zr xr) = γ2 exp − r exp − dx (y = (¯z x )/α ) 4α α r r r r r r=1 − r Ti,r − r − R max √ 0 2 Z (¯zr −Ti,r )/ αr Y (zr z ) = γ2 α exp r exp( y2) dy √ r − √ r r − − 4αr min − r=1 (¯zr −Ti,r )/ αr R 0 2 max min Y √παr (zr z ) h z¯r i,r z¯r i,r i = γ2 exp − r erf − T erf − T . 2 4α α α r=1 − − r √ r − √ r R min max where we explicitly express i as a Cartesian product i [ i,r , i,r ] for R-dimensional data. T T ≡×r=1 T T B Additional Experiment Results
5 Gibbs-Hawkes 3.0 Gibbs-Hawkes VBHP 4 2.5 VBHP Truth 3 2.0 Truth 1.5 2 1.0 1 0.5
Triggering Kernel Value 0 Triggering Kernel Value 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Time Difference Time Difference (a) Exp (b) Cos
Figure 4: Posterior Triggering Kernels Inferred By VBHP and Gibbs Hawkes. Results of Gibbs Hawkes are obtained in 2000 iterations.
103 5 3 2 5 10− 10 10 10 1 4 15 10 15 10− 20 100 20 5 25 Relative Error 25 10 Relative Error − 10 1 30 − 30 10 2 0 5 10 15 20 25 30 35 40 − 0 200 400 600 800 1000 Iteration Iteration (a) VBHP (b) Gibbs Hawkes
Figure 5: Convergence Rate of VBHP and Gibbs Hawkes with Different Numbers of Inducing Points. VBHP and Gibbs Hawkes measure respectively the relative error of the approximate marginal likelihood and of the posterior distribution of the Gaussian process.