Variational Inference for Gaussian Process Modulated Poisson Processes

Chris Lloyd* [email protected] Tom Gunter* [email protected] Michael A. Osborne [email protected] Stephen J. Roberts [email protected] Department of Engineering Science, University of Oxford

Abstract (N 3) computational scaling in the number of data points, O We present the first fully variational Bayesian N. To tackle this problem in practice, many approaches inference scheme for continuous Gaussian- (Rathbun & Cressie, 1994; Mller et al., 1998) discretise process-modulated Poisson processes. Such the domain, binning counts within each segment. This point processes are used in a variety of domains, approach enabled Cunningham et al. (2008a) to achieve (N log N) performance. However, the discretisation ap- including neuroscience, geo-statistics and astron- O omy, but their use is hindered by the computa- proach suffers from poor scaling with the dimension of the tional cost of existing inference schemes. Our domain and sensitivity to the choice of discretisation. scheme: requires no discretisation of the domain; We introduce a new model for Gaussian-process- scales linearly in the number of observed events; modulated Poisson processes that eliminates the require- and is many orders of magnitude faster than pre- ment for discretisation, while simultaneously delivering vious sampling based approaches. The resulting (N) scaling. We further introduce the first fully varia- algorithm is shown to outperform standard meth- tionalO Bayesian inference scheme for such models, allow- ods on synthetic examples, coal mining disaster ing computation many orders of magnitude faster than ex- data and in the prediction of Malaria incidences isting schemes. This approach, which we term Variational in Kenya. Bayes for Point Processes (VBPP), is shown to provide more accurate prediction than benchmarks on held-out data from datasets including synthetic examples, coal mining 1. Introduction disaster data and Malaria incidences in Kenya. The power Sparse events defined over a continuous domain arise in a of our approach suggests many future applications: in par- variety of real-world applications. The geospatial spread ticular, our fully generative model will permit the joint in- of disease through time, for example, may be viewed as a ference of real-valued covariates (such as log-rainfall) and set of infections which occur in three dimensional space- a point process (such as disease outbreaks). time. In this work, we will consider data where the intensity (or average incidence rate) of the event generating 2. Cox Processes process is assumed to vary smoothly over the domain. A popular model for such data is the inhomogenous Poisson Formally a Cox process—a particular type of inhomoge- process with a Gaussian process model for the smoothly- nous Poisson process—is defined via a stochastic intensity R+ RR varying intensity function. This flexible approach has function λ(x) : . For a domain = of arbi- trary dimensionXR, → the number of points,XN( ), found in been adopted for applications in neuroscience (Cunning- T ham et al., 2008b), finance (Basu & Dassios, 2002) and a subregion is Poisson distributed with parameter Tx ⊂Xx x forestry (Heikkinen & Arjas, 1999). λT = T λ( )d —where d indicates integration with respect to the Lebesgue measure over the domain—and for R However, existing inference schemes for such models scale disjoint subsets i of , the counts N( i) are independent. poorly with the number of data, preventing them from find- This independenceT isX due to the completelyT independent ing greater use. The use of a full Gaussian process in mod- nature of points in a Poisson process (Kingman, 1993). elling the intensity (Adams et al., 2009) incurs prohibitive If we restrict our consideration to some bounded region, Proceedings of the 32 nd International Conference on Machine , the probability density of a set of N observed points, Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- T right 2015 by the author(s). * Corresponding authors. Variational Inference for Gaussian Process Modulated Poisson Processes

D (n) N = x n=1, conditioned on the rate function higher dimensional point processes. The authors term this λ(x) is{ ∈ T } approach “adaptive thinning”.

N In Lasko (2014) the author performs renewal process infer- p(D λ) = exp λ(x)dx λ(x(n)). (1) ence without thinning the domain, by making use of a pos- | − T n=1 itively transformed intensity function. The intractability of Z Y their chosen approach forces them to resort to numerical We use to denote the measure of the continuous domain |T | integration techniques, however, and Bayesian inference is . In this work we will assume is a hyper-rectangular still performed using computationally expensive sampling. T RR minT max sub-set of with boundaries r and r in each dimension r and T T 3. Model R = 1dx = ( max min). (2) We construct our prior over the rate function using a Gaus- |T | Tr −Tr T r=1 sian process. Rather than using a squashing function, we Z Y will assume1 the intensity function is simply deﬁned as Using Bayes’ rule, the posterior distribution of the rate λ(x) = f 2(x), x , where f is a Gaussian process function conditioned on the data, p(λ D), is ∈ T | distributed random function achieving a non-negative prior p(λ)exp λ(x)dx N λ(x(n)) (Gunter et al., 2014b). Furthermore we will assume that − T n=1 (3) f is dependent on a set of inducing points = z(m) N (n) Z { ∈ p(λ)exp RT λ(x)dx n=1 λ(x )dλ M − Q m=1. We denote the evaluation of f at these points u, T } ~ R R Q and note u has distribution u (1¯u, Kzz). Using this which is often described as “doubly-intractable” because of ∼ N ′ the nested integral in the denominator. formulation f u µ(x), Σ(x, x ) has mean and covariance functions| ∼ GP 2.1. Inferring Intensity Functions −1 µ(x)= kxzKzz u, (5) To overcome the challenges posed by the doubly intractable ′ −1 Σ(x, x )= Kxx′ kxzK kzx′ , (6) integral Adams et al. (2009) propose the Sigmoidal Gaus- − zz sian Cox Process (SGCP). In the SGCP, a Gaussian pro- ′ cess (Rasmussen & Williams, 2006) is used to construct where kxz, Kxx′ , Kzz are matrices evaluated at x, x and using an appropriate kernel. We use the exponentiated an intensity function prior by passing a random function, Z f , through a sigmoid transformation and scaling it quadratic (also known as the “squared exponential”) ARD with∼ GPa maximum intensity λ∗. The intensity function is kernel therefore λ(x) = λ∗σ f(x) , where σ( ) is the logistic · R ′ 2 sigmoid (squashing) function (xr x ) K(x, x′)=γ exp − r . (7) − 2αr 1 r=1 σ(x)= . (4) Y 1 + exp( x) − To remove the inner intractable integral, the authors aug- With this hierarchical formulation the joint distribution D ment the variable set to include latent data, such that the over , f, u and Θ is joint distribution of the latent and observed data is uniform Poisson over the region . While this model works well p(D,f, u, Θ)=p(D λ = f 2)p(f u, Θ)p(u Θ)p(Θ), (8) | | | in practice on small, sparseT event data in one dimension, in reality, it scales poorly with both the dimensionality of where p(Θ) is an optional prior on the set of model param- the domain and the maximum observed density of points. eters Θ = γ,α1,...,αR, u¯ . For notational convenience This is due to: the incorporation of latent, or thinned, data, we will often{ omit conditioning} on Θ. whose number grows exponentially with the dimensional- 3 ity of the space; and an (N ) cost in the number N of all 4. Inference data (thinned or otherwise).O We will use variational inference to obtain a bound on the In Gunter et al. (2014a), the authors go some way towards model evidence p D . To achieve this we must integrate improving the scalability of the SGCP, by introducing a fur- ( ) out f and u, but we must also integrate f 2 over the region ther set of latent variables such that the entire space need no due to the integral embedded in the likelihood, Equation longer be thinned uniformly. Instead, they thin to a piece- 1.T wise uniform Poisson process, maintaining the tractability of the inner integral, and allowing the model to scale to 1See Section 5 for a detailed motivation of this choice. Variational Inference for Gaussian Process Modulated Poisson Processes

4.1. Variational Bound 4.2. Integrating Over The Region T We begin by integrating out the inducing variables u, us- This lower bound has the desirable property that we can ing a variational distribution q(u) = (u; m, S) over the take expectations under q(f) at any specific point, x, of the inducing points. We now multiply andN divide the joint by function value, f(x), since q(f(x)) is Gaussian. It is only q(u) and lower bound using Jensen’s inequality to obtain a possible to take useful expectations because: a) we used lower bound on the model evidence: the conditional GP formulation, allowing tractable expecta- q(u) tions to be taken w.r.t. q(f); b), we have already integrated log p(D Θ) = log p(D f)p(f u)p(u) du df | | | q(u) out the inducing variables u; and c) we chose a suitable ZZ transformation, i.e. λ(x)= f 2(x). p(f u)q(u)du log[p(D f)] df ≥ | | The required statistics for Equation 12 are: ZZ p(u) E 2 2 ⊤ −1 −1 + p(f u)q(u)df log du q(f)[fx] =µ ˜x = m Kzz kzxkxzKzz m, (13) | q(u) 2 −1 ZZ Varq f [fx] =σ ˜ = kxx Tr(K kzxkxz) E D ( ) x − zz = q(f) [log p( f)] KL q(u) p(u) −1 −1 | − || + Tr(Kzz SKzz kzxkxz). (14) , (9) ⊤ L It is now easy to calculate the integral since only kzx = kxz Since p(f u) is conjugate to q(u), we can write down in is a function of x, leading to the following terms: closed-form| the resulting integral: E 2 ⊤ −1 −1 q(f)[fx] dx = m Kzz ΨKzz m, (15) q(f)= p(f u)q(u)du = (f;µ, ˜ Σ)˜ , (10) T | GP Z Z −1 −1 Varq f [fx]dx = γ Tr(K Ψ) µ˜(x)= kxzKzz m, ( ) |T|− zz ZT ˜ ′ ′ −1 ′ −1 −1 ′ −1 −1 Σ(x, x )= Kxx kxzKzz kzx + kxzKzz SKzz kzx . + Tr(K SK Ψ). (16) − zz zz KL q(u) p(u) is simply the KL-divergence between two For the exponentiated quadratic ARD kernel, the matrix Gaussians|| z x x z′ x 1 Kzz Ψ= K( , )K( , )d (17) KL q(u) p(u) = tr K−1S log | | M || 2 zz − S − Z | | can be calculated by re-arranging the product as a single + (~1¯u m)⊤K−1(~1¯u m) . (11) exponentiated quadratic in x and z¯ as follows: − zz − i R ′ 2 2 We can now take expectations of the data log-likelihood (zr z ) (xr z¯r) Ψ(z, z′)= γ2 exp − r − dx under q(f): − 4αr − αr ZT r=1 E D Y = q(f) [log p( f)] KL q(u) p(u) R ′ 2 L | − || 2 √παr (zr zr) N = γ exp − − 2 − 4αr E 2 2 r=1 = q(f) fx dx + log fn Y − T max min " Z n=1 # z¯r r ) z¯r r X erf −T erf −T , KL q(u) p(u) × √αr − √αr − || ′ 2 ⊤ zr +zr = E [fx] + Var [fx] dx where z¯ = [¯z1,..., z¯R] has elements z¯r = . − q(f) q(f) 2 ZT N In addition to the exponentiated quadratic ARD kernel, the + E log f 2 KL q(u) p(u) , (12) matrix Ψ can be computed in closed-form for other kernels, q(f) n − || n=1 including polynomial and periodic kernels, as well as sum X and product combinations of kernels. where to keep the notation concise we have introduced the following identities: 4.3. Expectations At The Data Points , , 2 , ˜ fx f(x), µ˜x µ˜(x), σ˜x Σ(x, x), The expectation E [log f 2] has an analytical—albeit , (n) , (n) 2 , ˜ (n) (n) q(f) n fn f(x ), µñ µ˜(x ), σñ Σ(x , x ). complicated—solution expressed as Note we have used Tonelli’s Theorem to reverse the order- ∞ E 2 2 2 ing of the integrations over the positive integrand f 2q f . q(f)[log fn]= log(fn) (fn, µñ, σñ)dfn (18) x ( ) N We now have two tasks remaining: we must compute the Z−∞ µ˜2 σ˜2 integral over the region and calculate the expectations = G˜ n + log n C, (19) E 2 T − −2˜σ2 2 − q(f) log fn at the data points. n Variational Inference for Gaussian Process Modulated Poisson Processes where C 0.57721566 is the Euler-Mascheroni constant To optimise the inducing point locations we use the change and G˜ is defined≈ via the confluent hyper-geometric function of variables min max min max ∞ k (m) r + r r r (m) (a)kz zr = T T T −T sin ωr , (22) 1F 1 (a,b,z)= , (20) 2 − 2 (b)k k! k=0 (m) X and optimise in ωr [ π, π], which ensures the induc- ∈ − where ( )k denotes the rising Pochhammer series ing points always remain within the region . · T

(a) = 1, (a)k = a(a + 1)(a + 2) ... (a + k 1). 0 − 4.5. Locating The Inducing Points Specifically G˜ is a specialised version of the partial deriva- One final part of our model we have so far left unspec- ified is the number and location of inducing points. In tive of 1F 1 with respect to its first argument and can be computed using the method of Ancarani & Gasaneo principle, for a given set of parameters Θ, we will obtain (2008), which has a particular solution at a = 0, leading a lower bound for the true GP likelihood for any number of to the following definition of G˜: inducing points in any configuration of locations. We consider two possible approaches: firstly, treating the induc- ∞ j ing points as optimisation parameters and, secondly, fixing ˜ (1,0,0) 1 j! z G(z)= 1F 1 0, ,z = 2z . (21) them on a regular grid. 2 (2) (1 1 ) j=0 j 2 j X If the locations of the inducing points are optimised, this suggests—in common with other sparse GP models—that Naive implementation of Equation 21 has poor numerical we can achieve good performance using a small number of stability, although this can be improved somewhat using an well-placed inducing points. This is achieved at the cost of iterative scheme, in practice we therefore use a large multi- an additional set of optimisation parameters, although the resolution look-up table of precomputed values obtained size of m and S are commensurately reduced. Optimisa- from a numerical-package. As shown in Figure 1, this tion of the inducing points is particularly computationally function decreases very slowly as its argument becomes expensive, however, because of the necessity to recompute increasingly negative, so we can easily compute accurate K−1 for each dimension of each inducing point (M R in evaluations of G˜(z) for any z by linear interpolation of our zz total) and since K affects every term in the bound.× lookup table and, as a by-product, we also obtain G˜′. zz Regular grids, on the other hand, also have several advan- 0 tages. Consider firstly that—in contrast to standard sparse −1 −2 GP regression—the accuracy of our solution is not only −3 governed by the distance between the inducing points and

) −4 the data points. The variance of f x increases as x be-

z ( ) (

˜ −5 G comes further from the inducing points. However, the rate −6 function, λ, is a function of both the mean and the variance −7 of f. Since we are integrating λ over , we need inducing −8 points distributed across and not justT in regions close to −9 T −1000 −900 −800 −700 −600 −500z −400 −300 −200 −100 0 the data. An evenly-spaced grid is one way to ensure this is achieved. G˜ Figure 1. Accurate evaluations of can be obtained from a pre- Regularly sampled grids also afford potential computa- computed look-up table. tional advantages. When the grid points are evenly spaced, the kernel matrix has Toeplitz structure, and hence al- 4.4. Optimising The Bound lows matrix inversion (and linear solving) in (M log2 M) time, a fact previously utilised for efficient pointO processes To perform inference we find variational parameters m∗, by Cunningham et al. (2008a). Furthermore, when the ker- S∗ and the model parameters Θ∗ that maximise . To nel function is separable across the dimensions (as spec- optimise these simultaneously we construct an augmentedL ified by Equation (7)), the kernel matrix has Kronecker vector y = [Θ⊤, m⊤, vech(L)⊤]⊤—where vech(L) is the structure which can further reduce the cost of matrix in- vectorisation of the lower triangular elements of L, such version (Osborne et al., 2012). The latter is relevant to all that S = LL⊤. sparse GP applications based on inducing points, however, As well as the maximum-likelihood solution, we can also it is particularly relevant for this application as we are mo- compute the the maximum-a-posterior (MAP) estimate by tivated by the doubly intractable nature of Equation 3. In maximising (D; y) + log p(Θ). our implementation, we use na¨ıve inversion of the induc- L Variational Inference for Gaussian Process Modulated Poisson Processes ing point kernel matrix, Kzz, resulting in computational 5. Alternative GP Transformations complexity of (NM 3). Hence the computation times O At this point it is worth considering why we have chosen reported below could be improved with a relatively small 2 amount of additional implementation effort. the function transformation λ(x) = f (x) in preference to other alternatives we might have used. An obvious first choice would be 4.6. Predictive Distribution λ(x) = exp f(x) . (24) To form the predictive distribution we assume our optimised variational distribution q∗(u) = (u; m∗, S∗) ap- This transformation is undesirable for two reasons. The N proximates the posterior p(u D). Analogously to Equation more obvious of these is that after taking expectations un- | 10 we next compute q∗(f) p(f D). We can now derive a der q(f) we are left with the integral ≈ | lower bound of the (approximate) predictive log-likelihood 2 H σ˜x on a held-out test dataset : exp µ˜x + dx, (25) − 2 ∗ E T log p(H D, Θ ) = log p(f|D)[p(H f)] Z | | which cannot be computed in closed form. We could ap- log E ∗ [p(H f)] , p ≈ q (f) | M proximate the integral using a series expansion, however E ∗ , q (f)[log p(H f)] p. (23) this would be difficult with more than a couple of terms and ≥ | L furthermore, since the function is concave, this approxima- The derivation of follows Equations 12-19. The result- p tion would not be a lower bound. ing bound is the sameL as except m, S are replaced with m∗ and S∗, and there isL no KL divergence term. Kernel The second—and more subtle—reason is that in using this matrices are computed using Θ∗. transformation, when we take expectations under q(f) of the data, we obtain The tightness of this final bounding step will be a function E of how well the inducing variables u define the function q(f) [log exp(fn) ] =µ ñ. (26) f. Intuitively, when the variance at the inducing points is { } Since the mean, µ , is not a function of S, the variance large, the entropy of f will be large, as it is unrestricted. ñ of the variational distribution q u , we have effectively de- From an information theoretic perspective, we can say that ( ) coupled the data from the uncertainty on our variational the tightness of the bound will be a function of the entropy approximation; this is clearly undesirable. of f: H(f)= 1 log Σ˜ + const. 2 | | Another possible candidate is the probit function, λ x Given this knowledge we define a second, tightened, lower ( ) = f x . This can be integrated analytically against the bound , where we allow the variance of function val- Φ( ( )) 0 GP prior, however we are again left with a difficult integral ues at theL inducing points to collapse to zero: p(u D) | ≈ over which is δ(m). This reduces the conditional entropy of f given u by T shrinking the final term in the definition of Σ˜ (Equation 10), µ˜ Φ x dx. (27) resulting in a tighter bound for a slightly restricted class of 2 − T 1 +σ ˜x ! models. As it is a slightly more restricted model, we ex- Z pect the ground truth = log E u m [p( f)] to As the range of this transformationp is (0, 1) we would also M0 q(f| = ) H| be lower than the ground truth p. In practice, however, require additional machinery to infer a scaling variable. because the variational is soM much tighter, we also use 0 In contrast the square transform presented allows the inte- this to give results forL approximate predictive likelihood gral over the region to be computed in closed form and when comparing against other approaches. In Figure 7 we E [log f 2] can beT computed quickly and accurately. Im- demonstrate this empirically for the coal mining dataset by q(f) n portantly this transformation also maintains the connection evaluating the true predictive bounds on a held-out 50% of between the data and the variational uncertainty. the data via 10, 000 MCMC samples, and shading between these and the variational approximations. We do this for a Although the square transform is not a one-to-one range of inducing point grid densities, both with and with- function—any rate function λ may have been generated out optimisation of the inducing point locations. In Figures by f 2 or ( f)2—this sign ambiguity is integrated out in 6, 5, and 7, we plot all of the bounds described above as the a Bayesian− sense, Equation 18. number of inducing points increase. We note that for the relatively smooth coal mining data, (Figure 7), all bounds 6. Relationship to Sparse GP Models do not benefit from more than about 10 inducing points, while in the case of the twitter data, the faster dynamics The use of inducing points in our model relates it to a wide call for increased numbers. In all Figures the tightness of range of sparse Gaussian process models, e.g. SPGP (Snel- is evident as compared to p. son & Ghahramani, 2005). The variational sparse Gaussian L0 L Variational Inference for Gaussian Process Modulated Poisson Processes process framework was introduced by Titsias (2009), how- Our SGCP sampler is based on Adams et al. (2009). Our ever the bound we develop is more akin to the “Big-Data” implementation differs by using elliptical slice sampling to GP bound (Hensman et al., 2013), since we explicitly main- infer the latent function f and we perform hyper-parameter tain the variational distribution q(u). The variable Ψ that inference using Hybrid Monte-Carlo (HMC). We also use results from integrating the kernel over the input domain is the “adaptive-thinning” method described in Gunter et al. similar to the so-called “Ψ-statistic” which arises when in- (2014a), to reduce the number of thinning points required. tegrating out the uncertainty of latent variables in the variational GPLVM (Titsias & Lawrence, 2010). 7.2. Synthetic Data

7. Experiments To evaluate our algorithm, we benchmarked against our fre- quentist kernel smoothing approach, described below, both with without end correction (KS+EC and KS–EC respec- tively), and a fully Bayesian SGCP MCMC sampler. Our test data sets are generative data from the SGCP model and several real-world data sets.

7.1. Benchmarks Our kernel smoothing method is similar to standard kernel density estimation except we use truncated normal kernels to account for our explicit knowledge of the domain—the latter is referred to as “end-correction” in some literature (Diggle, 1985). The kernel smoother optimises a diagonal Figure 2. 2D Synthetic Data. Clockwise from top left: Ground truth, VBPP, KS+EC, SGCP. covariance, Σ∗, by maximising the leave-one-out training objective For the synthetic data sets, we ﬁrst generate a 2D function N N from a high resolution grid using a Gaussian process and Σ∗ = argmax log (x(i); x(j), Σ). (28) sigmoid link function, and then, conditioned on that func- NT Σ i=1 j i tion, we draw a training dataset and multiple test datasets. X 6=X=1 We give average performances results for these test datasets We can construct the predictive distribution by combining in Table 1. Figure 2 visualises an inferred 2D intensity con- the maximum-likelihood estimates of the size and spatial ditioned on 500 observations. Although the SGCP sam- location of the data. For a test data set H (with K! permu- ∼ pler gives better predictive performance than the VBPP 0 tations) this distribution is bound, it should be noted that the sampler uses well tunedL K hyper-parameters, uses the same link function as the gener- p(H D)= K! p(K D) p(x˜(k) D), (29) ative process and is much more computationally expensive. | | | k=1 VBPP outperforms kernel smoothing in terms of both pre- Y dictive likelihood and RMS error. where the location density,

N 7.3. Real Data 1 p(x˜(k) D)= (x˜(k); x(n), Σ∗), (30) | N NT We next investigate three real world data sets. For these n=1 X data sets we create training and test subsets by allocating is computed using using the previously described method each point to either subset with probability 0.5. Since true and the distribution of the number of points, rates are unknown for these datasets we rely on held-out predictive likelihood as the only performance metric. N K p(K D)= exp( N), (31) | K! − 7.3.1. COAL MINING DISASTER DATA is simply a Poisson distribution with parameter N. It is Our ﬁrst real dataset comprises 190 events recorded from straight forward to show that Equation 29 is equivalent March 15, 1851 to March 22, 1962; each represents a coal- to Equation 1 since we can interpret the rate function as mining disaster that killed at least ten people in the United N (n) ∗ λ(x) = n=1 T (x; x , Σ ) and since T λ(x)dx = Kingdom. These data, ﬁrst analysed in this form in 1979 N. N (Jarrett, 1979), have often been tackled with nonhomoge- P R Variational Inference for Gaussian Process Modulated Poisson Processes

Table 1. Results for 2D synthetic data (drawn from the SGCP generative process). SGCP KS+EC VBPP ( 0) Function Avg. LL RMS Time(s) Avg. LL RMS Time(s) Avg. LL RMSL Time(s) 1 446.1 1.37 7547.83 389.8 1.48 0.34 392.9 1.21 3.26 2 -61.1 0.38 1039.65 -78.3 0.46 0.02 -76.1 0.38 2.00 3 122.4 0.88 3173.91 84.3 1.04 0.12 92.6 0.81 2.44 4 175.8 1.71 3773.75 147.0 1.26 0.05 148.3 1.14 2.58 5 446.1 2.94 6368.44 413.6 2.02 0.21 415.5 1.81 2.83

3.5 35 3 30 Data 2.5 25 VB (Lp) 2 20 VB (L0) 1.5 15 KS+EC

1 KS-EC Tweets per10 day Deaths per year 0.5 SGCP 5 VB 25-75% 0 0 11 Sept 16 Sept 21 Sept 26 Sept 1860 1880 1900 1920 1940 1960 Date Year Figure 4. Twitter data Figure 3. Coal mining disaster data. 160 250 140 200

120 L 150 005006 100 004 L →M 000001002003p p 100 80 L0 →M0 Log Likelihood

Log Likelihood KS+EC 50 60 SGCP 0 40 10 15 20 25 30 35 40 10 15 20 25 30 35 40 Number of Inducing Points Number of Inducing Points Figure 6. Twitter data (Z optimised). Figure 5. Twitter data (Z on a fixed grid). neous Poisson processes, (Adams et al., 2009), as the rate 7.3.2. TWITTER DATA of such disasters is expected to vary according to known Next, we ran the models on the tweet profile of the chair- historical developments. The events are indicated by the man of the ‘Better Together Campaign’, Alistair Darling, rug plot along the axis of Figure 3. one week either side of the Scottish independence election Our inferred intensity of disasters correlates with the histor- (189 tweets). Results are shown in Figure 4 and Table 2, ical introduction of safety regulation, as noted in previous where half the data was held out and a regular 31 point grid work on this data (Fearnhead, 2006; Carlin et al., 1992). was used. Figures 5 and 6 compare the performance of Firstly, our results depict a decline in the rate of such dis- regularly spaced and optimised inducing points, and show asters throughout 1870–1890, a period that saw the UK optimisation yields considerably improved performance on parliament passing several acts with the aim of improving this dataset. The 0 and p bounds become less tight as the safety for mine workers, including the Coal Mines Regula- number of inducingL pointsL is increased, suggesting there is tion Acts of 1872 and 1887. Our inferred intensity also de- less uncertainty represented in the variational parameter S clines after 1950, likely related to the imposition of further and more uncertainty captured by a reduced kernel length safety regulation with the Mines and Quarries Act, 1954. scale. This transition is observed for fewer inducing points when inducing point optimisation is employed. Both with Predictive log-likelihood values on held out data (Figure 7) and without inducing point optimisation, VBPP and are also encouraging. VBPP outperforms kernel smoothing 0 outperform both the SGCP and kernel smoothingM by and SGCP with as few as 10 inducing points; more induc- p Ma wide margin, suggesting the square link function is an ing points yielding no further benefit. appropriate model for this data. Variational Inference for Gaussian Process Modulated Poisson Processes

−90 Table 2. Run times for 1D data sets. −92 Method Coal Mining Twitter VBPP 0.7 0.5 −94 KS+EC 0.0 0.3 KS-EC 0.0 0.2 −96 SGCP 417.6 230.0 −98

Log Likelihood Table 3. Test log-likelihood for 2D Malaria data. M L −100 KS-EC KS+EC VBPP( p) VBPP( 0) 855.0 867.2 869.7 855.9 −102 8 10 12 14 16 18 20 22 24 26 28 30 Number of Inducing Points Figure 7. Coal Mining Data set: The difference between sampling M0 and Mp, and the corresponding lower bounds, L0 and Lp 8. Further Work (see Section 4.6). The ﬁgure clearly demonstrates the tightness of Although the performance of the variational Bayesian point the L0 bound. Legend as in Figure 5. process inference algorithm described in this paper im- proves upon standard methods when used in isolation, it is in its extensions that its utility will be fully realised. Previous work (Gunter et al., 2014a) has shown that hierarchical modelling of point processes—structured point processes—can signiﬁcantly improve predictive accuracy. In these multi-output models, statistical strength is shared across multiple rate processes via shared latent processes. The method presented here provides a likelihood model for point-process data that can be incorporated as a probabilistic building-block into these larger interconnected models. That is, our fully generative model can readily be ex- tended to additionally incorporate other observation modal- ities. For example, real-valued observations such as (log-) household income could be modelled along with the intensity function over crime incidents using a variational multi-output GP framework. Future work will be aimed towards developing these variational structured point process algorithms and integration of these techniques to popular Figure 8. A sample of 741 malaria incidences in Kenya, which Gaussian process tool-kits, such as GPy (The GPy authors, occured over the course of 1985-2010, and the associated VBPP intensity function. 20 × 20 inducing points. 2014).

9. Conclusion

7.3.3. MALARIA DATA Point process models have hitherto been hindered by their scaling with the number of data. To address this problem, We expect that a major application of the contributions pre- we propose a new model, accommodating non-discretised sented in this paper is the joint modelling of disease inci- intensity functions, that permits linear scaling. We addi- dence with correlating factors, in a fully Bayesian, scal- tionally contribute a variational Bayesian inference scheme able framework. For example, those studying the spread of that delivers rapid and accurate prediction. The scheme is malaria often wish to use continuous rainfall measurements validated on real datasets including the canonical coal min- to better inform their epidemiological models. We use ex- ing disaster data set and malaria incidence in Kenya data. amples from the Malaria Atlas Project (2014) to test our scheme. We extracted 741 incidences of malaria outbreak ACKNOWLEDGEMENTS documented in Kenya between 1985 and 2010, and ran our VBPP algorithm and kernel smoothing on approximately The authors would like to thank James Hensman and Neil half of the resulting dataset, holding out the remainder for Lawrence for helpful discussions. Chris Lloyd is funded by testing. Test log-likelihood results, given in Table 3, show a DSTL National PhD Scheme Studentship. Tom Gunter is VBPP performs comparably to kernel smoothing. supported by UK Research Councils. Variational Inference for Gaussian Process Modulated Poisson Processes

References Hensman, J., Fusi, N., and Lawrence, N. D. Gaussian Pro- cesses for Big Data. CoRR, abs/1309.6835, 2013. Adams, R. P., Murray, I., and MacKay, D. J. C. Tractable Nonparametric Bayesian Inference in Poisson Processes Jarrett, R. G. A note on the intervals between coal-mining with Gaussian Process Intensities. In Proceedings of disasters. Biometrika, 66(1):191193, 1979. the 26th Annual International Conference on Machine Learning, ICML ’09, pp. 9–16, New York, NY, USA, Kingman, J. F. C. Poisson Processes (Oxford Studies in 2009. ACM. Probability). Oxford University Press, jan 1993. ISBN 0198536933. Ancarani, L. U. and Gasaneo, G. Derivatives of any order Lasko, T. A. Efficient Inference of Gaussian Process Mod- of the confluent hypergeometric function 1F 1 (a,b,z) with respect to the parameter a or b. Journal of Math- ulated Renewal Processes with Application to Medical ematical Physics, 49(6), 2008. Event Data. In Uncertainty in Artificial Intelligence (UAI), 2014. Basu, S. and Dassios, A. A Cox process with log-normal http://www.map.ox.ac.uk/ intensity. Insurance: Mathematics and Economics, 31 Malaria Atlas Project. explore/data-modelling/ (2):297302, Oct 2002. , 2014. [Online; ac- cessed 24-October-2014]. Carlin, B. P., Gelfand, A. E., and Smith, A. F. M. Hierarchi- Mller, J., Syversveen, A. R., and Waagepetersen, R. P. cal Bayesian analysis of changepoint problems. Journal Log Gaussian Cox Processes. Scandinavian Journal of of the Royal Statistical Society. Series C (Applied Statis- Statistics, 25(3):451–482, 1998. ISSN 1467-9469. tics), 41(2):389405, Jan 1992. Osborne, M. A., Roberts, S. J., Rogers, A., and Jennings, Cunningham, J. P., Shenoy, K. V., and Sahani, M. Fast N. R. Real-time information processing of environmen- Gaussian process methods for point process intensity tal sensor network data. ACM Transactions on Sensor estimation. In Proceedings of the 25th international Networks, 9(1), 2012. conference on Machine learning, pp. 192–199. ACM, 2008a. Rasmussen, C. E. and Williams, C. K. I. Gaussian Pro- cesses for Machine Learning. The MIT Press, 2nd edi- Cunningham, J. P., Yu, B. M., Shenoy, K. V., and Sahani, tion 2006 edition, 2006. ISBN 0-262-18253-X. M. Inferring neural firing rates from spike trains using Gaussian processes. In Advances in Neural Information Rathbun, S. L. and Cressie, N. Asymptotic properties of Processing Systems 20, 2008b. estimators for the parameters of spatial inhomogeneous Poisson point processes. Advances in Applied Probabil- Diggle, P. A kernel method for smoothing point process ity, pp. 122154, 1994. data. Journal of the Royal Statistical Society. Series C (Applied Statistics), 34(2):pp. 138–147, 1985. ISSN Snelson, E. and Ghahramani, Z. Sparse Gaussian Processes 00359254. using Pseudo-inputs. In Advances in Neural Information Processing Systems (NIPS), 2005. Fearnhead, P. Exact and efficient Bayesian inference for multiple changepoint problems. Statistics and Comput- The GPy authors. GPy: A Gaussian process framework in ing, 16(2):203213, Jun 2006. Python. https://github.com/SheffieldML/ GPy, 2014. Gunter, T., Lloyd, C., Osborne, M. A., and Roberts, S. J. Titsias, M. K. Variational learning of inducing variables in Efficient Bayesian nonparametric modelling of struc- sparse Gaussian processes. In In Artificial Intelligence tured point processes. In Uncertainty in Artificial In- and Statistics 12, pp. 567–574, 2009. telligence (UAI), 2014a. Titsias, M. K. and Lawrence, N. D. Bayesian Gaussian Gunter, T., Osborne, M. A., Garnett, R., Hennig, P., and process latent variable model. In Artificial Intelligence Roberts, S. Sampling for inference in probabilistic mod- and Statistics 10, 2010. els with fast Bayesian quadrature. In Cortes, C. and Lawrence, N. (eds.), Advances in Neural Information Processing Systems (NIPS), 2014b.

Heikkinen, J. and Arjas, E. Modeling a Poisson forest in variable elevations: A nonparametric Bayesian approach. Biometrics, 55(3):738745, Sep 1999.