Scalable Nonparametric on Point Processes with Gaussian Processes

Yves-Laurent Kom Samo [email protected] Stephen Roberts [email protected] Deparment of Engineering Science and Oxford-Man Institute, University of Oxford

Abstract 2. Related Work In this paper we propose an efficient, scalable Non-parametric inference on point processes has been ex- non-parametric Gaussian process model for in- tensively studied in the literature. Rathbum & Cressie ference on Poisson point processes. Our model (1994) and Moeller et al. (1998) used a finite-dimensional does not resort to gridding the domain or to intro- piecewise constant log-Gaussian for the intensity . ducing latent thinning points. Unlike competing Such approximations are limited in that the choice of the 3 models that scale as O(n ) over n data points, grid on which to represent the intensity function is arbitrary 2 our model has a complexity O(nk ) where k  and one has to trade-off precision with computational com- n. We propose a MCMC sampler and show that plexity and numerical accuracy, with the complexity being the model obtained is faster, more accurate and cubic in the precision and exponential in the dimension of generates less correlated samples than competing the input space. Kottas (2006) and Kottas & Sanso (2007) approaches on both synthetic and real-life data. used a mixture of Beta distributions as Finally, we show that our model easily handles prior for the normalised intensity function of a Poisson data sizes not considered thus far by alternate ap- process. Cunningham et al. (2008a) proposed a model proaches. using Gaussian processes evaluated on a fixed grid for the estimation of intensity functions of renewal processes with log-concave renewal distributions. They turned hyper- 1. Introduction parameters inference into an iterative series of convex opti- Point processes are a standard model when the objects of mization problems, where ordinarily cubic complexity op- study are the number and repartition of otherwise identical erations such as Cholesky decompositions are evaluated points on a domain, usually time or space. The Poisson in O(n log n) leveraging the uniformity of the grid and is probably the most commonly used point the log-concavity of the renewal distribution. Adams et process. It is fully characterised by an intensity function al. (2009) proposed an exact Monte Carlo that is inferred from the data. Gaussian processes have been (MCMC) inference scheme for the posterior intensity func- successfully used to form a prior over the (log-) intensity tion of a Poisson process with a sigmoid Gaussian prior function for applications such as astronomy (Gregory & intensity, or equivalently a (Cox, 1955) with Loredo, 1992), forestry (Heikkinen & Arjas, 1999), finance sigmoid Gaussian stochastic intensity. The authors simpli- (Basu & Dassios, 2002), and neuroscience (Cunningham fied the likelihood of a Cox process by introducing latent et al., 2008b). We offer extensions to existing work as fol- thinning points. The proposed scheme has a complexity ex- lows: we develop an exact non-parametric Bayesian model ponential in the dimension of the input space, cubic in the that enables inference on Poisson processes. Our method number of data and thinning points, and performs particu- scales linearly with the number of data points and does not larly poorly when the data are sparse. Gunter et al. (2014) resort to gridding the domain. We derive a MCMC sam- extended this model to structured point processes. Rao & pler for core components of the model and show that our Teh (2011) used uniformization to produce exact samples approach offers a faster and more accurate solution, as well from a non-stationary renewal process whose hazard func- as producing less correlated samples, compared to other ap- tion is modulated by a Gaussian process, and consequently proaches on both real-life and synthetic data. proposed a MCMC sampler to sample from the posterior intensity of a unidimensional point process. Although the nd Proceedings of the 32 International Conference on Machine authors have illustrated that their model is faster than that Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- of Adams et al. (2009) on some synthetic and real-life data, right 2015 by the author(s). Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes their method still scales cubically in the number of thinned fact, for every finite-dimensional prior over the vector in and data points, and is not applicable to data in dimension Equation (2), there exists a Cox process with an a.s. C∞ higher than 1, such as spatial point processes. intensity process that coincides with the postulated prior (see appendix for the proof). 3. Model This approach is similar to that of (Kottas, 2006). The au- R thor regarded I = S λ(s)ds as a and 3.1. Setup λ(s) noted that p(s) = R can be regarded as a pdf S λ(s)ds We are tasked with making non-parametric Bayesian infer- whose support is the domain S. He then made infer- ence on the intensity function of a ence on (I, p(s1), ..., p(sn)), postulating as prior that I and assumed to have generated a dataset D = {s1, ..., sn}. To (p(s1), ..., p(sn)) are independent, I has a Jeffrey’s prior simplify the discourse without loss of generality, we will and (s , ..., s ) are i.i.d. draws from a Dirichlet process d 1 n assume that data points take values in R . mixture of Beta with pdf p. Firstly, let us recall that a Poisson point process (PPP) on The model we present in the following section d a bounded domain S ⊂ R with non-negative intensity puts an appropriate finite-dimensional prior on function λ is a locally finite random collection of points 0 0 R (λ(s1), ..., λ(sn), λ(s1), ..., λ(sk), S λ(s)ds) for some in S such that the numbers of points occurring in disjoint 0 inducing points sj rather than putting a functional prior on parts Bi of S are independent and each follows a Poisson the intensity function directly. distribution with R λ(s)ds. Bi The likelihood of a PPP is given by: 3.3. Our model

n 3.3.1. INTUITION  Z  Y L(λ|s1, ..., sn) = exp − λ(s)ds λ(si). (1) S i=1 The intuition behind our model is that the data are not a ‘natural grid’ at which to infer the value of the inten- 3.2. Tractability discussion sity function. For instance, if the data consists of 200,000 points on the interval [0, 24] as in one of our experi- The approach adopted thus far in the literature to make ments, it might not be necessary to infer the value of a non-parametric Bayesian inference on point process us- function at 200,000 points to characterise it on [0, 24]. ing Gaussian processes (GPs) (Rasmussen & Williams, Instead, we find a small set of inducing points D0 = 0 0 2006) consists of putting a functional prior on the inten- {s1, ..., sk}, k  n on our domain, through which we will sity function in the form of a positive function of a GP: define the prior over the vector in Equation (2) augmented 0 0 λ(s) = f(g(s)) where g is drawn from a GP and f is a pos- with λ(s1), ..., λ(sk). The set of inducing points will be 0 0 itive function. Examples of such f include the exponential chosen so that knowing λ(s1), ..., λ(sk) would result in function and a scaled sigmoid function (Adams et al., 2009; knowing the values of the intensity function elsewhere on Rao & Teh, 2011). This approach can be seen as a Cox the domain, in particular λ(s1), ..., λ(sn), with ‘arbitrary process where the stochastic intensity follows the same dy- certainty’. We will then analytically integrate out the de- namics as the functional prior. When the Gaussian process pendency in λ(s1), ..., λ(sn) from the posterior, thereby re- used has almost surely continuous paths, the random vector ducing the complexity from cubic to linear in the number of Z data points without ‘loss of information’, and reformulating (λ(s1), ..., λ(sn), λ(s)ds) (2) our problem as that of making exact Bayesian inference on S the value of the intensity function at the inducing points. provably admits a probability density function (pdf). More- We will then describe how to obtain predictive mean and over, we note that any piece of information not contained in variance of the intensity function elsewhere on the domain the implied pdf over the vector in Equation (2) will be lost from training. as the likelihood only depends on those variables. Hence, given a functional prior postulated on the intensity func- 3.3.2. MODEL SPECIFICATION tion, the only necessary piece of information to be able to Let us denote by λ∗ a positive on S make a full Bayesian treatment is the implied joint pdf over such that log λ∗ is a stationary Gaussian process with co- the vector in Equation (2). ∗ ∗ variance kernel γ :(s1, s2) → γ (s1, s2) and constant For many useful transformations f and struc- mean m∗. Let us further denote by λˆ a positive stochas- tures for the GP, the aforementioned implied pdf might not tic process on S such that log λˆ is a conditional Gaus- be available analytically. We note however that there is no sian process coinciding with log λ∗ at k inducing points 0 0 0 ˆ need to put a functional prior on the intensity function. In D = {s1, ..., sk}, k  n. That is, log λ is the non- Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes stationary Gaussian process whose mean function m is de- It follows from the fifth assertion in our prior specification µ2 σ2 fined by I I that αI = 2 and βI = . We also note that σI µI ∗ ∗ ∗−1 m(s) = m + Σ 0 Σ 0 0 G (3) Z sD D D ˆ  µI = E λ(s)ds ∗ 0 ∗ ∗ 0 ∗ S where G = log λ (s1) − m , ..., log λ (sk) − m and Z ∗ ˆ  ΣXY is the covariance between the vectors X and = E exp(log λ(s)) ds Y under the covariance kernel γ∗. Moreover, log λˆ is such S Z 1 Z that for every vector S1 of points in S, the auto-covariance = exp(m(s) + γ(s, s))ds := f(s)ds (6) 1 S 2 S matrix ΣS1S1 of the values of process at S1 reads

∗ ∗ ∗−1 ∗T and ΣS1S1 = ΣS S − ΣS D0 ΣD0D0 ΣS D0 . (4) 1 1 1 1  Z  2 ˆ 2 2 σI = E λ(s)ds − µI The prior distribution in our model is constructed as fol- S  Z Z  lows: ˆ ˆ 2 = E exp(log λ(s1) + log λ(s2))ds1ds2 − µI S S 0 k Z Z   1. {log λ(si)}i=1 are samples from the stationary GP ˆ ˆ  2 ∗ 0 k ∗ #D = E exp log λ(s1) + log λ(s2) ds1ds2 − µI log λ at {si}i=1 respectively, with m = log µ(S) , S S Z Z where µ(S) is the size of the domain. 1 = exp m(s1) + m(s2) + γ(s1, s2) + γ(s1, s1) S S 2 I = R λ(s)ds {log λ(s )}n 2. S and j j=1 are conditionally 1 0 k + γ(s , s )ds ds − µ2 independent given {log λ(si)}i=1. 2 2 2 1 2 I Z Z 0 k n 2 3. Conditional on {log λ(si)}i=1, {log λ(sj)}j=1 are in- := g(s1, s2)ds1ds2 − µI . (7) S S dependent, and for each j ∈ [1..n] log λ(sj) follows the same distribution as log λˆ(s ). j The integrals in Equations (6) and (7) can be easily eval- 0 k uated with numerical methods such as Gauss-Legendre 4. Conditional on {log λ(si)}i=1, I follows a Gamma quadrature (Hildebrand, 2003). distribution with shape αI and scale βI . In particular, when S = [a, b], 2 2 5. The mean µI = αI βI and variance σI = αI βI of I R ˆ p   are that of S λ(s)ds. b − a X b − a b + a µ ≈ ω f x + (8) I 2 i 2 i 2 i=1 Assertion 3. above is somewhat similar to the FITC model of (Quinonero & Rasmussen, 2005). and

p p This construction yields a prior pdf of the form: (b − a)2 X X b − a b + a b − a σ2 ≈ ω ω g x + , x + I 4 i j 2 i 2 2 j 0 0  i=1 j=1 p log λ(s1), ..., log λ(sn), log λ(s1), ..., log λ(sk), I, θ 0 0 ∗ ∗  ! = N log λ(s ), ..., log λ(s )|m 1 , Σ 0 0 b + a 1 k k D D − µ2 (9)  2 I × N log λ(s1), ..., log λ(sn)|M, diag(ΣDD)  × γd I|αI , βI × p(θ) (5) where the roots xi of the Legendre polynomial of or- der p and the weights ωi are readily available from stan- where N (.|X,C) is the multivariate Gaussian pdf dard textbooks on numerical analysis such as (Hildebrand, with mean X and covariance matrix C, M = 2003) and scientific programming packages (R, Matlab and (m(s1), ..., m(sn)), 1k is the vector with length k and ele- Scipy). Extensions to rectangles in higher dimensions are ments 1, diag(ΣDD) is the diagonal matrix whose diagonal straightforward. Moreover, the complexity of such approx- is that of ΣDD, γd x|α, β) is the pdf of the gamma distri- imations only depends on the number of inducing points bution with shape α and scale β, and where θ denotes the and p (see Equations (3) and (4)), and hence scales well hyper-parameters of the covariance kernel γ∗. with the data size.

1The positive definitiveness of the induced covariance kernel A critical step in the derivation of our model is to analyti- ∗ γ is a direct consequence of the positive definitiveness of γ . cally integrate out log λ(s1), ..., log λ(sn) in the posterior, Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes to eliminate the cubic complexity in the number of data posterior variance we would have obtained, had we chosen points. To do so, we note that: inducing points in those parts of the domain.

Z n Intuitively, a good algorithm to find inducing points should Y λ(si)N log λ(s1), ..., log λ(sn)|M, leverage prior knowledge about the smoothness, periodic- d n (R ) i=1 ity, amplitude and length scale(s) of the intensity function  diag(ΣDD) d log λ(s1)...d log λ(sn) to optimize for the quality of (post-training) predictions n !! while minimising the number of inducing points. X = E exp log λ(si) We use as utility function for the choice of inducing points: i=1 0 ∗ ∗−1 ∗T 1 U(D ) = E (Tr(Σ 0 (θ)Σ 0 0 (θ)Σ 0 (θ))) (13) = exp(1T M + Tr(Σ )) (10) θ DD D D DD n 2 DD where θ is the vector of hyper-parameters of the covari- where the second equality results from the moment gener- ance kernel γ∗, and the expectation is taken with respect to ating function of a multivariate Gaussian. the prior distribution over θ. In other words, the utility of a set of inducing points is the expected total reduction of Thus, putting together the likelihood of Equa- the (predictive) variances of log λ(s1), ... log λ(sn) result- tion (1) and Equation (5), and integrating out 0 0  ing from knowing log λ(s1), ..., log λ(sk). log λ(s1), ..., log λ(sn) , we get: In practice, the expectation in Equation (13) might not 0 0 p log λ(s1), ..., log λ(sk), I, θ D) be available analytically. We can however use the Monte 0 0 ∗ ∗  Carlo estimate: ∼ p(θ)N log λ(s1), ..., log λ(sk)|M , ΣD0D0 1 N × exp(1T M + Tr(Σ )) exp(−I)γ I|α , β  (11) ˜ 0 1 X ∗ ˜ ∗−1 ˜ ∗T ˜ n DD d I I U(D ) = Tr(Σ 0 (θ )Σ 0 0 (θ )Σ 0 (θ )). (14) 2 N DD i D D i DD i i=1

Finally, although our model allows for joint inference on ˜ N The algorithm proceeds as follows. We sample (θi)i=1 the intensity function and its integral, we restrict our at- 0 from the prior. Initially we set k = 0, D = ∅ and u0 = 0. tention to making inference on the intensity function for We increment k by one, and consider adding an inducing brevity. By integrating out I from Equation (11), we get 0 ˜ 0 point. We then find the point sk that maximises U(D ∪{s}) the new posterior: 0 ˜ 0 sk := argmax U(D ∪ {s}) (15) 0 0 s∈S p(λ, θ|D) := p log λ(s1), ..., log λ(sk), θ D) (12) 1 using Bayesian optimisation (Mockus, 2013). We compute ∼ p(θ) exp(1T M + Tr(Σ ))(1 + β )−αI n 2 DD I the utility of having k inducing points as 0 0 ∗ ∗  × N log λ(s ), ..., log λ(s )|M , Σ 0 0 1 k D D ˜ 0 0 uk = U(D ∪ {sk}), where we noted that the dependencies of Equation (11) in I 0 0 we update D = D ∪ {sk} and stop when is of the form exp(−x)γd(x|α, β) which can be integrated out as the moment generating function of the gamma dis- uk − uk−1 −α < α, tribution evaluated at −1, that is (1 + β) . uk where 0 < α  1 is a convergence threshold. Proposition 3.3.3. SELECTIONOFINDUCINGPOINTS (a) For any D, α, N and p Algorithm1 stops in finite time Inferring the number k and positions of the inducing points θ and the sequence (u ) converges at least linearly with s0 is critical to our model, as k directly affects the complex- k k∈N i rate 1 − 1 . ity of our scheme and the positions of the inducing points #D affect the quality of our prediction. Too large a k will lead (b) Moreover, the maximum utility uf (α) returned by Algo- to an unduly large complexity. Too small a k will lead to rithm1 converges to the average total unconditional vari- loss of information (and subsequently excessively uncer- 1 PN ∗ ˜ ance w∞ := N i=1 Tr(ΣDD(θi)) as α goes to 0. tain predictions from training), and might make assertion 2 The idea behind the proof of this proposition is that the of our prior specification inappropriate. For a given k, if sequence of maximum utilities u is positive, increas- the inducing points are not carefully chosen, the coverage k ing2, and upper-bounded by the total unconditional vari- of the domain will not be adapted to changes in the inten- sity function and as a result, the predictive variance in cer- 2Intuitively, conditioning on a new point increases the reduc- tain parts of the domain might considerably differ from the tion of variance from the unconditional variance. Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes

Algorithm 1 Selection of inducing points variable of interest. We sample the log-intensities at the Inputs:0 < α  1, N, pθ inducing points using Elliptical Slice (Murray 0 Output: uf , D et al., 2010) with the pdf in Equation (12). k = 0 u = 0 D0 = e = 1 , 0 , ∅, ; Prediction from training ˜ N Sample (θi)i=1 from p(θ); while e > α do To predict the posterior mean at the data points we note k = k + 1; from the law of total expectation that 0 ˜ 0 sk = argmax U(D ∪ {s}); s∈S ∀si ∈ D, E(log λ(si)|D) ˜ 0 0 uk = U(D ∪ {sk}); ∗ 0 k   0 0 0 = E E log λ(si)|{log λ (sj)}j=1, D |D . (16) D = D ∪ {sk}; e = uk−uk−1 ; uk Also, we note from Equations (1) and (5) that the de- end while pendency of the posterior of log λ(si) conditional on ∗ 0 k {log λ (sj)}j=1 is of the form exp(log λ(s )) × N (log λ(s )|m(s ), γ(s , s )), 3 i i i i i ance w∞ . Hence, the sequence uk converges to a strictly positive limit, which implies that the stopping condition of where we recall that m(si) is the i-th element of the vector the while loop will be met in finite time regardless of D, M and γ(si, si) is the i-th diagonal element of the matrix α, N and pθ. Finally, we construct a sequence wk upper- ΣDD. Hence, the posterior distribution of log λ(si) condi- u ∗ 0 k bounded by the sequence k and that converges linearly tional on {log λ (sj)}j=1 is Gaussian with mean to the average total unconditional variance w∞ with rate 1 ∗ 0 k  1 − #D . As the sequence uk converges and is itself upper- E log λ(si)|{log λ (sj)}j=1, D = M[i] + ΣDD[i, i] bounded by w∞, its limit is w∞ as well, and it converges (17) at least as fast as wk. (See appendix for the full proof) and variance Our algorithm is particularly suitable to Poisson point pro- cesses as it prioritises sampling inducing points in parts of ∗ 0 k  Var log λ(si)|{log λ (sj)}j=1, D = ΣDD[i, i]. (18) the domain where the data are denser. This corresponds to regions where the intensity function will be higher, thus Finally, it follows from Equation (16) that E(log λ(si)|D) where the local random counts of the underlying PPP will is obtained by averaging out M[i]+ΣDD[i, i] over MCMC vary more4 and subsequently where the posterior variance samples after burn-in. of the intensity is expected to be higher. Moreover, it lever- Similarly, the law of total variance implies that ages prior smoothness assumptions on the intensity func- tion to limit the number of inducing points and to appropri- Var(log λ(si)|D) ately and sequentially improve coverage of the domain. ∗ 0 k   = E Var log λ(si)|{log λ (sj)}j=1, D |D Algorithm1 is illustrated on a variety of real life and syn- ∗ 0 k   thetic data sets in section5. + Var E log λ(si)|{log λ (sj)}j=1, D |D . (19)

4. Inference Hence, it follows from Equations (17) and (18) that the pos- terior variance at a data point si is obtained by summing up ∗ We use a squared exponential kernel for γ and scaled sig- the sample mean of ΣDD[i, i] with the sample variance of moid Gaussian priors for the kernel hyper-parameters; that M[i] + ΣDD[i, i], where sample mean and sample variance θimax is θi = where xi are i.i.d standard Normal. The are taken over MCMC samples after burn-in. 1+exp(−xi) problem-specific scales, θimax, restrict the supports of those distributions using prior knowledge to avoid unlikely ex- 5. Experiments treme values and to improve conditioning. We selected four data sets to illustrate the performance of We use a Block Gibbs Sampler (Geman & Geman, 1984) our model. We restricted ourselves to one synthetic data to sample from the posterior. We sample the hyper- set for brevity. We chose the most challenging of the parameters using the Metropolis-Hastings (Hastings, 1970) synthetic intensity functions of (Adams et al., 2009) and algorithm taking as proposal distribution the prior of the t t−25 2 (Rao & Teh, 2011), λ(t) = 2 exp(− 15 ) + exp(−( 10 ) ), 3The variance cannot be reduced by more than the total un- to thoroughly compare our model with competing meth- conditional variance. ods. We also ran our model on a standard 1 dimensional 4The variance of the Poisson distribution is its mean. real-life data set (the coal mine disasters dataset used in Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes

Table 1. Maximum output (resp. input) scale h (resp. l ) max max 1.0 used for each data set to select inducing points. 0.9

SYNTHETIC COALMINE BRAMBLE TWITTER 0.8

hMAX 10.0 10.0 10.0 10.0 0.7 lMAX 25.0 50.0 0.25 5.0 0.6

0.5 Table 2. Number of inducing points produced by Algorithm1 re- quired to achieve some critical normalised utility values on the 4 0.4 data sets. 0.3 eg cm 0.2 t k b u k SYNTHETIC COALMINE BRAMBLE TWITTER 0.1 u∞ 0 5 10 15 20 25 30 0.75 2 2 8 3 0.90 3 4 17 5 0.95 4 5 28 8 u Figure 1. Average normalised utility k of choosing k inducing u∞ points using Algorithm1 ± 1 as a function of k on the synthetic data set (eg), the coal mine data set (cm), the (Jarrett, 1979); 191 points) and a standard real-life 2 di- Twitter data set (t) and the bramble canes data set (b). The average mensional data (spatial location of bramble canes (Diggle, was taken over 10 runs. 1983); 823 points). Finally we ran our model on a real- life data set large enough to cause problems to competing models. This data set consists of the UTC timestamps (ex- pressed in hours in the day) of Twitter updates in English cesses considered herein and found that the Legendre poly- published in the (Twitter sample stream, 2014) on Septem- nomial order p = 10 was sufficient to yield a Quadrature ber 1st 2014 (188544 points). estimate for the standard deviation of the integral less than 1% away from the Monte Carlo estimate (using the trape- 5.1. Inducing points selection zoidal rule), and a Quadrature estimate for the mean of the integral less than a standard error away from the Monte Figure1 illustrates convergence of the selection of induc- Carlo average. We took a more conservative stand and used ing points on the 4 data sets. We ran the algorithm 10 times p = 20. with N = 20, and plotted the average normalised utility uk ± 1 std as a function of the number of inducing points. Inference on synthetic data u∞ Table1 contains the maximum hyper-parameters that were We generated a draw from a Poisson point process with the used for each data set. Table2 contains the number of in- t t−25 2 intensity function λ(t) = 2 exp(− 15 ) + exp(−( 10 ) ) of ducing points required to achieve some critical normalised (Adams et al., 2009) and (Rao & Teh, 2011). The draw con- utility values for each of the 4 data sets. We note that just 8 sisted of 41 points (blue sticks in Figure2). We compared inducing points were required to achieve a 95% utility for our model to (Adams et al., 2009) (SGCP) and (Rao & Teh, the Twitter data set (188544 points). In regards to the po- 2011) (RMP). We ran the RMP model with the renewal pa- sitions of sampled inducing points, we note from Figures2 rameter γ set to 1 (RMP 1), which corresponds to an expo- and3 that when the intensity function was bimodal, the first nential renewal distribution or equivalently an inhomoge- inducing point was sampled around the argument of the neous Poisson process. We also ran the RMP model with a highest mode, and the second inducing point was sampled uniform prior on [1, 5] over the renewal parameter γ (RMP around the argument of the second highest mode. More full). Figure2 illustrates the posterior mean intensity func- generally, the algorithm sampled inducing points where the tion under each model. Finally we ran the Dirichlet process latent intensity function varies the most, as expected. Mixture of Beta model of (Kottas, 2006) (DPMB). As de- tailed in Table3, our model outperformed that of (Adams 5.2. Intensity function et al., 2009), (Rao & Teh, 2011) and (Kottas, 2006) in terms of accuracy and speed. In each experiment we generated 5000 samples after burn- in (1000 samples). For each data set we used the set of Inference on real-life data inducing points that yielded a 95% normalized utility. The Figure3 shows the posterior mean intensity functions of exact numbers are detailed in Table2. the coal mine data set, the Twitter data set and the bramble We ran a Monte Carlo simulation for the stochastic pro- canes data set under our model. Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes

3.5 s s s s s s s s 4 1 8 3 6 2 5 7 5 ′ ′ ′ ′ ′ s3 s1 s4 s5 s2 3.0

4 2.5 us Real 3 2.0 SGCP RMP full 2 1.5 RMP 1 DPMB

1.0 1

0.5 0 1860 1880 1900 1920 1940 1960

0.0 (a) 0 10 20 30 40 50

40000 s8 s7s3 s1 s4 s2s6 s5

Figure 2. Inference on a draw (blue sticks) from a Poisson point 35000 t t−25 2 process with intensity λ(t) = 2 exp(− 15 ) + exp(−( 10 ) ) (black line). The red dots are the inducing points generated by 30000 our algorithm, labelled in the order they were selected. The solid 25000 blue line and the grey shaded area are the posterior mean ± 1 posterior standard deviation under our model. SGCP is the pos- 20000 terior mean under (Adams et al., 2009). RMP full and RMP 1 15000 are the posterior mean intensities under (Rao & Teh, 2011) with γ 10000 inferred and set to 1 respectively. DPMB is the Dirichlet process mixture of Beta (Kottas, 2006) 5000

0 0 5 10 15 20 25 Scalability: We note that it took only 240s on average (b) to generate 1000 samples on the Twitter data set (188544 points). As a comparison, this is the amount of time that 1.0 2200 would be required to generate as many samples on a data 1900 set that has 50 points (resp. 100 points) under the models 0.8 of (Adams et al., 2009) (resp. (Rao & Teh, 2011)). More 1600 importantly, it was not possible to run either of those two 0.6 competing models on the Twitter data set. Doing so would 1300 require computing 17×1010 covariance coefficients to eval- 1000 uate a single auto-covariance matrix of the log-intensity at 0.4 the data points, which a typical personal computer cannot 700 handle. 0.2 400

0.0 100 Table 3. Some on the MCMC runs of Figure2. RMSE 0.0 0.2 0.4 0.6 0.8 1.0 and MAE denote the Root Mean Square Error and the Mean Ab- (c) solute Error, expressed as a proportion of the average of the true intensity function over the domain. LP denotes the log mean pre- Figure 3. Inference on the intensity functions of the coal mine dictive probability on 10 held out PPP draws from the true inten- data set (top), the Twitter data set (middle), and the bramble canes sity ± 1 std. t(s) is the average time in seconds it took to generate data set (bottom). Blue dots are data points, red dots are inducing 1000 samples ± 1 std and ESS denotes the average effective sam- points (labelled in the upper panels in the order they were se- ple size (Gelman et al., 2013) per 1000 samples. lected), the grey area is the 1 standard deviation confidence band.

MAE RMSE LP T (S) ESS SGCP 0.31 0.37 -45.07 ± 1.64 257.72 ± 16.29 6 RMP 1 0.32 0.38 -45.24 ± 1.41 110.19 ± 7.37 23 RMP FULL 0.25 0.31 -43.51 ± 2.15 139.64 ± 5.24 6 6. Discussion DPMB 0.23 0.32 -42.95 ± 3.58 23.27 ± 0.94 47 US 0.19 0.27 -42.84 ± 3.07 4.35 ± 0.12 38 Scalability of the selection of inducing points Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes

The computational bottleneck of the selection of inducing space dimension. However, most interesting point process points is in the evaluation of applications involve modelling temporal, spatial or spatio- ∗ ˜ ∗−1 ˜ ∗T ˜ temporal events, for which our model scales considerably Tr(Σ 0 (θ )Σ 0 0 (θ )Σ 0 (θ )). DD i D D i DD i better with the data size than competing approaches. In ef- Hence, the complexity and the memory requirement of the fect, the models proposed by (Kottas, 2006; Cunningham selection of inducing points are both linear in the number et al., 2008a;b; Rao & Teh, 2011) are all specific to uni- of data points n := #D. dimensional input data, whereas the model introduced by (Kottas & Sanso, 2007) is specific to spatial data. As for The number of inducing points generated by our algorithm the model of (Adams et al., 2009), it scales very poorly does not increase with the size of the data, but rather as a with the input space dimension for its complexity is cubic function of the size of the domain and the resolution im- in the sum of the number of data points and the number of plied by the prior over the hyper-parameters. latent thinning points, and the number of thinning points Comparison with competing models grows exponentially with the input space dimension5. We note that the computational bottleneck of our MCMC Extension of our model inference is in the evaluation of Although the covariance kernel γ∗ was assumed stationary, ∗ ∗ ∗−1 ∗T Tr(ΣDD) = Tr(ΣDD) − Tr(ΣDD0 ΣD0D0 ΣDD0 ). no result in this paper relied on that assumption. We solely ∗ Hence, inferring the intensity function under our model needed to evaluate covariance matrices under γ . Hence, scales computationally in O(nk2) and has a memory re- the proposed model and algorithm can also be used to ac- quirement O(nk), where the number of inducing points k count for known non-stationarities. More generally, the is negligible. This is considerably better than alternative model presented in this paper can serve as foundation to methods using Gaussian processes (Adams et al., 2009; make inference on the stochastic dependency between mul- Rao & Teh, 2011) whose complexities are cubic in the tiple point processes when the intensities are assumed to be number of data points and whose memory requirement is driven by known exogenous factors, hidden common fac- squared in the number of data points. Moreover, the su- tor, and latent idiosyncratic factors. perior accuracy of our model compared to (Adams et al., 2009) and (Rao & Teh, 2011) is due to our use of the expo- 7. Summary nential transformation rather than the scaled sigmoid one. In effect, unlike the inverse scaled sigmoid function that In this paper we propose a novel exact non-parametric tends to amplify variations, the logarithm tends to smooth model to make inference on Poisson point processes using out variations. Hence, when the true intensity is uneven, Gaussian processes. We derive a robust MCMC scheme to the log-intensity is more likely to resemble a draw from a sample from the posterior intensity function. Our model stationary GP than the inverse scaled sigmoid of the true outperforms competing benchmarks in terms of speed and intensity function, and subsequently a stationary GP prior accuracy as well as in the decorrelation of MCMC sam- in the inverse domain is more suitable to the exponential ples. A critical advantage of our approach is that it has transformation than to the scaled sigmoid transformation. a numerical complexity and a memory requirement linear in the data size n (O(nk2), and O(nk) respectively, with Our model is also more suitable than that of (Cunningham k  n). Competing models using Gaussian processes have et al., 2008a) when confidence bounds are needed for the a cubic numerical complexity and squared memory require- intensity function, or when the input space is of dimension ment. We show that our model readily handles data sizes higher than 1. The model is a useful alternative to that of not yet considered in the literature. (Kottas, 2006), whose complexity is also linear. In effect, Gaussian processes (GP) are more flexible than a Dirichlet Acknowledgments process (DP) mixture of Beta distributions. This is the re- sult of the large number of known covariance kernels avail- Yves-Laurent Kom Samo is supported by the Oxford-Man able in the literature and the state-of-the-art understanding Institute of Quantitative Finance. of how well a given kernel can approximate an arbitrary 5 function (Micchelli et al., 2006; Pillai et al., 2007). More- The expected number of thinning points grows proportion- ally with the volume of the domain, which is exponential in the over, unlike a Dirichlet process mixture of Beta distribu- dimension of the input space when the domain is a hypercube with tions, Gaussian processes allow directly expressing prac- a given edge length. tical prior features such as smoothness, amplitude, length scale(s) (memory), and periodicity. As our model relies on the Gauss-Legendre quadrature, we would not recommend it for applications with a large input Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes

References Heikkinen, J., Arjas, E. Modeling a Poisson forest in vari- able elevations: a nonparametric Bayesian approach. Adams, R.P., Murray, I., and MacKay, D.J.C. Tractable Biometrics, 55, 738–745, 1999. nonparametric Bayesian inference in Poisson processes with Gaussian process intensities. pp. 9–16, 2009. Hildebrand, F. B, Introduction to numerical analysis: sec- ond edition, Chap. 8. Dover Publications, Inc., 2003. Basu, S.and Dassios, A. (2002) A Cox process with log- normal intensity. Insurance: mathematics and eco- Jarrett, R.G. A note on the intervals between coal-mining nomics, 31 (2). pp. 297–302. ISSN 0167-6687 disasters. Biometrika, 66, 191–193.

Cox, D.R. Some statistical methods connected with series Kingman, J.F.C. Poisson Processes. Oxford Science Pub- of events. Journal of the Royal Statistical Society, 17: lications, 1992. 129–164, 1955. Kottas, A. Dirichlet process mixtures of beta distributions, Cox, D.R., Isham, V. (eds.). Point Processes. Chapman with applications to density and intensity estimation. In Hall/CRC, 1980. Proceedings of the Workshop on Learning with Non- Cunningham, J.P., Shenoy, K.V., and Sahani, M. Fast parametric Bayesian Methods, 23rd ICML, Pittsburgh, Gaussian process methods for point process intensity es- PA, 2006. timation. Appearing in Proceedings of the 25 th Interna- Kottas, A., and Sanso, B. Bayesian mixture modeling for tional Conference on , Helsinki, Fin- spatial Poisson process intensities, with applications to land, 2008. extreme value analysis. Journal of Statistical Planning Cunningham, J.P., Yu, B., Shenoy, K.V., and Sahani, M. In- and Inference. Journal of Statistical Planning and Infer- ferring neural firing rates from spike trains using Gaus- ence, 137, 3151–3163, 2007. sian Processes. Advances in Neural Information Pro- Micchelli, C.A., Xu, Y., Zhang, H. Universal kernels. Jour- cessing Systems 20 pp. 329–336. nal of Machine Learning Research, 7 (2006) 2651-2667. Daley, D.J. and Vere-Jones, D. An introduction to the the- Metropolis, N., Rosenbluth, A.W, Rosenbluth, M.N., ory of point processes. Springer-Verlag, 2008. Teller, A.H., and Teller, E. Equations of state calcula- Diggle, P.J. Statistical analysis of spatial point patterns. tions by fast computing machines. Journal of Chemical Academic Press. Physics, 24:1087–1092, 1953.

Diggle, P.J. A for smoothing point process Mockus, J. Bayesian approach to global optimization: the- data. Applied Statistics, 34:138–147, 1985. ory and applications. Kluwer Academic, 2013.

Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Ve- Moeller, J., Syversveen, A., and Waagepetersen, R. Log- htari, A., and Rubin, D.B. (eds.). Bayesian data analysis Gaussian Cox processes. Scandinavian Journal of thrid edition. CRC Press, 2013. Statistics, 1998.

Geman, S. and Geman, D. Stochastic relaxation, Gibbs dis- Murray, I., Adams, R.P., and MacKay, D.J.C. Elliptical tributions, and the Bayesian restoration of images. IEEE slice sampling. pp. 9–16. Appearing in Proceedings of Transactions on Pattern Analysis and Machine Intelli- the 13th International Conference on Artificial Intelli- gence, 6:721–741, 1984. gence and Statistics (AISTATS), 2010.

Gregory, P. C., Loredo, T. J. A new method for the detection Pillai, N.S., Wu, Q., Liang, F., Mukherjee, S., and Wolpert, of a periodic signal of unknown shape and period. The R.L. Characterizing the function space for Bayesian ker- Astrophysical Journal, The Astrophysical Journal, 398, nel models. Journal of Machine Learning Research, 8: 146–168, 1992. 1769–1797, 2007.

Gunter, T., Lloyd, C., Osborne, M.A., Roberts, S.J. Ef- Quinonero-Candela, J. and Rasmussen C.E. A unifying ficient Bayesian nonparametric modelling of structured view of sparse approximate Gaussian process regres- point processes. Uncertainty in Artificial Intelligence sion. Journal of Machine Learning Research, 6 (2005) (UAI), 2014. 1939–1959.

Hastings, W.K. Monte Carlo sampling methods using Rao, V. A. and Teh, Y. W. Gaussian process modulated Markov chains and their applications. Biometrika, 24: renewal processes. Neural Information Processing Sys- 97–109, 1970. tems (NIPS), 2011. Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes

Rasmussen, Carl E. and Williams, Christopher K.I. (eds.). Gaussian processes for machine learning. The MIT Press, 2006. Rathbum, S.L. and Cressie, N.A.C. Asymptotic proper- ties of estimators for the parameters of spatial inhomo- geneous Poisson point processes. Advances in Applied Probability, 26:122–154, 1994. Twitter Inc. Twitter sample stream API. https://dev.twitter.com/streaming/reference/get/ sta- tuses/sample