Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes
Total Page:16
File Type:pdf, Size:1020Kb
Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes Yves-Laurent Kom Samo [email protected] Stephen Roberts [email protected] Deparment of Engineering Science and Oxford-Man Institute, University of Oxford Abstract 2. Related Work In this paper we propose an efficient, scalable Non-parametric inference on point processes has been ex- non-parametric Gaussian process model for in- tensively studied in the literature. Rathbum & Cressie ference on Poisson point processes. Our model (1994) and Moeller et al. (1998) used a finite-dimensional does not resort to gridding the domain or to intro- piecewise constant log-Gaussian for the intensity function. ducing latent thinning points. Unlike competing Such approximations are limited in that the choice of the 3 models that scale as O(n ) over n data points, grid on which to represent the intensity function is arbitrary 2 our model has a complexity O(nk ) where k and one has to trade-off precision with computational com- n. We propose a MCMC sampler and show that plexity and numerical accuracy, with the complexity being the model obtained is faster, more accurate and cubic in the precision and exponential in the dimension of generates less correlated samples than competing the input space. Kottas (2006) and Kottas & Sanso (2007) approaches on both synthetic and real-life data. used a Dirichlet process mixture of Beta distributions as Finally, we show that our model easily handles prior for the normalised intensity function of a Poisson data sizes not considered thus far by alternate ap- process. Cunningham et al. (2008a) proposed a model proaches. using Gaussian processes evaluated on a fixed grid for the estimation of intensity functions of renewal processes with log-concave renewal distributions. They turned hyper- 1. Introduction parameters inference into an iterative series of convex opti- Point processes are a standard model when the objects of mization problems, where ordinarily cubic complexity op- study are the number and repartition of otherwise identical erations such as Cholesky decompositions are evaluated points on a domain, usually time or space. The Poisson in O(n log n) leveraging the uniformity of the grid and point process is probably the most commonly used point the log-concavity of the renewal distribution. Adams et process. It is fully characterised by an intensity function al. (2009) proposed an exact Markov Chain Monte Carlo that is inferred from the data. Gaussian processes have been (MCMC) inference scheme for the posterior intensity func- successfully used to form a prior over the (log-) intensity tion of a Poisson process with a sigmoid Gaussian prior function for applications such as astronomy (Gregory & intensity, or equivalently a Cox process (Cox, 1955) with Loredo, 1992), forestry (Heikkinen & Arjas, 1999), finance sigmoid Gaussian stochastic intensity. The authors simpli- (Basu & Dassios, 2002), and neuroscience (Cunningham fied the likelihood of a Cox process by introducing latent et al., 2008b). We offer extensions to existing work as fol- thinning points. The proposed scheme has a complexity ex- lows: we develop an exact non-parametric Bayesian model ponential in the dimension of the input space, cubic in the that enables inference on Poisson processes. Our method number of data and thinning points, and performs particu- scales linearly with the number of data points and does not larly poorly when the data are sparse. Gunter et al. (2014) resort to gridding the domain. We derive a MCMC sam- extended this model to structured point processes. Rao & pler for core components of the model and show that our Teh (2011) used uniformization to produce exact samples approach offers a faster and more accurate solution, as well from a non-stationary renewal process whose hazard func- as producing less correlated samples, compared to other ap- tion is modulated by a Gaussian process, and consequently proaches on both real-life and synthetic data. proposed a MCMC sampler to sample from the posterior intensity of a unidimensional point process. Although the nd Proceedings of the 32 International Conference on Machine authors have illustrated that their model is faster than that Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- of Adams et al. (2009) on some synthetic and real-life data, right 2015 by the author(s). Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes their method still scales cubically in the number of thinned fact, for every finite-dimensional prior over the vector in and data points, and is not applicable to data in dimension Equation (2), there exists a Cox process with an a.s. C1 higher than 1, such as spatial point processes. intensity process that coincides with the postulated prior (see appendix for the proof). 3. Model This approach is similar to that of (Kottas, 2006). The au- R thor regarded I = S λ(s)ds as a random variable and 3.1. Setup λ(s) noted that p(s) = R can be regarded as a pdf S λ(s)ds We are tasked with making non-parametric Bayesian infer- whose support is the domain S. He then made infer- ence on the intensity function of a Poisson point process ence on (I; p(s1); :::; p(sn)), postulating as prior that I and assumed to have generated a dataset D = fs1; :::; sng. To (p(s1); :::; p(sn)) are independent, I has a Jeffrey’s prior simplify the discourse without loss of generality, we will and (s ; :::; s ) are i.i.d. draws from a Dirichlet process d 1 n assume that data points take values in R . mixture of Beta with pdf p. Firstly, let us recall that a Poisson point process (PPP) on The model we present in the following section d a bounded domain S ⊂ R with non-negative intensity puts an appropriate finite-dimensional prior on function λ is a locally finite random collection of points 0 0 R (λ(s1); :::; λ(sn); λ(s1); :::; λ(sk); S λ(s)ds) for some in S such that the numbers of points occurring in disjoint 0 inducing points sj rather than putting a functional prior on parts Bi of S are independent and each follows a Poisson the intensity function directly. distribution with mean R λ(s)ds. Bi The likelihood of a PPP is given by: 3.3. Our model n 3.3.1. INTUITION Z Y L(λjs1; :::; sn) = exp − λ(s)ds λ(si): (1) S i=1 The intuition behind our model is that the data are not a ‘natural grid’ at which to infer the value of the inten- 3.2. Tractability discussion sity function. For instance, if the data consists of 200,000 points on the interval [0; 24] as in one of our experi- The approach adopted thus far in the literature to make ments, it might not be necessary to infer the value of a non-parametric Bayesian inference on point process us- function at 200,000 points to characterise it on [0; 24]. ing Gaussian processes (GPs) (Rasmussen & Williams, Instead, we find a small set of inducing points D0 = 0 0 2006) consists of putting a functional prior on the inten- fs1; :::; skg; k n on our domain, through which we will sity function in the form of a positive function of a GP: define the prior over the vector in Equation (2) augmented 0 0 λ(s) = f(g(s)) where g is drawn from a GP and f is a pos- with λ(s1); :::; λ(sk). The set of inducing points will be 0 0 itive function. Examples of such f include the exponential chosen so that knowing λ(s1); :::; λ(sk) would result in function and a scaled sigmoid function (Adams et al., 2009; knowing the values of the intensity function elsewhere on Rao & Teh, 2011). This approach can be seen as a Cox the domain, in particular λ(s1); :::; λ(sn), with ‘arbitrary process where the stochastic intensity follows the same dy- certainty’. We will then analytically integrate out the de- namics as the functional prior. When the Gaussian process pendency in λ(s1); :::; λ(sn) from the posterior, thereby re- used has almost surely continuous paths, the random vector ducing the complexity from cubic to linear in the number of Z data points without ‘loss of information’, and reformulating (λ(s1); :::; λ(sn); λ(s)ds) (2) our problem as that of making exact Bayesian inference on S the value of the intensity function at the inducing points. provably admits a probability density function (pdf). More- We will then describe how to obtain predictive mean and over, we note that any piece of information not contained in variance of the intensity function elsewhere on the domain the implied pdf over the vector in Equation (2) will be lost from training. as the likelihood only depends on those variables. Hence, given a functional prior postulated on the intensity func- 3.3.2. MODEL SPECIFICATION tion, the only necessary piece of information to be able to Let us denote by λ∗ a positive stochastic process on S make a full Bayesian treatment is the implied joint pdf over such that log λ∗ is a stationary Gaussian process with co- the vector in Equation (2). ∗ ∗ variance kernel γ :(s1; s2) ! γ (s1; s2) and constant For many useful transformations f and covariance struc- mean m∗. Let us further denote by λ^ a positive stochas- tures for the GP, the aforementioned implied pdf might not tic process on S such that log λ^ is a conditional Gaus- be available analytically. We note however that there is no sian process coinciding with log λ∗ at k inducing points 0 0 0 ^ need to put a functional prior on the intensity function.