Modulated Cox Processes under Linear Inequality Constraints

Andr´esF. L´opez-Lopera ST John Nicolas Durrande

Mines Saint-Etienne´ ∗ PROWLER.io PROWLER.io

Abstract Poisson processes are the foundation for modelling point patterns (Kingman, 1992). Their extension Gaussian process (GP) modulated Cox pro- to stochastic intensity functions, known as doubly cesses are widely used to model point pat- stochastic Poisson processes or Cox processes (Cox, terns. Existing approaches require a mapping 1955), enables non-parametric inference on the in- (link function) between the unconstrained tensity function and allows expressing uncertainties GP and the positive intensity function. This (Møller and Waagepetersen, 2004). Moreover, pre- commonly yields solutions that do not have a vious studies have shown that other classes of point closed form or that are restricted to specific processes may also be seen as Cox processes. For ex- covariance functions. We introduce a novel ample, Yannaros (1988) proved that Gamma renewal finite approximation of GP-modulated Cox processes are Cox processes under non-increasing con- processes where positiveness conditions can ditions. A similar analysis was made later for Weibull be imposed directly on the GP, with no re- processes (Yannaros, 1994). strictions on the covariance function. Our ap- Gaussian processes (GPs) form a flexible prior over proach can also ensure other types of inequal- functions, and are widely used to model the intensity ity constraints (e.g. monotonicity, convexity), process Λ( ) (Møller et al., 2001; Adams et al., 2009; resulting in more versatile models that can Teh and Rao,· 2011; Gunter et al., 2014; Lasko, 2014; be used for other classes of Lloyd et al., 2015; Fernandez et al., 2016; Donner and (e.g. renewal processes). We demonstrate Opper, 2018). However, to ensure positive intensities, on both synthetic and real-world data that this commonly requires link functions between the our framework accurately infers the intensity intensity process and the GP g( ). Typical examples functions. Where monotonicity is a feature of mappings are Λ(x) = exp(g(x))· (Møller et al., 2001; of the process, our ability to include this in Diggle et al., 2013; Flaxman et al., 2015) or Λ(x) = the inference improves results. g(x)2 (Lloyd et al., 2015; Kozachenko et al., 2016). The exponential transformation has the drawback that there is no closed-form expression for some of the 1 INTRODUCTION integrals required to compute the likelihood. Although the square inverse link function allows closed-form Point processes are used in a variety of real-world expressions for certain kernels, it leads to models problems for modelling temporal or spatiotemporal exhibiting “nodal lines” with zero intensity due to point patterns in fields such as astronomy, geogra- the non-monotonicity of the transformation (see John phy, and ecology (Baddeley et al., 2015; Møller and and Hensman, 2018, for a discussion). Furthermore, Waagepetersen, 2004). In reliability analysis, they are current approaches to Cox process inference cannot used as renewal processes to model the lifetime of items be used in applications such as renewal processes that or failure (hazard) rates (Cha and Finkelstein, 2018). require both positivity and monotonicity constraints. Here, we introduce a novel approximation of GP- ∗Part of this work was completed during an internship of modulated Cox processes that does not rely on a A. F. L´opez-Lopera at PROWLER.io. mapping to obtain the intensity. In our approach we impose the constraints (e.g. non-negativeness or Proceedings of the 22nd International Conference on Ar- monotonicity) directly on Λ( ) by sampling from a tificial Intelligence and (AISTATS) 2019, Naha, · Okinawa, Japan. PMLR: Volume 89. Copyright 2019 by truncated Gaussian vector. This has the advantage the author(s). that the likelihood can be computed in closed form. Gaussian Process Modulated Cox Processes under Inequality Constraints

Moreover, our approach can ensure any type of linear 3 APPROXIMATION OF GP inequality constraint everywhere, which allows mod- MODULATED COX PROCESSES elling of a broader range of point processes. This paper is organised as follows. In Section 2, In this work, we approximate the intensity Λ( ) of the Cox process by a finite-dimensional GP Λ· ( ) we briefly describe inhomogeneous Poisson processes m · and some of their extensions. In Sections 3 and 4, subject to some inequality constraints (e.g. bounded- we introduce a finite representation of GP-modulated ness, monotonicity, convexity). Since positiveness con- straints are imposed directly on Λ ( ), a link function Cox processes and the corresponding Cox process m · inference under inequality constraints. In Section is no longer necessary. This has two main advantages. 5, we apply our framework to 1D and 2D inference First, the likelihood (1) can be computed analytically. examples under different inequality conditions. We Second, as our approach ensures any linear inequality also test its performance in reliability applications constraint, it can be used for modelling a broader with hazard rates exhibiting monotonic behaviours. range of point processes. Finally, in Section 6, we summarise our results and outline potential future work. 3.1 Finite Approximation of 1D GPs Let Λ( ) be a zero-mean GP on R with arbitrary 2 POISSON POINT PROCESSES covariance· function k. Consider x , with compact space = [0, 1], and a set of knots∈ St , , t . S 1 ··· m ∈ S A Poisson process X is a random countable subset of Here we consider equispaced knots tj = (j 1)∆m Rd where points occur independently (Baddeley with ∆ = 1/(m 1). We define Λ ( ) as the− finite- S ⊆ m m et al., 2006). Let N N be a random variable (r.v.) dimensional approximation− of Λ( ) consisting· of its ∈ · denoting the number of points in X. Let X1, ,Xn piecewise-linear interpolation at knots t , , t , i.e., ··· 1 ··· m be a set of n independent and identically distributed m (i.i.d.) r.v.’s on . The likelihood of (N = n, X = X S 1 Λm(x) = φj(x)ξj, (3) x1, ,Xn = xn) under an inhomogeneous Poisson j=1 process··· with non-negative intensity λ( ) is given by (Møller and Waagepetersen, 2004) · where ξj := Λ(tj) for j = 1, , m, and φ1, , φm are hat basis functions given by··· ··· n exp( µ) Y ( x tj x tj f(N,X , ,X )(n, x1, , xn) = − λ(xi), (1) 1 − if − 1, 1 n ∆m ∆m ··· ··· n! φj(x) := − ≤ (4) i=1 0 otherwise. where Z Similarly to spline-based approaches (e.g., Sleeper and µ = λ(s) ds (2) Harrington, 1990), we assume that Λ( ) is piecewise · S defined by (first-order) polynomials. The striking is the intensity measure or overall intensity. property of this basis is that satisfying the inequality When is the real line, the distance (inter-arrival constraints (e.g. boundedness, monotonicity, convex- time) betweenS consecutive points of a Poisson process ity) at the knots implies that the constraints are follows an exponential distribution. Renewal processes satisfied everywhere in the input space (Maatouk and are a generalisation of Poisson processes where inter- Bay, 2017). Although it is tempting to generalise arrival times are i.i.d. but not necessarily exponentially the above construction to smoother basis functions, distributed. An example is the Weibull process where it makes this property difficult to enforce. inter-arrival times are distributed following λ(x) = We aim at computing the distribution of Λ ( ) under αβxβ 1 (Cha and Finkelstein, 2018). m − the condition that it belongs to a convex· set of Cox processes (Cox, 1955) are a natural extension functions defined by some inequality constraints (e.g. E of inhomogeneous Poisson processes where λ( ) is positivity). This piecewise-linear representation has · sampled from a non-negative Λ( ). the benefit that satisfying Λm( ) is equivalent · ∈ E Previous studies have shown that many classes· of to satisfying only a finite number of inequality con- point processes can be seen as Cox processes under straints. More precisely, certain conditions (Møller and Waagepetersen, 2004; Λ ( ) ξ , (5) Yannaros, 1988, 1994). For example, Weibull renewal m · ∈ E ⇔ ∈ C m processes are Cox processes for β (0, 1] (Yannaros, where ξ = [ξ , , ξ ]>, and is a convex set on R . 1 ··· m C 1994). This motivates the construction∈ of GPs with For non-negativeness conditions , is given by non-negative and monotonic constraints, so that they E+ C can be used as intensities Λ( ) of Cox processes. := c Rm; j = 1, , m : c 0 , (6) · C+ { ∈ ∀ ··· j ≥ } Andr´esF. L´opez-Lopera, ST John, Nicolas Durrande

2 2 2

1 1 1 ↓ + + m 0 ∈ E 0 ∈ E 0 Λ m m

-1 Λ -1 Λ -1

-2 -2 -2

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

↓ (a) unconstrained GP (b) GP with C+ constraints (c) GP with C+ constraints

Figure 1: Samples from the prior Λm( ) under (a) no constraints, (b) non-negativeness constraints, (c) both non-negativeness and non-increasing constraints.· The grey region shows the 95% confidence interval.

and for non-increasing conditions , is given by Since (8) depends on r.v.’s ξ1, , ξm, it can be E↓ C approximated using samples of ξ···. To estimate the Rm := c ; j = 2, , m : cj 1 cj . (7) covariance parameters θ of the vector ξ, we can use C↓ { ∈ ∀ ··· − ≥ } stochastic global optimisation (Jones et al., 1998). Constraints can be composed, e.g. the convex set of non-negativeness and non-increasing conditions is 3.3 Extension to Higher Dimensions given by +↓ = + . C C ∩ C↓ Assuming that ξ is zero-mean Gaussian-distributed The approximation in (3) can be extended to grids in d dimensions by tensorisation. For ease of notation, we with covariance matrix Γ = (k(ti, tj))1 i,j m, then the distribution of ξ conditioned on ξ ≤ is≤ a truncated assume the same number of knots m and knot-spacing Gaussian distribution. Then, quantifying∈ C uncertainty ∆m in each dimension, but the generalisation to dif- on Λ relies on sampling ξ (see L´opez-Lopera ferent m1, , md or ∆m1 , , ∆md is straightforward. m ··· ··· d et al., 2018, for further discussion).∈ C Consider x = (x1, , xd) [0, 1] , and a set of knots 1 ··· 1 ∈ d d per dimension (t1, , tm), , (t1, , tm). Then Λm The effect of different constraints on samples from the is given by ··· ··· ··· prior Λm( ) can be seen in Figure 1. Here we set m = 100 and use· a squared-exponential (SE) covariance m " # X Y i 1 2 Λm(x) = φj (xi) ξj1, ,jd , (9) function with covariance parameters σ = 1, ` = 0.2. i ··· j1, ,jd=1 i= 1, ,d The samples were generated via Hamiltonian Monte ··· { ··· } Carlo (HMC) (Pakman and Paninski, 2014). i where ξj1, ,jd := Λ(tj1 , , tjd ) and φji are the hat basis functions··· defined in··· (4). Inequality constraints 3.2 Application to 1D GP-Modulated Cox can be imposed as in L´opez-Lopera et al. (2018). By Processes substituting (9) in (2), we obtain

The key challenge in building GP-modulated Cox pro- Z 1 m " # cesses is the evaluation of the integral in the intensity X Y µm = Λm(x) dx = cji ξj1, ,jd , 0 ··· measure. By considering Λm( ) as the intensity of the j1, ,jd=1 i= 1, ,d Cox process, the intensity measure· (2) becomes ··· { ··· }

with cji defined as in 1D, and the likelihood is m m Z 1 Z 1 X X µ = Λ (x) dx = φ (x)ξ dx = c ξ , m m j j j j f (N,X1, ,Xn) ξ(n, x1, , xn) 0 0 ··· | ··· j=1 j=1 m " # ! 1 X Y = exp c ξ (10) ∆m ji j1, ,jd where c1 = cm = and cj = ∆m for 1 < j < m. n! − ··· 2 j1, ,jd=1 i= 1, ,d ··· The likelihood of (N = n, X1 = x1, ,Xn = xn) is { ··· } n m " # ··· Y X Y φji (xi,k) ξj1, ,jd , f (N,X1, ,Xn) ξ1, ,ξm (n, x1, , xn) × ··· ··· |{ ··· } ··· i=1 j1, ,jd=1 k= 1, ,d m n m ··· 1  X  Y X { ··· } = exp cjξj φj(xi)ξj. (8) n! − where xi,k is the k-th component of the point xi. j=1 i=1 j=1 Due to the tensor structure of the finite representation, 1 0 2 (t−t0)2 SE covariance function: k(t, t ) = σ exp(− 2`2 ). it becomes costly as the dimension d increases. The Gaussian Process Modulated Cox Processes under Inequality Constraints

HMC sampler for truncated multivariate Gaussians using (non-truncated) Gaussian proposals, the stan- from Pakman and Paninski (2014) follows the same dard can suffer from small acceptance dynamics as a classical HMC sampler, but the particle rates due to constraint violations. We propose as “bounces” on the boundaries if its trajectory reaches an alternative a constrained version of the random- one of the inequality constraints. The computational walk Metropolis algorithm where inequality conditions complexity of each iteration scales linearly with the are ensured when sampling from the proposal q. As number of inequality conditions (e.g. md for posi- ξ is (non-negative) truncated Gaussian-distributed tiveness constraints) if the iteration does not require (with covariance matrix Γ), we suggest the truncated any reflection, but also increases with each bounce. Gaussian proposal q given by Hence, in the best case, the computational complexity d is (m ). However, this drawback could be mitigated q(χk+1 χk) (13) O | by using sparse representations of the constraints  1 k+1 k 1 k+1 k exp [χ χ ]>Σ− [χ χ ] (Pakman and Paninski, 2014), or using other types = − 2 − − , R  1 k 1 k of designs of the knots (e.g. sparse designs). ∞ exp [s χ ]>Σ− [s χ ] ds 0 − 2 − −

4 COX PROCESS INFERENCE where χk+1, χk 0 are samples of ξ and Σ is the covariance matrix.≥ Sampling from q can then be Having introduced the model, we now establish an performed via MCMC (Pakman and Paninski, 2014). inference procedure for Λ( ) using the approxima- We use Σ = ηΓ, where η is a scale factor. This has · tion Λm( ). For readability, we only assume non- the benefit that we are sampling from a distribution negativeness· constraints, i.e. ξ 0, but the extension with similar structure to the true one, while η controls to other types of constraints can≥ be made by construct- the step size of the Metropolis-Hastings procedure and ing a set of linear inequalities of the form l Aξ u, can be manually tuned to obtain a trade-off between where A is a full-rank matrix encoding≤ the linear≤ mixing speed and acceptance rate of the algorithm. operations, and l and u are the lower and upper The acceptance probability is given by bounds. In that case, results for Aξ l Aξ u |{ ≤ ≤ } are similar as for ξ 0 ξ < ∞ , and samples of f (χk+1) |{ ≤ } eξ N=n,X1=x1, ,Xn=xn ξ can be recovered from samples of Aξ, by solving a αk = |{ ··· } βk, (14) k × feξ N=n,X1=x1, ,Xn=xn (χ ) linear system. |{ ··· } ξ Consider the non-negative Gaussian vector and its where β = q(χk χk+1)/q(χk+1 χk), and sample χ. The posterior distribution of ξ conditioned k | | on a point pattern (N = n, X = x , ,X = x ) is 1 1 ··· n n feξ N=n,X1=x1, ,Xn=xn (χ) (15) |{ ··· } n f ξ N=n,X1=x1, ,Xn=xn (χ) (11) |{ ··· }  1 1  Y f (n, x , , x ) f (χ), = exp χ>Γ− χ c>χ φ>(xi)χ (N,X1, ,Xn) ξ=χ 1 n ξ − 2 − ∝ ··· |{ } ··· i=1 where the likelihood is defined in (8) and fξ(χ) is the (truncated) Gaussian density given by is the (unnormalised) posterior distribution. φ( ) = · [φ1( ), , φm( )]> and c = [c1, , cm]> are defined in  1 1 · ··· · ··· exp 2 χ>Γ− χ (4) and (8). We now focus on the term βk. Since the fξ(χ) = − , for χ 0. (12) R ∞  1 1 truncated Gaussian density has the same functional 0 exp 2 s>Γ− s ds ≥ − form as the non-truncated one, apart from the differing Since the posterior distribution (11) can be approxi- support and normalising constants, this yields mated using samples of ξ, it is possible to infer Λ ( ) m · via Metropolis-Hastings. R ∞  1 k 1 k 0 exp 2 [s χ ]>Σ− [s χ ] ds βk = R  1− − − . ∞ exp [s χk+1] Σ 1[s χk+1] ds 4.1 Metropolis-Hastings Algorithm with 0 − 2 − > − − (16) Truncated Gaussian Proposals

R ∞  1 1 The implementation of the Metropolis-Hastings algo- The orthants 0 exp 2 [x µ]>Σ− [x µ] dx rithm requires a proposal distribution q for the next cannot be computed in closed− − form, but they− can step in the . In practice, Gaussian pro- be estimated via MC (Genz, 1992; Botev, 2017). posals are often used, leading to the famous random- Algorithm 1 summarises the implementation of the walk Metropolis algorithm (Murphy, 2012). However, Metropolis-Hastings algorithm for the Cox process since inequality constraints are not necessarily satisfied inference using the finite approximation of Section 3. Andr´esF. L´opez-Lopera, ST John, Nicolas Durrande

Algorithm 1 Metropolis-Hastings algorithm for Cox Markov chains became stationary varied between 103 process inference with truncated Gaussian proposals and 104 samples. The code was implemented in 1: Input: χ(0) (Rm)+, Γ (covariance matrix of ξ), the R programming language based on the package η (scale factor).∈ lineqGPR (L´opez-Lopera, 2018). 2: for k = 0, 1, 2, do ··· (k)  3: Sample χ0 χ , ηΓ such that χ0 +. 5.1 Examples with Multiple Observations ∼ N ∈ C 4: Compute αk as in (14). 5: Sample uk uniform(0, 1). Here, we test our approach using the three toy exam- ∼ 6: Set new sample to ples proposed by Adams et al. (2009), ( χ0, if α u 7: χ(k+1) k k = (k) ≥ . 2 χ , if α < u λ1(x) = 2 exp x/15 + exp [(x 25)/10] , k k {− } {− − } (k) Pm (k) λ (x) = 5 sin(x2) + 6, 8: Compute λm (x) = j=1 φj(x)χj at location x 2 with φj( ) defined in (4). λ (x) = piecewise linear through (0, 2), (25, 3), (50, 1), · 3 (75, 2.5) and (100, 3). 4.2 Inference with Multiple Observations The domains for λ , λ and λ are = [0, 50], = 1 2 3 S1 S2 For N independent observations (X , ,X ) [0, 5] and 3 = [0, 100], respectively. o ν,1 ··· ν,nν S with ν = 1, ,No, the acceptance probability follows Figure 2 shows the inference results using N = ··· o 1, 10, 100 observations sampled from the ground truth. QNo f (χk+1) ν=1 ξ Nν =nν , ,Xν,nν =xν,nν With increasing number of observations the inferred αk = |{ ··· } βk, (17) QNo χk intensity converges to the ground truth. Here, we fixed ν=1 fξ Nν =nν , ,Xν,nν =xν,nν ( ) |{ ··· } 3 m = 100 and η = 10− . with posterior fξ Nν =nν ,Xν,1=xν,1, ,Xν,n =xν,n and |{ ··· ν ν } In Table 1, we assess the performance of our approach βk given by (11) and (16). Then, Algorithm 1 can be used with (17). under non-negativeness constraints (cGP-C+). We compare our inference results to the ones obtained with a log-Gaussian process (log-GP) modulated Cox 5 EMPIRICAL RESULTS process (Møller et al., 2001) and Variational Bayes for Point Processes (VBPP) (Lloyd et al., 2015) using We test the performance of the finite approximation the Q2 criterion. This criterion is defined as Q2 = of GP-modulated Cox process on 1D and 2D applica- 1 SMSE(λ( ), λb( )), where SMSE is the standardised tions. In the following, we use the squared-exponential mean− squared· error· (Rasmussen and Williams, 2005). covariance for the Gaussian vector ξ so that we can Q2 is equal to one if the inferred λb( ) is exactly equal compare to Lloyd et al. (2015). We estimate the to the true λ( ), zero if it is equal· to the average covariance parameters θ = (σ2, `) by maximising the intensity λ, and· negative if it performs worse than λ. likelihood (8). For all numerical experiments, we fix We compute the Q2 indicator on a regular grid of 1000 m such that we obtain accurate resolutions of the locations in . Then, we compute the mean µ and finite representations while minimising the cost of one standardS deviation σ of the Q2 results across 20 MCMC (see Bay et al., 2016; Maatouk and Bay, 2017, different replicates. Table 1 shows that our approach for discussion about the convergence of the finite- outperforms its competitors, with consistently higher dimensional approximation of GPs).2 For simulating means of the Q2 results and lesser dispersion σ. ξ, we use the exact HMC sampler proposed by Pakman and Paninski (2014). To approximate the Gaussian We assess the computational cost of our approach orthant probabilities from (16), we use the estimator using the third toy example λ3 for No = 100 (which proposed by Botev (2017) using 200 MC samples. We has the largest number of events with on average 22500 3 run Algorithm 1 with a scale factor η between 10− events in total). Obtaining one sample using our 4 and 10− for a good trade-off between the mixing approach takes around 60 milliseconds, and generating speed and the acceptance rate for each experiment.3 all 104 samples takes 10 minutes in total (in contrast to The number of discarded burn-in samples until the the 18 minutes required by VBPP).4 The multivariate effective sample size (ESS) (Flegal et al., 2017) was es- 2We tested our model for various values of m, observing timated at 322, corresponding to an effective sampling that, after a certain value, inference results are unchanged. 1 3We observed convergence of Algorithm 1 for a wide rate of 0.536 s− . range of values of η ∈ [10−5, 10−2]. Fine-tuning of η can help the experiment run faster by gradually increasing η 4These experiments were executed on a single core of until the sampling mixes well. an Intel R CoreTM i7-6700HQ CPU. Gaussian Process Modulated Cox Processes under Inequality Constraints

No = 1 No = 10 No = 100 2.0 2.0 2.0 1.5 1.5 1.5

1 1.0 1 1.0 1 1.0 λ λ λ 0.5 0.5 0.5 0.0 0.0 0.0 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 15 15 15

10 10 10 2 2 2 λ λ λ 5 5 5

0 0 0 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 4 4 4 3 3 3

3 2 3 2 3 2 λ λ λ 1 1 1 0 0 0 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100

Figure 2: Inference results with multiple observations (No = 1, 10, 100) using the toy examples from Adams et al. (2009). Each panel shows the point patterns (black crosses), the true intensity λ (red dashed lines) and the intensity inferred by the finite approximation of GP-modulated Cox processes (blue solid lines). The estimated 90% confidence intervals of the finite approximation are shown in grey.

constraints into account in renewal processes is crucial 2 Table 1: Q results for the toy examples of Figure 2, for the study of many applications. Moreover, it is averaged over 20 (†10) replicates. Our results (cGP- known that introducing monotonicity information in C+) are compared to results for Møller et al. (2001) GPs can lead to more realistic uncertainties (Riihim¨aki (log-GP) and Lloyd et al. (2015) (VBPP). and Vehtari, 2010; Maatouk and Bay, 2017). Q2 (µ σ) [%] Toy No ± As discussed in Section 2, some renewal processes can log-GP VBPP cGP-C+ be seen as Cox processes under certain conditions. 1 51.2 30.1 51.9 26.1 65.7 14.3 In order to demonstrate that we can model other ± ± ± λ1 10 95.1 3.9 94.6 3.7 95.4 2.3 types of point patterns, here we use two toy examples ± ± ± 100 99.5 0.2 99.5 0.3 99.5 0.3 where hazard rates are known to be monotonic. Both ± ± ± 1 -35.2 43.4 -1.1 28.8 0.7 24.0 examples are inspired by two classical renewal process: ± ± ± Weibull process and . λ2 10 72.6 9.1 71.7 10.4 81.9 7.4 100 95.4± 0.7 92.1± 3.9 97.8± 0.6 ± ± ± For the first class, the Weibull hazard function is 1 49.2 22.6 49.5 29.9 58.1 21.4 W β 1 ± ± ± λ (x) = αβx − for x 0, (18) λ3 10 91.7 4.4 93.8 2.8 94.3 2.5 ≥ ± ± ± where α and β are the scale and shape parameters, 100 98.4 0.4 98.9 0.3† 98.8 0.3 ± ± ± respectively. Depending on β, λW can be either non- increasing (0 < β < 1), constant (β = 1), or non- decreasing (β > 1). Moreover, for β (0, 1], the 5.2 Modelling Hazard Rates in Renewal Weibull renewal process can be seen as a∈ Cox process Processes (Yannaros, 1988). For numerical experiments, we consider the case of non-increasing conditions in the Poisson processes have been extended to model re- domain = [0, 100] by fixing α = 1 and β = 0.7 (see S newal processes where intensity functions are seen as Figure 3). We test our framework using No = 100 ob- hazard rates defining the probability that an operating servations from λW , and we consider non-negativeness object fails (Serfozo, 2009; Cha and Finkelstein, 2018). ↓ conditions, with (cGP-C+) or without (cGP-C+) taking However, in many application, e.g. reliability engineer- into account the non-increasing constraint. We also ing and survival analysis, hazard rates exhibit mono- consider the case where λW is non-increasing and ↓ ^ tonic behaviours describing the degradation of items convex (cGP-C+ ). or lifetime of organisms. For example, the hazard functions for the failure of many mechanistic devices For the Gamma class, the hazard function is given by and the mortality of adult humans tend to exhibit α xβ 1e x λG(x) = − − , for x 0, (19) monotonic behaviours. Thus, taking monotonicity Γ(β) Γ (β) ≥ − x Andr´esF. L´opez-Lopera, ST John, Nicolas Durrande

↓ ↓ ^ cGP-C+ cGP-C+ cGP-C+ 0.6 0.6 0.6

0.4 0.4 0.4 W W W λ λ λ

0.2 0.2 0.2

0 25 50 75 100 0 25 50 75 100 0 25 50 75 100

↑ ↑ _ cGP-C+ cGP-C+ cGP-C+ 5 5 5

4 4 4

3 3 3 G G G λ 2 λ 2 λ 2

1 1 1

0 0 0 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

Figure 3: Renewal inference examples under different inequality constraints using No = 100 and m = 100. Inference results are shown for (top row) a Weibull renewal process with α = 1 and β = 0.7, and (bottom row) a Gamma renewal process with α = 5 and β = 1.7. The panel description is the same as in Figure 2.

where Γ( ) and Γx( ) are the Gamma function and the examples of Figure 3 show that the monotonicity and incomplete· Gamma· function, respectively (Cha and convexity conditions found in certain point processes Finkelstein, 2018), and α and β are the scale and can be difficult to learn directly from the data. This shape parameters. As for the Weibull process, different suggests that including those constraints in the GP behaviours can be obtained using different values of β. prior is necessary to get accurate models with more Since similar profiles are obtained for β (0, 1], here realistic uncertainties. we are interested in the case where λG exhibits∈ non- decreasing constraints (β > 1). We fix = [0, 5], 5.3 2D Redwoods Data α = 5 and β = 1.7 obtaining a non-decreasingS profile as shown in Figure 3. Here, we consider non-decreasing We now assess the performance of the proposed ↑ ↑ _ (cGP-C+), and non-decreasing and concave (cGP-C+ ) approach for a 2D spatial problem. We use the constraints. Since λG(x) < α for x , we add the dataset provided by Ripley (1977) which describes ∈ S constraint λG [0, α]. the locations of redwood trees. The dataset contains ∈ Figure 3 shows the inferred intensities of λW and λG n = 195 events scaled to the unit square (see Figure under the different conditions previously discussed. In 4). Here we choose m = 15, obtaining 225 knots in 4 total, to obtain a good trade-off between resolution both experiments, we fixed m = 100 and η = 10− . For the Weibull class λW , the performance of all three and computational cost. We use the product of two ↓ 2 ↓ ^ SE kernels with covariance parameters θ = (σ , `1, `2) models, cGP-C+, cGP-C and cGP-C , tends to be + + as the covariance function of the Gaussian vector ξ, similar. However, the model without monotonicity 4 and we choose η = 10− in Algorithm 1. Following the constraint exhibits undesired oscillations, whereas the 5 other two approaches provide more realistic decreasing burn-in step, we keep 10 samples for the inference of λ( ), yielding a total running time of 7.6 hours (i.e. a profiles and more accurate inference results for x > 50. · 1 We can also observe that the three models cannot learn sampling rate of approximately 4 s− ). the singularity at x = 0. Note that the proposed Figure 4 shows the normalised inference results for the methodology does not make any assumption on the redwood dataset for different values of the lengthscale kernel, and it would be possible to consider a covari- parameters. Since in our approach we directly impose ance function such as k(x, y)/(xy) in order to improve the inequality conditions on the Gaussian vector ξ the model behaviour for small and large values of x. instead of using a link function, the interpretation of G For the Gamma hazard function λ , one may clearly the lengthscale parameters (`1, `2) are the same as observe the benefits of adding the non-decreasing and for standard GPs: one can find a trade-off between concave constraints, obtaining absolute improvements fidelity and regularity by tuning `. One can note, between 0.8% and 3.5% of the Q2 indicator. Both from Figures 4(a) and 4(b), that both profiles tend to Gaussian Process Modulated Cox Processes under Inequality Constraints

500

1.0 fields. Finally, we infer λ( ) when the covariance parameters θ are estimated via· maximum likelihood 400 using (10). According to the estimated lengthscales 0.8 (`b1 = 0.055, `b2 = 0.084), one can conclude that the 300 0.6 estimated intensity λ( ) is smoother along the second

2 · dimension x2. This is in agreement with the inference x 200 results by Adams et al. (2009), where more variations 0.4 of λ( ) were exhibited across x . · 1

0.2 100 6 CONCLUSIONS

0.0 0 0.0 0.2 0.4 0.6 0.8 1.0 x1 The proposed model for GP-modulated Cox processes

(a) `1 = `2 = 0.1 is based on a finite-dimensional approximation of a GP that is constrained to be positive. This approach 500 1.0 shows several advantages. First of all, it is based on general linear inequality constraints so it allows us to 400 0.8 incorporate more information, such as monotonicity and convexity, in the prior. As seen in the experiments, 300 0.6 this appears to be particularly helpful when few data 2

x are available. Second, imposing directly the positivity 200 0.4 constraint on the GP makes the use of a link function unnecessary. Both the likelihood and the intensity

0.2 100 measure can be computed analytically, which is not always the case when using a link function. Finally,

0.0 0 the fact that our model is based on a finite-dimensional 0.0 0.2 0.4 0.6 0.8 1.0 x representation ensures that the computational burden 1 grows linearly with the number of observations. (b) `1 = `2 = 0.01 There are two key elements that make the method 500 1.0 work: (a) the finite-dimensional representation of the GP that ensures that the constraints are satisfied 400 0.8 everywhere, and (b) the dedicated MCMC proposal distribution based on a truncated normal distribution 300 0.6 which allows us to have high acceptance rates com- 2

x pared to a naive multivariate Gaussian proposal. 200 0.4 The main limitation regarding the scaling of the pro- posed method lies in the dimension of the input space.

0.2 100 This is due to the construction by tensorisation of the basis functions used to obtain the finite-dimensional

0.0 0 0.0 0.2 0.4 0.6 0.8 1.0 representation. Moreover, our model is also sensitive x1 to three parameters: the dimensionality of the space in which we perform HMC, the number of constraints, (c) `b1 = 0.055, `b2 = 0.084 and the number of times the HMC particles violate a Figure 4: Inference results of the redwoods data from constraint. However, we believe that these limitations Ripley (1977); Baddeley et al. (2015). Each panel are not inherent to the proposed model and that other shows the point pattern (white dots) and the estimated types of designs of the knots (e.g. sparse designs) could intensity λ( ). be used in high dimensions. · Acknowledgements properly learn the point patterns but more regularity is 1 exhibited when `1 = `2 = 10− . For the case `1 = `2 = This work was supported by the Chair in Applied 2 10− , although the model follows the point patterns, Mathematics OQUAIDO (oquaido.emse.fr, France) one may observe noisy behaviour in regions without and by PROWLER.io (www.prowler.io, UK). We points, e.g. around (x1, x2) = (0.30, 0.85), as small thank O. Roustant (EMSE) and D. Rulli`ere(ISFA) values of ` lead to more oscillatory Gaussian random for their advice throughout this work. Andr´esF. L´opez-Lopera, ST John, Nicolas Durrande

References metric modelling of structured point processes. In UAI, pages 310–319. Adams, R. P., Murray, I., and MacKay, D. J. (2009). Tractable nonparametric Bayesian inference in Pois- John, S. T. and Hensman, J. (2018). Large-scale Cox son processes with Gaussian process intensities. In process inference using variational Fourier features. ICML, pages 9–16. In ICML, pages 2362–2370. Baddeley, A., Gregori, P., Mahiques, J., Stoica, R., Jones, D. R., Schonlau, M., and Welch, W. J. (1998). and Stoyan, D. (2006). Case Studies in Spatial Efficient global optimization of expensive black- Modeling. Lecture Notes in Statistics. box functions. Journal of Global Optimization, Springer, New York. 13(4):455–492. Baddeley, A., Rubak, E., and Turner, R. (2015). Spa- Kingman, J. (1992). Poisson Processes. Oxford tial Point Patterns: Methodology and Applications Studies in Probability. Clarendon Press, New York. with R. Chapman & Hall/CRC Interdisciplinary Kozachenko, Y., Pogorilyak, O., Rozora, I., and Tegza, Statistics. CRC Press, Boca Raton, FL. A. (2016). Simulation of Cox random processes. Bay, X., Grammont, L., and Maatouk, H. (2016). Gen- In Simulation of Stochastic Processes with Given eralization of the Kimeldorf–Wahba correspondence Accuracy and Reliability, pages 251–304. Elsevier, for constrained interpolation. Electronic Journal of Amsterdam. Statistics, 10(1):1580–1595. Lasko, T. A. (2014). Efficient inference of Gaussian- Botev, Z. I. (2017). The normal law under linear process-modulated renewal processes with applica- restrictions: Simulation and estimation via minimax tion to medical event data. In UAI, pages 469–476. Journal of the Royal Statistical Society: tilting. Lloyd, C. M., Gunter, T., Osborne, M. A., and Series B , 79(1):125–148. Roberts, S. J. (2015). Variational inference for Cha, J. and Finkelstein, M. (2018). Point Processes for Gaussian process modulated Poisson processes. In Reliability Analysis: Shocks and Repairable Systems. ICML, pages 1814–1822. Springer Series in Reliability Engineering. Springer, L´opez-Lopera, A. F. (2018). lineqGPR: Gaussian New York. process regression models with linear inequality Cox, D. R. (1955). Some statistical methods connected constraints. https://cran.r-project.org/web/ with series of events. Journal of the Royal Statistical packages/lineqGPR/. Society: Series B, 17(2):129–164. L´opez-Lopera, A. F., Bachoc, F., Durrande, N., and Diggle, P. J., Moraga, P., Rowlingson, B., and Taylor, Roustant, O. (2018). Finite-dimensional Gaussian B. M. (2013). Spatial and spatio-temporal log- approximation with linear inequality constraints. Gaussian Cox processes: Extending the geostatis- SIAM/ASA Journal on Uncertainty Quantification, tical paradigm. Statistical Science, 28(4):542–563. 6(3):1224–1255. Donner, C. and Opper, M. (2018). Efficient Bayesian Maatouk, H. and Bay, X. (2017). Gaussian process inference of sigmoidal Gaussian Cox processes. Jour- emulators for computer experiments with inequality nal of Research, 19(67):1–34. constraints. Mathematical Geosciences, 49(5):557– Fernandez, T., Rivera, N., and Teh, Y. W. (2016). 582. Gaussian processes for survival analysis. In NIPS, Møller, J., Syversveen, A. R., and Waagepetersen, pages 5021–5029. R. P. (2001). Log Gaussian Cox processes. Scan- Flaxman, S., Wilson, A., Neill, D., Nickisch, H., dinavian Journal of Statistics, 25(3):451–482. and Smola, A. (2015). Fast Kronecker inference in Møller, J. and Waagepetersen, R. P. (2004). Sta- Gaussian processes with non-Gaussian likelihoods. tistical Inference and Simulation for Spatial Point In ICML, pages 607–616. Processes. Chapman & Hall/CRC Monographs on Flegal, J. M., Hughes, J., Vats, D., and Dai, N. Statistics and Applied Probability. CRC Press, Boca (2017). mcmcse: Monte Carlo standard errors Raton, FL. for MCMC. https://cran.r-project.org/web/ Murphy, K. P. (2012). Machine Learning: A Prob- packages/mcmcse/. abilistic Perspective (Adaptive Computation And Genz, A. (1992). Numerical computation of multivari- Machine Learning). The MIT Press, Cambridge. ate normal probabilities. Journal of Computational Pakman, A. and Paninski, L. (2014). Exact Hamilto- and Graphical Statistics, 1:141–150. nian Monte Carlo for truncated multivariate Gaus- Gunter, T., Lloyd, C. M., Osborne, M. A., and sians. Journal of Computational and Graphical Roberts, S. J. (2014). Efficient Bayesian nonpara- Statistics, 23(2):518–542. Gaussian Process Modulated Cox Processes under Inequality Constraints

Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). MIT Press, Cambridge. Riihim¨aki,J. and Vehtari, A. (2010). Gaussian pro- cesses with monotonicity information. In AISTATS, pages 645–652. Ripley, B. D. (1977). Modelling spatial patterns. Journal of the Royal Statistical Society: Series B, 39(2):172–212. Serfozo, R. (2009). Renewal and Regenerative Pro- cesses, pages 99–167. Springer, New York. Sleeper, L. A. and Harrington, D. P. (1990). Re- gression splines in the Cox model with application to covariate effects in liver disease. Journal of the American Statistical Association, 85(412):941–949. Teh, Y. W. and Rao, V. (2011). Gaussian process modulated renewal processes. In NIPS, pages 2474– 2482. Yannaros, N. (1988). On Cox processes and Gamma renewal processes. Journal of Applied Probability, 25(2):423–427. Yannaros, N. (1994). Weibull renewal processes. Annals of the Institute of Statistical Mathematics, 46(4):641–648.