<<

Nonparametric Fisher Geometry with Application to Density Estimation

Babak Shahbaba Shiwei Lan Jeffrey D. Streets Andrew J. Holbrook∗ Mathematical and Statistical Sciences Mathematics Biostatistics UC Irvine Arizona State University UC Irvine UCLA

Abstract (Jeffreys, 1946); and it plays a central role in Bayesian and Frequentist asymptotics (Le Cam, 2012). Fisher ad- vocated the importance of the matrix in max- It is well known that the imum likelihood estimation (Fisher, 1925). Fisher’s stu- induces a Riemannian geometry on paramet- dent, Rao, was the first to place the information matrix ric families of probability density functions. in a differential geometric context (Rao, 1945). Since Following recent work, we consider the non- then, the differential geometric implications for paramet- parametric generalization of the Fisher geom- ric statistical models have been the subject of extensive etry. The resulting nonparametric Fisher ge- inquiry (Amari and Nagaoka, 2007). Recently, a num- ometry is shown to be equivalent to a familiar, ber of researchers have drawn connections between the albeit infinite-dimensional, geometric object— Fisher geometry and the geometry of the infinite sphere the sphere. By shifting focus away from den- (Srivastava, Jermyn, and Joshi, 2007; Chen, Streets, sity functions and toward square-root density and Shahbaba, 2015; Itoh and Satoh, 2015; Kurtek and functions, one may calculate theoretical quan- Bharath, 2015; Srivastava and Klassen, 2016; Peter, Ran- tities of interest with ease. More importantly, garajan, and Moyou, 2017). Much of this work has been the sphere of square-root densities is much in the area of shape analysis and has focused on using the more computationally tractable. As discussed Fisher geometry to measure between probabil- here, this insight leads to a novel Bayesian ity densities. Bayesian uses for the nonparametric Fisher nonparametric density estimation model. geometry were featured in (Chen, Streets, and Shah- baba, 2015), where Bayesian variational inference was accomplished by minimizing the Fisher distance, and 1 INTRODUCTION in (Kurtek and Bharath, 2015), where the nonparamet- ric Fisher geometry was used for sensitivity analysis of The Fisher information—and the geometry it induces— Bayesian models. Here, we focus on fully Bayesian non- has been one of the unequivocal success stories of ge- parametric inference, including the generation of poste- ometry in statistics. Building on recent work, we extend rior samples using Hamiltonian Monte Carlo (HMC). In the Fisher geometry beyond parametric statistical mod- contrast to recent research, the geodesics associated with els and show that the resulting geometry is equivalent the nonparametric Fisher geometry are used to efficiently to that of the infinite-dimensional sphere. The purpose explore the MCMC state space and not to measure or of this paper is to bring attention to this new perspec- minimize the distance between density functions. tive and to demonstrate its theoretical and methodolog- ical consequences. As an application, we introduce the This paper, and other recent research in the Fisher geom- χ2-process density prior, a flexible nonparametric model etry, builds on the sub-field of square-root density esti- for Bayesian density estimation that admits fast compu- mation. (Pinheiro and Vidakovic, 1997) used a wavelet tation while requiring minimal assumptions. basis to estimate the square-root density by effectively fitting the curve and then normalizing a sparse collec- The Fisher information matrix is canonical in statistics: it tion of wavelet coefficients, and (Muller¨ and Vidakovic, is rooted in (Gourieroux and Monfort, 1998) introduced a Bayesian follow-up to this work. 1995); it appears in Jeffrey’s prior of Bayesian analysis Recently, (Hong and Gao, 2016) used Riemannian ge- ometry to fit a square-root density model, but did not ∗Corresponding Author: [email protected]

Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), PMLR volume 124, 2020. make any connections to the Fisher geometry. More re- plausible realizations of the uncertainty inherent in the cently (Peter, Rangarajan, and Moyou, 2017) performed density estimation problem. Besides a recent application square-root density estimation for object recognition us- to Bayesian quadrature (Gunter et al., 2014), we are un- ing minimum description length as fitting criterion and aware of statistical applications for the χ2-process and used the nonparametric Fisher geometry to obtain a are therefore pleased to present its novel application to closed-form expression of this criterion. Bayesian density estimation. In that sense, our method can be considered as an alternative to Dirichlet Process In this paper, we focus on the application of the nonpara- Mixture Models (DPMM), which are commonly used for metric Fisher geometry to Bayesian inference for prob- nonparametric Bayesian estimation. DPMMs convolve ability densities. While the density function is the ob- the Dirichlet process with a smooth distribution, in effect ject of interest, we instead model the square-root density constructing an infinite mixture model (Antoniak, 1974). function, that is, the function the square of which inte- More recently, (Murray, MacKay, and Adams, 2009) grates to unity. We take a Bayesian nonparametric ap- proposed a new method, called Gaussian Process Density proach and endow the square-root density with a Gaus- Sampler (GPDS), offering a similar amount of flexibil- sian process (GP) prior (Williams and Rasmussen, 1996) ity as the DPMM but having an arguably simpler frame- multiplied by a Dirac measure limiting its support to the work. Nonetheless, inference for DPMM requires an ad- infinite-dimensional sphere. In order to maintain this re- vanced Gibbs sampling routine (Neal, 2000), and infer- striction, it is useful to use the Karhunen-Loeve` (K-L) ence for the GPDS requires exchange sampling to han- expansion (Wang, 2008) of the GP prior as opposed to dle the unit-integral restriction on the GP model (Mur- its kernel representation. Every GP with bounded sec- ray, MacKay, and Adams, 2009). In contrast, the model ond moment may be represented in terms of the eigen- we propose here can be computed using generic spheri- function expansion of its covariance operator, but this cal HMC (Lan, Zhou, and Shahbaba, 2014) or geodesic (the K-L) expansion is only explicitly known for a few Monte Carlo (Byrne and Girolami, 2013) algorithms. classes of GPs (Wang, 2008). Still, the K-L expansion has seen much recent success in the realm of Bayesian In summary, the contributions of this paper are as fol- inverse problems (Dashti and Stuart, 2013; Cotter et lows: al., 2013) and has been featured in infinite-dimensional HMC and infinite manifold HMC (∞-mHMC) (Beskos • we review a nonparametric generalization of the et al., 2016). The proposed application of the K-L expan- Fisher geometry and show its relationship to the sion to model the square-root density is unprecedented infinite-dimensional (L2) sphere, the space of and offers a probabilistic interpretation to the use of ba- square-root density functions; sis expansions for density estimation. • we derive the geodesics on the L2 sphere and use Due to the orthonormality of the eigenfunction basis, these geodesics to formalize the relationship be- the restriction to the (uncountably) infinite-dimensional tween Riemannian HMC and infinite-dimensional sphere translates to a restriction to the (countably) spherical HMC; infinite-dimensional sphere for the eigenvalues of the GP. Then, following the precedent set in (Beskos et al., • focusing on Bayesian nonparametric density esti- 2016), the K-L expansion is truncated and the object mation, we demonstrate the practical benefits to of inference is reduced to the posterior distribution of modeling the square-root density function. The re- a finite number of K-L coefficients restricted to a finite sulting χ2-process density prior performs well for a sphere. This computation is quick and easy using spher- variety of problems and is efficiently computed us- ical HMC (Lan, Zhou, and Shahbaba, 2014). Thanks ing spherical HMC. to the basis representation, computational complexity scales linearly with the number of data points, as op- The rest of the paper is organized in the following way. posed the cubic rate of the GP density sampler (Murray, In Section 2 we review the parametric Fisher geometry, MacKay, and Adams, 2009). Moreover, we show that— present a nonparametric extension of the Fisher geome- in the square-root density estimation context—spherical try, and derive key results by relating this geometry to the HMC corresponds to Riemannian HMC in the infinite- infinite-dimensional sphere. Section 3 presents the χ2- dimensional limit. process density prior along with some necessary tools, Squaring the GP square-root density prior gives a χ2- such as the Karhunen-Loeve` expansion. In Section 4, we process (cf. Rabier and Genz, 2014) density prior. We discuss efficient Bayesian inference for the model and re- illustrate the use of this prior for a number of problems. late Riemannian HMC to infinite-dimensional spherical The model is flexible and its posterior draws provide HMC. Section 5 relates our method to the Cox process Cox, 1955. Empirical results are presented in Section 6. Finally, in Section 7 we discuss model limitations and geometry and its dual connections. More recently, Giro- possible extensions. All proofs are placed in the Supple- lami and Calderhead (2011) successfully used the Fisher ment. geometry to guide the Hamiltonian flow of their Rieman- nian HMC. In this paper, we take another tack by gener- 2 THE NONPARAMETRIC FISHER alizing the notion of the Fisher geometry to nonparamet- ric models. GEOMETRY 2.2 Beyond parametric models 2.1 THE PARAMETRIC FISHER GEOMETRY We consider probability distributions over smooth man- Given data x in domain D, it is often useful to spec- ifolds D, of which D =∼ d is a special case. Having ify a probabilistic model S = {p = p(x, θ) | θ = R θ fixed a background measure µ, let θ1, . . . , θp}, where θ is a vector parameterizing the model and taking values in the continuous parameter  Z  space Θ. Then at any point θ ∈ Θ, the Fisher infor- P := p : D → R | p ≥ 0, p(x) µ(dx) = 1 mation is the expectation of the negative log-likelihood D Hessian: be the space of probability density functions over D.  ∂2`(θ)  Z ∂2`(θ) That is, P is the set of Radon-Nikodym derivatives of I(θ) = −E = − p(x|θ) µ(dx) , probability measures that are absolutely continuous with x ∂θ∂θT ∂θ∂θT D respect to µ. The following construction is agnostic to d where `(θ) = log p(x|θ). The Fisher information en- whether µ is the Lebesgue measure over D = R or the codes second-order functional information about `(θ). Hausdorff measure over a general Riemannian manifold This fact explains the use of the Fisher information as a D = M. gradient preconditioning matrix in both (the Frequentist) We deal with the space P and do not fix a parametric Fisher scoring (Longford, 1987) and (the Bayesian) Rie- model. Instead we give P the structure of an infinite mannian HMC (Girolami and Calderhead, 2011). The dimensional (formal) Riemannian manifold. First, we Fisher information may also be written as the expected think of it as a smooth manifold. Observe that for a given outer product of the score vector ∂ log p(x|θ)/∂θ: p ∈ P, the tangent space can be identified with

 ∂`(θ) ∂`(θ) T   Z  I(θ) = E   ∞ x ∂θ ∂θ TpP := φ ∈ C (D) | φ(x) µ(dx) = 0 . D Z ∂`(θ) ∂`(θ) T =   p(x|θ) µ(dx) . D ∂θ ∂θ This identification arises when one differentiates the unit measure condition on probability density functions. That The Fisher information is symmetric positive definite at is, for a smooth curve pt :(−, ) → P satisfying any point θ ∈ Θ. Taking note of this fact, Rao (1945) dpt/dt|t=0 = φ, we have interpreted the Fisher information matrix as a Rieman- Z Z d dpt nian metric tensor, i.e. a smoothly varying, symmetric 0 = pt(x) µ(dx) t=0 = (x) µ(dx) dt D D dt positive definite matrix defined over the parameter space Z Θ. In this way, the Fisher information matrix induces a = φ(x) µ(dx) . Riemannian metric gθ(·, ·) over Θ satisfying D X Now that we have a smooth manifold and an associated g (` , ` ) = I (θ) , and g (ψ, φ) = ψiφjI (θ) θ i j ij θ ij tangent space, we may define a Riemannian metric, i.e. i,j a smoothly varying, symmetric, non-degenerate, bilinear + Pp k function g(·, ·)p : TpP × TpP → {0} ∪ R . Riemannian for `i = ∂`(θ)/∂θi, ψ = k=1 ψ `k and φ = Pp k metrics are useful for developing a notion of distance k=1 φ `k. Hence, the Fisher information may be thought of as inducing a non-trival geometry on the oth- on a manifold that does not depend on any embedding erwise Euclidean parameter space Θ. There has been in Euclidean space. One may define uncountably many much inquiry into the nature of the parametric Fisher metrics on a general manifold, but we are interested in a geometry. Efron used the Fisher geometry to prove generalization of the parametric Fisher information met- the second-order efficiency of the MLE for exponential ric. family models (Efron, 1978), and Amari and Nagaoka Definition 1. Given D, the nonparametric Fisher in- (2007) has constructed a body of work around the Fisher formation metric on P(D) is (Srivastava, Jermyn, and Joshi, 2007; Srivastava and Klassen, 2016) As we will see below, not only is the L2 sphere Q more theoretically tractable, it also turns out to be more com- Z φ(x)ψ(x) putationally tractable. In the following sections, we take gF (φ, ψ)p := µ(dx). (1) D p(x) advantage of these two kinds of tractability to construct a Bayesian nonparametric model on Q and use it for an This metric is a consistent generalization of the paramet- application in density estimation. ric Fisher information metric. To see this, consider the parametric model p(x|θ), with θ as a vector. Then each 3 THE CHI-SQUARE PROCESS PRIOR element θi of θ defines a curve Θi → P, where Θi is a slice of Θ, and In this section, we transition from the theoretical to the Z methodological aspects of the nonparametric Fisher ge- Iij(θ) = `i`j p(x|θ)µ(dx) ometry. We find that the square-root representation q = √ D p is of use practically as well as theoretically. Here Z p (x|θ) p (x|θ) = i j p(x|θ)µ(dx) we focus on its natural application for density estimation D p(x|θ) p(x|θ) and show that Bayesian density estimation can be much Z pi(x|θ)pj(x|θ) easier when one shifts focus to the sphere of square-root = µ(dx) . D p(x|θ) densities. Suppose we want to attribute a smooth density function Here, we have adopted the shorthand pi(x|θ) = to observed data x , . . . , x on finite domain D ⊂ d ∂p(x|θ)/∂θi. Expressed in a more invariant fashion, in- 1 n R terpreting a model as a map θ :Θ → P, one has that the and recall the definitions (from Section 2) of the space parametric Fisher metric is induced by the nonparame- of density functions and the space of square-root density teric Fisher metric, i.e. functions:  R ∗ P := p : D → R | p ≥ 0, D p(x) µ(dx) = 1 θ gF = gθ.  R 2 Q := q : D → R | D q(x) µ(dx) = 1 , In what follows we make a nontrivial change of vari- respectively. We want to find a suitable element p(·) ∈ ables suggested by this geometric picture which provides P(D), the space of functions over domain D. Although various theoretical and computational simplifications. In this space contains the functions of interest, we opt to particular, for various reasons the manifold P equipped deal with the space Q of square-root densities instead. with Riemannian metric (1) is not particularly easy to As stated in the prior section, Q is the unit sphere in the deal with. In order to calculate geometric quantities of infinite-dimensional Hilbert space L2(D). We model the interest (e.g. geodesics, ), we shift focus to the square-root density with a GP prior (or a Gaussian mea- L2 unit sphere, i.e. the space of square-root density func- sure in L2) multiplied by the Dirac measure restricting tions the function to the unit sphere:  Z  2 q ∼ GP × δq(Q) . (2) Q := q : D → R | q(x) µ(dx) = 1 . D It turns out that it is much easier to enforce the con- This space, which is identified with P by a simple straint given by Dirac measure δq(Q) than it is to en- transformation indicated below, provides a much simpler force the corresponding constraint δp(P) (as is done for backdrop for calculations. This infinite-dimensional L2 the GPDS). To do so, however, we do not represent the sphere is a surprisingly familiar object. Its tangent spaces GP prior using its kernel representation as is commonly and geodesics are formally the exact same as those of the done in the literature. We opt instead to represent q in finite dimensional sphere Sn−1, the only difference be- terms of the eigenvalues and orthonormal eigenfunctions ing the replacement of the Euclidean inner product with of its covariance operator. the integral inner product of L2: 3.1 KARHUNEN-LOEVE` REPRESENTATION Z hf, hiL2 = f(x)h(x) µ(dx) . D In order to tractably enforce the constraint δq(Q) in (2), it is helpful to write q as a function (or linear sum of Remarkably, this simpler space is isometric to the space functions) for which we know the values of both of density functions equipped with the nonparameteric Z Z Fisher metric defined above. See the supplementary file q(x)µ(dx) and q(x)2µ(dx) . for more information along with some basic results. D D This condition is satisfied by representing random func- π(q) on square-root density q ∈ Q is a GP multiplied by tion q as a linear combination of orthonormal basis func- the Dirac measure on the L2 sphere. Following (3), the tions. The K-L representation (Wang, 2008) provides prior for q and the likelihood of the data x given q are a canonical way of doing so and thus links our fully given by probabilistic approach to other square-root density meth- ∞ ods that rely on a basis (Pinheiro and Vidakovic, 1997; Y 2 2  π(q) ∝δq(Q) exp − qi /(2λi ) , Muller¨ and Vidakovic, 1998; Hong and Gao, 2016). Let i=1 u(·) ∼ GP(0,K(·)) be a mean zero Gaussian process N Y 2 over domain D with covariance operator K(·). Then u π(x|q) = q (xn) , admits a K-L expansion of the form n=1

∞ since q is the square-root density. This prior can also be X ind 2 u(·) = ui φi(·), ui ∼ N(0, λi ), (3) interpreted as arising from an infinite-dimensional Bing- i=1 ham distribution on the coefficients (Dryden, 2005). The posterior distribution on q is then given by where the λis and the φis are respectively the eigenvalues N and eigenfunctions of operator K. That is to say, they π(x|q) π(q) Y π(q|x) = ∝ π(q) q2(x ) . satisfy R π(x|q) π(q) dq n Q n=1 Z 0 0 0 K(φi)(x ) = k(x, x )φi(x)µ(dx) = λiφi(x ) Suppressing the Dirac measure, the log-posterior given data x1:N may be written in terms of the K-L expansion where k(·, ·) is the usual covariance kernel. The eigen- of q: values are decreasing and their sum-of-squares is finite: N ∞ X 1 X λ < λ , P∞ λ2 < ∞. Finally, the eigenfunctions log π(q|x) ∝ log q(x )2 − q2/λ2 i+1 i i=1 i n 2 i i form an orthonormal basis of L2: n=1 i=1 N ∞ Z Z X 1 X 2 = 2 log |q(x )| − q2/λ2 φi(x)φj(x)µ(dx) = 0, and φi (x)µ(dx) = 1 . n 2 i i n=1 i=1 N ∞ ∞ In this paper, we model q as belonging to the Matern´ X X 1 X = 2 log | q φ (x )| − q2/λ2 . class of GPs. For the Matern´ class, a closed-form or- i i n 2 i i thonormal basis may be obtained from the eigenfunc- n=1 i=1 i=1 tions of the Laplacian (Chung, 2013; Beskos et al., By modelling the square-root density q with a GP prior, 2016). The covariance operator is given by we model the density function p with a χ2-process prior. Modeling the density p as a χ2-process, we automati- K = σ2(α − ∆)−s , (4) cally enforce the non-negativity requirement for proba- bility density functions. On the other hand, χ2-processes 2 where α and σ are positively constrained scale param- are not restricted to have unit integrals. We therefore rely eters, s is a smoothness parameter, and ∆ is the Lapla- on a geometric HMC inference scheme to restrict pro- Pd 2 2 cian i=1 ∂i . The eigenvalues and eigenfunctions cor- posals to the L sphere. This is discussed in details in responding to this covariance operator depend on the Section 4. area and dimensionality of domain D and are presented in Section 6 below. It should be noted that the decision to 4 INFERENCE use the Matern´ class is entirely dictated by ease of com- putation and does not preclude other classes of GP from Inference for the χ2-process density model is relatively being used in future applications. straightforward and amenable to advanced HMC meth- ods. In Section 4.1, we show that, in this context, infinite- 3.2 THE MODEL dimensional spherical HMC is equivalent to Riemannian HMC using the parametric Fisher information. In prac- The proposed density model is Bayesian nonparametric, tice, we follow Beskos et al. (2016) and truncate the K-L i.e. we place a prior distribution on a set of functions expansion of the GP square-root density prior for an in- and eschew a restrictive parametric form. Given data teger I using truncation operator TI : x = (x1, ··· , xN ) ∈ D, we obtain a posterior distri- bution, which is itself a distribution over the same set ∞ I   X  X of functions and is absolutely continuous with respect to TI q(x) = TI qi φi(x) = qi φi(x) . the specified prior distribution. As stated above, the prior i=0 i=0 Due to the orthonormality of the basis φi, the unit inte- chain states. Since these flows are formally equivalent 2 2 gral constraint on TI (q) translates directly to a spherical to the geodesic flows on the L sphere (see Section 2) I 2 constraint on the random coefficients q = (q0, ··· , qI ). and since the natural geometry on L is equivalent to That is, the nonparametric Fisher geometry, it is worth asking whether these inference schemes are adapted to the non- Z Z I 2 2  X  1 = TI q(x) µ(dx) = qi φi(x) µ(dx) parametric Fisher geometry in a similar way to Rieman- D D i=0 nian HMC’s adaptation to the parametric Fisher geome- I Z I try. X 2 2 X 2 = qi φi(x) µ(dx) = qi Indeed this is the case, and it is a simple consequence of i=0 i=0 Proposition 1 in the supplementary file and the isomet- where the penultimate equality is given by the orthog- ric relationship between square-integrable functions and onality of the basis elements and the last equality is on square-summable sequences induced by any orthonor- ∞ 2 account of the basis elements being normal. Thus, infer- mal basis {φi}i=1 with completion L . Denote the space ence can be performed over the coefficients qI by us- of square-summable sequences and its sphere ing spherical HMC (Lan, Zhou, and Shahbaba, 2014) I ( ∞ ) on the sphere S . Both of these methods augment the 2 ∞ X 2 state space with an auxiliary velocity variable v (satisfy- ` = q = {qi}i=1 hq, qi`2 = qi < ∞ , ing vT qI = 0) and simulate from a Hamiltonian system i=1 ( ∞ ) by splitting (Shahbaba et al., 2014) the Hamiltonian of ∞ 2 X 2 S = q ∈ ` hq, qi 2 = q = 1 . interest (H) into two Hamiltonians (H1 + H2): ` i i=1 1 1 H(qI , v) = − log π(qI ) + G(qI ) + vT v ∞ 2 2 Then it follows from the orthonormality of {φi}i=1 that 2 ∼ 2 1 (L , h·, ·iL2 ) = (` , h·, ·i`2 ), since for any arbitrary func- H1(qI , v) = − log π(qI ) + G(qI ) 2 2 tion q = q(·) ∈ L , 2 I 1 T Z Z ∞ H (q , v) = v v . 2 X 2 2 hq, qiL2 = q(x) µ(dx) = qiφi(x) µ(dx) Here π is the posterior distribution and G is the canoni- i=1 ∞ cal Riemann tensor for the sphere (Lan, Zhou, and Shah- X 2 baba, 2014). Simulating from H1 involves a small per- = qi = hq, qi`2 . turbation of the velocity by the gradient of H1 with re- i=1 I 2 spect to q ; simulating H involves moving along the It is an immediate result that the respective spheres are ∼ ∞ sphere’s geodesics in the direction v. This last fact is also isometric, i.e. (Q, h·, ·iL2 ) = (S , h·, ·i`2 ), and relevant to the discussion of the following section. hence, by Proposition 1, the following result holds. The computational bottlenecks for both HMC and spher- Lemma 1. Given an orthonormal basis for L2, the space ical HMC are the likelihood evaluations (within the of density functions equipped with the Fisher metric is accept-reject step) and the gradient evaluations (within isometric to the sphere S∞ with its natural Euclidean ∼ ∞ the discretized trajectory). For our model, both likeli- metric, i.e. (P, gF (·, ·)) = (S , h·, ·i`2 ). (See the sup- hood and gradient evaluations require a single summa- plementary file for the proof.) tion over N terms, each a simple function of the N ob- servations individually. Thus, the complexity is linear Our goal is to show that spherical HMC can be adapted in the number of data points (O(N)). This is orders to the nonparametric Fisher geometry in the infinite- faster than the O(N 3) computations required to perform dimensional limit. Given that the geodesic paths fol- ∞ inference for the GPDS (Murray, MacKay, and Adams, lowed by spherical HMC converge to geodesics on S , 2009). However, in a big data setting, even linear com- Lemma 35 will imply that these paths correspond to plexity might prove too costly. In such case, we recom- geodesics on (P, gF (·, ·)). mend performing these summations using a binary re- Lemma 2. Geodesic flows on the finite sphere SI−1 duction on a GPU with O(log2(N)) complexity (Hol- converge to geodesic flows on the infinite-dimensional brook et al., 2020). sphere S∞ as I → ∞. (See the supplementary file for the proof.) 4.1 INFERENCE IN THE LIMIT We are now ready to connect Riemannian HMC and We note that both spherical HMC uses geodesic flows spherical HMC in the infinite-dimensional limit (where on the finite dimensional sphere to propose new Markov the latter is applied to the square-root density estimation problem). To make this relationship as clear as possi- 5 RELATIONSHIP TO THE COX ble, we introduce a different (but equivalent) definition PROCESS of a geodesic based on the calculus of variations (in con- trast to the null acceleration definition from Lemma 1). The χ2-process density prior may be used to model the Assume that two points A and B are close together in intensity function of a Cox process (Cox, 1955). The a small open set of Riemannian manifold (M, g(·, ·)). Cox process is a point process over a given domain such Let Γ:[a, b] × (−, ) → M be a family of curves that each realization at point t is drawn from a Poisson γs :[a, b] → M satisfying γs(a) = A and γs(b) = B distribution with intensity µ(s), where intensity function for all s ∈ (−, ). Then γ is a geodesic if it minimizes µ(·) is itself a random process over the same given do- the energy functional main. Cox processes are useful for the analysis of spatial Z b and time series data. Given µ(·), the likelihood of such 1  N E(γ) = gγ(t) γ˙ (t), γ˙ (t) dt , data {sn}n=1 is given by 2 a  Z  N d E(γ ) = 0 N  Y and thus satisfies ds s . p {sn}n=1|µ(·) = exp − µ(s) ds × µ(sn) . D For a parametric family of distributions P equipped n=1 θ (5) with the Fisher metric, the parametric Fisher energy takes the form Bayesian inference on µ(·) requires the calculation of 1 Z b two integrals, that over the parameter space and that from E(θ) = g θ˙(t), θ˙(t) dt 2 θ(t) F Equation (5). We make the latter integral trivial by mod- a eling the intensity function as the product of a density Z b 1 T −1 function and a positively constrained random variable: = ∇θ`(θ(t)) I(θ(t)) ∇θ`(θ(t)) dt , 2 a µ(s) = M × p(s) = M × q(s)2 . where I(θ) is the Fisher information, and `(θ) = log p(θ). On the other hand by Lemmas 1 and 5, the N  In this case, the likelihood p {sn}n=1|µ(·) may be nonparametric Fisher energy for a family of curves in P written as takes the form N b  Z  1 Z 2 Y 2 E(p) = g p˙(t), p˙(t) dt exp − Mq(s) ds × Mq(sn) , p(t) F D 2 a n=1 1 Z b = hq˙(t), q˙(t)iL2 dt which is equal to 2 a 1 Z b N N Y 2 = hq˙(t), q˙(t)i`2 dt exp(−M) M q(sn) . 2 a n=1 √ where q = p = P∞ q φ (·). i=1 i i Since the likelihood factors in M and q(·), it follows that Theorem 1. Let q(·) = pp(·) ∈ Q be a square-root the two random variables will be independent in poste- density function with expansion satisfying rior distribution if they are specified to be independent in prior distribution. Indeed, M may even be given a ∞ X conjugate prior: it is easy to see that q(·) = qiφi(·) , and i=1 ∞ M ∼ Γ(a, b) , implies M|N ∼ Γ(a + N, b + 1) . Z X 1 = q(x)2 µ(dx) = q2 , i Sampling from the joint posterior of µ(·) is as simple as D i=1 independently sampling M from its posterior and q2(·) 2 with random, real-valued coefficients qi, i = 1,..., ∞. from the χ -process density sampler and then multiply- Then, in the infinite-dimensional limit, spherical HMC ing the two together. Such a model should be used with follows the nonparametric Fisher metric’s geodesic flows care. As a function of the data, the posterior distribution in the same way that Riemannian HMC follows the of M solely depends on N, which is itself a single real- Fisher metric’s geodesic flows over the parametric fam- ization from a Poisson distribution. Thus, our χ2-process ily of distributions Pθ. (See the supplementary file for density prior for the Cox process is useful in situations the proof). where ample prior information on M is available. ● ●● ●● ● ●● ● ● ● ● ●● ● ● ● ●●●●●●●●●● ●● ● ● ● ● ●● ● ●● ●●● ● ● ●●● ● ● ●● 3 3 1.00 ● ● ● ● ● ● ●●●●● ●● ● ●● ● ● ● ●●● ●● ●●●● ●●● ● ● ●●●●● ● ● ●● ● ● ● ●● ●●●●●●● ● ●●●● ● ● ●●● ●●● ●●●● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●●●● ● ●● ● ● ● ●●●●● ● ●● ●● ● ● ● ● ●● ● ●● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ●● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 0.75 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.75 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ●● ● ● ● ●● ● ● 2 2 ● ● ●●●● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ●● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ●●● ● ●● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●●●●● ●●● ●● ●● ● ● ● ●●●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●●● ● 0.50 ● ● ● ● ● ●● ●● ●●●● ●●●● ● ● ● ● ● ● ●●●●● ●●● ● ● ● ● ●● ●●● ●●● ● ● ● ● ● ● ● ● ●● ●●●●● ● ●●●●●●● ●● ● ● ● ● ●● ● ●●●● ● ● ●●●●● ●● ●● ●●●●● ●● ●●● ● ● ● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ●● ● ●● ●●● ● ● ●● ● ● ● ● ● ● 0.50 ●●●●●●●●●●● ●● ● ●● ●● ● ● ● ●●● ● ● ● ●●● ●●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ●● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ●● ●●● ●●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●●● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●●● ● ● ● ●● ● ● ● ● ●●● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● 1 1 ●● ● ●●●●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ● ●●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●●● ● ● ●● ●● ● ●● ● ● ● ● ● ●●● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●●●● ● ● ● ● ● ● ● ●● ● ● ● ● 0.25 ● ●●●● ●●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●●●●●●●● ● ● ● ● ● ● ● ● ● Dimension 2 ● ● ● ● ● ● ●● ●● ●● ● ● ● 0.25 ● ● ● ● ● ● ●●● ● ●● ●● ● ● ●● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●●● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ●● ● ● ● ● ● ● ● ●● ●● ● Density values ●●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ● ●● ●●● ● ● ● ●● ● ●●● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ● ●●●●●●●● ●●● ●●●● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●●●● ● ● ●●●●●● ●●●●●● ● ● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ●● ●●●● ●●●● ●● ●● ●●●● ● ●● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ●●● ●● ●● ●● ●● ●● ●●● ● ● ●● ● ● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ●●● ● ● ● ● ●●●●● ● ●●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●●● ● ●● ● ● ●● ●●● ● ● ●● ●●● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ●● ● ●●●●●●●●● ● ● ●●● ●●●●●●● ● ● ● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●●●● ● ● ● ●● ●● ● ● ●●● ● ● ● ● ● ●●●●●●●●●●●●●●●● ● ● ● ●●●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●● ●●●●● ● ●●●● ● ●● ● ● ● ● ●●● ●● ●● ● ●● ● ● ●●●● ●● ● ●●●●● ●●●●● ● ●●●● ●●● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ● ●●●● ●● ● ●● ●● ● ● ●● ● ● ●●● ●● ●● ● ●●● ● ●●●● ●●● ● ● ● ●● ● ● ● ●●● ●●● ● 0 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 0.00 ●●● ●●●● ●● ●●● ●● ● ●●●● ● ● ●● ● ● ● ● ●● ● 0.00 ● ● ● ● ●●● ●● ●●●●●●●●●● ●● ● ● 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.0 0.2 0.4 0.6 0.8 0.00 0.25 0.50 0.75 1.00

● ● ● ● ● ● ● ● ● ● ● ● ● ●● 3 3 1.00 ● ●● ● ● ●● 1.00 ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●●● ● ● ● ● ● ● ●● ●● ●● ● ● ● ●● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ●● ●● ● ●●● ●● ●● ● ●● ●● ● ● ● ● ●● ●● ● ● ●● ● ●● ● ● ● ●● ●●●● ●● ●●● ● ●●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ●● ●●●●●● ●● ● ● ● ● ●● ● ● ● ● ● ●●●● ●●● ● ● ●●● ●●●●● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ●●● ● ●●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●●● ● ●● ●● ● ● ●● ● ●● ● ● ●● ● ●● ● ●● 0.75 ●● ● ● ● ● ● ● ●● ● 0.75 ●● ● ● ● ●● ●● ●● ●● ●● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●● ● ● ●● ● ● ● ● ● ●●● ● ● ●●●●● ●● ●●● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ●● ●●●●●● ●● ● ● ●● ● ● ● ●●●● ● ● ● ●● ● ● ●● ● ●●● ●●● ● ● ● ● ●● ●● ● ●● ● ● ●● ● 2 2 ●● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ●●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ●● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ●● ● ● ● ● ●●● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●●●● ● ● ●●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ●●●● ● ● ● ● ● ● ●● ●●● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ●●● ●● ●● ● ● ● ● ● ●● ●● ● ● ●● ●●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● 0.50 ●●● ● ● ● ● ● ● ●● 0.50 ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ●● ● ● ●●●●● ● ● ● ● ● ●● ● ●● ●● ● ●●● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ●●●●●● ● ● ● ● ● ● ●●●● ●● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●●● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ●●●●●●● ● ● ● ●●● ● ● ● ●●● ● ●●● ● ● ●● ●● ● ●●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ●● ●● ●● ●● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ●● ● ●●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ● ●● ● ●● ● ● ● ●● ● 1 1 ● ● ● ● ● ● ●●●● ● ● ●● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ●● ● ●●●●●●●● ●●●● ● ●●● ●● ●●● ● ●● ● ● ●● ● ● ●●● ● ●● ● ●● ● ● ● ●● ● ●●●●●● ● ●●●● ●● ● ● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ● ●●● ●● ●●●●● ● ●● ● ● ● ●●● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ●●● ●● ●● ●●● ●● ● ● ●●● ● ● ● ● ● ● ●● ●● ● ● ●●●● ●● ●●● ● ● Dimension 2 0.25 ●●● ● ●● ●● ● ● ●● 0.25 ● ●●● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●●●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● Density values ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ●● ●● ●●● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ●● ● ● ●● ● ● ●● ●● ●●● ● ●● ●● ● ● ● ● ● ●●● ●● ●●● ●● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ●● ●● ● ● ● ● ● ●●●● ● ● ● ● ●● ● ● ●● ● ● ● ● 0 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 0 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||| 0.00 ● ● ● ● ● 0.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Support Support Dimension 1 Dimension 1

Figure 1: Each plot shows 100 posterior draws from the Figure 2: The contours (black) of the posterior median χ2-process density sampler. 1,000 data samples were from 1,000 draws of the χ2-process density sampler. drawn from a different beta distribution for each plot. Each posterior is conditioned on 1,000 data points (red). The generating pdf is given in red, and the red hash marks describe the actual data produced. for each example, and 10,000 thinned MCMC iterations were used to make each figure. 6 EMPIRICAL RESULTS Figure 2 depicts 1,000 data points (red) drawn from Here we apply the χ2-process density model to both sim- four different distributions on the unit square along with the contours of the pointwise median of 1,000 posterior ulated and real-world data. As stated in Section 3.1, the 2 eigen-pairs corresponding to the GP with covariance op- draws from the χ -process density model. The data in erator (4) depend on both the dimension and the area the first three plots was generated using truncated Gaus- of D. When D is the one-dimensional unit interval, the sians and mixtures of truncated Gaussians. The data for eigen-pairs are given by the last plot were generated by Gaussian noise added to √ the uniform distribution on the circle. The model adapts 2 2 2 2 −s λi = σ (α + π i ) , and φi(x) = 2 cos(π i x) , easily to multimodal and patterned data samples. For all examples, the hyperparameters were fixed to (σ, α, s) = for i ≥ 0. For the two-dimensional unit square D = (.9,.1, 1.1), and 0 ≤ i1, i2 ≤ 5 for each example. [0, 1] × [0, 1], the eigen-pairs are given by

2 2 2 2 2 −s 6.2 REAL-WORLD EXPERIMENTS λi = σ α + π (i1 + i2) , φi(x) = 2 cos(π i1 x1) cos(π i2 x2) , Figure 3 features the British coal mine disaster data set, in which the dates of 191 disasters are recorded between for i , i ≥ 0, where i and i are indices for the first 1 2 1 2 the years of 1851 and 1967. In both plots, the dates are and second dimensions of the domain respectively. See given in red. Two comparisons are implied by the figure. Beskos et al. (2016) for a similar approach. The first is a comparison between the variability of 100 posterior draws based on 191 data points (left plot) with 6.1 SIMULATED EXPERIMENTS the variability in 100 posterior draws based on 1,000 data points, as in Figure 1. One sees much less variability in Figure 1 depicts 1,000 data points (red hash marks) the latter. The other comparison is between the close fit drawn from four different beta distributions (red) along exhibited in the posterior draws of the left plot compared with 100 MCMC draws from the posterior distribution to the smooth fit shown by the pointwise quantiles (me- based on the χ2-process density model. From left to right dian, black; .25, blue; .75, blue). As we can see, our and top to bottom, the beta distribution parameters are method is valid for modeling densities without periodic (1, 1), (5, 2), (.5,.5), and (2, 2). Note that while the in- tendencies, despite the specific form of the basis. Both dividual posterior draws adhere closely to the sampled plots are based on 10,000 thinned MCMC iterations, data, the variability in the posterior draws accounts for with hyperparameter settings (σ, α, s) = (.5,.5,.8) with uncertainty and gives good coverage to the true density. I = 30. The hyperparameter settings for the top-left plot is given by (σ, α, s) = (.5, 1, 1), and (σ, α, s) = (.5,.5,.8) is Figure 4 features Hutchings’ bramble canes data (red) the hyperparameter setting for the rest. We set I = 30 (Hutchings, 1978), consisting of the locations of 823 3 3 Pointwise posterior mean Posterior predictive sample

● ● ● ● ● ● 1.00 ● ● ● ● ● ● ● 1.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●●●● ●● ●●● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● 0.75 ● ● ● ● ● ● ● ● 2 2 ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.75 ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ●● ● ● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ●●● ● ● ●● ● ●● ● ●●● ● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ●●● ● ●●●● ●●● ●● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ● 0.50 ● ● ● ● ● ● 0.50 ● ● ● ● ●● ●● ● ●● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●● ● 1 1 ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ●●● ●● ● ● ●● ● ● ● ● ● ● ● ●●●● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ●● ● ● ● Density values ● ● ● ● ●● ● ● ● ● ●●● ● ●●● ● ●● ●●● ● ● ● ● ● ● ● ● Dimension 2 ●●● ● ●●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● 0.25 ● ●● ● ●● ● ● ● ● ● ● ● 0.25 ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ●● ● ● ● ● ● ●●●●● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ●● ●● ●●●● ● ● ● ● ● ●● ● ● ● ● ●● ●●● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● 0 |||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||| |||||||||||| || || || |||||||||||||||| || | | || 0 ||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||| |||||||||||| || || || |||||||||||||||| || | | || 0.00 0.00 ● ●● ● ● 1850 1875 1900 1925 1950 1850 1875 1900 1925 1950 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Year Year Dimension 1 Dimension 1

Figure 3: Coal mining disasters data: the left figure Figure 4: Hutchings’ bramble canes data: the first figure shows 100 posterior draws from the χ2-process density depicts the 823 bramble canes (red), a heatmap of the model (gray) over 191 vertical lines (red) marking the pointwise posterior mean (black is low, white is high), precise date of each disaster. The right figure shows the and a single contour at density 0.3 (blue) including all but pointwise median (black) for the same sample as well as a few points. The second figure shows 823 draws from pointwise quantile bands (blue). the χ2-process density posterior predictive distribution.

is placing a prior on the number of basis functions to use, bramble canes in a square plot. The left figure contains as is done in (Cotter et al., 2013). a heatmap of the pointwise posterior mean of the χ2- process density model, where black pertains to low den- The theoretical and methodological results presented in sity and white pertains to high density. Finally, a single this paper are merely first steps in exploiting the simple contour (blue) at density level 0.3 divides the majority geometry implied by the nonparametric Fisher metric. of points from areas of extremely low density. The hy- perparameters were set to (σ, α, s) = (2,.01, 1.1) with Acknowledgement 0 ≤ i1, i2 ≤ 5, and the posterior sample featured 10,000 MCMC iterations. The right figure features 823 draws This work is supported by NSF grant DMS 1622490 and 2 from the posterior predictive distribution of the χ pro- NIH grant R01 MH115697. cess density model. Each draw from the posterior predic- tive distribution was obtained by randomly selecting one posterior draw from the χ2 process density model. Since References this single posterior sample is itself a density function, one can then sample from its corresponding distribution Gourieroux, Christian and Alain Monfort (1995). Statis- using a rejection sampling scheme. There is a remark- tics and econometric models. Vol. 1. Cambridge Uni- able similarity between the posterior predictive sample versity Press. (right, black) and the bramble canes data (left, red). Jeffreys, Harold (1946). “An invariant form for the prior probability in estimation problems”. In: Proceedings of the Royal Society of London a: mathematical, phys- 7 DISCUSSION ical and engineering sciences. Vol. 186. 1007. The Royal Society, pp. 453–461. We have presented a nonparametric extension to the Le Cam, Lucien (2012). Asymptotic methods in statisti- parametric Fisher geometry and showed that this gen- cal decision theory. Springer Science & Business Me- eralization is consistent with its parametric predecessor. dia. To do so, the set of probability density functions over a Fisher, Ronald Aylmer (1925). “Theory of statistical es- given domain was defined to be an infinite-dimensional timation”. In: Mathematical Proceedings of the Cam- smooth manifold where each point is itself a density bridge Philosophical Society. Vol. 22. 05. Cambridge function. This manifold becomes a Riemannian mani- Univ Press, pp. 700–725. fold when equipped with the nonparametric Fisher in- Rao, C Radhakrishna (1945). “Information and accuracy formation metric and is then identified with the infinite- attainable in the estimation of statistical parameters”. dimensional sphere. We demonstrated one application In: Bull. Calcutta Math. Soc 37.3, pp. 81–91. of this approach in the form of Bayesian nonparamet- Amari, Shun-ichi and Hiroshi Nagaoka (2007). Methods ric density estimation. The resulting χ2-process den- of information geometry. Vol. 191. American Mathe- sity model is flexible and computationally efficient: it is matical Soc. amenable to HMC and, in comparison to the cubic scal- Srivastava, Anuj, Ian Jermyn, and Shantanu Joshi (2007). ing of GP competitors, scales linearly in the number of “Riemannian analysis of probability density functions data points. Of course, there is nothing a priori restrict- with applications in vision”. In: Computer Vision and ing the prior to be Gaussian. Also, an important next step Pattern Recognition, 2007. CVPR’07. IEEE Confer- Murray, Iain, David MacKay, and Ryan P Adams (2009). ence on. IEEE, pp. 1–8. “The Gaussian process density sampler”. In: Advances Chen, Tian, Jeffrey Streets, and Babak Shahbaba (2015). in Neural Information Processing Systems, pp. 9–16. “A Geometric View of Posterior Approximation”. In: Rabier, Charles-Elie and Alan Genz (2014). “The supre- arXiv preprint arXiv:1510.00861. mum of Chi-Square processes”. In: Methodology and Itoh, Mitsuhiro and Hiroyasu Satoh (2015). “Geometry Computing in Applied Probability 16.3, pp. 715–729. of Fisher information metric and the barycenter map”. Gunter, Tom, Michael A Osborne, Roman Garnett, In: Entropy 17.4, pp. 1814–1849. Philipp Hennig, and Stephen J Roberts (2014). “Sam- Kurtek, Sebastian and Karthik Bharath (2015). pling for inference in probabilistic models with fast “Bayesian sensitivity analysis with the Fisher–Rao Bayesian quadrature”. In: Advances in neural infor- metric”. In: Biometrika 102.3, pp. 601–616. mation processing systems, pp. 2789–2797. Srivastava, Anuj and Eric P Klassen (2016). Functional Antoniak, Charles E (1974). “Mixtures of Dirichlet pro- and shape data analysis. Springer. cesses with applications to Bayesian nonparametric Peter, Adrian M, Anand Rangarajan, and Mark Moyou problems”. In: The annals of statistics, pp. 1152–1174. (2017). “The Geometry of Orthogonal-Series, Square- Neal, Radford M (2000). “Markov chain sampling meth- Root Density Estimators: Applications in Computer ods for Dirichlet process mixture models”. In: Journal Vision and Model Selection”. In: Computational In- of computational and graphical statistics 9.2, pp. 249– formation Geometry. Springer, pp. 175–215. 265. Pinheiro, Aluisio and Brani Vidakovic (1997). “Estimat- Byrne, Simon and Mark Girolami (2013). “Geodesic ing the square root of a density via compactly sup- Monte Carlo on embedded manifolds”. In: Scandina- ported wavelets”. In: Computational Statistics & Data vian Journal of Statistics 40.4, pp. 825–845. Analysis 25.4, pp. 399–415. Cox, David R (1955). “Some statistical methods con- Muller,¨ Peter and Brani Vidakovic (1998). “Bayesian in- nected with series of events”. In: Journal of the ference with wavelets: Density estimation”. In: Jour- Royal Statistical Society. Series B (Methodological), nal of Computational and Graphical Statistics 7.4, pp. 129–164. pp. 456–468. Longford, Nicholas (1987). “A fast scoring algorithm for Hong, Xia and Junbin Gao (2016). “A Fast Algorithm maximum likelihood estimation in unbalanced mixed to Estimate the Square Root of Probability Density models with nested random effects”. In: ETS Research Function”. In: Research and Development in Intelli- Report Series 1987.1. gent Systems XXXIII: Incorporating Applications and Girolami, Mark and Ben Calderhead (2011). “Riemann Innovations in Intelligent Systems XXIV 33. Springer, Manifold Langevin and Hamiltonian Monte Carlo pp. 165–176. Methods”. In: Journal of the Royal Statistical Society: Williams, Christopher KI and Carl Edward Rasmussen Series B (Statistical Methodology) 73.2, pp. 123–214. (1996). “Gaussian processes for regression”. In: Ad- Efron, Bradley (1978). “The geometry of exponential vances in neural information processing systems, families”. In: The Annals of Statistics 6.2, pp. 362– pp. 514–520. 376. Wang, Limin (2008). “Karhunen-Loeve expansions and Chung, Kai Lai (2013). Lectures from Markov processes their applications.” PhD thesis. London School of to Brownian motion. Vol. 249. Springer Science & Economics and Political Science (United Kingdom). Business Media. Dashti, Masoumeh and Andrew M Stuart (2013). “The Dryden, Ian L et al. (2005). “Statistical analysis on high- Bayesian approach to inverse problems”. In: arXiv dimensional spheres and shape spaces”. In: The An- preprint arXiv:1302.6989. nals of Statistics 33.4, pp. 1643–1665. Cotter, Simon L, Gareth O Roberts, Andrew M Stuart, Shahbaba, Babak, Shiwei Lan, Wesley O Johnson, and David White, et al. (2013). “MCMC methods for func- Radford M Neal (2014). “Split Hamiltonian Monte tions: modifying old algorithms to make them faster”. Carlo”. In: Statistics and Computing 24.3, pp. 339– In: Statistical Science 28.3, pp. 424–446. 349. Beskos, Alexandros, Mark Girolami, Shiwei Lan, Patrick Holbrook, Andrew J., Philippe Lemey, et al. (2020). E Farrell, and Andrew M Stuart (2016). “Geometric “Massive Parallelization Boosts Big Bayesian Multi- MCMC for infinite-dimensional inverse problems”. dimensional Scaling”. In: Journal of Computational In: Journal of Computational Physics. and Graphical Statistics 0.0, pp. 1–14. Lan, Shiwei, Bo Zhou, and Babak Shahbaba (2014). Hutchings, Michael J (1978). “Standing crop and pat- “Spherical Hamiltonian Monte Carlo for constrained tern in pure stands of Mercurialis perennis and Rubus target distributions”. In: JMLR workshop and confer- fruticosus in mixed deciduous woodland”. In: Oikos, ence proceedings. Vol. 32. NIH Public Access, p. 629. pp. 351–357.