Integral Geometry, Hamiltonian Dynamics, and Markov Chain Monte Carlo

by MASS ACHUSES INS ITUTE OF TECHNOLOGY Oren Mangoubi JUN 16 2016 B.S., Yale University (2011) LIBRARIES Submitted to the Department of Mathematics MCHVES in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2016 @ Oren Mangoubi, MMXVI. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created.

AuthorSignature redacted ...... C/ Department of Mathematics April 28, 2016

Certified by. Signature redacted Alan Edelman Professor Thesis Supervisor

Accepted bySignature redacted Jonathan Kelner Chairman, Applied Mathematics Committee 2 Integral Geometry, Hamiltonian Dynamics, and Markov Chain Monte Carlo by Oren Mangoubi

Submitted to the Department of Mathematics on April 28, 2016, in partial fulfillment of the requirements for the degree of Doctor of Philosophy

Abstract

This thesis presents applications of differential geometry and graph theory to the design and analysis of Markov chain Monte Carlo (MCMC) algorithms. MCMC al- gorithms are used to generate samples from an arbitrary probability density ir in computationally demanding situations, since their mixing times need not grow expo- nentially with the dimension of w. However, if w has many modes, MCMC algorithms may still have very long mixing times. It is therefore crucial to understand and reduce MCMC mixing times, and there is currently a need for global mixing time bounds as well as algorithms that mix quickly for multi-modal densities. In the Gibbs sampling MCMC algorithm, the variance in the size of modes inter- sected by the algorithm's search-subspaces can grow exponentially in the dimension, greatly increasing the mixing time. We use integral geometry, together with the Hes- sian of r and the Chern-Gauss-Bonnet theorem, to correct these distortions and avoid this exponential increase in the mixing time. Towards this end, we prove a general- ization of the classical Crofton's formula in integral geometry that can allow one to greatly reduce the variance of Crofton's formula without introducing a bias. Hamiltonian Monte Carlo (HMC) algorithms are some the most widely-used MCMC algorithms. We use the symplectic properties of Hamiltonians to prove global Cheeger- type lower bounds for the mixing times of HMC algorithms, including Riemannian Manifold HMC as well as No-U-Turn HMC, the workhorse of the popular Bayesian software package Stan. One consequence of our work is the impossibility of energy- conserving Hamiltonian Markov chains to search for far-apart sub-Gaussian modes in polynomial time. We then prove another generalization of Crofton's formula that ap- plies to Hamiltonian trajectories, and use our generalized Crofton formula to improve the convergence speed of HMC-based integration on manifolds. We also present a generalization of the Hopf fibration acting on arbitrary- ghost- valued random variables. For # = 4, the geometry of the Hopf fibration is encoded by the quaternions; we investigate the extent to which the elegant properties of this encoding are preserved when one replaces quaternions with general 0 > 0 ghosts.

3 Thesis Supervisor: Alan Edelman Title: Professor

4 Acknowledgments

I am very grateful to my advisor and coauthor Alan Edelman' for his guidance and collaboration on this thesis. I am also deeply grateful to my coauthor Natesh Pillai2 for his collaboration and advice on the Hamiltonian mixing times chapter of this thesis. I could not have finished this thesis without their insights. I am deeply thankful as well for indispensable advice and insights from Aaron Smith3 , Youssef Marzouk4 , Michael Betancourt5 , Jonathan Kelner1 , Michael La Croixi, Jiahao Chen', Laurent Demanet', Dennis Amelunxen', Ofer Zeitouni' 8 , Neil Shephard2 , and Nawaf Bou-Rabee'. I would also like to thank my mentors and previous coauthors Stephen Morse'o, Yakar Kannai7 , Edwin Marengo", and Lucio Frydman"2 . I am very grateful to my other mentors and professors at MIT and Yale, especially Roger Howe1", Gregory Margulis13 , Victor Chernozhukov1 4 , Kumpati Narendra'0 , Andrew Barron 15, Ivan Marcus16 , Paulo Lozano 4, and Manuel Martinez-Sanchez'. For valuable opportuni- ties to learn, teach and conduct research, I would like to thank the MIT Mathematics department and the Theory of Computation group at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), as well as the Yale Mathematics and Elec- trical Engineering departments, the Weizmann Institute Mathematics and Chemical Physics departments, and the Northeastern Electrical Engineering department. I am very thankful to have been blessed with a kind and loving family for who's

'MIT Mathematics Department 2 Harvard Statistics Department 3University of Ottawa Mathematics and Statistics Department 4 MIT Department of Aeronautics and Astronautics 5University of Warwick Statistics Department 6 City University of Hong Kong Mathematics Department 7Weizmann Institute of Science Mathematics Department 8 Courant Institute of Mathematical Sciences at NYU 9Rutgers Mathematical Sciences Department 0Yale Electrical Engineering Department "Northeastern University Electrical Engineering Department 12Weizmann Institute of Science Chemical Physics Department 13Yale Mathematics Department 1 4MIT Economics Department "5 Yale Statistics Department ' 6 Yale History Department

5 encouragement and support I am forever grateful: My mother and father, my brothers Tomer and Daniel, and, most importantly, my grandparents M6m6, Oma and Opa, as well as P6p6 (of blessed memory). I am also very thankful to my friends for their kindness and companionship. My schoolteachers at Schechter and Gann, especially my Mathematics teacher Mrs. Voolich and my Science teacher Mrs. Schreiber, have been an inspiration to me as well. I also thank the MITxplore program for giving me the opportunity to design and teach weekly Mathematics enrichment classes for children in Cambridge and Boston public schools. I deeply appreciate the generous support of a National Defense Science and Engi- neering Graduate (NDSEG) Fellowship, as well as support from the National Science Foundation (NSF DMS-1312831) and the MIT Mathematics Department.

Thesis Committee:

" Professor Alan Edelman Thesis Committee Chairman and Thesis Advisor Professor of Applied Mathematics, MIT

" Professor Natesh Pillai Associate Professor of Statistics, Harvard

* Professor Youssef Marzouk Associate Professor of Aeronautics and Astronautics, MIT

" Professor Jonathan Kelner Associate Professor of Applied Mathematics, MIT

6 Contents

I Introductih 9 1.1 Somic wid liy-u -&( 1\CL C ig1(1 uI ...... 10 L 1. 1 RandomN Walk Metropoi ...... 10

1.1.2 Gibbs sampling algoritin ...... 10

1.1.3 Hamiltonian Monte Carlo ...... 12

1.2 Iiitegral & differential geometrY prelimninari ...... 14 neas...... 15 1.2. Kine a Wi m

1.2.2 The Crofton for ...... 17

1.2.3 Concentration...... 17

1.2.4 The Chern-GaussR nnm...... 18 1.3 Conitribution-, of this the>...... 1 1.3 ribCon ti ns f t is h ...... 19

2 Integral Geometry for Gibw Sa-mmers 21

.21 2.1 Illt roductio ......

2.2 A first-ordei ewtn (iiin7 ie I eiuion 1>...... 28

2.2.1 The Crofton formula Gibbs sample ...... 29

2.2.2 Traditional weights vs. integral geometry weigmt ...... 30

2.3 A generalized Crofton forini ...... 31

2.3.1 The generalized Crofton formula Gibbs saimp ...... 44

~(K~'~; KUi 2.3.2 The pe-- j

densite ...... 46

2.3.3 An NCIm n aie( awo

t e e ...... 4 7

7 -1

2.3.5 Higliei-or( hi' Cv~veril-Lill .. -u e .(W.1.j .t.1. 50

2. 3. 6 Colle ctioni-of-sphieres xaiilIe aii( oietatidm-)-miii 51

2.3.7 Vairianiicc due to )1r(l-efltYklma(nisue((

2..> lieoiet a; )t )oul1(s (Ieive( usin- 1C C .i en 31 i t-O)raj(

oiitY...... 60

2.4 Ranidoiii iilatrix a-pplicattionl: Sa111pf)lig 0t lestocia'st a 62

2.4.1 Approxiinate sanljplilig alo-ov)Y1+2 1!?~Cr 2...... 63

2.5 Conditioinlg oni iiiltiple eigenivall( ...... 65

2.(6 Coniditioilling on- ai singp.le-eigeinvalti ue raI' er...... 66

3 Mixing Times of H-amiitouia-i 1\Iloit~e Carlc 71i

3.1 Itroducti ...... 71

3.2 Haiifltoia 1, AUI I,- I ...... 72

3.3 Clieger- bo)01 s111 in X!ItLCJC > ...... 78

:3.4 Daindoiii \\itlIK iiiip1 rnn u&'...... 80

4 A Generalization of Crofton's Formula to H ~r'i with Applictn~ii ~ atho IV~Ionte c>'a,2. 85

4.2 Cr-oftoni foiiuime 101- liiniironin uIvnanmII-...... 85

4.3 Mnhuifoli integrationi usinig HNWC anil the fllii 1111 ( irotti Oi bul 88

5 A Hopf Fibration for ' -Ghost Gaiissiau- 91

5.1 hit-odlUCti(o ...... 91

5.2 Defiingi th ...... 92

5.3 Hopf Fibrationi oni'...... 94

8 Chapter 1

Introduction

Applications of sampling on probability distributions, defined on Euclidean space or on other manifolds, arise in many fields, such as Statistics [18, 6, 24], Machine Learning [4], Statistical Mechanics [39], General Relativity [16], Molecular Biology [15], Linguistics [5], and Genetics [31]. In many cases these probability distributions are difficult to sample from with straightforward methods such as rejection sampling because the events we are conditioning on are very rare, or the probability density concentrates in some small regions of space. Typically, the complexity of sampling from these distributions grows exponentially with the dimension of the space. In such situations, we require alternative sampling methods whose complexity promises not to grow exponentially with dimension. In Markov chain Monte Carlo (MCMC) algorithms, one of the most commonly used such methods, we run a Markov chain that converges to the desired probability distribution [12]. MCMC algorithms are used to generate samples from an arbitrary probability density 7r in computationally demanding situations such as high-dimensional Bayesian statistics [60], machine learning [4], and molecular biology [15], since their mixing times need not grow exponentially with the dimension of 7r. However, if ir has many modes, MCMC algorithms may still have very long mixing times [35, 48, 28]. It is therefore crucial to understand and reduce MCMC mixing times, and there is currently a need for global mixing time bounds as well as algorithms that mix quickly for multi-modal densities.

9 1.1 Some widely-used MCMC algorithms

In this section we review some widely-used MCMC algorithms.

1.1.1 Random Walk Metropolis

The Random Walk Metropolis (RWM) algorithm (Algorithm 1) is the most basic MCMC algorithm. At each step of the Markov chain, the RWM algorithm proposes to take the next step ii+1 in a random direction and distance from the current position xi. The step is accepted with a probability of min{', 1}. If the step is rejected, the algorithm stays at its current position until the next time step.

Algorithm 1 Random Walk Metropolis [45] input: c > 0, xO, oracle for 7r : R' -+ [0, oc) output: x 1 , X2 , . .. , with stationary distribution 7r 1: for i = 1, 2, ... do 2: Sample independent h ~ f(0, e)n 3: set x'i+ = xi + h 4: set xi+1 = i+1 with probability min{{ ,Gi+i)1}; Else, set xi+1 = xi 5: end for

Although the RWM algorithm is widely-used in practice due to its simplicity [7], its mixing time slows down quadratically with a decrease in the step size since its associated "random walk" behavior approximates a diffusion. We will discuss this slowdown for the RWM algorithm further in Chapter 3.

1.1.2 Gibbs sampling algorithms

Gibbs sampling MCMC algorithms [23] offer one way of taking larger steps to avoid the quadratic slowdown associated with diffusion-like "random walk" behavior. Gibbs sampling algorithms work by sampling the next step xi+1 in the Markov chain from the probability density 7 conditioned on a random search subspace S. xi+1 may be sampled from S using a subroutine Markov chain or another sampling method. If a subroutine Markov chain is used, the Gibbs sampler is oftentimes referred to as the

10 "Metropolis-within-Gibbs" algorithm, where the term "Metropolis" is used loosely here to denote the subroutine Markov chain. The search subspace may be a line or a multi-dimensional plane passing through xi. In this thesis we will consider search subspaces with isotropic random orientation.

Algorithm 2 Gibbs Sampler (with isotropic random search subspaces) [23]

input: x0 , oracle for 7r : R --+ [0, oc)

output: x 1 , x2 , .. ., with stationary distribution 7r 1: for i = 1, 2, ... do 2: Sample a k-dimensional isotropic random search subspace S passing through xi 3: Sample xi+1 with probability proportional to the restriction of ir(x) to S (using a subroutine Markov chain or another sampling method) 4: end for

Algorithm 3 Gibbs Sampler (for 7r supported on submanifold) Algorithm 3 is identical to Algorithm 2 , except for the following steps:

3: Sample xi+1 from r(x)/IaII restricted to Sn M (using a subroutine Markov chain or another sampling method)

(Here I| denotes the product of the singular values of the projection map from S onto M', the orthogonal complement of the tangent space of M at x.) If the manifold can be mapped onto a sphere, it is sometimes simpler to bypass the primary Markov chain in the Gibbs sampler and sample the manifold directly by intersecting it with random subspaces moving according to the kinematic measure:

Algorithm 4 Great sphere sampler Algorithm 4 is identical to Algorithms 2 and 3, except for the following steps: input: oracle for 7r supported on a manifold M C Sn.

2: Sample a search subspace Si C S that is an isotropic random great sphere independent of xi.

One problem with Gibbs sampling algorithms when sampling from distributions with multiple modes is that the orientation of the search subspace S can greatly

11 distort the apparent size of a mode, slowing the algorithm. In Chapter 2 we use concentration of measure to quantify how much these distortions slow down the Gibbs sampling algorithm. We also show how one can use integral geometry to eliminate some of these distortions.

1.1.3 Hamiltonian Monte Carlo algorithms

Like Gibbs sampling algorithms, Hamiltonian Monte Carlo (HMC) algorithms [19] seek to avoid quadratic slowdowns associated with diffusion-like "random walk" be- havior. They do so by simulating the trajectory of a Hamiltonian particle for some amount of time T, and then refreshing the momentum according to the momentum's Boltzman distribution from statistical mechanics. Since the particle has momentum, it will tend to take large steps in the direction of the momentum, avoiding "random walk" behavior. Since Hamiltonian trajectories conserve energy, there is no need to reject any proposed steps. For this reason HMC algorithms work especially well in high dimensions since concentration of the posterior measure 7r causes most other MCMC algorithms to either sample steps that are very close (leading to "random walk" behavior) or to propose steps that have low probability density and are thus rejected with high probability. In this section we review three commonly-used HMC algorithms (Figure 1-1). The first two of these HMC algorithms (Algorithms 4 and 5) form the workhorse of the popular Bayesian software package Stan [10]. All three algorithms generate a Markov chain step by integrating a Hamiltonian trajectory for a time T, and refreshing the momentum. In Chapter 3 we will show global lower bounds on the mixing times of a large class of HMC algorithms sampling from arbitrary posterior distributions, including Algorithms 3-5. Isotropic-Momentum HMC (Algorithm 3) [44] is the most basic HMC algorithm (Figure 1-1, top, entire solid+dotted trajectory), The No-U-Turn Sampler is a modification of Algorithm 3 which seeks to take longer steps by avoiding U-turns in the Hamiltonian trajectories. It does so by stop- ping the trajectory once any two velocity vectors on the trajectory path form an angle

12 U

#SN&

II

T.

S 00 to0t

got

wool

Figure 1-1: The Isotropic-Momentum HMC (dashed and solid,top), No-U-Turn HMC trajectory (solid only, top), and Rieimanian manifold HMC trajectory (bottom). The Isotropic-Momentum and Riemannian Manifold trajectories evolve for a fixed time T, while the No-U-Turn trajectory stops once any two momentum vectors on the trajectory are orthogonal. The Isotropic-Momentum HMC and No-U-Turn HMC both have spherical Gaussian random initial trajectories, while Riemannian Manifold H MC has a non-spherical Gaussian random initial trajectory determined by the Hessian of -log(7r) at t = 0. In Chapter 3, we will use boundaries such as OS to establish lower bounds on the HMC mixing time. I

13 I Algorithm 5 Isotropic-Momentum HMC (idealized symplectic integrator) [44] input: qO, oracle for : R' -+ [0, oc) output: q1, q2 ...., with stationary distribution 7 define: H(q, p) -log(7r(q)) + lp'p. 1: for i = 1, 2, ... do 2: Sample independent pi ~ N(0, 1)n 3: Integrate Hamiltonian trajectory (q(t),p(t)) with Hamiltonian H over the time interval [0, T] and initial conditions (p(O), q(0)) = (pi, qi) 4: set qg4 1 = q(T) 5: end for of more than 900 (Figure 1-1, top, solid trajectory)

Algorithm 6 Idealized No-U-Turn Sampler HMC (perfect symplectic integrator) [32] Algorithm 6 is identical to Algorithm 5, except for step 3.

3: Integrate Hamiltonian trajectory (q(t),p(t)) over the time interval [0, T], with initial conditions (p(O), q(0)) = (pi, qi), where T is the minimum time such that the velocity vectors at two points on the trajectory path form an angle greater than 900.

Riemannian Manifold HMC seeks to take longer steps by choosing initial momenta from a multivariate Gaussian distribution that agrees with the local geometry of the posterior density 7r (Figure 1-1, bottom):

Algorithm 7 Riemannian Manifold HMC (idealized symplectic integrator) [27, 25] Algorithm 7 is identical to Algorithm 5, except for the following steps:

define: H(q,p) := -log(7r(q)) + cadet(G(q)) + jpT G(q)p, where G(q) is the non-degenerate Fisher information matrix of 7r at q, and c, = jlog(pi)n. 2: Sample pi ~ .(0, G- 1 (qi))

1.2 Integral & differential geometry preliminaries

In this section we review results from differential geometry, integral geometry, and concentration of measure that we will use extensively in this thesis.

14 1.2.1 Kinematic measure

Up until this point we have talked about random search-subspaces informally. This notion of randomness is formally referred to as the kinematic measure [53, 54]. The kinematic measure provides the right setting to state the Crofton Formula. The kine- matic measure, as the name suggests, is invariant under translations and rotations. The random subspace is said to be "moving according to the kinematic measure".

The kinematic measure is the formal way of discussing the following simple situ- ation: we would like to take a random point p uniformly on the unit sphere or, say, inside a cube in R'. First we consider the sphere. After choosing p we then choose an isotropically random plane of dimension d + 1 through the point p and the center of the sphere. In the case of the sphere, this is simply an isotropic random plane through the center of the sphere. On a cube there are some technical issues, but the basic idea of choosing a random point and an isotropic random orientation using that point as the origin persists. On the cube we would allow any orientation not only those through a "center". The technical issues relate to the boundary effects of a finite cube or the lack of a concept of a uniform probability measure on an infinite space. In any case the spherical geometry is the natural computational setting be- cause it is compact (If we insist on artificially compactifying RI by conditioning on a compact subset then either the boundary effects cause the different search-subspaces to vary greatly in volume, slowing the algorithm, or we must restrict ourselves to such a large subset of R' that most of the search-subspaces don't pass through much of the region of interest). However, for the sake of completeness we introduce the kinematic measure for the Euclidean as well as the spherical constant-curvature space because it is relevant in more theoretical applications.

In the spherical geometry case, we define the kinematic measure with respect to a fixed non-random subset Sfixed c S, usually a great subsphere, by the action of the Haar measure on the special orthogonal group SO(n + 1) on Sfxe.. When generalizing to Euclidean geometry, we must be a bit more careful, because there is no uniform probability distribution on Rn. In the case where S has finite d-volume,

15 we can circumvent these issues simply by choosing p to be a point in the poisson point process. To generalize to planes, we may define the kinematic measure as a poisson- like point process for our search-subspaces with a translationally and rotationally invariant distribution on all of R' (the "points" here are the search-subspaces):

Definition 1. (Kinematic measure) Let Kn e {Sn, R'} be a constant-curvature space. Let Sfied be a d-dimensional manifold that either has a finite d-volume (in R" or Sn), or is a plane (in RI only). Let H be the Haar measure on G. If S has finite d-volume we take G to be the group In of isometries of K". If S is a plane, we instead take G to be the quotient group In/Id of the isometries on K" with the isometries on Sfix. Let N be the counting process such that

(i) E[N(A)] = x H(A) (ii) N(A) and N(B) are independent for any disjoint Haar-measurable subsets A, B C G, where we drop the 1 vold (sfixed) term if Sfi.x is a plane. We define the kinematic measure with respect to Sfixe C Kn to be the action of the elements of N on Saxe-

If we wish to actually sample from the kinematic measure for the infinite-measure space R" in real life, we must restrict ourselves to some (almost surely) finite subset of the infinite kinematic measure "point" process. For instance, we could condition on those subspaces that intersect the manifold M that we wish to sample from.

Remark 1. There is in fact a third constant curvature-space, the constant negative- curvature hyperbolic space Hn (S has constant positive-curvature and Rn constant zero-curvature). Since the proof of Theorem 3 in chapter 2 seems to rely only on the constant-curvature of the space, we suspect that nearly identical versions of this proof and theorem probably apply to hyperbolic space as well. However, we do not investigate this further as it is beyond the scope of this thesis.

16 1.2.2 The Crofton formula

In this section, we state the Crofton formula [17, 53, 54], which says that the volume of a manifold M is proportional to the average of the volumes of the intersection SnM of M with a random search-subspace S moving according to the kinematic measure. Our first-order reweighting of the Gibbs sampler for submanifolds (section 2.2), referred to as the "angle-independent" reweighting in the introduction of Chapter 2, is based on this formula. In Section 2.3, we will prove a generalization of this formula that will allow for higher-order reweightings. In Chapter 4, we will prove a generalization of the Crofton formula that applies to trajectories in Hamiltonian dynamics including trajectories in the HMC algorithms. We will then apply our "Hamiltonian Crofton Formula" to improve the convergence rate of the HMC algorithm when it is used to compute integrals on manifolds.

Lemma 1. (Crofton Formula)[17, 53, 54]

Let M be a codimension-k submanifold of Kn, where KI e { sn, Rn}. Let S be a random d-dimensional manifold in Kn of finite volume (or a random plane), moving according to the kinematic measure. Then there exists a constant Cd,k,n,K such that

Volnk (M) = Cd,k,n,K X Es[Vold-k(S nM)], (1.1) Void(S) where we set Vold(S) to 1 if S is a plane. In the spherical case we have cd,k,n,S = Vols S-kx s. cdk,n, is given in [53] and depends on whether Vold(S) is finite.

1.2.3 Concentration of measure

The Concentration of Measure phenomenon ([38], [46]), is the idea that volume con- centrates in certain regions of high-dimensional space. One well-known result says that all but an exponentially small (in n) volume of an n - 1 sphere concentrates at a small distance from any single n - 2 dimensional equator [38]. In Section 2.3.6 we will briefly go over some of our generalizations [43] of this concentration result to the kinematic measure, which says that most of the intersec-

17 tion volume of an n-sphere with kinematic measure-distributed d-dimensional search- subspaces concentrates in a fraction of these search-subspaces that is exponentially small in d, causing the variance of these intersection volumes to grow exponentially as well. We will then use our concentration results for kinematic measure to compare the convergence rates of the traditional Gibbs sampler to our curvature-reweighted Gibbs sampler for an example involving the sampling of a manifold M that is a collection of spheres.

1.2.4 The Chern-Gauss-Bonnet theorem

The Gauss-Bonnet theorem [56], states that the integral of the Gauss curvature C of a 2-dimensional manifold M is proportional to its Euler characteristic X(M):

IM CdA = 27rX(M). (1.2) The Chern-Gauss-Bonnet theorem, a generalization of the Gauss-Bonnet theorem to arbitrary even-m-dimensional manifolds [13, 57], states that

IM Pf(Q)dVolm = (27r)ix(M), (1.3) where Q is the curvature form of the Levi-Civita connection and Pf is the pfaffian. The curvature form Q is an intrinsic property of the manifold, i.e., it does not depend on the embedding. In the special case when M is a hypersurface, the curvature Pf(Q2) may be computed as the Jacobian determinant of the Gauss map at x [59, 61]. The Chern-Gauss-Bonnet theorem is usually viewed as a way of relating the cur- vature of the manifold with its Euler characteristic. In Section 2.3 we will instead interpret the Chern-Gauss Bonnet theorem as a way of relating the volume form dVolm to the curvature form Q. This will come in useful since the curvature form does not change very quickly in sufficiently smooth manifolds, allowing us to get a good estimate for the volume of the manifold from its curvature form at a single point.

18 1.3 Contributions of this thesis

The contributions of this thesis are as follows:

" In Chapter 2 we show how Crofton formulae from integral geometry can be used to eliminate inefficiencies in Gibbs sampling MCMC algorithms, and prove that the transition kernels of the primary Gibbs sampling Markov chains remain unchanged after applying Crofton formulae. In doing so, we also prove a gen- eralization of Crofton's formula that allows for the use of generalized Gauss curvature to reduce the variance in Crofton's formula without introducing a bias. Some of our integral geometry results from Chapter 2 have since been used and further generalized by Amelunxen and Lotz in [2, 3].

" In Chapter 3, we use the symplectic volume-preserving properties of Hamil- tonian dynamics and Cheeger's inequality from graph theory to prove upper bounds for the spectral gap of Hamiltonian Monte Carlo algorithms for general posterior densities ir. Our results apply to the classical HMC algorithm, as well as Riemannian Manifold HMC and No-U-Turn HMC, the workhorse of the popular Bayesian software package Stan [10]. One consequence of our work is the impossibility of energy-conserving Hamiltonian Markov chains to search for far-apart sub-Gaussian modes in polynomial time.

" In Chapter 4, we prove a generalization of the Crofton formula that applies to Hamiltonian trajectories, and use our generalized Crofton formula to improve the convergence speed of HMC-based integration on submanifolds.

* In Chapter 5, we present a generalization of the Hopf fibration acting on arbitrary-f ghost-valued random variables. For # = 4, the geometry of the Hopf fibration is encoded by the quaternions; we investigate the extent to which the elegant properties of this encoding are preserved when one replaces quaternions with general # > 0 ghosts.

19 20 Chapter 2

Integral Geometry for Gibbs Samplers

2.1 Introduction

In this chapter, we consider Gibbs sampler MCMC algorithms. If the density we wish to sample from has many modes, or if the density has support on a submanifold M of Rn, then sever inefficiencies can arise. The purpose of this chapter is to demonstrate that integral geometry can be used to eliminate many of these inefficiencies. To illustrate these inefficiencies and our proposed fix, we imagine we would like to sample uniformly from a manifold M C Rn+1 (as illustrated in dark blue in Figure 2-1.) By uniformly, we can imagine that M has finite volume, and the probability of being picked in a region is equal to the volume of that region. More generally, we can put a probability measure on M and sample from that measure.

We consider algorithms that produce a sequence of points {xi, X 2, .. .} (yellow dots in Figure 2-1) with the property that xj+1 will be chosen somehow in an (isotropically generated) random plane S (red plane in Figure 2-1) centered at xi. Further, the step from xi to xj+1 is independent of all the previous steps (Markov chain property.) This situation is known as a Gibbs sampling Markov chain with isotropic random search-subspaces. For our purposes, we find it helpful to pick a sphere (light blue) of radius r that

21 represents the length of the jump we will take upon stepping from xi to xi+,. Note that r is usually random. The sphere will be the natural setting to mathematically exploit the symmetries associated with isotropically distributed planes. Conditioning on the sphere, the plane S becomes a great circle S (red), and the manifold A becomes a submanifold (blue) of the sphere. Assuming we take a step length of r, then necessarily xi+1 must be on the intersection (green dots in Figure 2-1, higher- dimensional submanifolds in more general situations) of the red great circle and the blue submanifold. For definitiveness, suppose our ambient space is R+'1 where n = 2, our blue manifold M has codimension k = 1, and our search-subspaces have dimension k + 1. Our sphere now has dimension n and the great circle dimension k = 1. The intersections (green dots) of the great circle with M are 0-dimensional points.

We now turn to the specifics of how xi+ 1 may be chosen from the intersection of the red curve and the blue curve. Every green point is on the intersection of the blue manifold and the red circle. It is worth pondering the distinction between shallower angles of intersection, and steeper angles. If we thicken the circle by a small constant thickness E, we see that a point with a shallow angle has a larger intersection than a steep angle. Therefore points with shallow angles should be weighted more. Figure

2-2 illustrates that 1 is the proper weighting for an intersection angle of 62. sin(62 ) We will argue that the distinction between shallower and steeper angles takes on a false sense of importance and traditional algorithms may become unnecessarily inefficient accordingly. A traditional algorithm focuses on the specific red circle that happens to be generated by the algorithm and then gives more weight to intersection points with shallower angles. We propose that knowledge of the isotropic distribution of the red circle indicates that all angles may be given the same weight. Therefore, any algorithmic work that goes into weighting points unequally based on the angle of intersection is wasted work.

Specifically, as we will see in Section 2.2.2, sin(Oi) has infinite variance, due in part to the fact that ) can become arbitrarily large for small enough 64. The algorithm must therefore search through a large fraction of the (green) intersection points before

22 converging because any one point could contain a signifiant portion of the conditional probability density, provided that its intersection angle is small enough. This causes the algorithm to sample the intersection points very slowly in situations where the dimension is large and there are typically exponentially many possible intersection points to sample from.

This chapter justifies the validity of the angle-independent approach through the mathematics of integral geometry [53, 54, 22, 30, 1], and the Crofton formula in particular in Section 2.2. We should note that sampling all the intersection points with equal probability cannot work for just any choice of random search-subspace S. For instance, if the search-subspaces are chosen to be random longitudes on the 2-sphere, parts of M that have a nearly east-west orientation would be sampled frequently but parts of M that have nearly north-south orientation would be almost never sampled, introducing a statistical bias to the samples in favor of the east-west oriented samples. However, if S is chosen to be isotropically random, the random orientation of S does not favor either the north-south nor the east-west parts of M, suggesting that we can sample the intersection points with equal probability in this situation without introducing a bias. Effectively, by sampling with equal probability weights and isotropic search-subspaces we will use integral geometry to compute an analytical average of the weights, an average that we would otherwise compute numerically, thereby freeing up computational resources and speeding up the algorithm.

In Part II of this chapter, we perform a numerical implementation of an approxi- mate version of the above algorithm in order to sample the eigenvalues of a random matrix conditioned on certain rare events involving other eigenvalues of this matrix. We obtain different histograms from these samples weighted according to both the traditional weights as well as integral geometry weights (Figure 2-3; Figures 2-10 and 2-11 in part II). We find that using integral geometry greatly reduces the variance of the weights. For instance, the integral geometry weights normalized by the median weight had a sample variance of 3.6 x 10, 578, and 1879 times smaller than the tra- ditional weights, respectively, for the top, middle, and bottom simulations of Figure

23 41

liA -

Figure 2-1: In this example we wish to generate random samples on a codimension-k manifold A C R' (dark blue) with a Gibbs sampling Markov chain {Xi, X 2 , .. .} that uses isotropic random search-subspaces S (light red) centered at the most recent point xi (k = 1, n = 3 in figure). We will consider the sphere rS" of an arbitrary radius r centered at xi (light blue), allowing us to make use of the spherical symmetry in the distribution of the random search-subspace to improve the algorithm's convergence speed. S now becomes an isotropically distributed random great k-sphere S = SnrS" (dark red), that intersects a codimension-k submanifold M = M4n rJS of the sphere.

24 . This reduction in variance allows us to get faster-converging (i.e., smoother for the same number of data points) and more accurate histograms in Figure . In fact, as we show in Section , the traditional weights have infinite variance due to their second-order heavy tailed probability density, so the sample variance tends to increase greatly as more samples are taken. Because of the second-order heavy-tailed behavior in the weights, the smoother we desire the histogram to be, the greater the speed up in the convergence time obtained by using the integral geometry weights in place of the traditional weights.

Figure 2-2: Conditional on the next sample point xj+1 lying a distance r from xi, the algorithm must randomly choose xj+1 from a probability distribution on the intersec- tion points (middle, green) of the manifold M with the isotropic random great circle S (red). If traditional Gibbs sampling is used, intersection points with a very small angle of intersection Oi must be sampled with a much greater (unnormalized) prob- ability 1 (right, top) than intersection points with a large angle (right, bottom). This greatly increases the variance in the sampling probabilities for different points and slows down the convergence of the method used to generate the next sample xi+1 . However, since S is isotropically distributed on rS", the symmetry of the isotropic distribution of S allows us to use the Crofton formula from integral geometry to analytically average out these unequal probability weights so that every intersection point now has the same weight. freeing the algorithm from the time-consuming task of effectively computing this average numerically.

Remark 2. Since we are using an approximate truncated version of the full algorithm that is not completely asymptotically accurate, the integral geometry weights also cause an increase in asymptotic accuracy. The full MCMC algorithm should have perfect asymptotic accuracy, so we expect this increase in accuracy to become an

25 increase in convergence speed if we allow the Markov chain to mix for a longer amount of time.

For situations where the intersections are higher-dimensional submanifolds rather than individual points, we show in Section 2.3 that the angle-independent approach generalizes to a curvature-dependent approach. We stress that traditional algorithms condition only on the plane that was actually generated while ignoring its isotropic distribution. By taking the isotropy into account, our algorithm can use the curvature information of the manifold to compute an analytical average of the local intersection volumes (local in a second-order sense) with all possible isotropically distributed search-subspaces, greatly reducing the variance of the volumes.

Higher-dimensional intersections occur in many (perhaps most) situations, such as applications with events that are rare for reasons other than that their associated submanifold has high codimension. In these situations, the probability of a low- dimensional search-subspace intersecting M can be very small, so one may wish to use a search-subspace S of dimension d that is greater than the codimension k of M in order to increase the probability of intersecting M.

As we will see in Section 2.3.6, the traditional approach can lead to a huge vari- ance in the intersection volumes that increases exponentially with the difference in dimension d - k (Figure 2-4, right). This exponentially large variance leads to the same type of algorithmic slowdowns of the traditional algorithm as the variance in the traditional angle weights discussed above. Using the curvature-aware approach can oftentimes reduce or eliminate this exponential slowdown.

This chapter justifies the validity of the curvature-aware approach by proving a generalization of the Crofton formula (Section 2.3). We then motivate the use of the curvature-aware approach over the traditional curvature-oblivious approach using the mathematics of concentration-of-measure [43, 38, 46] (Section 2.3.6) and differential geometry [56, 57], specifically the Chern-Gauss-Bonnet Theorem [13] whose curvature form we use to re-weight the intersection volumes (Section 2.3.4).

26 A4 1(A A2'3' A5 '7)=(2,-3.5,-4.65,-7.9,-9,-10.8)},A6 3 F__ _T __-F I __T

-Rejection Sampling of A41 (A 3 ,A5) -Integral Geometry Weights > 2 - Traditional Weights

1. 5

c1 -0 a-0.50

0 -8 -7.5 -7 -6.5 6 -5.5 -5 4.5 4 A2 A =2 geometry weights) -- 1 -A 1=2(Integral 0.5 A=2(traditional weights) ng) 0.4 _A 1 2(rejection sampli

0 0.3 0.2 -- -- 0 6 - - -3'-2- 6 -5 -4 -3 A\2 -2 -1 0 1

2 1=5

0.4 --Integral Geometry Weights Traditional Weights 00.3 0.2 C. -g 0.1 - 0- -5 -4 -3 -2 -1 0 1 2

Figure 2-3: Histograms from 3 random matrix simulations (see Sections and ) where we seek the distribution of an eigenvalue given conditions on one or more other eigenvalues. In all three figures, the blue curve uses the integral geometry weights proposed in this chapter, the red curve uses traditional weights, and the black curve but very slow rejection (only in the top two figures) is obtained by the accurate sampling method. Two things worth noticing is that the integral geometry weight curve is more accurate than the traditional weight curve (at least when we have a rejection sampling curve to compare), and that the integral geometry weight curve is smoother than the traditional weight curve. The integral geometry algorithm achieves these benefits in part because of the much smaller variance in the weights. (In these three simulations the integral geometry sample variance was smaller by a factor of 105 , 600, and 2000 respectively)

27 Variance of Vol(SnM;) normalized by its mean

10

10 2

0 0

10- 0 50 100 150 200 250 300 350 400 d Figure 2-4: In this example a collection A4 UA/ of n - 1-dimensional spheres A (blue, left) is intersected (intersection depicted as green circles) by a random search-subspace S (red). The spheres that S intersects farther form their center will have a much smaller intersection volume than the spheres that S intersects closer to their center, with the variance in the intersection volumes increasing exponentially in the dimension d of S (logarithmic plot, right). This curse of dimensionality for the intersection volume can lead to an exponential slowdown when using a traditional algorithm to sample from S n M. In Section we will see that this slowdown can be avoided if we use the curvature information to reweight the intersection volumes, reducing the variance in the intersection volumes.

Part I Theoretical results and discussion

2.2 A first-order reweighting via the Crofton for- mula

As discussed in the introduction to this chapter, we can use Crofton's formula directly to eliminate the weight 1/1 d'l in step 3 of Algorithm

Algorithm 8 Great sphere sampler Algorithm is identical to Algorithm except for the following step:

3: Sample xj+ 1 from r(x) restricted to S n A (using a subroutine Markov chain

or another sampling method)

Before we apply Crofton's formula to the Gibbs sampler (Algorithm ), which uses isotropic random linear search subspaces (as opposed to great spheres), we need

28 the following modification of Crofton's formula:

Theorem 1. (Crofton's formula for isotropic random linear subspaces)

Let S be an isotropic random linear subspace centered at the origin. Let 7r be a function on a manifold M then

7r(x)dx = c x - |IxIV- dx] IM E[snm d!M. where c = cd1,k,n_1,s is a constant and ||I 1| is the sine of the angle between the line passing through both the origin and x.

Proof. r(x)dx = (X) dxdr IM J0 rsn-1 n.A |dM=

Crofton formula on rSn 7r(x) dx]dr Scd-1,k,n-1,S - Es[If I df dr 0 snmnrsn ||g r|

Fubini ir(x) Cd-1,k,n-1,s -Es[ f r dx]dr JO JsnMnlrsn M=I

Cd-l,k,n-1,s -Es[ 7r ( jx-dx]

2.2.1 The Crofton formula Gibbs sampler

As discussed in the introduction, we can apply the first-order reweighting of Theo- rem 1 to the Gibbs sampler algorithm with d-dimensional isotropic random search- subspaces to get a more efficient MCMC algorithm (Algorithm 9):

Algorithm 9 Crofton Formula Gibbs Sampler (for 7r supported on submanifold) Algorithm 9 is identical to Algorithm 3, except for the following steps:

3: Sample xj+1 from the (unnormalized) density 7r(x)/I d4-)| 1 restricted to S n M (using a subroutine Markov chain or another sampling method)

29 Theorem 2. The primary Markov chains in Algorithm 9 and 3 (denoted by x1 , X2 ,... in both algorithms) have identical probability transition kernels.

Proof. For convenience, in this proof we will denote the primary Markov chains in

Algorithm 9 and 3 by zi, 22, ... and 1, 32, ... , respectively.

Let k(U, V) := P(i+1 E VJzi E U) and k(U, V) := P(si+1 E V i E U) denote the probability transition kernel for the primary Markov chain in Algorithm 3 and 9 respectively (here U, V C Rn). Then

K(U, V) = P(Gi+1 E Vjzi E U) = Es[ r(y)/ dSdy]dx J snv dMA

=jjr(y) 1 idydx The/remiyf d - -

The-eml IEs[J r(y)/11 -)| dy]dx U snv dM

= P(i+1 E Vl.i E U)

=k(U, V)]

2.2.2 Traditional weights vs. integral geometry weights

In this section we find the theoretical distribution of the traditional weights and compare them to the integral geometry weights of Theorem 1. We will see that while both weights incorporate the factor 1 the traditional weights have an additional component not present in the integral geometry weights that has infi- nite variance, greatly slowing the traditional MCMC algorithm. Indeed, d = d(x-x) Mds), where IMJ d MJ FI d(M--)dS is the pprojection offdMJ dS onto the tangent space at x of the sphere of radius |Ix - xill centered at xi. Since both weights share the component | I)I, for the remainder of this section we will focus our analysis on the component Id(Mnsx)I that is unique to the traditional algorithm.

30 In the codimension-k = 1 case, we can find the distribution of the weights by observing that the symmetry of the Haar measure means that the distribution of the weights are a local property that does not depend on the choice of manifold M. Moreover, since the kinematic measure is locally the same for both constant- curvature spaces S' and R', the distribution is the same regardless of the choice of constant-curvature space. Hence, without loss of generality, we may choose M to be a cylinder of unit radius in R'. We observe that projecting the cylinder down to the unit circle in R 2 , together with the dimension k = 1 search-subspace, does not increase the weight 1 . Because of the rotational symmetry of both the kinematic measure and the circle, without loss of generality we may condition on only the vertical lines { (x, t) : t E R}, in which case x is distributed uniformly on [-1, 1]. The weights are then given by w = w(X) = 1 1X2_+ with exactly two intersections at almost every x. Hence, E[w] = 2 f 1+ 1$ dx = 2r, the circumference of the circle, as expected. However, E[w 2] = 2 f 1+ x 2 dx = oo. Hence, w has infinite variance. Since projecting down to R2 did not increase the weights, the original weights must have infinite variance as well, greatly slowing the convergence of the sampling algorithm even in the codimension-k = 1 case! On the other hand, the integral geometry weights, being identically = 1, have variance zero, so the weights do not slow down the convergence at all. (A related computation, which we do not give here, shows that the theoretical weights for general k are given by the Wishart matrix determinant 1 1(,G), where G is a (k + 1) x k matrix of i.i.d. standard normals, which also has infinite variance.)

2.3 A generalized Crofton formula

Oftentimes, it is necessary to use a random search-subspace of dimension d larger than the codimension k of the constraint manifold M (the manifold we wish to sample from). For instance, the manifold might represent a rare event, so we might use a higher dimension than the codimension to increase the probability of finding an intersection with the manifold. However, the intersections will no longer be points

31 but submanifolds of dimension d - k. How should one assign weights to the points on this submanifold? The first-order factor in this weight is simple: it is the same as the Jacobian weight of Theorem . However, the size of the intersection still depends on the orientation of the search-subspace with respect to the constraint manifold. For instance, we will see in Section that if we intersect a spherical manifold with a plane near the sphere's center, then we will get a much larger intersection than if we intersect the sphere with a plane far from its center.

This example suggests that we should weight the points on the intersection using the local curvature form: If we intersect in a direction where the curvature is greater (with the plane not passing near the center in the example) then we should use a larger weight than in directions where the curvature is smaller (when the plane passes near the center) (Figure ).

7

Figure 2-5: Both d-dimensional slices, S and S2, pass through the green point x, but the slice passing through the center of the n-1 sphere M has a much bigger intersection volume than the slice passing far from the center. The smaller slice also has larger curvature at any given point x. If we reweight the density of Si nM at x by the Chern-Gauss-Bonnet curvature of Si n M at x, then both slices will have exactly the same total reweighted volume (exact in this case since the sphere has constant curvature form), since the Chern-Gauss-Bonnet theorem relates this curvature to the volume measure.

Consider the simple case where M is a collection of spheres. If we were just applying an algorithm based on the Classical Crofton formula, such as Algorithm , we would sample uniformly from the volume on the intersection S n M. However, the intersected volume depends heavily on the orientation of the search-subspace S with respect to each intersected sphere (Figure ), meaning that the algorithm will

32 in practice have to search through exponentially many spheres before converging to the uniform distribution on S f M (See section 2.3.6). To avoid this problem, we would like to sample from a density Withat is proportional to the absolute value of the Chern-Gauss-Bonnet curvature of S n M at each point x in the intersection: w = W^(x; S) = IPf(Qx(S n M))I (The motivation for using the Chern-Gauss-Bonnet curvature Pf(Q2(S n M)) will be discussed in Section 2.3.4). However, sampling from the density w^ (x; S) does not in general produce unbiased samples uniformly distributed on M even when S is chosen at random according to the kinematic measure. We will see in Theorem 3 that in order to guarantee an unbiased uniform sampling of M we can instead sample from the normalized curvature density

^(;S(2.1) W(X; S) = x~dS . Cd,k,n,K EQ[) (x; SQ) x det(Projm Q)]

The normalization term EQ [W^1(x; SQ) x det(ProjM Q)] is the average curvature at x over all the random orientations at which S could have passed through x. Here SQ = Q(S - x) + x is a random isotropically distributed rotation of S about x, with Q the corresponding isotropic random orthogonal matrix. The determinant inside the expectation is there because while S is originally isotropically distributed, the conditioning of S to intersect M (at x) modifies the probability density of its orientation by a factor of det(Projmj Q). Projag Q is the projection of the orthogonal complement of the tangent space of M at x. In this collection of spheres example, the denominator is a constant for each sphere of a radius R. For instance, in the Euclidean case it can be computed analytically, using the Gauss-Bonnet theorem, as

d-1 F(d + 1) F(--1 + 1)F(n-- + 1) R (27r)- 2 X 2 2 Rn. 7r! (n - d) (n - d)F(- 21 + n 2 +)

From this fact, together with the fact that the total curvature is always the same for any intersection by the Chern-Gauss-Bonnet theorem, we see that when sampling under the probability density w the probability that we will sample from any given sphere is always the same regardless of the volume of the intersection of S with that

33 sphere. Since each sphere (of the same radius) has an equal probability of being sampled, when sampling from M the algorithm has to search for far fewer spheres before converging to a uniformly random point on S n M than when sampling from the uniform distribution on S n M. The need to guarantee that w will still allow us to sample uniformly without bias from M motivates introducing the following generalization of the classical Crofton formula (Theorem 3), which, as far as we know, is new to the literature. Since the proof does not rely on the fact that w is derived from a curvature form, we state the theorem in a more general form that allows for arbitrary w^ (see Section 2.3.5 for a discussion of higher-order choices of Cv beyond just the Chern-Gauss-Bonnet curvature).

Theorem 3. (Generalized Crofton formula) Let & be a weight function, M a manifold, and S a random search subspace moving according to the kinematic measure, satisfying smoothness conditions Al and A2 (defined below). Then

Vol(M) X E (x; S) dx]. (2.2) Cd,k,n,K snM EQ[w(x;SQ) x det(Projmj Q)]

Q is a matrix formed by the first d columns of a random matrix sampled from the Haar measure on SO(n). SQ := Q(S - x) + x. Proju is the projection onto the orthogonal complement of the tangent space of M at x. (As in Lemma 1, if S is a plane, we set the "Vol(S)" term to 1.)

Remark 3. We note that Amelunxen and Lotz [2, 3] recently managed to provide a more elegant proof of our Theorem 3, by modifying our proof using an algebraic ap- proach similar to group-theoretic double-fibration arguments of [22, 1, 30]. Although our proof of Theorem 3 relies on smoothness conditions Al and A2 (defined below), their proof does not seem to rely on these two assumptions.

For MCMC applications, M is taken to be a component of a level set of 7F and tC the magnitude of the Chern-Gauss-Bonnet curvature of SnM, since the Chern-Gauss-

34 Bonnet theorem states that the integral of the curvature form over the intersection SnM is invariant under rotations of S as long as the topology of Sn M is unchanged.

Definition 2. (Smoothness conditions) Al: A manifold (such as M or S in Theorem 3) satisfies condition Al if its cur- vature form is uniformly bounded above.

A2: The pre-normalized weight Cv(x; S) is said to satisfy condition A2 if it is any function such that a < wi(x; S) < b for some 0 < a < b, and is Lipschitz in the variable x E M for some Lipschitz constant 0 < c / oo (when using a translation of S to keep x in S n M when we vary x).

Proof. (Of Theorem 3) We first observe that it suffices to prove Theorem 3 for the case where K' = R' is Euclidean, S is a random plane, and w(x; S) = w(x; L) depends only on the orien- tation d = d I of the tangent spaces of S and M at x. This is because constant- curvature kinematic measure spaces are locally Euclidean (and converge uniformly to a Euclidean geometry if we restrict ourselves to increasingly small neighborhoods of any point in the space because the curvature is the same). We may use any geodesic d-cube in place of the plane as a search-subspace S, since S can be decomposed as a collection of cubes, and Equation 2.2 treats each subset of S in an identical way (since so far we have assumed that w(x; S) depends only on the orientation of the tangent spaces of S and M at x). We can then approximate any search-subspace S of bounded curvature, and Lipschitz function w(x; S) that depends on the location on S where S intersects M (in addition to L), by approximating S with very small squares, each with a different "w(x; L)" that depends only on d. The remainder of the proof consists of two parts. In Part I we prove the theorem for the special case of very small codimension-k balls (in place of M). In Part II we extend this result to the entire manifold by tiling the manifold with randomly placed balls.

35 Part I: Special case for small codimension-k balls

Let BE = BE(x) be any k-ball of radius c that is tangent to M C R' at the ball's center x. Let S and 5 be independent random d-planes distributed according to the kinematic measure in RI. Let r be the distance in the k-plane containing BE (the shortest line contained in this plane) from S to the ball's center x. Let 0 be the orthogonal matrix denoting the orientation of S. Then we may write S = S,,O

Then almost surely (i.e., with probability 1; abbreviated "a.s.") Vol(Sr,o n BE) does not depend on 0 (this is because BE is a codimension-k ball and S is a d-plane, so the volume of S n BE, itself a d - k-ball, depends a.s. only on r and not on 9). We also note that w(x; d) obviously does not depend on r as well. Define events E := {S,,O n B #0} and := {5n B, # 0}. Then

Er,O [W(x; d ) 'XVOldk-(Sr,o Be)] (2.3)

= E,,O W (X; d x) X VOldk(Sr,o n BE) E x P(E) (2.4)

= Eo [ (X; dB E x Er[Vold-k(s,o nBE)IE] x P(E) (2.5)

= E0 x ' dB, E[ x E[VolVdk((S,o nBlB)EE] x P(E) (2.6) Cd,k,n,R E [ (x; d) ]V S f

1 Eo[W^ (x; dsx,)IE] = x dB x E[VOldk-(Sr,O n BE)IE] x P(E) (2.7) Cd,k,n,R Eg[((X; d)|5

36 X 1 X Er[VOld-k(Sr,o n B,)|E] x IP(E) (2.8)

1

Cd,k,n,R

X Er,O[VOld-(Sr,o n BI)E] x P(E) (2.9) Cd,k,n,R

Cd,k,n,Rl 1 X Er,O[VOld-k(Sr,o f Be)] (2.10)

1

Cd,k,n,R

X Cd,k,n,R X Vold-k(BE) (2.11)

= Vold-k(BE). (2.12)

* Equation 2.5 is due to the fact that r and 0 are independent random vari- ables even when conditioning on the event E. This is true because they are independent in the unconditioned kinematic measure on S, and remain inde- pendent once we condition on S intersecting BE (i.e., the event E) because of the symmetry of the codimension-k ball BE.

* Equation 2.6 is due to the fact that, by the change of variables formula,

1' 1 Vol(TQ + RQLy n BE)dVolfld(y) x d = Vol(BE) (2.13) fRn-d det Proi j;L for every orthogonal matrix Q, where the coordinates of the integral are con- veniently chosen with the origin at the center of BE. RQI is rotation matrix rotating the vector y so that it is orthogonal to TQ, the subspace spanned by the rows of Q.

37 Multiplying by tzbx; Q) and rearranging terms gives

(x; Q) x det(ProjB=Q)

fR-d Vol(TQ + RQLy n BE)dVolfd(y) b(X; Q) x (2.14) Vol(B 6)

Taking the expectation with respect to Q (where Q is the first d columns of a Haar(SO(n)) random matrix) on both sides of the equation gives

EQ[tb(x;Q) x det(ProjBIQ)] BE)dVol_ ~EK[WXnQ fR--dVol(TQ + RQIy n (y)I Vol(BE) (2.15)

Recognizing the right hand side as an expectation with respect to the kinematic measure on TQ+RQ y conditioned to intersect BE (since the fraction on the RHS is exactly the density of the probability of intersection for a given orientation of Q), we have:

EQ[w(x; Q) x det(Proj5 Q)] = E X; dM (2.16)

Equation 2.8 is due to the fact that ds' = dBtge6 dBr because BE has a constant tangent space, and hence

dSxo E] = ErO [ -(x; '6) E] 'dBe dSE = Er,O [ X; d~ o) E] = E I(X; E]. (2.17) dBE,

9 Equation 2.11 is by the Crofton formula.

Writing Es in place of E,,O in Equation 2.3 (LHS)/ 2.12 (RHS) (we may do this

= g = d, We since S = S,,O is determined by r and 0), and observingan osevig thatha B, =j B - _ dS

38 have shown that

Es wj dM) x Vold-k(Sn B) =Vold-k(BE). (2.18) 1(dM

Part II: Extension to all of M

All that remains to be done is to extend this result over all of M. To do so, we consider the Poisson point process {x } on M, with density equal to 1 We wish to approximate the volume-measure on M using the collection balls {BE(x )} (think of making a papier-mich6 mold of M using the balls BE(x ) as tiny bits of paper).

Let A C M be any measurable subset of M. Since M and S have uniformly bounded curvature forms, because of the symmetry of the balls and the symmetry of the poisson distribution, the total volume of the balls intersected by S and A converges a.s. to Vol(S n m n A) on any compact sumbanifold M C M:

Vol(B,(x ) n A) a.! VO( Vol(S n Be(D) x -- > Vol(S( xn Mn A), (2.19) Vol(BE(x)) o and similarly,

Vol(BE(xe) n A) C40-+ Vol(M n A). (2.20) {i:xiEM}

But, by assumption, w is Lipschitz in x on M (since WJ, which appears in both the numerator and denominator of w, is Lipschitz, and the denominator is bounded below by a > 0), so we can cut up M into a countable union of disjoint compact submanifolds UL1 Mj such that Iw(t; ) - w(x; )I < 6 on all of x, t E M3 , and hence, by Equation 2.19,

Vol(Bjxzi) n A) w x;dS lim Vol(S n BE(Zx) n A) x x ( E - 4 0 Vol(BE(x )) ' dMj

- (x; d ) dVol(x) < 6 xVol(S nM n A) (2.21) snasnf dyj a. s. for every j.

39 Summing over all j in equation 2.21 implies that

BE(xe) n A) lim Vol(S n BE(x,) n A) x Vol( x w (xf; - ) 4O V l(BE(x )) " dM dS) dVol( X) <6x voi(s n MnA) (2.22) snMnA w X snMnA0 almost surely. Since Equation 2.22 is true for every 6 > 0, we must have that

Vol(BE(x,) n A) dS ) Vol(S nB(z) nA) xw X' Vol(B,,(x )) dM dS w (x; dM )dVol(x). (2.23) I dM

Hence, taking the expectation Es on both sides of Equation 2.23, we get

dS ES V ol(S n B, (z) n A) x A) x1w(X 91) dS) -+ Es[ w(X; dVol(x)] (2.24) dMI

a.s. as c 4 0 (we may exchange the limit and the expectation by the dominated convergence theorem, since I E Vol(S n BE(xz) n A) x w(x4; )I is dominated by 2 x Vol(S n M) x k) for sufficiently small e.

Since the sum on the LHS of Equation 2.24 is of nonnegative terms we may exchange the sum and expectation, by the monotone convergence theorem:

Vol(B(xs) n A) Es [Vol(S nB(x) nA) w ; x Vol(B(x)) ) Vol(Be(xi) n A) dS)] = ZEs Vol(S n B,(xf)) x (2.25) Vol(Bjxzi)) .

40 But by Equation 2.18, Es[Vol(S n B,(x )) x w(x ; A)] = Vol(BE(x )), so

VOl(BE(xj) n A) dS ES[Z Vo(S nB(xE)) x f )] Vol(Bejxi)) dMJ Vol(B,(x ) nA) = Vol(B(xf)) x -+ Vol(M n A) (2.26) almost surely as e 4 0 by Equation 2.20. Combining Equations 2.24 and 2.26 gives

Es [fflm nA w X; dS dVol(x) = Vol(M n A). (2.27) dMJ

We now prove Theorem 4, a version of Theorem 3 with somewhat more general analytical assumptions.

Theorem 4. Suppose that ?b(x; S) is c(t)-Lipschitz on M n {x : (x; S) < t}, and that

EQ [(w^ (x; SQ) - A T-(x; SQ) V t) x det(Projm Q)] lim = 0 t-*oo EQ [I A ?-(x; SQ) V t x det(ProjM Q)]

and

1 lim Es IIA (X) x [w(x; S) - - V w(x; S) A t) d Vol] = 0, b-+oo [ Snm t where we define the "A" and "V " operators to be r A s := min{r, s} and r V s max{r, s}, respectively, for all r, s E R.

Then Theorem 3 holds even for a = 0 and b = c = oo.

Proof. (Of Theorem 4) Define

EQ [(zC(x; SQ) - a A W^(x; SQ) V b) x det(Projm LQ)] EQ [a A w-(x; SQ) V b x det(Projm Q)]

41 Let A be any Lebesgue-measurable subset. Then

Es [j 1A(X) x w(x; S) dVol] (2.28)

= lim Es[ 1A(X) x w(x;S) dVoll (2.29)

IA(X) x V w(X; S) A t dVoll = t m Esj t (2.30)

1 + lim Es[ ILA(X) X [W(X; S) - A V w(x; S) A t] dVoll t-+ . snm t

= lim Esj 1A(X) x 1 V w(x; S) A t dVoll +0 t (2.31)

V (x; S) A t = lim Esj IA(X) x dVoll EQ [C(x; SQ) x det(ProjM Q)] (2.32)

= lim Esj RA(X) (2.33) t-+oo . snm

V w(x; S) A t dVolJ EQ R V (x;SQ) A t x det(Projm Q)] x (1+ k(t))

42 = lim Es IA (X) (2.34) t-+0 [L snm

E A v(x; S) V t 1 EQ[} A (x; SQ) V t x det(Projm Q)] dVOIJ X 1 + 0(t)

1 = lim Vol(MlnA)x (2.35) t-+oo +

= Vol(M n A) x 1 (2.36)

= Vol(M n A). (2.37)

* Equation 2.31 is true because

O Es 1A(x) x I A w(x;S) V tdVoll

5 EsE[ I A w(x;S) V t dVol -+0. [JsnM t t-0

* Equation 2.35 follows from Theorem 3 using .1 A ii(x; S) V t as our pre-weight. Indeed, 1A (x; S) Vt obviously satisfies the boundedness conditions of Theorem 3. Moreover, since z(x; S) is c(t)-Lipschitz everywhere on M where zZ (x; S) < t,

the pre-weight .1A v(x; S) V t must be c(t)-Lipschitz on all x E M.

t

43 2.3.1 The generalized Crofton formula Gibbs sampler

We now state a Generalization of Theorem 1, that can be proved in much the same way as Theorem 1 by applying our Generalized Crofton's formula (Theorem 3) in place of the classical Crofton formula:

Theorem 5. Let S be an isotropic random subspace centered at the origin. Let 7r be a function on a manifold M then

j r(q) zi(q;SfnSq) 7r(q)dq = c xE[ d-||jqj n- dq] sn II EQ [foi (q; SQ n Sq) x det(Proj(Mqfns,)!Q)] where c = Cdl,k,n_1,S is a constant and ||gLr|| is the sine of the angle between the line passing through both the origin and q. (Sq is the sphere of radius ||q|| centered at the origin.)

Proof. The proof is identical to our proof of Theorem 1, with Crofon's formula on the sphere replaced by the Generalized Crofton formula on the sphere (Theorem 3). 0

Applying Theorem 5 gives the following improvement to Algorithm 9:

Algorithm 10 Generalized Crofton Formula Gibbs Sampler (for 7r supported on submanifold) Algorithm 10 is identical to Algorithms 9 and 3, except for the following step:

3: Sample xj 1 from the (unnormalized) density

Sx) wi(X) = 7r(x) bi (x; Si n - IEQ[f(x; SQ n Sx) x det(Proj(M.fns.)-Q)

restricted to S n M (using a subroutine Markov chain or another sampling method). (Sx is the sphere of radius jix - xi4l centered at xi.)

As discussed in Section 2.3, we would usually set (i to fvi(x, S) = IPf(2(S nM))I. However, as discussed in Section 2.3.5 in some cases it may be advantageous to use other functions for di.

44 Theorem 6. The primary Markov chains in Algorithms 10 and 3 (denoted by x 1 , x 2 , ... in both algorithms) have identical probability transition kernels.

Proof. For convenience, in this proof we will denote the primary Markov chains in

Algorithms 3 and 10 by x 1, x 2 , ... and 1, ^ 2,. , respectively.

Let k(U, V) := P(i+1 E V1zi E U) and k(U, V) := P(i+1 E Vj. E U) denote the probability transition kernel for the primary Markov chain in Algorithms 3 and 10 respectively (here U, V C Rn). Then

dS k(U, V) = P(zi+ E Vjzi E 1 U) = j Es[j 740y/1 dMdSdy]dx

_ dydx = 4 r(y) 1 1

Theorem5 j j d(yx) i(y; Sn S,) SESsnvy dM 1 EQ ['Ji (y; SQ n S,) x det(Proj(mdnsY) 0 ]dx

= P(i+ VjIj E U)

=K(U, V)]

Remark 4. The curvature form 0x(Si+1 n M n sx) of the intersected manifold can be computed in terms of the curvature form Qx(M) of the original manifold by applying the implicit function theorem twice in a row. Also, if M is a hyper-

surface then IPf(Qx(Si+1 n M n Sx)) is the determinant of the product of a ran- dom Haar-measure orthogonal matrix with known deterministic matrices, and hence EQ[IPf(Qx(Q n M n Sx))I x det(ProjM Q)] is also the expectation of a determinant of a random matrix of this type. If the Hessian is positive-definite, then we can obtain an analytical solution in terms of zonal polynomials. Even in the case when the curvature form is not a positive-definite matrix (it is a matrix with entries in the algebra of differential forms), the fact that the curvature form is the Pfaffian of a random curvature form (in particular, a determinant of a real-valued random matrix

45 in the codimension-1 case) should make it very easy to compute numerically, perhaps by a Monte Carlo method. This fact also means that it should be easy to bound the expectation, which allows us to use Theorem 3 to get bounds for the volumes of algebraic manifolds (Section 2.3.8).

Remark 5. While the Chern-Gauss-Bonnet theorem only holds for even-dimensional manifolds, if M has odd dimension we can always include a dummy variable to increase both the dimensions n and d by 1.

2.3.2 The generalized Crofton formula Gibbs sampler for full-dimensional densities

In many cases one might wish to sample from a full-dimensional set of nonzero prob- ability measure. One could still reweight in this situation to achieve faster conver- gence by decomposing the probability density into its level sets, and applying the weights of Theorem 5 separately to each of the (infinitely many) level sets. We ex- pect this reweighting to speed convergence in cases where the probability density is concentrated in certain regions, since when d is large, intersecting these regions with a random search-subspace S typically causes large variations in the integral of the probability density over the different regions intersected by S, unless we reweight using Theorem 5.

Algorithm 11 Generalized Crofton Formula Gibbs Sampler (for full-dimensional 7r) Algorithm 11 is identical to Algorithms 9 and 10, except for the following step:

3: Sample xj+1 (using a subroutine Markov chain or another sampling method) from the (unnormalized) density

wi(x) = ir(X) -Q (x S n EQ [ i (x; S n Sr)]'

where Sx is the sphere of radius |x - xi I centered at xi.

46 As discussed at the beginning of Section 2.3, we would usually set ?i to tbi(x, S) =

IPf(Qx(S n Lx))I, where Cx is the level set of 7r passing through x. If we instead set qii(x, S) = 1, we get the traditional Gibbs sampler (Algorithm 2).

Theorem 7. The primary Markov chains in Algorithms 11 and 2 (denoted by x1 , x 2, ... in both algorithms) have identical probability transition kernels.

Proof. For convenience, in this proof we will denote the primary Markov chains in

Algorithms 2 and 11 by i1 , z2, ... and i1, 2 ,..., respectively.

Let K(U, V) := P(zi+ 1 E V~j. E U) and k(U, V) :=QPs 1 E V|%i E U) denote the probability transition kernel for the primary Markov chain in Algorithm 2 and 11 respectively (here U, V C Rn). Then

f(U, V) = P(i+1 E VIzi E U) = Es[j ir(y)dy]dx

=r r(Y) _,dydx //\|y - X11|-

Theorem5 7r(Y i(y; S n y ES r(y) S) dy]d Ju fsnv ( EQ [i(y; SQ n Sy)]

= EGsi+1E V|Si E U)

=K(U, V), where we set M = R" when applying Theorem 5. 0

2.3.3 An MCMC volume estimator based on the Chern-Gauss- Bonnet theorem

In this section we briefly go over a new MCMC method (which we plan to discuss in much greater detail in a future paper) of estimating the volume of a manifold that is

47 based on the Chern-Gauss-Bonnet curvature. While this method is interesting in its own right, we choose to introduce it here since it will serve as a good introduction to our motivation (Section 2.3.4) for using the Chern-Gauss-Bonnet curvature as a pre-weight for Theorem 3.

Suppose we somehow knew or had an estimate for the Euler characteristic X(M) / 0 of a closed manifold M of even-dimension m. We could then use a Markov chain Monte Carlo algorithm to estimate the average Gauss curvature form Em[(Pf(Q))] on M. The Chern-Gauss-Bonnet theorem says that

(2.38) IM Pf(Q)dVlm = (27r)iX(M).

We may rewerite this as

fM Pf(Q)dVolm (27r) X(M) fM dVolm fM dVOlm

By definition, the left hand side is Em[(Pf(Q))], and fM dVolm = Volm(M), so

(2-x) mX (M) = M ,2)_ (2.40) EM[(Pf(Q))] Volm(M) from which we may derive an equation for the volume in terms of the known quantities Em[(Pf(Q))] and X(M)

_(2wr) x(M) Volm(M) = (M) .( (2.41) Em [(Pf(Q))]'

2.3.4 Motivation for reweighting with respect to Chern-Gauss- Bonnet curvature

While Theorem 3 tells us that any pre-weight fv generates an unbiased weight w, it does not tell us what pre-weights reduce the variance of the intersection volumes. We argue here that the Chern-Gauss-Bonnet theorem in many cases provides us with an ideal pre-weight if one only has access to local second-order information at a point x.

48 Equation 2.41 of Section 2.3.3 gives an estimate for the volume

Voldk (S n M) = (27r)dy(s n m) (2.42) EsnM[(Pf(Q(S nfM)))]' where Q(S n M) is the curvature form of the submanifold S n M.

If we had access to all the quantities in Equation 2.42 our pre-weight would then be 1 EsnM[P(Q(SnM)))]. However, as we shall see we cannot actually im- vold-k(snM) (27r)-7- x(SnM) plement this pre-weight since some of these quantities represent higher-order informa- tion. To make use of this weight to the best of our ability given only the second-order information, we must separate the higher-order components of the weight from the second-order components by dividing out the higher-order components.

The Euler characteristic is essentially a higher-order property, so it is not reason- able in general to try to estimate the Euler characteristic x(S n M) using the second derivatives of M at x because the local second-order information gives us little if any information about X(S n M) (although it may in theory be possible to say a bit more about the Euler characteristic if one has some prior knowledge of the manifold). The best we can do at this point is to assume the Euler characteristic is a constant with respect to S, or more generally, statistically independent of S.

All that remains to be done is to estimate EsnMPf(Q(S n M)). We observe that

EsnMPf(Q(S fl M)) EsnMPf(Q (S n M)) = EsnM IPf(Q(S n M))I x EsnmPf(.(s n m)) (2.43) EsE(Pf(S(s n m))

But the ratio ESnM PfsnMD is also a higher-order property since all it does is de- scribe how much the second-order Chern-Gauss-Bonnet curvature form changes glob- ally over the manifold, so in general we can say nothing about it using only the local second-order information. The best we can do at this point is to assume that this ratio is statistically independent of S as well.

49 Hence, we have:

1 M) = EsnMlPf(Q(S nf M))lx Voldki(S n M)44 Vd-k- Esnnf( m)) ((27r)'2 'x(snA)E nmfQ(S M)), (2.44) Esnm IPf(Q (S n M))| where we lose nothing by dividing out the unknown quantity

(27r)m X(M) EsnsPf (snm) since we have no information about it and it is indepen- dent of S.

We would therefore like to use Esnm Pf(Q(SnM)) as a pre-weight. Since we only know the curvature form Q(Sn M) locally at x, our best estimate for Esnm Pf(Q(Sn M)) is the absolute value IPf(Qx(SnM)| of the Chern-Gauss-Bonnet curvature at x.

Hence, our best local second-order choice for the pre-weight is W^= |Pf(Q (S n M)|.

2.3.5 Higher-order Chern-Gauss-Bonnet reweightings

One may consider higher-order reweightings which attempt to guess not only the second-order local intersection volume, but also make a better guess for both the Euler characteristic of the intersection SQ fM and how the curvature would vary over SQfnM. Nevertheless, higher-order approximations are probably harder to implement for the same reason that most nonlinear solvers, such as Newton's method, do not use higher-order derivatives. Moreover, it may not even be desirable to implement higher-order reweightings. Indeed, if the local intersection region whose volume we are aiming to estimate is so large that the second derivatives vary widely over this region, then the statistic we wish to compute with our algorithm will most likely also vary widely over this region, ensuring that different samples over this region will contain different information about this statistic. Hence, we probably only need to consider volume approximations that are local in a second-order sense.

50 2.3.6 Collection-of-spheres example and concentration-of-measure

In this section we argue that the traditional algorithms can suffer from an exponential slowdown (exponential in the search-subspace dimension) unless we reweight the in- tersection volumes using Theorem 3 with the Chern-Gauss-Bonnet curvature weights. We do so by applying two results (Theorems 8 and 9) related to the concentration- of-measure phenomenon, to an example involving a collection of hyperspheres.

Consider a collection of very many hyperspheres in R'. We wish to sample uni- formly from these hyperspheres. To do so, we imagine running a Markov chain with isotropically random search-subspaces. We imagine that there are so many hyper- spheres that a random search-subspace typically intersects exponentially many hy- perspheres. As a first step we would use Theorem 1 which allows us to sample the intersected hypersphere from the uniform distribution on their intersection volumes. While using Theorem 1 should speed convergence somewhat (as discussed in Section 2.2.2), concentration-of-measure causes the intersections with the different hyper- spheres to have very different volumes (Figure 2-6). In fact we shall see that the variance of these volumes increases exponentially in d, causing an exponential slow- down if only Theorem 1 is used, since the subroutine Markov chain would need to find exponentially many subspheres before converging.

Reweighting intersection volumes using Theorem 3 causes each random intersec- tion S n Mi (where Mi is a subsphere) to have exactly the same reweighted inter- section volume, regardless of the location where S intersects Mi, and regardless of d. Hence, in this example, Theorem 3 allows us to avoid the exponential slowdown in convergence speed that would arise from the variance of the intersection volumes.

The first result deals with the variance of the intersection volumes of a sphere in Euclidean space. It says that the variance of the intersection volume, normalized by it's mean, increases exponentially with the dimension d (as long as d is not too close to n). Although isotropically random search-subspaces are (conditional on the radial direction) distributed according to the Haar measure in spherical space, the Euclidean case is still of interest to us since it represents the limiting case when the

51 Figure 2-6: The random search-subspace S intersects a collection of spheres M4. Even though the spheres in this example all have the same n - I-volume, the d - I-volume of the intersection of S with each individual sphere (green circles) varies greatly depending on where S intersects the sphere if d is large. In fact, the variance of the intersection volume of each intersected sphere increases exponentially with d. This "curse of dimensionality" for the intersection volume variance leads to an exponential slowdown if we wish to sample from sn M with a Markov chain sampler (and S n M4 consists of exponentially many intersected spheres). However, if we use the Chern- Gauss-Bonnet curvature to reweight the intersection volumes, then all spheres in this example will have exactly the same reweighed intersection volume, greatly increasing the convergence speed of the Markov chain sampler. hyperspheres are small, since spherical space is locally Euclidean.

T heorem 8. (Variance resulting from concentration of Euclidean kinematic measure) Let S C R' be a random d-dimensional plane distributed according to the kinematic measure on R". Let M4 = S" C R" be the unit sphere in R n. Defining a :=1, we have

k (a, d)eCd x a") I < Var( VlSnA ) < K (a, d)eCdxpd) -- 1, (2.45) E [Vol (S n M4)]- where

o(a) = log(2) + ()log(-) - ( + -)log(- + 1) - ( )log(- 1) a a 2ce 2 a 2a 2 a

(27T 2 (TI -- 1)("d ) -2 1 k (a, d) = 4 )(n - d)2 4- e- (n - 1)(n _ d -2

K (a, d) = 0 3( -d)2 T" - C)( -n)-1r 47r2 (d - 1)(n + d - 2)

52 Proof. Consider the unit sphere M = S- 1 centered at the origin. By symmetry of the sphere, the intersection M n S of the unit sphere with a d-dimensional plane S is entirely determined (up to a rotation) by the plane's orthogonal complement S' that passes through the origin, and the intersection point x = S n S'. By symmetry, we may assume S' = Rn-d is aligned with the first n - d coordinate axes. If S is distributed according to the kinematic measure, we must have that x is distributed

Bn-d Rn-d. Hence, 5 R) = vld(RBnd)= Rn-d, uniformly on the ball c IP(IlxJl VOnd(Bn-d) 0 < R < 1, where B-d is the unit n - d ball.

The radius of SnM is just /1 - Ilxi 12, and hence Vol(SnM) = Cd x (1- lxi 2)d/2, where Cd drd/ 2

Denoting by S, a d-plane whose associated x-value has | xii = r, and by Cd = the constant in the volume formula for the d-dimensional unit ball, we have:

E[Vol(SnS"- 1)1] = )dVol(Srn-)'dP(r) = (cd(1-r2)d )txr -d-dr

2 (C)_ 2 d-1 __ (cd)t F(td + 1)F(!2 + 1) (2.46) (1 - r2) 2 x r dr = x where the last equality is by Gauss's theorem for the Gauss hypergeometric function.

(cd ___ - _d___-_) __t_-_-+ ____1)

In particular,

Cd p(di +1),(n-+2) E[Vol(S n S f-1)] = d x 2 2 (2.47) n - d (n - d)]F(n - 1 + 1) and Var[Vol(S n S"-1)] = E[Vol(S n Sn- 1) 2] - E[Vol(S n S"-1)]2

1 2 _ (Cd) 2 1(d - 1 + 1)F(n-d + 1) Cd F(!. + 1)F("~a ) 2 (2.48) n - d (n - d)(d - 1 + "+ 1) n - d (n - d)(zi - 1 + 1)

53 Combining Equations 2.47 and 2.48 gives

E[(S n n-1)2] Var o(s n -1) ] ~I VarE[voi(s n sn-1)_ E[(S n Sn-1)]2 - 1 r(d - 1 +1)r(ngd + 1) (n - d)( - j + 1) - 1 F(d - I1+ nid + 1) r(d + 1)]p(n-d +) F(d - 1+ 1) (n - d) 2 F( - j+1)2 1 r(nd - 1 + 1) F(9 1 + 1) 2 (n-d + 1)

(d - 1)d--le-d+1 ((-1)n,-a+i)2 = k x (n - d)2 X +d - -i- +1 x -1 (( -1 d1)2(n-d)n-+),

2 (n-JI)n+' k x (n-d)X - 1 x - 1, (2.49) (n+d-2 n~d x where the second-to-last equality is a consequence of Stirling's formula and k = k(d, n)

(27r)~ _3 is some number that satisfies 4 k < e. To shorten our equations, we set A log(k)+2log(n-d) - log(d 1) and B := log(2-j) -1 log(2+1- 2)+ log(a -1).

Hence,

vol(s n sn-1) log(Var + 1) E[voi(s n sn-1) d )log(n = dlog(2) + (n + ) log( ) + + d 2 2 -(n 2 2 n- d + ) log( n-d) + A 2 2 2 =dlog(2) + (n + ) log(n ) + d 2)~n d~2 2 d -(n - - ( ) - 1 n - d 1 d + d 2-( )log )log(2d) [(n +2 (n + )] +A + d + d- 2 =d log(2) + (n + 1) log(n ) -(n - 1)log(n

- ( + )log(n d ) + A 2 2 d n+ d = dlog(2) + (n + 1) log( - ) - 1) log( +1 -2 2 d d -(2 -(n d + 1) log( - 1)+ A

= d[log(2) + (n) log(n - -)log(- + -2) - 1)]+ A + B d 2d 2 d -d 2d- 21)log(- d

54 But, log(x) - a < log(x- a) log(x) - whenever a > 0 and x > 0. Therefore,

log(Var [ Vo1(Sn -1)] + 1)> 2 d 2 d x [log(2) -A] - + 1)[log(n + (,)[log(,) n/d + 1) -

- 1)] + A + B 2d 2d2- 1) log(-d 1 1 2] = d x log(2) + (-)[log(-)- )[log( + 1) - - ( 2a + 2 a 1 log(- - 1)] + A + B - (2 -1 ) a 1 1 = d x log(2) + (-)[log(-)] 1 1 1 log( 1e a a )[log(-a + ) - 1)]

log( e4 )+2 log(n - d) - Ilog(d -)+ 1 log( - )

1 ) 21 1 log( n+ 1 - )2+ -log( - (2.50) 2 d 2a 2 ) ~+ 1 - and, in the reverse direction,

log(Var[ Vol(S n n-1) + 1) < oE[Voi(S n S 1)] -1 1 (7n 1 7 2 = d x [log(2) + (n)[log(n) d d - 3 'd T2d+ -)[log(- + 1)- 1 n ) log(- - 1)] + A 2 d (n+ = d x [log(2) + (n)[log(n)] [log(n + -( n) - ) log( - 1)] 1 ni + [log( e 2) + 2 log(n - d) - Ilog(d - 1) + -log(- -) 2 2 d d

- 1 log( + 1 - 2) + 1 log(n - 1) - _ +1]. (2.51) n2-1I

Combining the above two inequalities (Equations 2.50 and 2.51), we get the double-

55 sided inequality

dlog(2) + (n + ) log( - 1) - ( +d- -)log(n+ )- (nd + I) log( - 1) 2 d 2 2 d 1 2 + [log(k) + 2 log(n - d) - I log(d - 1)] 2 < log(Var vo(i(sn -) +1) - E[vol(s n sn-1 )]] d n - d < d log(2) + (n + 1) + I-) log(-) + ) log ( - 1) log(n) - (n 2 d

+ [log(k) + 2log(n - d) - I log(d - 1)]. (2.52) 2

Substituting ae n into Equation 2.52, we get

d x [log(2) + ( I + 1) log( - 1)- ) log( + 1) a (-2a + 2 a

+ [log( ) - log(- + 1) + 2log(d(a - 1)) - log(d 2 - 1)] nl "-1) < log(Var [LEVol(S Vol(S n Sn-1)] +)

< d x [log(2) + ( 1 -)log( 3-) S 2 )log 1)]

1 1 + [log( + log( - -1) + 2 log(ad - d) - - log(d - 1)], (2.53) e) 2 a 2 where the second set of brackets contain only lower-order terms. This completes the proof of Theorem 8. 1

56 2.3.7 Variance due to spherical-geometry kinematic measure concentration

The next result (Theorem 9 and Figure 2-8) deals with the spherical geometry case. As in the Euclidean case, the concentration of spherical-geometry kinematic measure causes the variance of the intersection volume to increase exponentially with the dimension d as well. (While we were able to derive the analytical expression for the variance of the intersection volumes (Theorem 9), which we used to generate the plot in Figure 2-8 showing an exponential increase in variance, we have not yet finished deriving an inequality analogous to Theorem 8 for the spherical geometry case. We hope to make the analogous result available soon in [43])

In this example M (Figure 2-7, dark blue) is the boundary of a spherical cap of the unit n-sphere S" (Figure 2-7, light blue), contained in a hyperplane M a distance h from the center of S . We want to calculate the volume of the intersection Sn M to show the exponentially large variance which results from the concentration of measure of these intersection volumes. S n M is a dimension d - 1 subsphere (Figure 2-7, top left, green), which we project down to a dimension-0 sphere consisting of two green points in the figures in order to save our precious 3 dimensions to illustrate other features. To compute Vol(S n.M), we must find the radius r of the intersection S n M. As a first step we will consider the el component y of the maximum point (Figure 2-7, yellow) of S (Figure 2-7, top left, red) in the el direction, where el is defined to be the direction orthogonal to the hyperplane M containing M. By congruence of the two triangles in the bottom-left diagram, we know that y is just

||Pge1 |l, the length of the projection of el onto S (Figure 2-7, red), the smallest Euclidean subspace containing S. By congruence of the two triangles in the bottom- right diagram we have that the length a of the dotted diagonal line is a = L. Drawing Y S" from a different (3-dimensional) projection (Figure 2-7, top right) that contains a diameter of S n M, we see by the pythagorean theorem that S n M has radius r =VI/ -1a2

57 1h

Figure 2-7: Diagrams used to obtain the volume of the intersection of a great d- sphere S c S' with the boundary of a spherical cap M c S" (diagrams arranged counterclockwise from top-left)

P can be generated as the submatrix consisting of the first d + 1 rows of a random matrix Q sampled from the Haar measure on the special orthogonal group SO(n+1). Since the distribution of Q is invariant under action by orthogonal matrices in SO(n + 1), each row of Q corresponds to a random vector on Sn, and is therefore distributed according to .. N+i)T where N1 , ..., N+ 1 are independent standard normals. Hence, y = ||Pseilj is distributed according to KNi Nd~l)TH 2X 2 I(Ni Tj - x -I-Y where X~ Xd+1 and Y ~ nX-d are independent. Hence, = 1 + , where Z dIx is F-distributed with parameters (n - d, d+1), conditional on Z < (i.e. conditional on y > h). Now we can use the fact that we know the distribution of y to obtain the variance of the intersection volume (Theorem and Figure ). The plot (Figure ) shows that the variance of the intersection volume grows exponentially with the dimension d of S when d is not too large, and grows exponentially with the codimension n - d of S when the codimension n - d is not too large.

58 Theorem 9. (Variance for spherical-geometry kinematic measure) Let M be a n - 1-dimensional subsphere of Sn C Rn+1, such that the hyperplane containing M lies at a distance h from the center of Sn. Let S be an isotropic random d-dimensional great subsphere of Sn. Then

Vol(S n M) h7+01 _ nn-d _n--i 1 Var z2 (1+ z) 2 dz kiE[Vol(Sn M)] J d + 1 d+ ( 11) _ n- -)(1- h 2 (1 + n z))d-1z- -1(1 + z)- n+1 dz (2.54) d + 2 x L(1 x1 - 9/h ndZ2 t 1 ) n n =h (I - h 2(1 + n-dZ)) d Z - ( )--d

Proof. From the discussion above (illustrated in Figure 2-7), we have

dP(Z < z) _ [F(T) n - d n-d n-d _ 1 n - d _n+ fz(z) : z )(d) d +1) 2 xZ 2 (1+ 1 z) 2 (2.55)

-~ (c2~~1 2 dE(Vol(S n~ M)t x 1{Z < z})=V(S M)f () dEvo~s n my x td< z}z vol(s n m ) (z) x fzz< +i( i-1) (z)

- Z) = Cd- lr(Zjd-1 t X d (2.56) ' f-d h fz(z)dz

22(1 n -d 2z- - + 1 2- 1Zn)-ldz c d-+f x 1-h (1+ z) x +

Hence,

Vol(S n M) 7(-- n-a_ n - d _+1 Var zz2 (1+ z) 2 dz E[Vol(Sn M)]J[ d0+

d 1(1 - -Z Z21 di)2d n_.ghT-dZ -d d n+1(2.57)21 + fn (1 h1 + n))d-Z- -1(1 + gz)-4 2dz( x + ) z)" Z _h7 (I - 42(1 + E-_z))d zn-d-1(1 + 2

59 U

We can now numerically evaluate the integrals in Equation to obtain a plot

(Figure ) of Var( VEs")) for different values of d:

Variance of Vol(SnM.) normalized by its mean 10125 --- --

1020

S10 15_

10

10 0-

10-50 50 100 150 200 250 300 350 400 d Figure 2-8: This log-scale plot shows the variance of Vol(S n Mj) normalized by its mean, when S is an isotropic random d-dimensional great subsphere of S', for different values of d where n = 400. Mi is taken to be the boundary of a spherical cap of the unit sphere S' with geodesic radius r(d) such that S has a 10% probability of intersecting Mi. The variance increases exponentially with the dimension d of the search-subspace (as long as d is not too close to n), leading to an exponential slow- down in the convergence for the traditional Gibbs sampling algorithm applied to the collection-of-spheres example of Section . Reweighting the intersection volumes with the Chern-Gauss-Bonnet curvature using Theorem in this example (where M = UM is a collection of equal-radius subspheres M 2 ) causes each (nonempty) random intersection S n Mi to have exactly the same reweighted intersection volume regardless of d, allowing us to avoid the exponential slowdown in the convergence speed that would otherwise arise from the variance in the intersection volumes.

2.3.8 Theoretical bounds derived using Theorem and alge- braic geometry

Generalizing on bounds for lower-dimensional algebraic manifolds based on the Crofton formula (such as the bounds for tubular neighborhoods in [42] and [29]), it is also

possible to use Theorem to get a bound for the volume of an algebraic manifold

M of given degree s, as long as one can also use analytical arguments to bound

the second-order Chern-Gauss-Bonnet curvature reweighting factor on M for some convenient search-subspace dimension d:

60 Theorem 10. Let M C R' be an algebraic manifold of degree s and codimension 1, such that EQ[IPf(Qx(SQ n M))| x det(Proj pIQ)] > b for every x E M, and the conditions of Theorem 3 are satisfied if we set Wi(x; s) = |Pf (G(S n M))1. Then

1 1 s x (s - 1)d Vol(M) < x - x vl(s) (2.58) cd,k,n,R b 2 Proof. If we have an algebraic manifold of degree s in R", by Bezout's theorem the intersection with an arbitrary plane is also degree s. Hence (at least in the case where M has codimension 1), we can use Risler's bound to bound the integral of the absolute value of the Gauss curvature over S n M by a := sx(-1)d2 Vol(Sn) [51, 49]. By Theorem 3,

Vol(M) = Es[VOld(S) xPf(Q(snM)) dVold k] cd,k,n,K IEQ[IPf(Qx(SQ n M))| x det(ProjpI Q)] 11 1 1 < x Es[If(Qx(Sn M))]x - < x a x cd,k,n,R b- cd,k,n,R b

El

Unlike a bound derived using only the Crofton formula for point intersections, the bound in Theorem 10 allows us to incorporate additional information about the curvature, so we suspect that this bound will be much stronger in situations where the curvature does not vary too much in most directions over the manifold. We hope to investigate examples of such manifolds in the future where we suspect Theorem 10 will provide stronger bounds, but do not pursue such examples here because it is beyond the scope of this chapter.

61 Part II

Numerical simulations

2.4 Random matrix application: Sampling the stochas- tic Airy operator

Oftentimes, one would like to know the distribution of the largest eigenvalues of a ran- dom matrix in the large-n limit, for instance when performing principal component analysis [34]. For a large class of random matrices that includes the Gaussian or- thogonal/unitary/symplectic ensembles, and more generally the beta-ensemble point processes, the joint distribution of the largest eigenvalues converges in the large- n limit, after rescaling, to the so-called hard-edge limiting distribution (the single- largest eigenvalue's limiting distribution is the well-known Tracy-Widom distribution) [34, 58, 20, 21]. One way to learn about these distributions is to generate samples from certain large matrix models. One such matrix model that converges particularly fast to the large-n limit is the tridiagonal matrix discretization of the Stochastic Airy operator of Edelman and Sutton [58, 20],

d2 2 - x + -- dW, (2.59) where dW is the white noise process. We wish to study the distributions of eigenvalues of the hard edge conditioned on other eigenvalue(s).

To obtain samples from these conditional distributions, we can use Algorithm 8, which is straightforward to apply in this case since dW is already discretized as i.i.d. Gaussians.

The stochastic Airy operator (2.59) can be discretized as the tridiagonal matrix [20, 58] 1 2 A3= A - h x diag(1, 2, ... , k) + N, (2.60) h2 h(

62 -2 1 1 -2 1 1 -2 1 where A = is the k x k discretized Laplacian, N

1 -2 1 1 -2 diag(K(0, 1)') is a vector of independent standard normals, and the cutoff k is cho- sen (as in [20, 58]) to be k = 1OnA (the O(10n- ) cutoff is due to the decay of the eigenvectors corresponding to the largest eigenvalues, which decay like the Airy function, causing only the first O(10n-1) entries to be computationally significant).

2.4.1 Approximate sampling algorithm implementation

The discretized stochastic operator A, is a function of spherical (i.i.d.) Gaussians h;7N (Equation 2.60). Since, conditional on their magnitude, these Gaussians are uniformly distributed on a sphere, we can use the following modification of Algo- rithm 8 to sample AO conditional on our eigenvalue constraints of interest after first independently sampling their X,-distributed magnitude (Figure 2-10):

Algorithm 12 Great sphere sampler (with weights) Algorithm 12 is identical to Algorithms 4 and 8, except for the following steps:

output: x 1, x2 ,..., with associated weights w 1, w2 , ... , having (weighted) distribution 7r

3: Sample xj+ 1 uniformly from S n M (using a subroutine Markov chain or another sampling method). Set the weight wj+j = ir(Xi+1).

To simplify the algorithm, in our simulations we will use a deterministic nonlin- ear solver with random starting points in place of the nonlinear solver-based MCMC "Metropolis" subroutine of Algorithm 12 to get an approximate sampling. This is somewhat analogous to setting both the hot and cold baths in a simulated annealing- based (see, for instance, [28]) "Metropolis step" in a Metropolis-within-Gibbs algo-

63 rithm to zero temperature, since we are setting the randomness of the Metropolis subroutine to zero while fully retaining the randomness of the search-subspaces.

3F,

Figure 2-9: In Algorithm the random great circle (red) intersects the constraint manifold (the blue ribbon which represents the level set {g : A = 3} in this exam- ple) at different points, generating samples (green dots). The constraint manifold has different (differential) thickness at different points, given by 1 . Instead of weighting the green dots by the (differential) intersection length of the great circle and the constraint manifold at the green dot, Crofton's formula allows Algorithm to instead weight it by the local differential thickness, greatly reducing the variation in the weights (see Sections , . and ).

Remark 6. Using a deterministic solver with random starting point in place of the more random nonlinear solver-based "Metropolis" Markov chain subroutine of Algo- rithm introduces some bias in the samples, since the nonlinear solver probably will not find each point in the intersection Si 1 n M n rSi with equal probability. There is nothing preventing us from using a more random Markov chain in place of the deterministic solver, which one would normally do. However, since we only wanted to compare weighting schemes, we can afford to use a more deterministic solver in order to simplify numerical implementation for the time being, as the implementation of the "Metropolis" step would be beyond the scope of this chapter. It is important to note that this bias is not a failure of the reweighting scheme, but rather just a consequence of using a purely deterministic solver in place of the "Metropolis" step. On the contrary, we will see in Sections and , that this bias is in fact much smaller than the bias present when the traditional weighting scheme is used together with the same deterministic solver. In the future, we plan to also perform numerical

64 simulations with a random "Metropolis" step in place of the deterministic solver, as described in Algorithm 12.

2.5 Conditioning on multiple eigenvalues

In the first simulation (Figure 2-10), we sampled the fourth-largest eigenvalue con- ditioned on the remaining 1st- through 7th- largest eigenvalues. We begin with this example since in this particular situation, when conditioned only on the 3rd and 5th eigenvalues, the 4th eigenvalue is not too strongly dependent on the other eigenvalues (the intuition for this reasoning comes from the fact that the eigenvalues behave as a system of repelling particles with only week repulsion, so the majority of the inter-

action involves the immediate neighbors of A4 ). Hence, in this situation, we are able to test the accuracy of the local solver approximation by comparison to brute force rejection sampling. Of course, in a more general situation where we do not have these relatively week conditional dependencies, rejection sampling would be prohibitively slow (e.g., even if we allow a 10% probability interval for each of the six eigenvalues, conditioning on all six eigenvalues gives a box that would be rejection-sampled with probability 10-6).

Despite the fact that the integral geometry algorithm is solving for 6 different eigenvalues simultaneously, the conditional probability density histogram obtained using Algorithm 12 with the integral geometry weights (Figure 2-10, blue) agrees closely with the conditional probability density histogram obtained using rejection sampling (Figure 2-10, black). Weighting the exact same data points obtained with Algorithm 12 with the traditional weights instead yields a probability density his- togram (Figure 2-10, red) that is much more skewed to the right than either the black or blue curves. This is probably because, while theoretically unbiased, the traditional weights greatly amplify a small bias in the nonlinear solver's selection of intersection points.

65 A 4 {(A 1 ,A2'A 3 A5'A 6 'A 7)=(-2,-3.5,-4.65,-7.9,-9,-1 0.8)}

2.5 - Rejection Sampling of A41 (A3 A5 ) -Integral Geometry Weights a 2- Traditional Weights 0 >11.5-

-0 0 o 0.5-

-8 -7.5 -7 -6.5 -6 -5.5 -5 -4.5

Figure 2-10: In this simulation we used Algorithm together with both the tradi- tional weights (red) and the integral geometry weights (blue) to plot the histogram of A 4 (A1,A2, A3 , A5, A6 , A7 ) = (-2, -3.5, -4.65, -7.9, -9, -10.8) . We also provided a histogram obtained using rejection sampling of the approximated conditioning A4 1(A 3,A 5 ) - [-4.65 0.001] x [-7.9 0.001] (black) for comparison (conditioning on all six eigenvalues would have caused rejection sampling to be much too slow). Since we used a deterministic solver in place of the Metropolis subroutine in Algo- rithm 2, some bias is expected for both reweighting schemes. Despite this, we see that the integral geometry histogram agrees closely with the approximated rejec- tion sampling histogram, but the traditional weights lead to an extremely skewed histogram. This is probably because, while theoretically unbiased, the traditional weights greatly amplify a small bias in the nonlinear solver's selection of intersection points. The skewness is especially large (in comparison to Figure ) because we are conditioning on 6 eigenvalues simultaneously.

2.6 Conditioning on a single-eigenvalue rare event

In this set of simulations (Figure ), we sampled the second-largest eigenvalue conditioned on the largest eigenvalue being equal to -2, 0, 2, and 5. Since A = 5 is a very rare event, we do not have any reasonable chance of finding a point in the intersection of the codimension 1 constraint manifold A = {A = 5} with the search-subspace unless we use a search-subspace of dimension d >> 1. Indeed, the analytical solution for A, tells us that P(Al > 2) = 1 x 10-4, P(Al > 4) = 5 x 10 8 and

P(Al > 5) < 8 x 10-10 [,), -17]. For this same reason, rejection sampling for A = 2 is very slow (58 sec./sanmple vs. 0.25 sec./sample for ) and we cannot hope to perform rejection sampling for A = 5 (It would have taken about 84 days to get a single sample!). To allow us to make a histogram in a reasonable amount of time, we

66 will use 12 with search-subspaces of dimension d = 23 >> 1, vastly increasing the probability of the random search-subspace intersecting M. In (Figure 2-11, top), we see that while the rejection sampling (black) and integral geometry weight (blue) histograms of the density of A2 1A, = 2 are fairly close to each other, the plot obtained with the exact same data as the blue plot but weighted in the traditional way (red) is much more skewed to the right and less smooth than both the black and blue curves, implying that using the integral geometry weights from Theorem 1 greatly reduces bias and increases the convergence speed (Although the red curve is not as skewed as in Figure 2-11 of Section 2.5. This is probably because in this situation the codimension of M is 1, while in Section 2.5 the codimension was 6.)

In (Figure 2-11, middle), where we conditioned instead on A1 = 5, we see that solving from a random starting point but not restricting oneself to a random search- subspace (purple plot) causes huge errors in the histogram of A 2 IA1. We also see that, as in the case of A1 = 2, the plot of A 2 obtained with the traditional weights is much more skewed to the right and less smooth than the plot obtained using the integral geometry weights. In (Figure 2-11, bottom), we apply our Algorithm 12 to study the behavior of

A2 IA1 for values of A1 at which it would be difficult to obtain accurate curves with traditional weights or rejection sampling. We see that as we move A 1 to the right, the variance of A 2IA1 increases and the mean shifts to the right. One explanation for this is that the largest and third-largest eigenvalues normally repel the second- largest eigenvalue, squeezing it between the largest and third-largest eigenvalues, which reduces the variance of A 2 IA1. Hence, moving the largest eigenvalue to the right effectively "decompresses" the probability density of the second-largest eigenvalue, increasing it's variance. Moving the largest eigenvalue to the right also allows the second-largest eigenvalue's mean to move to the right by reducing the repulsion from the right caused by the largest eigenvalue.

Remark 7. As discussed in Remark 6 of Section 2.4.1, if we wanted to get a perfectly accurate plot, we would still need to use a randomized solver, such as a subroutine

67 SEEM_

A 1 =2 . weights) 0.5 -.A1=2(Integral geometry >1 A, 2(traditional weights) 0.4 - .... _2(rejection sampling) a) C 0.3- TD .2 -- 00.1 0 -2 6~~~~~ -5 - 3A2-

1-7- A21 1,=0 -Integral Geometry Weights 0.4 _ Unconstrained Solver U) Traditional Weights 00.3 0.2 - " - -00.1 0 0~ -6 -5 -4 -3 -2 -1 0 1 2 3 4 "2 A 21A1={-2,O,2,5}, Integral geometry weights 0.6 - A =0 .A =5 0.4 - =2

M 0.2 - 0 0 CL- -6 -5 -4 -3 2 -2 -1 0 1

Figure 2-11: Histograms of A2 =A-2, 0, 2, and 5, generated using Algorithm . A search-subspace of dimension d 23 was used, allowing us to sample the rare event A = 5. In the first plot (top) we see that the rejection sampling histogram of A2A = 2 is much closer to the histogram obtained using the integral geometry weights (blue) than the histogram obtained with the traditional weights (red) because the red plot is much more skewed to the right and less smooth (it takes longer to converge) than either the blue or black plots. If we do not constrain the solver to a random search-subspace, the histogram we get for A21 = 5 (purple) is very skewed to the right (middle plot), implying that using a random search-subspace (as a opposed to just a random starting point) greatly helps in the mixing of our samples towards the correct distribution. As an application of our algorithm, in the last plot (bottom), the probability densities of A2 JA obtained with the integral geometry weights show that moving the largest eigenvalue to the right has the effect of increasing the variance of the probability density of A2LJA and moving its mean to the right, probably because the second eigenvalue feels less repulsion from the largest eigenvalue as A, -- oc.

68 Markov Chain, to randomize over the intersection points. Since d = 23, the volumes of the exponentially many connected submanifolds in the intersection S i1 n M would be concentrated in just a few of these submanifolds, with the concentration being exponential in d, causing the algorithm to be prohibitively slow for d = 23 unless we use Algorithm 11, which uses the Chern-Gauss-Bonnet curvature reweighting of Theorem 3 (see Section 2.3.6). Hence, if we were to implement the randomized solver of Algorithm 12, the red curve would converge extremely slowly unless we reweighted according to Theorem 3 (in addition to Theorem 1). Hence, the situation for the traditional weights is in fact much worse in comparison to the integral geometry weights of Theorems 1 and 3 than even (Figure 2-11, middle) would suggest.

Acknowledgements

We gratefully acknowledge support from NSF DMS-1312831. Oren Mangoubi was supported by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program.

69 70 Chapter 3

Mixing Times of Hamiltonian Monte Carlo

3.1 Introduction

Hamiltonian Monte Carlo (also called Hybrid Monte Carlo or HMC) algorithms are some the most widely-used [26, 14, 44] MCMC algorithms. In this chapter we derive lower bounds for the mixing times of a large class of Hamiltonian Monte Carlo al- gorithms sampling from an arbitrary probability density 7r, including the traditional Isotropic-Momentum HMC algorithm [19], Riemannian Manifold HMC [27, 25] and the No-U-Turn Sampler [32], the workhorse of the popular Bayesian software package Stan [10] (Section 3.2). We do so by applying the continuity equation, a generaliza- tion of the divergence theorem used extensively in fluid mechanics. For comparison, we also prove lower bounds for the mixing times of the Random Walk Metropolis MCMC algorithm (Section 3.4). Since true mixing times, in the narrow sense of the word, usually do not exist for continuous state-space Markov Chains, we use the term "mixing time" here in the broader sense to refer to the relaxation time tre := 1, defined as the inverse of the spectral gap p; the term "mixing times" is oftentimes used more loosely to include a variety of measures of convergence times, including relaxation times [40]. Denoting the Markov transition Kernel by P(., .), the spectral gap p is the smallest number

71 such that

l|P(., .)I|L2 (7r) PI|I|PL2 (7r) for any signed measure p [52]. If P has a second-largest eigenvalue A2 (for example, if P is a matrix of a finite state space Markov Chain), then p = 1 - A2. Geometric ergodicity of HMC algorithms was proved under very general conditions in [41, 9], implying existence of a non-zero spectral gap under those conditions [52]. Cheeger's inequality [11, 37, 55] provides bounds for the spectral gap in terms of the bottleneck ratio 4D(S) of a subset S of the state space, a quantity proportional to the probability that the Markov chain at stationary distribution transitions between S and Sc: *<5 p < 2(* (3.1) 2 where 4* := minsCRA(S). In Section 3.2, we derive bounds for the spectral gap by using the symplectic volume-conservation properties of the Hamiltonian in the phase space to obtain an equation for the bottleneck ratio [40] and then applying Cheeger's inequality to bound p.

3.2 Hamiltonian Monte Carlo mixing times

In this section we derive equations for the bottleneck ratio. First, we define the following terms:

" Nas and Ng,) are the number of times a random trajectory intersects aS and an E-ball of (q, p), respectively.

* Pq is the component of p in the direction orthogonal to oS at q.

" Ps+(q) is the half-space of momentum vectors pointing away from S at q E aS.

Definition 3. (of Q and Qqp)) Let F be the probability measure on the random trajectories -y at stationary distribution conditioned to intersect aS at least once. Define Q to be the probability measure whose density is proportional to dP(-y) - Nas(7y).

72 Similarly, let lP,,,) be the probability measure on the random trajectories -y at sta- tionary distribution conditioned to intersect BE(q, p) nS x Rd at least once. Let Q'qp) to be the probability measure whose density is proportional to dP',)(Y) N6,,(-).4

Theorem 11. Isotropic-Momentum and Riemannian Manifold HMC.

Let r(q) be any probabilitydensity on Rd. Let S C Rd be any subset of the position space. Then the bottleneck ratio for an HMC Markov chain with fixed trajectory evolution time T and any smooth phase-space stationary distribution 7r(q, p) satisfies

D(S) = D+ -EI Nas - 1{Nas odd}] I(S) (3.2)

where the total positive flux 4+ is

4)+ =T j IOlog(ir (q, p)) dd = Tr -r(q, p) -dpdq 9 s+ (q) ' Pq

In the case of Isotropic-Momentum HMC, (D+ reduces to

+=T j 7r(q)dqIj)j as d vf2

The term EQ - 1{Nas odd} can be interpreted as the average periodicity of the Hamiltonian trajectories comprising each iteration of the algorithm. Observing that EQ [ -1{Nos odd} <; 1 gives an upper bound for bottleneck ratio. Numerical simulations for various two-mode densities that approximate Gaussian mixture models (Figure 3-1) suggest that this bound is nearly tight in many cases where T is not too large. To extend Theorem 11 to algorithms with non-fixed trajectory time, such as No-U-Trn HMC, we express T as a function T(q, p):

Theorem 12. (General HMC, including No-U-Turn HMC)

Let S C Rd be any subset of the position space. Then the bottleneck ratio 1(S) of

73 an HMC Markov chain with trajectory evolution time T(q, p) is

Blg(?rq, p)) =f s() // r(qp) - -T(q, p) -liM -1{Nasodd} dpdq H(S). 0 P+(q)+s p ( Nas (3.3)

The proofs of Theorems 11 and 12, make use of the continuity equation, a general- ization of the divergence theorem that says that the change in probability measure of a subset S of the position space at any given time is equal to the flux of the probability measure flowing through 9S [50, 33]. Also essential is the fact that the symplec- tic phase-space volume preserving properties of Hamiltonian mechanics imply that the stationary distribution of HMC is equal to the invariant measure of Hamilton's equations at every point in time of the trajectory's evolution.

Proof. (of Theorem 11) Define 4D+ to be the total flux flowing into (but not out of) SC during one step of the algorithm (or equivalently, by reversibility, the flux into S)

First, note that by reversibility, P(Nas = n, 7y(0) E S) = P(Nas = n, y(O) E Sc).

Hence, P(Nas = n) = P(Nas = n, 7y(0) E S) + P(Nas = n, 7y(0) E S') = 2P(Nas = n, y(0) E S), i.e.,

1 P(Nas = n, y(O) E S) = P(Nas = n, 7(0) E SC) = P(Nas = n) (3.4) 2

Define the measure Q by d(7) := Nas(y) - dP(y) at every trajectory 7.

By the continuity equation,

Sn+1 -1

+ P(Nas = n, y(O) E S) + E fP(Nas = n, 7(O) E Sc) n=2,4... n=2,4,...

74 E(Nas = n) n=1,2,3,...

2( Nas = n) n= 1, 2,3,..

Hence, Zn-1,2,3,... NO(Nas = n) = 1, so the measure is a probability measure, and hence Q =

Therefore,

P(.y(T) E SC, (O) E S) = P(Nas = n, -y(O) E S) 1 P(Nas = n) n==1,3,... n=1,3,...

1. n Q(Nas= n) = <+- -Q(Nas = n) = D+-EQ 1{Nas odd}] In E n n= 1,3,... n=1,3,...

All that remains to be done is to compute

At any time t the time derivative of the total flux into Sc is:

dtb+ = 1 r(q, p) - v+(q, p)T'r/(q)dpdq (3.5) dt .sd

and hence

b+=T (q, p) -v+(q, P)TT7(q)dpdqdt 0 'SI d. (3.6) = T.-asf I r(q,p)- v+(q, p)Tr(q)dq

Applying Hamilton's equations gives

v+(qp)T 7(q) =olog(7r(q, p)) +pq 1pSc(q)

75 so p)) dq

r(q,p)- log(r(qp)) dq 4P+=T - as fs (q) apq

&,(y) ydydq =T - j f q) '

= T - -r( q) - d dq %/2/r(l) v 2d7r (3.8) 2 = T - 7r(q)dq - 10 ly2e-2Ndy d//F($) Js 2 1 =T V2 ( - \//F( ~) 7r(q)dq 2d -F(I) = T - fr(q)dq as dVd

Proof. (of Theorem 12) Define 4b+ to be the total flux going into S' during one step of the algorithm (or equivalently, by reversibility, the flux into S) due to only those trajectories for which

Nas = n. Define 4D, Eto be the total (qp) flux flowing into S' during one step of the algorithm through BE(q, p) n 4S x Rd, and let < : m (p)

E SC E = P(Nas = n, y(O) E S) = P(y(T) IY(O) S) z = 1 [n.P(Nas = n, y(U) C S) n=1 ,3,.... n=1,3,....

1 (by the continuity equation) n=1,3,....

lim Qp)(Nas = n)dpdq (by law of total probability) Rd 640 n=1,3,... jS

76 isI p) liQq,p)(Ns = n)dpdq (by Monotone Convergence Theorem) n=1,3,.... 1 - rli Q,,p) (Nas d= n)dpdqd < n=1,3... n 40o

4+ im E(Q, q,p) {Nas odd}] dpdq 1 IRd [jas. All that remains to be done is to compute

d) + dq4' = r(q, p) - IL{T(q, p) <5 t} - v+(q, p)T (q) (3.9)

and hence

- T (q+) - 7r(q, p) - 1{T(q, p) <_ t} -v+(q, p, t)Ti(q)dt (3.10) = T(q,p) -7r(q,p) - v+(qp)T?(q)

Applying Hamilton's equations gives

v (qp)TJ(q) = log(7r(q, p)) 0Pq S7()P

so 0og(7r(q, T(q, p) r(q,p)- p)) (3.11) (q p)= - ]

09q)p)

One consequence of this bound is that the time it takes energy-conserving Hamil- tonian Markov chains to search for far-apart sub-Gaussian modes grows exponentially with both the dimension and the distance between modes, resolving an open question posed by Prof. Neil Shephard at Harvard.

77 Equation 3.3 also suggests that an optimal HMC algorithm should minimize a particular function of the periodicity of each Hamiltonian trajectory. In the future, we plan to investigate to what extent the No-U-Turn algorithm, the default algorithm in the widely-used Stan software [10], approaches this optimality, and whether one can design a better algorithm closer to optimality. In many applications, the following bound for the Bottleneck ratio of HMC may also be helpful:

Theorem 13. Let S c Rd be any subset of the state space. Then the bottleneck ratio (D(S) of an HMC Markov chain (including Isotropic-Momentum, Riemannian Manifold and No- U- Turn HMC) with any stationary distribution 7r satisfies

<(S) < H [I - Fj(E - U(q))]H(q)dq, (3.12) where U(q) := -log(7r(q)), E := minqeasU(q), and FX2 is the CDF of the x2 random variable.

Proof. No trajectory starting at qo E S can exit S if it does not start with energy at least E := minqeOsU(q). Hence, the probability of exiting S starting at the stationary distribution -r must be at most IP,({H(qo,po) ;> E} n {qo E S}) = fs(i - FX2(E - U(q)))7r(q)dq. E

Finally, we provide a simulation (Figure 3-1) of isotropic HMC sampling of a two- mode density. The results of the simulation agree closely with Theorems 11 and 12, and illustrate the various components of Equations 3.3 and 3.2.

3.3 Cheeger bounds in extreme cases

If the mean step size T = e is small, the spectral gap is close to the bound . Why is this true? First of all, as e 4 0 the expectation in the integrand of Theorem 12 approaches = 1 (i.e., there are no "u-turns" as e 4 0 since all paths are nearly

78 1 -

0.8-

C\J 0.6

0.2 -

0-2 12 10 8- T1 6 T 4 2 -- Figure 3-1: In this simulation we computed the spectral gap for the Isotropic-Momentum HMC algorithm with the stationary distribution ir(q) = 2F 1 ()ax(fg(O,1)(q - a), J(o,1)(q - a)), for different distances 2a between the two modes and different Hamiltonian trajectory times T. As predicted in Theorem the spectral gap is bounded above by linear function of T, and in fact increases linearly with T when a = 0 for T < . The approximate periodicity in T is due to the fact that the trajectories here have period ;> , meaning that the conditional expectation term in Equation varies (approximately) periodically with T. The exponential decay in a2 is explained by the fact that f0 8 (q)dq = fK(o,1)(q - a)), so the corresponding term in Equation (if we choose S = {0} to be the halfway 2 point between the two modes) decreases e xponentially in a . Note that 7(q) approx- imates the Gaussian mixture model #(q) =}fA(o,)(q - a) + lfg(o,1)(q - a); indeed - -- + 0 as a-- oo. The plots were generated by numerically diagonalizing an analytical solution for the transition matrix of the HMC Markov chain.

79 straight and oS is nearly straight on the scale of E). Hence, ignoring constant factors, by Theorem 12, 4, ~ e for small enough e.

If we multiply the integration time T = e by n, the algorithm (which approximates a diffusion for small T = e) travels on average n times the distance in one step. However, if we instead take n independent steps of size e in succession, the mean distance traveled is only fin, so we need to make n 2 steps of size e to achieve the same average displacement as a step of size n. Indeed, if (D. = e, then the upper bound

2 2 is I! = e2, but if I, = ne then the upper bound is * = n e . So, to get a spectral 2 gap of 1 - n 2 E2 using steps of size e, we would need to apply the transition matrix n

2 2 2 2 2 2 times (i.e., take n steps): An' (1- _ )n2 = (I _.2)n = 1-n +L.O.T. ~ 1-n E . On the other extreme, if there are two sets S, and S2 for which I(S*) is much smaller than the relaxation time of the chain restricted to either S or S' by itself, then we would expect the spectral gap for the entire space (S* U S*) to be close to = 'P(S*). Indeed, this behavior has been proved in the case of a lazy random walk on two discrete tori glued together at a single vertex [40]. We conjecture that this behavior will occur for HMC as well, for instance, in the case of a two-mode density with a deep valley, with one mode in the region S* and the other mode in S' (assuming T is not too small).

3.4 Random Walk Metropolis mixing times

In this section we show two bounds for the RWM algorithm. Together, the bounds suggest that the relaxation time of RWM grows exponentially with both the dimension d and the distance between modes, even if one uses a mean step size e that is optimal. Theorem 14 shows that the relaxation time grows with both the dimension d and with the distance between modes for small e. Theorem 15 shows that the relaxation time grows with the dimension d for all other (i.e., non-small) e.

Theorem 14. Consider a RWM Markov chain q 1, q2 ,... with proposal distribution qw1 - qi ~ .j(0, e)d for every i E N. Let S be a subset of the state space. Then

80

In particular,for every r > 0, we have

Proof. Let qi, q2, .... be a RWM Markov chain. We wish to bound

SCqj E S). Suppose that the step size is |lqj - q +1|| < r. Then whenever both qji+1 E Sc and q E S, it must be true that either qj E aS + B, or qj+ 1 E aS + Br. Hence,

Pr qi+1 E Sc, qj E S ||qi+1 - qi|11 r)(15

Pr(q E as + Br) + Pr(qi+1 E aS + Br) = 2ir(aS + B,).

Equations 3.13 and 3.14 now follow directly from the law of total probability.

Theorem 15. Let ir(q) = Ekmax ckirk(q), E ck = 1, be a Gaussian mixture model, where 7rk has covariance matrix Ek and mean ak. Let the RWM proposal have covari- ance matrix AE.

Then, for every 7, 6 > 0,

6 kmax kmax kmax Paccept + min 1ZCj 0j,k (77) + x: CjVj,k( 6 ) j=1 k=1 j=1 where

3 2 min{ - 2IE d / + V 4Ej2d3 - 8tE l((IE d- (7r l(r())2), O}] 2) OJ,k () exp( - 41IEjI

81 + AEd - (7r-'(ck )20} '/j,k <)eXp d min{|Ej 2)

and

(rk ()2= -2lEkllog((2pi E-k12t)

Proof. Let X be distributed according to the stationary distribution (that is, an index j is sampled at random with probability cj, and then a Gaussian random variable X = Xj is sampled from X ~ r1j), and let Y be the independent random jump proposal (so that the next proposed move is the point X + Y).

Then

r(X+}Y) Paccept = min{ < -P 7r(X) > rq,ir(X+Y) 6) +1-P 1gr(X) < '}{w7r(X+Y)> J}) 1, 7r(X)

< 6+ P 7r(X) < r7 + P 7r(X +Y) > 6)

Now, { Ek ck7rk(X + Y) < r1} C {ck7rk(X + Y) < 77} for every k, so

P = minE lr(X) < < minP Irk (X) < r7) cJP rk(Xi) < n) k

By Equation 4.3 of [36], if Z ~ 2 , then

3 2 2 3 IP(IE1Z 2 > t) exp(- [min{ - 2|Ejld / + v41z3 - d - 8I'E-I(|Z Id - t), 0} 2 so P (rk(Xi) < r) = P(IEj|Z2 > (7rk1(r1))2

3 2 -min{ - 2|Ejd / + V 41Xj|2d3 - 8I|El (|E d - (7r l(r))2), O} 2) < exp ( - 41Z31 82 All that remains to be done is to bound P (lr(X + Y) > 6).

To do so we observe that {E cki7rk(X + Y) > 6} C Uk{ck7rk(X + Y) > so

1P 7r(X+Y) >6 < P 7rk(X-Y) > Cka = Z crk(Xj+Y) >Ckk k Ckkm. k j kka

but (for fixed covariance determinants IEkI and IEj+ E), A P(7rk(XJ +Y) > Ckmax ) is maximized if the means of Xj +Y and irk are the same and the covariance matrices are both multiples of the identity matrix. Hence, we may replace IIXjH with IEj + AEIZ without decreasing P(lrk(Xi) > j). By Equation 4.4 of [36], P(|Ej + AEIZ2 < t) ; exp(-( )2), so

2 IP(wr(X) > ) = IP(IE + LIEZ < (7rjl( ))2) Ckk x Ckkma

exp (min{IEj + AEld - (7r'(ck ))2,0}2) d 2|Ej + AJE

Acknowledgement

We gratefully acknowledge support from ONR 14-0001 and NSF DMS-1312831.

83 84 Chapter 4

A Generalization of Crofton's

Formula to Hamiltonian Dynamics, with Applications to Hamiltonian Monte Carlo

4.1 Introduction

In this chapter we prove generalizations of Crofton's formula (Theorems 16 and 17) that apply to particle trajectories in Hamiltonian Dynamics. We then use one of these new formulae (Theorem 16) to increase the efficiency of computing integrals over codimension-1 manifolds with Hamiltonian Monte Carlo algorithms.

4.2 Crofton formulae for Hamiltonian dynamics

Theorem 16. Let M be a codimension-1 submanifold in the position space Rn. Let y be a random Hamiltonian trajectory with Hamiltonian energy functional H(q, p) =

U(q) + Tp. Then r(q)dq = cE[NM], (4.1)

85 where NM is the number of times y intersects M (counted with multiplicity) and c := C1,n_1,n,R is the same constant used in the classical Crofton's formula (Lemma 1).

Remark 8. If M = OS, then Theorem 16 gives the expected number of times that y crosses from S to Sc or vis versa. If the stationary density 7r and integration time T are such that -y never crosses OS more than once, then Theorem 16 and Theorem 11 imply each other in this special case.

Remark 9. The classical Crofton formula for lines (Lemma 1 with k = 1) can be viewed as a corollary to Theorem 16: if we choose the Hamiltonian potential to be uniform over a compact set Q, the Hamiltonian trajectories will be composed of lines that move with the Kinematic measure, conditioned to intersect Q. Since the potential is uniform, we have fM ir(q)dq -Vol(Q) = Vol(M), the value on the LHS of the classical Crofton formula.

Proof. (of Theorem 16) Let y a random Hamiltonian trajectory over time [0, T] with position and momen- tum at time 0 sampled from the stationary density ir(q,p) = -r(q) - fg(0,1).(p). Let NM be the number of intersection points of y with M, and let (qi, pi) be the phase space coordinates of -y at its i'th intersection point with M. Since Hamiltonian flow preserves the stationary distribution, -y is at stationary distribution at any time t E [0, T], so for any function h we have

I NM h(i i I h(q, p)ir(q, p)dpdq = TE [qIPro piI, (4.2) where "projMT" denotes the projection onto the normal vector to M at q. Also, since the marginal stationary density of the momentum f(o,i1)n (p) is inde- pendent of the position q and is rotationally invariant with respect to p, for every q E M we have

1' 1 jIprojMTp1. - fg(0,1)n(p)dp =] |proje1 pjj - fg(ol)n(p)d = -, (4.3) Rn q8R

86 where el := (1, 0, ... 0)T is a coordinate vector. Hence, setting h(q,p) := ||projMTpII, we get

= Jir(q) . ldq M 7r(q)dq (by Eq. 4.3) = 7r(q) - c - I| IprojMTp|| - f(o,1)"(p)dpdq

= c- I projT pI| . r(q) - fg(0,1)n(p)dpdq fM Rn R =- IM IlprojMTp|| -7r(q,p)dpdq

(by Eq. 4.2) 1 -N - T E|proj p || - E - +E[Z-

= +E [NM

0

More generally, for arbitrary Hamiltonians we have

Theorem 17. Let M be a codimension-1 submanifold of the position space Rn. Let ^y be a random Hamiltonian trajectory with arbitrary Hamiltonian energy functional H(q, p). Then

7r(q)dq = T E n q4p117(,POP4.4) IEGpd Im . i_1 -7r"q dM7 I

where qi is the position where y intersects M for the i'th time, and NM is the number

of intersections (counted with multiplicity). v(q, p) : dt - Op is the velocity given

by Hamilton's equations at position q and momentum p. | dv(q'p) 1| is the magnitude of the component of v in the direction orthogonal to the tangent space of M at qi.

Proof. The proof of Theorem 17 follows the same steps as the proof of Theorem 16: Let 7 be a random Hamiltonian trajectory over time [0, T] with position and mo-

mentum at time 0 sampled from the stationary density ir(q, p). Let NM be the number

87 of intersection points of -y with M, and let (qj, pi) be the phase space coordinates of -y at its i'th intersection point with M.

Since Hamiltonian flow preserves the stationary distribution, 'Y is at stationary distribution at any time t E [0, T], so for any function h we have

h(qipi)] h (q, p)7r(q, p)dpdq = E M - (4.5) fM Jn T- i=1

Hence, setting h(qp) := 1 11yI_Ibrqd we get

IM 7r(q)dq = IMr(q) - dq

= IM(q) - p)dpdq 7r (q, p)dp LRn 1 dv 1)7r(q, d v(qp) - 7r(q,p)dpdq M Rn ddo , )1r1 , / J p

(by Eq. 4.5) 1E NdM ,p) v'P)117r (qj, p) dp dv(qp) T~ ~ ~ , Ii I I E-NM

f.. f I ,j4p)17r(qi,p)dp]

4.3 Manifold integration using HMC and the Hamil- tonian Crofton formula

We now state a conventional method of using Hamiltonian Monte Carlo to compute integrals over a submanifold (Algorithm 13).

88 Algorithm 13 integration on a submanifold with HMC input: qo, oracle for 7 : Rd - [0, oc), oracle for intersections with M output: Estimator A for fM x(q)dq. define: H(q,p) := -log(r(q)) + }p'p.

1: for i = 1, 2,... do

2: Sample independent pi ~ K(O, I)d 3:. Integrate Hamiltonian trajectory (q(t),p(t)) with Hamiltonian H over the time interval [0, T] and initial conditions (p(O), q(O)) = (pi, qi)

4: set qi+1 = q(T) 5: compute the sequence of intersection angles Of, y,.... of the trajectory with M 6: end for 7: computeA:=Z)

use of the weights . Unfortunately, these weights Algorithm 13 requires the sin(Oi) infinite variance, greatly slowing the algorithm ( Var( L) = 00, as shown in have sin(Oq)2 Section 2.2.2 when analyzing classical Crofton formula). To eliminate these weights we can apply our Hamiltonian Crofton formula (Theorem 16) to obtain an algorithm with a much faster-converging estimator for the integral (Algorithm 14):

Algorithm 14 Crofton formula integration on a submanifold with HMC Algorithm 14 is identical to Algorithm 13, except for the following steps: output: Estimator -- NM for fm 7r(q)dq. 5: compute the number (with multiplicity) of intersections Ni of the trajectory with M 7: compute NM : >jMNi

Acknowledgements

We gratefully acknowledge support from NSF DMS-1312831.

89 90 Chapter 5

A Hopf Fibration for #-Ghost Gaussians

5.1 Introduction

There is a wonderful geometrical construction known as the Hopf fibration. The Hopf map is a continuous map from the 3-sphere to the 2-sphere where every point on the 2-sphere is the image of a circle on the 3-sphere. This Hopf map has the nice property that if a point is uniformly distributed on the 3-sphere, its image is uniformly distributed on the 2-sphere. The entire Hopf story can be expressed quickly and elegantly with quaternions. In quaternion language, consider the map that takes z to zi., where z is a unit quaternion (Izi = 1). By taking its conjugate, it is easy to see that zi2 is a unit quaternion with zero real part (these are known as versors). Identifying unit quarternions with the 3-sphere, and those with zero real part with the 2-sphere, we have our Hopf map. Notice that for a fixed unit quaternion q, the map from z with 0 real part to qzq is a linear function of z which preserves JzJ. The matrix representation is a 3x3 rotation matrix. This association is widely used in practice in computer graphics and other fields. Given any 3x3 rotation matrix, there is a well known construction to go the other way based on computing the eigenvector (the axis of the rotation), and the eigenvalues (which encode the angle of rotation.)

91 We offer a quick proof based on orthogonal invariance that if z is uniformly dis- tributed on the 3-sphere, then zi is uniformly distributed on the 2-sphere. As just discussed, any orthogonal rotation of the non-zero coordinates of ziz may be written as qzizg = (qz)i(qz). Now if z is uniform on the 3-sphere, so is qz and thus qzizg has the same distribution as zi2 implying that it is uniformly distributed on the 2-sphere. This proof is very elegant and worth reading a few times. The very fact that the geometry of Hopf is so nicely encoded in quaternions inspired us to ask what happens if we replaced quaternions with general 0 ghosts.

5.2 Defining the 3-dimensional algebra

Let # be of even dimension. Write A = R8 = A x B x C x D

Where A = span{1}, B = span{i},

C = span{ji, ... j,-}, 2 R 2, and D = span{k, ..., k_} 6 R. 2 2

Definition 4. (multiplication) The only multiplication allowed in this algebra is mul- tiplication of an element in A by elements of span{1, i}, as well as any multiplication where the output only has elements of A = A x B x C x D (e.g., r2 and rir are allowed for any r E A). Left multiplication is defined for all t, s E {1, ... , 2} as:

i2 = j2 = k 2 = -1 (5.1)

ijt =k (5.2)

ikt = -Jt (5.3)

and right multiplication is defined by

jti =-k (5.4)

92 kti = jt (5.5)

Finally, we assume that all orthogonal pure imaginary components are anti- commutative, although we only allow such operations if the end results cancel so that the output is in A x B x C x D:

isit =-iti

jsjt = -jtJs (5.6)

isJt = -Jtis

Associativity of multiplication is assumed as well (but NOT commutativity). Ad- dition is done as a vector space. Finally, we define the conjugation operation on an element Z = a + bi + cr, where a, b, c E R and r E S,-2 C C x D by Z a - b - cr. The following theorems (Theorems 1-3) give algebraic properties of the ,3-dimensional algebra that generalize properties of the quaternion algebra. These properties will come in handy when generalizing the Hopf fibration to R8.

Theorem 18. r2 = -1 and (ir)2 = 1

Proof. 1 = Ir12 = rr = r x (-r) = -r 2 hence, r2 =_I

Since ir is pure imaginary (no real part) (this is because we are assuming that the pure imaginary # - 2 sphere is closed under orthogonal multiplication), 1 = lir1 2 ir(-ir) = -(ir)2 2 Hence, (ir) = 1.

Theorem 19. ir = -ri

Proof. (i +r)(i +r) = Ji+r 12 =E R

93 2 but (i+ r)(i+ r) = (i+ r)(-i - r) = -i2i - r2 I ir - ri+1

Hence, since ir and ri are non-real (since Equations 5.2-5.4 imply that the # - 1 pure imaginary sphere is closed under orthogonal multiplication), and Ii + rl is real, it must be true that -ir - ri = 0. Hence, ir = -ri. L

Theorem 20. Ixr + yir = /x 2 + y2 for all x, y E R

Proof. Ixr + yir1 2 = (xr + yir)(-xr - yir) = -x 2 r 2 - xyir2 - yrir - y 2 (ir) 2 x 2 +xyi+xyr2i+y2 = 2 +Xyi-Xyi+y 2 X 2 +y 2

5.3 Hopf Fibration on R

We begin by defining a version of the Hopf fibration W : R -+ R1- 1 for even-integer # by generalizing the quaternion representation of the / = 4 Hopf fibration. Let Z ~ .(O, 1)3 (where "Z ~ .(0, 1)13" means that Z is a random variable sampled from the distribution .J(O, 1)a). Then

Z=a+bi+cr

2 where a, b ~ (0, 1), c - X-2, r Uniform(S'- ) C C x D are independently distributed.

We can now define the Hopf map W : R3 -+ R- by WH(Z) ZiZ.

Theorem 21.

7(Z) :Zi = (a2 + b2 - c2 )i + 2bcr - 2acir ~ (a2 + b2 - c2 )i + 2c (Va2+ b2)r

Proof.

Zi = (a+bi+cr)i(a-bi-cr)= (a+bi+cr)(ai+b-cir)= a2 i+ab-acir-ab+b2 i-bci2 r+acri+bcr-c 2r

= (a2 + b2 )i - acir + bcr - acir + bcr + c2 r 2 i = (a2 + b2 - c2 )i + 2bcr - 2acir

94 (a2+b2-2)i+2c (Vb2+a2)r

In particular, Theorem 21 shows that the Hopf map eliminates the real component, just like it does in the quaternion (0 = 4) case. More generally, Theorem 21 allows us to compare the distribution of Zi2 for different /. Towards this end, let W be the i-component of ") . Then

_ X-Y W imagi(ZiZ) IZi2| X+Y where X:= a2 + b2 ~ x2 and Y:= c 2 ~ X-2-

A quick multivariable integral computation gives the densities fw and fIwi of W and 1W , respectively:

f2 1 -t _-2_ fw (t) = x ( 2 ) 2 4 t E [-1, 1] and

# -2 1 +t 6_-2_1 1 - t 62_1 fAw,(t) = 4 ( 2) 2 + ( 2) 2 ,1 t E [0, 1]

In particular, fw(t) is constant for /3 = 4, has negative second derivative for 4 < /3 < 6, is linear for / = 6 and has positive second derivative for 3 > 6. fiwi is uniform for both 3 = 4 and / = 6, and also has negative second derivative for 4 < /3 < 6 and positive second derivative for 3 > 6.

The / - 2-dimensional surface area density p(t) of on the / - 2 sphere at height t on the i-axis is:

0 -2( 1)82 P~t) fw(t) t E [-1 1] Vol(Pt n Si-2) - 48- 3 2(1 + t)

95 where Pt is a # - 2-plane a distance t from the origin and

S1 - := Voli( 3 3 '- ) = 2-

In particular, 0 < p(t) < oo everywhere except at t = -1, where p(-1) = oo. This suggests that W- has a dimension reduction of 1 everywhere except at the "south pole" of the # -2 sphere (-i), where the dimension was reduced by # -2 (r gets mapped to

2 -i for any /3-2- "phase" r. However, for instance, only the circle {a+bi : a + b2 = 1 gets mapped to +i, so the map is well-behaved at +i because the dimension goes down by 1.)

Moreover, this fact, together with the fact that 7-(a + bi + cr) = (a2 + b2 - c2 )i + 2bcr - 2acir, suggests that W- is analytic everywhere except at -i E S8- 2 .

Acknowledgements

We gratefully acknowledge support from NSF DMS-1312831.

96 Bibliography

[1] J. C. Alvarez Paiva and E. Fernandes. Gelfand transforms and Crofton formulas. Selecta Math. (N.S.), 13(3):369-390, 2007.

[2] Dennis Amelunxen and Martin Lotz. Computational kinematics. manuscript in preparation.

[3] Dennis Amelunxen and Martin Lotz. A comment on "Integral geometry for MCMC" . private correspondence.

[4] C. Andrieu, N. de Freitas, A. Doucet, and M.I. Jordan. An introduction to MCMC for machine learning. Machine Learning, 50:5-43, 2003.

[5] R.H. Baayen, D.J. Davidson, and D.M. Bates. Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59:390-412, 2008.

[6] Julian Besag. Markov chain Monte Carlo for statistical inference. Technical report, University of Washington, Department of Statistics, 04 2001.

[7] Louis J Billera and Persi Diaconis. A geometric interpretation of the Metropolis- Hastings algorithm. Statistical Science, 16(4):335-339, 2001.

[8] F. Bornemann. On the numerical evaluation of distributions in random matrix theory: a review. Markov Process. Related Fields, 16(4):803-866, 2010.

[9] Nawaf Bou-Rabee and Jesus Mara Sanz-Sernax. Randomized Hamiltonian Monte Carlo. arXiv preprint arXiv:1511.09382v1, 2015.

[10] Bob Carpenter, A Gelman, M Hoffman, D Lee, B Goodrich, M Betancourt, M Brubaker, J Guo, P Li, and A Riddell. Stan: a probabilistic programming language. Journal of Statistical Software, in press, 2015.

[11] Jeff Cheeger. A lower bound for the smallest eigenvalue of the Laplacian. In Problems in analysis (Papers dedicated to Salomon Bochner, 1969), pages 195- 199. Princeton Univ. Press, Princeton, N. J., 1970. [12] Ming-Hui Chen, Qi-Man Shao, and Joseph G. Ibrahim. Monte Carlo methods in Bayesian computation. Springer Series in Statistics. Springer-Verlag, New York, 2000.

97 [13] Shiing-shen Chern. On the curvatura integra in a Riemannian manifold. Ann. of Math. (2), 46:674-684, 1945.

[14] Sai Hung Cheung and James L Beck. Bayesian model updating using hybrid Monte Carlo simulation with application to structural dynamic models with many uncertain parameters. Journal of engineering mechanics, 135(4):243-255, 2009.

[15] Anders S. Christensen, Troels E. Linnet, Mikael Borg, Kresten Lindorff-Larsen Wouter Boomsma, Thomas Hamelryck, and Jan H. Jense. Protein structure val- idation and refinement using amide proton chemical shifts derived from quantum mechanics. PLoS ONE, 8(12):1-10, 2013.

[16] Neil J. Cornish and Edward K. Porter. MCMC exploration of supermassive black hole binary inspirals. Classical Quantum Gravity, 23(19):761-767, 2006.

[17] Morgan W. Crofton. On the theory of local probability, applied to straight lines drawn at random in a plane; the methods used being also extended to the proof of certain new theorems in the integral calculus. Philosophical Transactions of the Royal Society of London, 158:181-199, 1868.

[18] Persi Diaconis, Susan Holmes, and Mehrdad Shahshahani. Sampling from a man- ifold. In Advances in Modern Statistical Theory and Applications: A Festschrift in Honor of Morris L. Eaton, pages 102-125. Institute of Mathematical Statis- tics, 2013.

[19] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics Letters B, 195(2):216-222, 1987.

[20] Alan Edelman and Brian D. Sutton. From random matrices to stochastic oper- ators. J. Stat. Phys., 127(6):1121-1165, 2007.

[21] Alan Edelman and Brian D. Sutton. From random matrices to stochastic oper- ators. J. Stat. Phys., 127(6):1121-1165, 2007.

[22] Israel M. Gelfand and Mikhail M. Smirnov. Lagrangians satisfying Crofton for- mulas, Radon transforms, and nonlocal differentials. Adv. Math., 109(2):188-227, 1994.

[23] Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. Pattern Analysis and Machine Intelli- gence, IEEE Transactions on, (6):721-741, 1984.

[24] Charles J. Geyer. Markov chain Monte Carlo maximum likelihood. In Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, pages 156-163, 1991.

98 [25] Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamilto- nian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123-214, 2011.

[26] Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamilto- nian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123-214, 2011.

[27] Mark Girolami, Ben Calderhead, and Siu A Chin. Riemannian Manifold Hamil- tonian Monte Carlo. Arxiv preprint, 2009.

[28] Yongtao Guan and Stephen M. Krone. Small world MCMC and convergence to multi-modal distributions: from slow mixing to fast mixing. Ann. Appl. Probab., 17(1):284-304, 2007.

[29] Larry Guth. Degree reduction and graininess for Kakeya-type sets in R3 . preprint on arXiv:1402.0518, 2014.

[30] Sigurdur Helgason. Integral geometry and Radon transforms. Springer, New York, 2011.

[31] Jody Hey and Rasmus Neilsen. Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics. Proceedings of the national academy of sciences of the United States of America, 104(8):2785- 2790, 2006.

[32] Matthew D. Homan and Andrew Gelman. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. The Journalof Machine Learn- ing Research, 15(1):1593-1623, 2014.

[33] JH Irving and John G Kirkwood. The statistical mechanical theory of transport processes. iv. the equations of hydrodynamics. The Journal of chemical physics, 18(6):817-829, 1950.

[34] Iain M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist., 29(2):295-327, 2001.

[35] Daphne Koller and Nir Friedman. Probabilisticgraphical models. Adaptive Com- putation and Machine Learning. MIT Press, Cambridge, MA, 2009. Principles and techniques.

[36] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic func- tional by model selection. Annals of Statistics, 28(5):1302-1338, 2000.

[37] Gregory F Lawler and Alan D Sokal. Bounds on the L2 spectrum for Markov chains and Markov processes: a generalization of Cheeger's inequality. Transac- tions of the American mathematical society, 309(2):557-580, 1988.

99 [38] Michel Ledoux. The concentration of measure phenomenon, volume 89 of Math- ematical Surveys and Monographs. American Mathematical Society, Providence, RI, 2001.

[39] Tony Lelivre, Mathias Rousset, and Gabriel Stoltz. Free energy computations: A Mathematical Perspective. Imperial College Press, 2010.

[40] David Asher Levin, Yuval Peres, and Elizabeth Lee Wilmer. Markov chains and mixing times. American Mathematical Soc., 2009.

[41] Samuel Livingstone, Michael Betancourt, Simon Byrne, and Mark Girolami. On the geometric ergodicity of Hamiltonian Monte Carlo. arXiv preprint arXiv:1601.08057, 2016.

[42] Martin Lotz. On the volume of tubular neighborhoods of real algebraic varieties. Proc. Amer. Math. Soc., 143(5):1875-1889, 2015.

[43] Oren Mangoubi. Concentration of kinematic measure. manuscript in preparation.

[44] B Mehlig, DW Heermann, and BM Forrest. Hybrid Monte Carlo method for condensed-matter systems. Physical Review B, 45(2):679, 1992. [45] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087-1092, 1953. [46] V. D. Milman. A new proof of A. Dvoretzky's theorem on cross-sections of convex bodies. Funkcional. Anal. i Priloien., 5(4):28-37, 1971.

[47] Boaz Nadler. On the distribution of the ratio of the largest eigenvalue to the trace of a Wishart matrix. J. Multivariate Anal., 102(2):363-371, 2011. [48] Radford M. Neal. MCMC using Hamiltonian dynamics. In Handbook of Markov chain Monte Carlo, Chapman & Hall/CRC Handb. Mod. Stat. Methods, pages 113-162. CRC Press, Boca Raton, FL, 2011.

[49] S. Yu. Orevkov. Sharpness of Risler's upper bound for the total curvature of an affine real algebraic hypersurface. Uspekhi Mat. Nauk, 62(2):169-170, 2007. [50] Hannes Risken. Fokker-planck equation. Springer, 1984. [51] Jean-Jacques Risler. On the curvature of the real Milnor fiber. Bull. London Math. Soc., 35(4):445-454, 2003.

[52] Gareth 0 Roberts and Jeffrey S Rosenthal. Geometric ergodicity and hybrid Markov chains. Electron. Comm. Probab, 2(2):13-25, 1997.

[53] Luis A. Santal6. Integral geometry and geometric probability. Cambridge Math- ematical Library. Cambridge University Press, Cambridge, second edition, 2004. With a foreword by .

100 [54] Rolf Schneider and Wolfgang Weil. Stochastic and integral geometry. Probability and its Applications (New York). Springer-Verlag, Berlin, 2008. [55] Alistair Sinclair and Mark Jerrum. Approximate counting, uniform generation and rapidly mixing Markov chains. Information and Computation, 82(1):93-133, 1989.

[56] Michael Spivak. A comprehensive introduction to differential geometry. Vol. III. Publish or Perish, Inc., Wilmington, Del., second edition, 1979.

[57] Michael Spivak. A comprehensive introduction to differential geometry. Vol. V. Publish or Perish, Inc., Wilmington, Del., second edition, 1979.

[58] Brian D. Sutton. The stochastic operator approach to random matrix theory. ProQuest LLC, Ann Arbor, MI, 2005. Thesis (Ph.D.)-Massachusetts Institute of Technology.

[59] Mihai Tibar and Dirk Siersma. Curvature and Gauss-Bonnet defect of global affine hypersurfaces. Bulletin des Sciences Mathematiques, 130(2):110-122, 2006.

[60] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681-688, 2011.

[61] Chenchang Zhu. The Gauss-Bonnet theorem and its applications. http: //math. berkeley. edu/-alanw/240papersOO/zhu. pdf, 2004.

101