The Power Spherical Distribution
Total Page:16
File Type:pdf, Size:1020Kb
The Power Spherical distribution Nicola De Cao 1 2 Wilker Aziz 1 Abstract = 10 0.6 p( = /2, = 4) = 100 There is a growing interest in probabilistic models 1k samples = 500 0.5 defined in hyper-spherical spaces, be it to accom- 1 modate observed data or latent structure. The von 0.4 Mises-Fisher (vMF) distribution, often regarded 0.3 0 as the Normal distribution on the hyper-sphere, is 0.2 a standard modeling choice: it is an exponential 1 0.1 1 1 family and thus enjoys important statistical results, 0 0 0.0 for example, known Kullback-Leibler (KL) diver- 0 /2 3 /2 1 1 1 2 gence from other vMF distributions. Sampling (a) On the circle S . (b) On the sphere S . from a vMF distribution, however, requires a re- jection sampling procedure which besides being Figure 1. Example of draws and density function of the Power slow poses difficulties in the context of stochas- Spherical distribution. For the circle (a) we plot both the density tic backpropagation via the reparameterization and histograms for 1k samples. For the sphere (b) we plot draws trick. Moreover, this procedure is numerically from 3 distributions of orthogonal directions (µ) and different unstable for certain vMFs, e.g., those with high concentration parameter κ. concentration and/or in high dimensions. We pro- pose a novel distribution, the Power Spherical distribution, which retains some of the important Bijral et al., 2007), mixed-membership models (Reisinger aspects of the vMF (e.g., support on the hyper- et al., 2010), computer vision (Liu et al., 2017), and natural sphere, symmetry about its mean direction param- language processing (Kumar & Tsvetkov, 2019). eter, known KL from other vMF distributions) The von Mises-Fisher distribution (vMF; Mardia & Jupp, while addressing its main drawbacks (i.e., scala- 2009) is a natural and standard choice for densities in hyper- bility and numerical stability). We demonstrate spheres. It is a two-parameter exponential family, one pa- the stability of Power Spherical distributions with rameter being a mean direction and the other a scalar con- a numerical experiment and further apply it to a centration, and it is symmetric about the former. Because variational auto-encoder trained on MNIST. Code of that, it is often regarded as the Normal distribution on at: github.com/nicola-decao/power spherical spheres. Amongst other useful properties, it has closed-form KullbackLeibler (KL) divergence with other vMF densities including the uniform distribution, one of its special cases. 1. Introduction Thanks to the tangent-normal decomposition, sampling Manifold learning and machine learning applications of from a vMF, no matter its dimensionality, requires only arXiv:2006.04437v2 [stat.ML] 15 Jun 2020 directional statistics (Sra, 2018) have spurred interest in sampling from a univariate marginal distribution. Unfortu- distributions defined in non-Euclidean spaces (e.g., sim- nately, the inverse cumulative density function (cdf ) of this plex, hyper-torus, hyper-sphere). Examples include learning marginal is not known analytically, which prevents straight- rotations (i.e., SO(n), Falorsi et al., 2018; 2019) and hierar- forward generation of independent samples. Fortunately, a chical structures on hyperbolic spaces (Mathieu et al., 2019; rejection sampling procedure is known (Ulrich, 1984), but Nagano et al., 2019). Hyper-spherical distributions, in par- unsurprisingly, rejection sampling is inefficient. ticular, find applications in clustering (Banerjee et al., 2005; In deep learning, an appealing use of the vMF distribution is 1University of Amsterdam 2The University of Edinburgh. Cor- as a random generator in stochastic and differentiable com- respondence to: Nicola De Cao <[email protected]>. putation graphs. For that, we need a reparameterization trick Proceedings of the 37 th International Conference on Machine to enable unbiased and low variance estimates of gradients Learning INNF+ 2020 Workshop, Vienna, Austria, PMLR 108, of samples with respect to the vMF parameters (Rezende 2020. Copyright 2020 by the author(s). Under review *. et al., 2014; Kingma & Welling, 2014). With rejection sam- The Power Spherical distribution pling, reparameterization does not come naturally, requiring Corollary 1 (9.3.1 in Mardia & Jupp(2009)) . The intersec- d−1 a correction term which has high variance (Naesseth et al., tion of S with the planep through tµ and normal to µ is a 2017). This plays against widespread use of vMFs. For (d − 2)-sphere of radius 1 − t2. Moreover, t has density example, Davidson et al.(2018) successfully used the vMF d−3 2 2 distribution to approximate the posterior distribution of a pT (t; d) / 1 − t with t 2 [−1; 1] : (2) hyper-spherical variational auto-encoder (VAE; Kingma & Welling, 2014), but they had to omit the correction term, Importantly, it follows from Theorem1 and Corollary1 trading variance for bias. Additionally, due to its exponen- that every distribution that depends on x only through t = tial form as well as a dependency on the modified Bessel µ>x can be expressed in terms of a marginal distribution function of the first kind (Weisstein, 2002), the vMF distri- p and a uniform distribution p on the subspace d−1. bution is numerically unstable in high dimensions or with T V S Density evaluation as well as sampling just requires dealing high concentration (Davidson et al., 2018). with the marginal pT as pV has constant density and it is To overcome all of the vMF drawbacks, we propose the trivial to sample from. An example from this class is the novel Power Spherical distribution. We start from the von Mises-Fisher distribution. tangent-normal decomposition of vectors in hyper-spheres Definition 1. Let’s define a unnormalized density as and specifically design a univariate marginal distribution that admits an analytical inverse cdf . This marginal is used > d−1 pX (x; µ, κ) / exp κµ x with x 2 S ; (3) to derive a distribution on hyper-spheres of any dimension. d−1 The resulting distribution is not an exponential family and with direction µ 2 S and concentration κ 2 R≥0. When is defined via a power law. Crucially, it is numerically sta- normalized, this is the von Mises-Fisher distribution. ble and dispenses with rejection sampling for independent Theorem 2 (9.3.12 in Mardia & Jupp(2009)) . The marginal sampling. We verify the stability of the Power Spherical p (t; d; κ) of a von Mises-Fisher distribution is distribution with a numerical experiment as well as repro- T ducing some of the experiments of Davidson et al.(2018) d−3 κt 2 2 while substituting the vMF with our proposed distribution. pT (t; d; κ) = CT (κ, d) · e 1 − t ; (4) with normalizer C (κ, d) = Contributions We propose a new distribution defined on T any d-dimensional hyper-sphere which d −1 κ 2 −1 d − 1 1 Γ Γ I d−1 (κ) ; (5) • has closed form marginal cdf and inverse cdf , thus it 2 2 2 2 does not require rejection sampling; where I (z) is the modified Bessel function of the first kind. • is fully reparameterizable without a correction term; v • is numerically stable in high dimensions and/or high Although evaluation of pT is tractable, the Bessel func- concentrations. tion (Weisstein, 2002) (see AppendixA) is slow to compute and unstable for large arguments. Besides, pT does not 2. Method admit a known closed-form cdf (nor its inverse). Thus, re- jection sampling is normally used to draw samples from We start with an overview of the vMF distribution, also it (Ulrich, 1984). introducing results that help formulate the Power Spherical distribution. We then define the Power Spherical and present 2.2. The Power Spherical distribution some of its proprieties such as mean, mode, variance, and entropy as well as its KullbackLeibler divergence from a As all the issues of the vMF distribution stem from a prob- vMF and from a uniform distribution. lematic marginal, we address them by defining a new dis- tribution that shares some basic proprieties of a vMF but 2.1. Preliminaries none of its drawbacks. Namely i) it is rotationally symmet- ric about µ, ii) it can be expressed in terms of a marginal d−1 d Let S = fx 2 R : kxk2 = 1g be the hyper-spherical distribution pT , and iii) this marginal has closed-form and set. A key idea in directional distribution theory is the stable cdf (and inverse cdf ). tangent-normal decomposition. Definition 2. Let’s define an unnormalized density as Theorem 1 (9.1.2 in Mardia & Jupp(2009)) . Any unit vec- d−1 > κ d−1 tor x 2 S can be decomposed as pX (x; µ, κ) / 1 + µ x with x 2 S ; (6) 2 1 x = µt + v(1 − t ) 2 ; (1) d−1 with direction µ 2 S and concentration κ 2 R≥0. When with t 2 [−1; 1] and v 2 Sd−2 a tangent to Sd−1 at µ. normalized this is the Power Spherical distribution. The Power Spherical distribution Algorithm 1 Power Spherical sampling Property Value Input: dimension p, direction µ, concentration κ E[X] µ(α − β)=(α + β) 2α > sample z ∼ Beta (Z;(d − 1)=2 + κ, (d − 1)=2) var(X) 2 (β − α)µµ + (α + β)Id d−2 (α+β) (α+β+1) sample v ∼ U(S ) Mode µ (for κ > 0) t 2z − 1 H(T ) H(Beta(α; β)) + log 2 p y [t;( 1 − t2)v> ]> fconcatenationg H(X) log NX (κ, d) − κ log 2 + (α) − (α + β) > u^ e1 − µ fe1 is the base vector [1; 0; ··· ; 0] g u = u^ Table 1. Properties of X ∼ PowerSpherical(µ, κ). Recall that ku^k2 > α = (d − 1)=2 + κ and β = (d − 1)=2.