The Power Spherical distribution

Nicola De Cao 1 2 Wilker Aziz 1

Abstract = 10 0.6 p( = /2, = 4) = 100 There is a growing interest in probabilistic models 1k samples = 500 0.5 defined in hyper-spherical spaces, be it to accom- 1 modate observed data or latent structure. The von 0.4 Mises-Fisher (vMF) distribution, often regarded 0.3 0

as the on the hyper-sphere, is 0.2 a standard modeling choice: it is an exponential 1 0.1 1 1 family and thus enjoys important statistical results, 0 0 0.0 for example, known Kullback-Leibler (KL) diver- 0 /2 3 /2 1 1 1 2 gence from other vMF distributions. Sampling (a) On the circle S . (b) On the sphere S . from a vMF distribution, however, requires a re- jection sampling procedure which besides being Figure 1. Example of draws and density function of the Power slow poses difficulties in the context of stochas- Spherical distribution. For the circle (a) we plot both the density tic backpropagation via the reparameterization and for 1k samples. For the sphere (b) we plot draws trick. Moreover, this procedure is numerically from 3 distributions of orthogonal directions (µ) and different unstable for certain vMFs, e.g., those with high concentration parameter κ. concentration and/or in high dimensions. We pro- pose a novel distribution, the Power Spherical distribution, which retains some of the important Bijral et al., 2007), mixed-membership models (Reisinger aspects of the vMF (e.g., support on the hyper- et al., 2010), computer vision (Liu et al., 2017), and natural sphere, symmetry about its mean direction param- language processing (Kumar & Tsvetkov, 2019). eter, known KL from other vMF distributions) The von Mises-Fisher distribution (vMF; Mardia & Jupp, while addressing its main drawbacks (i.e., scala- 2009) is a natural and standard choice for densities in hyper- bility and numerical stability). We demonstrate spheres. It is a two-parameter , one pa- the stability of Power Spherical distributions with rameter being a mean direction and the other a scalar con- a numerical experiment and further apply it to a centration, and it is symmetric about the former. Because variational auto-encoder trained on MNIST. Code of that, it is often regarded as the Normal distribution on at: github.com/nicola-decao/power spherical spheres. Amongst other useful properties, it has closed-form KullbackLeibler (KL) divergence with other vMF densities including the uniform distribution, one of its special cases. 1. Introduction Thanks to the tangent-normal decomposition, sampling Manifold learning and machine learning applications of from a vMF, no matter its dimensionality, requires only arXiv:2006.04437v2 [stat.ML] 15 Jun 2020 directional (Sra, 2018) have spurred interest in sampling from a univariate marginal distribution. Unfortu- distributions defined in non-Euclidean spaces (e.g., sim- nately, the inverse cumulative density function (cdf ) of this plex, hyper-torus, hyper-sphere). Examples include learning marginal is not known analytically, which prevents straight- rotations (i.e., SO(n), Falorsi et al., 2018; 2019) and hierar- forward generation of independent samples. Fortunately, a chical structures on hyperbolic spaces (Mathieu et al., 2019; rejection sampling procedure is known (Ulrich, 1984), but Nagano et al., 2019). Hyper-spherical distributions, in par- unsurprisingly, rejection sampling is inefficient. ticular, find applications in clustering (Banerjee et al., 2005; In deep learning, an appealing use of the vMF distribution is 1University of Amsterdam 2The University of Edinburgh. Cor- as a random generator in stochastic and differentiable com- respondence to: Nicola De Cao . putation graphs. For that, we need a reparameterization trick Proceedings of the 37 th International Conference on Machine to enable unbiased and low variance estimates of gradients Learning INNF+ 2020 Workshop, Vienna, Austria, PMLR 108, of samples with respect to the vMF parameters (Rezende 2020. Copyright 2020 by the author(s). Under review *. et al., 2014; Kingma & Welling, 2014). With rejection sam- The Power Spherical distribution pling, reparameterization does not come naturally, requiring Corollary 1 (9.3.1 in Mardia & Jupp(2009)) . The intersec- d−1 a correction term which has high variance (Naesseth et al., tion of S with the plane√ through tµ and normal to µ is a 2017). This plays against widespread use of vMFs. For (d − 2)-sphere of radius 1 − t2. Moreover, t has density example, Davidson et al.(2018) successfully used the vMF d−3 2 2 distribution to approximate the posterior distribution of a pT (t; d) ∝ 1 − t with t ∈ [−1, 1] . (2) hyper-spherical variational auto-encoder (VAE; Kingma & Welling, 2014), but they had to omit the correction term, Importantly, it follows from Theorem1 and Corollary1 trading variance for bias. Additionally, due to its exponen- that every distribution that depends on x only through t = tial form as well as a dependency on the modified Bessel µ>x can be expressed in terms of a marginal distribution function of the first kind (Weisstein, 2002), the vMF distri- p and a uniform distribution p on the subspace d−1. bution is numerically unstable in high dimensions or with T V S Density evaluation as well as sampling just requires dealing high concentration (Davidson et al., 2018). with the marginal pT as pV has constant density and it is To overcome all of the vMF drawbacks, we propose the trivial to sample from. An example from this class is the novel Power Spherical distribution. We start from the von Mises-Fisher distribution. tangent-normal decomposition of vectors in hyper-spheres Definition 1. Let’s define a unnormalized density as and specifically design a univariate marginal distribution that admits an analytical inverse cdf . This marginal is used >  d−1 pX (x; µ, κ) ∝ exp κµ x with x ∈ S , (3) to derive a distribution on hyper-spheres of any dimension. d−1 The resulting distribution is not an exponential family and with direction µ ∈ S and concentration κ ∈ R≥0. When is defined via a power law. Crucially, it is numerically sta- normalized, this is the von Mises-Fisher distribution. ble and dispenses with rejection sampling for independent Theorem 2 (9.3.12 in Mardia & Jupp(2009)) . The marginal sampling. We verify the stability of the Power Spherical p (t; d, κ) of a von Mises-Fisher distribution is distribution with a numerical experiment as well as repro- T ducing some of the experiments of Davidson et al.(2018) d−3 κt 2 2 while substituting the vMF with our proposed distribution. pT (t; d, κ) = CT (κ, d) · e 1 − t , (4) with normalizer C (κ, d) = Contributions We propose a new distribution defined on T any d-dimensional hyper-sphere which d      −1 κ 2 −1 d − 1 1 Γ Γ I d−1 (κ) , (5) • has closed form marginal cdf and inverse cdf , thus it 2 2 2 2 does not require rejection sampling; where I (z) is the modified of the first kind. • is fully reparameterizable without a correction term; v

• is numerically stable in high dimensions and/or high Although evaluation of pT is tractable, the Bessel func- concentrations. tion (Weisstein, 2002) (see AppendixA) is slow to compute and unstable for large arguments. Besides, pT does not 2. Method admit a known closed-form cdf (nor its inverse). Thus, re- jection sampling is normally used to draw samples from We start with an overview of the vMF distribution, also it (Ulrich, 1984). introducing results that help formulate the Power Spherical distribution. We then define the Power Spherical and present 2.2. The Power Spherical distribution some of its proprieties such as mean, mode, variance, and entropy as well as its KullbackLeibler divergence from a As all the issues of the vMF distribution stem from a prob- vMF and from a uniform distribution. lematic marginal, we address them by defining a new dis- tribution that shares some basic proprieties of a vMF but 2.1. Preliminaries none of its drawbacks. Namely i) it is rotationally symmet- ric about µ, ii) it can be expressed in terms of a marginal d−1 d Let S = {x ∈ R : kxk2 = 1} be the hyper-spherical distribution pT , and iii) this marginal has closed-form and set. A key idea in directional distribution theory is the stable cdf (and inverse cdf ). tangent-normal decomposition. Definition 2. Let’s define an unnormalized density as Theorem 1 (9.1.2 in Mardia & Jupp(2009)) . Any unit vec- d−1 > κ d−1 tor x ∈ S can be decomposed as pX (x; µ, κ) ∝ 1 + µ x with x ∈ S , (6) 2 1 x = µt + v(1 − t ) 2 , (1) d−1 with direction µ ∈ S and concentration κ ∈ R≥0. When with t ∈ [−1, 1] and v ∈ Sd−2 a tangent to Sd−1 at µ. normalized this is the Power Spherical distribution. The Power Spherical distribution

Algorithm 1 Power Spherical sampling Property Value Input: dimension p, direction µ, concentration κ E[X] µ(α − β)/(α + β) 2α >  sample z ∼ Beta (Z;(d − 1)/2 + κ, (d − 1)/2) var(X) 2 (β − α)µµ + (α + β)Id d−2 (α+β) (α+β+1) sample v ∼ U(S ) Mode µ (for κ > 0) t ← 2z − 1 H(T ) H(Beta(α, β)) + log 2 √  y ← [t;( 1 − t2)v> ]> {concatenation} H(X) log NX (κ, d) − κ log 2 + ψ (α) − ψ (α + β) > uˆ ← e1 − µ {e1 is the base vector [1, 0, ··· , 0] } u = uˆ Table 1. Properties of X ∼ PowerSpherical(µ, κ). Recall that kuˆk2 > α = (d − 1)/2 + κ and β = (d − 1)/2. x ← (Id − 2uu )y {Id is the identity matrix d × d} Return: x 2.3. Proprieties

As we show in Theorem 12 (Appendix C.1), the marginal of Table1 summarizes some basic properties of the Power the Power Spherical distribution has the valuable property Spherical distribution. See Appendix C.3 for derivations. of being defined in terms of an affine transformation of a In particular, note that having a closed-form differential Beta-distributed variable, i.e., entropy allows using the Power Spherical in applications such as variational inference (VI; Jordan et al., 1999) and T = 2Z − 1 with Z ∼ Beta (α, β) , (7) mutual information minimization.

d−1 d−1 where α = 2 + κ and β = 2 . Therefore, its density KullbackLeibler divergence The KL divergence be- can be easily assessed via the change of variable theorem tween a Power Spherical P and a uniform Q = U(Sd−1) is (Theorem8 in AppendixB). Sampling and evaluating a is numerically stable and, crucially, it permits DKL[P kQ] = − H(P ) + H(Q) . (10) backpropagation though sampling via implicit reparame- See Theorem 17 (Appendix C.5) for the full derivation. Be- terization gradients (Figurnov et al., 2018). The properly ing able to compute the KL divergence from a uniform distri- normalized density of the Power Spherical distribution is de- bution in closed-form is useful in the context of variational rived in Theorem 13 (Appendix C.2) and is p (x; µ, κ) = X inference as Q can be used as a prior. Another important  −1 result we present here is a closed-form KL divergence of a α+β β Γ(α) > κ 2 π 1 + µ x . (8) Power Spherical distribution P with parameters µp, κp from Γ(α + β) a vMF Q with parameters µ , κ , which evaluates to | {z } q q =NX (κ,d) (normalizer) α − β  − H(P ) + log C (κ , d) − κ µ>µ (11) X q q q p α + β Sampling As for the vMF,1 draws are obtained sampling with CX (κq, d) the vMF normalizer (see Theorem 16 in d−2 t ∼ pT (t; κ, d) and v ∼ U(S ) , (9) Appendix C.5). This is valuable when using the Power √ Spherical to approximate a vMF, as we can assess the qual- and constructing y = [t; v> 1 − t2]> using Theorem1. ity of the approximation. For example, if one has a vMF Finally, we apply a Householder reflection about µ to y to prior, it is then straightforward to use a Power Spherical obtain a sample x (see Algorithm1). All these operations approximate posterior in VI. If a vMF is necessary for a are differentiable which means we can use the reparameteri- particular application, we can train with Power Spherical zation trick to have low variance and unbiased estimation (enabling fast sampling and stable optimization) and then of gradients of Monte Carlo samples with respect to the return the vMF that is closest to it in terms of KL. parameters of the density (Rezende et al., 2014; Kingma & Welling, 2014). Importantly, and differently from a vMF, 3. Experiments sampling from a Power Spherical does not require rejection sampling. This leads to two main advantages: i) fast sam- In this section we aim to show that Power Spherical distri- pling (as we demonstrate in Section3), and ii) no need for a butions are more stable than vMFs, they are also faster to high variance gradient correction term that compensates for sample from, and lead to comparable performance when sampling from a proposal distribution rather than the true used in the context of variational auto-encoders. one (Naesseth et al., 2017; Davidson et al., 2018).

1This is equal to the method of Davidson et al.(2018) (Al- Stability We tested numerical stability of both distribu- b gorithms 1 and 3) for sampling from a vMF where we use the tions for dimensions d ∈ {a · 10 } and concentrations marginal of the Power Spherical instead. κ ∈ {a · 10b} for all a ∈ {1,..., 9}, b ∈ {0,..., 5}. For The Power Spherical distribution

6 vMF Power Spherical 10 5e4 Method 1e4 LL ELBO LL ELBO unstable 5e3 n n o i o 1e3 i t

t 5 a

a 10 d = 5 −114.51 −117.68 −114.49 −118.01

r 5e1 r t t n

n 1e2 e e d = 10 −97.37 −101.78 −97.46 −101.86 c c 5e1 n n o o 4 stable 1e1 C C 10 d = 20 −93.80 −99.38 −93.70 −99.27 5e0 von Mises-Fisher 1e0 PowerSpherical d = 40 −98.64 −108.44 −98.63 −108.32

101 102 103 104 105 0 2 4 6 8 10 Dimension d Time (ms) Table 2. Comparison between the vMf and Power Spherical dis- (a) Stability of the vMF distri- (b) Sampling time (on GPU) tributions in a VAE on MNIST with different dimensional latent bution. Ours does not have nu- with d = 64 of a batch of 100 spaces d−1. We show estimated (with 5k Monte Carlo samples) merical issues in these intervals. vectors of varius concentrations. S log-likelihood (LL) and evidence lower bond (ELBO) on test set. Figure 2. Comparing stability (a) and running time (b) of the von Mises-Fisher and the Power Spherical distribution. estimates (with 5k Monte Carlo samples) of log-likelihood and the evidence lower bound on the test set. We observe every combination of hd, κi, we sample 10 vectors x(i) and no substantial difference in performance between the two (i) > (i) compute the gradient g = ∇κµ x . If at least one of distributions. This shows, across dimensions, that the Power the samples x(i) or one of the gradients g(i) returns Not a Spherical is sufficiently expressive to replace the vMF in Number (NaN), we mark hd, κi as unstable for that distribu- this application. However, training with the Power Spheri- tion. In Figure 2(a) we show the regions of instability. As cal was >2× faster than with the vMF. One can notice that intended, the Power Spherical does not present numerical both log-likelihood and ELBO do not improve after d = 40. issues in these intervals, while the vMF does. This makes This is in line with Davidson et al.’s (2018) findings and a our distribution more suitable where high dimensional vec- consequence of a shallow architecture. It is not the purpose tors are needed such as for language modelling Kumar & of this experiment to show improvements on this task. Tsvetkov(2019). 4. Conclusion and future work Efficiency We also compare sampling efficiency between We presented a novel distribution on the d-dimensional the Power Spherical and the vMF to highlight that rejec- sphere that, unlike the typically used von Mises-Fisher, i) tion sampling is an undesirable bottleneck. We measured is numerically stable in high dimensions and concentration, sampling time with d = 64 of a batch of 100 vectors of ii) has gradients of its samples with respect to its parame- various concentrations κ ∈ {a · 10b} with a ∈ {1,..., 5}, ters, and iii) does not require rejection sampling allowing b ∈ {0,..., 4}.2 For every concentration κ we computed faster computation and exact reparameterization gradient mean and variance of 7 trials that consisted of computing without a high variance correction term. We empirically the mean execution time (in milliseconds) sampling 100 show that our distribution is more numerically stable and times. Figure 2(b) shows the results. Sampling from a faster to sample from compared to a vMF while preform- Power Spherical is at least 6× faster than sampling from a ing equally well in a variational auto-encoder setting. As vMF. For some concentrations where the rejection ratio is shown in our experiments, high dimensional hyperspaces worse, it is almost 20× faster. Noticeably, and differently suffer from surface area collapse at the expense of the from a vMF, sampling time is constant regardless of the expressivity of latent space embeddings. Davidson et al. concentration κ. (2019) addressed some of these issues, future work might explore this direction further. Another future direction is to Variational inference Finally, we employed our Power use the Power Spherical in combination with normalizing Spherical distribution in the context of variational auto- flows (Rezende & Mohamed, 2015). Though neural autore- encoders (Kingma & Welling, 2014). In particular, we gressive flows (Huang et al., 2018; De Cao et al., 2019) replicated some of Davidson et al.’s (2018) experiments have been shown to be remarkably flexible, they are still comparing the Power Spherical and the vMF on the the subject to topological constraints (Dinh et al., 2019; Cor- MNIST dataset (LeCun et al., 1998). We implement a sim- nish et al., 2019), which motivates extensions for complex ple feed-forward encoder [784, 256, tanh, 128, tanh, d] and manifolds (Brehmer & Cranmer, 2020; Wu et al., 2020) decoder [d, 128, tanh, 256, tanh 784] and optimize the evi- including spheres and tori (Rezende et al., 2020). In addi- dence lower bound for 100 epochs for d ∈ {5, 10, 20, 40}. tion, a recent work by Falorsi & Forre´(2020) extend Neural We used Adam (Kingma & Ba, 2014) with learning rate Ordinary Differential Equations (Chen et al., 2018) to arbi- 10−3 and batch size 64. Table2 shows importance sampling trary Riemannian manifolds allowing the use of continuous normalizing flow to such spaces. 2On a NVIDIA Titian X 12GB GPU. The Power Spherical distribution

Acknowledgements Falorsi, L., de Haan, P., Davidson, T. R., De Cao, N., Weiler, M., Forre,´ P., and Cohen, T. S. Explorations in home- Authors want to thank Luca Falorsi and Sergio De Cao omorphic variational auto-encoding. ICML workshop for helpful discussions and technical support as well as on Theoretical Foundations and Applications of Deep Srinivas Vasudevan and Robin Scheibler for spotting some Generative Models, 2018. typos. This project is supported by SAP Innovation Center Network, ERC Starting Grant BroadSem (678254). Falorsi, L., de Haan, P., Davidson, T. R., and Forre,´ P. Repa- rameterizing distributions on lie groups. AISTATS, 2019. References Figurnov, M., Mohamed, S., and Mnih, A. Implicit reparam- Banerjee, A., Dhillon, I. S., Ghosh, J., and Sra, S. Clustering eterization gradients. In Advances in Neural Information on the unit hypersphere using von mises-fisher distribu- Processing Systems, pp. 441–452, 2018. tions. Journal of Machine Learning Research, 6(Sep): Huang, C.-W., Krueger, D., Lacoste, A., and Courville, A. 1345–1382, 2005. Neural autoregressive flows. International Conference on Machine Learning Bijral, A. S., Breitenbach, M., and Grudic, G. Mixture of , 2018. watson distributions: a generative model for hyperspheri- Jordan, M., Ghahramani, Z., Jaakkola, T., and Saul, L. An cal embeddings. In Artificial Intelligence and Statistics, introduction to variational methods for graphical models. pp. 35–42, 2007. Machine Learning, 37(2):183–233, 1999. Brehmer, J. and Cranmer, K. Flows for simultaneous man- Kingma, D. P. and Ba, J. Adam: A method for stochas- ifold learning and density estimation. arXiv preprint tic optimization. International Conference for Learning arXiv:2003.13913, 2020. Representations, 2014.

Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duve- Kingma, D. P. and Welling, M. Auto-encoding variational naud, D. K. Neural ordinary differential equations. In bayes. Proceedings of the 2nd International Conference Bengio, S., Wallach, H., Larochelle, H., Grauman, K., on Learning Representations (ICLR), 2014. Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Kumar, S. and Tsvetkov, Y. Von mises-fisher loss for train- Neural Information Processing Systems 31, pp. 6571– ing sequence to sequence models with continuous outputs. 6583. 2018. Seventh International Conference on Learning Represen- tations (ICLR 2019), 2019. Cornish, R., Caterini, A. L., Deligiannidis, G., and Doucet, A. Relaxing bijectivity constraints with con- LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient- tinuously indexed normalising flows. arXiv preprint based learning applied to document recognition. Proceed- arXiv:1909.13833, 2019. ings of the IEEE, 86(11):2278–2324, 1998.

Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., and Song, L. Tomczak, J. M. Hyperspherical variational auto-encoders. Sphereface: Deep hypersphere embedding for face recog- 34th Conference on Uncertainty in Artificial Intelligence nition. In Proceedings of the IEEE conference on com- (UAI-18), 2018. puter vision and pattern recognition, pp. 212–220, 2017.

Davidson, T. R., Tomczak, J. M., and Gavves, E. Increasing Mardia, K. V. and Jupp, P. E. Directional statistics, volume expressivity of a hyperspherical vae. NeurIPS 2019, in 494. John Wiley & Sons, 2009. Workshop on Bayesian Deep Learning, 2019. Mathieu, E., Le Lan, C., Maddison, C. J., Tomioka, R., and Teh, Y. W. Continuous hierarchical representations with De Cao, N., Titov, I., and Aziz, W. Block neural autoregres- poincare´ variational auto-encoders. In Advances in neural sive flow. 35th Conference on Uncertainty in Artificial information processing systems, pp. 12544–12555, 2019. Intelligence (UAI19), 2019. Naesseth, C., Ruiz, F., Linderman, S., and Blei, D. Repa- Dinh, L., Sohl-Dickstein, J., Pascanu, R., and Larochelle, rameterization Gradients through Acceptance-Rejection H. A rad approach to deep mixture models. In ICLR, Sampling Algorithms. AISTATS, pp. 489–498, 2017. Workshop track, 2019. Nagano, Y., Yamaguchi, S., Fujita, Y., and Koyama, M. Falorsi, L. and Forre,´ P. Neural ordinary differential equa- A wrapped normal distribution on hyperbolic space for tions on manifolds. arXiv preprint arXiv:2006.06663, gradient-based learning. In International Conference on 2020. Machine Learning, pp. 4693–4702, 2019. The Power Spherical distribution

Reisinger, J., Waters, A., Silverthorn, B., and Mooney, R. J. Spherical topic models. In Proceedings of the 27th inter- national conference on machine learning (ICML-10), pp. 903–910, 2010. Rezende, D. and Mohamed, S. Variational inference with normalizing flows. ICML, 37:1530–1538, 2015.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochas- tic backpropagation and approximate inference in deep generative models. ICML, pp. 1278–1286, 2014. Rezende, D. J., Papamakarios, G., Racaniere,` S., Albergo, M. S., Kanwar, G., Shanahan, P. E., and Cranmer, K. Normalizing flows on tori and spheres. arXiv preprint arXiv:2002.02428, 2020. Sra, S. Directional statistics in machine learning: a brief review. Applied Directional Statistics: Modern Methods and Case Studies, pp. 225, 2018.

Ulrich, G. Computer Generation of Distributions on the m-Sphere. Journal of the Royal Statistical Society. Series C (Applied Statistics), 33(2):158–163, 1984. Weisstein, E. W. Bessel function of the first kind. Wolfram, 2002.

Wu, H., Kohler,¨ J., and Noe,´ F. Stochastic normalizing flows. arXiv preprint arXiv:2002.06707, 2020. The Power Spherical distribution

A. Definitions Definition 12. A random variable X is distributed accord- ing to the Beta distribution if its probability density function Definition 3. The Gamma function is: is Z ∞ B(α, β)−1xα−1(1 − x)β−1 , (22) Γ(x) = ux−1 exp(−u) du . (12) >0 0 for x ∈ [0, 1] and zero elsewhere where α ∈ R and β ∈ >0 are shape parameters. Definition 4. The Beta function: R Γ(a)Γ(b) B(a, b) = . (13) B. Theorems Γ(a + b) Theorem 3. The entropy of a random variable X Beta Definition 5. The incomplete Beta function is: distributed is Z x a−1 b−1 H(X) = log B(α, β) + (α + β − 2)ψ(α + β) Bx(a, b) = u (1 − u) du . (14) (23) 0 − (α − 1)ψ(α) − (β − 1)ψ(β)

Definition 6. The regularized incomplete Beta function is: ∂ where ψ(x) = ∂x log(Γ(x)). Bx(a, b) Theorem 4. The expectation of X of a random variable X I (a, b) = . (15) x B(a, b) Beta distributed is α Definition 7. The surface area of the hyper-sphere d−1 is: [X] = . (24) S E α + β d 2π 2 Theorem 5. The expectation of log X of a random variable Ad−1 = d  . (16) Γ 2 X Beta distributed is The uniform distribution on d−1 has con- Definition 8. S E[log X] = ψ(α) − ψ(α + β) . (25) stant density equal to the reciprocal of the surface area (Definition7): Theorem 6. The expectation of X2 of a random variable 1 X Beta distributed is p (x) = . (17) X A d−1 (α + 1)α [X2] = . (26) Definition 9. The modified Bessel function of the first kind E (α + β + 1)(α + β) is defined as Theorem 7. The variance of X of a random variable X  v ∞ 1 2k 1 X 4 z Beta distributed is Iv(z) = z . (18) 2 k!Γ(v + k + 1) αβ k=0 var[X] = . (27) (α + β)2(α + β + 1) An useful integral function Theorem 8. Given a bijective function f : X → Y be- Z d a b tween two continuous random variables X ∈ X ⊆ R and = (1 + x) (1 − x) dx d (19) Y ∈ Y ⊆ R , a relation between the probability density a+b+1 functions pY (y) and pX (x) is = 2 B x+1 (a + 1, b + 1) + C. 2 −1 Definition 10. For a distribution P of a continuous random pY (y) = pX (x) det Jf(x) , (28) variable the differential entropy is defined to be the integral where y = f(x), and det Jf(x) is the absolute value of the determinant of the Jacobian of f evaluated at x. H(P ) = −Ep(x) [log p(x)] , (20) Theorem 9. Given a bijective function f : X → Y between where p denotes the probability density function of P . d two continuous random variables X ∈ X ⊆ R and Y ∈ Definition 11. For distributions P and Q of a continuous Y ⊆ Rd, a relation between the two respective entropies random variable the KullbackLeibler divergence is defined H(Y ) and H(X) is to be the integral Z

H(Y ) = H(X) + pX (x) log det Jf(x) dx , (29) DKL(P kQ) = − H(P ) − Ep(x) [log q(x)] , (21) X where p and q denote the probability density function of P where y = f(x), and det Jf(x) is the absolute value of and Q respectively. the determinant of the Jacobian of f evaluated at x. The Power Spherical distribution

Theorem 10 (9.3.33 in Mardia & Jupp(2009)) . Continuous product marginal distribution is pT (t; κ, d) = distributions with rotational symmetry about a direction µ d−3 and with probability density functions of the form pX ∝ −1 κ 2 2 > = NT (κ, d) (1 + t) 1 − t (38) g(µ x) have mean d−3 d−3 −1 2 +κ 2 = NT (κ, d) (1 + t) (1 − t) (39) ∗ d−2 d−1 E[X] = E[T ]µ . (30) −1 2 +κ−1 2 −1 = NT (κ, d) (2z) (2 − 2z) (40) −1 α−1 β−1 Theorem 11 (9.3.34 in Mardia & Jupp(2009)) . Continuous = B (α, β) z (1 − z) . (41) distributions with rotational symmetry about a direction µ and with probability density functions of the form pX ∝ where ∗ indicates a substitution t = 2z − 1 and eventually > d−1 d−1 g(µ x) have variance α = 2 + κ, β = 2 . Notice that Equation 41 is a Beta distribution. Therefore, if we define the random variable 1 − [T 2] Z ∼ Beta(α, β) turns out that T is simply T = 2Z−1. var[X] = var[T ]µµ> + E (I − µµ>) , (31) d − 1 d Corollary 2. Following Theorem 12, the marginal of a where Id is the d × d identity matrix. Power Spherical distribution has cumulative density func- tion F (t; κ, d) = C. Derivations Z t d−3 −1 κ 2 2 = NT (κ, d) (1 + x) 1 − x dx (42) C.1. The Power Spherical marginal −1 d−1 d −1 Theorem 12. Let = {x ∈ : kxk = 1} be the = B (α, β) B x+1 (α, β) (43) S R 2 hyper-spherical set. Let an unnormalized density be = I t+1 (α, β) , (44) 2 κ p (x; µ, κ) ∝ 1 + µ>x with x ∈ d−1 , (32) X S and inverse cumulative density function

d−1 d−1 with x ∈ S , direction µ ∈ S , and concentration −1 −1 F (y; κ, d) = 2Iy (α, β) − 1 . (45) parameter κ ∈ R≥0. Let T bet a random variable that > denotes the dot-product t = µ x, then Corollary 3. Using Theorem3 and9 we derive the differ- ential entropy of the marginal Power Spherical pT . Using T = 2Z − 1 with Z ∼ Beta (α, β) , (33) t = 2z − 1 = f(z), det Jf(z) = 2 for all z, and then

d−1 d−1 where α = 2 + κ and β = 2 . H(T ) = H(Z) + log 2 (46) (23) = log B(α, β) + (α + β − 2)ψ(α + β) Proof. Given Corollary1, the marginal distribution of the − (α − 1)ψ(α) − (β − 1)ψ(β) + log 2 . (47) κ 2 d−3 dot-product t is ∝ (1 + t) (1 − t ) 2 so its normalizer is C.2. The normalized Power Spherical distribution Z 1 d−3 κ 2 2 NT (κ, d) = (1 + t) 1 − t dt (34) −1 Theorem 13. The normalized density of the Power Spheri- Z 1 cal distribution is pX (x; µ, κ) = d−3 +κ d−3 = (1 + t) 2 (1 − t) 2 dt (35) −1 −1  Γ(α)  κ 2α+βπβ 1 + µ>x , (48) (19) d − 1 d − 1 = 2d+κ−2 B + κ, Γ(α + β) 1 2 2 ! with α = d−1 + κ and β = d−1 . d − 1 d − 1 2 2 − B + κ, (36) 0 2 2 | {z } =0 Proof. The Power Spherical is expressed via the tangent d − 1 d − 1 normal decomposition (Theorem1) as a joint distribution = 2d+κ−2B + κ, . (37) between T ∼ p (t; κ, d) (from Theorem 12) and V ∼ 2 2 T U(Sd−2). Since T ⊥⊥ V , the Power Spherical normalizer NX (p, κ) is the product of the normalizer of pT (t; κ, d) It follows that the probability density function of the dot- and the uniform distribution on Sd−2 (whose probability is The Power Spherical distribution constant on the d − 2-sphere – see Definition8), that is C.4. Power Spherical differential entropy Theorem 15. The differential entropy of the Power Spheri- NX (κ, d) = NT (κ, d) · Ad−2 (49) cal pX that is H(X) = 2πβ = 2α+β−1B (α, β) (50) log N (κ, d) − κ (log 2 + ψ (α) − ψ (α + β)) , (62) Γ(β) X Γ(α) with α = d−1 + κ, β = d−1 . = 2α+βπβ . (51) 2 2 Γ(α + β) Proof. Applying Definition 10, H(X) = Thus, p (x; µ, κ) = N (κ, d)−1(1 + µ>x)κ. X X = −EX [log q(X)] (63) = log N (κ, d) − κ [log1 + µ>X] (64) C.3. Power Spherical properties X EX = log NX (κ, d) − κ (log 2 + EZ [log Z]) (65) Corollary 4. Directly applying Theorem 10, the mean of a (25)  Power Spherical is = log NX (κ, d) − κ log 2 + ψ(α) − ψ(α + β) . (66) (24) α − β  [X] = [T ]µ = (2 [Z] − 1)µ = µ , (52) E E E α + β C.5. KullbackLeibler divergence with the von d−1 d−1 with α = 2 + κ, β = 2 . Mises-Fisher distribution Corollary 5. Directly applying Theorem 11, the variance Theorem 16. The KullbackLeibler divergence DKL (Defi- of a Power Spherical is var[X] = nition 11) between a Power Spherical distribution P with parameters µ , κ and von Mises-Fisher and Q with param- 2 p p > 1 − E[T ] > eters µ , κ is D [P kQ] = = var[T ]µµ + (Id − µµ ) (53) q q KL d − 1   [Z] − [Z2] > α − β = 4 var[Z]µµ> + 4E E (I − µµ>) (54) − H(P ) + log CX (κq, d) − κqµq µp , (67) 2β d α + β d−1 d−1 (26,27) var[Z](α + β) with α = + κ, β = . = 4 var[Z]µµ> + 4 (I − µµ>) (55) 2 2 2β d 2 var[Z] Proof. Applying Definition 11, DKL[P kQ] = = (β − α)µµ> + (α + β)I  (56) β d = − H(P ) − EpX [log q(X)] (68) >  (27) 2α (β − α)µµ + (α + β)Id > = − H(P ) + log CX (κq, d) − EpX [κqµq X] (69) = 2 , (57) (α + β) (α + β + 1) > = − H(P ) + log CX (κq, d) − κqµq EpT [T ]µp (70) d−1 d−1   with α = + κ, β = and Id is the d × d identity > α − β 2 2 = − H(P ) + log CX (κq, d) − κqµ µp , matrix. q α + β (71) Theorem 14. The mode of a Power Spherical with κ > 0 is where CX (κq, d) is the von Mises-Fisher normalizer.

arg max pX (x; µ, κ) = µ . (58) d−1 d−1 x∈S C.6. KullbackLeibler divergence with U(S )

Theorem 17. The KullbackLeibler divergence DKL (Defi- Proof. We have arg max pX (x; µ, κ) = nition 11) between a Power Spherical distribution P and a d−1 d−1 x∈S uniform distribution on the sphere Q = U(S ) is

−1 > κ DKL[P kQ] = − H(P ) + H(Q) (72) = arg max NX (κ, d) (1 + µ x) (59) d−1 x∈S κ>0 Proof. Applying Definition 11, D [P kQ] = = arg max log1 + µ>x (60) KL d−1 x∈S = − H(P ) − EpX [log q(X)] (73) = arg max µ>x = µ . (61) d−1 = − H(P ) + H(Q) , (74) x∈S where H(Q) = log Ad−1 (from Definition8).