Maximum Likelihood Estimation of the Fisher-Bingham Distribution Via
Total Page:16
File Type:pdf, Size:1020Kb
Noname manuscript No. (will be inserted by the editor) Maximum likelihood estimation of the Fisher{Bingham distribution via efficient calculation of its normalizing constant Yici Chen · Ken'ichiro Tanaka Received: date / Accepted: date Abstract This paper proposes an efficient numerical 1 Introduction integration formula to compute the normalizing con- stant of Fisher{Bingham distributions. This formula 1.1 Fisher{Bingham distribution uses a numerical integration formula with the contin- uous Euler transform to a Fourier-type integral repre- The Fisher{Bingham distribution is defined as a multi- sentation of the normalizing constant. As this method variate normal distribution restricted on a unit sphere. is fast and accurate, it can be applied to the calculation of the normalizing constant of high-dimensional Fisher{ Definition 1 Bingham distributions. More precisely, the error decays For a p-dimensional multivariate normal distribution exponentially with an increase in the integration points, with a mean µ and a variance-covariance matrix Σ, and the computation cost increases linearly with the di- the Fisher{Bingham distribution is given by the density mensions. In addition, this formula is useful for calculat- function ing the gradient and Hessian matrix of the normalizing T −1 1 x Σ x T −1 f(x; µ, Σ) := exp − + x Σ µ d p−1 (x); constant. Therefore, we apply this formula to efficiently C 2 S calculate the maximum likelihood estimation (MLE) of high-dimensional data. Finally, we apply the MLE to where x 2 Rp and the hyperspherical variational auto-encoder (S-VAE), a Σ−1 deep-learning-based generative model that restricts the C = C ;Σ−1µ 2 latent space to a unit hypersphere. We use the S-VAE Z T −1 trained with images of handwritten numbers to esti- x Σ x T −1 := exp − + x Σ µ dSp−1 (x) mate the distributions of each label. This application is Sp−1 2 useful for adding new labels to the models. is the normalizing constant and dSp−1 (x) is the uniform measure in the (p − 1)-dimensional sphere Sp−1. Keywords Fisher{Bingham distributions · continuous The Fisher{Bingham distribution plays an essential arXiv:2004.14660v1 [stat.CO] 30 Apr 2020 Euler transform · high-dimensional data · maximum likelihood estimation · hyperspherical variational role in directional statistics, which is concerned with auto-encoder data on various manifolds, especially data represented in a high-dimensional sphere. For example, wind direc- tion and the geomagnetic field are common types of Y. Chen data that can be represented on a sphere S2. In ad- Department of Information Science and Technology, The Uni- dition, data on a hypersphere are used in link predic- versity of Tokyo E-mail: [email protected] tion of networks and image generation. Therefore, the Fisher{Bingham distribution, a normal distribution re- K. Tanaka Department of Mathematical Informatics, The University of stricted on a unit sphere, is commonly used in this field. Tokyo, Tokyo, Japan However, the spherical domain causes some prob- E-mail: [email protected] lems when using Fisher{Bingham distributions. One 2 Yici Chen, Ken'ichiro Tanaka such problems is calculating the normalizing constant. Since x is restricted on a unit sphere, we have As it is difficult to calculate it analytically, a numer- ical method is necessary. The saddlepoint approxima- C(θ + cI; γ) tion method is a numerical method for computing the Z p ! X 2 normalizing constant C(θ; γ) developed by Kume and = exp (−(θi + c)xi + γixi) dSp−1 (x) Sp−1 Wood (2005). Another approach, the holonomic gradi- i=1 Z p !! ent method considered by Kume and Sei (2018), com- X 2 = exp −c + (−θixi + γixi) dSp−1 (x) putes the normalizing constant as well. However, these p−1 S i=1 methods have some limitations. The saddlepoint ap- Z p ! proximation method is not as accurate as the holo- −c X 2 =e exp (−θixi + γixi) dSp−1 (x) p−1 nomic gradient method, which is theoretically exact be- S i=1 cause the problem of calculating C(θ; γ) is mathemat- =e−cC(θ; γ); ically characterized by solving an ODE. However, the holonomic gradient method is computationally expen- where c is a real number and I = (1; 1; ··· ; 1) 2 Rp. If sive and cannot be applied to calculate the normalizing we put constant of high-dimensional distributions. Hence, it is p ! 1 X 2 necessary to create a numerical method that is efficient, f(x; θ; γ) := exp (−θ x + γ x ) d p−1 (x): C(θ; γ) i i i i S numerically stable, and accurate. i=1 To construct such a numerical method, the follow- then we have ing details about Fisher{Bingham distributions are re- quired (Kume and Sei (2018)). f(x; θ + cI; γ) Since any orthogonal transformation in Sp−1 is iso- p ! 1 X 2 metric, the parameter dimensions are reduced from (p× = exp (−(θ + c)x + γ x ) d p−1 (x) C(θ + cI; γ) i i i i S p+p) to 2p by singular value decomposition. Therefore, i=1 we have c p !! e X 2 = exp −c + (−θ x + γ x ) d p−1 (x) C(θ; γ) i i i i S i=1 −1 −1 Σ −1 ∆ −1 p ! C ;Σ µ = C ; ∆ Oµ ; 1 X 2 2 2 = exp (−θ x + γ x ) d p−1 (x) C(θ; γ) i i i i S i=1 2 2 =f(x; θ; γ): where ∆ = diag(δ1; ··· ; δp) and O is the orthogonal matrix obtained from Σ = OT ∆O. Thus, without loss As a result, if the normalizing constant C(θ; γ) is ob- of generality, we can assume that the variance-covariance tained, C(θ + cI; γ) can also be obtained. Moreover, for matrix Σ is diagonal. After reducing the parameter di- the maximum likelihood estimation (MLE), as f(x; θ; γ) = mensions to 2p, the normalizing constant becomes f(x; θ + cI; γ), θ can be shifted to θ + cI for all c 2 R. Additionally, because the unit sphere is symmetrical, ∆−1 Z p ! C ; ∆−1Oµ = C(θ; γ) X 2 C(θ; jγj) = exp (−θixi + jγijxi) dSp−1 (x) 2 p−1 S i=1 Z p ! X 2 Z p ! := exp (−θixi + γixi) dSp−1 (x); X 2 Sp−1 = exp (−θixi + γixi) dSp−1 (x) i=1 p−1 S i=1 = C(θ; γ): where As a result, it can be assumed that γ has non-negative entries when calculating the normalizing constant. 1 1 ∆−1 θ = (θ1; ··· ; θp) = 2 ; ··· ; 2 = diag 2δ1 2δp 2 1.2 Aim of this paper and In this paper, 1. we propose an efficient numerical integration for- −1 γ = (γ1; ··· ; γp) = ∆ Oµ. mula to compute the normalizing constant. Maximum likelihood estimation of the Fisher{Bingham distribution 3 2. we apply this formula to perform MLE. First, the distribution f of p independent normal 1 3. we apply MLE to the latent variables of the hyper- random variables Xi ∼ N (µi; )(i = 1; ··· ; p) is 2θi spherical variational auto-encoder (S-VAE) (David- 1 p Qp 2 ! son et al. (2018)). i=1 θi X 2 f(x1; ··· ; xp) = p exp − θi(xi − µi) : π 2 The normalizing constant of Fisher{Bingham dis- i=1 tributions can be represented in a Fourier integration We then apply the variable transform form. Therefore, we can use the numerical integration ( Pp 2 T r = i=1 xi = x x formula with the continuous Euler transform introduced x1 xp x φ = (φ1; ··· ; φp) = 1 ; ··· ; 1 = 1 by Ooura (2001). Note that the continuous Euler trans- r 2 r 2 r 2 form is useful for calculating the normalizing constant to f(x ; ··· ; x ) and integrate it with respect to φ. and MLE. 1 p Then, the marginalized distribution becomes This method can be applied to the MLE of high- p ! 1 dimensional data, such as the latent variables of S-VAE 1 − p Y 1 f (r) = π 2 θ 2 C^(rθ; r 2 γ) (Davidson et al. (2018)), a generating model used in mrg 2 i i=1 machine learning. The dimensions of the hyperspherical p 2 ! 1 X γi p −1 variational auto-encoder rely on the complexity of the × exp − r 2 ; (1) 4 θ data. For example, for human face data, there may be i=1 i 100 dimensions of the latent variables. where γ = (2θ1µ1; ··· ; 2θpµp) 1.3 Organization of this paper and 1 C^(rθ; r 2 γ) This paper is organized as follows. In Section 2, we make Z p ! X 1 some general remarks about the Fisher{Bingham dis- 2 2 = exp − (rθiφi − r γiφi) dSp−1 (φ): (2) p−1 tribution and the Fourier transform representation of S i=1 the normalizing constant. In Section 3, we explain the When r = 1, Equation (2) matches the definition of continuous Euler transform and its use for numerical the normalizing constant C(θ; γ). As a result, based on computation of the normalizing constant. In Section 4, Equation (1), we obtain we discuss the calculation of the gradient of the nor- malizing constant, which is necessary for MLE. Sub- p ! p 2 ! p Y − 1 1 X γ 2 2 i sequently, the MLE algorithm is provided. In Section C(θ; γ) = 2π θi fmrg(1) exp : (3) 4 θi 5, we demonstrate some MLE numerical experiment to i=1 i=1 show the effectiveness of this method. In Section 6, we Therefore, if the distribution fmrg(r) can be represented show the application of MLE in the S-VAE whose latent in a one-dimensional integration form, the goal will be space includes high-dimensional data on a hypersphere.