Maximum Likelihood Estimation of the Fisher-Bingham Distribution Via

Noname manuscript No. (will be inserted by the editor) Maximum likelihood estimation of the Fisher{Bingham distribution via efficient calculation of its normalizing constant Yici Chen · Ken'ichiro Tanaka Received: date / Accepted: date Abstract This paper proposes an efficient numerical 1 Introduction integration formula to compute the normalizing constant of Fisher{Bingham distributions. This formula 1.1 Fisher{Bingham distribution uses a numerical integration formula with the continuous Euler transform to a Fourier-type integral repre- The Fisher{Bingham distribution is defined as a multi- sentation of the normalizing constant. As this method variate normal distribution restricted on a unit sphere. is fast and accurate, it can be applied to the calculation of the normalizing constant of high-dimensional Fisher{ Definition 1 Bingham distributions. More precisely, the error decays For a p-dimensional multivariate normal distribution exponentially with an increase in the integration points, with a mean µ and a variance-covariance matrix Σ, and the computation cost increases linearly with the di- the Fisher{Bingham distribution is given by the density mensions. In addition, this formula is useful for calculat- function ing the gradient and Hessian matrix of the normalizing T −1 1 x Σ x T −1 f(x; µ, Σ) := exp − + x Σ µ d p−1 (x); constant. Therefore, we apply this formula to efficiently C 2 S calculate the maximum likelihood estimation (MLE) of high-dimensional data. Finally, we apply the MLE to where x 2 Rp and the hyperspherical variational auto-encoder (S-VAE), a Σ−1 deep-learning-based generative model that restricts the C = C ;Σ−1µ 2 latent space to a unit hypersphere. We use the S-VAE Z T −1 trained with images of handwritten numbers to esti- x Σ x T −1 := exp − + x Σ µ dSp−1 (x) mate the distributions of each label. This application is Sp−1 2 useful for adding new labels to the models. is the normalizing constant and dSp−1 (x) is the uniform measure in the (p − 1)-dimensional sphere Sp−1. Keywords Fisher{Bingham distributions · continuous The Fisher{Bingham distribution plays an essential arXiv:2004.14660v1 [stat.CO] 30 Apr 2020 Euler transform · high-dimensional data · maximum likelihood estimation · hyperspherical variational role in directional statistics, which is concerned with auto-encoder data on various manifolds, especially data represented in a high-dimensional sphere. For example, wind direc- tion and the geomagnetic field are common types of Y. Chen data that can be represented on a sphere S2. In ad- Department of Information Science and Technology, The Uni- dition, data on a hypersphere are used in link predic- versity of Tokyo E-mail: [email protected] tion of networks and image generation. Therefore, the Fisher{Bingham distribution, a normal distribution re- K. Tanaka Department of Mathematical Informatics, The University of stricted on a unit sphere, is commonly used in this field. Tokyo, Tokyo, Japan However, the spherical domain causes some prob- E-mail: [email protected] lems when using Fisher{Bingham distributions. One 2 Yici Chen, Ken'ichiro Tanaka such problems is calculating the normalizing constant. Since x is restricted on a unit sphere, we have As it is difficult to calculate it analytically, a numerical method is necessary. The saddlepoint approxima- C(θ + cI; γ) tion method is a numerical method for computing the Z p ! X 2 normalizing constant C(θ; γ) developed by Kume and = exp (−(θi + c)xi + γixi) dSp−1 (x) Sp−1 Wood (2005). Another approach, the holonomic gradi- i=1 Z p !! ent method considered by Kume and Sei (2018), com- X 2 = exp −c + (−θixi + γixi) dSp−1 (x) putes the normalizing constant as well. However, these p−1 S i=1 methods have some limitations. The saddlepoint ap- Z p ! proximation method is not as accurate as the holo- −c X 2 =e exp (−θixi + γixi) dSp−1 (x) p−1 nomic gradient method, which is theoretically exact be- S i=1 cause the problem of calculating C(θ; γ) is mathemat- =e−cC(θ; γ); ically characterized by solving an ODE. However, the holonomic gradient method is computationally expen- where c is a real number and I = (1; 1; ··· ; 1) 2 Rp. If sive and cannot be applied to calculate the normalizing we put constant of high-dimensional distributions. Hence, it is p ! 1 X 2 necessary to create a numerical method that is efficient, f(x; θ; γ) := exp (−θ x + γ x ) d p−1 (x): C(θ; γ) i i i i S numerically stable, and accurate. i=1 To construct such a numerical method, the follow- then we have ing details about Fisher{Bingham distributions are re- quired (Kume and Sei (2018)). f(x; θ + cI; γ) Since any orthogonal transformation in Sp−1 is iso- p ! 1 X 2 metric, the parameter dimensions are reduced from (p× = exp (−(θ + c)x + γ x ) d p−1 (x) C(θ + cI; γ) i i i i S p+p) to 2p by singular value decomposition. Therefore, i=1 we have c p !! e X 2 = exp −c + (−θ x + γ x ) d p−1 (x) C(θ; γ) i i i i S i=1 −1 −1 Σ −1 ∆ −1 p ! C ;Σ µ = C ; ∆ Oµ ; 1 X 2 2 2 = exp (−θ x + γ x ) d p−1 (x) C(θ; γ) i i i i S i=1 2 2 =f(x; θ; γ): where ∆ = diag(δ1; ··· ; δp) and O is the orthogonal matrix obtained from Σ = OT ∆O. Thus, without loss As a result, if the normalizing constant C(θ; γ) is ob- of generality, we can assume that the variance-covariance tained, C(θ + cI; γ) can also be obtained. Moreover, for matrix Σ is diagonal. After reducing the parameter di- the maximum likelihood estimation (MLE), as f(x; θ; γ) = mensions to 2p, the normalizing constant becomes f(x; θ + cI; γ), θ can be shifted to θ + cI for all c 2 R. Additionally, because the unit sphere is symmetrical, ∆−1 Z p ! C ; ∆−1Oµ = C(θ; γ) X 2 C(θ; jγj) = exp (−θixi + jγijxi) dSp−1 (x) 2 p−1 S i=1 Z p ! X 2 Z p ! := exp (−θixi + γixi) dSp−1 (x); X 2 Sp−1 = exp (−θixi + γixi) dSp−1 (x) i=1 p−1 S i=1 = C(θ; γ): where As a result, it can be assumed that γ has non-negative entries when calculating the normalizing constant. 1 1 ∆−1 θ = (θ1; ··· ; θp) = 2 ; ··· ; 2 = diag 2δ1 2δp 2 1.2 Aim of this paper and In this paper, 1. we propose an efficient numerical integration for- −1 γ = (γ1; ··· ; γp) = ∆ Oµ. mula to compute the normalizing constant. Maximum likelihood estimation of the Fisher{Bingham distribution 3 2. we apply this formula to perform MLE. First, the distribution f of p independent normal 1 3. we apply MLE to the latent variables of the hyper- random variables Xi ∼ N (µi; )(i = 1; ··· ; p) is 2θi spherical variational auto-encoder (S-VAE) (David- 1 p Qp 2 ! son et al. (2018)). i=1 θi X 2 f(x1; ··· ; xp) = p exp − θi(xi − µi) : π 2 The normalizing constant of Fisher{Bingham dis- i=1 tributions can be represented in a Fourier integration We then apply the variable transform form. Therefore, we can use the numerical integration ( Pp 2 T r = i=1 xi = x x formula with the continuous Euler transform introduced x1 xp x φ = (φ1; ··· ; φp) = 1 ; ··· ; 1 = 1 by Ooura (2001). Note that the continuous Euler trans- r 2 r 2 r 2 form is useful for calculating the normalizing constant to f(x ; ··· ; x ) and integrate it with respect to φ. and MLE. 1 p Then, the marginalized distribution becomes This method can be applied to the MLE of high- p ! 1 dimensional data, such as the latent variables of S-VAE 1 − p Y 1 f (r) = π 2 θ 2 C^(rθ; r 2 γ) (Davidson et al. (2018)), a generating model used in mrg 2 i i=1 machine learning. The dimensions of the hyperspherical p 2 ! 1 X γi p −1 variational auto-encoder rely on the complexity of the × exp − r 2 ; (1) 4 θ data. For example, for human face data, there may be i=1 i 100 dimensions of the latent variables. where γ = (2θ1µ1; ··· ; 2θpµp) 1.3 Organization of this paper and 1 C^(rθ; r 2 γ) This paper is organized as follows. In Section 2, we make Z p ! X 1 some general remarks about the Fisher{Bingham dis- 2 2 = exp − (rθiφi − r γiφi) dSp−1 (φ): (2) p−1 tribution and the Fourier transform representation of S i=1 the normalizing constant. In Section 3, we explain the When r = 1, Equation (2) matches the definition of continuous Euler transform and its use for numerical the normalizing constant C(θ; γ). As a result, based on computation of the normalizing constant. In Section 4, Equation (1), we obtain we discuss the calculation of the gradient of the normalizing constant, which is necessary for MLE. Sub- p ! p 2 ! p Y − 1 1 X γ 2 2 i sequently, the MLE algorithm is provided. In Section C(θ; γ) = 2π θi fmrg(1) exp : (3) 4 θi 5, we demonstrate some MLE numerical experiment to i=1 i=1 show the effectiveness of this method. In Section 6, we Therefore, if the distribution fmrg(r) can be represented show the application of MLE in the S-VAE whose latent in a one-dimensional integration form, the goal will be space includes high-dimensional data on a hypersphere.

Maximum Likelihood Estimation of the Fisher-Bingham Distribution Via

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support