<<

arXiv:1801.04658v1 [cs.IT] 15 Jan 2018 eeomnsi eplann 1,2]ta mlyvarious employ several that recently, 20] More [19, 18]. 13], learning [17, deep [12, in learning optimization developments machine visi 11], computer and [16], [10, [6], communications 15], applicatio networks [14, in systems analyses neural radar for tool as be rigorous and has such useful geometry a as used therefore, probabili optimal use geometry and, engineering and information distributions [7], science quantum in problems geometry and Many [9]. research Riemannian [8], information Fisher-Rao geometry novel Finsler to, transport several limited [6], to not geometry but expanded including, and areas scope [5]. metric statistical a as (FIM) Matrix Information Fisher probability matrice of covariance usage of of and estimation transfor main robust space non-singular [3], The parameters to the of invariant years. are in that recent structures distributions in include by interest advantages developed distributio considerable probability formally manifold parameterized garnered later a the of describes and concept geometric that [1] information Rao The paper R. [2]. C. Cencov seminal by his to in introduced were geome methods differential modeling The perspective. geometric Riemannian aaknbud hrb nacn hi plcblt ol to t applicability to their obtain results enhancing situations. our to SNR thereby extend their used with also bound, is We along Barankin once, CRLB extensions. deterministic parameter prop at and vector we that, CRLB paper, metric Bayesian this both Riemannian error In this general square scenarios. mean a (SNR) in the practi signal-to-noise predicts several low work un inaccurately with for and only links Recent Fishe useful estimators is establishing CRLB the classical (CRLB). to However, applications. led to determini bound has the equivalent framework obtaining a lower in is parameterized proposes helpful Cramér-Rao the that and work between Matrix, - prior Information distributions particular, the conce probability - In fundamentals metric the theory. parameterized Riemannian to are the estimation linked in that is geometof geometry matrix distributions differential the covariance probability that their as shown of viewed has space approach be This can structures. densities probability nomto emtyhsnwtaseddisinitial its transcended now has geometry Information a from models statistical of study a is geometry Information Abstract nomto emti praht Bayesian to Approach Geometric Information Ifraingoer ecie rmwr where framework a describes geometry —Information noe ahaPaeh India Pradesh, Madhya Indore, ninIsiueo Technology of Institute Indian icpieo Mathematics of Discipline .I I. mi:[email protected] Email: .AhkKumar Ashok M. NTRODUCTION oe ro Bounds Error Lower mation shas ns biased [4] s tric stic ose pts cal ow on by en he ns ry ty in r RB h aeinCL 2]i yial sdfor to parameters used an the to with typically assumes similar random it is be is that It except estimators. [21] CRLB biased classical deterministic of CRLB the quality the of Bayesian assessing the drawbacks The when these CRLB. MSE addresses of that noise. predictor framework in accurate increase to an due asymptotic down not the breaks in is performance only due used and is sharply CRLB increases area The effect. MSE outlier estimator the the to and ambiguities to lies only th cases on via limiting noise. two variable threshold obtained these the random between that In quasi-uniform support. by to a parameter is, close corrupted that is hugely information, MSE an is the insufficient criterion Here, is regions, observations SNR estimator low signal the or from observations information few the of case In accurately. ein 2] hnteosrain r ag rteSNR the or large (SNR are ( signal-to-noise-ratio observations by high the distinct characterized is When three is [24]. radar estimator of case regions support in the in presence estimation of finite that the delay performance with known time the well problems the [23], example is estimation for It nonlinear parameters, small. the are of errors the when cases. these the in as utility general, framewor its such geometric ceases In information problems above-mentioned ra practical unbiased. The and [23]. [22], many is communication [21], in regression estimator nonparametric biased CRLB the are classical if estimators the However, only widely (MSE) estimator. most error holds square an is mean of CRLB the performance t the evaluating for link because benchmark geometric analyses used information their this Therefore, in exploited directly CRLB FIM. Near works theory. are prior estimation of geometry of all fundamentals inverse information the from with the connected derived is results In (CRLB) the [3]. Cramér-R bound deterministic FIM well-known lower the the on theory, estimation distributions probability parameterized descent gradient concepts. the geometric calculate information to incorporated FIM have the to approximations nti ae,w rps h nomto geometric information the propose we paper, this In oevr h lsia RBi ih on only bound tight a is CRLB classical the Moreover, the between distance the bases geometry information The mi:[email protected] Email: IR-Hdocec Engineering & Hydroscience - IIHR asymptotic einweetesga bevtosaesubjected are observations signal the where region h nvriyo Iowa of University The ua ia Mishra Vijay Kumar oaCt,I,USA IA, City, Iowa priori a ein,teCL ecie h MSE the describes CRLB the region), rbblt est ucin We function. density probability priori a dar the the ao ly d o k e ) develop a general Riemannian metric that can be modified to space of all probability distributions on X . Let S ⊂ P be a link to both Bayesian and deterministic CRLB. To address the sub-manifold. Let θ = (θ1,...,θk) be a parameterization of problem of the threshold effect, other bounds that are tighter S. Given a function 1 on S, Eguchi [31] defines a than the CRLB have been developed (see e.g. [25] for an Riemannian metric on S by the matrix overview) to accurately identify the SNR thresholds that define (D) (D) the ambiguity region. In particular, Barankin bound [26] is a G (θ)= gi,j (θ) , fundamental statistical tool to understand the threshold effect h i for unbiased estimators. In simple terms, the threshold effect where can be understood as the region where Barankin deviates from g(D)(θ) := −D[∂ , ∂ ] the CRLB [27]. In this paper, we show that our metric is i,j i j also applicable for the Barankin bound. Hence, compared to ∂ ∂ := − ′ D(pθ,pθ′ ) ∂θ ∂θ ′ previous works [3], our information geometric approach to i j θ=θ minimum bounds on MSE holds good for both the Bayesian where gi,j is the elements in the ith row and jth column of the CRLB and deterministic CRLB, their vector equivalents and ′ ′ ′ matrix G, θ = (θ1,...,θn), θ = (θ ,...,θ ), and dual affine the threshold effect through the Barankin bound. ∗ 1 n connections (D) and (D ), with connection coefficients The paper is organized as follows: In the next section, we ∇ ∇ described by following Christoffel symbols provide a brief background to the information geometry and describe the notation used in the later sections. Further, we (D) Γ (θ) := −D[∂i∂j , ∂k] explain the dual structure of the manifolds because, in most ij,k ∂ ∂ ∂ applications, the underlying manifolds are dually flat. Here, ′ := − ′ D(pθ,pθ ) ∂θ ∂θ ∂θ ′ we define a divergence function between two points in a i j k θ=θ manifold. In Section III, we establish the connection between and the Riemannian metric and the Kullback-Leibler divergence ∗ for the Bayesian case. The manifold of all discrete probability (D ) Γij,k (θ) := −D[∂k, ∂i∂j ] distributions is dually flat and the Kullback-Leibler divergence ∂ ∂ ∂ ′ plays a key role here. We also show that, under certain := − ′ ′ D(pθ,pθ ) , ∂θ ∂θ ∂θ ′ conditions, our approach yields the previous results from [3]. k i j θ=θ

In Section IV, we state and prove our main result applicable D D∗ such that, ∇( ) and ∇( ) form a dualistic structure in the to several other bounds before providing concluding remarks sense that in Section V. ∗ (D) (D) (D ) ∂kg =Γ +Γ , (1) II.INFORMATION GEOMETRY: A BRIEF BACKGROUND i,j ki,j kj,i ∗ A n-dimensional manifold is a Hausdorff and second where D (p, q)= D(q,p). countable topological space which is locally homeomorphic to Euclidean space of dimension n [28–30]. A Riemannian III. MATRIX FOR THE BAYESIAN manifold is a real differentiable manifold in which the tangent CASE space at each point is a finite dimensional Hilbert space and, Eguchi’s theory in section II can also be extended to the therefore, equipped with an inner product. The collection of space P˜(X ) of all measures on X . That is, P˜ = {p˜ : X → all these inner products is called a Riemannian metric. (0, ∞)}. Let S = {pθ : θ = (θ1,...,θk) ∈ Θ} be a In the information geometry framework, the statistical k-dimensional sub-manifold of P and let models play the role of a manifold and the Fisher information matrix and its various generalizations play the role of a S˜ := {p˜θ(x)= pθ(x)λ(θ): pθ ∈ S}, (2) Riemannian metric. Formally, by a statistical manifold, we mean a parametric family of probability distributions S = where λ is a on Θ. Then S˜ is a ˜ ′ ˜ {pθ : θ ∈ Θ} with a “continuously varying" parameter space k +1-dimensional sub-manifold of P. For p˜θ, p˜θ ∈ P, the Θ (). The dimension of a statistical manifold Kullback-Leibler divergence (KL-divergence) between p˜θ and is the dimension of the parameter space. For example, S = p˜θ′ is given by {N(µ, σ2): µ ∈ R, σ2 > 0} is a two dimensional statistical p˜θ(x) manifold. The tangent space at a point of S is a linear space I(˜pθkp˜θ′ ) = p˜θ(x) log − p˜θ(x)+ p˜θ′ (x) p˜ ′ (x) that corresponds to a “local linearization” at that point. The x θ x x X X X tangent space at a point p of S is denoted by Tp(S). The pθ(x)λ(θ) ′ = pθ(x)λ(θ) log ′ − λ(θ)+ λ(θ ). elements of Tp(S) are called tangent vectors of S at p. A p ′ (x)λ(θ ) x θ Riemannian metric at a point p of S is an inner product defined X (3) for any pair of tangent vectors of S at p. In this paper, let us restrict to statistical manifolds defined 1By a divergence, we mean a non-negative function D defined on S × S on a finite set X = {a1,...,ad}. Let P := P(X ) denote the such that D(p,q) = 0 iff p = q. (I) (I) ˜ We define a Riemannian metric G (θ) = [gi,j (θ)] on S by and (e) (e) (I) kωX kp˜ := kXkp˜ = hX,Xip˜ . gi,j (θ) q ˜ := −I[∂ik∂j] Now, for a (smooth) real function f on S, the differential of ∗ f at p˜, (df)p˜, is a member of Tp˜ (S) which maps X to X(f). ∂ ∂ pθ(x)λ(θ) = − pθ(x)λ(θ) log The gradient of f at p˜ is the tangent vector corresponding to ′ ′ ′ ∂θi ∂θj pθ (x)λ(θ ) ′ x θ =θ (df)p˜, hence satisfies X = ∂i(pθ(x)λ(θ)) · ∂j log(pθ(x)λ(θ)) (e) (df)p˜(X)= X(f)= h(gradf)p˜,Xi , (9) x p˜ X = pθ(x)λ(θ)∂i(log pθ(x)λ(θ)) · ∂j (log(pθ(x)λ(θ))) and x X 2 (e) k(df)p˜kp˜ = h(gradf)p˜, (gradf)p˜ip˜ . (10) = λ(θ) pθ(x)[∂i(log pθ(x)) + ∂i(log λ(θ))] x X Since gradf is a tangent vector, we can write ·[∂j (log pθ(x)) + ∂j (log λ(θ))] k = λ(θ) Eθ[∂i log pθ(X) · ∂j log pθ(X)] gradf = hi∂i (11) · + ∂ (log λ(θ)) · ∂ (log λ(θ)) (4) i=1  i j X (e) λ = λ(θ) gi,j (θ)+ Ji,j (θ) , (5) for some scalars hi. Applying (9) with X = ∂j , for each j =1,...,k, and using (11), we get where  k (e) (e) gi,j (θ) := Eθ[∂i log pθ(X) · ∂j log pθ(X)], (6) (∂j )(f) = hi∂i, ∂j * + and i=1 kX λ J (θ) := ∂i(log λ(θ)) · ∂j (log λ(θ)). (7) (e) i,j = hih∂i, ∂j i (e) (e) λ λ i=1 Let G (θ) := [gi,j (θ)] and J (θ) := [Ji,j (θ)]. Then Xk (e) G(I)(θ)= λ(θ) G(e)(θ)+ J λ(θ) . (8) = higi,j , j =1, . . . , k. i=1 e X Notice that G( )(θ) is the usual Fisher information matrix. From this, we have ˜ RX˜ ˜ Also observe that P is a subset of , where X := X ∪ −1 ˜ T (e) T {ad+1}. The tangent space at every point of P is A0 := {A ∈ [h1,...,hk] = G [∂1(f), . . . , ∂k(f)] , RX˜ ˜ : x∈X˜ A(x)=0}. That is, Tp(P) = A0. We denote a h i (m) and so tangent vector (that is, elements of A0) by X . The manifold P i,j (e) P˜ can be recognized by its homeomorphic image {logp ˜ :p ˜ ∈ gradf = (g ) ∂j (f)∂i. (12) ˜ i,j P} under the mapping p˜ 7→ logp ˜. Under this mapping the X ˜ (e) tangent vector X ∈ Tp˜(P) can be represented X which is From (9), (10), and (12), we get defined by X(e)(x)= X(m)(x)/p˜(x) and we define 2 i,j (e) k(df)p˜kp˜ = (g ) ∂j (f)∂i(f) (13) (e) (e) X˜ T (P˜)= {X : X ∈ Tp(P˜)} = {A ∈ R : Ep[A]=0}. i,j p˜ ˜ ˜ X i,j (I) (I) For the natural basis ∂i of a coordinate system θ = (θi), where (g ) is the (i, j)th entry of the inverse of G . (m) (e) (∂i)ξ = ∂ip˜θ and (∂i)ξ = ∂i logp ˜θ. The above and the following results are indeed an extension With these notations, for any two tangent vectors X, Y ∈ of [3, Sec.2.5.] to P˜. Tp˜(P˜), the Fisher metric in (5) can be written as Theorem 1 ( [3]): Let A : X → R be any mapping (that is, a X vector in R . Let E[A]: P˜ → R be the mapping p˜ 7→ Ep˜[A]. (e) E (e) (e) hX, Y ip˜ = p˜[X Y ]. We then have ˜ ˜ 2 Let S be a sub-manifold of P of the form as in (2), Var(A)= k(dEp˜[A])p˜kp˜. (14) (I) ∗ ˜ together with the metric G as in (5). Let Tp˜ (S) be the dual space (cotangent space) of the tangent space T (S˜) and p˜ ˜ ˜ ∗ ˜ Proof: For any tangent vector X ∈ Tp˜(P), let us consider for each Y ∈ Tp˜(S), the element ωY ∈ Tp˜ (S) (e) which maps X to hX, Y i . The correspondence Y 7→ ωY X(E [A]) = X(x)A(x) ˜ ∗ ˜ p˜ is a linear map between Tp˜(S) and Tp˜ (S). An inner product x ∗ ˜ ˜ X e and a norm on Tp˜ (S) are naturally inherited from Tp˜(S) by ( ) = Ep˜[Xp˜ · A] (15) (e) (e) hωX ,ωY ip˜ := hX, Y ip˜ = Ep˜[Xp˜ (A − Ep˜[A])]. (16) (e) ˜ ˜ k Since A−Ep˜[A] ∈ Tp˜ (P), there exists Y ∈ Tp˜(P) such that (a) Let A = i=1 ciθi, where θ = (θ1,..., θk) is an (e) unbiased estimator of θ, in corollary 2. Then, from (17), A − Ep˜[A]= Yp˜ , and grad(E[A]) = Y . Hence we see that we have P b b b b 2 k(dE[A])p˜kp˜ (I) i,j cicj Cov˜(θi, θj ) ≥ cicj (g ) (θ). (e) (e) θ = E [Y Y ] i,j i,j p p˜ p˜ X X 2 = Ep˜[(A − Ep˜[A]) ]. This implies that b b (I) i,j λ(θ) cicj Covθ(θi, θj ) ≥ cicj (g ) (θ). (21) ˜ i,j i,j Corollary 2 ( [3]): If S is a submanifold of P, then X X b b 2 Hence, taking expectation on both sides with respect to Varp˜[A] ≥k(dE[A]|S )p˜kp˜ (17) λ, we get with equality iff (I) i,j cicj Eλ Covθ(θi, θj ) ≥ cicj Eλ (g ) (θ) . (e) (e) i,j i,j A − Ep˜[A] ∈{Xp : X ∈ Tp˜(S)} =: Tp (S). X   X   ˜ ˜ That is, b b (I) −1 Eλ Varθ(θ) ≥ Eλ[G (θ) ]. Proof: Since (grad E[A]|S)p˜ is the orthogonal projection − −1 of (grad E[A])p˜ onto Tp˜(S), the result follows from the E (I)  1 E (I) But λ[G (θ) ] ≥b λ[G (θ)] by [32]. This theorem. proves the result. (b) This follows from (21) by taking λ(θ )=1. IV. DERIVATION OF ERROR BOUNDS (c) Let us first observe that θ is an unbiased estimator of k E We state our main result in the following theorem. θ + b(θ). Let A = i=1 ciθi as before. Then [A] = Theorem 3: Let S and S˜ be as in (2). Let θ be an estimator k c (1+b (θ)). Then,b from corollary 2 and (17), we i=1 i i P of θ. Then have b P (a) Bayesian Cramér-Rao: b ′ (e) −1 ′ Varθ[θ] ≥ (1 + B (θ))[G (θ)] (1 + B (θ)) − (I) 1 T Eλ Varθ(θ) ≥ Eλ[G (θ)] , (18) But MSE [θ] = Var [θ]+ b(θ)b(θ) . This proves the θb θ assertion. where Varθ(θ) = [Cov θ(θi(X), θj (X))] is the (1) (n) b (d) For fixed a1b,...,an ∈bR and θ ,...,θ ∈ Θ, let us (I) covariance matrix and G (θ) is as in (8). (In (18), we define a metric by the following formula use the usualb convention that,b for twob matrices A and n B, A ≥ B means that A − B is positive semi-definite.) 2 g(θ) := alLθ(l) (x) pθ(x). (22) (b) Deterministic Cramér-Rao (unbiased): If is an θ x l=1 unbiased estimator of θ, then X h X i Let f be the mapping p 7→ Ep[A]. Let A(·)= θ(·) − θ, (e) −1 b Varθ[θ] ≥ [G (θ)] . (19) where θ is an unbiased estimator of θ. Then the partial derivatives of f in (13) equals to b (c) Deterministic Cramér-Rao (biased): For any estimator θ b b n of θ, (θ(x) − θ) alL (l) (x) pθ(x) b θ MSE [θ] ≥ (1 + B′(θ))[G(e)(θ)]−1(1 + B′(θ)) x l=1 θ X  Xn  T b pθ(l) (x) +b(θ)b(θ) , = al (θ(x) − θ) pθ(x) b pθ(x) T l=1 x where b(θ) = (b1(θ),...,bk(θ)) := Eθ[θ] is the bias Xn  X  ′ (l) b and 1 + B (θ) is the matrix whose (i, j)th entry is 0 if = al(θ − θ). i 6= j and is (1 + ∂ibi(θ)) if i = j. b l X=1 (d) Barankin Bound: (Scalar case) If θ be an unbiased Hence from (13) and corollary 2, we have estimator of θ, then n 2 (l) n b 2 al(θ − θ) (l) al(θ − θ) l=1 Varθ[θ] ≥ h n 2i Var l=1 (20) P θ[θ] ≥ sup h n 2i , a L (l) (x) p (x) l P l θ θ n,al,θ( ) a L (l) (x) p (x) x l=1 l θ θ b h i b x l=1 (l) P P h i Since al and θ are arbitrary, taking supremum over all P P (1) (n) where Lθ(l) (x) := pθ(l) (x)pθ(x) and the supremum is a1,...,an ∈ R, n ∈ N, and θ ,...,θ ∈ Θ, we get (1) (n) over all a1,...,an ∈ R, n ∈ N, and θ ,...,θ ∈ Θ. (20). Proof: V. CONCLUSION [11] ——, “Information geometry of neural learning and belief propagation,” in IEEE International Conference on Neural Information Processing, We have shown that our Theorem 3 provides a general vol. 2, 2002, pp. 886–vol. information geometric characterization of the statistical [12] S. Amari and M. Yukawa, “Minkovskian gradient for sparse manifolds linking them to the Bayesian CRLB for vector optimization,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 4, pp. 576–585, 2013. parameters; the extension to estimators of measurable [13] H.-G. Beyer, “Convergence analysis of evolutionary algorithms that functions of the parameter θ is trivial. We exploited the general are based on the paradigm of information geometry,” Evolutionary definition of Kullback-Leibler divergence when the probability Computation, vol. 22, no. 4, pp. 679–709, 2014. [14] E. de Jong and R. Pribic,´ “Design of radar grid cells with constant densities are not normalized. This is an improvement over information distance,” in IEEE Radar Conference, 2014, pp. 1–5. Amari’s work [3] on information geometry which only [15] F. Barbaresco, “Innovative tools for radar signal processing based on dealt with the notion of deterministic CRLB of scalar Cartan’s geometry of SPD matrices & information geometry,” in IEEE Radar Conference, 2008, pp. 1–6. parameters. Further, we proposed an approach to arrive at [16] M. Coutino, R. Pribic,´ and G. Leus, “Direction of arrival estimation the Barankin bound thereby shedding light on the relation based on information geometry,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 3066–3070. between the threshold effect and information geometry. Both [17] K. Sun and S. Marchand-Maillet, “An information geometry of statistical of our improvements enable usage of information geometric manifold learning,” in International Conference on Machine Learning, approaches in critical scenarios of biased estimators and low 2014, pp. 1–9. [18] S. Amari, “Natural gradient works efficiently in learning,” Neural SNRs. This is especially useful in the analyses of many computation, vol. 10, no. 2, pp. 251–276, 1998. practical problems such as radar and communication. In future [19] G. Desjardins, K. Simonyan, R. Pascanu et al., “Natural neural investigations, we intend to explore these methods further networks,” in Advances in Neural Information Processing Systems, 2015, pp. 2071–2079. especially in the context of the threshold effect. [20] N. L. Roux, P.-A. Manzagol, and Y. Bengio, “Topmoumoute online natural gradient algorithm,” in Advances in neural information REFERENCES processing systems, 2008, pp. 849–856. [1] C. R. Rao, “Information and the accuracy attainable in the estimation [21] R. D. Gill and B. Y. Levit, “Applications of the van Trees inequality: A of statistical parameters,” Bulletin of Calcutta Mathematical Society, Bayesian Cramér-Rao bound,” Bernoulli, pp. 59–79, 1995. vol. 37, pp. 81–91, 1945. [22] R. Prasad and C. R. Murthy, “Cramér-rao-type bounds for sparse [2] N. N. Cencov, Statistical decision rules and optimal inference bayesian learning,” IEEE Transactions on Signal Processing, vol. 61, (Translations of mathematical monographs). American Mathematical no. 3, pp. 622–632, 2013. Society, 2000, no. 53. [23] K. V. Mishra and Y. C. Eldar, “Performance of time delay estimation [3] S. Amari and H. Nagaoka, Methods of information geometry. American in a cognitive radar,” in IEEE Int. Conf. Acoustics, Speech and Signal Mathematical Society, Oxford University Press, 2000, vol. 191. Process., 2017, pp. 3141–3145. [4] B. Balaji, F. Barbaresco, and A. Decurninge, “Information geometry and [24] H. L. Van Trees, K. L. Bell, and Z. Tian, Detection Estimation and estimation of toeplitz covariance matrices,” in IEEE Radar Conference, Modulation Theory, Part I: Detection, Estimation, and Filtering Theory, 2014, pp. 1–4. 2nd ed. Wiley, 2013. [5] F. Nielsen, “Cramér-Rao lower bound and information geometry,” arXiv [25] A. Renaux, P. Forster, P. Larzabal, C. D. Richmond, and A. Nehorai, preprint arXiv:1301.3578, 2013. “A fresh look at the Bayesian bounds of the Weiss-Weinstein family,” [6] S. J. Maybank, S. Ieng, and R. Benosman, “A Fisher-Rao metric for IEEE Transactions on Signal Processing, vol. 56, no. 11, pp. 5334–5352, paracatadioptric images of lines,” International Journal of Computer 2008. Vision, vol. 99, no. 2, pp. 147–165, 2012. [26] E. W. Barankin, “Locally best unbiased estimates,” The Annals of [7] Z. Shen, “Riemann-Finsler geometry with applications to information Mathematical Statistics, pp. 477–501, 1949. geometry,” Chinese Annals of Mathematics-Series B, vol. 27, no. 1, pp. [27] L. Knockaert, “The Barankin bound and threshold behavior in frequency 73–94, 2006. estimation,” IEEE Transactions on Signal Processing, vol. 45, no. 9, pp. [8] W. Gangbo and R. J. McCann, “The geometry of optimal transportation,” 2398–2401, 1997. Acta Mathematica, vol. 177, no. 2, pp. 113–161, 1996. [28] S. Gallot, D. Hulin, and J. Lafontaine, Methods of information geometry. [9] M. R. Grasselli and R. F. Streater, “On the uniqueness of the Chentsov Riemannian Geometry, 2004. metric in quantum information geometry,” Infinite Dimensional Analysis, [29] J. Jost, Riemannian Geometry and Geometric Analysis. Springer, 2005. Quantum Probability and Related Topics, vol. 4, no. 02, pp. 173–182, [30] M. Spivak, A Comprehensive Introduction to - 2001. Volume I. Publish or Perish Inc., 2005. [10] S. Amari, “Information geometry of neural networks: An overview,” in [31] S. Eguchi, “Geometry of minimum contrast,” Hiroshima Mathematical Mathematics of Neural Networks, ser. Operations Research/Computer Journal, vol. 22, no. 3, pp. 631–647, 1992. Science Interfaces Series, S. W. Ellacott, J. C. Mason, and I. J. Anderson, [32] T. Groves and T. Rothenberg, “A note on the expected value of an inverse Eds. Springer US, 1997, vol. 8. matrix,” Biometrika, vol. 56, pp. 690–691, 1969.