Quick viewing(Text Mode)

Generalized Bayesian Cramér-Rao Inequality Via Information Geometry of Relative Α-Entropy

Generalized Bayesian Cramér-Rao Inequality Via Information Geometry of Relative Α-Entropy

Generalized Bayesian Cramér-Rao Inequality via Geometry of Relative α-Entropy

Kumar Vijay Mishra† and M. Ashok Kumar‡ †United States CCDC Army Research Laboratory, Adelphi, MD 20783 USA ‡Department of Mathematics, Indian Institute of Technology Palakkad, 678557 India Email: [email protected], [email protected]

Abstract—The relative α-entropy is the Rényi analog of relative Recently, [14] studies information geometry of Rényi entropy [15], entropy and arises prominently in information-theoretic problems. Recent which is a generalization of Shannon entropy. In source coding information geometric investigations on this quantity have enabled the problem where normalized cumulants of compressed lengths are generalization of the Cramér-Rao inequality, which provides a lower bound for the variance of an estimator of an escort of the underlying considered instead of expected compressed lengths, Rényi entropy parametric . However, this framework remains is used as a measure of uncertainty [16]. The Rényi entropy of p of unexamined in the Bayesian framework. In this paper, we propose a ≥ ≠ ( ) ∶= 1 ( )α order α, α 0, α 1, is defined to be Hα p 1 α log ∑x p x . general Riemannian metric based on relative α-entropy to obtain a In the context of source distribution version of this problem, the generalized Bayesian Cramér-Rao inequality. This establishes a lower − bound for the variance of an unbiased estimator for the α-escort Rényi analog of relative entropy is relative α-entropy [17, 18]. The distribution starting from an unbiased estimator for the underlying relative α-entropy of p with respect to q (or Sundaresan’s distribution. We show that in the limiting case when the entropy order between p and q) is defined as approaches unity, this framework reduces to the conventional Bayesian α α 1 Cramér-Rao inequality. Further, in the absence of priors, the same Iα(p, q) ∶= log Q p(x)q(x) framework yields the deterministic Cramér-Rao inequality. 1 − α x − Index Terms—Bayesian bounds, cross-entropy, Rényi entropy, 1 α α Riemannian metric, Sundaresan divergence. − log Q p(x) + log Q q(x) . (2) 1 − α x x

I.INTRODUCTION It follows that, as α → 1, we have Iα(p, q) → I(pYq) and Hα(p) → H(p) [19]. Rényi entropy and relative α-entropy are related by the In information geometry, a parameterized family of probability equation Iα(p, u) = log SXS − Hα(p). Relative α-entropy is closely distributions is expressed as a manifold in the Riemannian space [1], related to the Csiszár f-divergence Df as in which the parameters form the coordinate system on manifold and α α α the measure is given by the matrix (FIM) Iα(p, q) = log sgn(1 − α) ⋅ Df (p , q ) + 1 , (3) 1 − α [2]. This framework reduces certain important information-theoretic ( ) ( ) α p x α α q x α problems to investigations of different Riemannian manifolds [3]. where p (x) ∶= α , q (x) ∶= α , and f(u) = sgn(1− y p y y q y This perspective is helpful in analyzing many problems in engineering 1(α) ( ) ( ) ( ) α α α) ⋅ (u − 1), u ≥∑ 0 ([19,) Sec. II]. The∑ measures( ) p and q are and sciences where probability distributions are used, including called α~-escort or α-scaled measures [20, 21]. It is easy( ) to show( ) that, optimization [4], signal processing [5], machine learning [6], optimal indeed, the right side of (3) is the Rényi divergence between p α transport [7], and quantum information [8]. and q α of order 1~α. ( ) In particular, when the separation between the two points on the The( Rényi) entropy and relative α-entropy arise in several important manifold is defined by Kullback-Leibler divergence (KLD) or relative information-theoretic problems such as guessing [18, 22, 23] and task entropy between two probability distributions p and q on a finite state encoding [24]. Relative α-entropy arises in as a generalized space X = {0, 1, 2,...,M}, i.e., likelihood function robust to outliers [25], [26]. It also shares many p(x) interesting properties with relative entropy; see, e.g. [19, Sec. II] for I(p, q) ∶= Q p(x) log , (1) α q(x) a summary. For example, relative -entropy behaves like squared x X Euclidean distance and satisfies a Pythagorean property in a similar arXiv:2002.04732v1 [cs.IT] 11 Feb 2020 then the resulting Riemmanian∈ metric is defined by FIM [9]. This way relative entropy does [13, 19]. This property helps in establishing method of defining a Riemannian metric on statistical manifolds from a computation method [26] for a robust estimation procedure [27]. a general divergence function is due to Eguchi [10]. Since FIM is Motivated by such analogous relationships, our previous works the inverse of the well-known deterministic Cramér-Rao lower bound [14] investigated the relative α-entropy from a differential geometric (CRLB), the information-geometric results are directly connected perspective. In particular, we applied Eguchi’s method with relative with those of estimation theory. Further, the relative entropy is entropy as the divergence function to obtain the resulting statistical related to the Shannon entropy H(p) ∶= − ∑ p(x) log p(x) by x X manifold with a general Riemannian metric. This metric is specified I(p, u) = log S S − H(p), where u is the uniform distribution on . by the Fisher information matrix that is the inverse of the so called X ∈ X It is, therefore, instructive to explore information-geometric deterministic α-CRLB [19]. In this paper, we study the structure of frameworks for key estimation-theoretic results. For example, the statistical manifolds with respect to a relative α-entropy in a Bayesian Bayesian CRLB [11, 12] is the analogous lower bound to CRLB setting. This is a non-trivial extension of our work in [13], where for random variables. It assumes the parameters to be random with we proposed Riemmanian metric arising from the relative entropy an a priori probability density function. In [13], we derived Bayesian for the Bayesian case. In the process, we derive a general Bayesian CRLB using a general definition of KLD when the probability Cramér-Rao inequality and the resulting Bayesian α-CRLB which densities are not normalized. embed the compounded effects of both Rényi order α and Bayesian prior distribution. We show that, in limiting cases, the bound reduces III.RELATIVE α-ENTROPYINTHE BAYESIAN SETTING to deterministic α-CRLB (in the absence of prior), Bayesian CRLB (when α → 1) or CRLB (no priors and α → 1). We now introduce relative α-entropy in the Bayesian case. Define The rest of the paper is organized as follows. In the next section, we S = {pθ ∶ θ = (θ1, . . . , θk) ∈ Θ} as a k-dimensional sub-manifold of provide the essential background to information geometry. We then P and introduce the definition of Bayesian relative α-entropy in Section III and show that it is a valid divergence function. In Section IV, we S˜ ∶= {p˜θ(x) = pθ(x)λ(θ) ∶ pθ ∈ S}, (5) establish the connection between this divergence and the Riemannian metric and then derive the Bayesian α-version of Cramér-Rao where λ is a probability distribution on Θ. Then, S˜ is a sub-manifold ˜ ˜ inequality in Section V. Finally, we state our main result for the of P. Let p˜θ, p˜θ′ ∈ S. The relative entropy of p˜θ with respect to p˜θ′ Bayesian α-CRLB in Section VI and conclude in Section VII. is (c.f. [29, Eq. (2.4)] and [13])

II.DESIDERATA FOR INFORMATION GEOMETRY p˜θ(x) I(p˜θYp˜θ′ ) = Q p˜θ(x) log − Q p˜θ(x) + Q p˜θ′ (x) A n-dimensional manifold is a Hausdorff and second countable x p˜θ′ (x) x x topological space which is locally homeomorphic to Euclidean space pθ(x)λ(θ) of dimension n [2]. A Riemannian manifold is a real differentiable = Q pθ(x)λ(θ) log − λ(θ) + λ(θ ). x pθ′ (x)λ(θ ) ′ manifold in which the tangent space at each point is a finite ′ dimensional Hilbert space and, therefore, equipped with an inner We define relative α-entropy of p˜θ with respect to p˜θ′ by product. The collection of all these inner products is Riemannian metric. In information geometry, the statistical models play the role Iα(p˜θ, p˜θ′ ) of a manifold and the Fisher information matrix and its various λ θ α 1 ∶= ( )( ( ) ′ ( )) + ( ) 1 α log ∑x pθ x λ θ pθ x λ θ generalizations play the role of a Riemannian metric. The statistical ( ) α ′ − ′ log x pθ x 1 α − ( )  − { + ( )} − ′ ( )  manifold here means a parametric family of probability distributions λ θ − α 1 α 1 log λ θ α log ∑x pθ x ∑ ( ) S = {pθ ∶ θ ∈ Θ} with a continuously varying parameter space Θ ( − ) (). The dimension of a statistical manifold is the We present the following Lemma 1 which shows that our definition dimension of the parameter space. For example, S = {N(µ, σ2) ∶ of Bayesian relative α-entropy is not only a valid divergence function 2 µ ∈ R, σ > 0} is a two dimensional statistical manifold. The tangent but also coincides with the KLD as α → 1. space at a point of S is a linear space that corresponds to a “local linearization” at that point. The tangent space at a point p of S is Lemma 1. 1) Iα(p˜θ, p˜ ′ ) ≥ 0 with equality if and only if p˜θ = p˜ ′ denoted by Tp˜(S). The elements of Tp˜(S) are called tangent vectors θ θ of S at p. A Riemannian metric at point p of S is an inner product 2) Iα(p˜θ, p˜θ′ ) → I(p˜θYp˜θ′ ) as α → 1. defined for any pair of tangent vectors of S at p. Proof: 1) Let α > 1. Applying Holder’s inequality with Holder Let us restrict to statistical manifolds defined on a finite set X = conjugates p = α and q = α~(α − 1), we have {a1, . . . , ad}. Let P ∶= P(X ) denote the space of all probability distributions on X . Let S ⊂ P be a sub-manifold. Let θ = (θ1, . . . , θk) α 1 α 1 α 1 Q pθ(x)(λ(θ )pθ′ (x)) ≤ YpθYλ(θ ) YpθY , be a parameterization of S. By a divergence, we mean a non-negative x ′ − ′ − ′ − function D defined on S ×S such that D(p, q) = 0 iff p = q. Given a divergence function on S, Eguchi [28] defines a Riemannian metric where Y ⋅ Y denotes α-norm. When α < 1, the inequality is reversed. on S by the matrix Hence D D λ(θ) G (θ) = gi,j (θ) , s 1 log Q pθ(x)(λ(θ )pθ′ (x)) ( ) ( ) 1 − α x ′ − where α λ(θ) log ∑x pθ(x) λ(θ) α ∂ ∂ ≥ − λ(θ) log λ(θ ) − log Q pθ′ (x) D α(1 − α) α gi,j (θ) ∶= −D[∂i, ∂j ] ∶= − D(pθ, pθ′ )W ′ x ( ) ∂θj ∂θi θ θ′ α λ(θ) log ∑x pθ(x) ′ ≥ − λ(θ) log λ(θ) − λ(θ) + λ(θ ) where gi,j is the elements in the ith row and jth column= of the matrix α(1 − α) ′ G, θ = (θ1, . . . , θn), θ = (θ1, . . . , θn), and dual affine connections ( ) ∗ λ θ α ∇ D and ∇ D , with connection′ ′ coefficients′ described by following − log Q pθ′ (x) α x ( ) ( ) Christoffel symbols α log ∑x pθ(x) α = λ(θ)  − {1 + log λ(θ)} − log Q pθ′ (x) + λ(θ ), D ∂ ∂ ∂ α(1 − α) x ′ Γij,k (θ) ∶= −D[∂i∂j , ∂k] ∶= − D(pθ, pθ′ )W ( ) ∂θi ∂θj ∂θk θ θ′ ′ where the second inequality follows because, for x, y ≥ 0, and = x y D∗ ∂ ∂ ∂ = − ≥ − ( ~ − ) ≥ − + ( ) ∶= − [ ] ∶= − ( ′ )W log x log x y x 1 y x, Γij,k θ D ∂k, ∂i∂j D pθ, pθ , y x ( ) ∂θk ∂θi ∂θj θ θ′ ∗ ′ ′ such that, ∇ D and ∇ D form a dualistic structure in the= sense and hence that ( ) ( ) x log y ≤ x log x − x + y. D D D∗ ∂kgi,j = Γki,j + Γkj,i , (4) ( ) ( ) ( ) The conditions of equality follow from the same in Holder’s where D (p, q) = D(q, p). inequality and log x ≤ x − 1. ∗ 2) This follows by applying L’Hôpital rule to the first term of Iα: and since Renyi entropy coincides with Shannon entropy as α → 1.

α α 1 lim  λ(θ) log Q pθ(x)(λ(θ )pθ′ (x)) α 1 1 − α x ′ − IV. FISHER INFORMATION MATRIX FOR THE BAYESIAN CASE

→ 1 α 1 The Eguchi’s theory we provided in section II can also = lim  λ(θ) log Q pθ(x)(λ(θ )pθ′ (x)) α 1 1 − be extended to the space P˜(X ) of all positive measures α 1 x ′ − α 1 on X , that is, P˜ = {p˜ ∶ X → (0, ∞)}. Following Eguchi → 1 ∑ pθ(x)(λ(θ )p ′ (x)) log(λ(θ )p ′ (x)) = λ(θ) lim  x θ θ Iα ˜ 1 ′ − α 1′ [28], we define a Riemannian metric [gi,j (θ)] on S by α 1 − p (x)(λ(θ )p ′ (x)) α2 ∑x θ θ ( ) ′ − = − Q(λ→(θ)pθ(x)) log(λ(θ )pθ′ (x)), x ′

Iα gi,j (θ) ( ) ∂ ∂ = − Iα(p˜θ, p˜θ′ )W ∂θj ∂θi θ′ θ ′ 1 = α 1 α = ⋅ ∂ ∂ λ(θ) log Q p (x)(λ(θ )p ′ (x)) W − ∂ λ(θ)∂ log Q p ′ (x) W − j i θ θ i j θ α 1 ′ y ′ − θ′ θ ′ x θ′ θ ⎧ α 1 ⎡ α 1 ⎤ ⎫ ⎪ = ⎢ p (=x)∂ ‰λ(θ)p ′ (x)Ž ⎥ ⎪ 1 ⎪ (λ(θ )pθ′ (x)) ⎢ ∑x θ j θ ⎥ ⎪ = ⎨λ(θ) Q ∂ p (x) ⋅ ∂  + ∂ λ(θ) ⋅ − ⎬ ⎪ i θ j ′ − α 1 i ⎢ ′ α 1 ⎥ ⎪ α − 1 ⎪ x ′ ∑y pθ(y)(λ(θ )pθ′ (x)) ′ ⎢ ∑x pθ(x)(λ(θ )pθ′ (x)) ⎥ ⎪ ⎩ θ θ ⎣ ⎦θ′ θ⎭ ′ − ′ − = α = − ∂iλ(θ)∂j log Q pθ′ (x) W (6) ′ x θ′ θ α 2 α 1 α 2 ∑ ∂ p (x)(λ(θ)p (x)) ∂ (λ(θ)p (x)) ∑ (∂ p (x))p (x) ∑ p (x)(λ(θ)p ′ (x)) ∂ (λ(θ)p (x)) = λ(θ) œ x i θ θ j θ − x i θ θ ⋅ x θ = θ j θ − α 1 α − − α 1 ∑x pθ(x)(λ(θ)pθ(x)) ∑x pθ(x) ∑x pθ(x)(λ(θ)pθ(x)) ⎡ − α 1 ⎤ ⎫ − ⎢ ( ) ‰ ( ) ′ ( )Ž ⎥ ⎪ ⎢ ∑x pθ x ∂j λ θ pθ x ⎥ ⎪ +∂ log λ(θ) ⋅ ⎬ − ∂ λ(θ)E α [∂ log p (X)] i ⎢ ′ α −1 ⎥ i θ( ) j θ ⎢ ∑x pθ(x)(λ(θ )pθ′ (x)) ⎥ ⎪ ⎣ ⎦θ′ θ⎭ ′ − = λ(θ){Eθ(α) [∂i log pθ(X)∂j log pθ(X)] + ∂j log λ(θ)Eθ(=α) [∂i log pθ(X)] − Eθ(α) [∂i log pθ(X)] [Eθ(α) [∂j log pθ(X)] + ∂j log λ(θ)]

+∂i log λ(θ) ⋅ [Eθ(α) [∂j log pθ(X)] + ∂j log λ(θ)]} − ∂iλ(θ)Eθ(α) [∂j log pθ(X)]

= λ(θ)[Covθ(α) [∂i log pθ(X), ∂j log pθ(X)] + ∂i log λ(θ) ⋅ {Eθ(α) [∂j log pθ(X)] + ∂j log λ(θ)}] − ∂iλ(θ)Eθ(α) [∂j log pθ(X)]

= λ(θ){Covθ(α) [∂i log pθ(X), ∂j log pθ(X)] + ∂i log λ(θ)∂j log λ(θ)} α λ = λ(θ)[gi,j (θ) + Ji,j (θ)], (7) ( )

e where mapping the tangent vector X ∈ Tp˜(P˜) can be represented X which is defined by X e (x) = X m (x)~p˜(x) and we define ( ) α gi,j (θ) ∶= Covθ(α) [∂i log pθ(X), ∂j log pθ(X)], (8) ( ) ( ) ( ) ˜ e (P˜) = { e ∶ ∈ (P˜)} = { ∈ ∶ [ ] = } and Tp˜ X X Tp˜ A R Ep˜ A 0 . (10) ( ) ( ) X λ Ji,j (θ) ∶= ∂i(log λ(θ)) ⋅ ∂j (log λ(θ)). (9) Motivated by the expression for the Riemannian metric in (6), α α λ λ λ Let G (θ) ∶= [gi,j (θ)], J (θ) ∶= [Ji,j (θ)] and Gα(θ) ∶= define α ( ) λ ( ) λ I G (θ) + J (θ). Notice that, when α = 1, Gα becomes G , the usual( ) Fisher information matrix in the Bayesian case [c.f. [13]].( ) α ∂i (pθ(x)) ( ) α 1 1 ⎛ p ′ (x) ⎞ ∶= θ W V. AN α-VERSIONOF CRAMÉR-RAO INEQUALITYINTHE ∂i − α 1 α − 1 ⎝ ∑ p (y) p ′ (y) ⎠ ′ BAYESIAN SETTING ′ y θ θ θ θ α 1 − 1 ⎛ pθ′ (x) ⎞ = We now investigate the geometry of P˜ with respect to the metric = ∂i W − − α 1 λ α 1 ′ ⎝ ∑y pθ(y)pθ′ (y) ⎠ θ′ θ Gα. Later, we formulate an α-equivalent version of the Cramér-Rao ˜ ˜ ⎡ α 2 − α 1 α 1 ⎤ inequality associated with a submanifold S. Observe that P is a ⎢ pθ(x) ∂ipθ(x) pθ(x) ∑= y pθ(y) ∂ipθ(y) ⎥ ˜ = ⎢ − ⎥ ˜ ⎢ − α − α −2 ⎥ subset of R , where X ∶= X ∪ {ad 1}. The tangent space at ∑y pθ(y) (∑y pθ(y) ) ˜ ⎣ ⎦ X every point of P˜ is A0 ∶= {A ∈ ∶ ∑ ˜ A(x) = 0}. That is, α α R + x pθ(x) pθ(x) ˜ X Tp˜(P) = A0. We denote a tangent vector (that is, elements of A0) =  ∂i(log pθ(x)) − Eθ(α) [∂i(log pθ(X))] . ∈X p (x() ) p (x() ) by X m . The manifold P˜ can be recognized by its homeomorphic θ θ image( {)logp ˜ ∶ p˜ ∈ P˜} under the mapping p˜ ↦ logp ˜. Under this (11) We shall call the above an α-representation of ∂i at pθ. With this for some scalars hi. Applying (16) with X = ∂j , for each j = notation, the G α is given by 1, . . . , n, and using (18), we obtain ( ) α α n α gi,j (θ) = Q ∂ipθ(x) ⋅ ∂i (pθ(x)). ( ) x ( ) (∂j )(f) = dQ hi∂i, ∂j i( ) i 1 α It should be noted that Eθ[∂ (pθ(X))] = 0. This follows since n i = α ( ) = Q hi⟨∂i, ∂j ⟩ α i 1 ( ) α pθ α ∂ (pθ) = ∂i log p . n i ( ) θ α pθ = ( ) ( ) = Q higi,j , j = 1, . . . , n. i 1 ( ) When α = 1, the right hand side of (11) reduces to ∂i(log pθ). Motivated by (11), the α-representation of a tangent vector X at This yields = p is T α 1 T [h1, . . . , hn] = G  [∂1(f), . . . , ∂n(f)] , α α − α p˜ (x) e p˜ (x) e ( ) X (x) ∶=  X (x) − E α [X ] p˜ ( ) p˜ ( ) p˜( ) p˜ and so ( ) p(x) ( ) p(x) ( ) = ( i,j ) α ( ) p α (x) gradf Q g ∂j f ∂i. (19) e e i,j =  ŠX (x) − E (α) [X ] , (12) ( ) (p()x) p˜ p p˜ ( ) ( ) From (16), (17), and (19), we get α = α where the last equality follows because p˜ p . The collection 2 i,j α of all such α-representations is ( ) ( ) Y(df)p˜Yp˜ = Q(g ) ∂j (f)∂i(f) (20) i,j ( ) T α (P) ∶= {X α ∶ X ∈ T (P˜)}. (13) p˜ p˜ p˜ ( i,j ) α ( ) α ( ) ( ) where g is the i, j th entry of the inverse of G . α X With these( preliminaries,) we now state our main results.( ) These are Clearly Ep˜[Xp˜ ] = 0. Also, since any A ∈ R with Ep˜[A] = 0 is ( ) analogous to those in [30, Sec. 2.5]. p α A =  ‰B − E α [B]Ž X ( ) p( ) Theorem 2. Let A ∶ X → R be any mapping (that is, a vector in R . p˜ ˜ Let E[A] ∶ P → R be the mapping p˜ ↦ Ep˜[A]. We then have with B = B˜ − Ep˜[B˜], where p˜ 2 Var (α)  (A − Ep˜[A]) = Y(dEp˜[A])p˜Y . (21) p˜(x) p p α p˜ B˜(x) ∶=  A(x) . p α (x) ( ) In view of (10), we have ( ) Proof. For any tangent vector X ∈ Tp˜(P˜), T e (P˜) = T α (P˜). (14) p˜ p˜ X(Ep˜[A]) = Q X(x)A(x) ( ) ( ) ∈ x Now the inner product between any two tangent vectors X,Y e = Ep˜[X ⋅ A] (22) Tp˜(P˜) defined by the α-information metric in (6) is p˜ (e) α e α = Ep˜[Xp˜ (A − Ep˜[A])]. (23) ⟨X,Y ⟩ ∶= Ep˜[X Y ]. (15) p˜ ( ) ( ) ( ) ( ) α ˜ ˜ Consider now an n-dimensional statistical manifold S, a submanifold Since A−Ep˜[A] ∈ Tp˜ (P) (c.f. (14)), there exists Y ∈ Tp˜(P) such ˜ α (α ) of P, together with the metric G as in (15). Let Tp˜ (S) be the that A − Ep˜[A] = Yp˜ , and grad(E[A]) = Y . Hence we see that ( ) ∗( ) ( ) dual space (cotangent space) of the tangent space Tp˜ S and let 2 Y(dE[A])p˜Yp˜ us consider for each Y ∈ Tp˜(S), the element ωY ∈ Tp˜ (S) which α ∗ e α maps X to ⟨X,Y ⟩ . The correspondence Y ↦ ωY is a linear map = Ep˜[Yp˜ Yp˜ ] ( ) ( ) ( ) between Tp˜(S) and Tp˜ (S). An inner product and a norm on Tp˜ (S) e = Ep˜[Y (A − Ep˜[A])] ∗ ∗ p˜ are naturally inherited from Tp˜(S) by ( ) a p˜(X) e α = Ep˜ œ (A − Ep˜[A]) + E (α) [Y ]¡ (A − Ep˜[A]) ⟨ωX , ωY ⟩p˜ ∶= ⟨X,Y ⟩ α p p˜ p˜ ( ) p (X) ( ) ( ) b p( X) ) and ¼ = E  (A − E [A])(A − E [A]) α α p˜ p α (X) p˜ p˜ YωX Yp˜ ∶= YXYp˜ = ⟨X,X⟩p˜ . ( ) ( ) ( ) ( )p˜(X) p˜(X) Now, for a (smooth) real function f on S, the differential of f at p, = E α  (A − E [A]) (A − E [A]) p( ) p α (X) p˜ p α (X) p˜ (df)p˜, is a member of Tp˜ (S) which maps X to X(f). The gradient ( ) ( ) of f at p is the tangent vector∗ corresponding to (df)p˜, hence, satisfies p˜(X) = α  (A − E [A]) , Varp( ) α p˜ α p (X) (df)p˜(X) = X(f) = ⟨(gradf)p˜,X⟩ , (16) p˜ ( ) ( ) where the equality (a) is obtained by applying (12) to Y and (b) and follows because Ep˜[A − Ep˜[A]] = 0. 2 α Y(df)p˜Yp˜ = ⟨(gradf)p˜, (gradf)p˜⟩p˜ . (17) ( ) Since gradf is a tangent vector, Corollary 3. If S˜ is a submanifold of P˜, then n p˜(X) 2 gradf = Q hi∂i (18) Var α  (A − E [A]) ≥ Y(dE[A]S ) Y p( ) α p˜ S p˜ p˜ (24) i 1 p (X) ( ) = with equality if and only if Bayesian [13] and deterministic α-CRLB [14]. These improvements enable usage of information geometric approaches for biased A − E [A] ∈ {X α ∶ X ∈ T (S)} =∶ T α (S). p˜ p˜ p˜ p˜ estimators and noisy situations as in radar and communications ( ) ( ) problems [32].

Proof. Since (grad E[A]S ) is the orthogonal projection of S p˜ REFERENCES (grad E[A])p˜ onto Tp˜(S), the proof follows from Theorem 2. [1] M. Spivak, A Comprehensive Introduction to - We use the aforementioned ideas to establish an α-version of the Volume I. Publish or Perish Inc., 2005. Cramér-Rao inequality for the α-escort of the underlying distribution. [2] S. Gallot, D. Hulin, and J. Lafontaine, Methods of information geometry. This gives a lower bound for the variance of an estimator of S α Riemannian Geometry, 2004. starting from an unbiased estimator of S. ( ) [3] N. Ay, J. Jost, H. Vân Lê, and L. Schwachhöfer, Information geometry. Springer, 2017, vol. 64. VI.DERIVATION OF ERROR BOUNDS [4] S. Amari and M. Yukawa, “Minkovskian gradient for sparse optimization,” IEEE Journal of Selected Topics in Signal Processing, We state our main result in the following theorem. vol. 7, no. 4, pp. 576–585, 2013. [5] S.-i. Amari, Information geometry and its applications. Springer, 2016, Theorem 4 (Bayesian α-Cramér-Rao inequality). Let S = {pθ ∶ ˜ vol. 194. θ = (θ1, . . . , θm) ∈ Θ} be the given statistical model and let S [6] S. Amari, “Natural gradient works efficiently in learning,” Neural ˆ ˆ ˆ be as before. Let θ = (θ1,..., θm) be an unbiased estimator of computation, vol. 10, no. 2, pp. 251–276, 1998. θ = (θ1, . . . , θm) for the statistical model S. Then [7] W. Gangbo and R. J. McCann, “The geometry of optimal transportation,” ⎡ ⎤ Acta Mathematica, vol. 177, no. 2, pp. 113–161, 1996. ⎢ ⎥ 1 [8] M. R. Grasselli and R. F. Streater, “On the uniqueness of the Chentsov ⎢ p˜θ(X) ˆ ⎥ α S Varθ(α) ⎢ (θ(X) − θ)⎥ dθ ≥ šEλGλ Ÿ , (25) metric in quantum information geometry,” Infinite Dimensional Analysis, ⎢ α ( ) ⎥ − ⎣ pθ X ⎦ ( ) Quantum Probability and Related Topics, vol. 4, no. 02, pp. 173–182, ( ) α α 2001. where θ denotes expectation with respect to pθ . (In (25), we use [9] S. Amari and H. Nagaoka, Methods of information geometry. American the usual( convention) that, for two matrices A and( )B, A ≥ B means Mathematical Society, Oxford University Press, 2000, vol. 191. that A − B is positive semi-definite.) [10] S. Eguchi, “Geometry of minimum contrast,” Hiroshima Mathematical Journal m , vol. 22, no. 3, pp. 631–647, 1992. Proof: Given an unbiased estimator θˆ of θ for S˜, let A = ∑ ciθˆi, [11] H. L. Van Trees, K. L. Bell, and Z. Tian, Detection Estimation and i 1 Modulation Theory, Part I: Detection, Estimation, and Filtering Theory, m for c = (c1, . . . , cm) ∈ R . 2nd ed. Wiley, 2013. = Then, from (24) and (20), we have [12] R. D. Gill and B. Y. Levit, “Applications of the van Trees inequality: A ⎡ ⎤ Bayesian Cramér-Rao bound,” Bernoulli, pp. 59–79, 1995. ⎢ ⎥ ⎢ p˜θ(X) ˆ ⎥ t α 1 t [13] M. A. Kumar and K. V. Mishra, “Information geometric approach to cVarθ(α) ⎢ (θ(X) − θ)⎥ c ≥ c{λ(θ)Gλ } c . (26) Bayesian lower error bounds,” in IEEE International Symposium on ⎢ α ( ) ⎥ ⎣ pθ X ⎦ ( ) − , 2018, pp. 746–750. ( ) Integrating the above over θ, we get [14] ——, “Cramƒ’er-Rao lower bounds arising from generalized Csiszƒ’ar divergences,” arXiv preprint arXiv:2001.04769, 2020. ⎡ ⎤ ⎢ ⎥ [15] A. Rényi et al., “On measures of entropy and information,” in ⎢ p˜θ(X) ˆ ⎥ t c S Var (α) ⎢ (θ(X) − θ)⎥ dθ c Proceedings of the Fourth Berkeley Symposium on Mathematical θ ⎢ α ( ) ⎥ ⎣ pθ X ⎦ Statistics and Probability, Volume 1: Contributions to the Theory of ( ) ˘ α 1 t Statistics, 1961, p. 547âA¸S561. ≥ c S [λ(θ)Gλ ] dθ c . (27) [16] L. L. Campbell, “A coding theorem and Rényi’s entropy,” Information ( ) − and Control, vol. 8, pp. 423–429, 1965. But [17] A. C. Blumer and R. J. McEliece, “The Rényi redundancy of generalized α 1 α 1 S [λ(θ)Gλ ] dθ ≥ ™Eλ[Gλ (θ)]ž (28) Huffman codes,” IEEE Transactions on Information Theory, vol. 34, ( ) − ( ) − no. 5, pp. 1242–1249, September 1988. by [31]. This proves the result. [18] R. Sundaresan, “Guessing under source uncertainty,” IEEE Transactions on Information Theory,, vol. 53, no. 1, pp. 269–287, 2007. Remark . 1 [19] M. A. Kumar and R. Sundaresan, “Minimization problems based 1) The above result reduces to the usual Bayesian Cramer-Rao on relative α-entropy I: Forward projection,” IEEE Transactions on inequality when α = 1 as in [13]. Information Theory, vol. 61, no. 9, pp. 5063–5080, 2015. 2) When λ is the uniform distribution, we obtain the α-Cramer-Rao [20] C. Tsallis, R. S. Mendes, and A. R. Plastino, “The role of constraints inequality as in [14]. within generalized nonextensive statistics,” Physica A, vol. 261, pp. 534–554, 1998. = 3) When α 1 and λ is the uniform distribution, this yields the [21] P. N. Karthik and R.undaresan, “On the equivalence of projections in usual deterministic Cramer-Rao inequality. relative α-entropy and Rényi divergence,” in National Conference on Communication, 2018, pp. 1–6. VII.CONCLUSION [22] E. Arıkan, “An inequality on guessing and its application to sequential We have shown that our Theorem 4 provides a general information decoding,” IEEE Transactions on Information Theory, vol. 42, no. 1, pp. 99–105, January 1996. geometric characterization of the statistical manifolds linking them [23] W. Huleihel, S. Salamatian, and M. Médard, “Guessing with limited to the Bayesian α-CRLB for vector parameters; the extension to memory,” in IEEE International Symposium on Information Theory, estimators of measurable functions of the parameter θ is trivial. 2017, pp. 2253–2257. We exploited the general definition of relative α-entropy in the [24] C. Bunte and A. Lapidoth, “Codes for tasks and Rényi entropy,” IEEE Bayesian case. This is an improvement over AmariâA˘ Zs´ work Transactions on Information Theory, vol. 60, no. 9, pp. 5065–5076, September 2014. [9] on information geometry which only dealt with the notion [25] M. C. Jones, N. L. Hjort, I. R. Harris, and A. Basu, “A comparison of deterministic CRLB of scalar parameters. Further, this is a of related density based minimum divergence estimators,” Biometrika, generalization of our earlier information-geometric frameworks of vol. 88, no. 3, pp. 865–873, 2001. [26] M. Ashok Kumar and R. Sundaresan, “Minimization problems based approach to inference for linear inverse problems,” The Annals of on relative α-entropy II: Reverse projection,” IEEE Transactions on Statistics, vol. 19, no. 4, pp. 2032–2066, 1991. Information Theory, vol. 61, no. 9, pp. 5081–5095, 2015. [30] S. Amari and H. Nagaoka, Methods of Information Geometry. Oxford [27] H. Fujisawa and S. Eguchi, “Robust parameter estimation with a small University Press, 2000. bias against heavy contamination,” Journal of Multivariate Analysis, [31] T. Groves and T. Rothenberg, “A note on the expected value of an inverse vol. 99, pp. 2053–2081, 2008. matrix,” Biometrika, vol. 56, pp. 690–691, 1969. [28] S. Eguchi, “Geometry of minimum contrast,” Hiroshima Mathematical [32] K. V. Mishra and Y. C. Eldar, “Performance of time delay estimation Journal, vol. 22, no. 3, pp. 631–647, 1992. in a cognitive radar,” in IEEE International Conference on Acoustics, [29] I. Csiszár, “Why least squares and maximum entropy? An axiomatic Speech and Signal Processing, 2017, pp. 3141–3145.