<<

BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES, Vol. 58, No. 1, 2010 DOI: 10.2478/v10175-010-0019-1

Information geometry of functions

S. AMARI∗ and A. CICHOCKI RIKEN Brain Science Institute, Wako-shi, Hirosawa 2-1, Saitama 351-0198, Japan

Abstract. Measures of divergence between two points play a key role in many engineering problems. One such measure is a function, but there are many important measures which do not satisfy the properties of the distance. The Bregman divergence, Kullback- Leibler divergence and f-divergence are such measures. In the present article, we study the differential-geometrical structure of a manifold induced by a divergence function. It consists of a Riemannian metric, and a pair of dually coupled affine connections, which are studied in geometry. The class of Bregman divergences are characterized by a dually flat structure, which is originated from the Legendre duality. A dually flat space admits a generalized Pythagorean theorem. The class of f-divergences, defined on a manifold of probability distributions, is characterized by information monotonicity, and the Kullback-Leibler divergence belongs to the intersection of both classes. The f-divergence always gives the α-geometry, which consists of the metric and a dual pair of ±α-connections. The α-divergence is a special class of f-divergences. This is unique, sitting at the intersection of the f-divergence and Bregman divergence classes in a manifold of positive measures. The geometry derived from the Tsallis q-entropy and related divergences are also addressed. Key words: information geometry, divergence functions.

1. Introduction are explained intuitively. Most properties are already known, but here we give their new geometrical explanations. Given two points P and Q in a space S, we may define a Bregman divergences are derived from convex functions. divergence D[P : Q] which measures their discrepancy. The The Bregman divergence induces a dual structure through the standard distance is indeed such a measure. However, there Legendre´ transformation. It gives a geometrical structure con- are many other measures frequently used in many areas of sisting of a Riemannian metric and dually flat affine connec- applications (see, e.g., [1, 2]). In particular, for two probabil- tions, called the dually flat Riemannian structure [1]. A dually ity distributions p(x) and q(x), one can define various mea- flat Riemannian manifold is a generalization of the Euclidean sures D[p(x) : q(x)] such as the Kullback-Leibler divergence space, in which the generalized Pythagorean theorem and pro- and the Hellinger distance. A divergence is not necessarily jection theorem hold. These two theorems provide powerful symmetric, that is, the relation D[P : Q] = D[Q : P ] does tools for solving problems in optimization, statistical infer- not generally hold, nor does it satisfy the triangular inequal- ence and signal processing. We show that the Bregman type ity. It usually has the dimension of squared distance, and a divergence is automatically induced from the dual flatness of Pythagorean-like relation holds in some cases. a Riemannian manifold. The present paper aims at elucidating the differential- Then we study the class of invariant divergences [1, geometrical structure of a manifold equipped with a diver- 11]. The invariance requirement comes from information gence function. We study the geometry induced by a diver- monotonicity, which states that a divergence measure does gence function, and demonstrate that it endows a Riemannian not increase by coarse graining of information [12]. This metric and a pair of dually coupled affine connections [1]. leads to the class of f-divergences. The α-divergences are One of the original contribution of the present paper is to typical examples belonging to this class, which also includes give a unified geometrical framework to various divergence the Kullback-Leibler divergence as a special case. This class functions, which have a lot of applications [2, 3]. We use of divergences induces an invariant Riemannian metric given modern differential geometry, but do not follow the rigorous by the Fisher information matrix and a pair of invariant dual mathematical formalism, and instead try to give intuitive ex- affine connections, the α-connections, which are not neces- planations to be understandable to those who are not familiar sarily flat. See [13] for± more delicate problems occurring in with modern differential geometry. the function space. We begin with two typical classes of divergences: One is When a family of unnormalized probability distributions, the class of Bregman divergences [4], introduced through a that is, a family of positive measures or arrays, is considered, convex function (see, e.g., [5]). The other is the class of in- we show that the α-divergence is the only class that is both variant divergences, called f-divergences, where f is a convex invariant and dually flat at the same time [14]. function [6–9]. Both are frequently used in engineering ap- We further study the geometry derived from a general di- plications. Csiszar´ [10] gives an axiomatic characterization of vergence in detail. This part is the original contribution of these divergences. The geometrical structures of these classes the present paper. A divergence endows a geometrical struc-

∗e-mail: [email protected]

183 S. Amari and A. Cichocki

ture to the underlying space. We discuss the inverse problem z(t) connecting two points z1 = z(t1) and z2 = z(t2) is de- of constructing a divergence function from the geometrical fined by the integral

structure of a manifold. The study also comprises various ex- t2 amples of divergence functions, and provides an insight into s = g (z(t))z ˙ (t)z ˙ dt, (9) the geometry derived from the Tsallis q-entropy [15, 17–19]. ij i j This extends our views to study how the geometry changes tZ1 X q by modifying a divergence function. where z˙i(t) = (d/dt)zi(t). This is the Riemannian distance along the curve between the two points. A Riemannian geo- 2. Convex function, Bregman divergence desic z(t) is the curve that minimizes the above distance. and dual geometry We next introduce a dual structure in S, defined by the Legendre transformation (see, e.g., [1, 20]. The gradient vec- 2.1. Bregman divergence and Riemannian metric. Let tor k(z) be a strictly convex differentiable function defined in z∗ = Grad k(z) (10) a space S with a local coordinate system z. Then, for two points z and y in S, we can define the following function of a convex function k(z) is in one-to-one correspondence with z. This is the Legendre transformation, and z∗ can be D[z : y]= k(z) k(y) Grad k(y) (z y), (1) − − · − regarded as another coordinate system of S different from z. where, Grad k is the gradient vector We can calculate the dual function of k, defined by k∗(z∗) = max z z∗ k(z) , (11) Grad k(z) = (∂k(z)/∂zi), (2) z { · − } and the operator ‘ ’ denotes the inner product, which is a convex function of z∗. Hence we can describe the · geometry of S by using the dual convex function k∗ and the ∂k ∗ ∗ Grad k(y) (z y)= (zi yi). (3) dual coordinates z . Obviously, z and z are dual, since we · − ∂y − i i have X z Grad ∗ z∗ (12) The function D[z : y] satisfies the following condition for = k ( ). divergences: The Riemannian metric is given in the dual coordinates by 2 1) D[z : y] 0, ∗ ∗ ∂ ∗ ∗ ≥ gij (z )= ∗ ∗ k (z ). (13) 2) D[z : y]=0, when and only when z = y, ∂zi ∂zj 3) For small dz, Taylor expansion ∗ Theorem 1. The Riemannian metrics gij and gij in their 1 matrix form are mutually inverse. They are the same tensor D[z + dz : z] gij dzidzj (4) ≈ 2 represented in different coordinate systems z and z∗, giving gives a positive-definite quadraticX form. the same local distance. We call D[z : y] the Bregman divergence between two Proof. From z∗ = Grad k(z), we have points z and y. In general, the divergence is not symmetric 2 ∗ ∂ k(z) with respect to z and y so that dzi = dzj = gij dzj . (14) ∂zi∂zj D[y : z] = D[z : y]. (5) X X 6 In a similar way, we have When z is infinitesimally close to y, y = z + dz, we have ∗ ∗ dzi = gij dzj . (15) 1 ∂2k(z) D[z : z + dz]= dzidzj = D[z + dz : z] (6) By using the vector-matrix notation,X they are rewritten as 2 ∂zi∂zj X dz∗ = gdz, dz = g∗dz∗, (16) by Taylor expansion. This is regarded as a half of the squared ∗ ∗ distance between z and z + dz, defined in the following. where g = (gij ) and g = (gij ). This shows We can use the Hessian of , 1 k g∗ = g− . (17) 2 ∂ ∗ gij (z)= k(z), (7) The local of dz and dz are given by the quadratic ∂zi∂zj form to define the squared local distance as ds2 = dzT gdz, (18) 2 ds∗2 = dz∗T g∗dz∗, (19) ds = gij (z)dzidzj . (8) where dzT is the transpose of dz. Hence, from (16) and (17), A space S is called a RiemannianX manifold, when a positive- we have definite matrix g(z) = (g (z)) is defined at each point z S ij ds2 = ds∗2. (20) such that (8) is the squared local distance. ∈ When S is a Riemannian manifold, we can define a Rie- In other words, g and g∗ are the same Riemannian metric mannian geodesic z(t) parameterized by t. A length of a curve represented in two different coordinate systems.

184 Bull. Pol. Ac.: Tech. 58(1) 2010 Information geometry of divergence functions

The dual function k∗(z∗) also induces a divergence, curves. Then, the two curves intersect orthogonally at t =0, when their tangent vectors D∗[y∗ : z∗]= k∗(y∗) k∗(z∗) gradk∗(z∗) (z∗ y∗). − − · − (21) d d t1 = z1(0), t2 = z2(0) (31) Theorem 2. dt dt ∗ The two divergences D and D are mutually reciprocal, satisfy the Riemannian orthogonality condition, in the sense of t1, t2 = gij t1it2j =0, (32) D∗[y∗ : z∗]= D[z : y]. (22) h i where t1, t2 is the innerX product with respect to the Rie- The divergence between two points z and y is written in the h i dual form mannian metric g. When the dual coordinate system is used ∗ for one curve, say, z2(t), the orthogonality condition is sim- ∗ ∗ ∗ D[z : y]= k(z)+ k (y ) z y . (23) plified to − · d i d ∗i Proof. The right-hand side of (11) is maximized when z and z1(0) z2 (0) = 0, (33) z∗ correspond to each other, that is, dt dt because X ∂ ∗ ∗ T T ∗ z z k(z) = z Grad k(z)=0. (24) z˙ 1, z˙ 2 = z˙ gz˙ 2 = z˙ z˙ . (34) ∂z { · − } − h i 1 1 2 Hence, we have the identity from (11), Pythagorean Theorem. Let P , Q, R be three points in S whose coordinates k(z)+ k∗(z∗) z z∗ =0. (25) − · (and dual coordinates) are represented by zP , zQ, zR ∗ ∗ ∗ By using this relation for y, we have (zP , zQ, zR), respectively. When the dual geodesic connect- ing P and Q is orthogonal at Q to the geodesic connecting D[z : y] = k(z) k(y) y∗ (z y) (26) − − · − Q and R, then = k(z)+ k∗(y∗) z y∗. (27) − · D[P : R]= D[P : Q]+ D[Q : R]. (35) Analogously, we have for k(z), ∗ ∗ ∗ Dually, when the geodesic connecting P and Q is orthogonal D[z : y]= D [y : z ]. (28) at Q to the dual geodesic connecting Q and R, we have The theorem shows that it suffices to consider only one D[R : P ]= D[Q : P ]+ D[R : Q]. (36) divergence function. Proof. By using (23), we have 2.2. Dual affine structure and Pythagorean theorem. We D[R : Q]+ D[Q : P ] = k(z )+ k∗(z∗ )+ k(z )+ k∗(z∗ ) now introduce an affine structure in S [20], which is differ- R Q Q P ∗ ∗ ent from the Riemannian structure defined by g. We simply zR zQ zQ zP (37) − · ∗− ∗ · ∗ assume that the coordinate system z is affine. Hence, a curve = k(zR)+ k (zP )+ zQ zQ represented in the form ∗ ∗ · zR zQ zQ zP (38) − · ∗ − · z(t)= ta + b (29) = D[zR : zP ] + (zQ zR) ∗ ∗ − is a geodesic in this sense, where t is the parameter along the (zQ zP ). (39) · − curve and a and b are constant vectors. This geodesic is not The tangent vector of the geodesic connecting Q and R is a Riemannian geodesic that minimizes the length of a curve. zQ zR, and the tangent vector of the dual geodesic con- Further we define a dual affine structure. A dual geodesic − ∗ ∗ ∗ necting Q and P is zQ zP in the dual coordinate system. z (t) is defined by a linear curve Hence, the second term− of the right-hand side of the above z∗(t)= ta∗ + b∗, (30) equation vanishes, because the primal and dual geodesics con- necting Q and R, and Q and P are orthogonal. for constant vectors a∗ and b∗, regarding z∗ as another affine The following projection theorem is a consequence of the coordinate system of S. This defines a dual affine structure of generalized Pythagorean theorem. Let M be a smooth sub- S, which is different from the primal affine structure. Howev- manifold of S. Given a point P outside M, we connect it er, we will later show their differential-geometrical formalism to a point Q in M by geodesic (dual geodesic). When the and prove that the two affine structures are dually coupled geodesic (dual geodesic) connecting P and Q is orthogonal with respect to the Riemannian metric. to M (that is, orthogonal to any tangent vectors of M), Q is We have shown that space S equipped with a Bregman said to be the geodesic projection (dual geodesic projection) divergence is Riemannian, but has two dually flat affine struc- of P to M. tures. This gives rise to the following generalized Pythagorean theorem [1, 20]. As a preliminary, we mention the orthogonal- Projection Theorem. ity of two curves in S. This is defined by using the Riemanni- Given P and M, the point Q(Q∗) that minimizes diver- an metric. Let z1(t) and z2(t) be two curves intersecting at gence D(P : R), R Q (D(R : P ), R M) is the projec- ∈ ∈ t = 0, z1(0) = z2(0), where t is a parameter along the tion (dual projection) of P to Q.

Bull. Pol. Ac.: Tech. 58(1) 2010 185 S. Amari and A. Cichocki

This theorem is useful, when we search for the point be- which is known as the Kullback-Leibler divergence. It is writ- longing to M that minimizes the divergence D(P : Q) or ten in general as D(Q : P ) for preassigned P . In many engineering problems, p(x) P is given from observed data, and M is a model to describe KL[p(x) : q(x)] = p(x)log . (49) q(x) x the underlying structure. X The dual coordinates are given by 2.3. Examples of dually flat geometry. The following ex- ∗ pi amples illustrate our approach: zi = log . (50) p0 1) Euclidean geometry: and are are known as the natural parameters of an exponential When k(z) has a quadratic form, family, where the is rewritten in the

1 2 form k(z)= z , (40) ∗ ∗ ∗ ∗ i p(x, z ) = exp z δi(x) k (z ) . (51) 2 i − X the induced Riemannian metric g is the identity matrix. Hence An ofn probabilityX distributionso is usually the space is Euclidean. The primal and dual coordinates are represented as the same, z∗ = z, so that the space is self-dual. The diver- gence is then a half of the square of Euclidean distance, p(x, θ) = exp θiδi(x) ψ(θ) . (52) − 1 2 ∗ nX ∗ ∗ o D[z : y]= zi yi . (41) In our case, θi = zi , and ψ(θ) = k (z ) is the cumulant 2 | − | generating function. This is the dual potential, The Pythagorean theorem andX the projection theorem are ex- ∗ ∗ ψ(θ)= k (z )= log p0. (53) actly the same as the well-known counterparts in a Euclidean − space. The induced geometrical structure is studied in detail in 2) Entropy geometry on discrete probability distributions: information geometry [1], where the Pythagorean theorem and Projection theorem hold. Consider the set Sn of all discrete probability distributions It should be mentioned that the usual definition begins over n+1 elements X = x0, x1, , xn . A probability dis- tribution is given by { · · · } with the exponential form (52), where ψ is the convex func- tion to define the geometry and θ z∗ is the primal affine n = coordinates. Therefore, z p is the dual affine coordinates. p(x)= p δ (x), (42) = i i Hence, the definition of the primal and dual coordinates are i=0 X reversed. where 3) Positive-definite matrices: 1, x = xi, pi = Prob x = xi , δi(x)= (43) Let be the set of n n positive-definite matrices. It { } ( 0, otherwise. is an n(Pn + 1)/2-dimensional× manifold, since P is a ∈ P Obviously, symmetric matrix. When P is the determinant of P , n | | k(P )= log P (54) pi =1. (44) − | | i=0 is a convex function of . Its gradient is X P We can use a coordinate system z = (p1, ,pn) for the set −1 · · · Grad ψ(P )= P . (55) Sn of all such distributions, where z0 = p0 is regarded as a − function of the other coordinates, Hence, the induced divergence is n D[P : Q]= log PQ−1 + tr(PQ−1) n, (56) p0 =1 zi. (45) − − − i=1 where the operator tr is the trace of a matrix. The dual affine X coordinates are The Shannon entropy, P ∗ = P −1, (57) − H(z)= zi log zi 1 zi log 1 zi , (46) − − − − and the dual potential is     is concave,X so that k(z)= H(Xz) is a convex functionX of z. k∗(P ∗)= log P ∗ . (58) The Riemannian metric− induced from k(z) is calculated as − | | Consider a family of probability distributions of x, 1 1 z gij ( )= δij + , (47) 1 T −1 pi p0 P (x, P ) = exp x P x ψ(P ) . (59) −2 − which is the Fisher information matrix. The divergence func-   tion is given by This is a multi-variate Gaussian distribution with mean 0 and n covariance matrix P . The geometrical structure introduced z D[z : y]= z log i . (48) here is the same as that derived from the exponential family i y i=0 i of distributions [1]. X 186 Bull. Pol. Ac.: Tech. 58(1) 2010 Information geometry of divergence functions

Quantum information geometry [1, 21–23] uses the con- Let S = x be the open region satisfying the constraints vex function { } k(P )= tr(P log P P ). (60) Aij xj >bi. (72) − Its gradient is Then, X P ∗ = Grad k(P ) = log P (61) m n k(x)= log Aij xj bi (73) and hence the dual affine coordinates are . The diver- −  −  log P i=1 j=1 gence is X X  is a convex function. Hence, we can introduce a dually flat D(P : Q)= tr P (log P log Q)+ P + Q , (62)   { − } Riemannian structure to S. which is the von Neumann divergence. The inner method of LP makes use of this structure [25]. We further define the following function by using a convex The primal and dual curvatures play an important role in eval- function f, uating the efficiency of algorithms [26]. This can be general- ized to the cone programming and semi-definite programming k (P )= trf(P )= f(λ ), (63) f i problems [27]. where λ are the eigenvalues of P . Then,X k is a convex func- i f 5) Other divergences: tion of P , from which we derive a dual geometrical structure There are many other divergences in derived from a depending on f [24]. Sn 2 convex function in the form of z . We For f(λ)=(1/2)λ , we have f(z) k( ) = f(zi) show some of them. An extensive list of divergences is given 1 P k (P )= tr(P T P ), (64) in [2]. f 2 and the derived divergence is i) For f(z)= log z, we have the Itakua-Saito divergence − 1 2 1 2 qi pi Df [P : Q]= P Q = pij qij . (65) DIS[p : q]= (log + 1), (74) 2k − k 2 | − | pi qi − The dual is P ∗ = P , so that the geometryX is self-dual and X Euclidean. which is useful for spectral analysis of speech signals. The statistical case of (54) is derived from ii) For f(z)= z log z + (1 z) log(1 z), (75) f(λ)= log(λ), (66) − − − while we have the Fermi-Dirac divergence

f(λ)= λ log λ λ (67) pi 1 pi − DF D[p : q]= pi log + (1 pi)log − gives the quantum-information structure (60). qi − 1 qi  −  More generally, by putting [22] X (76) iii) The β-divergence is a family divergences derived from the 4 1+α fα(λ)= − (λ 2 λ), ( 1 <α< 1) (68) following β-functions, 1 α2 − − − 1 we have the α-divergence, zβ+1 (β + 1)z + β β(β + 1) −  Dα[P : Q]= fβ(z)= z log z z, β =0  − (69)  4 1 α 1+ α 1+α 1−α  2 2 z log z, β = 1. = 2 tr − P + Q P Q . − − 1 α 2 2 −  (77) −    This is a generalization of the α-divergence defined in the The divergences are space of positive measures defined later. 1 pβ+1 qβ+1 4) Linear programming: β(β + 1) i − i  Let us consider the following LP (linear programming) P n β  (β + 1)qi (pi qi) , β> 1 problem.  − − − Dβ[p : q]=  pi o  pi log , β =0 Problem. Minimize the cost function  qi P qi pi c(x)= cixi (70)  log + 1 , β = 1.  pi qi − − n    under the constraints on x RXgiven by  P (78) ∈  n Hence, the family includes the KL-divergence and Itakura- Saito divergence. It is used in machine learning [28] and Aij xj bi, i =1, , m. (71) ≥ · · · j=1 robust estimation [29–31]. X

Bull. Pol. Ac.: Tech. 58(1) 2010 187 S. Amari and A. Cichocki

3. Invariant divergence in manifolds 3.2. f-divergence and information monotonicity. The f- of probability and positive measures divergence was introduced by Csiszar´ [7] and also by Ali and Silvey [6]. It is defined by 3.1. Information monotonicity. Let us consider again the space Sn of all probability distributions over n+1 atoms X = qi Df [p : q]= pif , (88) x0, x1, , xn . The probability distributions are given by p { · · · }  i  p = (p0,p1, ,pn), pi = Prob x = xi , i = 0, 1, ,n, X · · · { } · · · where f is a convex function satisfying pi = 1. We try to define a new divergence measure D[p : q] between two distributions p and q. To this end, we f(1) = 0. (89) introduceP the concept of information monotonicity [12, 14], of which original idea was proposed by Chentsov in 1972 For f = cf for a constant c, we have

[11]. Dcf [p : q]= cD[p : q]. (90) Let us divide X into m groups, G1, G2, , Gm(m < n + 1), say · · · Hence, f and cf give the same divergence except for the scale factor. In order to standardize the scale of divergence, we may G1 = x1, x2, x5 , G2 = x3, x8, , (79) { } { ···} · · · assume that ′′ This is a partition of X, f (1) = 1, (91)

X = Gi, (80) provided f is differentiable. Further, for fc(u)= f(u) c(u ∪ 1) where c is constant, we have − − Gi Gj = φ. (81) ∩ Df [p : q]= Df [p : q]. (92) Assume that we do not know the outcome xi directly, but can c observe which group Gj it belongs to. This is called coarse- Hence, we may use such an f that satisfies graining of . X ′ The coarse-graining generates a new probability distribu- f (1) = 0 (93) tions p = (p , , p ) over G1, , Gm, 1 · · · m · · · without loss of generality. A convex function satisfying the above three conditions (89), (91), (93) is called a standard f p = Prob Gj = Prob xi . (82) j { } { } function. xi∈Gj X The f-divergence (88) is written as a sum of functions Let D[p : q] be an induced divergence between p and q. of two variables pi and qi. Such a divergence is said to be Since coarse-graining summarizes some of elements into one decomposable. group, detailed information of the outcome in each group is Csiszar´ found that an f-divergence satisfies information lost. Therefore, it is natural to require monotonicity. Moreover, the class of f-divergences is unique D[p : q] D[p : q]. (83) in the sense that any decomposable divergence satisfying in- ≤ formation monotonicity is an f-divergence. When does the equality hold? For two distributions p and Theorem 3. q, assume that the outcome xi is known to belong to Gj . Then, we require more information to distinguish the two The f-divergence satisfies the information monotonicity. probability distributions p and q by knowing further detail Conversely, any decomposable information monotonic diver- inside group Gj . Since xi belongs to group Gj , we consider gence is written in the form of f-divergence. the conditional probability distributions The proof is found, e.g., in [14]. The Riemannian metric and affine connections derived p (xi xi Gj ) , q (xi xi Gj ) (84) | ∈ | ∈ from the f-divergence has a common invariant structure. They inside group Gj . If they are equal, we cannot obtain further are given by the Fisher information Riemannian metric and information to distinguish p from q by observing elements α-connections, which are shown in a later section. ± inside Gj . Hence, 3.3. Examples of f-divergence in S . An extensive list of D [p : q]= D[p : q] (85) n f-divergences is given in [2]. Some of them are listed below. holds, when and only when 1) Total variation: f(u)= u 1 . p (xi Gj )= q (xi Gj ) (86) | − | | | The total variation distance is defined by for all Gj and all xi Gj , or ∈ D[p : q]= pi qi . (94) pi | − | = λj (87) qi Note that it is not differentiable,X and no Riemannian metric is derived. However, this gives a Minkovskian metric. for all xi Gj for some constant λj . 2 A divergence∈ satisfying the above requirements is called 2) Squared Hellinger distance: f(u) = (√u 1) for which − an invariant divergence, and such a property is termed as in- 2 D[p : q]= (√pi √qi) . (95) formation monotonicity. − X 188 Bull. Pol. Ac.: Tech. 58(1) 2010 Information geometry of divergence functions

3) Pearson and Neyman Chi-square divergence: f(u) = plays a special role in Mn. This fα(u) is a standard convex (1/2)(u 1)2 and f(u)=(1/2)(u 1)2/u, for which function. − − 2 The α-divergence is given in the following form, 1 (pi qi) D[p : q]= − , (96) 2 pi 4 1 α 1+ α 2 − zi + yi X 2 1 α 2 2 1 (pi qi)  −  1−α 1+α D[p : q]= − . (97) P 2 2  zi yi , α = 1, 2 qi  6 ±  4) The KL-divergence: f(u)=Xu 1 log u, and Dα[z : y]=    yi − −  zi yi + yi log , α =1, qi  − z D[p : q]= pi log (98)  i  pi P z  i 5) The -divergence: X  yi zi + zi log , α = 1. α  − yi −    4 1+α 2 P (105) f(u)= 1 u 2 (u 1), (99)  1 α2 − − 1 α − −   1−α 1+α 3.5. Characterization of α-divergence. We show that the 4 2 2 Dα[p : q]= 1 p q . (100) α-divergence is not only invariant, but also has a dually flat 1 α2 − i i − structure for any α in Mn. This is different from the case The α-divergence was introducedX  by Havdra and Charvat´ of Sn, because it is not dually flat in Sn, as will be shown in 1967 [32], and has been studied extensively by Amari later. If Mn is dually flat, it has primal and dual affine coor- and Nagaoka [1]. Its applications were described earlier by dinate systems w, w∗ such that primal and dual geodesics are Chernoff in 1952 [33], and later in [34, 35] etc. to mention represented in the linear forms (29), (30) in these coordinate a few. It is the squared Hellinger distance for α = 0, and systems. To this end, we introduce the KL-divergence and its reciprocal are obtained in the 2 1−α limit of α 1. u 2 1 , α =1 →± r (u)= 1 α − 6 (106) α  −   3.4. f-divergence in the space of positive measures. We  log u, α =1, have studied both invariant and flat geometrical structures which is called the α-representation of u. The new coordinates in the manifold Sn of probability distributions. Here, we  w and w∗ of are defined by study a similar structure in the space of positive measures Mn ∗ Mn over X = x1, , xn , whose points are denoted by wi = rα(zi), wi = r−α(zi). (107) { · · · } z = (z1, ,zn), zi > 0. Here zi is the mass (measure) of ∗ ∗ · · · We further define convex functions k(w) and k (w ) by xi, where the total mass zi is arbitrary. In many applica- 2 tions, z is a non-negative array, and we can extend it to a 2 1 α 1−α k(w)= kα(w)= 1+ − wi , (108) non-negative double arrayPz = (z ) etc., that is, matrices 1+ α 2 ij   and tensors. We first derive an -divergence in : for two X∗ ∗ ∗ f Mn k (w )= k−α(w ). (109) positive measures z and y, an f-divergence is given by Theorem 4. The α-divergence is a Bregman divergence, yi where w and w∗ are affine coordinate systems, having du- Df [z : y]= zif , (101) z ∗ ∗  i  al convex functions kα(w) and kα(w ), respectively. where f is a standard convex function.X It should be noted that Proof. The divergence Dα between two points w and s (dual an f-divergence is no more invariant under the transformation ∗ ∗ coordinates are w and s , respectively) derived from kα(w) from f(u) to is written as fc(u)= f(u) c(u 1). (102) ∗ ∗ Dα[w : s]= kα(w)+ k α(s ) w s . (110) − − − − · Hence, it is absolutely necessary to use a standard f in the For w = rα(z), s = rα(y), we have case of M , because the condition of divergence is violated n 2 otherwise. kα(w)= zi, (111) Among all f-divergences, the α divergence given in the 1+ α X ∗ 2 form of k−α(s )= yi (112) yi 1 α Dα[z : y]= zifα , (103) − zi and X   1 1−α 1+α where X w s∗ = z 2 y 2 . (113) · 1 α2 i i 4 1+α 2 − 1 u 2 + (u 1), α = 1, By substituting , (111)–(112) in (109), we prove that Dα 1 α2 − 1 α − 6 ±  − − is the α-divergence given in (105). This demonstrates that fα(u)= u log u  (u 1), α =1,  − − Mn has the dually flat α-structure for any α. We can fur-  log u + (u 1), α = 1, thermore prove that the α-divergence is unique in the sense − − − that it belongs to both classes of f-divergences and Bregman  (104)  divergences [14].

Bull. Pol. Ac.: Tech. 58(1) 2010 189 S. Amari and A. Cichocki

4. Geometry derived same directions in a flat space, provided affine coordinates are from general divergence function taken. In this special case, ei(z) and ei(z+dz) have the same direction. This is the case with a Bregman divergence. When 4.1. Tangent space, Riemannian metric and affine connec- a space is curved, the directions of e (z) and e (z + dz) are tion. We have so far studied the geometry derived from Breg- i i different, and we cannot have an affine coordinate system in man and f-divergences without mentioning underlying math- general. ematical background. Here, we study the geometry derived Let us compare a basis vector ei(z) Tz with ei(z + from a general divergence, together with basic mathematical ∈ dz) Tz+dz. Their intrinsic change δei is defined by concepts from differential geometry. Let us consider a mani- ∈ δei = ei(z + dz) ei(z), where ei(z + dz) is the vec- fold S having a local coordinate system z = (z1, ..., zn). Let − tor in Tz corresponding to ei(z + dz) Tz+dz in Tz. We us consider a differentiable function D[y : z], satisfying the ∈ need to define such correspondence by connecting Tz and condition of divergence. Then, the Taylor expansion, e e Tz+dz. By this correspondence, ei(z + dz) is a vector in Tz 1 so that it is spanned by ei , and it reduces to 0 as dz 0. D(z + dz, z)= gij (z)dzidzj (114) { } → 2 Hence, it may be written in the lineare form is a positive definite quadraticX form. Hence, a Riemannian δe = Γk (z)dzie . (120) metric is induced in S by the second-order derivatives of D j ij k at y = z, that is, X k Therefore, if we define the coefficients Γij , we have a cor- ∂2 respondence between two nearby tangent spaces. This is an gij (z)= D(z : y) y=z . (115) affine connection, and Γk are called the coefficients of the ∂zi∂zj ij affine connection. Note that ei(z + dz) and ei(z) belong to It is easy to prove that this is a tensor. different tangent spaces, so that we cannot subtract one from The Bregman divergence induces a dually flat geometri- the other directly. We have defined the intrinsic difference δei cal structure, but this does not necessarily hold in the case by the above equation. of general divergences. However, we also have a dually cou- Formally, an affine connection is defined by using the co- pled (non-flat) affine connections together with a Riemannian variant derivative operator XY , which operates on vector metric. In order to elucidate the induced geometry, we briefly fields X and Y , and shows∇ how the field Y changes as points provide here intuitive explanations of the notion of a tangent move in the direction of X. A covariant derivative XY is space and an affine connection. defined by using an affine connection as ∇ Let us consider the tangent space Tz of S at point z. It is ∂Y k a linear space spanned by basis vectors e1, , en, where ei Y k j e (121) · · · X = i + Γij Y Xi k, represents the tangent vector along the coordinate axis zi. One ∇ ∂z X  X  may regard it as a local linearization of S in a neighborhood for two vector fields of z. Any tangent vector X of Tz is represented as X = Xiei, (122) X = X e . (116) i i X In particular, the tangent vectorX of a curve z(t) is represent- Y = Yj ej . (123) ed by The basis vectors are regardedX as vector fields, and the covari- d z˙ (t)= z (t)e . (117) ant derivative of vector field ej (z) in the direction of ei(z) is dt i i k The Riemannian metricX tensor is expressed by the inner e ej = Γ ek. (124) ∇ i ij product of two basis vectors, This shows how ej (z) changeX intrinsically as points move in k gij (z)= ei, ej , (118) the direction of e . The covariant version of Γ is h i i ij so that the inner product of two tangent vectors X and Y is m Γijk = Γij gmk. (125) m X, Y = gij XiYj . (119) X h i We next consider an affineX connection which is necessary 4.2. Affine connection derived from divergence. An affine for defining a geodesic. When we consider tangent spaces at connection is derived from a divergence function D[y : z]. It all points z S, they are a collection of local linear approx- was shown by Eguchi [37] that the coefficients of the derived imations of S∈ at different points. Such a collection is a fibre affine connection are given by bundle called the tangent bundle. A geodesic is an extension ∂3 of the “straight line”, and is defined as a curve of which the Γijk(z)= D[z : y]|y=z . (126) −∂zi∂zj∂yk tangent directions are the same along the curve. To define this “sameness”, we need to connect tangent spaces defined A curve z(t) is a geodesic, when at different points and define which directions are the same in z˙ z˙ =0, (127) different tangent spaces. The basis vectors ei have always the ∇

190 Bull. Pol. Ac.: Tech. 58(1) 2010 Information geometry of divergence functions where 4.3. Geometry of Bregman divergence. We have already d studied the geometry derived from a Bregman divergence, z˙ = z(t). dt D[y : z]= k(y) k(z) Grad k(z) (y z). (138) This can be rewritten in the component form, − −{ } · − From (115), the Riemannian metric is given by the Hessian d2z (t) k + Γ z˙ z˙ =0. (128) dt2 ijk j i ∂2 g (z)= k(z). (139) X ij Since our affine connection is different from that derived from ∂zi∂zj the Riemannian metric, it is not a curve of minimal distance Furthermore, we see from (126) that connecting two points. But it keeps straightness in the sense of this affine connection, because the tangent direction never Γijk(z)=0 (140) changes along the curve. holds. Hence, the space is flat with respect to this connection, We can define another affine connection, the dual connec- and the related coordinates z are affine. However, note that tion, from the dual divergence ∗ ∗ Γijk(z) =0. If we use the dual affine coordinate system z , ∗ 6 ∗ ∗ D [z : y]= D[y : z]. (129) then the dual affine connection Γijk(z ) vanishes in the dual coordinate system. ∗ The dual affine connection Γijk is given, in terms of the co- efficients, as 4.4. Geometry of f-divergence. We now study the geomet- 3 rical structure of S induced by an f-divergence which has ∗ ∂ n Γijk = D[z : y]|y=z . (130) information monotonicity. We begin with the induced Rie- −∂yi∂yj∂zk mannian metric. Two affine connections are said to be mutually dual when Theorem 5. Any f-divergence induces a unique information ∗ monotonic Riemannian metric, which is given by the Fisher DX Y , Z = X Y , Z + Y , Z (131) h i h∇ i h ∇X i information matrix. holds for three vector fields X, Y and Z. In terms of the Proof. The Riemannian metric induced by D is calculated components, this relation can be written as f from (88) as

∗ ∂ n 2 Γkij +Γkji = gij . (132) 1 (dpi) ∂z Df [p : p + dp]= . (141) k 2 p i=0 i 0 X The Riemannian connection (Levi-Civita connection) Γijk Let z = (p1, ,pn) be the coordinate system, where pi = is given by · · · zi (i =1, ,n) and p0 =1 zi. By eliminating dp0 by · · · − 0 1 ∂ ∂ ∂ using dp0 = dzi, we have Γijk = gjk + gki gij . (133) − P 2 ∂zi ∂zj − ∂zk P 1 1   gij = δij + . (142) It is the average of the two connections, pi p0 This coincides with the Fisher information matrix defined by 0 1 ∗ Γ = Γijk +Γ . (134) ijk 2 ijk ∂ log p(x, z) ∂ log p(x, z) gij (z)= p(x, z) dx. (143) The Riemannian geodesic which minimizes the Riemannian ∂zi ∂zj Z length of a curve is given by Hence for any f, the Riemannian metric is the same and given 0 by the Fisher information matrix. This is the only invariant z¨k + Γijkz˙iz˙j =0. (135) metric of Sn. When two affine connectionsX are dually coupled, we have a Since an f-divergence Df [p : q] cannot be written in tensor the form of Bregman divergence, except for the case of the T =Γ Γ∗ , (136) KL-divergence, there are no affine and dual affine coordinate ijk ijk ijk ∗ ∗ − systems such that Γijk(z)=0 or Γijk(z )=0 holds. This which is symmetric with respect to the three indices i, j and k. implies that Df does not induce a flat structure in Sn. How- Therefore, by using this tensor, the two dually coupled affine ever, it induces a pair of dual affine connections, which define connections can be written as primal and dual geodesics. A symmetric tensor is defined by 0 1 Γijk =Γ Tijk, ijk − 2 ∂ log p(x, z) ∂ log p(x, z) ∂ log p(x, z) (137) Tijk(z)= p(x, z) dx. 1 ∂zi ∂zj ∂zk Γ∗ =Γ0 + T . Z ijk ijk 2 ijk (144)

Bull. Pol. Ac.: Tech. 58(1) 2010 191 S. Amari and A. Cichocki

The integration reduces to the summation in the discrete case. We can also calculate the coefficients of the primal and dual We can then calculate the induced primal and dual affine con- affine connections. nections. Theorem 7. An f-divergence Df [z : y] induces the primal Theorem 6. An f-divergence Df [z : y] induces the primal and dual affine connections in Mn, and dual affine connections defined by α 1 α 0 Γijk(z)= ( 1 α)δijk , (154) Γijk =Γijk =Γijk Tijk, (145) 2 − 2 2zi − − α Γ∗ =Γ−α =Γ0 + T , (146) ijk ijk ijk 2 ijk ∗ 1 where α =3+2f ′′′(1). Γijk(z)= 2 ( 1+ α)δijk , (155) 2zi −

Proof. By simple calculations, we have ′′′ where α =3+2f (1), and δijk =1 when i = j = k and 0 3 ∂ otherwise. Γijk = Df [z : y]|y=z −∂z ∂z ∂y i j k (147) This shows that any f-divergence induces a family of ′′′ 0 ′′′ 3 affine connections indexed by α = 3+2f (1). We can further =Γ f (1) + Tijk, ijk − 2 prove the following flatness theorem.   and its dual, Theorem 8. The primal and dual affine connections of Mn 3 induced by an f-divergence are flat. Γ∗ =Γ0 + f ′′′(1) + T . (148) ijk ijk 2 ijk   Proof. The theorem can be proved by calculating the They are two dually coupled affine connections, given by Riemann-Christoffel curvature tensors, which we omit. ±α 0 α A flat manifold has an affine coordinate system. We have Γ =Γ Tijk, (149) ijk ijk ± 2 already shown that the α-representation given in (107) is an where α =3+2f ′′′(1). The geometrical structure given by affine coordinate system, and α-representation (107) is a du- − the triplet S,gij ,Tijk is called the α geometry. al affine coordinate system. The α {geometry is} a natural consequence of information monotonicity. This is explained from the invariancy principle 4.6. Canonical divergence. Given a divergence D[z : y], that the geometrical structure is invariant by transformation we have a Riemannian metric and a dual pair of affine con- from x to t(x), when t is a sufficient . For more nections. However, given a Riemannian manifold with a dual details, see [1, 11, 36]. pair of affine connections, we have infinitely many divergences Among these divergences, the KL-divergence and its dual that generate the same geometrical structure. This is because are unique in the sense that the space is dually flat in S . n the differential geometrical structure depends only on local In other words, there exist affine and dual affine coordinate ∗ properties of a divergence function. For example, given a di- systems z and z such that the two affine connections Γijk ∗ vergence D[z : y], its modification and Γijk vanish in the respective coordinate systems. Hence, it is written in the form of a Bregman divergence, as is given 4 D[z : y]= D[z : y]+ c zi yi , c> 0 (156) in Subsec. 2.3. A dually flat manifold has the Pythagorean | − | and Projection theorems, which are useful for a wide range X gives the same geometry. of applications. e When a manifold S is dually flat, there exist two dual affine coordinate systems z and z∗, accompanied by two con- 4.5. Invariant Geometry of Mn. An f-divergence induces a vex functions k(z) and k∗(z∗). These coordinate systems are dually flat structure in Mn, as we saw in the previous section. unique up to affine transformations, and convex functions are For positive measures z = (z1, ,zn), any f-divergence gives · · · unique up to linear terms. However, the divergence is uniquely 1 1 2 determined, Df (z : z + dz)= (dzi) . (150) 2 zi D[y : z]= k(y)+ k∗(z∗) y z∗. (157) Hence, the Riemannian metric is X − · 1 We call it the canonical divergence of a dually flat manifold. gij (z)= δij . (151) zi The Bregman divergence is the canonical divergence of a du- It is Euclidean, because it can be changed into ally flat Riemannian manifold. So we have:

gij (z)= δij , (152) Theorem 9. The α-divergence is the canonical divergence in the dually flat space M of positive measures. The KL- by the coordinate transformation n e e divergence is the canonical divergence in the dually flat space zi =2√zi. (153) Sn of discrete probability distributions.

192 e Bull. Pol. Ac.: Tech. 58(1) 2010 Information geometry of divergence functions

4.7. Symmetric divergence. A divergence is not symmetric For a dual geodesic in the original geometry given by in general. However, we can construct a symmetric divergence z∗(t)= tz∗ + (1 t)b∗, (169) Ds[z : y] from an asymmetric D[z : y] by − the curve in the new dual coordinate system is given by 1 Ds[z : y]= D[z : y]+ D[y : z] . (158) ∗ ′ ∗ ′ ∗ 2 { } z (t)= tf (t)a + (1 t)f (t)b , (170) − Its geometry is elucidated by the following theorem. where e f ′(t)= f ′ [k z (z∗(t)) ] . (171) Theorem 10. The symmetrized divergence gives the same { } Riemannian metric as the asymmetric one. The derived affine This is not a dual geodesic of the derived geometry. connections are self-dual, and is the Riemannian connection, However, when b∗ = 0, a geodesic passes through the ∗ S ∗S 0 origin z∗ = z , which is the point minimizing the convex Γijk =Γijk =Γijk. (159) function k(z) and k(z). Such a dual geodesic is again a dual Proof. From geodesic with respect to the transformed geometry, since 2 2 e ∂ ∂ e ∗ ∗ D[z : y]y=z = D[z : y]y=z, (160) z (t)= ta (172) ∂zi∂zj ∂yi∂yj is transformed to we have S ∗ ′ ∗ gij = gij . (161) z (t)= tf (t)a . (173) It is easy to see When k(z) is the negative entropy, this represents the fact 3 that the maximum-entropye theorem holds commonly both for ∂ 1 ∗ Ds[z : y]|y=z = Γijk +Γ . (162) k(z) and k(z). −∂z ∂z ∂y 2 ijk i j k We further study the geometry of S derived from the  n We may say that the squared Riemannian distance is the q-exponentiale family [17, 18, 27]. Let us define the q- canonical divergence of a space with a symmetric divergence. exponential function by

1 4.8. Tsallis entropy, q-exponential family, and conformal 1+(1 q)z 1−q , q =1, exp (z)= { − } 6 (174) geometry. Let us define q   exp z, q =1. p q (163) hq( )= pi , 0

Theorem 12. The q-divergence is a conformal transform of [9] I. Taneja and P. Kumar, “Relative information of type s, the α-divergence, with q =(1+ α)/2, Csiszar’s´ f-divergence, and information inequalities”, Infor- mation Sciences 166, 105–125 (2004). 1 q 1−q [10] I. Csiszar,´ “Why least squares and maximum entropy? An ax- Dq[p : q]= 1 pi qi . (182) (1 q)hq(p) − iomatic approach to inference for linear problems”, Annuals of −  X  Statistics 19, 2032–2066 (1991). These results pave a way to develop conformal information [11] N.N. Chentsov, Statistical Decision Rules and Optimal Infer- geometry further. ence, American Mathematical Society, New York, 1972. [12] I. Csiszar,´ “Axiomatic characterizations of information mea- 5. Conclusions sures”, Entropy 10, 261–273 (2008). [13] G. Pistone and C. Sempi, “An infinite-dimensional geomet- We have studied various divergence measures and the ric structure on the space of all the probability measures differential-geometrical structures derived therefrom. This equivalent to a given on”, Annals of Statistics 23, 1543–1561 connects many important engineering problems in compu- (1995). tational vision, optimization, signal processing, neural net- [14] S. Amari, “Alpha divergence is unique, belonging to both class- works, etc. with a modern information geometry. A divergence es of f-divergence and Bregman divergence”, IEEE Trans. In- function gives a Riemannian metric to the underlying space, formation Theory B 55, 4925–4931 (2009). and furthermore a pair of dually coupled affine connections. [15] C. Tsallis, “Possible generalization of Boltzmann-Gibbs statis- It has been shown that the Bregman divergence always gives tics”, J. Stat. Phys. 52, 479–487 (1988). [16] A. Renyi,´ “On measures of entropy and information”, Proc. a dually flat Riemannian structure, and conversely, a dually 4th Berk. Symp. Math. Statist. and Probl. 1, 547–561 (1961). flat Riemannian manifold always gives a canonical divergence [17] J. Naudts, “Estimators, escort probabilities, and phi- of the Bregman type. We have presented a number of impor- exponential families in statistical physics”, J. Ineq. Pure App. tant divergences of the Bregman type. Math. 5, 102 (2004). Information monotonicity is a natural requirement for a di- [18] J. Naudts, “Generalized exponential families and associated vergence defined in a manifold of probability distributions. entropy functions”, Entropy 10, 131–149 (2008). This leads to the class of f-divergences. We have proved that [19] H. Suyari, “Mathematical structures derived from q- an f-divergence gives the unique Riemannian metric, which multinomial coefficient in Tsallis statistics”, Physica A 368, is the Fisher information matrix. It also gives a class of α- 63–82 (2006). connections, where α- and α-connections are dually cou- [20] S. Amari, “Information geometry and its applications: convex pled. By extending it to the− manifold of positive measures, function and dually flat manifold”, Emerging Trends in Visual Computing A2, 5416 (2009). we have shown that the -divergences are canonical diver- α [21] M.R. Grasselli, “Duality, monotonicity and Wigner-Yanase- gences. Dyson metrics”, Infinite Dimensional Analysis, Quantum Prob- We have addressed the geometrical structure inspired from ability and Related Topics 7, 215–232 (2004). the Tsallis q-entropy and the corresponding divergences. This [22] H. Hasegawa, “α-divergence of the non-commutative informa- will open a new field of conformal information geometry. tion geometry”, Reports on Mathematical Physics 33, 87–93 (1993). [23] D. Petz, “Monotone metrics on matrix spaces”, Linear Algebra REFERENCES and its Applications 244, 81–96 (1996). [1] S. Amari and H. Nagaoka, Methods of Information Geometry, [24] I.S. Dhillon and J.A. Tropp, “Matrix nearness problem with Oxford University Press, New York, 2000. Bregman divergences”, SIAM J. on Matrix Analysis and Ap- [2] A. Cichocki, R. Zdunek, A.H. Phan, and S. Amari, Nonnega- plications 29, 1120–1146 (2007). tive Matrix and Tensor Factorizations, John Wiley, New York, [25] Yu. Nesterov and M.J. Todd, “On the Riemannian geometry 2009. defined by self-concordant barriers and interior-point meth- [3] F. Nielsen, “Emerging trends in visual computing”, Lecture ods”, Foundations of Computational Mathematics 2, 333–361 Notes in Computer Science 6, CD-ROM (2009). (2002). [4] L. Bregman, “The relaxation method of finding a common [26] A. Ohara and T. Tsuchiya, An Information Geometric Approach point of convex sets and its application to the solution of prob- to Polynomial-time Interior-time Algorithms, (to be published). lems in convex programming”, Comp. Math. Phys., USSR 7, [27] A. Ohara, “Information geometric analysis of an interior point 200–217 (1967). method for semidefinite programming”, Geometry in Present [5] A. Banerjee, S. Merugu, I.S. Dhillon, and J. Ghosh, “Cluster- Day Science 1, 49–74 (1999). ing with Bregman divergences”, J. Machine Learning Research [28] N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi, “Infor- 6, 1705–1749 (2005). mation geometry of U-boost and Bregman divergence”, Neural [6] M.S. Ali and S.D. Silvey, “A general class of coefficients of di- Computation 26, 1651–1686 (2004). vergence of one distribution from another”, J. Royal Statistical [29] S. Eguchi and J. Copas, “A class of logistic type discriminant Society, B(28), 131–142 (1966). function”, Biometrika 89, 1–22 (2002). [7] I. Csiszar,´ “Information-type measures of difference of proba- [30] M. Minami and S. Eguchi,“ Robust blind source separation by bility distributions and indirect observations”, Studia Sci. Math. beta-divergence”, Neural Commutation 14, 1859–1886 (2004). 2, 299–318 (1967). [31] H. Fujisawa and S. Eguchi, “Robust parameter estimation with [8] I. Csiszar,´ “Information measures: a critical survey”, Transac- a small bias against heavy contamination”, J. Multivariate tion of the 7th Prague Conf. 1, 83–86 (1974). Analysis 99, 2053–2081 (2008).

194 Bull. Pol. Ac.: Tech. 58(1) 2010 Information geometry of divergence functions

[32] J. Havrda and F. Charvat,´ “Quantification method of classifi- [35] Y. Matsuyama, “The α-EM algorithm: Surrogate likelihood cation process. Concept of structural α-entropy”, Kybernetika maximization using α-logarithmic information measures”, 3, 30–35 (1967). IEEE Trans. on 49, 672–706 (2002). [33] H. Chernoff, “A measure of asymptotic efficiency for tests of [36] J. Zhang, “Divergence function, duality, and convex analysis”, a hypothesis based on a sum of observations”, Annals of Math- Neural Computation 16, 159–195 (2004). ematical Statistics 23, 493–507 (1952). [37] S. Eguchi, “Second order efficiency of minimum contrast es- [34] S. Amari, “Integration of stochastic models by minimizing α- timations in a curved exponential family”, Annals of Statistics divergence”, Neural Computation 19, 2780–2796 (2007). 11, 793–803 (1983).

Bull. Pol. Ac.: Tech. 58(1) 2010 195