<<

Front. Electr. Electron. Eng. China 2010, 5(3): 241–260 DOI 10.1007/s11460-010-0101-3

Shun-ichi AMARI geometry in optimization, machine learning and statistical inference

c Higher Education Press and Springer-Verlag Berlin Heidelberg 2010

Abstract The present article gives an introduction to article intends to give an understandable introduction information geometry and surveys its applications in to information geometry without modern differential ge- the area of machine learning, optimization and statis- ometry. Since underlying manifolds in most applications tical inference. Information geometry is explained in- are dually flat, the dually flat structure plays a funda- tuitively by using functions introduced in a mental role. We explain the fundamental dual structure manifold of probability distributions and other general and related dual geodesics without using the concept of manifolds. They give a Riemannian structure together affine connections and covariant derivatives. with a pair of dual flatness criteria. Many manifolds are We begin with a divergence function between two dually flat. When a manifold is dually flat, a general- points in a manifold. When it satisfies an invariance cri- ized Pythagorean theorem and related projection the- terion of information monotonicity, it gives a family of f- orem are introduced. They provide useful means for divergences [2]. When a divergence is derived from a con- various approximation and optimization problems. We vex function in the form of Bregman divergence [3], this apply them to alternative minimization problems, Ying- gives another type of divergence, where the Kullback- Yang machines and belief propagation algorithm in ma- Leibler divergence belongs to both of them. We derive chine learning. a geometrical structure from a divergence function [4]. The Riemannian structure is derived Keywords information geometry, machine learning, from an invariant divergence (f-divergence) (see Refs. optimization, statistical inference, divergence, graphical [1,5]), while the dually flat structure is derived from the model,offprint Ying-Yang machine Bregman divergence (convex function). The manifold of all discrete probability distributions is dually flat, where the Kullback-Leibler divergence plays 1 Introduction a key role. We give the generalized Pythagorean theo- rem and projection theorem in a dually flat manifold, Information geometry [1] deals with a manifold of proba- which plays a fundamental role in applications. Such bility distributions from the geometrical point of view. It a structure is not limited to a manifold of probability studies the invariant structure by using the Riemannian distributions, but can be extended to the manifolds of geometry equipped with a dual pair of affine connec- positive arrays, matrices and visual signals, and will be tions. Since probability distributions are used in many used in neural networks and optimization problems. problems in optimization, machine learning, vision, sta- After introducing basic properties, we show three ar- tistical inference, neural networks and others, informa- eas of applications. One is application to the alterna- tion geometry provides a useful and strong tool to many tive minimization procedures such as the expectation- areas of information sciences and engineering. maximization (EM) algorithm in [6–8]. The Many researchers in these fields, however, are not fa- second is an application to the Ying-Yang machine in- miliar with modern differential geometry. The present troduced and extensively studied by Xu [9–14]. The third one is application to belief propagation algorithm of stochastic reasoning in machine learning or artificial Received January 15, 2010; accepted February 5, 2010 intelligence [15–17]. There are many other applications Shun-ichioffprint AMARI in analysis of spiking patterns of the brain, neural net- RIKEN Brain Science Institute, Saitama 351-0198, Japan works, boosting algorithm of machine learning, as well E-mail: [email protected] as wide range of statistical inference, which we do not

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. 242 Front. Electr. Electron. Eng. China 2010, 5(3): 241–260 mention here. is an important coordinate system of Sn, as we will see later. 2 Divergence function and information geometry

2.1 Manifold of probability distributions and positive arrays

We introduce divergence functions in various spaces or manifolds. To begin with, we show typical exam- ples of manifolds of probability distributions. A one- dimensional Gaussian distribution with mean μ and vari- ance σ2 is represented by its probability density function 2 1 (x − μ) S p(x; μ, σ)=√ exp − . (1) Fig. 1 Manifold 2 of discrete probability distributions 2πσ 2σ2 The third example deals with positive measures, not It is parameterized by a two-dimensional parameter probability measures. When we disregard the constraint ξ =(μ, σ). Hence, when we treat all such Gaussian pi =1of(6)inSn, keeping pi > 0, p is regarded distributions, not a particular one, we need to consider as an (n + 1)-dimensional positive arrays, or a positive the set SG of all the Gaussian distributions. It forms a measure where x = i has measure pi.Wedenotetheset two-dimensional manifold of positive measures or arrays by

SG = {p(x; ξ)} , (2) Mn+1 = {z,zi > 0; i =0, 1,...,n} . (10) where ξ =(μ, σ) is a coordinate system of SG.This is not the only coordinate system. It is possible to use This is an (n + 1)-dimensional manifold with a coordi- nate system z. Sn is its submanifold derived by a linear other parameterizations or coordinate systems when we constraint zi =1. study SG. We show another example. Let x be a discrete random In general, we can regard any regular variable taking values on a finite set X = {0, 1,...,n}. S = {p(x, ξ)} (11) Then, a is specified by a vector ξ p =(p0,p1,...,pn), where parameterized by as a manifold with a (local) coor- dinate system ξ.ItisaspaceM of positive measures, { } offprintpi =Prob x = i . (3) when the constraint p(x, ξ)dx = 1 is discarded. We We may write may treat any other types of manifolds and introduce dual structures in them. For example, we will consider p p(x; )= piδi(x), (4) a manifold consisting of positive-definite matrices. where 1,x= i, 2.2 Divergence function and geometry δi(x)= (5) 0,x= i. Since p is a probability vector, we have We consider a manifold S having a local coordinate sys- tem z =(zi). A function D[z : w] between two points pi =1, (6) z and w of S is called a divergence function when it and we assume satisfies the following two properties: pi > 0. (7) 1) D[z : w]  0, with equality when and only when z w The set of all the probability distributions is denoted by = . 2) When the difference between w and z is infinitesi- {p} Sn = , (8) mally small, we may write w = z +dz and Taylor which is an n-dimensional simplex because of (6) and expansion gives (7). When n =2,Sn is a triangle (Fig. 1). Sn is an n- D [z : z +dz]= gij (z)dzidzj, (12) dimensional manifold, and ξ =(p1,...,pn)isacoordi- nate system. There are many other coordinate systems. where ∂2 For example,offprintij z z w |w=z g ( )= D[ : ] (13) ∂zi∂zj pi θi =log ,i=1,...,n, (9) p0 is a positive-definite matrix.

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. Shun-ichi AMARI. Information geometry in optimization, machine learning and statistical inference 243

A divergence does not need to be symmetric, and 3 ∂ Γijk(z)=− D[z : w]|w=z, (16) D [z : w] = D [w : z] (14) ∂zi∂zj∂wk 3 ∗ z − ∂ z w in general, nor does it satisfy the triangular inequality. Γijk( )= D[ : ]|w=z, (17) ∂wi∂wj∂zk Hence, it is not a . It rather has a dimension of square of distance as is seen from (12). So (12) is consid- derived from a divergence D[z : w]. These two are du- ered to define the square of the local distance ds between ally coupled with respect to the Riemannian metric gij two nearby points Z =(z)andZ +dZ =(z +dz), [1]. The meaning of dually coupled affine connections is not explained here, but will become clear in later sec- 2 ds = gij (z)dzidzj. (15) tions, by using specific examples. The Euclidean divergence, defined by More precisely, dz is regarded as a small line element z w 1 − 2 connecting two points Z and Z +dZ. Thisisatan- DE[ : ]= (zi wi) , (18) 2 gent vector at point Z (Fig. 2). When a manifold has a positive-definite matrix gij (z)ateachpointz,itis is a special case of divergence. We have called a Riemannian manifold, and (gij )isaRieman- gij = δij , (19) nian metric tensor. ∗ Γijk =Γijk =0, (20)

where δij is the Kronecker delta. Therefore, the derived ∗ geometry is Euclidean. Since it is self-dual (Γijk =Γijk), the duality does not play a role.

2.3 Invariant divergence: f-divergence

2.3.1 Information monotonicity

Let us consider a function t(x) of random variable x, where t and x are vector-valued. We can derive proba- Fig. 2 Manifold, tangent space and tangent vector bility distributionp ¯(t, ξ)oft from p(x, ξ)by An affine connection defines a correspondence between p¯(t, ξ)dt = p(x, ξ)dx. (21) two nearby tangent spaces. By using it, a geodesic is de- t=t(x) fined: A geodesic is a curve of which the tangent direc- offprintWhen t(x) is not reversible, that is, t(x)isamany-to- tions do not change along the curve by this correspon- dence (Fig. 3). It is given mathematically by a covari- one mapping, there is loss of information by summariz- x t t x ant derivative, and technically by the Christoffel symbol, ing observed data into reduced = ( ). Hence, for two divergences between two distributions specified by Γijk(z), which has three indices. ξ ξ 1 and 2, D = D [p (x, ξ1):p (x, ξ2)] , (22)

D¯ = D [¯p (t, ξ1):¯p (t, ξ2)] , (23)

it is natural to require

D¯  D. (24)

The equality holds, when and only when t(x)isasuf- ficient statistics. A divergence function is said to be invariant, when this requirement is satisfied. The invari- ance is a key concept to construct information geometry of probability distributions [1,5].

Fig. 3 Geodesic, keeping the same tangent direction Here, we use a simplified version of invariance due to Csisz´ar [18,19]. We consider the space Sn of Oneoffprint may skip the following two paragraphs, since all probability distributions over n + 1 atoms X = technical details are not used in the following. Eguchi {x0,x1,...,xn}. A probability distributions is given by [4] proposed the following two connections: p =(p0,p1,...,pn), pi =Prob{x = xi}.

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. 244 Front. Electr. Electron. Eng. China 2010, 5(3): 241–260

Let us divide X into m subsets, T1,T2,...,Tm (m< by Ali and Silvey [20]. It is defined by n +1),say qi { } { } Df [p : q]= pif , (34) T1 = x1,x2,x5 ,T2 = x3,x8,... , ... (25) pi

This is a partition of X, where f is a convex function satisfying X = ∪Ti, (26) f(1) = 0. (35) Ti ∩ Tj = φ. (27) For a function cf with a constant c,wehave Let t be a mapping from X to {T1,...,Tm}. Assume that we do not know the outcome x directly. Instead Dcf [p : q]=cDf [p : q] . (36) we can observe t(x)=Tj, knowing the subset Tj to which x belongs. This is called coarse-graining of X Hence, f and cf give the same divergence except for into T = {T1,...,Tm}. the scale factor c. In order to standardize the scale of The coarse-graining generates a new probability dis- divergence, we may assume that tributions p¯ =(¯p1,...,p¯m)overT1,...,Tm, f (1) = 1, (37) p¯j =Prob{Tj} = Prob {x} . (28)

x∈Tj provided f is differentiable. Further, for fc(u)=f(u) − − Let D¯ [p¯ : q¯] be an induced divergence between p¯ and q¯. c(u 1) where c is any constant, we have Since coarse-graining summarizes a number of elements Df [p : q]=Df [p : q] . (38) into one subset, detailed information of the outcome is c lost. Therefore, it is natural to require Hence, we may use such an f that satisfies

¯ p q  p q  D [¯ : ¯] D [ : ] . (29) f (1) = 0 (39)

When does the equality hold? Assume that the out- without loss of generality. A convex function satisfying come x isknowntobelongtoTj. This gives some in- the above three conditions (35), (37), (39) is called a p q formation to distinguish two distributions and .If standard f function. we know further detail of x inside subset Tj,weobtain A divergence is said to be decomposable, when it is a more information to distinguish the two probability dis- sum of functions of components tributions p and q.Sincex belongs to Tj,weconsider the two conditional probability distributions D[p : q]= D [pi : qi] . (40) i p (xi |xi ∈ Tj ) ,q(xi |xi ∈ Tj ) (30) offprintThe f-divergence (34) is a decomposable divergence. under the condition that x is in subset Tj.Ifthetwo Csisz´ar [18] found that any f-divergence satisfies distributions are equal, we cannot obtain further infor- information monotonicity. Moreover, the class of f- mation to distinguish p from q by observing x inside Tj. divergences is unique in the sense that any decompos- Hence, able divergence satisfying information monotonicity is D¯ [p¯ : q¯]=D [p : q] (31) an f-divergence. holds, when and only when Theorem 1 Any f-divergence satisfies the informa- tion monotonicity. Conversely, any decomposable infor- p (xi |Tj )=q (xi |Tj ) (32) mation monotonic divergence is written in the form of f-divergence. for all Tj and all xi ∈ Tj,or A proof is found, e.g., in Ref. [21]. pi The Riemannian metric and affine connections derived = λj (33) qi from an f-divergence has a common invariant structure [1]. They are given by the Fisher information metric and ∈ for all xi Tj for some constant λj. ±α-connections, which are shown in a later section. A divergence satisfying the above requirements is An extensive list of f-divergences is given in Cichocki called an invariant divergence, and such a property is et al. [22]. Some of them are listed below. termed as information monotonicity. 1) The Kullback-Leibler (KL-) divergence: f(u)= u log u − (u − 1): 2.3.2 offprintf-divergence qi DKL[p : q]= pi log . (41) The f-divergence was introduced by Csisz´ar [2] and also pi

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. Shun-ichi AMARI. Information geometry in optimization, machine learning and statistical inference 245

√ 2 2) Squared Hellinger distance: f(u)=( u − 1) : two positive measures z and y,anf-divergence is given by √ √ 2 D [p : q]= ( pi − qi) . (42) yi Hel Df [z : y]= zif , (45) zi 3) The α-divergence: where f is a standard convex function. It should be noted that an f-divergence is no more invariant under 4 1+α 2 fα(u)= 1 − u 2 − (u − 1),(43) the transformation from f(u)to 1 − α2 1 − α 4 1+α 1−α Dα[p : q]= 1 − p 2 q 2 . (44) fc(u)=f(u) − c(u − 1). (46) 1 − α2 i i The α-divergence was introduced by Havrda and Hence, it is absolutely necessary to use a standard f in Charv´at [23], and has been studied extensively by thecaseofMn, because the conditions of divergence are Amari and Nagaoka [1]. Its applications were violated otherwise. described earlier in Chernoff [24], and later in Among all f-divergences, the α divergence [1,21,22] is given by Matsuyama [25], Amari [26], etc. to mention a few. yi It is the squared Hellinger distance for α =0.The Dα[z : y]= zifα , (47) zi KL-divergence and its reciprocal are obtained in the limit of α →±1. See Refs. [21,22,26] for the α- where ⎧ structure. ⎪ 4 1+α 2 ⎪ 1 − u 2 − (u − 1),α= ±1, ⎨ 1 − α2 1 − α fα(u)= 2.3.3 f-divergence of positive measures ⎪ u log u − (u − 1),α=1, ⎩⎪ − log u +(u − 1),α= −1, We now extend the f-divergence to the space of positive (48) measures Mn over X = {x1,...,xn},whosepointsare and plays a special role in Mn. Here, we use a simple z (1+α)/2 given by the coordinates =(z1,...,zn), zi > 0. Here, power function u , changing it by adding a lin- zi is the mass (measure) of xi where the total mass zi ear and a constant term to become the standard convex is positive and arbitrary. In many applications, z is function fα(u). It includes the logarithm as a limiting a non-negative array, and we can extend it to a non- case. negative double array z =(zij ), etc., that is, matrices The α-divergence is explicitly given in the following and tensors. We first derive an f-divergence in Mn:For form [1,21,22]:

⎧ ⎪ − 1+α 1−α ⎪ 2 1 α 1+α 2 2 ⎪ zi + yi − z y ,α= ±1, ⎪ − 2 i i offprint⎪ 1 α 2 2 ⎨ z y − yi Dα[ : ]=⎪ zi yi + yi log ,α=1, (49) ⎪ zi ⎪ ⎪ zi ⎩ yi − zi + zi log ,α= −1. yi

2.4 Flat divergence: Bregman divergence ric in Riemannian geometry, but our approach is more general.

z w AdivergenceD[ : ] gives a Riemannian metric gij by 2.4.1 Convex function (13), and a pair of affine connections by (16), (17). When (20) holds, the manifold is flat. We study divergence We show in this subsection that a dually flat geometry functions which give flat structure in this subsection. In is derived from a divergence due to a convex function in this case, coordinate system z is flat, that is affine, al- terms of the Bregman divergence [3]. Conversely a du- though the metric gij (z) is not Euclidean. When z is ally flat manifold always has a convex function to define affine, all the coordinate curves are regarded as straight its geometry. The divergence derived from it is called lines, that is, geodesics. Here, we separate two concepts, the canonical divergence of a dually flat manifold [1]. flatness and metric. Mathematically speaking, we use Let ψ(z) be a strictly convex differentiable function. an affineoffprint connection Γijk to define flatness, which is not Then, the linear hyperplane tangent to the graph necessarily derived from the metric gij . Note that the Levi-Civita affine connection is derived from the met- y = ψ(z) (50)

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. 246 Front. Electr. Electron. Eng. China 2010, 5(3): 241–260

∗ ∗ at z0 is function ψ (z ). We can obtain a dual potential func- tion by y = ∇ψ (z0) · (z − z0)+ψ (z0) , (51) ψ∗ (z∗)=max{z · z∗ − ψ(z)} , (58) z and is always below the graph y = ψ(z) (Fig. 4). Here, which is convex in z∗. The two coordinate systems are ∇ψ is the gradient, dual, since the pair z and z∗ satisfies the following rela- tion: ∂ ∂ ∗ ∗ ∗ ∇ψ = ψ(z),..., ψ(z) . (52) ψ(z)+ψ (z ) − z · z =0. (59) ∂z1 ∂zn The inverse transformation is given by The difference of ψ(z) and the tangent hyperplane is z = ∇ψ∗ (z∗) . (60) z z z − z −∇ z · z − z  ∗ ∗ Dψ [ : 0]=ψ( ) ψ ( 0) ψ ( 0) ( 0) 0. We can define the Bregman divergence Dψ∗ [z : w ] (53) by using ψ∗ (z∗). However, it is easy to prove See Fig. 4. This is called the Bregman divergence in- ∗ ∗ duced by ψ(z). Dψ∗ [z : w ]=Dψ [w : z] . (61) Hence, they are essentially the same, that is the same when w and z are interchanged. It is enough to use only one common divergence. By simple calculations, we see that the Bregman di- vergence can be rewritten in the dual form as D[z : w]=ψ(z)+ψ∗ (w∗) − z · w∗. (62) We have thus two coordinate systems z and z∗ of S. They define two flat structures such that a curve with parameter t, Fig. 4 Bregman divergence due to convex function z(t)=ta + b, (63) is a ψ-geodesic, and This satisfies the conditions of a divergence. The in- ∗ duced Riemannian metric is z∗(t)=tc∗ + d (64) 2 ∗ ∂ is a ψ∗-geodesic, where a, b, c∗ and d are constants. gij (z)= ψ(z), (54) ∂zi∂zj A Riemannian metric is defined by two metric tensors ∗ ∗ G =(gij )andG = g , and the coefficients of the affine connection vanish luck- ij ily, offprint ∂2 z gij = ψ(z), (65) Γijk( )=0. (55) ∂zi∂zj 2 Therefore, z is an affine coordinate system, and a ∗ ∂ ∗ ∗ gij = ∗ ∗ ψ (z ) , (66) geodesic is always written as the linear form ∂zi ∂zj z a b in the two coordinate systems, and they are mutual in- (t)= t + (56) verses, G∗ = G−1 [1,27]. Because the squared local dis- with parameter t and constant vectors a and b (see Ref. tance of two nearby points z and z +dz is given by [27]). 1 D [z : z +dz]= gij (z)dzidzj, (67) 2 2.4.2 Dual structure 1 ∗ ∗ ∗ ∗ = gij (z )dzi dzj , (68) 2 In order to give a dual description related to a convex they give the same Riemannian distance. However, the function ψ(z), let us define its gradient, two geodesics are different, which are also different from z∗ = ∇ψ(z). (57) the Riemannian geodesic derived from the Riemannian metric. Hence, the minimality of the curve length does This is the Legendre transformation, and the correspon- not hold for ψ-andψ∗-geodesics. dence between z and z∗ is one-to-one. Hence, z∗ is Two geodesic curves (63) and (64) intersect at t =0, ∗ regardedoffprint as another (nonlinear) coordinate system of S. when b = d . In such a case, they are orthogonal if In order to describe the geometry of S in terms of the their tangent vectors are orthogonal in the sense of the dual coordinate system z∗, we search for a dual convex Riemannian metric. The orthogonality condition can be

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. Shun-ichi AMARI. Information geometry in optimization, machine learning and statistical inference 247 represented in a simple form in terms of the two dual Projection Theorem. Given p ∈ S, the minimizer ∗ ∗ coordinates as r of Dψ(p, r), r ∈ M,istheψ -projection of p to ∗∗ M, and the minimizer r of Dψ(r, p), r ∈ M,isthe ∗ ∗ a, b = aibi =0. (69) ψ-projection of p to M. The theorem is useful in various optimization prob- 2.4.3 Pythagorean theorem and projection theorem lems searching for the closest point r ∈ M to a given p. A generalized Pythagorean theorem and projection the- orem hold for a dually flat space [1]. They are highlights 2.4.4 Decomposable convex function and of a dually flat manifold. generalization Generalized Pythagorean Theorem. Let r, s, q be three points in S such that the ψ∗-geodesic connect- Let U(z) be a convex function of a scalar z. Then, we have a simple convex function of z, ing r and s is orthogonal to the ψ-geodesic connecting s and q (Fig. 5). Then, ψ(z)= U (zi) . (72)

Dψ[r : s]+Dψ[s : q]=Dψ[r : q]. (70) The dual of U is given by U ∗ (z∗)=max{zz∗ − U(z)} , (73) z and the dual coordinates are ∗  zi = U (zi) . (74) The dual convex function is ∗ ∗ ∗ ∗ ψ (z )= U (zi ) . (75) The ψ-divergence due to U between z and w is decom- posable and given by the sum of components [28–30], ∗ ∗ ∗ Dψ[z : w]= {U (zi)+U (wi ) − ziwi } . (76) We have discussed the flat structure given by a convex function due to U(z). We now consider a more general Fig. 5 Pythagorean theorem case by using a nonlinear rescaling. Let us define by When S is a Euclidean space with r = k(z) a nonlinear scale r in the z-axis, where k is a monotone function. If we have a convex function U(r)of 1 2 the induced variable r, there emerges a new dually flat offprintD[r : s]= (ri − si) , (71) 2 structure with a new convex function U(r). Given two functions k(z)andU(r), we can introduce a new dually (70) is the Pythagorean theorem itself. flat Riemannian structure in S, where the flat coordi- Now, we state the projection theorem. Let M be a nates are not z but r = k(z). There are infinitely many submanifold of S (Fig. 6). Given p ∈ S,apointq ∈ M such structures depending on the choice of k and U.We is said to be the ψ-projection (ψ∗-projection) of p to M show two typical examples, which give the α-divergence when the ψ-geodesic (ψ∗-geodesic) connecting p and q is and β-divergence [22] in later sections. Both of them orthogonal to M with respect to the Riemannian metric use a power function of the type uq including the log gij . and exponential function as the limiting cases.

2.5 Invariant flat structure of S

This section focuses on the manifolds which are equipped with both invariant and flat geometrical structure.

2.5.1 of probability distributions

The following type of probability distributions of ran- offprinty θ dom variable parameterized by , Fig. 6 Projection of p to M p(y, θ)=exp θisi(y) − ψ(θ) , (77)

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. 248 Front. Electr. Electron. Eng. China 2010, 5(3): 241–260 under a certain measure μ(y)onthespaceY = {y} is From θi called an exponential family. We denote it now by S. p0 =1− pi =1− p0 e , (92) We introduce random variables x =(xi), we have ψ(θ)=log 1+ eθi . (93) xi = si(y), (78) and rewrite (77) in the standard form 2.5.2 Geometry of exponential family p(x, θ)=exp θixi − ψ(θ) , (79) The Fisher information matrix is defined in a manifold of probability distributions by where exp {−ψ(θ)} is the normalization factor satisfying ∂ ∂ gij (θ)=E log p(x, θ) p(x, θ) , (94) ∂θi ∂θj ψ(θ)=log exp θixi dμ(x). (80) where E is expectation with respect to p(x, θ). When S This is called the free energy in the physics community is an exponential family (79), it is calculated as and is a convex function. ∂2 We first give a few examples of exponential families. gij (θ)= ψ(θ). (95) ∂θi∂θj 1. Gaussian distributions SG A Gaussian distribution is given by This is a positive-definite matrix and ψ(θ) is a convex function. Hence, this is the Riemannian metric induced 1 (y − μ)2 p (y; μ, σ)=√ exp − . (81) from a convex function ψ. 2πσ 2σ2 We can also prove that the invariant Riemannian met- ric derived from an f-divergence is exactly the same as Hence, all Gaussian distributions form a two- the Fisher metric for any standard convex function f.It dimensional manifold S ,where(μ, σ)playstheroleof G is an invariant metric. a coordinate system. Let us introduce new variables A geometry is induced to S = {p(x, θ)} through the θ θ x1 = y, (82) convex function ψ( ). Here, is an affine coordinate 2 system, which we call e-affine or e-flat coordinates, “e” x2 = y , (83) μ representing the exponential structure of (79). By the θ1 = 2 , (84) Legendre transformation of ψ, we have the dual coordi- 2σ ∗ nates (we denote them by η instead of θ ), − 1 θ2 = 2 . (85) 2σ η = ∇ψ(θ). (96) Then, (81) can be rewritten in the standard form offprintBy calculating the expectation of xi,wehave p(x, θ)=exp{θ1x1 + θ2x2 − ψ(θ)} , (86) ∂ E[xi]=ηi = ψ(θ). (97) ∂θi where η μ2 √ By this reason, the dual parameters =(ηi) are called ψ(θ)= − log 2πσ . (87) 2σ2 the expectation parameters of an exponential family. We call η the m-affine or m-flat coordinates, where “m”im- Here, θ =(θ1,θ2)isanewcoordinatesystemofS , G plying the mixture structure [1]. called the natural or canonical coordinate system. The dual convex function is given by the negative of 2. Discrete distributions Sn entropy Let y be a random variable taking values on { } ∗ 0, 1,...,n . By defining ψ (η)= p (x, θ(η)) log p (x, θ(η)) dx, (98)

xi = δi(y),i=1, 2,...,n, (88) where p(x, θ(η)) is considered as a function of η. Then, and the Bregman divergence is pi ∗ θi =log , (89) D [θ1 : θ2]=ψ (θ1)+ψ (η2) − θ1 · η2, (99) p0 (4) is rewritten in the standard form where ηi is the dual coordinates of θi, i =1, 2. We show that manifolds of exponential families have p(x, θ)=exp θixi − ψ(θ) , (90) both invariant and flat geometrical structure. However, offprintthey are not unique and it is possible to introduce both where invariant flat structure to a mixture family of probability ψ(θ)=− log p0. (91) distributions.

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. Shun-ichi AMARI. Information geometry in optimization, machine learning and statistical inference 249

Theorem 2 The Bregman divergence is the is a convex function of r. Kullback-Leibler divergence given by The dual potential is simply given by p (x, θ1) ∗ ∗ ∗ KL [p (x, θ1):p (x, θ2)] = p (x, θ1)log dx. ψα (r )=ψ−α (r ) , (108) p (x, θ2) (100) and the dual affine coordinates are This shows that the KL-divergence is the canonical divergence derived from the dual flat structure of ψ(θ). ∗(α) (−α) r = r = k−α(z). (109) Moreover, it is proved that the KL-divergence is the unique divergence belonging to both the f-divergence We can then prove the following theorem. and Bregman divergence classes for manifolds of proba- Theorem 3 The α-divergence of M is the Bregman bility distributions. divergence derived from the α-representation kα and the For a manifold of positive measures, there are diver- linear function Uα of z. gences other than the KL-divergence, which are invari- Proof We have the Bregman divergence between ant and dually flat, that is, belonging to the classes f- r = k(z)ands = k(y) based on ψα as divergences and Bregman divergences at the same time. See the next subsection. ∗ ∗ Dα[r : s]=ψα(r)+ψ−α (s ) − r · s , (110)

2.5.3 Invariant and flat structure in manifold of where s∗ is the dual of s. By substituting (107), (108) positive measures and (109) in (110), we see that Dα[r(z):s(y)] is equal to the α-divergence defined in (49). z We consider a manifold M of positive arrays ,zi > 0, This proves that the α-divergence for any α belongs to where zi can be arbitrary. We define an invariant and the intersection of the classes of f-divergences and Breg- flat structure in M. Here, z is a coordinate system, and man divergences in M. Hence, it possesses information consider the f-divergence given by monotonicity and the induced geometry is dually flat at n the same time. However, the constraint zi =1isnot wi α −α Df [z : w]= zif . (101) linear in the flat coordinates r or r except for the zi i=1 case α = ±1. Hence, although M is flat for any α,the manifold P of probability distributions is not dually flat We now introduce a new coordinate system defined by except for α = ±1, and it is a curved submanifold in M. (α) ri = kα (zi) , (102) Hence, the α-divergences (α = ±1) do not belong to the class of Bregman divergences in P . This proves that the where the α-representation of zi is given by KL-divergence belongs to both classes of divergences in ⎧ ⎨ 2 1−α P and is unique. offprintu 2 − 1 ,α=1 , 1 − α The α-divergences are used in many applications, e.g., kα(u)=⎩ (103) log u, α =1. Refs. [21,25,26]. Theorem 4 The α-divergence is the unique class of r Then, the α-coordinates α of M are defined by divergences sitting at the intersection of the f-divergence (α) and Bregman divergence classes. ri = kα (zi) . (104) Proof The f-divergence is of the form (45) and the We use a convex function Uα(r)ofr defined by decomposable Bregman divergence is of the form 2 −1 ∗ ∗ Uα(r)= kα (r). (105) D[z : y]= U˜ (zi)+ U˜ (yi) − ri (zi) r (yi) , 1+α i (111) In terms of z, this is a linear function, where 2 U˜(z)=U(k(z)), (112) Uα {r(z)} = z, (106) 1+α and so on. When they are equal, we have f, r,andr∗ which is not a (strictly) convex function of z but is con- that satisfy vex with respect to r.Theα-potential function defined y zf = r(z)r∗(y) (113) by z 2 −1 except for additive terms depending only on z or y.Dif- ψα(r)= k (ri) offprint1+α α ferentiating the above by y,weget 2 − 1−α 2 1 α y  = 1+ ri (107) f  = r(z)r∗ (y). (114) 1+α 2 z

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. 250 Front. Electr. Electron. Eng. China 2010, 5(3): 241–260

By putting x = y, y =1/z,weget By changing the above into the standard form, we arrive at the α-divergence. 1  f (xy)=r r∗ (x). (115) y 2.6 β-divergence Hence, by putting h(u)=logf (u), we have

h(xy)=s(x)+t(y), (116) It is useful to introduce a family of β-divergences, which  where s(x)=logr∗ (x), t(y)=r(1/y). By differenti- are not invariant but dually flat, having the structure of ating the above equation with respect to x and putting Bregman divergences. Thisisusedinthefieldofma- x =1,weget chine learning and statistics [28,29]. c Use z itself as a coordinate system of M;thatis,the h(y)= . (117) y representation function is k(z)=z.Theβ-divergence [30] is induced by the potential function This proves that f is of the form ⎧ ⎧ ⎨ 1 β+1 1+α (1 + βz) β ,β>0, ⎪ cu 2 ,α= ±1, β +1 ⎨ Uβ(z)=⎩ (119) exp z, β =0. f(u)=⎪ cu log u, α =1, (118) ⎩⎪ c log u, α = −1. The β-divergence is thus written as

⎧ ⎪ 1 β+1 β+1 1 β ⎨⎪ yi − zi − zi (yi − zi) ,β>0, β +1 β Dβ[z : y]= (120) ⎪ zi ⎩ zi log + yi − zi ,β=0. yi

It is the KL-divergence when β = 0, but it is different a statistical model to which the true distribution is as- from the α-divergence when β>0. sumed to belong [8]. The EM algorithm is popular in Minami and Eguchi [30] demonstrated that statisti- statistics, machine learning and others [7,31,32]. cal inference based on the β-divergence (β>0) is ro- bust. Such idea has been applied to machine learning in 3.1 Geometry of alternative minimization Murata et al. [29]. The β-divergence induces a dually z flat structure in M. Since its flat coordinates are ,the When a divergence function D[z : w] is given in a man- restriction ifold S, we consider two submanifolds M and E,where offprinti z ∈ w ∈ z = 1 (121) we assume M and E. The divergence of two submanifolds M and E is given by is a linear constraint. Hence, the manifold P of the z probability distributions is also dually flat, where , D[M : E]= min D[z : w]=D [z∗ : w∗] , (123) zi = 1, is its flat coordinates. The dual flat coor- z∈M,w∈E dinates are ∗ ∗ z ∈ w ∈ where M and E are a pair of the points 1 i β  that attain the minimum. When M and E intersect, ∗ (1 + βz ) ,β=0, ∗ ∗ zi = (122) D[M : E] = 0, and z = w lies at the intersection. exp zi,β=0, An alternative minimization is a procedure to calcu- z∗ w∗ depending on β. late ( , ) iteratively (see Fig. 7): 1) Take an initial point z0 ∈ M for t = 0. Repeat steps 2and3fort =0, 1, 2,... until convergence. 3 Information geometry of alternative 2) Calculate minimization wt =argminD zt : w . (124) w∈E Given a divergence function D[z : w]onaspaceS,we encounter the problem of minimizing D[z : w] under the 3) Calculate z ⊂ constraints that belongs to a submanifold M S and t+1 t+1 w ⊂ z =argminD z : w . (125) belongsoffprint to another submanifold E S.Atypical z∈M example is the EM algorithm [6], where M represents a manifold of partially observed data and E represents When the direct minimization with respect to one

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. Shun-ichi AMARI. Information geometry in optimization, machine learning and statistical inference 251

Assume that vector random variable z is divided into two parts z =(h, x) and the values of x are observed but h are not. Non-observed random variables h =(hi) are called missing data or hidden variables. It is possible to eliminate unobservable h by summing up p(h, x)over all h, and we have a new statistical model, p˜(x, ξ)= p(h, x, ξ), (133) h

E˜ = {p˜(x; ξ)} . (134) Thisp ˜(x, ξ) is the marginal distribution of p(h, x; ξ). Fig. 7 Alternative minimization Based on observed x, we can estimate ξˆ, for example, by maximizing the likelihood, variable is computationally difficult, we may use a decre- mental procedure: ξˆ =argmaxp˜(x, ξ). (135) ξ t+1 t t t w = w − ηt∇wD z : w , (126) t+1 t t+1 t z ξ z = z − ηt∇zD z : w , (127) However, in many problems p( ; ) has a simple form whilep ˜(x; ξ) does not. The marginal distributionp ˜(x, ξ) ∇ ∇ where ηt is a learning constant and w and z are gra- has a complicated form and it is computationally diffi- w z dients with respect to and , respectively. cult to estimate ξ from x. It is much easier to calculate It should also be noted that the natural gradient (Rie- ξˆ when z is observed. The EM algorithm is a powerful mannian gradient of D) [33] is given by method used in such a case [6].

−1 ∂ The EM algorithm consists of iterative procedures to ∇˜ zD[z : w]=G (z) D[z : w], (128) ∂z use values of the hidden variables h guessed from ob- served x and the current estimator ξˆ.Letusassume where 2 ∂ that n iid data zt =(ht, xt), t =1,...,n, are generated G = D[z : w]|w=z. (129) ∂z∂z from a distribution p(z, ξ). If data ht are not missing, Hence, this becomes the Newton-type method. we have the following empirical distribution: 1 3.2 Alternative projections p¯(h, x h¯, x¯ )= δ (x − xt) δ (h − ht) , (136) n t

When D[z : w] is a Bregman divergence derived from a ¯ offprintwhere h and x¯ represent the sets of data (h1, h2,...) convex function ψ,thespaceS is equipped with a dually and (x1, x2,...)andδ is the delta function. When S w z w flat structure. Given , the minimization of D[ : ] is an exponential family,p ¯(h, x h¯, x¯ ) is the distribution z ∈ w with respect to M is given by the ψ-projection of given by the sufficient statistics composed of h¯, x¯ . to M, (ψ) The hidden variables are not observed in reality, and arg min D[z : w]= w. (130) the observed part x¯ cannot determine the empirical dis- z M tributionp ¯(x, y) (136). To overcome this difficulty, let This is unique when M is ψ∗-flat. On the other hand, us use a conditional probability q(h|x)ofh when x is given z, the minimization with respect to w is given by observed. We then have the ψ∗-projection, (ψ∗) p(h, x)=q(h|x)p(x). (137) arg min D[z : w]= z. (131) w M In the present case, the observed part x¯ determines the This is unique when M is ψ-flat. empirical distributionp ¯(x), 3.3 EM algorithm 1 p¯(x)= δ (x − xt) , (138) n t Let us consider a statistical model p(z; ξ) with random h|x variables z, parameterized by ξ. The set of all such dis- but the conditional part q( ) remains unknown. Tak- tributions E = {p(z; ξ)} forms a submanifold included ing this into account, we define a submanifold based on offprintthe observed data x¯, in the entire manifold of probability distributions S = {p(z)} . (132) M (x¯)={p(h, x) |p(h, x)=q(h|x)¯p(x)} , (139)

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. 252 Front. Electr. Electron. Eng. China 2010, 5(3): 241–260

where q(h|x) is arbitrary. 1) Begin with an arbitrary initial guess ξ0 at t =0,and In the observable case, we have an empirical distribu- repeat the E-step and M-step for t =0, 1, 2,...,until tionp ¯(h, x) ∈ S from the observation, which does not convergence. usually belong to our model E. The maximum likeli- 2) E-step: Use the e-projection of p (h, x; ξt)toM (x¯) hood estimator ξˆ is the point in E that minimizes the to obtain KL-divergence fromp ¯ to E, qt(h|x)=p (h |x; ξt ) , (146) ξˆ h x x ξ and calculate the conditional expectation =argminKL[¯p( , ):p( , )] . (140) Lt(ξ)= p (h |xi; ξ )logp (h, xi; ξ) . (147) In the present case, we do not knowp ¯ but know M (x¯) t h,xi which includes the unknownp ¯(h, x). Hence, we search for ξˆ ∈ E andq ˆ(h|x) ∈ M (x¯) that minimizes 3) M-step: Calculate the maximizer of Lt(ξ), that is the m-projection of pt(h, x)toE, KL[M : E]. (141) ξt+1 =argmaxLt(ξ). (148) The pair of minimizers gives the points which define the divergence of two submanifolds M (x¯)andE.Themax- imum likelihood estimator is a minimizer of (141). Given partial observations x¯, the EM algorithm con- 4 Information geometry of Ying-Yang sists of the following interactive procedures, t =0, 1,.... machine We begin with an arbitrary initial guess ξ0 and q0(h|x), and construct a candidate pt (h, x)=qt(h|x)¯p (x) ∈ 4.1 Recognition model and generating model M = M (x¯)fort =0, 1, 2,.... Wesearchforthenext guess ξt+1 that minimizes KL [pt (h, x):p (h, x, ξ)]. We study the Ying-Yang mechanism proposed by Xu [9– This is the same as maximizing the pseudo-log-likelihood 14] from the information geometry point of view. Let us consider a hierarchical stochastic system, which consists Lt(ξ)= pt (h, xi)logp (h, xi; ξ) . (142) of two vector random variables x and y.Letx be a lower h,xi level random variable representing a primitive descrip- y This is called the M-step, that is the m-projection of tion of the world, while let be a higher level random variable representing a more conceptual or elaborated pt (h, x) ∈ M to E. Let the maximum value be ξ , t+1 x y which is the estimator at t +1. description. Obviously, and are closely correlated. One may consider them as information representations We then search for the next candidate pt (h, x) ∈ M in the brain. Sensory information activates neurons in that minimizes KL p (h, x):p h, x, ξt+1 . The min- a primitive sensory area of the brain, and its firing pat- imizer is denoted by pt+1 (h, x). This distribution is x x used tooffprint calculate the expectation of the new log likeli- tern is represented by .Componentsxi of can be binary variables taking values 0 and 1, or analog val- hood function Lt+1(ξ). Hence, this is called the E-step. The following theorem [8] plays a fundamental role in ues representing firing rates of neurons. The conceptual information activates a higher-order area of the brain, calculating the expectation with respect to pt(h, x). y Theorem 5 Along the e-geodesic of projecting with a firing pattern . h x ξ x A probability distribution p(x)ofx reflects the in- p ( , ; t)toM (¯), the conditional probability dis- formation structure of the outer world. When x is ob- tribution qt(h|x) is kept invariant. Hence, by the e- served, the brain processes this primitive information projection of p (h, x; ξt)toM, the resultant conditional distribution is given by and activates a higher-order area to recognize its higher- order representation y. Its mechanism is represented y|x qt (h |xi )=p (h |xi; ξt ) . (143) by a conditional probability distribution p( ). This is possibly specified by a parameter ξ,intheformof This shows that the e-projection is given by p(y|x; ξ). Then, the joint distribution is given by

qt(h, x)=p (h |x; ξt ) δ (x − x¯) . (144) p(y, x; ξ)=p(y|x; ξ)p(x). (149)

Hence, the theorem implies that the next expected log The parameters ξ would be synaptic weights of neurons likelihood is calculated by to generate pattern y. When we receive information x, it is transformed to higher-order conceptual information ξ h x ξ h x ξ Lt+1( )= p i; t+1 log p ( , i, ) , (145) y. This is the bottom-up process and we call this a offprinth,xi recognition model. which is to be maximized. Our brain can work in the reverse way. From concep- y x The EM algorithm is formally described as follows. tual information pattern , a primitive pattern will

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. Shun-ichi AMARI. Information geometry in optimization, machine learning and statistical inference 253 be generated by the top-down neural mechanism. It is These relations hold for any dually flat divergence, gen- represented by a conditional probability q(x|y), which erated by a convex function ψ over S. will be parameterized by ζ,asq(x|y; ζ). Here, ζ will An iterative algorithm to realize the Ying-Yang har- represent the top-down neural mechanism. When the mony is, for t =0, 1, 2,..., probability distribution of y is q(y; ζ), the entire process (m) is represented by another joint probability distribution pt+1 = qt, (158) MR q(y, x; ζ)=q(y; ζ)q(x|y; ζ). (150) (e) qt+1 = pt+1. (159) MG We have thus two stochastic mechanisms p(y, x; ξ) and q(y, x; ζ). Let us denoted by S the set of all the This is a typical alternative minimization algorithm. joint probability distributions of x and y. The recogni- tion model 4.3 Various models of Ying-Yang structure MR = {p(y, x; ξ)} (151) forms its submanifold, and the generative model A Yang machine receives signal x from the outside, and processes it by generating a higher-order signal y,which { y x ζ } MG = q( , ; ) (152) is stochastically generated by using the conditional prob- y|x ξ forms another submanifold. ability p( , ). The performance of the machine is ξ This type of information mechanism is proposed by Xu improved by learning, where the parameter is mod- [9,14], and named the Ying-Yang machine. The Yang ified step-by-step. We give two typical examples of machine (male machine) is responsible for the recog- information processing; pattern classification and self- nition model {p(y, x; ξ)} and the Ying machine (fe- organization. male machine) is responsible for the generative model Learning of a pattern classifier is a supervised learning {q(y, x; ζ)}. These two models are different, but should process, where teacher signal z is given. be put in harmony (Ying-Yang harmony). 1. pattern classifier Let C = {C1,...,Cm} denotes the set of categories, x x 4.2 Ying-Yang harmony and each belongs to one of them. Given an ,ama- chineprocessesitandgivesanansweryκ,whereyκ, κ =1,...,m, correspond to one of the categories. In When a divergence function D [p(x, y):q(x, y)] is de- other words, y is a pattern representing Cκ.Theout- fined in S, we have the divergence between the two mod- κ put y is generated from x: els MR and MG, y f x ξ n D [MR : MG]=minD [p(y, x; ξ):q(y, x; ζ)] . (153) = ( , )+ , (160) ξ,ζ offprintn When MR and MG intersect, where is a zero-mean Gaussian noise with covariance matrix σ2I, I being the identity matrix. Then, the con- D [MR : MG] = 0 (154) ditional distribution is given by at the intersecting points. However, they do not inter- 1 2 p(y|x; ξ)=c exp − |y − f(x, ξ)| , (161) sect in general, 2σ2 MR ∩ MG = φ. (155) It is desirable that the recognition and generating mod- where c is the normalization constant. f x ξ els are closely related. The minimizer of D gives such a The function ( , ) is specified by a number of pa- ξ harmonized pair called the Ying-Yang harmony [14]. rameters . In the case of multilayer perceptron, it is A typical divergence is the Kullback-Leibler diver- given by gence KL[p : q], which has merits of being invariant and generating dually flat geometrical structure. In this yi = vij ϕ wjkxk − hj . (162) case, we can define the two projections, e-projection and j k m-projection. Let p∗(y, x)andq∗(y, x) be the pair of minimizers of D[p : q]. Then, the m-projection of q∗ to Here, wjk is the synaptic weight from input xk to the ∗ MR is p , jth hidden unit, hj is its threshold, and ϕ is a sigmoidal (m) ∗ ∗ function. The output of the jth hidden unit is hence q = p , (156) MR ϕ ( wjkxk − hj) and is connected linearly to the ith ∗ ∗ and theoffprinte-projection of p to MG is q , output unit with synaptic weight vij . (e) In the case of supervised learning, a teacher signal yκ ∗ ∗ p = q . (157) representing the category Cκ is given, when the input MG

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. 254 Front. Electr. Electron. Eng. China 2010, 5(3): 241–260 pattern is x ∈ Cκ. Since the current output of the Yang to specify the distribution of x, and we have a parame- machine is f(x, ξ), the error of the machine is terized statistical model

1 2 { x|y } l (ξ, x, y )= |y − f(x, ξ)| . (163) MYing = p( ) , (167) κ 2 κ generated by the Ying mechanism. Here, the parameter Therefore, the stochastic gradient learning rule modifies y is a random variable, and hence the Bayesian frame- the current ξ to ξ +Δξ by work is introduced. The prior probability distribution of y ξ − ∂l is Δ = ε ξ , (164) ∂ p(y)= p(x, y)dx. (168) where ε is the learning constant and ∇l = The Bayesian framework to estimate y from observed ∂l ∂l ,..., is the gradient of l with respect to ξ. x, or more rigorously, a number of independent obser- ∂ξ1 ∂ξm vations, x1,...,xn, is as follows. Let x¯ =(x1,...,xn) Since the space of ξ is Riemannian, the natural gradient be observed data and learning rule [33] is given by n −1 Δξ = −εG ∇l, (165) p (x¯|y)= p (xi|y) . (169) i=1 where G is the Fisher information matrix derived from The prior distribution of y is p(y), but after observing the conditional distribution. x, the posterior distribution is 2. Non-supervised learning and clustering When signals x in the outside world are divided into p(x¯, y) p(y|x¯)= . (170) a number of clusters, it is expected that Yang machine p(x¯, y)dy reproduces such a clustering structure. An example of clustered signals is the Gaussian mixture This is the Yang mechanism, and the m y y|x x − 1 |x − y |2 ˆ =argmaxp( ¯) (171) p( )= wκ exp 2 κ , (166) y κ=1 2σ is the Bayesian posterior estimator. where yκ is the center of cluster Cκ.Givenx,onecan A full exponential family p(x, y)=exp{ yixi+ see which cluster it belongs to by calculating the output k(x)+w(y) − ψ(x, y)} is an interesting special case. y f x ξ n y = ( , )+ . Here, is an inner representation of Here, y is the natural parameter for the exponential fam- ξ the clusters and its calculation is controlled by .There ily of distributions, are a number of clustering algorithms. See, e.g., Ref. [9]. offprint We can use them to modify ξ. Since no teacher signals q(x|y)=exp yixi + k(x) − ψ(y) . (172) exist, this is unsupervised learning. Another typical non-supervised learning is imple- The family of distributions MYing = {q(x|y)} specified mented by the self-organization model given by Amari by parameters y is again an exponential family. It has and Takeuchi [34]. a dually flat Riemannian structure. The role of the Ying machine is to generate a pat- Dually to the above, the family of the posterior dis- x y tern ˜ from a conceptual information pattern .Since tributions form a manifold M = {p(y|x)},consisting x Yang the mechanism of generating ˜ is governed by the con- of x|y ζ ditional probability q( ; ), its parameter is modified y|x ξ ∗ to fit well the corresponding Yang machine p( , ). p(y|x)=exp xiyi + w(y) − ψ (y) . (173) As ξ changes by learning, ζ changes correspondingly. The Ying machine can be used to improve the primi- It is again an exponential family, where x (x¯ = xi/n) tive representation x by filling missing information and is the natural parameter. It defines a dually flat Rie- removing noise by using x˜. mannian structure, too. When the probability distribution p(y) of the Yang 4.4 Bayesian Ying-Yang machine machine includes a hyper parameter p(y)=p(y, ζ), we have a family q(x, y; ζ). It is possible that the Yang Let us consider joint probability distributions p(x, y)= machine have a different parametric family p(x, y; ξ). q(x, y)offprint which are equal. Here, the Ying machine and the Then, we need to discuss the Ying mechanism and the Yang machine are identical, and hence the machines are Yang mechanism separately, keeping the Ying-Yang har- in perfect matching. Now consider y as the parameters mony.

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. Shun-ichi AMARI. Information geometry in optimization, machine learning and statistical inference 255

This opens a new perspective to the Bayesian frame- This depends only on the marginal distribution of xi, work to be studied in future from the Ying-Yang stand-  qi (xi)= q (x1,...,xn) , (176) point. Information geometry gives a useful tool for ana-  lyzing it. where summation is taken over all x1,...,xn except for xi. The expectation is denoted by − − 5 Geometry of belief propagation ηi =E[xi]=Prob[xi =1] Prob [xi = 1] = qi(1) − qi(−1). (177)

A direct application of information geometry to machine When ηi > 0, Prob [xi =1] > Prob [xi = −1], so that, learning is studied here. When a number of correlated x˜i =1,otherwise˜xi = −1. Therefore, random variables z1,...,zN exist, we need to treat their x˜i =sgn(ηi) (178) joint probability distribution q (z1,...,zN ). Here, we as- sume for simplicity’s sake that zi is binary, taking values is the estimator that minimizes the number of errors in − the components of x˜,where 1and 1. Among all variables zi, some are observed and the others are not. Therefore, zi are divided into 1,u 0, two parts, {x1,...,xn} and {yn+1,...,yN },wherethe sgn(u)= (179) −1,u<0. values of y =(yn+1,...,yN ) are observed. Our prob- lem is to estimate the values of unobserved variables The marginal distribution qi (xi) is given in terms of x =(x1,...,xn) based on observed y. This is a typical ηi by 1+ηi problem of stochastic inference. We use the conditional Prob {xi =1} = . (180) probability distribution q(x|y) for estimating x. 2 Therefore, our problem reduces to calculation of ηi,the The graphical model or the Bayesian network is used expectation of xi. However, exact calculation is compu- to represent stochastic interactions among random vari- tationally heavy when n is large, because (176) requires ables [35]. In order to estimate x, the belief propagation n−1 summation over 2 x’s. Physicists use the mean field (BP) algorithm [15] uses a graphical model or a Bayesian approximation for this purpose [38]. The belief propa- network, and it provides a powerful algorithm in the gation is another powerful method of calculating it ap- field of artificial intelligence. However, its performance proximately by iterative procedures. is not clear when the underlying graph has loopy struc- ture. The present section studies this problem by using 5.2 Graphical structure and random Markov structure information geometry due to Ikeda, Tanaka and Amari [16,17]. The CCCP algorithm [36,37] is another power- x ful procedure for finding the BP solution. Its geometry Let us consider a general probability distribution p( ). is also studied. In the special case where there are no interactions among offprintx1,...,xn, they are independent, and the probability is written as the product of component distributions 5.1 Stochastic inference pi (xi). Since we can always write the probabilities, in the binary case, as We use a conditional probability q(x|y)toestimatethe h −h e i e i values of x.Sincey is observed and fixed, we hereafter pi(1) = ,pi(−1) = , (181) ψ ψ denote it simply by q(x), suppressing observed variables y, but it always means q(x|y), depending on y. where ψ = ehi + e−hi , (182) We have two estimators of x. One is the maximum likelihood estimator that maximizes q(x): we have pi (xi)=exp{hixi − ψ (hi)} . (183) x x ˆ mle =argmaxq( ). (174) Hence, x p(x)=exp hixi − ψ(h) , (184) x However, calculation of ˆ mle is computationally difficult where when n is large, because we need to search for the max- ψ(h)= ψ (hi) , (185) imum among 2n candidates x’s. Another estimator x˜ tries to minimize the expected number of errors of n ψi (hi)=log{exp (hi)+exp(−hi)} . (186) x component randomvariables˜x1,...,x˜n.Whenq( )is When x1,...,xn are not independent but their inter- known,offprint the expectation of xi is actions exist, we denote the interaction among k vari- · ··· · ables xi1 ,...,xik by a product term xi1 xik , E[xi]= xiq(x)= xiqi (xi) . (175) x · ··· · x c( )=xi1 xik . (187)

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. 256 Front. Electr. Electron. Eng. China 2010, 5(3): 241–260

When there are many interaction terms, p(x) is written given by p0(x, θ)=exp{h · x + θ · x − ψ(θ)},andits as e-coordinates are θ. Foreachbranchorcliquer,wealsoconsiderane-flat p(x)=exp hixi + srcr(x) − ψ , (188) submanifold Mr, i r Mr = {p (x, ζr)} , (194) where ψ is the normalization factor, cr(x)representsa p (x, ζr)=exp{h · x + srcr(x)+ζr · x − ψr (ζr)} , monomial interaction term such as (195) x · ··· · ζ cr( )=xi1 xik , (189) where r is e-affine coordinates. Note that a distribu- tion in Mr includes only one nonlinear interaction term { } where r is an index showing a subset i1,...,ik of in- cr(x). Therefore, it is computationally easy to calculate { }  dices 1, 2,...,n , k 2andsr is the intensity of such Er[x]orProb{xi =1} with respect to p (x, ζr). All of interaction. M0 and Mr play proper roles for finding Eq[x]ofthe When all interactions are pairwise, k =2,wehave target distribution (192). cr(x)=xixj for r =(i, j). To represent the structure { } of pairwise interactions, we use a graph G = N,B , 5.3 m-projection in which we have n nodes N1,...,Nn representing ran- dom variables x1,...,xn and b branches B1,...,Bb.A Let us first consider a special submanifold M0 of inde- branch Br connects two nodes Ni and Nj,wherer de- pendent distributions. notes the pair (xi,xj) of interaction. G is a non-directed ∗ We show that the expectation η =Eq[x]isgivenby graph having n nodes and b branches. When a graph has the m-projection of q(x) ∈ M to M0 [39,40]. We denote no loops, it is a tree graph, otherwise, it is a loopy graph. it by There are many physical and engineering problems hav- pˆ(x)= q(x). (196) ing a graphical structure. They are, for example, spin 0 glass, error-correcting codes, etc. This is the point in M0 that minimizes the KL- Interactions are not necessarily pairwise, and there divergence from q to M0, may exist many interactions among more than two vari- pˆ(x)=argminKL[q : p]. (197) ables. Hence, we consider a general case (188), where r p∈M0 includes {i1,...,ik},andb is the number of interactions. It is known thatp ˆ is the point in M0 such that the m- This is an example of the random Markov field, where geodesic connecting q andp ˆ is orthogonal to M0,andis each set r = {i1,...,ik} forms a clique of the random unique, since M0 is e-flat. graph. The m-projection has the following property. We consider the following statistical model on graph Theorem 6 The m-projection of q(x)toM0 does not G or aoffprint random Markov field specified by two vector pa- change the expectation of x. rameters θ and v, ∗ Proof Letp ˆ(x)=p (x, θ ) ∈ M0 be the m- M = {p(x, θ, v)} , (190) projection of q(x). By differentiating KL [q : p(x, θ)] with respect to θ,wehave p(x, θ, v)=exp θixi+ vrcr(x) − ψ (θ, v) . ∂ ∗ (191) xq(x) − ψ0 (θ )=0. (198) ∂θ This is an exponential family, forming a dually flat man- The first term is the expectation of x with respect to ∗ ifold, where (θ, v)isitse-affine coordinates. Here, q(x) and the second term is the η-coordinates of θ ∗ ψ(θ, v) is a convex function from which a dually flat which is the expectation of x with respect to p0 (x, θ ). structure is derived. The model M includes the true Hence, they are equal. distribution q(x) in (13), i.e., 5.4 Clique manifold Mr q(x)=exp{h · x + s · c(x) − ψ(h, s)} , (192) which is given by θ = h and v = s =(sr). Our job is to The m-projection of q(x)toM0 is written as the product find of the marginal distributions η∗ x =Eq [ ] , (193) n q(x)= qi (xi) , (199) where expectation Eq is taken with respect to q(x), and 0 η is theoffprint dual (m-affine) coordinates corresponding to θ. i=1 We consider b+1 e-flat submanifolds, M0,M1,...,Mb which is what we want to obtain. Its coordinates are ∗ ∗ in M. M0 is the manifold of independent distributions given by θ , from which we easily have η =Eθ∗ [x],

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. Shun-ichi AMARI. Information geometry in optimization, machine learning and statistical inference 257

since M0 consists of independent distributions. How- knows that the linearized version of srcr(x)isξr,and  ever, it is difficult to calculate the m-projection, or to broadcasts to all the other models M0 and Mr (r = r) ∗ ∗ obtain θ or η directly. Physicists use the mean field that the linear counterpart of srcr(x)isξr. Receiving approximation for this purpose. The belief propagation these messages from all Mr, M0 guesses that the equiv- ∗ is another method of obtaining an approximation of θ . alent linear term of srcr(x) will be Here, the nodes and branches (cliques) play an impor- θ = ξ . (205) tant role. r Since the difficulty of calculations is given rise to by Since Mr in turn receives messages ξr from all other  the existence of a large number of branches or cliques Mr , Mr uses them to form a new ζr, Br, we consider a model Mr which includes only one  ζr = ξr, (206) r branch or clique B . Since a branch (clique) manifold r=r Mr includes only one nonlinear term cr(x), it is written   x as which is the linearized version of r=r sr cr ( ). This process is repeated. p (x, ζr)=exp{h · x + srcr(x)+ζr · x − ψr} , (200) The above algorithm is written as follows, where the t t current candidates θ ∈ M0 and ζ ∈ Mr at time t, ζ r where r is a free vector parameter. Comparing this with t =0, 1,..., are renewed iteratively until convergence x q( ), we see that the sum of all the nonlinear terms [17]. Geometrical BP Algorithm:   x sr cr ( ) (201) ζ0 r=r 1) Put t = 0, and start with initial guesses r,forex- ζ0 ample, r =0. t except for srcr(x) is replaced by a linear term ζr · x. 2) For t =0, 1, 2,..., m-project pr x, ζr to M0,and Hence, by choosing an adequate ζr, p (x, ζr) is expected obtain the linearized version of srcr(x), to give the same expectation Eq[x]asq(x)oritsvery t+1 t t ξr = pr x, ζr − ζr. (207) good approximation. Moreover, Mr includes only one 0 nonlinear term so that it is easy to calculate the expec- x x ζ 3) Summarize all the effects of srcr(x), to give tation of with respect to p ( , r). In the following r t+1 t+1 algorithm, all branch (clique) manifolds M cooperate θ = ξr . (208) iteratively to give a good entire solution. r

4) Update ζr by 5.5 Belief propagation t+1 t+1 t+1 t+1 ζr = ξr = θ − ξr . (209) r=r We have the true distribution q(x), b clique distributions x ζ ∈ r 5) Stop when the algorithm converges. p ( , roffprint) Mr, =1,...,b, and an independent distri- bution p0(x, θ) ∈ M0. All of them join force to search 5.6 Analysis of the solution of the algorithm for a good approximationp ˆ(x) ∈ M0 of q(x). Let ζ be the current solution which Mr believes to r ∗ x Let us assume that the algorithm has converged to {ζr} give a good approximation of q( ). We then project it ∗ ∗ and θ . Then, our solution is given by p0 (x, θ ), from to M0, giving the equivalent solution θr in M0 having the same expectation, which ∗ ηi =E0 [xi] (210) p (x, θr)= pr (x, ζ ) . (202) 0 r is easily calculated. However, this might not be the ex- We abbreviate this as act solution but is only an approximation. Therefore, we need to study its properties. To this end, we study θr = ζr. (203) the relations among the true distribution q(x), the con- 0 ∗ ∗ verged clique distributions pr = pr (x, ζr)andthecon- θ ∗ ∗ The r specifies the independent distribution equivalent verged marginalized distribution p0 = p0 (x, θ ). They to pr (x, ζr) in the sense that they give the same expec- are written as tation of x.Sinceθr includes both the effect of the single q(x)=exp h · x + srcr(x) − ψ , (211) nonlinear term srcr(x) specific to Mr and that due to the linear term ζr in (200), ∗ ∗ p0 (x, θ )=exp h · x + ξr · x − ψ0 , (212) ξ = θr − ζ (204) r r x ζ∗ h · x ξ∗ · x x − offprintpr ( , r)=exp + r + srcr( ) ψr .  represents the effect of the single nonlinear term srcr(x). r =r This is the linearized version of srcr(x). Hence, Mr (213)

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. 258 Front. Electr. Electron. Eng. China 2010, 5(3): 241–260

The convergence point satisfies the two conditions: new ζr’s, until the m-condition is satisfied eventually. The m-projections of all Mr become identical when the ∗ ∗ 1.m-condition : θ = pr (x, ζr) ; (214) 0 m-condition is satisfied. On the other hand, in each b step of 3), we obtain θ ∈ M0 such that the e-condition ∗ 1 ∗ 2.e-condition : θ = ζ . (215) is satisfied. Hence, throughout the procedures, the e- b − 1 r r=1 condition is satisfied. We have a different type of algorithm, which searches Let us consider the m-flat submanifold M ∗ connecting ∗ ∗ for θ that eventually satisfies the e-condition, while the p0 (x, θ )andallofpr (x, ζ ), r m-condition is always satisfied. Such an algorithm is ∗ ∗ ∗ proposed by Yuille [36] and Yuille and Rangarajan [37]. M = p(x) p(x)=d0p0 (x, θ )+ drpr (x, ζr); It consists of the two steps, beginning with initial guess θ0 d0 + dr =1 . (216) : CCCP Algorithm: ∗ θt ζt+1 This is a mixture family composed of p0 (x, θ )and Step 1 (inner loop): Given ,calculate r by solv- x ζ∗ ing pr ( , r ). ∗ t+1 t t+1 We also consider the e-flat submanifold E connecting pr x, ζr = bθ − ζr . (220) ∗ ∗ 0 p0 (x, θ )andpr (x, ζ ), t+1 t+1 r Step 2 (outer loop): Given ζ ,calculateθ by r t+1 t t+1 ∗ ∗ θ θ − ζ E = p(x) log p(x)=v0 log p0 (x, θ ) = b r . (221) r The inner loop use an iterative procedures to obtain + vr log pr (x, ζr) − ψ . t+1 new ζr in such a way that they satisfy the m- r t condition together with old θ . Hence, the m-condition (217) t+1 is always satisfied, and we search for new θ until the Theorem 7 The m-condition implies that M ∗ inter- e-condition is satisfied. We may simplify the inner loop by sects all of M0 and Mr orthogonally. The e-condition ∗ t+1 t implies that E include the true distribution q(x). pr x, ζ = p0 x, θ , (222) 0 r Proof The m-condition guarantees that m- ∗ ∗ which can be solved directly. This gives a computation- projection of pr to M0 is p0.Sincethem-geodesic con- ∗ ∗ ally easier procedure. There are some difference in the necting pr and p0 is included in M and the expectations are equal, basin of attraction for the original and simplified proce- ∗ ∗ dures. ηr = η0, (218) We have formulated a number of algorithms for ∗ and Moffprintis orthogonal to M0 and Mr.Thee-condition is stochastic reasoning in terms of information geometry. easily shown by putting c0 = −(b − 1), cr = 1, because It not only clarifies the procedures intuitively but make it possible to analyze the stability of the equilibrium and x x θ∗ −(b−1) x ζ∗ q( )=cp0 ( , ) pr ( , r) (219) speed of convergence. Moreover, we can estimate the er- ror of estimation by using the curvatures of E∗ and M ∗. −ψq from (211)–(213), where c = e . The new type of free energy is also defined. See Refs. ∗ ∗ θ ζ The algorithm searches for and r until both the [16,17] for more details. m-ande-conditions are satisfied. If M ∗ includes q(x), ∗ ∗ its m-projection to M0 is p0 (x, θ ). Hence, θ gives the true solution, but this is not guaranteed. Instead, E∗ 6 Conclusions includes q(x). The following theorem is known and easy to prove. We have given the dual geometrical structure in man- Theorem 8 When the underlying graph is a tree, ifolds of probability distributions, positive measures or ∗ ∗ x both E and M include q( ), and the algorithm gives arrays, matrices, tensors and others. The dually flat the exact solution. structure is given from a convex function in general, which gives a Riemannian metric and a pair of dual 5.7 CCCP procedure for belief propagation flatness criteria. The information invariancy and dual flat structure are explained by using divergence func- The above algorithm is essentially the same as the BP tions without rigorous differential geometrical terminol- algorithmoffprint given by Pearl [15] and studied by many fol- ogy. The dual geometrical structure is applied to vari- lowers. It is iterative search procedures for finding M ∗ ous engineering problems, which include statistical infer- and E∗.Insteps2)− 5), it uses m-projection to obtain ence, machine learning, optimization, signal processing

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. Shun-ichi AMARI. Information geometry in optimization, machine learning and statistical inference 259 and neural networks. Applications to alternative opti- energy, and information geometry. Neural Computation, mization, Ying-Yang machine and belief propagations 2004, 16(9): 1779–1810 are shown from the geometrical point of view. 18. Csisz´ar I. Information measures: A critical survey. In: Transactions of the 7th Prague Conference. 1974, 83–86 19. Csisz´ar I. Axiomatic characterizations of information mea- sures. Entropy, 2008, 10(3): 261–273 References 20. Ali M S, Silvey S D. A general class of coefficients of di- vergence of one distribution from another. Journal of the 1. Amari S, Nagaoka H. Methods of Information Geometry. Royal Statistical Society. Series B, 1966, 28(1): 131–142 New York: Oxford University Press, 2000 21. Amari S. α-divergence is unique, belonging to both f- 2. Csisz´ar I. Information-type measures of difference of prob- divergence and Bregman divergence classes. IEEE Trans- ability distributions and indirect observations. Studia Sci- actions on , 2009, 55(11): 4925–4931 entiarum Mathematicarum Hungarica, 1967, 2: 299–318 22. Cichocki A, Adunek R, Phan A H, Amari S. Nonnegative 3. Bregman L. The relaxation method of finding the common Matrix and Tensor Factorizations. John Wiley, 2009 point of convex sets and its application to the solution of 23.HavrdaJ,Charv´at F. Quantification method of classifi- problems in convex programming. USSR Computational cation process: Concept of structural α-entropy. Kyber- Mathematics and Mathematical Physics, 1967, 7(3): 200– netika, 1967, 3: 30–35 217 24. Chernoff H. A measure of asymptotic efficiency for tests of 4. Eguchi S. Second order efficiency of minimum contrast es- a hypothesis based on the sum of observations. Annals of timators in a curved exponential family. The Annals of Mathematical Statistics, 1952, 23(4): 493–507 Statistics, 1983, 11(3): 793–803 25. Matsuyama Y. The α-EM algorithm: Surrogate likelihood 5. Chentsov N N. Statistical Decision Rules and Optimal In- maximization using α-logarithmic information measures. ference. Rhode Island, USA: American Mathematical Soci- IEEE Transactions on Information Theory, 2002, 49(3): ety, 1982 (originally published in Russian, Moscow: Nauka, 672–706 1972) 26. Amari S. Integration of stochastic models by minimizing α- 6. Dempster A P, Laird N M, Rubin D B. Maximum likeli- divergence. Neural Computation, 2007, 19(10): 2780–2796 hood from incomplete data via the EM algorithm. Journal 27. Amari S. Information geometry and its applications: Con- of the Royal Statistical Society. Series B, 1977, 39(1): 1–38 vex function and dually flat manifold. In: Nielsen F ed. 7. Csisz´ar I, Tusn´ady G. Information geometry and alter- Emerging Trends in Visual Computing. Lecture Notes in nating minimization procedures. Statistics and Decisions, Computer Science, Vol 5416. Berlin: Springer-Verlag, 2009, 1984, Supplement Issue 1: 205–237 75–102 8. Amari S. Information geometry of the EM and em algo- 28. Eguchi S, Copas J. A class of logistic-type discriminant rithms for neural networks. Neural Networks, 1995, 8(9): functions. Biometrika, 2002, 89(1): 1–22 1379–1408 29. Murata N, Takenouchi T, Kanamori T, Eguchi S. Informa- 9. Xu L. Bayesian Ying-Yang machine, clustering and number tion geometry of U-boost and Bregman divergence. Neural ofoffprint clusters. Pattern Recognition Letters, 1997, 18(11–13): Computation, 2004, 16(7): 1437–1481 1167–1178 30. Minami M, Eguchi S. Robust blind source separation by 10. Xu L. RBF nets, mixture experts, and Bayesian Ying-Yang beta-divergence. Neural Computation, 2002, 14(8): 1859– learning. Neurocomputing, 1998, 19(1–3): 223–257 1886 11. Xu L. Bayesian Kullback Ying-Yang dependence reduction 31. Byrne W. Alternating minimization and Boltzmann ma- theory. Neurocomputing, 1998, 22(1–3): 81–111 chine learning. IEEE Transactions on Neural Networks, 12. Xu L. BYY harmony learning, independent state space, and 1992, 3(4): 612–620 generalized APT financial analyses. IEEE Transactions on 32. Amari S, Kurata K, Nagaoka H. Information geometry of Neural Networks, 2001, 12(4): 822–849 Boltzmann machines. IEEE Transactions on Neural Net- 13. Xu L. Best harmony, unified RPCL and automated model works, 1992, 3(2): 260–271 selection for unsupervised and supervised learning on Gaus- 33. Amari S. Natural gradient works efficiently in learning. sian mixtures, three-layer nets and ME-RBF-SVM models. Neural Computation, 1998, 10(2): 251–276 International Journal of Neural Systems, 2001, 11(1): 43– 34. Amari S, Takeuchi A. Mathematical theory on formation 69 of category detecting nerve cells. Biological Cybernetics, 14. Xu L. BYY harmony learning, structural RPCL, and topo- 1978, 29(3): 127–136 logical self-organizing on mixture models. Neural Networks, 35. Jordan M I. Learning in Graphical Models. Cambridge, 2002, 15(8–9): 1125–1151 MA: MIT Press, 1999 15. Pearl J. Probabilistic Reasoning in Intelligent Systems: 36. Yuille A L. CCCP algorithms to minimize the Bethe and Networks of Plausible Inference. San Mateo, CA: Morgan Kikuchi free energies: Convergent alternatives to belief Kaufmann, 1988 propagation. Neural Computation, 2002, 14(7): 1691–1722 16. Ikeda S, Tanaka T, Amari S. Information geometry of turbo 37. Yuille A L, Rangarajan A. The concave-convex procedure. andoffprint low-density parity-check codes. IEEE Transactions on Neural Computation, 2003, 15(4): 915–936 Information Theory, 2004, 50(6): 1097–1114 38. Opper M, Saad D. Advanced Mean Field Methods—Theory 17. Ikeda S, Tanaka T, Amari S. Stochastic reasoning, free and Practice. Cambridge, MA: MIT Press, 2001

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. 260 Front. Electr. Electron. Eng. China 2010, 5(3): 241–260

39. Tanaka T. Information geometry of mean-field approxima- eas of mathematical science and engineering, such as tion. Neural Computation, 2000, 12(8): 1951–1968 topological network theory, differential geometry of con- 40. Amari S, Ikeda S, Shimokawa H. Information geometry and tinuum mechanics, pattern recognition, and information mean field approximation: The α-projection approach. In: Opper M, Saad D, eds. Advanced Mean Field Methods— sciences. In particular, he has devoted himself to mathe- Theory and Practice. Cambridge, MA: MIT Press, 2001, matical foundations of neural networks, including statis- 241–257 tical neurodynamics, dynamical theory of neural fields, associative memory, self-organization, and general learn- Shun-ichi Amari was born in ing theory. Another main subject of his research is in- Tokyo, Japan, on January 3, formation geometry initiated by himself, which applies 1936. He graduated from the modern differential geometry to statistical inference, in- Graduate School of the Univer- formation theory, control theory, stochastic reasoning, sity of Tokyo in 1963 majoring and neural networks, providing a new powerful method in mathematical engineering and to information sciences and probability theory. received Degree of Doctor of En- Dr. Amari is past President of International Neural gineering. Networks Society and Institute of Electronic, Informa- He worked as an Associate Professor at Kyushu Uni- tion and Communication Engineers, Japan. He received versity and the University of Tokyo, and then a Full Pro- Emanuel R. Piore Award and Neural Networks Pioneer fessor at the University of Tokyo, and is now Professor- Award from the IEEE, the Japan Academy Award, C&C Emeritus. He moved to RIKEN Brain Science Institute Award and Caianiello Memorial Award. He was the and served as Director for five years and is now Senior founding co-editor-in-chief of Neural Networks, among Advisor. He has been engaged in research in wide ar- many other journals. offprint

offprint

THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES.