Information Geometry in Optimization, Machine Learning and Statistical Inference

Front. Electr. Electron. Eng. China 2010, 5(3): 241–260 DOI 10.1007/s11460-010-0101-3 Shun-ichi AMARI Information geometry in optimization, machine learning and statistical inference c Higher Education Press and Springer-Verlag Berlin Heidelberg 2010 Abstract The present article gives an introduction to article intends to give an understandable introduction information geometry and surveys its applications in to information geometry without modern differential ge- the area of machine learning, optimization and statis- ometry. Since underlying manifolds in most applications tical inference. Information geometry is explained in- are dually flat, the dually flat structure plays a funda- tuitively by using divergence functions introduced in a mental role. We explain the fundamental dual structure manifold of probability distributions and other general and related dual geodesics without using the concept of manifolds. They give a Riemannian structure together affine connections and covariant derivatives. with a pair of dual flatness criteria. Many manifolds are We begin with a divergence function between two dually flat. When a manifold is dually flat, a general- points in a manifold. When it satisfies an invariance cri- ized Pythagorean theorem and related projection the- terion of information monotonicity, it gives a family of f- orem are introduced. They provide useful means for divergences [2]. When a divergence is derived from a con- various approximation and optimization problems. We vex function in the form of Bregman divergence [3], this apply them to alternative minimization problems, Ying- gives another type of divergence, where the Kullback- Yang machines and belief propagation algorithm in ma- Leibler divergence belongs to both of them. We derive chine learning. a geometrical structure from a divergence function [4]. The Fisher information Riemannian structure is derived Keywords information geometry, machine learning, from an invariant divergence (f-divergence) (see Refs. optimization, statistical inference, divergence, graphical [1,5]), while the dually flat structure is derived from the model,offprint Ying-Yang machine Bregman divergence (convex function). The manifold of all discrete probability distributions is dually flat, where the Kullback-Leibler divergence plays 1 Introduction a key role. We give the generalized Pythagorean theorem and projection theorem in a dually flat manifold, which plays a fundamental role in applications. Such Information geometry [1] deals with a manifold of probability distributions from the geometrical point of view. It a structure is not limited to a manifold of probability studies the invariant structure by using the Riemannian distributions, but can be extended to the manifolds of geometry equipped with a dual pair of affine connec- positive arrays, matrices and visual signals, and will be tions. Since probability distributions are used in many used in neural networks and optimization problems. problems in optimization, machine learning, vision, sta- After introducing basic properties, we show three ar- tistical inference, neural networks and others, informa- eas of applications. One is application to the alterna- tion geometry provides a useful and strong tool to many tive minimization procedures such as the expectation- areas of information sciences and engineering. maximization (EM) algorithm in statistics [6–8]. The Many researchers in these fields, however, are not fa- second is an application to the Ying-Yang machine in- miliar with modern differential geometry. The present troduced and extensively studied by Xu [9–14]. The third one is application to belief propagation algorithm of stochastic reasoning in machine learning or artificial Received January 15, 2010; accepted February 5, 2010 intelligence [15–17]. There are many other applications Shun-ichioffprint AMARI in analysis of spiking patterns of the brain, neural net- RIKEN Brain Science Institute, Saitama 351-0198, Japan works, boosting algorithm of machine learning, as well E-mail: [email protected] as wide range of statistical inference, which we do not THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. 242 Front. Electr. Electron. Eng. China 2010, 5(3): 241–260 mention here. is an important coordinate system of Sn, as we will see later. 2 Divergence function and information geometry 2.1 Manifold of probability distributions and positive arrays We introduce divergence functions in various spaces or manifolds. To begin with, we show typical examples of manifolds of probability distributions. A one- dimensional Gaussian distribution with mean μ and vari- ance σ2 is represented by its probability density function 2 1 (x − μ) S p(x; μ, σ)=√ exp − . (1) Fig. 1 Manifold 2 of discrete probability distributions 2πσ 2σ2 The third example deals with positive measures, not It is parameterized by a two-dimensional parameter probability measures. When we disregard the constraint ξ =(μ, σ). Hence, when we treat all such Gaussian pi =1of(6)inSn, keeping pi > 0, p is regarded distributions, not a particular one, we need to consider as an (n + 1)-dimensional positive arrays, or a positive the set SG of all the Gaussian distributions. It forms a measure where x = i has measure pi.Wedenotetheset two-dimensional manifold of positive measures or arrays by SG = {p(x; ξ)} , (2) Mn+1 = {z,zi > 0; i =0, 1,...,n} . (10) where ξ =(μ, σ) is a coordinate system of SG.This is not the only coordinate system. It is possible to use This is an (n + 1)-dimensional manifold with a coordinate system z. Sn is its submanifold derived by a linear other parameterizations or coordinate systems when we constraint zi =1. study SG. We show another example. Let x be a discrete random In general, we can regard any regular statistical model variable taking values on a finite set X = {0, 1,...,n}. S = {p(x, ξ)} (11) Then, a probability distribution is specified by a vector ξ p =(p0,p1,...,pn), where parameterized by as a manifold with a (local) coordinate system ξ.ItisaspaceM of positive measures, { } offprintpi =Prob x = i . (3) when the constraint p(x, ξ)dx = 1 is discarded. We We may write may treat any other types of manifolds and introduce dual structures in them. For example, we will consider p p(x; )= piδi(x), (4) a manifold consisting of positive-definite matrices. where 1,x= i, 2.2 Divergence function and geometry δi(x)= (5) 0,x= i. Since p is a probability vector, we have We consider a manifold S having a local coordinate system z =(zi). A function D[z : w] between two points pi =1, (6) z and w of S is called a divergence function when it and we assume satisfies the following two properties: pi > 0. (7) 1) D[z : w] 0, with equality when and only when z w The set of all the probability distributions is denoted by = . 2) When the difference between w and z is infinitesi- {p} Sn = , (8) mally small, we may write w = z +dz and Taylor which is an n-dimensional simplex because of (6) and expansion gives (7). When n =2,Sn is a triangle (Fig. 1). Sn is an n- D [z : z +dz]= gij (z)dzidzj, (12) dimensional manifold, and ξ =(p1,...,pn)isacoordi- nate system. There are many other coordinate systems. where ∂2 For example,offprintij z z w |w=z g ( )= D[ : ] (13) ∂zi∂zj pi θi =log ,i=1,...,n, (9) p0 is a positive-definite matrix. THE AUTHORS WARRANT THAT THEY WILL NOT POST THE E-OFFPRINT OF THE PAPER ON PUBLIC WEBSITES. Shun-ichi AMARI. Information geometry in optimization, machine learning and statistical inference 243 A divergence does not need to be symmetric, and 3 ∂ Γijk(z)=− D[z : w]|w=z, (16) D [z : w] = D [w : z] (14) ∂zi∂zj∂wk 3 ∗ z − ∂ z w in general, nor does it satisfy the triangular inequality. Γijk( )= D[ : ]|w=z, (17) ∂wi∂wj∂zk Hence, it is not a distance. It rather has a dimension of square of distance as is seen from (12). So (12) is consid- derived from a divergence D[z : w]. These two are du- ered to define the square of the local distance ds between ally coupled with respect to the Riemannian metric gij two nearby points Z =(z)andZ +dZ =(z +dz), [1]. The meaning of dually coupled affine connections is not explained here, but will become clear in later sec- 2 ds = gij (z)dzidzj. (15) tions, by using specific examples. The Euclidean divergence, defined by More precisely, dz is regarded as a small line element z w 1 − 2 connecting two points Z and Z +dZ. Thisisatan- DE[ : ]= (zi wi) , (18) 2 gent vector at point Z (Fig. 2). When a manifold has a positive-definite matrix gij (z)ateachpointz,itis is a special case of divergence. We have called a Riemannian manifold, and (gij )isaRieman- gij = δij , (19) nian metric tensor. ∗ Γijk =Γijk =0, (20) where δij is the Kronecker delta. Therefore, the derived ∗ geometry is Euclidean. Since it is self-dual (Γijk =Γijk), the duality does not play a role. 2.3 Invariant divergence: f-divergence 2.3.1 Information monotonicity Let us consider a function t(x) of random variable x, where t and x are vector-valued. We can derive proba- Fig. 2 Manifold, tangent space and tangent vector bility distributionp ¯(t, ξ)oft from p(x, ξ)by An affine connection defines a correspondence between p¯(t, ξ)dt = p(x, ξ)dx. (21) two nearby tangent spaces. By using it, a geodesic is de- t=t(x) fined: A geodesic is a curve of which the tangent direc- offprintWhen t(x) is not reversible, that is, t(x)isamany-to- tions do not change along the curve by this correspondence (Fig. 3). It is given mathematically by a covari- one mapping, there is loss of information by summariz- x t t x ant derivative, and technically by the Christoffel symbol, ing observed data into reduced = ( ).

Load more