Divergence Geometry
Total Page:16
File Type:pdf, Size:1020Kb
2015 Workshop on High-Dimensional Statistical Analysis Dec.11 (Friday) ~15 (Tuesday) Humanities and Social Sciences Center, Academia Sinica, Taiwan Information Geometry and Spontaneous Data Learning ShintoShinto Eguchi Eguchi InstituteInstitute Statistical Statistical Mathematics, Mathematics, Japan Japan This talk is based on a joint work with Osamu Komori and Atsumi Ohara, University of Fukui Outline Short review for Information geometry Kolmogorov-Nagumo mean -path in a function space Generalized mean and variance –divergence geometry U-divergence geometry Minimum U-divergence and density estimation 2 A short review of IG Nonparametric space Space of statistics Information geometry is discussed on the product space 3 Bartlett’s identity Parametric model Bartlett’s first identity Bartlett’s second identity 4 Metric and connections Information metric Mixture connection Exponential connection Rao (1945), Dawid (1975), Amari (1982) 5 Geodesic curves and surfaces in IG m-geodesic curve e-geodesic curve m-geodesic surface e-geodesic surface 6 Kullback-Leibler K-L divergence 1. KL divergence is the expected log-likelihood ratio 2. Maximum likelihood is minimum KL divergence. Akaike (1974) 3. KL divergence induces to the m-connection and e-connection Eguchi (1983) 7 Pythagoras Thm Amari-Nagaoka (2001) Pf 8 Exponential model Exponential model Mean parameter For For Amari (1982) Degenerated Bartlett identity 9 Minimum KL leaf Exponential model Mean equal space 10 Pythagoras foliation 11 log + exp log & exp Bartlett identities KL-divergence e-connection e-geodesic m-connection m-geodesic Pythagoras identity exponential model Pythagoras foliation mean equal space 12 Path geometry { m-geodesic, e-geodesic , … } 13 Kolmogorov-Nagumo mean K-N mean is for positive numbers ¥ 14 K-N mean in Y Def . K-N mean Cf. Naudts (2009) 15 -path Def. -path connecting f and g Thm (Pf ) lim 1(c) c Examples of -path Exm 0 Exm 1 Exm 2 Exm 3 17 Identities of -density Model 1st identity 2nd identity because 18 Generalized mean and variance Def Note Generalized mean and variance Exm 20 Bartlett Identity Model Bartlett identity Bartlett identities 21 Tangent space of Y Tangent space Riemannian metric Expectation gives the tangent space Topological properties of Tf depend on If = log, then Tf is too large to do statistics on Y Cf. Pistone (1992) 22 Parallel transport Def A vector field is parallel along a curve A curve is -geodesic Cf. Amari (1982). 23 -geodesic Thm If is the -geodesic curve Proof. 24 -divergence -cross entropy -entropy -divergence Note: -divergence is KL-divergence if = log 25 Divergence geometry Def. Let be a statistical model. with the Riemannian metric on M : the pair of affine connections on M: 26 -divergence geometry The metric Affine connection pair 27 –Pythagorean theorem Thm -geodesic -geodesic Pf h f g 28 -Pythagorean foliation -mean equal space 29 -mean -Bartlett identities - divergence - connection - geodesic - connection - geodesic - Pythagoras identity - model - Pythagoras foliation - mean equal space What is a statistical meaning of -mean and -variance? 30 U-divergence U-cross-entropy U-entropy U-divergence Note Exm 31 U-divergence geometry The metric associated with U-divergence: Affine connections associated with U-divergence: Thm (i) U-geodesic is mixture geodesic. (ii) U*-geodesic is geodesic 32 U-geometry /= -geometry -metric on a model M U-metric on a model M -connection U*-connection 33 Triangle with DU Thm mixture geodesic -geodesic h Pf f g 34 U-estimation Let g(x) be a data density function with statistical model U-loss function U-empirical loss function U-estimator for 35 U-estimator under -model -model U-empirical loss function U-estimator for U-estimator under -model has analogy with MLE under exponential model 36 Potential function Def We call the potential function on -model Note We define the mean parameter by Cf. mean parameter Thm U-estimator for is given by the sample mean 37 Pythagoras foliation Thm Pf 38 Pythagoras foliation 39 U-Boost learning for density estimation 1 n L ( f ) ( f (x )) U(( f (x)))dx U-loss function U i p n i1 IR Dictionary of density functions W { g (x) : g (x) 0, g (x)dx 1, } Learning space = -model * 1 1 W (co((W ))) { ( (g (x)))} 1 Let f (x,π) ( (g (x))). Then f (x,(0,,1,0)) g (x) () * W W * Goal : find f argmin LU ( f ) * f WU U-Boost algorithm (A) Find f1 arg min LU (g) gW 1 (B) Update fk fk1 ((1k1) ( fk ) k1(gk1)) st 1 (k1, gk1) arg min LU ( ((1) ( fk ) (g))) (, g)(0,1)W ˆ 1 (C) Select K, and f ((1 K ) ( fK 1) K (gK )) 1 * Example 2. Power entropy f (x) ( k gk (x) ) k * g (x) k If 0, f (x) exp( k log k ) gk (x) Friedman et al (1984) k k * Klemela (2007) If 1, f (x) k gk (x) k Inner step in the convex hull * 1 W {t(x,) : } WU (co(W )) * * 1 ˆ ˆ ˆ ˆ Goal : f argmin LU ( f ) f (x) (1( f1(x)) kˆ( fkˆ (x))) * f WU W g 7 g5 * WU f (x) g 4 g 6 f (x) g2 g 3 g1 Non-asymptotic bound Theorem. Assume that a data distribution has a density g(x) and that 2 (A) sup U''(){() ()} bU (,, )co((W ))W W Then we have ˆ * IEg DU (g, fK ) FA(g,WU ) EE(g,W ) IE(K,W ), where FA(g,WU= ) inf DU (g, f ) (Functional approximation) f WU= n EE(g, ) 2 IE sup | 1 ( f (x )) IE ( f )|} W g{ n i p (Estimation error) fW i1 c2b 2 IE(K,W ) U (c:step-lengthconstant) (Iteration effect) K c 1 * Remark. Trade between FA( p,W U ) and EE( p,W ) 43 -Boost -Boost KDE RSDE (Girolami-He, 2004) C (skewed-unimodal) H (bimodal) L (quadrimodal) 44 Conclusion K-N mean -path = E(), Cov() -geodesic -divergence (G(),*(), ()) -geodesic, -geodesic ) U U-divergence (G(U),*(U), (U)) -geodesic, m-geodesic ) * 1 -path WU (co(W )) 45 Future problems Tangent space Path space 46 Future problems -mean, -variance, -divergence These are natural ideas from IG view point We can build -efficiency in estimation theory, but What is a statistical meaning of -mean and -variance? Can we define a random sample of -version? 47 Thank you 48.